I've inherited a piece of code that I need to run somewhere else than the original place with some minor changes. I am trying to map a list of strings to something that applies a function to each element of that list using python 3.6 (a language I am not familiar with).
I would like to use map not list comprehension, but now I doubt this is possible.
In the following example I've tried a combination of for loops, yield (or not), and next(...) or not, but I am not able to make the code working as expected.
I would like to see the print:
AAA! xxx
Found: foo
Found: bar
each time the counter xxx modulo 360 is 0 (zero).
I understand the map function does not execute the code, so then I need to do something to "apply" that function to each element of the input list.
However I am not able to make this thing work. This documentation https://docs.python.org/3.6/library/functions.html#map and https://docs.python.org/3.6/howto/functional.html#iterators do not help that much, I went through it and I think at least one of the commented bits below (# <python code>) should have worked. I am not an experienced python developer and I think I am missing some gotchas about the syntax/conventions of python 3.6 regarding iterators/generators.
issue_counter = 0
def foo_func(serious_stuff):
# this is actually a call to a module to send an email with the "serious_stuff"
print("Found: {}".format(serious_stuff))
def report_issue():
global issue_counter
# this actually executes once per minute (removed the logic to run this fast)
while True:
issue_counter += 1
# every 6 hours (i.e. 360 minutes) I would like to send emails
if issue_counter % 360 == 0:
print("AAA! {}".format(issue_counter))
# for stuff in map(foo_func, ["foo", "bar"]):
# yield stuff
# stuff()
# print(stuff)
iterable_stuff = map(foo_func, ["foo", "bar"])
for stuff in next(iterable_stuff):
# yield stuff
print(stuff)
report_issue()
I get lots of different errors/unexpected behaviors of that for loop when running the script:
not printing anything when I call print(...)
TypeError: 'NoneType' object is not callable
AttributeError: 'map' object has no attribute 'next'
TypeError: 'NoneType' object is not iterable
Printing what I am expecting interleaved by None, e.g.:
AAA! 3047040
Found: foo
None
Found: bar
None
I found out the call to next(iterable_thingy) actually invokes the mapped function.
Knowing the length of the input list when mapping it to generate the iterable, means we know how many times we have to invoke the next(iterable_thingy), so the function report_issue (in my previous example) runs as expected when defined like this:
def report_issue():
global issue_counter
original_data = ["foo", "bar"]
# this executes once per minute
while True:
issue_counter += 1
# every 6 hours I would like to send emails
if issue_counter % 360 == 0:
print("AAA! {}".format(issue_counter))
iterable_stuff = map(foo_func, original_data)
for idx in range(len(original_data)):
next(iterable_stuff)
To troubleshoot this iterable stuff I found useful running ipython (an interactive REPL) to check the type and documentation of the generated iterable, like this:
In [2]: def foo_func(serious_stuff):
...: # this is actually a call to a module to send an email with the "serious_stuff"
...: print("Found: {}".format(serious_stuff)) ...:
In [3]: iterable_stuff = map(foo_func, ["foo", "bar"])
In [4]: iterable_stuff?
Type: map
String form: <map object at 0x7fcdbe8647b8>
Docstring:
map(func, *iterables) --> map object
Make an iterator that computes the function using arguments from
each of the iterables. Stops when the shortest iterable is exhausted.
In [5]: next(iterable_stuff) Found: foo
In [6]: bar_item = next(iterable_stuff) Found: bar
In [7]: bar_item?
Type: NoneType
String form: None
Docstring: <no docstring>
In [8]:
Related
I'm very new to Julia, and I'm trying to just pass an array of numbers into a function and count the number of zeros in it. I keep getting the error:
ERROR: UndefVarError: array not defined
I really don't understand what I am doing wrong, so I'm sorry if this seems like such an easy task that I can't do.
function number_of_zeros(lst::array[])
count = 0
for e in lst
if e == 0
count + 1
end
end
println(count)
end
lst = [0,1,2,3,0,4]
number_of_zeros(lst)
There are two issues with your function definition:
As noted in Shayan's answer and Dan's comment, the array type in Julia is called Array (capitalized) rather than array. To see:
julia> array
ERROR: UndefVarError: array not defined
julia> Array
Array
Empty square brackets are used to instantiate an array, and if preceded by a type, they specifically instantiate an array holding objects of that type:
julia> x = Int[]
Int64[]
julia> push!(x, 3); x
1-element Vector{Int64}:
3
julia> push!(x, "test"); x
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64
Thus when you do Array[] you are actually instantiating an empty vector of Arrays:
julia> y = Array[]
Array[]
julia> push!(y, rand(2)); y
1-element Vector{Array}:
[0.10298669573927233, 0.04327245960128345]
Now it is important to note that there's a difference between a type and an object of a type, and if you want to restrict the types of input arguments to your functions, you want to do this by specifying the type that the function should accept, not an instance of this type. To see this, consider what would happen if you had fixed your array typo and passed an Array[] instead:
julia> f(x::Array[])
ERROR: TypeError: in typeassert, expected Type, got a value of type Vector{Array}
Here Julia complains that you have provided a value of the type Vector{Array} in the type annotation, when I should have provided a type.
More generally though, you should think about why you are adding any type restrictions to your functions. If you define a function without any input types, Julia will still compile a method instance specialised for the type of input provided when first call the function, and therefore generate (most of the time) machine code that is optimal with respect to the specific types passed.
That is, there is no difference between
number_of_zeros(lst::Vector{Int64})
and
number_of_zeros(lst)
in terms of runtime performance when the second definition is called with an argument of type Vector{Int64}. Some people still like type annotations as a form of error check, but you also need to consider that adding type annotations makes your methods less generic and will often restrict you from using them in combination with code other people have written. The most common example of this are Julia's excellent autodiff capabilities - they rely on running your code with dual numbers, which are a specific numerical type enabling automatic differentiation. If you strictly type your functions as suggested (Vector{Int}) you preclude your functions from being automatically differentiated in this way.
Finally just a note of caution about the Array type - Julia's array's can be multidimensional, which means that Array{Int} is not a concrete type:
julia> isconcretetype(Array{Int})
false
to make it concrete, the dimensionality of the array has to be provided:
julia> isconcretetype(Array{Int, 1})
true
First, it might be better to avoid variable names similar to function names. count is a built-in function of Julia. So if you want to use the count function in the number_of_zeros function, you will undoubtedly face a problem.
Second, consider returning the value instead of printing it (Although you didn't write the print function in the correct place).
Third, You can update the value by += not just a +!
Last but not least, Types in Julia are constantly introduced with the first capital letter! So we don't have an array standard type. It's an Array.
Here is the correction of your code.
function number_of_zeros(lst::Array{Int64})
counter = 0
for e in lst
if e == 0
counter += 1
end
end
return counter
end
lst = [0,1,2,3,0,4]
number_of_zeros(lst)
would result in 2.
Additional explanation
First, it might be better to avoid variable names similar to function names. count is a built-in function of Julia. So if you want to use the count function in the number_of_zeros function, you will undoubtedly face a problem.
Check this example:
function number_of_zeros(lst::Array{Int64})
count = 0
for e in lst
if e == 0
count += 1
end
end
return count, count(==(1), lst)
end
number_of_zeros(lst)
This code will lead to this error:
ERROR: MethodError: objects of type Int64 are not callable
Maybe you forgot to use an operator such as *, ^, %, / etc. ?
Stacktrace:
[1] number_of_zeros(lst::Vector{Int64})
# Main \t.jl:10
[2] top-level scope
# \t.jl:16
Because I overwrote the count variable on the count function! It's possible to avoid such problems by calling the function from its module:
function number_of_zeros(lst::Array{Int64})
count = 0
for e in lst
if e == 0
count += 1
end
end
return count, Base.count(==(1), lst)
The point is I used Base.count, then the compiler knows which count I mean by Base.count.
I had been having issues with python 3.7 for quite some time about very pointless indentations, so I decided to get back to 3.6, specifically repl.it Python 3.6.1, and as I mentioned, the errors are for no good reason whatsoever as far as I can tell, the code is as written below:
from random import randint
import functools
printf = functools.partial(print, end=" ")
defNuc = ['C','A','T','G']
def opNuc():
def create():
nuc = [0]
nucop = [0]
length = randint(11,16)
print (length - 1)
for i in range(1,length):
part = randint(1,4)
for a in range(1,4)
if part == a:
nuc = defNuc[a]
nucOp = defNuc[-a]
if i != length - 1:
printf(nuc[i],i,"-")
else:
print(nuc[i],i)
for i in range (1,length):
if i != length - 1:
printf(nucOp[i],"-")
else:
print(nucop[i])
The error is at line 9, at
def create():
and as for the reason of error, it just says
expected an indented block
Edit:
This was completely my stupidity, don't take the post seriously, will be deleted in 10 minutes.
You never finished the definition of opNuc, so the parser is expecting an indented line to continue the body of that function. Either add a pass statement to provide a trivial body:
def opNuc():
pass
or indent the definition of create if that is supposed to be local to the body of opNuc (unlikely, but possible):
def opNuc():
def create():
...
The problem is that your first function, opNuc, was never finished. I have made this simple mistake many times myself and is very easy to miss. It's easy to fix though, just type pass inside of the opNuc function and it should be fine. Hope I helped!
Just for learning purpose, I tried to set a dictionary as a global variable in accumulator the add function works well, but I ran the code and put dictionary in the map function, it always return empty.
But similar code for setting list as a global variable
class DictParam(AccumulatorParam):
def zero(self, value = ""):
return dict()
def addInPlace(self, acc1, acc2):
acc1.update(acc2)
if __name__== "__main__":
sc, sqlContext = init_spark("generate_score_summary", 40)
rdd = sc.textFile('input')
#print(rdd.take(5))
dict1 = sc.accumulator({}, DictParam())
def file_read(line):
global dict1
ls = re.split(',', line)
dict1+={ls[0]:ls[1]}
return line
rdd = rdd.map(lambda x: file_read(x)).cache()
print(dict1)
For anyone who arrives at this thread looking for a Dict accumulator for pyspark: the accepted solution does not solve the posed problem.
The issue is actually in the DictParam defined, it does not update the original dictionary. This works:
class DictParam(AccumulatorParam):
def zero(self, value = ""):
return dict()
def addInPlace(self, value1, value2):
value1.update(value2)
return value1
The original code was missing the return value.
I believe that print(dict1()) simply gets executed before the rdd.map() does.
In Spark, there are 2 types of operations:
transformations, that describe the future computation
and actions, that call for action, and actually trigger the execution
Accumulators are updated only when some action is executed:
Accumulators do not change the lazy evaluation model of Spark. If they
are being updated within an operation on an RDD, their value is only
updated once that RDD is computed as part of an action.
If you check out the end of this section of the docs, there is an example exactly like yours:
accum = sc.accumulator(0)
def g(x):
accum.add(x)
return f(x)
data.map(g)
# Here, accum is still 0 because no actions have caused the `map` to be computed.
So you would need to add some action, for instance:
rdd = rdd.map(lambda x: file_read(x)).cache() # transformation
foo = rdd.count() # action
print(dict1)
Please make sure to check on the details of various RDD functions and accumulator peculiarities because this might affect the correctness of your result. (For instance, rdd.take(n) will by default only scan one partition, not the entire dataset.)
For accumulator updates performed inside actions only, their value is
only updated once that RDD is computed as part of an action
I'm using Python because it's generally easy to read, but this is not a Python-specific question.
Take the following Python function strip_argument:
def strip_argument(func_with_no_args):
return lambda unused: func_with_no_args()
In use, I can pass a no-argument function to strip_argument, and it will return a function that accepts one argument that is never used. For example:
# some API I want to use
def set_click_event_listener(listener):
"""Args:
listener: function which will be passed the view that was clicked.
"""
# ...implementation...
# my code
def my_click_listener():
# I don't care about the view, so I don't want to make that an arg.
print "some view was clicked"
set_click_event_listener(strip_argument(my_click_listener))
Is there a standard name for the function strip_argument? I'm interested in any languages that have a function like this in the standard library.
Most functional programming languages offer a const function, that's a function that will always ignore it's first parameter and return it's second. If you pass a function to const that's exactly the behavior you described.
In Haskell you can use it like that:
f x = x + 1
g = const f
g 2 3 == 4 --2 is ignored and 3 is incremented
I have done a quick search for such a function in python but haven't found anything. It seems the standard is to use a lambda function as you did.
Up till now, I have parallelized functions by mapping them on to lists that are distributed out to the various clusters using the function map_sync(function, list) .
Now, I need to run a function on each entry of a dictionary.
map_sync does not seem work on dictionaries. I have also tried to scatter the dictionary and use decorators to run the function in parallel. However, dictionaries dont seem to lend themselves to scattering either. Is there some other way to parallelize functions on dictionaries without having to convert to lists?
These are my attempts thus far:
from IPython.parallel import Client
rc = Client()
dview = rc[:]
test_dict = {'43':"lion", '34':"tiger", '343':"duck"}
dview.scatter("test",test)
dview["test"]
# this yields [['343'], ['43'], ['34'], []] on 4 clusters
# which suggests that a dictionary can't be scattered?
Needless to say, when I run the function itself, I get an error:
#dview.parallel(block=True)
def run():
for d,v in test.iteritems():
print d,v
run()
AttributeError
Traceback (most recent call last) in ()
in run(dict)
AttributeError: 'str' object has no attribute 'iteritems'
I don't know if it's relevant, but I'm using an IPython Notebook connected to Amazon AWS clusters.
You can scatter a dict with:
def scatter_dict(view, name, d):
"""partition a dictionary across the engines of a view"""
ntargets = len(view)
keys = d.keys() # list(d.keys()) in Python 3
for i, target in enumerate(view.targets):
subd = {}
for key in keys[i::ntargets]:
subd[key] = d[key]
view.client[target][name] = subd
scatter_dict(dview, 'test', test_dict)
and then operate on it remotely, as you normally would.
You can also gather the remote dicts into one local one again with:
def gather_dict(view, name):
"""gather dictionaries from a DirectView"""
merged = {}
for d in view.pull(name):
merged.update(d)
return merged
gather_dict(dv, 'test')
An example notebook