Parallelize function on dictionary in IPython - dictionary

Up till now, I have parallelized functions by mapping them on to lists that are distributed out to the various clusters using the function map_sync(function, list) .
Now, I need to run a function on each entry of a dictionary.
map_sync does not seem work on dictionaries. I have also tried to scatter the dictionary and use decorators to run the function in parallel. However, dictionaries dont seem to lend themselves to scattering either. Is there some other way to parallelize functions on dictionaries without having to convert to lists?
These are my attempts thus far:
from IPython.parallel import Client
rc = Client()
dview = rc[:]
test_dict = {'43':"lion", '34':"tiger", '343':"duck"}
dview.scatter("test",test)
dview["test"]
# this yields [['343'], ['43'], ['34'], []] on 4 clusters
# which suggests that a dictionary can't be scattered?
Needless to say, when I run the function itself, I get an error:
#dview.parallel(block=True)
def run():
for d,v in test.iteritems():
print d,v
run()
AttributeError
Traceback (most recent call last) in ()
in run(dict)
AttributeError: 'str' object has no attribute 'iteritems'
I don't know if it's relevant, but I'm using an IPython Notebook connected to Amazon AWS clusters.

You can scatter a dict with:
def scatter_dict(view, name, d):
"""partition a dictionary across the engines of a view"""
ntargets = len(view)
keys = d.keys() # list(d.keys()) in Python 3
for i, target in enumerate(view.targets):
subd = {}
for key in keys[i::ntargets]:
subd[key] = d[key]
view.client[target][name] = subd
scatter_dict(dview, 'test', test_dict)
and then operate on it remotely, as you normally would.
You can also gather the remote dicts into one local one again with:
def gather_dict(view, name):
"""gather dictionaries from a DirectView"""
merged = {}
for d in view.pull(name):
merged.update(d)
return merged
gather_dict(dv, 'test')
An example notebook

Related

How do I get information about function calls from a Lua script?

I have a script written in Lua 5.1 that imports third-party module and calls some functions from it. I would like to get a list of function calls from a module with their arguments (when they are known before execution).
So, I need to write another script which takes the source code of my first script, parses it, and extracts information from its code.
Consider the minimal example.
I have the following module:
local mod = {}
function mod.foo(a, ...)
print(a, ...)
end
return mod
And the following driver code:
local M = require "mod"
M.foo('a', 1)
M.foo('b')
What is the better way to retrieve the data with the "use" occurrences of the M.foo function?
Ideally, I would like to get the information with the name of the function being called and the values of its arguments. From the example code above, it would be enough to get the mapping like this: {'foo': [('a', 1), ('b')]}.
I'm not sure if Lua has functions for reflection to retrieve this information. So probably I'll need to use one of the existing parsers for Lua to get the complete AST and find the function calls I'm interested in.
Any other suggestions?
If you can not modify the files, you can read the files into a strings then parse mod file and find all functions in it, then use that information to parse the target file for all uses of the mod library
functions = {}
for func in modFile:gmatch("function mod%.(%w+)") do
functions[func] = {}
end
for func, call in targetFile:gmatch("M%.(%w+)%(([^%)]+)%)") do
args = {}
for arg in string.gmatch(call, "([^,]+)") do
table.insert(args, arg)
end
table.insert(functions[func], args)
end
Resulting table can then be serialized
['foo'] = {{"'a'", " 1"}, {"'b'"}}
3 possible gotchas:
M is not a very unique name and could vary possibly match unintended function calls to another library.
This example does not handle if there is a function call made inside the arg list. e.g. myfunc(getStuff(), true)
The resulting table does not know the typing of the args so they are all save as strings representations.
If modifying the target file is an option you can create a wrapper around your required module
function log(mod)
local calls = {}
local wrapper = {
__index = function(_, k)
if mod[k] then
return function(...)
calls[k] = calls[k] or {}
table.insert(calls[k], {...})
return mod[k](...)
end
end
end,
}
return setmetatable({},wrapper), calls
end
then you use this function like so.
local M, calls = log(require("mod"))
M.foo('a', 1)
M.foo('b')
If your module is not just functions you would need to handle that in the wrapper, this wrapper assumes all indexes are a function.
after all your calls you can serialize the calls table to get the history of all the calls made. For the example code the table looks like
{
['foo'] = {{'a', 1}, {'b'}}
}

Vertex in Python Gremlin not updating

Using python gremlin on Neptune workbench, I have two functions:
The first adds a Vertex with a set of properties, and returns a reference to the traversal operation
The second adds to that traversal operation.
For some reason, the first function's operations are getting persisted to the DB, but the second operations do not. Why is this?
Here are the two functions:
def add_v(v_type, name):
tmp_id = get_id(f"{v_type}-{name}")
result = g.addV(v_type).property('id', tmp_id).property('name', name)
result.iterate()
return result
def process_records(features):
for i in features:
v_type = i[0]
name = i[1]
v = add_v(v_type, name)
if len(i) > 2:
%debug
props = i[2]
for r in props:
v.property(r[0], r[1]).iterate()
Your add_V method has already iterated the traversal. If you want to return the traversal from add_v in a way that you can add to it remove the iterate.

Map function and iterables in python 3.6

I've inherited a piece of code that I need to run somewhere else than the original place with some minor changes. I am trying to map a list of strings to something that applies a function to each element of that list using python 3.6 (a language I am not familiar with).
I would like to use map not list comprehension, but now I doubt this is possible.
In the following example I've tried a combination of for loops, yield (or not), and next(...) or not, but I am not able to make the code working as expected.
I would like to see the print:
AAA! xxx
Found: foo
Found: bar
each time the counter xxx modulo 360 is 0 (zero).
I understand the map function does not execute the code, so then I need to do something to "apply" that function to each element of the input list.
However I am not able to make this thing work. This documentation https://docs.python.org/3.6/library/functions.html#map and https://docs.python.org/3.6/howto/functional.html#iterators do not help that much, I went through it and I think at least one of the commented bits below (# <python code>) should have worked. I am not an experienced python developer and I think I am missing some gotchas about the syntax/conventions of python 3.6 regarding iterators/generators.
issue_counter = 0
def foo_func(serious_stuff):
# this is actually a call to a module to send an email with the "serious_stuff"
print("Found: {}".format(serious_stuff))
def report_issue():
global issue_counter
# this actually executes once per minute (removed the logic to run this fast)
while True:
issue_counter += 1
# every 6 hours (i.e. 360 minutes) I would like to send emails
if issue_counter % 360 == 0:
print("AAA! {}".format(issue_counter))
# for stuff in map(foo_func, ["foo", "bar"]):
# yield stuff
# stuff()
# print(stuff)
iterable_stuff = map(foo_func, ["foo", "bar"])
for stuff in next(iterable_stuff):
# yield stuff
print(stuff)
report_issue()
I get lots of different errors/unexpected behaviors of that for loop when running the script:
not printing anything when I call print(...)
TypeError: 'NoneType' object is not callable
AttributeError: 'map' object has no attribute 'next'
TypeError: 'NoneType' object is not iterable
Printing what I am expecting interleaved by None, e.g.:
AAA! 3047040
Found: foo
None
Found: bar
None
I found out the call to next(iterable_thingy) actually invokes the mapped function.
Knowing the length of the input list when mapping it to generate the iterable, means we know how many times we have to invoke the next(iterable_thingy), so the function report_issue (in my previous example) runs as expected when defined like this:
def report_issue():
global issue_counter
original_data = ["foo", "bar"]
# this executes once per minute
while True:
issue_counter += 1
# every 6 hours I would like to send emails
if issue_counter % 360 == 0:
print("AAA! {}".format(issue_counter))
iterable_stuff = map(foo_func, original_data)
for idx in range(len(original_data)):
next(iterable_stuff)
To troubleshoot this iterable stuff I found useful running ipython (an interactive REPL) to check the type and documentation of the generated iterable, like this:
In [2]: def foo_func(serious_stuff):
...: # this is actually a call to a module to send an email with the "serious_stuff"
...: print("Found: {}".format(serious_stuff)) ...:
In [3]: iterable_stuff = map(foo_func, ["foo", "bar"])
In [4]: iterable_stuff?
Type: map
String form: <map object at 0x7fcdbe8647b8>
Docstring:
map(func, *iterables) --> map object
Make an iterator that computes the function using arguments from
each of the iterables. Stops when the shortest iterable is exhausted.
In [5]: next(iterable_stuff) Found: foo
In [6]: bar_item = next(iterable_stuff) Found: bar
In [7]: bar_item?
Type: NoneType
String form: None
Docstring: <no docstring>
In [8]:

Julia: How to iterate with Channel

When I run the following code, I get a deprecation saying produce has been replace with channels.
function source(dir)
filelist = readdir(dir)
for filename in filelist
name,ext = splitext(filename)
if ext == ".jld"
produce(filename)
end
end
end
path = "somepathdirectoryhere"
for fname in Task(source(path))
println(fname)
end
I cannot find an example on how to do this with channels. I've tried creating a global channel and using put! instead of produce with no luck.
Any ideas?
Here's one way. Modify your function to accept a channel argument, and put! data in it:
function source(dir, chnl)
filelist = readdir(dir)
for filename in filelist
name, ext = splitext(filename)
if ext == ".jld"
put!(chnl, filename) % this blocks until "take!" is used elsewhere
end
end
end
Then create your task implicitly using the Channel constructor (which takes a function with a single argument only representing the channel, so we need to wrap the source function around an anonymous function):
my_channel = Channel( (channel_arg) -> source( pwd(), channel_arg) )
Then, either check the channel is still open (i.e. task hasn't finished) and if so take an argument:
julia> while isopen( my_channel)
take!( my_channel) |> println;
end
no.jld
yes.jld
or, use the channel itself as an iterator (iterating over Tasks is becoming deprecated, along with the produce / consume functionality)
julia> for i in my_channel
i |> println
end
no.jld
yes.jld
Alternatively you can use #schedule with bind etc as per the documentation, but it seems like the above is the most straightforward way.

accumulator in pyspark with dict as global variable

Just for learning purpose, I tried to set a dictionary as a global variable in accumulator the add function works well, but I ran the code and put dictionary in the map function, it always return empty.
But similar code for setting list as a global variable
class DictParam(AccumulatorParam):
def zero(self, value = ""):
return dict()
def addInPlace(self, acc1, acc2):
acc1.update(acc2)
if __name__== "__main__":
sc, sqlContext = init_spark("generate_score_summary", 40)
rdd = sc.textFile('input')
#print(rdd.take(5))
dict1 = sc.accumulator({}, DictParam())
def file_read(line):
global dict1
ls = re.split(',', line)
dict1+={ls[0]:ls[1]}
return line
rdd = rdd.map(lambda x: file_read(x)).cache()
print(dict1)
For anyone who arrives at this thread looking for a Dict accumulator for pyspark: the accepted solution does not solve the posed problem.
The issue is actually in the DictParam defined, it does not update the original dictionary. This works:
class DictParam(AccumulatorParam):
def zero(self, value = ""):
return dict()
def addInPlace(self, value1, value2):
value1.update(value2)
return value1
The original code was missing the return value.
I believe that print(dict1()) simply gets executed before the rdd.map() does.
In Spark, there are 2 types of operations:
transformations, that describe the future computation
and actions, that call for action, and actually trigger the execution
Accumulators are updated only when some action is executed:
Accumulators do not change the lazy evaluation model of Spark. If they
are being updated within an operation on an RDD, their value is only
updated once that RDD is computed as part of an action.
If you check out the end of this section of the docs, there is an example exactly like yours:
accum = sc.accumulator(0)
def g(x):
accum.add(x)
return f(x)
data.map(g)
# Here, accum is still 0 because no actions have caused the `map` to be computed.
So you would need to add some action, for instance:
rdd = rdd.map(lambda x: file_read(x)).cache() # transformation
foo = rdd.count() # action
print(dict1)
Please make sure to check on the details of various RDD functions and accumulator peculiarities because this might affect the correctness of your result. (For instance, rdd.take(n) will by default only scan one partition, not the entire dataset.)
For accumulator updates performed inside actions only, their value is
only updated once that RDD is computed as part of an action

Resources