accumulator in pyspark with dict as global variable - dictionary

Just for learning purpose, I tried to set a dictionary as a global variable in accumulator the add function works well, but I ran the code and put dictionary in the map function, it always return empty.
But similar code for setting list as a global variable
class DictParam(AccumulatorParam):
def zero(self, value = ""):
return dict()
def addInPlace(self, acc1, acc2):
acc1.update(acc2)
if __name__== "__main__":
sc, sqlContext = init_spark("generate_score_summary", 40)
rdd = sc.textFile('input')
#print(rdd.take(5))
dict1 = sc.accumulator({}, DictParam())
def file_read(line):
global dict1
ls = re.split(',', line)
dict1+={ls[0]:ls[1]}
return line
rdd = rdd.map(lambda x: file_read(x)).cache()
print(dict1)

For anyone who arrives at this thread looking for a Dict accumulator for pyspark: the accepted solution does not solve the posed problem.
The issue is actually in the DictParam defined, it does not update the original dictionary. This works:
class DictParam(AccumulatorParam):
def zero(self, value = ""):
return dict()
def addInPlace(self, value1, value2):
value1.update(value2)
return value1
The original code was missing the return value.

I believe that print(dict1()) simply gets executed before the rdd.map() does.
In Spark, there are 2 types of operations:
transformations, that describe the future computation
and actions, that call for action, and actually trigger the execution
Accumulators are updated only when some action is executed:
Accumulators do not change the lazy evaluation model of Spark. If they
are being updated within an operation on an RDD, their value is only
updated once that RDD is computed as part of an action.
If you check out the end of this section of the docs, there is an example exactly like yours:
accum = sc.accumulator(0)
def g(x):
accum.add(x)
return f(x)
data.map(g)
# Here, accum is still 0 because no actions have caused the `map` to be computed.
So you would need to add some action, for instance:
rdd = rdd.map(lambda x: file_read(x)).cache() # transformation
foo = rdd.count() # action
print(dict1)
Please make sure to check on the details of various RDD functions and accumulator peculiarities because this might affect the correctness of your result. (For instance, rdd.take(n) will by default only scan one partition, not the entire dataset.)

For accumulator updates performed inside actions only, their value is
only updated once that RDD is computed as part of an action

Related

Python, square brackets after lambda call and dictionary comprehension

I am working through Miguel Morales Deep Reinforcement Learning book and I have come across some syntax that I am unfamilar with. I have looked at the tutorials for dictionary comprehension and lambda functions and I am yet to find what the square brackets does at the end of the lamda call. Can anyone help?
def policy_improvement(value, mdp, gamma=1.0):
''' Performs improvement of a given policy by evaluating the actions for each state and choosing greedily. '''
Q = np.zeros((len(mdp), len(mdp[0])), dtype=np.float64)
for state in range(len(mdp)):
for action in range(len(mdp[state])):
for transition_prob, state_prime, reward, done in mdp[state][action]:
Q[state][action] += transition_prob * (reward + gamma * value[state_prime] * (not done)) # Update the Q value for each action.
new_policy_pie = lambda state: {state:action for state, action in enumerate(np.argmax(Q, axis=1))}[state]
return new_policy_pie
The lambda is made of two parts:
# 1. Create a map of state to action
d = {state:action for state, action in enumerate(np.argmax(Q, axis=1))}
# 2. Return the value for argument `state`
d[state]
So the [state] bit is part of the lambda, it's the dictionary selector.
The intent might have been to write a fail-safe lambda in case the state does not exist, but then they should use .get(state) instead of [].
So in the end, your code above could be replaced with:
lambda state: np.argmax(Q, axis=1)[state]

Hiding a System.Random instance by returning a function

The following code comes from Stylish F# 6: Crafting Elegant Functional Code for .NET 6 listing 9-13:
let randomByte =
let r = System.Random()
fun () ->
r.Next(0, 255) |> byte
// E.g. A3-52-31-D2-90-E6-6F-45-1C-3F-F2-9B-7F-58-34-44-
for _ in 0..15 do
printf "%X-" (randomByte())
printfn ""
The author states, "Although we call randomByte() multiple times, only one System.Random() instance is created."
I understand randomByte returns a function that does not create a System.Random() instance, but it seems to me multiple System.Random() instances would be created each time through the for-do-loop anyway.
I would appreciate an explanation of how multiple instances of System.Random() are not created in this case.
The key point is that randomByte is not a function. It's a value with some complex initialization logic. Like, for example, I could write:
let x = 5
Or I could write:
let x =
let fourtyTwo = 42
let thirtySeven = 37
fourtyTwo - thirtySeven
And these would be equivalent. Both declare a value named x and equal to 5. I hope you can see how the expression fourtyTwo - thirtySeven is evaluated only once, not every time somebody gets the value of x.
And so it works with randomByte too: it's a value with non-trivial initialization logic. During that value's initialization, first it creates an instance of System.Random, and then it creates an anonymous function that closes over that instance, and this anonymous function becomes the value of randomByte.

porting python class to Julialang

I am seeing that Julia explicitly does NOT do classes... and I should instead embrace mutable structs.. am I going down the correct path here?? I diffed my trivial example against an official flux library but cannot gather how do I reference self like a python object.. is the cleanest way to simply pass the type as a parameter in the function??
Python
# Dense Layer
class Layer_Dense
def __init__(self, n_inputs, n_neurons):
self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
self.biases = np.zeros((1, n_neurons))
def forward(self, inputs):
pass
My JuliaLang version so far
mutable struct LayerDense
num_inputs::Int64
num_neurons::Int64
weights
biases
end
function forward(layer::LayerDense, inputs)
layer.weights = 0.01 * randn(layer.num_inputs, layer.num_neurons)
layer.biases = zeros((1, layer.num_neurons))
end
The flux libraries version of a dense layer... which looks very different to me.. and I do not know what they're doing or why.. like where is the forward pass call, is it here in flux just named after the layer Dense???
source : https://github.com/FluxML/Flux.jl/blob/b78a27b01c9629099adb059a98657b995760b617/src/layers/basic.jl#L71-L111
struct Dense{F, M<:AbstractMatrix, B}
weight::M
bias::B
σ::F
function Dense(W::M, bias = true, σ::F = identity) where {M<:AbstractMatrix, F}
b = create_bias(W, bias, size(W,1))
new{F,M,typeof(b)}(W, b, σ)
end
end
function Dense(in::Integer, out::Integer, σ = identity;
initW = nothing, initb = nothing,
init = glorot_uniform, bias=true)
W = if initW !== nothing
Base.depwarn("keyword initW is deprecated, please use init (which similarly accepts a funtion like randn)", :Dense)
initW(out, in)
else
init(out, in)
end
b = if bias === true && initb !== nothing
Base.depwarn("keyword initb is deprecated, please simply supply the bias vector, bias=initb(out)", :Dense)
initb(out)
else
bias
end
return Dense(W, b, σ)
end
This is an equivalent of your Python code in Julia:
mutable struct Layer_Dense
weights::Matrix{Float64}
biases::Matrix{Float64}
Layer_Dense(n_inputs::Integer, n_neurons::Integer) =
new(0.01 * randn(n_inputs, n_neurons),
zeros((1, n_neurons)))
end
forward(ld::Layer_Dense, inputs) = nothing
What is important here:
here I create an inner constructor only, as outer constructor is not needed; as opposed in the Flux.jl code you have linked the Dense type defines both inner and outer constructors
in python forward function does not do anything, so I copied it in Julia (your Julia code worked a bit differently); note that instead of self one should pass an instance of the object to the function as the first argument (and add ::Layer_Dense type signature so that Julia knows how to correctly dispatch it)
similarly in Python you store only weights and biases in the class, I have reflected this in the Julia code; note, however, that for performance reasons it is better to provide an explicit type of these two fields of Layer_Dense struct
like where is the forward pass call
In the code you have shared only constructors of Dense object are defined. However, in the lines below here and here the Dense type is defined to be a functor.
Functors are explained here (in general) and in here (more specifically for your use case)

Vertex in Python Gremlin not updating

Using python gremlin on Neptune workbench, I have two functions:
The first adds a Vertex with a set of properties, and returns a reference to the traversal operation
The second adds to that traversal operation.
For some reason, the first function's operations are getting persisted to the DB, but the second operations do not. Why is this?
Here are the two functions:
def add_v(v_type, name):
tmp_id = get_id(f"{v_type}-{name}")
result = g.addV(v_type).property('id', tmp_id).property('name', name)
result.iterate()
return result
def process_records(features):
for i in features:
v_type = i[0]
name = i[1]
v = add_v(v_type, name)
if len(i) > 2:
%debug
props = i[2]
for r in props:
v.property(r[0], r[1]).iterate()
Your add_V method has already iterated the traversal. If you want to return the traversal from add_v in a way that you can add to it remove the iterate.

Delete key in map

I have a map:
var sessions = map[string] chan int{}
How do I delete sessions[key]? I tried:
sessions[key] = nil,false;
That didn't work.
Update (November 2011):
The special syntax for deleting map entries is removed in Go version 1:
Go 1 will remove the special map assignment and introduce a new built-in function, delete: delete(m, x) will delete the map entry retrieved by the expression m[x]. ...
Go introduced a delete(map, key) function:
package main
func main () {
var sessions = map[string] chan int{};
delete(sessions, "moo");
}
Copied from Go 1 release notes
In the old language, to delete the entry with key k from the map represented by m, one wrote the statement,
m[k] = value, false
This syntax was a peculiar special case, the only two-to-one assignment. It required passing a value (usually ignored) that is evaluated but discarded, plus a boolean that was nearly always the constant false. It did the job but was odd and a point of contention.
In Go 1, that syntax has gone; instead there is a new built-in function, delete. The call
delete(m, k)
will delete the map entry retrieved by the expression m[k]. There is no return value. Deleting a non-existent entry is a no-op.
Updating: Running go fix will convert expressions of the form m[k] = value, false into delete(m, k) when it is clear that the ignored value can be safely discarded from the program and false refers to the predefined boolean constant. The fix tool will flag other uses of the syntax for inspection by the programmer.
From Effective Go:
To delete a map entry, use the delete built-in function, whose arguments are the map and the key to be deleted. It's safe to do this even if the key is already absent from the map.
delete(timeZone, "PDT") // Now on Standard Time
delete(sessions, "anykey")
These days, nothing will crash.
Use make (chan int) instead of nil. The first value has to be the same type that your map holds.
package main
import "fmt"
func main() {
var sessions = map[string] chan int{}
sessions["somekey"] = make(chan int)
fmt.Printf ("%d\n", len(sessions)) // 1
// Remove somekey's value from sessions
delete(sessions, "somekey")
fmt.Printf ("%d\n", len(sessions)) // 0
}
UPDATE: Corrected my answer.

Resources