Pyspark Array Key,Value - multidimensional-array

I currently have an RDD with an array that stores a key-value pair where the key is the 2D indices of the array and the value is the number at that spot. For example [((0,0),1),((0,1),2),((1,0),3),((1,1),4)]
I want to add up the values of each key with the surrounding values. In relation to my earlier example, I want to add up 1,2,3 and place it in the (0,0) key value spot. How would I do this?

I would suggest you do the following:
Define a function that, given a pair (i,j), returns a list with the pairs corresponding to the positions surrounding (i,j), plus the input pair (i,j). For instance, lets say the function is called surrounding_pairs(pair). Then:
surrounding_pairs((0,0)) = [ (0,0), (0,1), (1,0) ]
surrounding_pairs((2,3)) = [ (2,3), (2,2), (2,4), (1,3), (3,3) ]
Of course, you need to be careful and return only valid positions.
Use a flatMap on your RDD as follows:
MyRDD = MyRDD.flatMap(lambda (pos, v): [(p, v) for p in surrounding_pairs(pos)])
This will map your RDD from
[((0,0),1),((0,1),2),((1,0),3),((1,1),4)] to
[((0,0),1),((0,1),1),((1,0),1),
((0,1),2),((0,0),2),((1,1),2),
((1,0),3),((0,0),3),((1,1),3),
((1,1),4),((1,0),4),((0,1),4)]
This way, the value at each position will be "copied" to the neighbour positions.
Finally, just use a reduceByKey to add the corresponding values at each position:
from operator import add
MyRDD = MyRDD.reduceByKey(add)
I hope this makes sense.

Related

Find which sum of any numbers in an array equals amount

I have a customer who sends electronic payments but doesn't bother to specify which invoices. I'm left guessing which ones and I would rather not try every single combination manually. I need some sort of pseudo-code to do it and then I can adapt it but I'm not sure I can come up with a good algorithm myself. . I'm familiar with php, bash, and python but I can adapt.
I would need an array with the following numbers: [357.15, 223.73, 106.99, 89.96, 312.39, 120.00]. Those are the amounts of the invoices. Then I would need to find a sum of any combination of two or more of those numbers that adds up to 596.57. Once found the program would need to tell me exactly which numbers it used to reach the sum so I can then know which invoices got paid.
This is very similar to the Subset Sum problem and can be solved using a similar approach to the typical brute-force method used for that problem. I have to do this often enough that I keep a simple template of this algorithm handy for when I need it. What is posted below is a slightly modified version1.
This has no restrictions on whether the values are integer or float. The basic idea is to iterate over the list of input values and keep a running list of every subset that sums to less than the target value (since there might be a later value in the inputs that will yield the target). It could be modified to handle negative values as well by removing the rule that only keeps candidate subsets if they sum to less than the target. In that case, you'd keep all subsets, and then search through them at the end.
import copy
def find_subsets(base_values, taget):
possible_matches = [[0, []]] # [[known_attainable_value, [list, of, components]], [...], ...]
matches = [] # we'll return ALL subsets that sum to `target`
for base_value in base_values:
temp = copy.deepcopy(possible_matches) # Can't modify in loop, so use a copy
for possible_match in possible_matches:
new_val = possible_match[0] + base_value
if new_val <= target:
new_possible_match = [new_val, possible_match[1]]
new_possible_match[1].append(base_value)
temp.append(new_possible_match)
if new_val == target:
matches.append(new_possible_match[1])
possible_matches = temp
return matches
find_subsets([list, of input, values], target_sum)
This is a very inefficient algorithm and it will blow up quickly as the size of the input grows. The Subset Sum problem is NP-Complete, so you are not likely to find a generalized solution that will work in all cases and is efficient.
1: The way lists are being used here is kludgy. If the goal was to simply find any match, the nested lists could be replaced with a dictionary, and we could exit right away once a match is found. But doing that will cause intermediate subsets that sum to the same value to also map to the same dictionary slot, so only one subset with that sum is kept. Since we need to report all matching subsets (because the values represent checks and are presumably not fungible even if the dollar amounts are equal), a dictionary won't work.
You can use itertools.combinations(t,r) to list all combinations of r elements in array t.
So we loop on the possible values of r, then on the results of itertools.combinations:
import itertools
def find_sum(t, obj):
t = [x for x in t if x < obj] # filter out elements which are too big
for r in range(1, len(t)+1): # loop on number of elements
for subt in itertools.combinations(t, r): # loop on combinations of r elements
if sum(subt) == obj:
return subt
return None
find_sum([1,2,3,4], 6)
# (2, 4)
find_sum([1,2,3,4], 10)
# (1, 2, 3, 4)
find_sum([1,2,3,4], 11)
# none
find_sum([35715, 22373, 10699, 8996, 31239, 12000], 59657)
# none
Rounding errors:
The code above is meant to be used with integers, rather than floats.
To use with floats, replace the test sum(subt) == obj with the more forgiving test sum(subt) - obj < 0.01.
Relevant documentation:
itertools.combinations

Delete all duplicated elements in a vector in Julia 1.1

I am trying to write a code which deletes all repeated elements in a Vector. How do I do this?
I already tried using unique and union but they both delete all the repeated items but 1. I want all to be deleted.
For example: let x = [1,2,3,4,1,6,2]. Using union or unique returns [1,2,3,4,6]. What I want as my result is [3,4,6].
There are lots of ways to go about this. One approach that is fairly straightforward and probably reasonably fast is to use countmap from StatsBase:
using StatsBase
function f1(x)
d = countmap(x)
return [ key for (key, val) in d if val == 1 ]
end
or as a one-liner:
[ key for (key, val) in countmap(x) if val == 1 ]
countmap creates a dictionary mapping each unique value from x to the number of times it occurs in x. The solution can then be easily found by extracting every key from the dictionary that maps to val of 1, ie all elements of x that occur precisely once.
It might be faster in some situations to use sort!(x) and then construct an index for the elements of the sorted x that only occur once, but this will be messier to code, and also the output will be in sorted order, which you may not want. The countmap method preserves the original ordering.

Simple query in neo4j - no record if a node has degree 0

To better understand how results are formatted in neo:
A simple query where node ENSG00000180447 has no neighbor:
MATCH (d:Target)-[r:Interaction]-(t:Target)
where d.uid = 'ENSG00000180447'
with d, count(t) as degree
Return d, degree
(no changes, no records)
Instead
MATCH (d:Target)
where d.uid = 'ENSG00000180447'
Return d # return the node
MATCH (d:Target)-[r:Interaction]-(t:Target)
where d.uid = 'ENSG00000180447'
with count(t) as degree
Return degree # return 0
I would like to get returned node and its degree on the same query.
What is it wrong with the first query?
"MATCH" is looking for the exact pattern match, and does not find it for the node with the uid = 'ENSG00000180447'. Two ways:
1) Use OPTIONAL MATCH:
MATCH (d:Target)
WHERE d.uid = 'ENSG00000180447'
OPTIONAL MATCH (d)-[r:Interaction]-(t:Target)
RETURN d, COUNT(t) AS degree
2) Use zero length paths:
MATCH (d:Target)-[r:Interaction*0..1]-(t:Target)
where d.uid = 'ENSG00000180447'
with d, count(t) as degree
Return d, degree-1
The problem, as stdob-- points out, is that when you perform a MATCH, it only returns rows for which the match is true. So you're asking for a match between that one specific node to a :Target node using a relationship of type :Interaction. Since no such pattern exists, no rows are returned.
The SIZE() function will probably be your best bet for a concise query, you can use it to find the occurrences of a pattern. In this case, we can use it to find the number of relationships of that type to a :Target node:
MATCH (d:Target)
WHERE d.uid = 'ENSG00000180447'
RETURN d, SIZE( (d)-[:Interaction]-(:Target) ) AS degree
EDIT - explaining why your query returning the node and count returns no rows.
COUNT() is an aggregation that only has context from the non-aggregation columns (grouping key). On its own, COUNT() has no other context and no grouping keys, and it can handle null values:
COUNT(null) = 0.
When we perform MATCHes, we build up rows. Where a MATCH doesn't find any matches, no rows are returned:
MATCH (ele:PinkElephant)
RETURN ele
// (no changes, no records)
When we try to pair this with aggregation, we will still get no rows, because the aggregation will run for every possible row, but there are no rows to execute on:
MATCH (person:Person)-[:Halucinates]->(ele:PinkElephant)
RETURN ele, COUNT(person)
// (no changes, no records)
In this case, you're asking for rows of :PinkElephant nodes, and for each of those nodes, a count of the people who hallucinate that pink elephant.
But there are no :PinkElephant nodes. There are no rows present for COUNT() to operate on. We can't show any rows because there are no nodes present to populate them.
Even if there WERE :PinkElephant nodes in the graph, if there were no relationships to :People nodes, the result would be the same. The match would find nothing, because the pattern you asked for (pink elephants that are hallucinated by people) doesn't exist. There are no :PinkElephants that are hallucinated by a :Person, so no nodes to populate the ele column, so no rows, and if there are no rows, your COUNT() has nothing to execute on, and no place to add a return value.

Comparing values of 2 dictionaries

Can anyone advice how I could compare values of 2 dictionaries. For example:
A = {'John': [(300, 5000), (700, 750), (10, 300)], 'Mary': [(12, 300), (5678, 9000), (200, 657), (800, 7894)]}
B = {‘Jim’:[(500,1000),(600,1500),(900,2000)], ‘Mary’:[(13,250), (1000,6000), (222,600)]}
I would like to compare the 2 such that if the 'key' (in this case 'Mary') is present between A and B dictionaries and the first and second numbers in the 'values' of B dictionaries are within that of the 'values in A (i.e. (13,250) and (222,600) are between (12,300) and (200, 657) respectively. The return results will therefore be 'Mary': [(13,250), (222,600)]
Thanks
Okay I did what I think you wanted and retrieved the results (13,250)
(222,600). So it looks like it is working. I made three classes one was the main class, another a Point Class, and another class that did the filling of the dictionaries and comparing. I didn't know how you made up your dictionaries. But I made it like so:
private Dictionary<String,Map<String,List<Point>>> first
= new Hashtable<String,Map<String,List<Point>>>();
It is dictionary that takes in a String and a Map, which takes in a String and a List, which takes in a Point Object. When I look at your snippet there; it just screamed Points. So I made a small class with x and y properties.
Next, one should make a method that will compare the values of the points like using less than and greater than:
Then in another method after you fill the Dictionary, then check the dictionary and loop through the size of the second list of the dictionary
int size = second.get("B").get("Mary").size();
for(int i =0; i<size; i++){
//compare method that you just made
}
Then print out results
My Output: Mary:(13,250)(222,600)
If you need any help with code please reply.

Get a vertex in a vertex sequence

A vertex sequence by igraph seems not be a sequence. For example:
The v sequence by V( module.net ) is a sequence, since I can access it by [deg==1]. But why it does't work when I try peripheral[1]? Any possible explanation for this?
The dataset for this example is not easy to be included, sorry for that.
//
I find the answer, the index of first vertex 'MED24' is 4, instead of 1. So if I want to get the first vertex, I have to do peripheral[1]. But this seems a little unreasonable. A replicatable example:
g = graph.ring(5)
V(g)$name = c('node1', 'node2', 'node3','node4','node5')
temp = V(g)[2:3]
If you want to access 'node3' from temp, you have to use temp[3] instead of temp[2]
I've always had trouble with vertex sequences and edge sequences. The problem with the indexing operator on those objects is that is searched by vector name, not position. So peripheral[1] is looking to see if vector 1 is in the list, it's not extracting the first element in the list.
The best i've come up with is converting the sequence to a simple vector and re-indexing the vector list. For example
el <- cbind(letters[1:5], letters[c(2,3,5,1,4)])
gg <- graph.edgelist(el)
p <- V(gg)[c(2,3)]
V(gg)[as.vector(p)[1]]
Actually, if you just want to extract the name of a particular vertex, then
p$name[1]
would work.

Resources