What is the "some" meaning in Collect result in Scala - rdd

"some" is not a special term which makes the googling seem to just ignore that search.
What I am asking is in my learning below:
b.collect:
Array[(Int, String)] = Array((3,dog), (6,salmon), (3,rat), (8,elephant))
d.collect:
Array[(Int, String)] = Array((3,dog), (3,cat), (6,salmon), (6,rabbit), (4,wolf), (7,penguin))
if I do some join and then collect the result, like b.join(d).collect, I will get the following:
Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (3,(dog,dog)), (3,(dog,cat)), (3,(rat,dog)), (3,(rat,cat)))
which seems understandable, however, if I do: b.leftOuterJoin(d).collect, I will get:
Array[(Int, (String, Option[String]))] = Array((6,(salmon,Some(salmon))), (6,(salmon,Some(rabbit))), (3,(dog,Some(dog))), (3,(dog,Some(cat))), (3,(rat,Some(dog))), (3,(rat,Some(cat))), (8,(elephant,None)))
My question is why do I get results seems to be expressed differently, I mean why the second result contains "Some"? what's the difference between with "Some" and without "Some"? Can "Some" be removed? Does "Some" have any impact to any later operations as the content of RDD?
Thank you very much.

When you do the normal join as b.join(d).collect, you get Array[(Int, (String, String))]
This is because of only the same key with RDD b and RDD d so it is always guaranteed to have a value so it returns Array[(Int, (String, String))].
But when you use b.leftOuterJoin(d).collect the return type is Array[(Int, (String, Option[String]))] this is because to handle the null. In leftOuterJoin, there is no guarantee that all the keys of RDD b are available in RDD d, So it is returned as Option[String] which contains two values
Some(String) =>If the key is matched in both RDD
None If the key is present in b and not present in d
You can replace Some by getting the value from it and providing the value in case of None as below.
val z = b.leftOuterJoin(d).map(x => (x._1, (x._2._1, x._2._2.getOrElse("")))).collect
Now you should get Array[(Int, (String, String))] and output as
Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (3,(dog,dog)), (3,(dog,cat)), (3,(rat,dog)), (3,(rat,Some(cat)), (8,(elephant,)))
Where you can replace "" with any other string as you require.
Hope this helps.

Related

Sort dictionary based on len(value) where value is a set

I know there are many solutions (How do I sort a dictionary by value?) to sorting a dictionary by values. However, most of those predate python 3.7's changes to dictionary.
I am also aware of Fastest way to sort a python 3.7+ dictionary, which seems close to the answer I need.
I have a large dictionary of keys that are ints and values that are sets of strings.
I want to create a new dictionary that is sorted by the length of the set of each value.
Dictionary:
dict1={
'12':{'sym1', 'sym2'},
'13':{'sym1', 'sym4', 'sym5', 'sym6'},
'14':{'sym1', 'sym3'},
'15':{'sym2'},
'16':{'sym2'},
'17':{'sym2'},
'18':{'sym3', 'sym89', 'sym34', 'sym5', 'sym88'}
}
New sorted dictionary:
>>sorted_dict1
{
'18':{'sym3', 'sym89', 'sym34', 'sym5', 'sym88'},
'13':{'sym1', 'sym4', 'sym5', 'sym6'},
'12':{'sym1', 'sym2'},
'14':{'sym1', 'sym3'},
'15':{'sym2'},
'16':{'sym2'},
'17':{'sym2'}
}
I think this is a very slow way to do this, but here goes:
create a dictionary that has the same keys, and the value is the length of the set of the key
from operator import itemgetter
dict1={
'12':{'sym1', 'sym2'},
'13':{'sym1', 'sym4', 'sym5', 'sym6'},
'14':{'sym1', 'sym3'},
'15':{'sym2'},
'16':{'sym2'},
'17':{'sym2'},
'18':{'sym3', 'sym89', 'sym34', 'sym5', 'sym88'}
}
dict1_len ={}
for k,v in dict1.items():
dict1_len.update({k:len(v)})
Sort dict1_len by the value numbers in reverse (descending).
sorted_dict1_len = {k: v for k,v in sorted(dict1_len.items(), key=itemgetter(1), reverse=True)}
using the keys in the order given by sorted_dict1_len, add the key and the values of that key as given by the original dict1 to sorted_dict1.
sorted_dict1 = {}
for k in sorted_dict1_len.keys():
print(k)
sorted_dict1.update({k:dict1.get(k)})
Edit: improved answer
from operator import itemgetter
dict1={
'12':{'sym1', 'sym2'},
'13':{'sym1', 'sym4', 'sym5', 'sym6'},
'14':{'sym1', 'sym3'},
'15':{'sym2'},
'16':{'sym2'},
'17':{'sym2'},
'18':{'sym3', 'sym89', 'sym34', 'sym5', 'sym88'}
}
def len_val(tup):
return len(tup[1]) # length of the value i.e. elements in set
dict2 = sorted(dict1.items(), key=len_val, reverse=True)
print(dict2)
returns
[('18', {'sym88', 'sym5', 'sym89', 'sym3', 'sym34'}), ('13', {'sym1', 'sym5', 'sym6', 'sym4'}), ('12', {'sym1', 'sym2'}), ('14', {'sym3', 'sym1'}), ('15', {'sym2'}), ('16', {'sym2'}), ('17', {'sym2'})]

Julia dictionary "key not found" only when using loop

Still trying to figure out this problem (I was having problems building a dictionary, but managed to get that working thanks to rickhg12hs).
Here's my current code:
#open files with codon:amino acid pairs, initiate dictionary:
file = open(readall, "rna_codons.txt")
seq = open(readall, "rosalind_prot.txt")
codons = {"UAA" => "stop", "UGA" => "stop", "UAG" => "stop"}
#generate dictionary entries using pairs from file:
for m in eachmatch(r"([AUGC]{3,3})\s([A-Z])\s", file)
codon, aa = m.captures
codons[codon] = aa
end
All of that code seems to work as intended. At this point, I have the dictionary I want, and the right keys point to the right entries. If I just do print(codons["AUG"]) for example, it prints 'M', which is the correct output. Now I want to scan through a string in the second file, and for every 3 letters, pull out the entry referenced in the dictionary and add it to the prot string. So I tried:
for m in eachmatch(r"([AUGC]{3,3})", seq)
amac = codons[m.captures]
prot = "$prot$amac"
end
But this kicks out the error key not found: ["AUG"]. I know the key exists, because I can print codons["AUG"] and it returns the proper entry, so why can't it find that key when it's in the loop?

Pyparsing - name not starting with a character

I am trying to use Pyparsing to identify a keyword which is not beginning with $ So for the following input:
$abc = 5 # is not a valid one
abc123 = 10 # is valid one
abc$ = 23 # is a valid one
I tried the following
var = Word(printables, excludeChars='$')
var.parseString('$abc')
But this doesn't allow any $ in var. How can I specify all printable characters other than $ in the first character position? Any help will be appreciated.
Thanks
Abhijit
You can use the method I used to define "all characters except X" before I added the excludeChars parameter to the Word class:
NOT_DOLLAR_SIGN = ''.join(c for c in printables if c != '$')
keyword_not_starting_with_dollar = Word(NOT_DOLLAR_SIGN, printables)
This should be a bit more efficient than building up with a Combine and a NotAny. But this will match almost anything, integers, words, valid identifiers, invalid identifiers, so I'm skeptical of the value of this kind of expression in your parser.

Neo4j cypher - Compare two collections to get at least one same element

I want to return all users that I follow who are not members of any groups that I am in. If a followed user is a member of even one group that I am in, it should not be returned.
However, I am getting an error:
None.get
Neo.DatabaseError.Statement.ExecutionFailure
when I try this query:
MATCH (g1:groups)<-[:MEMBER_OF]-(u1:users{userid1:"56"})-[:FOLLOWS]->(u2:users)-[:MEMBER_OF]->(g2:groups)
WITH collect(g1.groupid) AS my_groups,u2,collect(g2.groupid) AS foll_groups
WHERE NOT any(t in foll_groups WHERE t IN extract(x IN my_groups))
RETURN u2
Here is one solution:
MATCH (g1:groups)<-[:MEMBER_OF]-(u1:users { userid1:"56" })-[:FOLLOWS]->(u2:users)-[:MEMBER_OF]->(g2:groups)
WITH u2, collect(g2) AS foll_groups, collect(g1) AS my_groups
WITH u2, reduce(dup = FALSE, g IN foll_groups | (dup OR g IN my_groups)) AS has_dup
WHERE NOT has_dup
RETURN u2;

grep on two strings

I'm working to grab two different elements in a string.
The string look like this,
str <- c('a_abc', 'b_abc', 'abc', 'z_zxy', 'x_zxy', 'zxy')
I have tried with the different options in ?grep, but I can't get it right, 'm doing something like this,
grep('[_abc]:[_zxy]',str, value = TRUE)
and what I would like is,
[1] "a_abc" "b_abc" "z_zxy" "x_zxy"
any help would be appreciated.
Use normal parentheses (, not the square brackets [
grep('_(abc|zxy)',str, value = TRUE)
[1] "a_abc" "b_abc" "z_zxy" "x_zxy"
To make the grep a bit more flexible, you could do something like:
grep('_.{3}$',str, value = TRUE)
Which will match an underscore _ followed by any character . three times {3} followed immediately by the end of the string $
this should work: grep('_abc|_zxy', str, value=T)
X|Y matches when either X matches or Y matches
In this case just doing:
str[grep("_",str)]
will work... is it more complicated in your specific case?

Resources