how to use spark(r) to partition data?

how to use spark(r) to partition data? - r

I have a cluster with 3 nodes.
I have a json file, with each line being a json string.
I need to partition the data into X blocks based on an ID field in each line of the json file, so that lines of jsons with the same ID will be calculated in the same node.
How can I do the partition?
I am using SparkR, The structure of the code looks like this:
getObj=function(x)
{
rec1= rjson:::fromJSON(x) ;
kv=list(rec1$id, rec1) ;
return(kv)
}
data= SparkR:::textFile(sc, "Path")
mapdata= SparkR:::map(data, getObj)
mapdataP=SparkR:::partitionBy(mapdata,100)
The id ranges from 1 to 100. I aim to partition to 100 parts, so each part will have one ID. However, the code above does not give the expected answer. Some partitions are null. For instance, when I try to get the second partition, using
result=SparkR:::collectPartition(mapdataP, 1L), which return Null.
Are there something missing or wrong here? Many thanks!

Related

How do I remove special characters from an extracted value of a query?

I'm using the following code to get a user's recovery_token and store it in a variable:
Connect To Database psycopg2 ${DB_NAME}
... ${DB_USER_NAME}
... ${DB_USER_PASSWORD}
... ${DB_HOST}
... ${DB_PORT}
${RECOVERY_TOKEN}= Query select recovery_token FROM public."system_user" where document_number like '57136570514'
Looking at the log, the recovery_token is being saved as follows:
${RECOVERY_TOKEN} = [('eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpZCI6ImU3ZGM4MmNjLTliMGQtNDc3OC1hMzM0LWEyNjM4MDU1Mzk1MSIsImlhdCI6MTYyMzE5NjM4NSwiZXhwIjoxNjIzMTk2NDQ1fQ.mdsrQlgaWUol02tZO8dXlL3KEwY6kqwj5T7gfRDYVfU',)]
But I need what is saved in the variable ${RECOVERY_TOKEN} to be just the token, without the special characters [('',)]
${RECOVERY_TOKEN} = eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpZCI6ImU3ZGM4MmNjLTliMGQtNDc3OC1hMzM0LWEyNjM4MDU1Mzk1MSIsImlhdCI6MTYyMzE5NjM4NSwiZXhwIjoxNjIzMTk2NDQ1fQ.mdsrQlgaWUol02tZO8dXlL3KEwY6kqwj5T7gfRDYVfU
Is there any way I can remove the special characters?
Thanks in advance!!

The returned value is a list of tuples, a two-dimensional matrix (e.g. a table); if you have queried for 3 columns for example, the inner tuple would have 3 members. And if there were 5 records that match it, the list would have 5 tuples in it.
Thus to get the value you are after, get it from the matrix by its indexes (which are 0-based, e.g. the first element is with index "0"):
${RECOVERY_TOKEN}= Set Variable ${RECOVERY_TOKEN[0][0]}

How to recover tabular data using a Neo4j query?

I've stored numeric tabular data as relationship properties in a Neo4j database. I would like to recover the data in tabular form.
For instance, one node was stored as follows:
MATCH (g:GNE),(p:EXP)
WHERE g.etr='5313' AND p.NExp='Bos_RM'
CREATE UNIQUE (p)-[r:Was_norm
{Method:'NULL', time_t_35: '6.04',time_t9: '6.587',time_t14: '5.708',time_t31: '6.89',time_t224: '4.842'}
]->(g)
I tried a query like this:
MATCH (g:GNE)-[r1:Was_sel]-(e:EXP)-[r2:Was_norm]-(g)
WHERE e.NExp = 'Bos_SM'
RETURN g.etr,r2
but I'd like to recover the data in tabular form, and in the correct order.
Does anyone have any suggestions?

It may not be possible to do what you want with your current data model, given Cypher's current capabilities. Part of the problem is that there is no way to get a property value without hardcoding (in your query) the name of the property. Another part of the problem is that property keys are not necessarily returned in the original order (or in any predictable order).
Instead, you can get around these problems by changing the way you store your tabular data.
For example, suppose you stored a node this way (notice that the collections are stored in the desired order):
MATCH (g:GNE),(p:EXP)
WHERE g.etr='5313' AND p.NExp='Bos_RM'
CREATE UNIQUE
(p)-[r:Was_norm {
Method:'NULL',
times: [ 9, 14, 31, 224],
values:[6.587, 5.708, 6.89, 4.842]
}]->(g)
Given the above data model, you can easily get the tabular data back as 2 separate arrays:
MATCH (g:GNE)-[r:Was_norm]->(p:EXP)
WHERE g.etr='5313' AND p.NExp='Bos_RM'
RETURN g.etr, r.times, r.values;
Or, if you wanted to get the data back in a single array:
MATCH (g:GNE)-[r:Was_norm]->(p:EXP)
WHERE g.etr='5313' AND p.NExp='Bos_RM'
RETURN g.etr,
REDUCE(s =[], i IN RANGE(0,LENGTH(r.times)-1) | s + { time: r.times[i], value: r.values[i]}) AS table;
The result of the above query (see this console) would look like this:
+-------------------------------------------------------------------------------------------------------+
| g.etr | table |
+-------------------------------------------------------------------------------------------------------+
| "5313" | [{time=9, value=6.587},{time=14, value=5.708},{time=31, value=6.89},{time=224, value=4.842}] |
+-------------------------------------------------------------------------------------------------------+

DSE cassandra and spark map collections type: how to perform get operation

For example I have the following table named "example":
name | age | address
'abc' | 12 | {'street':'1', 'city':'kl', 'country':'malaysia'}
'cab' | 15 | {'street':'5', 'city':'jakarta', 'country':'indonesia'}
In Spark I can do this:
scala> val test = sc.cassandraTable ("test","example")
and this:
scala> test.first.getString
and this:
scala> test.first.getMapString, String
which gives me all the fields of the address in the form of a map
Question 1: But how do I use the "get" to access "city" information?
Question 2: Is there a way to falatten the entire table?
Question 3: how do I go about counting number of rows where "city" = "kl"?
Thanks

Question 3 : How do we count the number of rows where city == something
I'll answer 3 first because this may provide you an easier way to work with the data. Something like
sc.cassandraTable[(String,Map[String,String],Int)]("test","example")
.filter( _._2.getOrElse("city","NoCity") == "kl" )
.count
First, I use the type parameter [(String,Map[String,String],Int)] on my cassandraTable call to transform the rows into tuples. This gives me easy access to the Map without any casting. (The order is just how it appears when I made the table in my test environment you may have to change the ordering)
Second I say I would like to filter based on the _._2 which is shorthand for the second element of the incoming tuple. getOrElse returns the value for the key "city" if the key exists and "NoCity" otherwise. The final equivalency checks what city it is.
Finally, I call count to find out the number of entries in the city.
1 How do we access the map?
So the answer to 2 is that once you have a Map, you can call get("key") or getOrElse("key") or any of the standard Scala operations to get a value out of the map.
2 How to flatten the entire table.
Depending on what you mean by "flatten" this can be a variety of things. For example if you want to return the entire table as an array to the driver (Not recommended since your RDD should be very big in production.) You can call collect
If you want to flatten the elements of your map into a tuple you can always do something like calling toSeq and you will end up with a list of (key,value) tuples. Feel free to ask another question if I haven't answered what you want with "flattening."

I would like to link values of one dictionary to keys in another

I have two .csv files containing information which I would like to link. I read each .csv file into a dictionary and named it as follows:
Key Value
Dictionary1 = {Complex, Protein}
Dictionary2 = {Protein, Absorbance}
I would like to be able to link the proteins from Dictionary1 to Dictionary2 so the end result would be if I were to call a complex in Dictionary1 it would give me the absorbances associated with all the proteins in Dictionary2.
Perhaps I have taken the wrong approach putting both the data sets into dictionaries...

You can use the value resulting from the lookup in the first as the key for the second. Of course, assuming that the data is immutable.
Python:
dict1 = {'Complex': 'Protein'}
dict2 = {'Protein': 'Absorbance'}
dict2[dict1['Complex']] # 'Absorbance'

How to eleminate duplicate values in a single file using hadoop mapreduce program

How to eleminate duplicate values in a single file using hadoop mapreduce programWhile in output i need unique values For Example: in a file
line 1: Hi this is Ashok
Line 2: Basics of hadoop framework
line 3: Hi this is Ashok
From this example need output only unique values i.e. It should print Line 1 and 3... How to do it....

This is word count without the count.
The typical way to do this is to group by the entire line, then only output the key in the reducer. Here is some pseudocode:
map(key, value):
emit (value, null)
reducer(key, iterator):
emit (key, null)
Notice that I'm just outputting value here as the key from the mapper. The value can be null (i.e., NullWriteable, or you can just use an integer or whatever.).
In the reducer, I don't care how many I saw, I just output the key.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how to use spark(r) to partition data? - r

Related

How do I remove special characters from an extracted value of a query?

How to recover tabular data using a Neo4j query?

DSE cassandra and spark map collections type: how to perform get operation

I would like to link values of one dictionary to keys in another

How to eleminate duplicate values in a single file using hadoop mapreduce program

Categories

Resources