How to modify the avro key/value schema in a RDD map transformation - dictionary

I'm trying to migrate some Hadoop Map Reduce code to Spark and I have doubts about how to manage map and reduce transformations when the schema of either the key or value change from input to output.
I have avro files with Indicator records that I want to process somehow. I already have this code that works:
val myAvroJob = new Job()
myAvroJob.setInputFormatClass(classOf[AvroKeyInputFormat[Indicator]])
myAvroJob.setOutputFormatClass(classOf[AvroKeyOutputFormat[Indicator]])
myAvroJob.setOutputValueClass(classOf[NullWritable])
AvroJob.setInputValueSchema(myAvroJob, Schema.create(Schema.Type.NULL))
AvroJob.setInputKeySchema(myAvroJob, Indicator.SCHEMA$)
AvroJob.setOutputKeySchema(myAvroJob, Indicator.SCHEMA$)
val indicatorsRdd = sc.newAPIHadoopRDD(myAvroJob.getConfiguration,
classOf[AvroKeyInputFormat[Indicator]],
classOf[AvroKey[Indicator]],
classOf[NullWritable])
val myRecordOnlyRdd = indicatorsRdd.map(x => (doSomethingWith(x._1), NullWritable.get)
val indicatorPairRDD = new PairRDDFunctions(myRecordOnlyRdd)
indicatorPairRDD.saveAsNewAPIHadoopDataset(myAvroJob.getConfiguration)
But this code works since the schema of the input and ouput keys does not change, is always Indicator. In hadoop Map Reduce you can define a map or reduce functions and modify the schema from input to output. In fact, I have map functions which process every Indicator record and generates a new record SoporteCartera. How can I do this in spark? It is possible from the same RDD or I have to define 2 different RDDs and pass from one to another somehow?
Thanks for your help.

To answer my own question... the problem was that you cannot change the RDD type, you must define a different RDD, so I solved it with the above code:
val myAvroJob = new Job()
myAvroJob.setInputFormatClass(classOf[AvroKeyInputFormat[SoporteCartera]])
myAvroJob.setOutputFormatClass(classOf[AvroKeyOutputFormat[Indicator]])
myAvroJob.setOutputValueClass(classOf[NullWritable])
AvroJob.setInputValueSchema(myAvroJob, Schema.create(Schema.Type.NULL))
AvroJob.setInputKeySchema(myAvroJob, SoporteCartera.SCHEMA$)
AvroJob.setOutputKeySchema(myAvroJob, Indicator.SCHEMA$)
val soporteCarteraRdd = sc.newAPIHadoopRDD(myAvroJob.getConfiguration,
classOf[AvroKeyInputFormat[SoporteCartera]],
classOf[AvroKey[SoporteCartera]],
classOf[NullWritable])
val indicatorsRdd = soporteCarteraRdd.map(x => (fromSoporteCarteraToIndicator(x._1), NullWritable.get))
val indicatorPairRDD = new PairRDDFunctions(indicatorsRdd)
indicatorPairRDD.saveAsNewAPIHadoopDataset(myAvroJob.getConfiguration)

Related

Merging Value of one dictionary as key of another

I have 2 Dictionaries:
StatePops={'AL':4887871, 'AK':737438, 'AZ':7278717, 'AR':3013825}
StateNames={'AL':'Alabama', 'AK':'Alaska', 'AZ':'Arizona', 'AR':'Arkansas'}
I am trying to merge so the Value of StateNames is the Key for StatePops.
Ex.
{'Alabama': 4887871, 'Alaska': 737438, ...
I also have to display the name of states w/ population over 4million.
Any help is appreciated!!!
You have not specified in what programming language you want this problem to be solved.
Nonetheless, here is a solution in Python.
state_pops = {
'AL': 4887871,
'AK': 737438,
'AZ':7278717,
'AR':3013825
}
state_names = {
'AL':'Alabama',
'AK':'Alaska',
'AZ':'Arizona',
'AR':'Arkansas'
}
states = dict([([state_names[k],state_pops[k]]) for k in state_pops])
final = {k:v for k, v in states.items() if v > 4000000}
print(states)
print(final)
First, you can merge two dictionaries with the predefined dict python function in the states variable as such. Here, k is an iterator and it is used as index for state_names and state_pops.
Then, store the filtered dictionary in final where the states.items() is used to access the keys and values in states and type-cast it as a string with the str function.
There may be more simpler solutions but this is as far as I can optimize the problem.
Hope this helps.
Dictionary Keys cannot be changed in Python. You need to either add the modified key-value pair to the dictionary and then delete the old key, or you can create a new dictionary. I'd opt for the second option, i.e., creating a new dictionary.
myDict = {}
for i in StatePops:
myDict.update({StateNames[i] : StatePops[i]})
This outputs myDict as
{'Alabama': 4887871, 'Alaska': 737438, 'Arizona': 7278717, 'Arkansas': 3013825}

How to write Key Value Pairs in Map in a Loop without overwriting them?

the following is my problem:
I do a select on my database and want to write the values of every record i get in a map. When i do this, i only have the values of one record in my map, because the put() function overwrites the entries everytime the loop starts again. Since I have to transfer the Key:Value pairs via JSON into my Javascript and write something into a field for every Key:Value pair, an ArrayList is not an option.
I've already tried to convert a ArrayList, which contains the Map, to a Map or to a String and then to Map, but i failed.
EDIT:
Here´s my Code
def valueArray = new ArrayList();
def recordValues = [:]
while (rs.next())
{
fremdlKorr = rs.getFloatValue(1)
leistungKorr = rs.getFloatValue(2)
materialKorr = rs.getFloatValue(3)
strid = rs.getStringValue(4)
recordValues.put("strid", strid);
recordValues.put("material", materialKorr);
recordValues.put("fremdl", fremdlKorr);
recordValues.put("leistung", leistungKorr);
valueArray.add(korrekturWerte);
}
The ArrayList was just a test, i dont want to have an ArrayList, i need a Map.
The code as written, will give you a list of maps, but the maps will all contain the values of the last row. The reaons is the def recordValues = [:] that is outside of the while loop. So you basically add always the same map to the list and the map values get overwritten each loop.
While moving the code, would fix the problem, I'd use Sql instead. That all boils down to:
def valueArray = sql.rows("select strid, material, fremdl, leistung from ...")

How do I get a value from a dictionary when the key is a value in another dictionary in Lua?

I am writing some code where I have multiple dictionaries for my data. The reason being, I have multiple core objects and multiple smaller assets and the user must be able to choose a smaller asset and have some function off in the distance run the code with the parent noted.
An example of one of the dictionaries: (I'm working in ROBLOX Lua 5.1 but the syntax for the problem should be identical)
local data = {
character = workspace.Stores.NPCs.Thom,
name = "Thom", npcId = 9,
npcDialog = workspace.Stores.NPCs.Thom.Dialog
}
local items = {
item1 = {
model = workspace.Stores.Items.Item1.Main,
npcName = "Thom",
}
}
This is my function:
local function function1(item)
if not items[item] and data[items[item[npcName]]] then return false end
end
As you can see, I try to index the dictionary using a key from another dictionary. Usually this is no problem.
local thisIsAVariable = item[item1[npcName]]
but the method I use above tries to index the data dictionary for data that is in the items dictionary.
Without a ton of local variables and clutter, is there a way to do this? I had an idea to wrap the conflicting dictionary reference in a tostring() function to separate them - would that work?
Thank you.
As I see it, your issue is that:
data[items[item[npcName]]]
is looking for data[“Thom”] ... but you do not have such a key in the data table. You have a “name” key that has a “Thom” value. You could reverse the name key and value in the data table. “Thom” = name

How do i get the factor to index mappings from RFormula/RFormulaModel in Apache Spark?

All,
I have a simple data frame like below
I am using RFormula api to make a model matrix as below
val formula = "dep ~ indep"
val rF = new RFormula().setFormula(formula).setFeaturesCol("features").setLabelCol("label")
val rfModel = rF.fit(df)
where rfModel is of type RFormulaModel. According to the docs here
the mapping of the categorical variable "indep" should be available for access from this object as pipelineModel but this seems to be a private member.
My question is how do i get the labels and corresponding indices from the RFormulaModel object? I know I can use the metadata of the transformed dataframe and do string manipulation but is there a straightforward way to do this?
Thanks for any help!
Came up with a hack where I had to write the RFormulaModel to the disk and then read the pipelineModel part back in as PipelineModel. From there I have access to the StringIndexerModel stages as shown here
import org.apache.spark.ml.PipelineModel
import org.apache.spark.ml.feature.StringIndexerModel
rfModel.write.overwrite.save("/rfModel")
val pModel = PipelineModel.read.load("/rfModel/pipelineModel")
val strIndexers = pModel.stages.filter(stage => stage.isInstanceOf[StringIndexerModel])
val labelMaps = strIndexers.map(e => { val i = e.asInstanceOf[StringIndexerModel]; (i.getInputCol, i.labels)})
In pyspark you can extract the metadata from the target dataframe like
for attr in df.schema["X"].metadata["ml_attr"]["attrs"]["numeric"]:
print(f"idx: {attr['idx']}, name: {attr['name']}")
Hope this is applicable in scala as well!

Transform bag of key-value tuples to map in Apache Pig

I am new to Pig and I want to convert a bag of tuples to a map with specific value in each tuple as key. Basically I want to change:
{(id1, value1),(id2, value2), ...} into [id1#value1, id2#value2]
I've been looking around online for a while, but I can't seem to find a solution. I've tried:
bigQMap = FOREACH bigQFields GENERATE TOMAP(queryId, queryStart);
but I end up with a bag of maps (e.g. {[id1#value1], [id2#value2], ...}), which is not what I want. How can I build up a map out of a bag of key-value tuple?
Below is the specific script I'm trying to run, in case it's relevant
rawlines = LOAD '...' USING PigStorage('`');
bigQFields = FOREACH bigQLogs GENERATE GFV(*,'queryId')
as queryId, GFV(*, 'queryStart')
as queryStart;
bigQMap = ?? how to make a map with queryId as key and queryStart as value ?? ;
TOMAP takes a series of pairs and converts them into the map, so it is meant to be used like:
-- Schema: A:{foo:chararray, bar:int, bing:chararray, bang:int}
-- Data: (John, 27, Joe, 30)
B = FOREACH A GENERATE TOMAP(foo, bar, bing, bang) AS m ;
-- Schema: B:{m: map[]}
-- Data: (John#27,Joe#30)
So as you can see the syntax does not support converting a bag to a map. As far as I know there is no way to convert a bag in the format you have to map in pure pig. However, you can definitively write a java UDF to do this.
NOTE: I'm not too experienced with java, so this UDF can easily be improved on (adding exception handling, what happens if a key added twice etc.). However, it does accomplish what you need it to.
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import java.util.Map;
import java.util.HashMap;
import java.util.Iterator;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.DataBag;
public class ConvertToMap extends EvalFunc<Map>
{
public Map exec(Tuple input) throws IOException {
DataBag values = (DataBag)input.get(0);
Map<Object, Object> m = new HashMap<Object, Object>();
for (Iterator<Tuple> it = values.iterator(); it.hasNext();) {
Tuple t = it.next();
m.put(t.get(0), t.get(1));
}
return m;
}
}
Once you compile the script into a jar, it can be used like:
REGISTER myudfs.jar ;
-- A is loading some sample data I made
A = LOAD 'foo.in' AS (foo:{T:(id:chararray, value:chararray)}) ;
B = FOREACH A GENERATE myudfs.ConvertToMap(foo) AS bar;
Contents of foo.in:
{(open,apache),(apache,hadoop)}
{(foo,bar),(bar,foo),(open,what)}
Output from B:
([open#apache,apache#hadoop])
([bar#foo,open#what,foo#bar])
Another approach is to use python to create the UDF:
myudfs.py
#!/usr/bin/python
#outputSchema("foo:map[]")
def BagtoMap(bag):
d = {}
for key, value in bag:
d[key] = value
return d
Which is used like this:
Register 'myudfs.py' using jython as myfuncs;
-- A is still just loading some of my test data
A = LOAD 'foo.in' AS (foo:{T:(key:chararray, value:chararray)}) ;
B = FOREACH A GENERATE myfuncs.BagtoMap(foo) ;
And produces the same output as the Java UDF.
BONUS:
Since I don't like maps very much, here is a link explaining how the functionality of a map can be replicated with just key value pairs. Since your key value pairs are in a bag, you'll need to do the map-like operations in a nested FOREACH:
-- A is a schema that contains kv_pairs, a bag in the form {(id, value)}
B = FOREACH A {
temp = FOREACH kv_pairs GENERATE (key=='foo'?value:NULL) ;
-- Output is like: ({(),(thevalue),(),()})
-- MAX will pull the maximum value from the filtered bag, which is
-- value (the chararray) if the key matched. Otherwise it will return NULL.
GENERATE MAX(temp) as kv_pairs_filtered ;
}
I ran into the same situation so I submitted a patch that just got accepted: https://issues.apache.org/jira/browse/PIG-4638
This means that what you wanted is a core part starting with pig 0.16.

Resources