How do i get the factor to index mappings from RFormula/RFormulaModel in Apache Spark? - r

All,
I have a simple data frame like below
I am using RFormula api to make a model matrix as below
val formula = "dep ~ indep"
val rF = new RFormula().setFormula(formula).setFeaturesCol("features").setLabelCol("label")
val rfModel = rF.fit(df)
where rfModel is of type RFormulaModel. According to the docs here
the mapping of the categorical variable "indep" should be available for access from this object as pipelineModel but this seems to be a private member.
My question is how do i get the labels and corresponding indices from the RFormulaModel object? I know I can use the metadata of the transformed dataframe and do string manipulation but is there a straightforward way to do this?
Thanks for any help!

Came up with a hack where I had to write the RFormulaModel to the disk and then read the pipelineModel part back in as PipelineModel. From there I have access to the StringIndexerModel stages as shown here
import org.apache.spark.ml.PipelineModel
import org.apache.spark.ml.feature.StringIndexerModel
rfModel.write.overwrite.save("/rfModel")
val pModel = PipelineModel.read.load("/rfModel/pipelineModel")
val strIndexers = pModel.stages.filter(stage => stage.isInstanceOf[StringIndexerModel])
val labelMaps = strIndexers.map(e => { val i = e.asInstanceOf[StringIndexerModel]; (i.getInputCol, i.labels)})

In pyspark you can extract the metadata from the target dataframe like
for attr in df.schema["X"].metadata["ml_attr"]["attrs"]["numeric"]:
print(f"idx: {attr['idx']}, name: {attr['name']}")
Hope this is applicable in scala as well!

Related

BertModel transformers outputs string instead of tensor

I'm following this tutorial that codes a sentiment analysis classifier using BERT with the huggingface library and I'm having a very odd behavior. When trying the BERT model with a sample text I get a string instead of the hidden state. This is the code I'm using:
import transformers
from transformers import BertModel, BertTokenizer
print(transformers.__version__)
PRE_TRAINED_MODEL_NAME = 'bert-base-cased'
PATH_OF_CACHE = "/home/mwon/data-mwon/paperChega/src_classificador/data/hugingface"
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME,cache_dir = PATH_OF_CACHE)
sample_txt = 'When was I last outside? I am stuck at home for 2 weeks.'
encoding_sample = tokenizer.encode_plus(
sample_txt,
max_length=32,
add_special_tokens=True, # Add '[CLS]' and '[SEP]'
return_token_type_ids=False,
padding=True,
truncation = True,
return_attention_mask=True,
return_tensors='pt', # Return PyTorch tensors
)
bert_model = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME,cache_dir = PATH_OF_CACHE)
last_hidden_state, pooled_output = bert_model(
encoding_sample['input_ids'],
encoding_sample['attention_mask']
)
print([last_hidden_state,pooled_output])
that outputs:
4.0.0
['last_hidden_state', 'pooler_output']
While the answer from Aakash provides a solution to the problem, it does not explain the issue. Since one of the 3.X releases of the transformers library, the models do not return tuples anymore but specific output objects:
o = bert_model(
encoding_sample['input_ids'],
encoding_sample['attention_mask']
)
print(type(o))
print(o.keys())
Output:
transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions
odict_keys(['last_hidden_state', 'pooler_output'])
You can return to the previous behavior by adding return_dict=False to get a tuple:
o = bert_model(
encoding_sample['input_ids'],
encoding_sample['attention_mask'],
return_dict=False
)
print(type(o))
Output:
<class 'tuple'>
I do not recommend that, because it is now unambiguous to select a specific part of the output without turning to the documentation as shown in the example below:
o = bert_model(encoding_sample['input_ids'], encoding_sample['attention_mask'], return_dict=False, output_attentions=True, output_hidden_states=True)
print('I am a tuple with {} elements. You do not know what each element presents without checking the documentation'.format(len(o)))
o = bert_model(encoding_sample['input_ids'], encoding_sample['attention_mask'], output_attentions=True, output_hidden_states=True)
print('I am a cool object and you can acces my elements with o.last_hidden_state, o["last_hidden_state"] or even o[0]. My keys are; {} '.format(o.keys()))
Output:
I am a tuple with 4 elements. You do not know what each element presents without checking the documentation
I am a cool object and you can acces my elements with o.last_hidden_state, o["last_hidden_state"] or even o[0]. My keys are; odict_keys(['last_hidden_state', 'pooler_output', 'hidden_states', 'attentions'])
I faced the same issue while learning how to implement Bert. I noticed that using
last_hidden_state, pooled_output = bert_model(encoding_sample['input_ids'], encoding_sample['attention_mask'])
is the issue. Use:
outputs = bert_model(encoding_sample['input_ids'], encoding_sample['attention_mask'])
and extract the last_hidden state using
output[0]
You can refer to the documentation here which tells you what is returned by the BertModel

Extracting dict keys from values

I am still learning about python and I face some trouble extracting data from a dict. I need to create a loop which check each values and extract the keys. So for this code I need to find the nice students. I am stuck at line 3 #blank.
How do i go about this?
Thanks in advance
class = {"James":"naughty", "Lisa":"nice", "Bryan":"nice"}
for student in class:
if #blank:
print("Hello, "+student+" students!")
else:
print("odd")
Uses dictionary methods "keys(), values(), items()":
def get_students_by_criteria(student_class, criteria):
students = []
for candidate, value in student_class.items():
if value == criteria:
students.append(candidate)
return students
my_class = {"James":"naughty", "Lisa":"nice", "Bryan":"nice"}
print(get_students_by_criteria(my_class, "nice"))
Warning to the word "class" it is a keyword reserved for python programming oriented object

Merging Value of one dictionary as key of another

I have 2 Dictionaries:
StatePops={'AL':4887871, 'AK':737438, 'AZ':7278717, 'AR':3013825}
StateNames={'AL':'Alabama', 'AK':'Alaska', 'AZ':'Arizona', 'AR':'Arkansas'}
I am trying to merge so the Value of StateNames is the Key for StatePops.
Ex.
{'Alabama': 4887871, 'Alaska': 737438, ...
I also have to display the name of states w/ population over 4million.
Any help is appreciated!!!
You have not specified in what programming language you want this problem to be solved.
Nonetheless, here is a solution in Python.
state_pops = {
'AL': 4887871,
'AK': 737438,
'AZ':7278717,
'AR':3013825
}
state_names = {
'AL':'Alabama',
'AK':'Alaska',
'AZ':'Arizona',
'AR':'Arkansas'
}
states = dict([([state_names[k],state_pops[k]]) for k in state_pops])
final = {k:v for k, v in states.items() if v > 4000000}
print(states)
print(final)
First, you can merge two dictionaries with the predefined dict python function in the states variable as such. Here, k is an iterator and it is used as index for state_names and state_pops.
Then, store the filtered dictionary in final where the states.items() is used to access the keys and values in states and type-cast it as a string with the str function.
There may be more simpler solutions but this is as far as I can optimize the problem.
Hope this helps.
Dictionary Keys cannot be changed in Python. You need to either add the modified key-value pair to the dictionary and then delete the old key, or you can create a new dictionary. I'd opt for the second option, i.e., creating a new dictionary.
myDict = {}
for i in StatePops:
myDict.update({StateNames[i] : StatePops[i]})
This outputs myDict as
{'Alabama': 4887871, 'Alaska': 737438, 'Arizona': 7278717, 'Arkansas': 3013825}

How to modify the avro key/value schema in a RDD map transformation

I'm trying to migrate some Hadoop Map Reduce code to Spark and I have doubts about how to manage map and reduce transformations when the schema of either the key or value change from input to output.
I have avro files with Indicator records that I want to process somehow. I already have this code that works:
val myAvroJob = new Job()
myAvroJob.setInputFormatClass(classOf[AvroKeyInputFormat[Indicator]])
myAvroJob.setOutputFormatClass(classOf[AvroKeyOutputFormat[Indicator]])
myAvroJob.setOutputValueClass(classOf[NullWritable])
AvroJob.setInputValueSchema(myAvroJob, Schema.create(Schema.Type.NULL))
AvroJob.setInputKeySchema(myAvroJob, Indicator.SCHEMA$)
AvroJob.setOutputKeySchema(myAvroJob, Indicator.SCHEMA$)
val indicatorsRdd = sc.newAPIHadoopRDD(myAvroJob.getConfiguration,
classOf[AvroKeyInputFormat[Indicator]],
classOf[AvroKey[Indicator]],
classOf[NullWritable])
val myRecordOnlyRdd = indicatorsRdd.map(x => (doSomethingWith(x._1), NullWritable.get)
val indicatorPairRDD = new PairRDDFunctions(myRecordOnlyRdd)
indicatorPairRDD.saveAsNewAPIHadoopDataset(myAvroJob.getConfiguration)
But this code works since the schema of the input and ouput keys does not change, is always Indicator. In hadoop Map Reduce you can define a map or reduce functions and modify the schema from input to output. In fact, I have map functions which process every Indicator record and generates a new record SoporteCartera. How can I do this in spark? It is possible from the same RDD or I have to define 2 different RDDs and pass from one to another somehow?
Thanks for your help.
To answer my own question... the problem was that you cannot change the RDD type, you must define a different RDD, so I solved it with the above code:
val myAvroJob = new Job()
myAvroJob.setInputFormatClass(classOf[AvroKeyInputFormat[SoporteCartera]])
myAvroJob.setOutputFormatClass(classOf[AvroKeyOutputFormat[Indicator]])
myAvroJob.setOutputValueClass(classOf[NullWritable])
AvroJob.setInputValueSchema(myAvroJob, Schema.create(Schema.Type.NULL))
AvroJob.setInputKeySchema(myAvroJob, SoporteCartera.SCHEMA$)
AvroJob.setOutputKeySchema(myAvroJob, Indicator.SCHEMA$)
val soporteCarteraRdd = sc.newAPIHadoopRDD(myAvroJob.getConfiguration,
classOf[AvroKeyInputFormat[SoporteCartera]],
classOf[AvroKey[SoporteCartera]],
classOf[NullWritable])
val indicatorsRdd = soporteCarteraRdd.map(x => (fromSoporteCarteraToIndicator(x._1), NullWritable.get))
val indicatorPairRDD = new PairRDDFunctions(indicatorsRdd)
indicatorPairRDD.saveAsNewAPIHadoopDataset(myAvroJob.getConfiguration)

Transform bag of key-value tuples to map in Apache Pig

I am new to Pig and I want to convert a bag of tuples to a map with specific value in each tuple as key. Basically I want to change:
{(id1, value1),(id2, value2), ...} into [id1#value1, id2#value2]
I've been looking around online for a while, but I can't seem to find a solution. I've tried:
bigQMap = FOREACH bigQFields GENERATE TOMAP(queryId, queryStart);
but I end up with a bag of maps (e.g. {[id1#value1], [id2#value2], ...}), which is not what I want. How can I build up a map out of a bag of key-value tuple?
Below is the specific script I'm trying to run, in case it's relevant
rawlines = LOAD '...' USING PigStorage('`');
bigQFields = FOREACH bigQLogs GENERATE GFV(*,'queryId')
as queryId, GFV(*, 'queryStart')
as queryStart;
bigQMap = ?? how to make a map with queryId as key and queryStart as value ?? ;
TOMAP takes a series of pairs and converts them into the map, so it is meant to be used like:
-- Schema: A:{foo:chararray, bar:int, bing:chararray, bang:int}
-- Data: (John, 27, Joe, 30)
B = FOREACH A GENERATE TOMAP(foo, bar, bing, bang) AS m ;
-- Schema: B:{m: map[]}
-- Data: (John#27,Joe#30)
So as you can see the syntax does not support converting a bag to a map. As far as I know there is no way to convert a bag in the format you have to map in pure pig. However, you can definitively write a java UDF to do this.
NOTE: I'm not too experienced with java, so this UDF can easily be improved on (adding exception handling, what happens if a key added twice etc.). However, it does accomplish what you need it to.
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import java.util.Map;
import java.util.HashMap;
import java.util.Iterator;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.DataBag;
public class ConvertToMap extends EvalFunc<Map>
{
public Map exec(Tuple input) throws IOException {
DataBag values = (DataBag)input.get(0);
Map<Object, Object> m = new HashMap<Object, Object>();
for (Iterator<Tuple> it = values.iterator(); it.hasNext();) {
Tuple t = it.next();
m.put(t.get(0), t.get(1));
}
return m;
}
}
Once you compile the script into a jar, it can be used like:
REGISTER myudfs.jar ;
-- A is loading some sample data I made
A = LOAD 'foo.in' AS (foo:{T:(id:chararray, value:chararray)}) ;
B = FOREACH A GENERATE myudfs.ConvertToMap(foo) AS bar;
Contents of foo.in:
{(open,apache),(apache,hadoop)}
{(foo,bar),(bar,foo),(open,what)}
Output from B:
([open#apache,apache#hadoop])
([bar#foo,open#what,foo#bar])
Another approach is to use python to create the UDF:
myudfs.py
#!/usr/bin/python
#outputSchema("foo:map[]")
def BagtoMap(bag):
d = {}
for key, value in bag:
d[key] = value
return d
Which is used like this:
Register 'myudfs.py' using jython as myfuncs;
-- A is still just loading some of my test data
A = LOAD 'foo.in' AS (foo:{T:(key:chararray, value:chararray)}) ;
B = FOREACH A GENERATE myfuncs.BagtoMap(foo) ;
And produces the same output as the Java UDF.
BONUS:
Since I don't like maps very much, here is a link explaining how the functionality of a map can be replicated with just key value pairs. Since your key value pairs are in a bag, you'll need to do the map-like operations in a nested FOREACH:
-- A is a schema that contains kv_pairs, a bag in the form {(id, value)}
B = FOREACH A {
temp = FOREACH kv_pairs GENERATE (key=='foo'?value:NULL) ;
-- Output is like: ({(),(thevalue),(),()})
-- MAX will pull the maximum value from the filtered bag, which is
-- value (the chararray) if the key matched. Otherwise it will return NULL.
GENERATE MAX(temp) as kv_pairs_filtered ;
}
I ran into the same situation so I submitted a patch that just got accepted: https://issues.apache.org/jira/browse/PIG-4638
This means that what you wanted is a core part starting with pig 0.16.

Resources