Transform bag of key-value tuples to map in Apache Pig - dictionary

I am new to Pig and I want to convert a bag of tuples to a map with specific value in each tuple as key. Basically I want to change:
{(id1, value1),(id2, value2), ...} into [id1#value1, id2#value2]
I've been looking around online for a while, but I can't seem to find a solution. I've tried:
bigQMap = FOREACH bigQFields GENERATE TOMAP(queryId, queryStart);
but I end up with a bag of maps (e.g. {[id1#value1], [id2#value2], ...}), which is not what I want. How can I build up a map out of a bag of key-value tuple?
Below is the specific script I'm trying to run, in case it's relevant
rawlines = LOAD '...' USING PigStorage('`');
bigQFields = FOREACH bigQLogs GENERATE GFV(*,'queryId')
as queryId, GFV(*, 'queryStart')
as queryStart;
bigQMap = ?? how to make a map with queryId as key and queryStart as value ?? ;

TOMAP takes a series of pairs and converts them into the map, so it is meant to be used like:
-- Schema: A:{foo:chararray, bar:int, bing:chararray, bang:int}
-- Data: (John, 27, Joe, 30)
B = FOREACH A GENERATE TOMAP(foo, bar, bing, bang) AS m ;
-- Schema: B:{m: map[]}
-- Data: (John#27,Joe#30)
So as you can see the syntax does not support converting a bag to a map. As far as I know there is no way to convert a bag in the format you have to map in pure pig. However, you can definitively write a java UDF to do this.
NOTE: I'm not too experienced with java, so this UDF can easily be improved on (adding exception handling, what happens if a key added twice etc.). However, it does accomplish what you need it to.
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import java.util.Map;
import java.util.HashMap;
import java.util.Iterator;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.DataBag;
public class ConvertToMap extends EvalFunc<Map>
{
public Map exec(Tuple input) throws IOException {
DataBag values = (DataBag)input.get(0);
Map<Object, Object> m = new HashMap<Object, Object>();
for (Iterator<Tuple> it = values.iterator(); it.hasNext();) {
Tuple t = it.next();
m.put(t.get(0), t.get(1));
}
return m;
}
}
Once you compile the script into a jar, it can be used like:
REGISTER myudfs.jar ;
-- A is loading some sample data I made
A = LOAD 'foo.in' AS (foo:{T:(id:chararray, value:chararray)}) ;
B = FOREACH A GENERATE myudfs.ConvertToMap(foo) AS bar;
Contents of foo.in:
{(open,apache),(apache,hadoop)}
{(foo,bar),(bar,foo),(open,what)}
Output from B:
([open#apache,apache#hadoop])
([bar#foo,open#what,foo#bar])
Another approach is to use python to create the UDF:
myudfs.py
#!/usr/bin/python
#outputSchema("foo:map[]")
def BagtoMap(bag):
d = {}
for key, value in bag:
d[key] = value
return d
Which is used like this:
Register 'myudfs.py' using jython as myfuncs;
-- A is still just loading some of my test data
A = LOAD 'foo.in' AS (foo:{T:(key:chararray, value:chararray)}) ;
B = FOREACH A GENERATE myfuncs.BagtoMap(foo) ;
And produces the same output as the Java UDF.
BONUS:
Since I don't like maps very much, here is a link explaining how the functionality of a map can be replicated with just key value pairs. Since your key value pairs are in a bag, you'll need to do the map-like operations in a nested FOREACH:
-- A is a schema that contains kv_pairs, a bag in the form {(id, value)}
B = FOREACH A {
temp = FOREACH kv_pairs GENERATE (key=='foo'?value:NULL) ;
-- Output is like: ({(),(thevalue),(),()})
-- MAX will pull the maximum value from the filtered bag, which is
-- value (the chararray) if the key matched. Otherwise it will return NULL.
GENERATE MAX(temp) as kv_pairs_filtered ;
}

I ran into the same situation so I submitted a patch that just got accepted: https://issues.apache.org/jira/browse/PIG-4638
This means that what you wanted is a core part starting with pig 0.16.

Related

Merging Value of one dictionary as key of another

I have 2 Dictionaries:
StatePops={'AL':4887871, 'AK':737438, 'AZ':7278717, 'AR':3013825}
StateNames={'AL':'Alabama', 'AK':'Alaska', 'AZ':'Arizona', 'AR':'Arkansas'}
I am trying to merge so the Value of StateNames is the Key for StatePops.
Ex.
{'Alabama': 4887871, 'Alaska': 737438, ...
I also have to display the name of states w/ population over 4million.
Any help is appreciated!!!
You have not specified in what programming language you want this problem to be solved.
Nonetheless, here is a solution in Python.
state_pops = {
'AL': 4887871,
'AK': 737438,
'AZ':7278717,
'AR':3013825
}
state_names = {
'AL':'Alabama',
'AK':'Alaska',
'AZ':'Arizona',
'AR':'Arkansas'
}
states = dict([([state_names[k],state_pops[k]]) for k in state_pops])
final = {k:v for k, v in states.items() if v > 4000000}
print(states)
print(final)
First, you can merge two dictionaries with the predefined dict python function in the states variable as such. Here, k is an iterator and it is used as index for state_names and state_pops.
Then, store the filtered dictionary in final where the states.items() is used to access the keys and values in states and type-cast it as a string with the str function.
There may be more simpler solutions but this is as far as I can optimize the problem.
Hope this helps.
Dictionary Keys cannot be changed in Python. You need to either add the modified key-value pair to the dictionary and then delete the old key, or you can create a new dictionary. I'd opt for the second option, i.e., creating a new dictionary.
myDict = {}
for i in StatePops:
myDict.update({StateNames[i] : StatePops[i]})
This outputs myDict as
{'Alabama': 4887871, 'Alaska': 737438, 'Arizona': 7278717, 'Arkansas': 3013825}

Update a field within a NamedTuple (from typing)

I'm looking for a more pythonic way to update a field within a NamedTuple (from typing).
I get field name and value during runtime from a textfile und therefore used exec, but I believe, there must be a better way:
#!/usr/bin/env python3.6
# -*- coding: utf-8 -*-
from typing import NamedTuple
class Template (NamedTuple):
number : int = 0
name : str = "^NAME^"
oneInstance = Template()
print(oneInstance)
# If I know the variable name during development, I can do this:
oneInstance = oneInstance._replace(number = 77)
# I get this from a file during runtime:
para = {'name' : 'Jones'}
mykey = 'name'
# Therefore, I used exec:
ExpToEval = "oneInstance = oneInstance._replace(" + mykey + " = para[mykey])"
exec(ExpToEval) # How can I do this in a more pythonic (and secure) way?
print(oneInstance)
I guess, from 3.7 on, I could solve this issue with dataclasses, but I need it for 3.6
Using _replace on namedtuples can not be made "pythonic" whatsoever. Namedtuples are meant to be immutable. If you use a namedtuple other developers will expect that you do not intent to alter your data.
A pythonic approach is indeed the dataclass. You can use dataclasses in Python3.6 as well. Just use the dataclasses backport from PyPi.
Then the whole thing gets really readable and you can use getattrand setattr to address properties by name easily:
from dataclasses import dataclass
#dataclass
class Template:
number: int = 0
name: str = "^Name^"
t = Template()
# use setattr and getattr to access a property by a runtime-defined name
setattr(t, "name", "foo")
print(getattr(t, "name"))
This will result in
foo

How do I get a value from a dictionary when the key is a value in another dictionary in Lua?

I am writing some code where I have multiple dictionaries for my data. The reason being, I have multiple core objects and multiple smaller assets and the user must be able to choose a smaller asset and have some function off in the distance run the code with the parent noted.
An example of one of the dictionaries: (I'm working in ROBLOX Lua 5.1 but the syntax for the problem should be identical)
local data = {
character = workspace.Stores.NPCs.Thom,
name = "Thom", npcId = 9,
npcDialog = workspace.Stores.NPCs.Thom.Dialog
}
local items = {
item1 = {
model = workspace.Stores.Items.Item1.Main,
npcName = "Thom",
}
}
This is my function:
local function function1(item)
if not items[item] and data[items[item[npcName]]] then return false end
end
As you can see, I try to index the dictionary using a key from another dictionary. Usually this is no problem.
local thisIsAVariable = item[item1[npcName]]
but the method I use above tries to index the data dictionary for data that is in the items dictionary.
Without a ton of local variables and clutter, is there a way to do this? I had an idea to wrap the conflicting dictionary reference in a tostring() function to separate them - would that work?
Thank you.
As I see it, your issue is that:
data[items[item[npcName]]]
is looking for data[“Thom”] ... but you do not have such a key in the data table. You have a “name” key that has a “Thom” value. You could reverse the name key and value in the data table. “Thom” = name

How to modify the avro key/value schema in a RDD map transformation

I'm trying to migrate some Hadoop Map Reduce code to Spark and I have doubts about how to manage map and reduce transformations when the schema of either the key or value change from input to output.
I have avro files with Indicator records that I want to process somehow. I already have this code that works:
val myAvroJob = new Job()
myAvroJob.setInputFormatClass(classOf[AvroKeyInputFormat[Indicator]])
myAvroJob.setOutputFormatClass(classOf[AvroKeyOutputFormat[Indicator]])
myAvroJob.setOutputValueClass(classOf[NullWritable])
AvroJob.setInputValueSchema(myAvroJob, Schema.create(Schema.Type.NULL))
AvroJob.setInputKeySchema(myAvroJob, Indicator.SCHEMA$)
AvroJob.setOutputKeySchema(myAvroJob, Indicator.SCHEMA$)
val indicatorsRdd = sc.newAPIHadoopRDD(myAvroJob.getConfiguration,
classOf[AvroKeyInputFormat[Indicator]],
classOf[AvroKey[Indicator]],
classOf[NullWritable])
val myRecordOnlyRdd = indicatorsRdd.map(x => (doSomethingWith(x._1), NullWritable.get)
val indicatorPairRDD = new PairRDDFunctions(myRecordOnlyRdd)
indicatorPairRDD.saveAsNewAPIHadoopDataset(myAvroJob.getConfiguration)
But this code works since the schema of the input and ouput keys does not change, is always Indicator. In hadoop Map Reduce you can define a map or reduce functions and modify the schema from input to output. In fact, I have map functions which process every Indicator record and generates a new record SoporteCartera. How can I do this in spark? It is possible from the same RDD or I have to define 2 different RDDs and pass from one to another somehow?
Thanks for your help.
To answer my own question... the problem was that you cannot change the RDD type, you must define a different RDD, so I solved it with the above code:
val myAvroJob = new Job()
myAvroJob.setInputFormatClass(classOf[AvroKeyInputFormat[SoporteCartera]])
myAvroJob.setOutputFormatClass(classOf[AvroKeyOutputFormat[Indicator]])
myAvroJob.setOutputValueClass(classOf[NullWritable])
AvroJob.setInputValueSchema(myAvroJob, Schema.create(Schema.Type.NULL))
AvroJob.setInputKeySchema(myAvroJob, SoporteCartera.SCHEMA$)
AvroJob.setOutputKeySchema(myAvroJob, Indicator.SCHEMA$)
val soporteCarteraRdd = sc.newAPIHadoopRDD(myAvroJob.getConfiguration,
classOf[AvroKeyInputFormat[SoporteCartera]],
classOf[AvroKey[SoporteCartera]],
classOf[NullWritable])
val indicatorsRdd = soporteCarteraRdd.map(x => (fromSoporteCarteraToIndicator(x._1), NullWritable.get))
val indicatorPairRDD = new PairRDDFunctions(indicatorsRdd)
indicatorPairRDD.saveAsNewAPIHadoopDataset(myAvroJob.getConfiguration)

Filtering tab completion in input task implementation

I'm currently implementing a SBT plugin for Gatling.
One of its features will be to open the last generated report in a new browser tab from SBT.
As each run can have a different "simulation ID" (basically a simple string), I'd like to offer tab completion on simulation ids.
An example :
Running the Gatling SBT plugin will produce several folders (named from simulationId + date of report generaation) in target/gatling, for example mysim-20140204234534, myothersim-20140203124534 and yetanothersim-20140204234534.
Let's call the task lastReport.
If someone start typing lastReport my, I'd like to filter out tab-completion to only suggest mysim and myothersim.
Getting the simulation ID is a breeze, but how can help the parser and filter out suggestions so that it only suggest an existing simulation ID ?
To sum up, I'd like to do what testOnly do, in a way : I only want to suggest things that make sense in my context.
Thanks in advance for your answers,
Pierre
Edit : As I got a bit stuck after my latest tries, here is the code of my inputTask, in it's current state :
package io.gatling.sbt
import sbt._
import sbt.complete.{ DefaultParsers, Parser }
import io.gatling.sbt.Utils._
object GatlingTasks {
val lastReport = inputKey[Unit]("Open last report in browser")
val allSimulationIds = taskKey[Set[String]]("List of simulation ids found in reports folder")
val allReports = taskKey[List[Report]]("List of all reports by simulation id and timestamp")
def findAllReports(reportsFolder: File): List[Report] = {
val allDirectories = (reportsFolder ** DirectoryFilter.&&(new PatternFilter(reportFolderRegex.pattern))).get
allDirectories.map(file => (file, reportFolderRegex.findFirstMatchIn(file.getPath).get)).map {
case (file, regexMatch) => Report(file, regexMatch.group(1), regexMatch.group(2))
}.toList
}
def findAllSimulationIds(allReports: Seq[Report]): Set[String] = allReports.map(_.simulationId).distinct.toSet
def openLastReport(allReports: List[Report], allSimulationIds: Set[String]): Unit = {
def simulationIdParser(allSimulationIds: Set[String]): Parser[Option[String]] =
DefaultParsers.ID.examples(allSimulationIds, check = true).?
def filterReportsIfSimulationIdSelected(allReports: List[Report], simulationId: Option[String]): List[Report] =
simulationId match {
case Some(id) => allReports.filter(_.simulationId == id)
case None => allReports
}
Def.inputTaskDyn {
val selectedSimulationId = simulationIdParser(allSimulationIds).parsed
val filteredReports = filterReportsIfSimulationIdSelected(allReports, selectedSimulationId)
val reportsSortedByDate = filteredReports.sorted.map(_.path)
Def.task(reportsSortedByDate.headOption.foreach(file => openInBrowser((file / "index.html").toURI)))
}
}
}
Of course, openReport is called using the results of allReports and allSimulationIds tasks.
I think I'm close to a functioning input task but I'm still missing something...
Def.inputTaskDyn returns a value of type InputTask[T] and doesn't perform any side effects. The result needs to be bound to an InputKey, like lastReport. The return type of openLastReport is Unit, which means that openLastReport will construct a value that will be discarded, effectively doing nothing useful. Instead, have:
def openLastReport(...): InputTask[...] = ...
lastReport := openLastReport(...).evaluated
(Or, the implementation of openLastReport can be inlined into the right hand side of :=)
You probably don't need inputTaskDyn, but just inputTask. You only need inputTaskDyn if you need to return a task. Otherwise, use inputTask and drop the Def.task.

Resources