how to extract single element from a bag in pig? - bigdata

My pig statement generates the following output:
({(10)},5201)
({(20),(20),(20)},3334)
({(30),(30),(30),(30)},4632)
({(40),(40)},3101)
({(50),(50)},3801)
({(60),(60),(60)},3959)
But I want to store above output as below in pig:
(10,5201)
(20,3334)
(30,4632)
(40,3101)
(50,3801)
(60,3959)
Is there any way to extract the very first element from bag in pig?

Use the Datafu UDF FirstTupleFromBag to achieve exactly this!

Related

Extract XML child attribute based on another child attribute

I have the following XML structure. I am trying to extract the attributes StartDate and EndDate of the relationship period, that is only if rr:PeriodType is RELATIONSHIP_PERIOD.
However, the nodes for "relationship" and "accounting" have exactly the same name and am not sure how to proceed.
<rr:RelationshipPeriods>
<rr:RelationshipPeriod>
<rr:StartDate>2018-01-01T00:00:00.000Z</rr:StartDate>
<rr:EndDate>2018-12-31T00:00:00.000Z</rr:EndDate>
<rr:PeriodType>ACCOUNTING_PERIOD</rr:PeriodType>
</rr:RelationshipPeriod>
<rr:RelationshipPeriod>
<rr:StartDate>2019-01-02T00:00:00.000Z</rr:StartDate>
<rr:PeriodType>RELATIONSHIP_PERIOD</rr:PeriodType>
</rr:RelationshipPeriod>
</rr:RelationshipPeriods>
I tried using this code
ldply(xpathApply(xmlData, '//rr:RelationshipPeriod/rr:StartDate', getChildrenStrings), rbind)
But doesn't work well as it's hard to understand if it is extracting accounting or relationship period.
Any help would be greatly appreciated!
For rr:StartDate use XPath:
//rr:RelationshipPeriod[rr:PeriodType='RELATIONSHIP_PERIOD']/rr:StartDate
But probably better to first find the correct rr:RelationshipPeriod using XPath:
//rr:RelationshipPeriod[rr:PeriodType='RELATIONSHIP_PERIOD']
See this answer on how to reuse the result of a XPath.
But don't use // in front of rr:StartDate and rr:EndDate

Extract items in a list using variable names in R

I'm parsing a JSON using the RJSONIO package.
The parsed item contains nested lists.
Each item in the list can be extracted using something like this:
dat_raw$`12`[[31]]
which correctly returns the string stored at this location (in this example, the '12' refers to the month and [[31]] to day).
"31-12-2021"
I now want to run a for loop to sequentially extract the date for every month. Something like this:
for (m in 1:12) {
print(dat_raw$m[[31]])
}
This, naturally, returns a NULL because there is no $m[[31]] in the list.
Instead, I'd like to extract the objects stored at $`1`[[31]], $`2`[[31]], ... $`12`[[31]].
There must be a relatively easy solution here but I haven't managed to crack it. I'd value some help. Thanks.
EDIT: I've added a screenshot of the list structure I'm trying to extract. The actual JSON object is quite large for a dput() output. Hope this helps
So, to get the date in this list, I'd use something like dat_raw$data$`1`[[1]]$date$gregorian$date.
What I'm trying to do is run a loop to extract multiple items of the list by cycling through $data$`1`[[1]]$..., $data$`2`[[1]]$... ... $data$`12`[[1]]$... using $data$m[[1]]$... in a for loop where m is the month.
Instead of dat_raw$`12`[[31]], you can have dat_raw[[12]][[31]] if 12 is the 12th element of the JSON. So your for loop would be:
for (m in 1:12) {
print(dat_raw[[m]][[31]])
}

Is there a way to print out single values from a multidimentional list? in python please

Imagine i have a multidimensional List like this vals = [['John', '20'], ['Derron', '5'], ['Mike', '43']], what can i do to print out only the names e.g: John, Derron, Mike
In which language you want to do this? Using JavaScript you can try below solution.
function print(val) {
console.log(val[0])
}
vals.forEach(print)
You can use a nested for loop. In python For example:
for sublist in vals:
print(sublist[0])
The first line of this code will loop through each sublist. For this question, ['John','20'] is a sublist. The second line will print out the first element (aka name) of each of these sublists.

Automatically generate a character vector with 100 elements

I would like to automatically create a vector with the following elements:
elements<-c("elem[1]","elem[1]" .... "elem[100]")
without typing elem[1], elem[2] etc by hand. How can I do this automatically?
Thanks
You can use paste0():
#Code
paste0('elem[',1:100,']')

How to use indirect reference to read contents into a data table in R.

How do you use indirect references in R? More specifically, in the following simple read statement, I would like to be able to use a variable name to read inputFile into data table myTable.
myTable <- read.csv(inputFile, sep=",", header=T)
Instead of the above, I want to define
refToMyTable <- "myTable"
Then, how can I use refToMyTable instead of myTable to read inputFile into myTable?
Thanks for the help.
R doesn't really have references like that, but you can use strings to retrieve/create variables of that name.
But first let me say this is generally not a good practice. If you're looking to do this type of thing, it's generally a sign that you're not doing it "the R way.'
Nevertheless
assign(refToMyTable, read.csv(inputFile, sep=",", header=T))
Should to the trick. And the complement to assign is get to retrieve a variable's value using it's name.
I think you mean something like the following:
reftomytable='~/Documents/myfile.csv'
myTable=read.csv(reftomytable)
Perhaps assign as mentioned by MrFlick.
When you want the contents of the object named "myTable" you would use get:
get("myTable")
get(refToMyTable) # since get will evaluate its argument
(It would be better to assign results of multiple such dataframes to a ist object or a Reference Class.)
If you wanted a language-name object you would use as.name:
as.name("myTable")
# myTable .... is printed at the console; note no quotes
str(as.name("myTable"))
#symbol myTable

Resources