GROUP ing data in PIG - bigdata

I have data as follows
name marks
ABC 2
ABC 3
ABC 3
XYZ 1
XYZ 2
I want o/p to be
ABC 8
XYZ 3
mY script is as:-
groupdata = GROUP filedata by name;
sumdata =FOREACH groupdata GENERATE filedata.name,SUM(filedata.marks);
DUMP sumdata;
i am getting o/p as
({ABC,ABC,ABC},8)
({XYZ,XYZ},3)
what is wrong with my script?

Use the keyword 'group' instead of filedata.name
sumdata = FOREACH groupdata GENERATE group,SUM(filedata.marks);
DUMP sumdata;

Related

Append data to different sheets in an excel in R

I have a dataframe like
All_DATA
ID Name Age
1 xyz 10
2 pqr 20
5 abc 15
6 pqr 19
8 xyz 10
9 pqr 12
10 abc 20
54 abc 41
Right now I have code which works for subsetting the data based on Name and the putting them into different excel ,but Now I want it in same excel in different sheets.
Here is the code for putting them into different excel
library("xlsx")
library("openxlsx")
All_DATA = read.xlsx("D:/test.xlsx")
data.list=list()
for(i in unique(All_DATA$Name)){
data.list[[i]] = subset(All_DATA,NAME==i)
write.xlsx( data.list[[i]],file=paste0("D:/Admin/",i,".xlsx"),row.names=F)
}
Is there any way by which a single excel file with data on multiple sheets can be generated.
Thanks
Domnick
You can use
write.xlsx(data.list[[i]], file="file.xlsx", sheetName=paste0("Sheet_",i,".xlsx"), row.names = F)

I want to match a column from one table with another table and replace those matching values [duplicate]

This question already has answers here:
Drop data frame columns by name
(25 answers)
Closed 5 years ago.
i have a table say,Table A :
uid pid code
1 1 aaa
2 1 ccc
3 4 ddd
4 2 eee
i have another table, Table B:
pid msg
1 good
2 inspiring
3 thing to wtch
4 terrible
now, i want to replace pid in table A with msg in Table B.
I used merge(tableA, tableb, by =c("pid"))
I got the result as
uid pid code msg
1 1 aaa good
2 1 ccc good
3 4 ddd terrible
4 2 eee inspiring
where in i want the result as
uid msg code
1 good aaa
2 good ccc
3 terrible ddd
4 inspiring eee
Your approach seems absolutely correct, just needs further steps:
selection of required columns
reordering them
With tidyverse functions, you can do something like:
TableA %>%
left_join(TableB) %>%
select(uid, msg, code)
which gives:
uid msg code
1 1 good aaa
2 2 good ccc
3 3 terrible ddd
4 4 inspiring eee
Base R solution:
newtable = merge(tableA,tableB,by = "pid")
newtable$pid = NULL
newtable = newtable[order(newtable$uid,decreasing=FALSE),c(1,3,2)]

Merge Uneven Data Files

I have 5 weeks of measured data in 5 separate CSV files, and am looking for a way to merge them into a single document that makes sense. The issue I'm having is that not all data points are present in each file, my largest has ~20k rows and my smallest has ~2k so there isn't a 1:1 relation. Here's what my data looks like:
Keyword URL 5/12 Rank
activity site.com 2
activity site.com/page 1
backup site.com/backup 4
The next file would look something like this:
Keyword URL 5/19 Rank
activity site.com/page 2
database site.com/data 3
What I'd like to end up with is something like this
Keyword URL 5/12 Rank 5/19 Rank
activity site.com 2 -
activity site.com/page 1 2
backup site.com/backup 4 -
database site.com/data - 3
My preference would be to do this with R. I think plyr will make this a snap, but I've never used it before and I'm just not getting how this comes together.
Use merge:
csv1 <- read.table(header=TRUE, text="
Keyword URL 5/12_Rank
activity site.com 2
activity site.com/page 1
backup site.com/backup 4
")
csv2 <- read.table(header=TRUE, text="
Keyword URL 5/19_Rank
activity site.com/page 2
database site.com/data 3
")
csv12 <- merge(csv1, csv2, all=TRUE)
#> csv12
# Keyword URL X5.12_Rank X5.19_Rank
#1 activity site.com 2 NA
#2 activity site.com/page 1 2
#3 backup site.com/backup 4 NA
#4 database site.com/data NA 3
If you have several tables, you can put them in a list and use Reduce:
csv3 <- read.table(header=TRUE, text="
Keyword URL 5/42_Rank
activity site.com 5
html site.com/data 6
")
L <- list(csv1, csv2, csv3)
Reduce(f=function(x,y)merge(x,y,all=TRUE), L)
Result
# Keyword URL X5.12_Rank X5.19_Rank X5.42_Rank
#1 activity site.com 2 NA 5
#2 activity site.com/page 1 2 NA
#3 backup site.com/backup 4 NA NA
#4 database site.com/data NA 3 NA
#5 html site.com/data NA NA 6

Counting the repeated values of a variable by id

I am at my very first steps with SAS, and I incurred into the following problem which I am not able to solve.
Suppose my dataset is:
data dat;
input id score gender;
cards;
1 10 1
1 10 1
1 9 1
1 9 1
1 9 1
1 8 1
2 9 2
2 8 2
2 9 2
2 9 2
2 10 2
;
run;
What I need to do is to count the number of times the score variable takes values 8, 9 and 10 by id. Then to create the newly variables count8, count9 and count10 such that I can get the following output:
id gender count8 count9 count10
1 1 1 3 2
2 2 1 3 1
How would you suggest ot proceed? Any help would be greatly appreciated.
Lots of ways to do that. Here's a simple one data step approach.
data want;
set dat;
by id;
if first.id then do;
count8=0;
count9=0;
count10=0;
end;
select(score);
when(8) count8+1;
when(9) count9+1;
when(10) count10+1;
otherwise;
end;
if last.id then output;
keep id count8 count9 count10;
run;
SELECT...WHEN is a shortening of a bunch of IF statements, basically (like CASE..WHEN in other languages).
Gender should be dropped, by the way, unless it's always the same by ID (or unless you intend to count by it.)
A more flexible approach than this is to use a PROC FREQ (or PROC MEANS or ...) and transpose it:
proc freq data=dat noprint;
tables id*score/out=want_pre;
run;
proc transpose data=want_pre out=want prefix=count;
by id;
id score;
var count;
run;
If you really only want 8,9,10 and want to drop records less than 8, do so in the data=dat part of PROC FREQ:
proc freq data=dat(where=(score ge 8)) noprint;

How to join data from three different spreadsheet?

I have 3 tsv files containing different data on my employees. I can join these data with the last name and first name of the employees, which appear in each file.
I would like to gather all the data for each employee in only one spreadsheet.
(I can't just do copy/past of the columns because some employees are not in file number 2 for example but will be in file number 3).
So I think - I am a beginner - a script could do that, for each employee (a row), gather as much data as possible from the files in a new tsv file.
Edit.
Example of what I have (in reality I have approximatively 300 rows for each file, some emloyees are not in all files).
file 1
john hudson 03/03 male
mary kate 34/04 female
harry loup 01/01 male
file 2
harry loup 1200$
file3
mary kate atlanta
What I want :
column1 colum2 column3 column4 column5 column6
john hudson 03/03 male
mary kate 34/04 female atlanta
harry loup 01/01 male 1200$
It would help me a lot!
Use this python script:
import sys, re
r=[]
i = 0
res = []
for f in sys.argv[1:]:
r.append({})
for l in open(f):
a,b = re.split('\s+', l.rstrip(), 1)
r[i][a] = b
if i == 0:
res += [a]
i += 1
for l in res:
print l," ".join(r[k].get(l, '-') for k in range(i))
The script loads each file into the dictionary (the first column is used as a key).
Then the script iterates through the values of the first column in the first file and
writes correspondent values from the dictionaries (that were created from the other files).
Example of usage:
$ cat 1.txt
user1 100
user2 200
user3 300
$ cat 2.txt
user2 2200
user3 2300
$ cat 3.txt
user1 1
user3 3
$ python 1.py [123].txt
user1 100 - 1
user2 200 2200 -
user3 300 2300 3
If you're familiar with SQL then you can use the perl DBD::CSV module todo the job, easily. But that also depends on whether you're comfortable writing perl.

Resources