How to combine value with with a reference table in R? - r

This question is very similar to these two posts: How to compare values with a reference table in R?
and Combine Table with reference to another table in R . However, mine is more complicated:
I have two data frames:
> df1
name value1 value2
1 applefromJapan 2 8
2 applesfromJapan 3 9
3 applenotfromUS 4 10
4 pearsgoxxJapan 5 11
5 bananaxxeeChina 6 12
> df2
name value1 value2
1 applefromJapan 33 1
2 watermeleonnotfromUS 34 2
3 applesfromJapan 35 3
4 pearfromChina 36 4
5 pearfromphina 37 5 # only one letter different will not cause problem
and a reference table:
> ref.df
fruit country name seller
1 1:5 10:14 appleJapan John
2 1:5 10:14 pearsJapan Mike
3 1:6 11:15 applesJapan Nicole
4 1:6 11:12 bananaUS Amy
5 1:4 9:13 pearChina Jenny
6 1:5 13:14 appleUS Mark
7 1:10 18:22 watermeleonchina James
The reference table works as the following:
the name column in the ref.df contains the full name, where df1(df2) can obtain this name by trimming its name column according to the fruit and country column.
My desire output table is:
#output
name value1 value2 value3 value4
1 appleJapan 2 8 33 1 # ended up using the name from the ref.df
2 applesJapan 3 9 34 2 # merge df1 and df2
3 appleUS 4 10 NA NA # replacing NA for not exist values
4 pearsJapan 5 11 NA NA
5 watermeleonUS NA NA 34 2
6 pearChina NA NA 73 9 # name with only one letter difference consider as typo, so we will just sum the values up.
# bananaxxeeChina is not here because it does not referenced by the ref.df
*There are thousands of row in the real dataset (~12000 rows on average in each of the 22 dfs, ~ 200 rows in the ref.df), this is only the first couple rows (with some alternation). So I think it is better to compare each df to the ref.df, then combine the df2? But how can i achieve this?
Codes for producing data:
https://codeshare.io/2p179V

Related

Merge, cbind: How to merge better? [duplicate]

This question already has answers here:
R: Adding NAs into Data Frame
(5 answers)
Closed 6 years ago.
I want to merge multiple vectors to a data frame. There are two variables, city and id that are going to be used for matching vectors to data frame.
df <- data.frame(array(NA, dim =c(10*50, 2)))
names(df)<-c("city", "id")
df[,1]<-rep(1:50, each=10)
df[,2]<-rep(1:10, 50)
I created a data frame like this. To this data frame, I want to merge 50 vectors that each corresponds to 50 cities. The problem is that each city only has 6 obs. Each city will have 4 NAs.
To give you an example, city 1 data looks like this:
seed(1234)
cbind(city=1,id=sample(1:10,6),obs=rnorm(6))
I have 50 cities data and I want to merge them to one column in df. I have tried the following code:
for(i in 1:50){
citydata<-cbind(city=i,id=sample(1:10,6),obs=rnorm(6)) # each city data
df<-merge(df,citydata, by=c("city", "id"), all=TRUE)} # merge to df
But if I run this, the loop will show warnings like this:
In merge.data.frame(df, citydata, by = c("city", "id"), ... :
column names ‘obs.x’, ‘obs.y’ are duplicated in the result
and it will create 50 columns, instead of one long column.
How can I merge cbind(city=i,id=sample(1:10,6),obs=rnorm(6)) to df in a one nice and long column? It seems both cbind and merge are not ways to go.
In case there are 50 citydata (each has 6 rows), I can rbind them as one long data and use data.table approach or expand.gird+merge approach as Philip and Jaap suggested.
I wonder if I can merge each citydata through a loop one by one, instead of rbind them and merge it to df.
data.table is good for this:
library(data.table)
df <- data.table(df)
> df
city id
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 1 5
---
496: 50 6
497: 50 7
498: 50 8
499: 50 9
500: 50 10
I'm using CJ instead of your for loop to make some dummy data. CJ cross-joins each column against each value of each other column, so it makes a two-column table with each possible pair of values of city and id. The [,obs:=rnorm(.N)] command adds a third column that draws random values (without recycling them as it would if it were inside the CJ)--.N means "# rows of this table" in this context.
citydata <- CJ(city=1:50,id=1:6)[,obs:=rnorm(.N)]
> citydata
city id obs
1: 1 1 0.19168335
2: 1 2 0.35753229
3: 1 3 1.35707865
4: 1 4 1.91871907
5: 1 5 -0.56961647
---
296: 50 2 0.30592659
297: 50 3 -0.44989646
298: 50 4 0.05359738
299: 50 5 -0.57494269
300: 50 6 0.09565473
setkey(df,city,id)
setkey(citydata,city,id)
As these two tables have the same key columns the following looks up rows of df by the key columns in citydata, then defines obs in df by looking up the value in citydata. Therefore the resulting object is the original df but with obs defined wherever it was defined in citydata:
df[citydata,obs:=i.obs]
> df
city id obs
1: 1 1 0.19168335
2: 1 2 0.35753229
3: 1 3 1.35707865
4: 1 4 1.91871907
5: 1 5 -0.56961647
---
496: 50 6 0.09565473
497: 50 7 NA
498: 50 8 NA
499: 50 9 NA
500: 50 10 NA
In base R you can do this with a combination of expand.grid and merge:
citydata <- expand.grid(city=1:50,id=1:6)
citydata$obs <- rnorm(nrow(citydata))
res <- merge(df, citydata, by = c("city","id"), all.x = TRUE)
which gives:
> head(res,12)
city id obs
1: 1 1 -0.3121133
2: 1 2 -1.3554576
3: 1 3 -0.9056468
4: 1 4 -0.6511869
5: 1 5 -1.0447499
6: 1 6 1.5939187
7: 1 7 NA
8: 1 8 NA
9: 1 9 NA
10: 1 10 NA
11: 2 1 0.5423479
12: 2 2 -2.3663335
A similar approach with dplyr and tidyr:
library(dplyr)
library(tidyr)
res <- crossing(city=1:50,id=1:6) %>%
mutate(obs = rnorm(n())) %>%
right_join(., df, by = c("city","id"))
which gives:
> res
Source: local data frame [500 x 3]
city id obs
(int) (int) (dbl)
1 1 1 -0.5335660
2 1 2 1.0582001
3 1 3 -1.3888310
4 1 4 1.8519262
5 1 5 -0.9971686
6 1 6 1.3508046
7 1 7 NA
8 1 8 NA
9 1 9 NA
10 1 10 NA
.. ... ... ...

How to merge tables and fill the empty cells in the mean time in R?

Assume there are two tables a and b.
Table a:
ID AGE
1 20
2 empty
3 40
4 empty
Table b:
ID AGE
2 25
4 45
5 60
How to merge the two table in R so that the resulting table becomes:
ID AGE
1 20
2 25
3 40
4 45
You could try
library(data.table)
setkey(setDT(a), ID)[b, AGE:= i.AGE][]
# ID AGE
#1: 1 20
#2: 2 25
#3: 3 40
#4: 4 45
data
a <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA))
b <- data.frame(ID=c(2,4,5), AGE=c(25,45,60))
Assuming you have NA on every position in the first table where you want to use the second table's age numbers you can use rbind and na.omit.
Example
x <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA))
y <- data.frame(ID=c(2,4,5), AGE=c(25,45,60))
na.omit(rbind(x,y))
Results in what you're after (although unordered and I assume you just forgot ID 5)
ID AGE
1 20
3 40
2 25
4 45
5 60
EDIT
If you want to merge two different data.frames's and keep the columns its a different thing. You can use merge to achieve this.
Here are two data frames with different columns:
x <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA), COUNTY=c(1,2,3,4))
y <- data.frame(ID=c(2,4,5), AGE=c(25,45,60), STATE=c('CA','CA','IL'))
Add them together into one data.frame
res <- merge(x, y, by='ID', all=T)
giving us
ID AGE.x COUNTY AGE.y STATE
1 20 1 NA <NA>
2 NA 2 25 CA
3 40 3 NA <NA>
4 NA 4 45 CA
5 NA NA 60 IL
Then massage it into the form we want
idx <- which(is.na(res$AGE.x)) # find missing rows in x
res$AGE.x[idx] <- res$AGE.y[idx] # replace them with y's values
names(res)[agrep('AGE\\.x', names(res))] <- 'AGE' # rename merged column AGE.x to AGE
subset(res, select=-AGE.y) # dump the AGE.y column
Which gives us
ID AGE COUNTY STATE
1 20 1 <NA>
2 25 2 CA
3 40 3 <NA>
4 45 4 CA
5 60 NA IL
The package in the other answer will work. Here is a dirty hack if you don't want to use the package:
x$AGE[is.na(x$AGE)] <- y$AGE[y$ID %in% x$ID]
> x
ID AGE
1 1 20
2 2 25
3 3 40
4 4 45
But, I would use the package to avoid the clunky code.

transform a dataframe from long to wide in r, but needs date transformation

I have a dataframe like this (each "NUMBER" indicate a student):
NUMBER Gender Grade Date.Tested WI WR WZ
1 F 4 2014-02-18 6 9 10
1 F 3 2014-05-30 9 8 2
2 M 5 2013-05-02 7 9 15
2 M 4 2009-05-21 5 7 2
2 M 5 2010-04-29 9 1 4
I know I can use:
cook <- reshape(data, timevar= "?", idvar= c("NUMBER","Gender"), direction = "wide")
to change it into a wide format. However, I want to remove the date.tested to the times (1st time, 2nd time...etc), and indicate the grade.
What I want at the end is like this:
NUMBER Gender Grade1 Grade 2 Grade 3 WI1 WR1 WZ1 WI2 WR2 WZ2 WI3 WR3 WZ3
1 F 3 4 NA 9 8 2 6 9 10 NA NA NA
and for the rest "NUMBER"s.
I have searched a lot but did not find an answer. Can someone help me with it?
Thank you very much!
Try
data$id <- with(data, ave(seq_along(NUMBER), NUMBER, FUN=seq_along))
reshape(data, idvar=c('NUMBER', 'Gender'), timevar='id', direction='wide')
If you want the Date.Tested variable to be included in the 'idvar' and you need only the 1st value for the group ('NUMBER' or 'GENDER')
data$Date.Tested <- with(data, ave(Date.Tested, NUMBER,
FUN=function(x) head(x,1)))
reshape(data, idvar=c('NUMBER', 'Gender', 'Date.Tested'),
timevar='id', direction='wide')

Add a Count field to a data frame [duplicate]

This question already has answers here:
Count number of rows per group and add result to original data frame
(11 answers)
Closed 8 years ago.
I have the following dataframe dat:
> dat
subjectid variable
1 1234 12
2 1234 14
3 2143 19
4 3456 12
5 3456 14
6 3456 13
How do I add another column which shows the count of each unique subjectid?
ddply(dat,.(subjectid),summarize,quan_95=quantile(variable,0.95),uniq=count(unique(subjectid)))
Here is an approach via dplyr. First we group by subjectid, then use the function n() to count number of rows in each group:
dat <- read.table(text="
subjectid variable
1 1234 12
2 1234 14
3 2143 19
4 3456 12
5 3456 14
6 3456 13")
library(dplyr)
dat %>%
group_by(subjectid) %>%
mutate(count = n())
subjectid variable count
1 1234 12 2
2 1234 14 2
3 2143 19 1
4 3456 12 3
5 3456 14 3
6 3456 13 3
If dat is ordered by subjectid
tbl <- table(dat[,1])
transform(dat, count=rep(tbl, tbl))
# subjectid variable count
#1 1234 12 2
#2 1234 14 2
#3 2143 19 1
#4 3456 12 3
#5 3456 14 3
#6 3456 13 3
Similar to ave(), you may also use split/lapply/unsplit:
i = split(dat$variable, dat$subjectid)
count = unsplit(lapply(i, length), dat$subjectid)
Then graft the count variable back using data.frame() or whatever your preferred method.
The split() function just creates a list of dat$variable values for each value of dat$subjectid. The count is found by using lapply() to apply the length() function over each index in the list (i) and unsplit() puts everything back in order.
unsplit() is pure magic and fairy dust. I didn't believe it the first 100 times.

Match dataframe rows according to two variables (Indexing)

I am essentially trying to get disorganized data into long form for linear modeling.
I have 2 data.frames "rec" and "book"
Each row in "book" needs to be pasted onto the end of several of the rows of "rec" according to two variables in the row: "MRN" and "COURSE" which match.
I have tried the following and variations thereon to no avail:
i=1
newlist=list()
colnames(newlist)=colnames(book)
for ( i in 1:dim(rec)[1]) {
mrn=as.numeric(as.vector(rec$MRN[i]));
course=as.character(rec$COURSE[i]);
get.vector<-as.vector(((as.numeric(as.vector(book$MRN))==mrn) & (as.character(book$COURSE)==course)))
newlist[i]<-book[get.vector,]
i=i+1;
}
If anyone has any suggestions on
1)getting this to work
2) making it more elegant (or perhaps just less clumsy)
If I have been unclear in any way I beg your pardons.
I do understand I haven't combined any data above, I think if I can generate a long-format data.frame I can combine them all on my own
Sounds like you need to merge the two data-frames. Try this:
merge(rec, book, by = c('MRN', 'COURSE'))
and do read the help for merge (by doing ?merge at the R console) for more options on how to merge these.
I've created a simple example that may help you. In my case i wanted to paste the 'value' column from df1 in each row of df2, according to variables x1 and x2:
df1 <- read.table(textConnection("
x1 x2 value
1 2 12
1 3 56
2 1 35
2 2 68
"),header=T)
df2 <- read.table(textConnection("
test x1 x2
1 1 2
2 1 3
3 2 1
4 2 2
5 1 2
6 1 3
7 2 1
"),header=T)
library(sqldf)
sqldf("select df2.*, df1.value from df2 join df1 using(x1,x2)")
test x1 x2 value
1 1 1 2 12
2 2 1 3 56
3 3 2 1 35
4 4 2 2 68
5 5 1 2 12
6 6 1 3 56
7 7 2 1 35

Resources