Merge, cbind: How to merge better? [duplicate] - r

This question already has answers here:
R: Adding NAs into Data Frame
(5 answers)
Closed 6 years ago.
I want to merge multiple vectors to a data frame. There are two variables, city and id that are going to be used for matching vectors to data frame.
df <- data.frame(array(NA, dim =c(10*50, 2)))
names(df)<-c("city", "id")
df[,1]<-rep(1:50, each=10)
df[,2]<-rep(1:10, 50)
I created a data frame like this. To this data frame, I want to merge 50 vectors that each corresponds to 50 cities. The problem is that each city only has 6 obs. Each city will have 4 NAs.
To give you an example, city 1 data looks like this:
seed(1234)
cbind(city=1,id=sample(1:10,6),obs=rnorm(6))
I have 50 cities data and I want to merge them to one column in df. I have tried the following code:
for(i in 1:50){
citydata<-cbind(city=i,id=sample(1:10,6),obs=rnorm(6)) # each city data
df<-merge(df,citydata, by=c("city", "id"), all=TRUE)} # merge to df
But if I run this, the loop will show warnings like this:
In merge.data.frame(df, citydata, by = c("city", "id"), ... :
column names ‘obs.x’, ‘obs.y’ are duplicated in the result
and it will create 50 columns, instead of one long column.
How can I merge cbind(city=i,id=sample(1:10,6),obs=rnorm(6)) to df in a one nice and long column? It seems both cbind and merge are not ways to go.
In case there are 50 citydata (each has 6 rows), I can rbind them as one long data and use data.table approach or expand.gird+merge approach as Philip and Jaap suggested.
I wonder if I can merge each citydata through a loop one by one, instead of rbind them and merge it to df.

data.table is good for this:
library(data.table)
df <- data.table(df)
> df
city id
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 1 5
---
496: 50 6
497: 50 7
498: 50 8
499: 50 9
500: 50 10
I'm using CJ instead of your for loop to make some dummy data. CJ cross-joins each column against each value of each other column, so it makes a two-column table with each possible pair of values of city and id. The [,obs:=rnorm(.N)] command adds a third column that draws random values (without recycling them as it would if it were inside the CJ)--.N means "# rows of this table" in this context.
citydata <- CJ(city=1:50,id=1:6)[,obs:=rnorm(.N)]
> citydata
city id obs
1: 1 1 0.19168335
2: 1 2 0.35753229
3: 1 3 1.35707865
4: 1 4 1.91871907
5: 1 5 -0.56961647
---
296: 50 2 0.30592659
297: 50 3 -0.44989646
298: 50 4 0.05359738
299: 50 5 -0.57494269
300: 50 6 0.09565473
setkey(df,city,id)
setkey(citydata,city,id)
As these two tables have the same key columns the following looks up rows of df by the key columns in citydata, then defines obs in df by looking up the value in citydata. Therefore the resulting object is the original df but with obs defined wherever it was defined in citydata:
df[citydata,obs:=i.obs]
> df
city id obs
1: 1 1 0.19168335
2: 1 2 0.35753229
3: 1 3 1.35707865
4: 1 4 1.91871907
5: 1 5 -0.56961647
---
496: 50 6 0.09565473
497: 50 7 NA
498: 50 8 NA
499: 50 9 NA
500: 50 10 NA

In base R you can do this with a combination of expand.grid and merge:
citydata <- expand.grid(city=1:50,id=1:6)
citydata$obs <- rnorm(nrow(citydata))
res <- merge(df, citydata, by = c("city","id"), all.x = TRUE)
which gives:
> head(res,12)
city id obs
1: 1 1 -0.3121133
2: 1 2 -1.3554576
3: 1 3 -0.9056468
4: 1 4 -0.6511869
5: 1 5 -1.0447499
6: 1 6 1.5939187
7: 1 7 NA
8: 1 8 NA
9: 1 9 NA
10: 1 10 NA
11: 2 1 0.5423479
12: 2 2 -2.3663335
A similar approach with dplyr and tidyr:
library(dplyr)
library(tidyr)
res <- crossing(city=1:50,id=1:6) %>%
mutate(obs = rnorm(n())) %>%
right_join(., df, by = c("city","id"))
which gives:
> res
Source: local data frame [500 x 3]
city id obs
(int) (int) (dbl)
1 1 1 -0.5335660
2 1 2 1.0582001
3 1 3 -1.3888310
4 1 4 1.8519262
5 1 5 -0.9971686
6 1 6 1.3508046
7 1 7 NA
8 1 8 NA
9 1 9 NA
10 1 10 NA
.. ... ... ...

Related

Unique ID for interconnected cases

I have the following data frame, that shows which cases are interconnected:
DebtorId DupDebtorId
1: 1 2
2: 1 3
3: 1 4
4: 5 1
5: 5 2
6: 5 3
7: 6 7
8: 7 6
My goal is to assign a unique group ID to each group of cases. The desired output is:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
6: 6 2
7: 7 2
My train of thought:
library(data.table)
example <- data.table(
DebtorId = c(1,1,1,5,5,5,6,7),
DupDebtorId = c(2,3,4,1,2,3,7,6)
)
unique_pairs <- example[!duplicated(t(apply(example, 1, sort))),] #get unique pairs of DebtorID and DupDebtorID
unique_pairs[, group := .GRP, by=.(DebtorId)] #assign a group ID for each DebtorId
unique_pairs[, num := rowid(group)]
groups <- dcast(unique_pairs, group + DebtorId ~ num, value.var = 'DupDebtorId') #format data to wide for each group ID
#create new data table with unique cases to assign group ID
newdt <- data.table(DebtorId = sort(unique(c(example$DebtorId, example$DupDebtorId))), group = NA)
newdt$group <- as.numeric(newdt$group)
#loop through the mapped groups, selecting the first instance of group ID for the case
for (i in 1:nrow(newdt)) {
a <- newdt[i]$DebtorId
b <- min(which(groups[,-1] == a, arr.ind=TRUE)[,1])
newdt[i]$group <- b
}
Output:
DebtorId group
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
6: 6 3
7: 7 3
There are 2 problems in my approach:
From the output, you can see that it fails to recognize that case 5
belongs to group 1;
The final loop is agonizingly slow, which would
render it useless for my use case of 1M rows in my original data, and going the traditional := way does not work with which()
I'm not sure whether my approach could be optimized, or there is a better way of doing this altogether.
This functionality already exists in igraph, so if you don't need to do it yourself, we can build a graph from your data frame and then extract cluster membership. stack() is just an easy way to convert a named vector to data frame.
library(igraph)
g <- graph.data.frame(df)
df_membership <- clusters(g)$membership
stack(df_membership)
#> values ind
#> 1 1 1
#> 2 1 5
#> 3 2 6
#> 4 2 7
#> 5 1 2
#> 6 1 3
#> 7 1 4
Above, values corresponds to group and ind to DebtorId.

Merge multiple data frames with partially matching rows

I have data frames with lists of elements such as NAMES. There are different names in dataframes, but most of them match together. I'd like to combine all of them in one list in which I'd see whether some names are missing from any of df.
DATA sample for df1:
X x
1 1 rh_Structure/Focus_S
2 2 rh_Structure/Focus_C
3 3 lh_Structure/Focus_S
4 4 lh_Structure/Focus_C
5 5 RH_Type-Function-S
6 6 RH_REFERENT-S
and for df2
X x
1 1 rh_Structure/Focus_S
2 2 rh_Structure/Focus_C
3 3 lh_Structure/Focus_S
4 4 lh_Structure/Focus_C
5 5 UCZESTNIK
6 6 COACH
and expected result would be:
NAME. df1 df2
1 COACH NA 6
2 lh_Structure/Focus_C 4 4
3 lh_Structure/Focus_S 3 3
4 RH_REFERENT-S 6 NA
5 rh_Structure/Focus_C 2 2
6 rh_Structure/Focus_S 1 1
7 RH_Type-Function-S 5 NA
8 UCZESTNIK NA 5
I can do that with merge.data.frame(df1,df2,by = "x", all=T),
but the I can't do it with more df with similar structure. Any help would be appreciated.
It might be easier to work with this in a long form. Just rbind all the datasets below one another with a flag for which dataset they came from. Then it's relatively straightforward to get a tabulation of all the missing values (and as an added bonus, you can see if you have any duplicates in any of the source datasets):
dfs <- c("df1","df2")
dfall <- do.call(rbind, Map(cbind, mget(dfs), src=dfs))
table(dfall$x, dfall$src)
# df1 df2
# COACH 0 1
# lh_Structure/Focus_C 1 1
# lh_Structure/Focus_S 1 1
# RH_REFERENT-S 1 0
# rh_Structure/Focus_C 1 1
# rh_Structure/Focus_S 1 1
# RH_Type-Function-S 1 0
# UCZESTNIK 0 1

how to replace the NA in a data frame with the average number of this data frame

I have a data frame like this:
nums id
1233 1
3232 2
2334 3
3330 1
1445 3
3455 3
7632 2
NA 3
NA 1
And I can know the average "nums" of each "id" by using:
id_avg <- aggregate(nums ~ id, data = dat, FUN = mean)
What I would like to do is to replace the NA with the value of the average number of the corresponding id. for example, the average "nums" of 1,2,3 are 1000, 2000, 3000, respectively. The NA when id == 3 will be replaced by 3000, the last NA whose id == 1 will be replaced by 1000.
I tried the following code to achieve this:
temp <- dat[is.na(dat$nums),]$id
dat[is.na(dat$nums),]$nums <- id_avg[id_avg[,"id"] ==temp,]$nums
However, the second part
id_avg[id_avg[,"id"] ==temp,]$nums
is always NA, which means I always pass NA to the NAs I want to replace.
I don't know where I was wrong, or do you have better method to do this?
Thank you
Or you can fix it by:
dat[is.na(dat$nums),]$nums <- id_avg$nums[temp]
nums id
1 1233.000 1
2 3232.000 2
3 2334.000 3
4 3330.000 1
5 1445.000 3
6 3455.000 3
7 7632.000 2
8 2411.333 3
9 2281.500 1
What you want is contained in the zoo package.
library(zoo)
na.aggregate.default(dat, by = dat$id)
nums id
1 1233.000 1
2 3232.000 2
3 2334.000 3
4 3330.000 1
5 1445.000 3
6 3455.000 3
7 7632.000 2
8 2411.333 3
9 2281.500 1
Here is a dplyr way:
df %>%
group_by(id) %>%
mutate(nums = replace(nums, is.na(nums), as.integer(mean(nums, na.rm = T))))
# Source: local data frame [9 x 2]
# Groups: id [3]
# nums id
# <int> <int>
# 1 1233 1
# 2 3232 2
# 3 2334 3
# 4 3330 1
# 5 1445 3
# 6 3455 3
# 7 7632 2
# 8 2411 3
# 9 2281 1
You essentially want to merge the id_avg back to the original data frame by the id column, so you can also use match to follow your original logic:
dat$nums[is.na(dat$nums)] <- id_avg$nums[match(dat$id[is.na(dat$nums)], id_avg$id)]
dat
# nums id
# 1: 1233.000 1
# 2: 3232.000 2
# 3: 2334.000 3
# 4: 3330.000 1
# 5: 1445.000 3
# 6: 3455.000 3
# 7: 7632.000 2
# 8: 2411.333 3
# 9: 2281.500 1

How can I reshape my dataframe?

I have a huge data frame, that in a simple version it looks like this:
trials=c("1","2","3","4","5","6","7","8","9","10")
co =c(rep ("1",10))
stim=c("8","9","11","2","4","7","8","1","12","16")
ansbin=c("1","0","1","0","0","1","0","1","1","0")
stim.1=c("11","2","11","7","4","3","9","1","4","16")
ansbin.1=c("0","0","1","0","0","1","0","1","1","1")
trials.1=c("1","2","3","4","5","6","7","8","9","10")
co.1 =c(rep ("2",10))
stim1.1=c("11","2","11","2","5","7","8","15","17","10")
ansbin1.1=c("1","1","1","0","0","1","1","1","0","1")
stim2.1=c("11","2","14","1","4","8","9","10","4","12")
ansbin2.1=c("0","1","1","0","0","1","0","0","1","0")
ID<- data.frame(trials,co,stim,ansbin,stim.1,ansbin.1,trials.1,co.1,stim1.1,ansbin1.1,stim2.1,ansbin2.1)
View(ID)
Now I would like to form my new data.frame in the way that "stim", "stim.1","stim1.1" and "stim2.1" are under the same column called "stimulus", and the same thing for the answers: I would like all "ansbin", "ansbin.1", "ansbin1.1" and "ansbin2.1" under the same column called "answers".
Trials and Trials.1 at the same time should be under the same column, but the difference will the "co" column.
I tryied to use "reshape" like this:
df<-reshape(ID, direction="long",
idvar=c("trials", "co"),
varying= c("stim","stim.1", "stim1.1","stim2.1","ansbin","ansbin.1","ansbin1.1","ansbin2.1"
v.names=c("stimulus","answer"),
timevar="num",
)
but I have some problems and warning at the everytimes. I think it should be a problem linked to columns's name.
Can you help me?
Thank you in advance! :)
Here's the approach I would take:
library(data.table)
melt(
rbindlist(split.default(ID, cumsum(grepl("^trials", names(ID))))),
measure.vars = patterns("^stim", "^ansbin"), value.name = c("stim", "ansbin"))
# trials co variable stim ansbin
# 1: 1 1 1 8 1
# 2: 2 1 1 9 0
# 3: 3 1 1 11 1
# 4: 4 1 1 2 0
# 5: 5 1 1 4 0
# ---
# 36: 6 2 2 8 1
# 37: 7 2 2 9 0
# 38: 8 2 2 10 0
# 39: 9 2 2 4 1
# 40: 10 2 2 12 0
Basically, it sounds like you're looking at two rounds of "reshaping".
Stacking the columns from "trials" to the second set of "ansbin" on top of each other. I've done that with the rbindlist(split.default(...)) part of my answer.
Stacking each resulting pair of "stim" and "ansbin" columns on top of each other. I've done that with the melt(...) part of my answer.
Consider building a list of reshaped dataframes for each set: co, trials, stimulus, and answers, then merge them together. However, because co and trials only carry two columns while latter two carries four columns consider repeating columns prior to reshaping:
ID$co2 <- ID$co
ID$co3 <- ID$co.1
ID$trials.2 <- ID$trials
ID$trials.3 <- ID$trials.1
df_list <- lapply(c("co", "trials", "stim", "ans"), function(s)
reshape(ID, direction="long",
varying= grep(s, names(ID)),
v.names=c(s),
drop = grep(paste0("^", s), names(ID), invert=TRUE),
timevar="num",
new.row.names = 1:1000)
)
# CHAIN MERGE
finaldf <- Reduce(function(x, y) merge(x, y, by=c('id', 'num')), df_list)
finaldf <- with(finaldf, finaldf[order(num, id),]) # SORT DATAFRAME
rownames(finaldf) <- NULL # RESET ROWNAMES
head(finaldf)
# id num co trials stim ans
# 1 1 1 1 1 8 1
# 2 2 1 1 2 9 0
# 3 3 1 1 3 11 1
# 4 4 1 1 4 2 0
# 5 5 1 1 5 4 0
# 6 6 1 1 6 7 1

counting occurrences in column and create variable in R

I am new on R and I have a data.frame , called "CT", containing a column called "ID" containing several hundreds of different identification numbers (these are patients). Most numbers appear once, but some others appear two or three times (therefore, in different rows).
In the CT data.frame, I would like to insert a new variable, called "countID", which would indicate the number of occurrences of these specific patients (multiple records should still appear several times).
I tried two different strategies after reading this forum:
1st strategy:
CT <- cbind(CT, countID=sequence(rle(CT.long$ID)$lengths)
But this doesn't work, I get only one count.
2nd strategy: create a data frame with two columns (one is ID, one is count) and the match this dataframe with CT:
tabs <- table(CT.long$ID)
out <- data.frame(item=names(unlist(tabs)),count=unlist(tabs)[],stringsAsFactors=FALSE)
rownames(out) = c()
head(out)
# item count
# 1 1.312 1
# 2 1.313 2
# 3 1.316 1
# 4 1.317 1
# 5 1.321 1
# 6 1.322 1
So this works fine but I can't melt the two data.frames: the number of rows doesn't match between "out" and "CT" (out has less rows of course).
Maybe someone has an elegant solution to add the number of occurrences directly in the data.frame CT, or correctly match the two data.frames?
You were almost there! rle will work very nicely, you just need to sort your table on ID before computing rle:
CT <- data.frame( value = runif(10) , id = sample(5,10,repl=T) )
# sort on ID when calculating rle
Count <- rle( sort( CT$id ) )
# match values
CT$Count <- Count[[1]][ match( CT$id , Count[[2]] ) ]
CT
# value id Count
#1 0.94282600 1 4
#2 0.12170165 2 2
#3 0.04143461 1 4
#4 0.76334609 3 2
#5 0.87320740 4 1
#6 0.89766749 1 4
#7 0.16539820 1 4
#8 0.98521044 5 1
#9 0.70609853 3 2
#10 0.75134208 2 2
data.table usually provides the quickest way
set.seed(3)
library(data.table)
ct <- data.table(id=sample(1:10,15,replace=TRUE),item=round(rnorm(15),3))
st <- ct[,countid:=.N,by=id]
id item countid
1: 2 0.953 2
2: 9 0.535 2
3: 4 -0.584 2
4: 4 -2.161 2
5: 7 -1.320 3
6: 7 0.810 3
7: 2 1.342 2
8: 3 0.693 1
9: 6 -0.323 5
10: 7 -0.117 3
11: 6 -0.423 5
12: 6 -0.835 5
13: 6 -0.815 5
14: 6 0.794 5
15: 9 0.178 2
If you don't feel the need to use base R, plyr makes this task easy:
> set.seed(3)
> library(plyr)
> ct <- data.frame(id=sample(1:10,15,replace=TRUE),item=round(rnorm(15),3))
> ct <- ddply(ct,.(id),transform,idcount=length(id))
> head(ct)
id item idcount
1 2 0.953 2
2 2 1.342 2
3 3 0.693 1
4 4 -0.584 2
5 4 -2.161 2
6 6 -0.323 5

Resources