Rearranging longitudinal data - r

I have a dataset that is roughly structured like this:
case Year 2001 2002 2003 2004
1 2003 0 0 0 3
2 2002 0 5 3 2
3 2001 3 3 2 2
I am trying to restructure it so that every column represents the first, second (etc.) year counting from the "Year" variable, i.e.:
case Year yr1 yr2 yr3 yr4
1 2003 0 3 0 0
2 2002 5 3 2 0
3 2001 3 3 2 2
This code downloads the dataset and tries the solution suggested by #akrun, but it fails.
library("devtools")
df1 <- source_gist("b4c44aa67bfbcd6b72b9")
df1[-(1:2)] <- do.call(rbind,lapply(seq_len(nrow(df1)), function(i) {x <- df1[i, ]; x1 <- unlist(x[-(1:2)]); indx <- which(!is.na(x1))[1]; i <- as.numeric(names(indx))-x[,2]+1; x2 <- x1[!is.na(x1)]; x3 <- rep(NA, length(x1)); x3[i:(i+length(x2)-1)]<- x2; x3}))
This generates:
Error in i:(i + length(x2) - 1) : NA/NaN argument
In addition: Warning message:
In FUN(1:234[[1L]], ...) : NAs introduced by coercion
How can I transform the data so that every column represents the first, second (etc.) year counting from the value in the "Year" variable for each row?

Here's a possibilty:
library(dplyr)
library(reshape2)
df %>%
melt(id.vars = c("case", "Year")) %>%
mutate(variable = as.numeric(as.character(variable)),
yr = variable - Year + 1) %>%
filter(variable >= Year) %>%
dcast(case + Year ~ yr, fill = 0)
# case Year 1 2 3 4
# 1 1 2003 0 3 0 0
# 2 2 2002 5 3 2 0
# 3 3 2001 3 3 2 2
Data:
df <- structure(list(case = 1:3, Year = c(2003L, 2002L, 2001L), `2001` = c(0L,
0L, 3L), `2002` = c(0L, 5L, 3L), `2003` = c(0L, 3L, 2L), `2004` = c(3L,
2L, 2L)), .Names = c("case", "Year", "2001", "2002", "2003",
"2004"), class = "data.frame", row.names = c(NA, -3L))

This should create the manipulation you are looking for.
library("devtools")
df1 <- source_gist("b4c44aa67bfbcd6b72b9")
temp <- df1[[1]]
library(dplyr); library(tidyr); library(stringi)
temp <- temp %>%
gather(new.Years, X, -Year) %>% # convert rows to one column
mutate(Year.temp=paste0(rownames(temp), "-", Year)) %>% # concatenate the Year with row number to make them unique
mutate(new.Years = as.numeric(gsub("X", "", new.Years)), diff = new.Years-Year+1) %>% # calculate the difference to get the yr0 yr1 and so on
mutate(diff=paste0("yr", stri_sub(paste0("0", (ifelse(diff>0, diff, 0))), -2, -1))) %>% # convert the differences in Yr01 ...
select(-new.Years) %>% filter(diff != "yr00") %>% # drop new.Years column
spread(diff, X) %>% # convert column to rows
select(-Year.temp) # Drop Year.temp column
temp[is.na(temp)] <- 0 # replace NA with 0
temp %>% View
Notice that this will work for up to 99 years.

Here's a data.table solution:
require(data.table)
require(reshape2)
dt.m = melt(dt, id = 1:2, variable.factor = FALSE)
dt.m[, variable := as.integer(variable)-Year+1L]
dcast.data.table(dt.m, case + Year ~ variable, fill=0L,
value.var = "value", subset = (variable > 0L))
# case Year 1 2 3 4
# 1: 1 2003 0 3 0 0
# 2: 2 2002 5 3 2 0
# 3: 3 2001 3 3 2 2

library("devtools")
df1 <- source_gist("b4c44aa67bfbcd6b72b9")$value
I have an X in the colnames and remove it:
colnames(df1) <- gsub("X", "", colnames(df1))
I have got a solution without any additional packages:
startYear <- as.numeric(colnames(df1)[2])
shifts <- df1$Year - startYear
n <- ncol(df1)
df2 <- df1
colnames(df2)[-1] <- 1:(n-1)
df2[,2:n] <- NA
for(row in 1:nrow(df1)){
if(shifts[row]>=0){
df2[row,2:(n-shifts[row])] <- df1[row, (shifts[row]+2):n]
#df2[row,2:(n-shifts[row])] <- colnames(df1)[(shifts[row]+2):n]
}else{
df2[row, (-shifts[row]+2):n] <- df1[row, 2:(n+shifts[row])]
#df2[row, (-shifts[row]+2):n] <- colnames(df1)[2:(n+shifts[row])]
}
}
You can prefill df2 with 0 instead of NA of corse. Decomment second rows and comment first rows in the ifelse condition to validate the permutation.
Hope it does what you wanted.

Related

Number of occurences in a dataframe

I've the following data frame and I want to count the occurrences of each row by the first column and append as another column say "freq" to the data frame:
df:
gene a b c
abc 1 NA 1
bca NA 1 1
cba 1 2 1
my df is bigger, so this is only an example to scalable.
The desire dataframe is that:
gene a b c freq
abc 1 NA 1 2
bca NA 1 1 2
cba 1 2 1 3
the codes what I have tried is that:
g <- df %>% mutate(numtwos = rowSums(. > 0))
or
df$freq <- apply(df , 1, function(x) length(which(x>0)))
But it is not working because if in a row should have (for example) 150 repetitions, I obtain only 2 for every row.
Any help or other point of view is welcome!
Thanks
We can use first convert the Na to "NA"
library(dplyr)
df %>%
mutate_at(vars(a:c), ~ as.numeric(na_if(., "Na"))) %>%
mutate(freq = rowSums(select(., a:c), na.rm = TRUE))
# gene a b c freq
#1 abc 1 NA 1 2
#2 bca NA 1 1 2
#3 cba 1 1 1 3
Here, the values are all 1s, so it is the same as getting the sum of non-NA
df %>%
mutate_at(vars(a:c), ~ as.numeric(na_if(., "Na"))) %>%
mutate(freq = rowSums(!is.na(select(., a:c))))
data
df <- structure(list(gene = c("abc", "bca", "cba"), a = c("1", "Na",
"1"), b = c("Na", "1", "1"), c = c(1L, 1L, 1L)),
class = "data.frame", row.names = c(NA,
-3L))
I haven't used R for a while, so I won't paste in the code, but you can create a new df groupping the initial one by gene and merge/join it to your initial df in another line of code.

Grouping with condition in RStudio

Good morning everyone, I have a csv file (df2.csv) with several variables, as illustrated below (just for example):
CLASSE Variables Terms Number
1 DAT_1 20160701q 5
1 DAT_1 20160802q 2
1 DAT_1 20160901q 1
1 DAT_2 20161001q 1
1 DAT_2 20161201q 2
1 DAT_2 20170301q 3
2 DAT_1 20161001q 1
2 DAT_1 20161201q 2
2 DAT_1 20170301q 1
I want for each class (1 or 2 in this case), for each distinct date variable, if the number of individuals is less than 3, to group individuals with the next date. If I have a period of more than 3 individuals, in this case, I want to have a date like '20160701q-20160901q' instead of 20160701q and 20160901q separately. In this case, we group two dates or more to get a period of more than 3 individuals, and if the next date of the class has less than 3 individuals, we will group this date with the period before also.
I started whit this code
for (n in df2$CLASSE){
for (k in df2$Variables){
for (i in 1:nrow(df2)){
if (df2$Number[i]<3){
rempl_date=paste(df2$Terms[i],df2$Terms[i+1], sep="-")
df2$Terms[i]<-rempl_date
next
}
}
}
}
But it doesn't work, I want to have this one after grouping:
CLASSE Variables Terms Number
1 DAT_1 20160701q 5
1 DAT_1 20160802q-20160901q 3
1 DAT_2 20161001q-20161201q 3
1 DAT_2 20170301q 3
2 DAT_1 20161001q-20170301q 4
I don't know what I must change else if you can help me, I hope I was clear. Thanks in advance
We can use MESS::cumsumbinning function here to create groups until a threshold is reached.
library(dplyr)
thresh <- 3
temp <- df %>%
group_by(CLASSE, Variables,
group = MESS::cumsumbinning(Number, thresh)) %>%
summarise(Terms = if(n() > 1)
paste(first(Terms), last(Terms), sep = "-") else Terms,
Number = sum(Number)) %>%
select(-group)
This returns :
temp
# A tibble: 6 x 4
# Groups: CLASSE, Variables [3]
# CLASSE Variables Terms Number
# <int> <chr> <chr> <int>
#1 1 DAT_1 20160701q 5
#2 1 DAT_1 20160802q-20160901q 3
#3 1 DAT_2 20161001q-20161201q 3
#4 1 DAT_2 20170301q 3
#5 2 DAT_1 20161001q-20161201q 3
#6 2 DAT_1 20170301q 1
To combine the last row, we can do :
n <- nrow(temp)
if(temp$Number[n] < 3) {
temp$Terms[n-1] <- sub("-.*", paste0('-', temp$Terms[n]), temp$Terms[n -1])
temp$Number[n-1] <- sum(temp$Number[n-1], temp$Number[n])
temp <- temp[-n,]
}
# CLASSE Variables Terms Number
# <int> <chr> <chr> <int>
#1 1 DAT_1 20160701q 5
#2 1 DAT_1 20160802q-20160901q 3
#3 1 DAT_2 20161001q-20161201q 3
#4 1 DAT_2 20170301q 3
#5 2 DAT_1 20161001q-20170301q 4
Here is a base R solution:
define custom function for grouping
f <- function(v, th = 3) {
k <- 1
r <- c()
repeat {
if (length(v)==0) break
ind<-seq(head(which(cumsum(v)>=th),1))
if (sum(v)<2*th) {
r <- c(r,rep(k,length(v)))
v <- c()
} else {
r <- c(r,rep(k,length(ind)))
v <- v[-ind]
}
k <- k+1
}
r
}
then use aggregate + ave
dfout <- subset(aggregate(Terms~.,
within(within(df,grp <- ave(Number,Classe, Variables, FUN = f)),
Number <- ave(Number,Classe,Variables,grp,FUN = sum)),
c),
select = -grp)
format the dfout to the desired style by using order
dfout <- dfout[order(dfout$Classe,dfout$Variables),]
Output
> dfout
Classe Variables Number Terms
3 1 DAT_1 5 20160701q
4 1 DAT_1 3 20160802q, 20160901q
1 1 DAT_2 3 20161001q, 20161201q
5 1 DAT_2 3 20170301q
2 2 DAT_1 4 20161001q, 20161201q, 20170301q
DATA
df <- structure(list(Classe = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L),
Variables = c("DAT_1", "DAT_1", "DAT_1", "DAT_2", "DAT_2",
"DAT_2", "DAT_1", "DAT_1", "DAT_1"), Terms = c("20160701q",
"20160802q", "20160901q", "20161001q", "20161201q", "20170301q",
"20161001q", "20161201q", "20170301q"), Number = c(5L, 2L,
1L, 1L, 2L, 3L, 1L, 2L, 1L)), class = "data.frame", row.names = c(NA,
-9L))
UPDATE
If you want to concatenate the contents in Terms, try the code below
dfout <- subset(aggregate(Terms~.,
within(within(df,grp <- ave(Number,Classe, Variables, FUN = f)),
Number <- ave(Number,Classe,Variables,grp,FUN = sum)),
FUN = function(v) ifelse(length(v)==1,v,paste0(c(v[1],v[length(v)]),collapse = "-"))),
select = -grp)
dfout <- dfout[order(dfout$Classe,dfout$Variables),]
such that
> dfout
Classe Variables Number Terms
3 1 DAT_1 5 20160701q
4 1 DAT_1 3 20160802q-20160901q
1 1 DAT_2 3 20161001q-20161201q
5 1 DAT_2 3 20170301q
2 2 DAT_1 4 20161001q-20170301q
This is a quite cumbersome solution I've created that does what you asked. I'm sure it can be optimized or that functions from other packages can be used.
Explanations are inserted into the code
# new dataframe
df_new <- data.frame(
CLASSE = numeric(nrow(df2)),
Variables = character(nrow(df2)),
Terms = character(nrow(df2)),
Number = numeric(nrow(df2)),
stringsAsFactors = FALSE
)
# temporary dataframe
temp_df <- data.frame(
CLASSE = numeric(0),
Variables = character(0),
Terms = character(0),
Number = numeric(0),
stringsAsFactors = FALSE
)
temp_sum <- 0
present_row_temp_df <- 1
for (i in 1:nrow(df2)){
# if the row doesn't have to be grouped, just paste it in the new dataframe
if (df2$Number[i] >= 3){
df_new[i,] <- df2[i,]
next
}
# if the row has to be grouped, add it to a temporary dataframe
if (df2$Number[i] < 3){
temp_df[present_row_temp_df,] <- df2[i,]
temp_sum <- temp_sum + df2$Number[i]
present_row_temp_df <- present_row_temp_df + 1
# if the rows in the temporary dataframe need to be grouped now
if(temp_sum >= 3){
Terms_new <- paste(temp_df$Terms[1], temp_df$Terms[nrow(temp_df)], sep = "-")
Number_new <- sum(temp_df$Number)
df_new[i, c(1:3)] <- c(df2$CLASSE[i], df2$Variables[i], Terms_new)
df_new[i, 4] <- Number_new
# re-initialize temporary variables
temp_df <- data.frame(
CLASSE = numeric(0),
Variables = character(0),
Terms = character(0),
Number = numeric(0),
stringsAsFactors = FALSE
)
temp_sum <- 0
present_row_temp_df <- 1
}
# for the case in which the last row is not united with the previous rows
if (i == nrow(df2) & df2$Number[i] < 3){
Terms_new <- paste(stringr::str_extract(df_new$Terms[i-1], "^[^-]*"), df2$Terms[i], sep = "-")
Number_new <- df_new$Number[i-1] + df2$Number[i]
df_new[i, c(1:3)] <- c(df_new$CLASSE[i-1], df_new$Variables[i-1], Terms_new)
df_new[i, 4] <- Number_new
df_new[i-1,] <- c("0", "0", "0", 0)
}
}
}
# filter only relevant rows
df_new <- df_new[df_new$Number != 0,]
Result:
df_new
#CLASSE Variables Terms Number
# 1 DAT_1 20160701q 5
# 1 DAT_1 20160802q-20160901q 3
# 1 DAT_2 20161001q-20161201q 3
# 1 DAT_2 20170301q 3
# 2 DAT_1 20161001q-20170301q 4

Is there any way to treat NA differently for the same variable? [duplicate]

This question already has answers here:
sum non NA elements only, but if all NA then return NA
(2 answers)
Closed 3 years ago.
I have a dataset that the same ID may have several records for the same variable. Some of them have partial NA, some of them have all NA.
I want to sum the variable based on the same ID, and want people with all NA in that variable to get an NA while people with partial NA in that variable to get a sum (treat NA as 0 in this situation). Is there any way to do this?
I have tried na.rm = T when summing the variable, and all NA became 0, this is not what I want.
Dataset:
ID V1
5 120
5 300
5 NA
8 NA
8 NA
8 NA
Want this:
ID V1
5 420
8 NA
I did this and all NA became 0:
df <- df %>% group_by(ID) %>% transmute(V1 = sum(V1, na.rm = T))
Most of the approaches would remove the NA group or make it 0. Maybe we can use a custom condition
library(dplyr)
df %>%
group_by(ID) %>%
summarise(V1 = if (all(is.na(V1))) NA else sum(V1, na.rm = TRUE))
# A tibble: 2 x 2
# ID V1
# <int> <int>
#1 5 420
#2 8 NA
and with base R aggregate
aggregate(V1~ID, df, function(x)
if (all(is.na(x))) NA else sum(x, na.rm = TRUE), na.action = "na.pass")
We can use sum_ from hablar which would automatically return NA if all the elements are NA. Using data.table syntax, it would be
library(data.table)
library(hablar)
setDT(df)[, .(V1 = sum_(V1)), .(ID)]
# ID V1
#1: 5 420
#2: 8 NA
Or with dplyr
library(dplyr)
df %>%
group_by(ID) %>%
summarise(V1 = sum_(V1))
# A tibble: 2 x 2
# ID V1
# <int> <int>
#1 5 420
#2 8 NA
Or with sum without using any if/else
df %>%
group_by(ID) %>%
summarise(V1 = sum(V1, na.rm = TRUE) * NA^ all(is.na(V1)))
# A tibble: 2 x 2
# ID V1
# <int> <dbl>
#1 5 420
#2 8 NA
Or using base R
out <- rowsum(df$V1, df$ID, na.rm = TRUE)
(NA^!out) * out
# [,1]
#5 420
#8 NA
Or with by
by(df$V1, df$ID, FUN = sum_)
NOTE: all the codes are compact
data
df <- structure(list(ID = c(5L, 5L, 5L, 8L, 8L, 8L), V1 = c(120L, 300L,
NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, -6L
))

Count strings with a certain condition

I have the following dataset
#mydata
Factors Transactions
a,c 2
b 0
c 0
d,a 0
a 1
a 0
b 1
I'd like to count those factors who had transactions.For example, we had two times "a" with transaction. I can write a code to give me my desirable outcome for each variable separately. The following is for "a".
nrow (subset (mydata,mydata$Transaction > 0 & length(mydata[grep("a", mydata$Factors),] )> 0))
But I have too much variables and do not want to repeat a code for all of them. I would think there should be a way to write a code to give me the results for all of the variables. I wish to have the following out put:
#Output
a 2
b 1
c 1
d 0
Equivalent data.table option:
library(data.table)
setDT(df)[, .(Factors = unlist(strsplit(as.character(Factors), ","))),
by = Transactions][,.(Transactions = sum(Transactions > 0)), by = Factors]
# Factors Transactions
#1: a 2
#2: c 1
#3: b 1
#4: d 0
You could create a table using the unique values of the Factor column as the levels. Consider df to be your data set.
s <- strsplit(as.character(df$Factors), ",", fixed = TRUE)
table(factor(unlist(s[df$Transactions > 0]), levels = unique(unlist(s))))
#
# a c b d
# 2 1 1 0
Wrap in as.data.frame() for data frame output.
with(df, {
s <- strsplit(as.character(Factors), ",", fixed = TRUE)
f <- factor(unlist(s[Transactions > 0]), levels = unique(unlist(s)))
as.data.frame(table(Factors = f))
})
# Factors Freq
# 1 a 2
# 2 c 1
# 3 b 1
# 4 d 0
With tidyverse packages, assuming your data is strings/factors and numbers,
library(tidyr)
library(dplyr)
# separate factors with two elements
df %>% separate_rows(Factors) %>%
# set grouping for aggregation
group_by(Factors) %>%
# for each group, count how many transactions are greater than 0
summarise(Transactions = sum(Transactions > 0))
## # A tibble: 4 x 2
## Factors Transactions
## <chr> <int>
## 1 a 2
## 2 b 1
## 3 c 1
## 4 d 0
You could also avoid dplyr by using xtabs, though some cleaning is necessary to get to the same arrangement:
library(tidyr)
df %>% separate_rows(Factors) %>%
xtabs(Transactions > 0 ~ Factors, data = .) %>%
as.data.frame() %>%
setNames(names(df))
## Factors Transactions
## 1 a 2
## 2 b 1
## 3 c 1
## 4 d 0
A full base R equivalent:
df2 <- do.call(rbind,
Map(function(f, t){data.frame(Factors = strsplit(as.character(f), ',')[[1]],
Transactions = t)},
df$Factors, df$Transactions))
df3 <- as.data.frame(xtabs(Transactions > 0 ~ Factors, data = df2))
names(df3) <- names(df)
df3
## Factors Transactions
## 1 a 2
## 2 b 1
## 3 c 1
## 4 d 0
We can use cSplit from splitstackshape to split the 'Factors' into 'long' format and grouped by 'Factors' we get the sum of logical column ('Transactions > 0`).
library(splitstackshape)
cSplit(df1, "Factors", ",", "long")[, .(Transactions=sum(Transactions > 0)),.(Factors)]
# Factors Transactions
#1: a 2
#2: c 1
#3: b 1
#4: d 0
Or using base R
with(df1, table(factor(unlist(strsplit(Factors[Transactions>0], ",")),
levels = letters[1:4]) ))
# a b c d
# 2 1 1 0
data
df1 <- structure(list(Factors = c("a,c", "b", "c", "d,a", "a", "a",
"b"), Transactions = c(2L, 0L, 0L, 0L, 1L, 0L, 1L)), .Names = c("Factors",
"Transactions"), class = "data.frame", row.names = c(NA, -7L))

Combine incomplete dataframes in R into matrix

I have a list of dataframes in R, each loaded from a different file containing ranks. For example the files could contain finishing positions for different athletes in different races.
The same element (athlete) can appear in more than one dataframe (race) but no dataframe will necessarily contain all elements.
I would like to populate a matrix of rankings with athletes as rows and races as columns. Where there is no ranking for an athlete in a particular race it should read 0.
For example, if I have:
[[1]]
name rank
1 Alice 1
2 Bob 2
3 Carla 3
4 Diego 4
[[2]]
name rank
1 Alice 2
2 Carla 1
3 Eric 3
4 Frank 4
5 Gary 5
[[3]]
name rank
1 Bob 5
2 Carla 4
3 Diego 3
4 Eric 1
5 Gary 2
I would like to generate a matrix:
1 2 3
Alice 1 2 0
Bob 2 0 5
Carla 3 1 4
Diego 4 0 3
Eric 0 3 1
Frank 0 4 0
Gary 0 5 2
I am looking for an efficient way to do this: my data is more like 200 dataframes and 10000 ranked elements per dataframe (15000 unique elements in total) so the final matrix will be approx 15000x200
Here's a solution using reshape2 package:
require(reshape2)
dcast(do.call(rbind, lapply(seq_along(ll), function(ix)
transform(ll[[ix]], id = ix))), name ~ id, value.var="rank", fill=0)
name 1 2 3
1 Alice 1 2 0
2 Bob 2 0 5
3 Carla 3 1 4
4 Diego 4 0 3
5 Eric 0 3 1
6 Frank 0 4 0
7 Gary 0 5 2
where ll is your list of data.frames.
or equivalently:
dcast(transform(do.call(rbind, ll), id = rep(seq_along(ll), sapply(ll, nrow))),
name ~ id, value.var = "rank", fill = 0)
A data.table solution:
require(data.table)
pp <- rbindlist(ll)[, id := rep(seq_along(ll), sapply(ll, nrow))]
setkey(pp, "name", "id")
pp[CJ(unique(name), 1:3)][is.na(rank), rank := 0L][, as.list(rank), by = name]
name V1 V2 V3
1: Alice 1 2 0
2: Bob 2 0 5
3: Carla 3 1 4
4: Diego 4 0 3
5: Eric 0 3 1
6: Frank 0 4 0
7: Gary 0 5 2
Some benchmarking (now that we've quite some answers):
names <- tapply(sample(letters, 1e4, replace=TRUE), rep(1:(1e4/5), each=5), paste, collapse="")
names <- unique(names)
dd_create <- function() {
nrow <- sample(c(100:500), 1)
ncol <- 3
data.frame(name = sample(names, nrow, replace=FALSE), rank = sample(nrow))
}
ll <- replicate(1e3, dd_create(), simplify = FALSE)
require(reshape2)
require(data.table)
Arun1_reshape2 <- function(ll) {
# same as #agstudy's
dcast(do.call(rbind, lapply(seq_along(ll), function(ix)
transform(ll[[ix]], id = ix))), name ~ id, value.var="rank", fill=0)
}
Arun2_reshape2 <- function(ll) {
dcast(transform(do.call(rbind, ll), id = rep(seq_along(ll), sapply(ll, nrow))),
name ~ id, value.var = "rank", fill = 0)
}
eddi_reshape2 <- function(ll) {
dcast(melt(ll, id.vars = 'name'), name ~ L1, fill = 0)
}
Arun_data.table <- function(ll) {
pp <- rbindlist(ll)[, id := rep(seq_along(ll), sapply(ll, nrow))]
setkey(pp, "name", "id")
pp[CJ(unique(name), 1:1000)][is.na(rank), rank := 0L][, as.list(rank), by = name]
}
merge.all <- function(x, y) {
merge(x, y, all=TRUE, by="name")
}
Hong_Ooi <- function(ll) {
for(i in seq_along(ll))
names(ll[[i]])[2] <- paste0("rank", i)
out <- Reduce(merge.all, ll)
}
require(microbenchmark)
microbenchmark( arun1 <- Arun1_reshape2(ll),
arun2 <- Arun2_reshape2(ll),
eddi <- eddi_reshape2(ll),
hong <- Hong_Ooi(ll),
arun.dt <- Arun_data.table(ll), times=10)
Unit: seconds
expr min lq median uq max neval
arun1 <- Arun1_reshape2(ll) 9.157160 9.177143 9.366775 9.715767 28.043125 10
arun2 <- Arun2_reshape2(ll) 8.408356 8.437066 8.494233 9.018796 10.075029 10
eddi <- eddi_reshape2(ll) 8.056605 8.314110 8.402396 8.474129 9.124581 10
hong <- Hong_Ooi(ll) 82.457432 82.716930 82.908646 108.413217 321.164598 10
arun.dt <- Arun_data.table(ll) 2.006474 2.123331 2.212783 2.311619 2.738914 10
Here's a simpler reshape2 solution:
library(reshape2)
dcast(melt(ll, id.vars = 'name'), name ~ L1, fill = 0)
# name 1 2 3
#1 Alice 1 2 0
#2 Bob 2 0 5
#3 Carla 3 1 4
#4 Diego 4 0 3
#5 Eric 0 3 1
#6 Frank 0 4 0
#7 Gary 0 5 2
Arun's benchmarks were pretty interesting, and it seems like what data.table does really well is the melting part, and what reshape2 does really well is the dcast, so here's the best of both worlds:
library(reshape2)
library(data.table)
pp = rbindlist(ll)[, id := rep(seq_along(ll), sapply(ll, nrow))]
dcast(pp, name ~ id, fill = 0, value.var = 'rank')
Using Arun's benchmark data:
names <- tapply(sample(letters, 1e4, replace=TRUE), rep(1:(1e4/5), each=5), paste, collapse="")
names <- unique(names)
dd_create <- function() {
nrow <- sample(c(100:500), 1)
ncol <- 3
data.frame(name = sample(names, nrow, replace=FALSE), rank = sample(nrow))
}
ll <- replicate(1e3, dd_create(), simplify = FALSE)
Arun_data.table <- function(ll) {
pp <- rbindlist(ll)[, id := rep(seq_along(ll), sapply(ll, nrow))]
setkey(pp, "name", "id")
pp[CJ(unique(name), 1:1000)][is.na(rank), rank := 0L][, as.list(rank), by = name]
}
mix_of_both = function(ll) {
pp = rbindlist(ll)[, id := rep(seq_along(ll), sapply(ll, nrow))]
dcast(pp, name ~ id, fill = 0, value.var = 'rank')
}
require(microbenchmark)
microbenchmark(Arun_data.table(ll), mix_of_both(ll), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# Arun_data.table(ll) 2568.333 2586.0079 2626.7704 2832.8076 2911.1314 10
# mix_of_both(ll) 615.166 739.9383 766.8994 788.5822 821.0478 10
here the data since the OP don't give a reproducible example :
dput(ll)
list(structure(list(name = structure(1:4, .Label = c("Alice",
"Bob", "Carla", "Diego"), class = "factor"), rank = 1:4), .Names = c("name",
"rank"), class = "data.frame", row.names = c("1", "2", "3", "4"
)), structure(list(name = structure(1:5, .Label = c("Alice",
"Carla", "Eric", "Frank", "Gary"), class = "factor"), rank = c(2L,
1L, 3L, 4L, 5L)), .Names = c("name", "rank"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5")), structure(list(name = structure(1:5, .Label = c("Bob",
"Carla", "Diego", "Eric", "Gary"), class = "factor"), rank = c(5L,
4L, 3L, 1L, 2L)), .Names = c("name", "rank"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5")))
Nearly the same solution as #Arun one, but in 2 separtes steps:
## add race column
ll <- lapply(seq_along(ll),function(x){
ll[[x]]$race <- x
ll[[x]]
})
## create a long data.frame
dd <- do.call(rbind,ll)
## transform to the wide format
library(reshape2)
dcast(name~race,data=dd,fill=0,value.var='rank')
name 1 2 3
1 Alice 1 2 0
2 Bob 2 0 5
3 Carla 3 1 4
4 Diego 4 0 3
5 Eric 0 3 1
6 Frank 0 4 0
7 Gary 0 5 2
Another Reduce use case, it seems.
merge.all <- function(x, y)
merge(x, y, all=TRUE, by="name")
# to avoid problems with merged name clashes
for(i in seq_along(ll))
names(ll[[i]])[2] <- paste0("rank", i)
out <- Reduce(merge.all, ll)
You'll have to modify your data frames slightly to avoid merge complaining about name collisions; a for loop works as well as anything for this purpose.
Any missing races will have NA. You can replace them with 0 by out[is.na(out)] <- 0; you should ask yourself whether this is sensible though. For example, if you do this, then simple summary statistics like means, variances etc will give misleading results. The same holds if you want to do any more complicated modelling. By contrast, most R modelling functions will be smart enough to exclude NAs.

Resources