Aggregate across a list of dataframes - r

I have a fitting process that runs 100 times. Each time the output is a dataframe -- which I capture by using lapply to create a list of 100 dataframes.
The first two dataframes might look something like this (I have more than 1 column):
n1 = c(4, 5, 6)
df1 = data.frame(n1, row.names = c("height", "weight", "favcolor"))
n2 = c(2, 3, 5, 7)
df2 = data.frame(n2, row.names = c("height", "weight", "inseam", "favcolor"))
I would like to combine these dataframes (take the average height value, or the standard deviation of the weights, for example).
My first thought was to turn this list into a dataframe -- not going to work because arguments imply differing number of rows.
My second thought was to insert NAs for each rowname that didn't appear (so, I would be adding:
new_row <- c(NA, NA)
row.names(new_row) <- "inseam"
But I can't get that to work for a myrid of reasons, and I'm unable to even test if I add an "inseam" row that contains NAs to the end of df1, that when I take the average of the "inseam" I get the correct answer.
The correct average in this case being:
height 3
weight 4
inseam 5
favcolor 6.5
So, my question:
If you had a list of dataframes, where the row names are meaningful and need to be aggregated across like the above example, what is the best way to do that?
################
EDIT
Here is the full data from 3 of my datasets:
> mega_df[1]
[[1]]
coeff error pval
rf2 -1.15099200 0.5752430 4.540538e-02
rf3 -0.53430218 0.4928974 2.783635e-01
rf4 0.08784138 0.4933079 8.586711e-01
rf5 0.96002601 0.5070957 5.833327e-02
rm2 -0.36188368 0.4626464 4.340949e-01
rm3 0.01805873 0.4355164 9.669251e-01
rm4 0.45008373 0.4319557 2.974268e-01
rm5 1.04056503 0.4441024 1.912556e-02
rc_cat1 0.86231928 0.2827566 2.290799e-03
rc_cat2 1.21335473 0.2448206 7.192321e-07
rc_cat3 0.96196637 0.2044198 2.528247e-06
rc_cat4 1.04477290 0.3302644 1.559142e-03
rc_cat5 -0.58902623 1.5893867 7.109357e-01
rc_cat6 0.62569607 0.1720676 2.765407e-04
rc_cat7 0.29381724 0.4115594 4.752815e-01
rs2 0.12333678 0.7186019 8.637250e-01
rs3 1.22018613 0.6423970 5.750837e-02
rs4 1.96075220 0.6454184 2.381892e-03
rs5 2.58404946 0.6543862 7.853927e-05
1|3 0.01561497 0.4851330 9.743229e-01
3|4 1.82853786 0.4937675 2.128663e-04
4|5 3.73480100 0.5023435 1.047616e-13
> mega_df[2]
[[1]]
coeff error pval
rf2 -0.23364248 5.849338e-01 6.895734e-01
rf3 0.24054894 5.219730e-01 6.449094e-01
rf4 0.84072979 5.208259e-01 1.064788e-01
rf5 1.47867154 5.346970e-01 5.684640e-03
rm2 -0.29555400 4.465509e-01 5.080612e-01
rm3 0.31147504 4.131024e-01 4.508553e-01
rm4 0.73696523 4.141224e-01 7.514424e-02
rm5 1.14273148 4.271863e-01 7.472508e-03
rc_cat1 1.27479299 3.094432e-01 3.794740e-05
rc_cat2 1.10917318 2.619011e-01 2.284654e-05
rc_cat3 0.65782540 2.161602e-01 2.340525e-03
rc_cat4 0.40512225 3.301662e-01 2.198131e-01
rc_cat5 12.78797722 5.612311e-08 0.000000e+00
rc_cat6 0.41622889 1.677804e-01 1.310894e-02
rc_cat7 0.16833629 3.806498e-01 6.583198e-01
rs2 -0.02279305 7.225878e-01 9.748360e-01
rs3 0.68299485 6.759050e-01 3.122608e-01
rs4 1.36149302 6.780722e-01 4.465519e-02
rs5 2.18484594 6.863696e-01 1.456612e-03
1|3 0.35419237 5.844931e-01 5.445266e-01
3|4 2.12603072 5.928308e-01 3.354863e-04
4|5 3.97564508 5.999369e-01 3.431064e-11
> mega_df[3]
[[1]]
coeff error pval
rf2 -0.2733408 0.5884741 6.422961e-01
rf3 0.1764257 0.5257697 7.372050e-01
rf4 0.6504428 0.5248386 2.152271e-01
rf5 1.3967757 0.5356706 9.119879e-03
rm2 -0.2361284 0.4870015 6.277745e-01
rm3 0.2078729 0.4609270 6.519977e-01
rm4 0.6390950 0.4622065 1.667555e-01
rm5 1.1738653 0.4730686 1.308730e-02
rc_cat1 0.9337627 0.2958630 1.599133e-03
rc_cat2 1.0292916 0.2493133 3.651281e-05
rc_cat3 0.7088285 0.2012026 4.267587e-04
rc_cat4 0.6296966 0.3664883 8.576193e-02
rc_cat6 0.5475018 0.1720841 1.464662e-03
rc_cat7 0.4521113 0.3588440 2.077017e-01
rs2 -0.4663666 0.7031265 5.071541e-01
rs3 0.7810059 0.6489673 2.287985e-01
rs4 1.5178641 0.6522175 1.995271e-02
rs5 2.1916080 0.6578769 8.643075e-04
1|3 0.2569225 0.4659460 5.813597e-01
3|4 2.0648302 0.4769118 1.493906e-05
4|5 3.9312070 0.4855339 5.648509e-16
And I'm hoping to do some basic aggregations that end up returning:
avcoef averror avpval std(coef)
rf2 W X Y Z
rf3 ...
.
.
.

I guess you could just create a new column in each data set which will contain the row names and then merge accordingly, something like:
l <- lapply(list(df1, df2), function(x) {x$New <- row.names(x) ; x})
Res <- Reduce(function(...) merge(..., by = "New", all = TRUE), l)
cbind(Res[1], Means = rowMeans(Res[-1], na.rm = TRUE))
# Row.names Means
# 1 favcolor 6.5
# 2 height 3.0
# 3 inseam 5.0
# 4 weight 4.0
This is probably highly related to this
Edit: For the new data set
l <- lapply(list(mega_df1, mega_df2, mega_df3), function(x) {x$RowName <- row.names(x) ; x})
Res <- Reduce(function(...) merge(..., by = "RowName", all = TRUE), l)
library(data.table) ## v1.9.6+
dcast(melt(setDT(Res), "RowName"),
RowName ~ sub("\\..*", "", variable),
mean,
na.rm = TRUE,
value.var = "value")
# RowName coeff error pval
# 1: cat1 1.0236250 0.2960209 1.309293e-03
# 2: cat2 1.1172732 0.2520117 2.002619e-05
# 3: cat3 0.7762068 0.2072609 9.232706e-04
# 4: cat4 0.6931972 0.3423063 1.023781e-01
# 5: cat5 6.0994755 0.7946934 3.554678e-01
# 6: cat6 0.5298089 0.1706440 4.950048e-03
# 7: cat7 0.3047549 0.3836844 4.471010e-01

Related

How to use mice for multiple imputation of missing values in longitudinal data?

I have a dataset with a repeatedly measured continuous outcome and some covariates of different classes, like in the example below.
Id y Date Soda Team
1 -0.4521 1999-02-07 Coke Eagles
1 0.2863 1999-04-15 Pepsi Raiders
2 0.7956 1999-07-07 Coke Raiders
2 -0.8248 1999-07-26 NA Raiders
3 0.8830 1999-05-29 Pepsi Eagles
4 0.1303 2005-03-04 NA Cowboys
5 0.1375 2013-11-02 Coke Cowboys
5 0.2851 2015-06-23 Coke Eagles
5 -0.3538 2015-07-29 Pepsi NA
6 0.3349 2002-10-11 NA NA
7 -0.1756 2005-01-11 Pepsi Eagles
7 0.5507 2007-10-16 Pepsi Cowboys
7 0.5132 2012-07-13 NA Cowboys
7 -0.5776 2017-11-25 Coke Cowboys
8 0.5486 2009-02-08 Coke Cowboys
I am trying to multiply impute missing values in Soda and Team using the mice package. As I understand it, because MI is not a causal model, there is no concept of dependent and independent variable. I am not sure how to setup this MI process using mice. I like some suggestions or advise from others who have encountered missing data in a repeated measure setting like this and how they used mice to tackle this problem. Thanks in advance.
Edit
This is what I have tried so far, but this does not capture the repeated measure part of the dataset.
library(mice)
init = mice(dat, maxit=0)
methd = init$method
predM = init$predictorMatrix
methd [c("Soda")]="logreg";
methd [c("Team")]="logreg";
imputed = mice(data, method=methd , predictorMatrix=predM, m=5)
There are several options to accomplish what you are asking for. I have decided to impute missing values in covariates in the so-called 'wide' format. I will illustrate this with the following worked example, which you can easily apply to your own data.
Let's first make a reprex. Here, I use the longitudinal Mayo Clinic Primary Biliary Cirrhosis Data (pbc2), which comes with the JM package. This data is organized in the so-called 'long' format, meaning that each patient i has multiple rows and each row contains a measurement of variable x measured on time j. Your dataset is also in the long format. In this example, I assume that pbc2$serBilir is our outcome variable.
# install.packages('JM')
library(JM)
# note: use function(x) instead of \(x) if you use a version of R <4.1.0
# missing values per column
miss_abs <- \(x) sum(is.na(x))
miss_perc <- \(x) round(sum(is.na(x)) / length(x) * 100, 1L)
miss <- cbind('Number' = apply(pbc2, 2, miss_abs), '%' = apply(pbc2, 2, miss_perc))
# --------------------------------
> miss[which(miss[, 'Number'] > 0),]
Number %
ascites 60 3.1
hepatomegaly 61 3.1
spiders 58 3.0
serChol 821 42.2
alkaline 60 3.1
platelets 73 3.8
According to this output, 6 variables in pbc2 contain at least one missing value. Let's pick alkaline from these. We also need patient id and the time variable years.
# subset
pbc_long <- subset(pbc2, select = c('id', 'years', 'alkaline', 'serBilir'))
# sort ascending based on id and, within each id, years
pbc_long <- with(pbc_long, pbc_long[order(id, years), ])
# ------------------------------------------------------
> head(pbc_long, 5)
id years alkaline serBilir
1 1 1.09517 1718 14.5
2 1 1.09517 1612 21.3
3 2 14.15234 7395 1.1
4 2 14.15234 2107 0.8
5 2 14.15234 1711 1.0
Just by quickly eyeballing, we observe that years do not seem to differ within subjects, even though variables were repeatedly measured. For the sake of this example, let's add a little bit of time to all rows of years but the first measurement.
set.seed(1)
# add little bit of time to each row of 'years' but the first row
new_years <- lapply(split(pbc_long, pbc_long$id), \(x) {
add_time <- 1:(length(x$years) - 1L) + rnorm(length(x$years) - 1L, sd = 0.25)
c(x$years[1L], x$years[-1L] + add_time)
})
# replace the original 'years' variable
pbc_long$years <- unlist(new_years)
# integer time variable needed to store repeated measurements as separate columns
pbc_long$measurement_number <- unlist(sapply(split(pbc_long, pbc_long$id), \(x) 1:nrow(x)))
# only keep the first 4 repeated measurements per patient
pbc_long <- subset(pbc_long, measurement_number %in% 1:4)
Since we will perform our multiple imputation in wide format (meaning that each participant i has one row and repeated measurements on x are stored in j different columns, so xj columns in total), we have to convert the data from long to wide. Now that we have prepared our data, we can use reshape to do this for us.
# convert long format into wide format
v_names <- c('years', 'alkaline', 'serBilir')
pbc_wide <- reshape(pbc_long,
idvar = 'id',
timevar = "measurement_number",
v.names = v_names, direction = "wide")
# -----------------------------------------------------------------
> head(pbc_wide, 4)[, 1:9]
id years.1 alkaline.1 serBilir.1 years.2 alkaline.2 serBilir.2 years.3 alkaline.3
1 1 1.095170 1718 14.5 1.938557 1612 21.3 NA NA
3 2 14.152338 7395 1.1 15.198249 2107 0.8 15.943431 1711
12 3 2.770781 516 1.4 3.694434 353 1.1 5.148726 218
16 4 5.270507 6122 1.8 6.115197 1175 1.6 6.716832 1157
Now let's multiply the missing values in our covariates.
library(mice)
# Setup-run
ini <- mice(pbc_wide, maxit = 0)
meth <- ini$method
pred <- ini$predictorMatrix
visSeq <- ini$visitSequence
# avoid collinearity issues by letting only variables measured
# at the same point in time predict each other
pred[grep("1", rownames(pred), value = TRUE),
grep("2|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("2", rownames(pred), value = TRUE),
grep("1|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("3", rownames(pred), value = TRUE),
grep("1|2|4", colnames(pred), value = TRUE)] <- 0
pred[grep("4", rownames(pred), value = TRUE),
grep("1|2|3", colnames(pred), value = TRUE)] <- 0
# variables that should not be imputed
pred[c("id", grep('^year', names(pbc_wide), value = TRUE)), ] <- 0
# variables should not serve as predictors
pred[, c("id", grep('^year', names(pbc_wide), value = TRUE))] <- 0
# multiply imputed missing values ------------------------------
imp <- mice(pbc_wide, pred = pred, m = 10, maxit = 20, seed = 1)
# Time difference of 2.899244 secs
As can be seen in the below three example traceplots (which can be obtained with plot(imp), the algorithm has converged nicely. Refer to this section of Stef van Buuren's book for more info on convergence.
Now we need to convert back the multiply imputed data (which is in wide format) to long format, so that we can use it for analyses. We also need to make sure that we exclude all rows that had missing values for our outcome variable serBilir, because we do not want to use imputed values of the outcome.
# need unlisted data
implong <- complete(imp, 'long', include = FALSE)
# 'smart' way of getting all the names of the repeated variables in a usable format
v_names <- as.data.frame(matrix(apply(
expand.grid(grep('ye|alk|ser', names(implong), value = TRUE)),
1, paste0, collapse = ''), nrow = 4, byrow = TRUE), stringsAsFactors = FALSE)
names(v_names) <- names(pbc_long)[2:4]
# convert back to long format
longlist <- lapply(split(implong, implong$.imp),
reshape, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
# logical that is TRUE if our outcome was not observed
# which should be based on the original, unimputed data
orig_data <- reshape(imp$data, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
orig_data$logical <- is.na(orig_data$serBilir)
# merge into the list of imputed long-format datasets:
longlist <- lapply(longlist, merge, y = subset(orig_data, select = c(id, time, logical)))
# exclude rows for which logical == TRUE
longlist <- lapply(longlist, \(x) subset(x, !logical))
Finally, convert longlist back into a mids using datalist2mids from the miceadds package.
imp <- miceadds::datalist2mids(longlist)
# ----------------
> imp$loggedEvents
NULL

Summing values when merging rows in a data set in R

So I have a large data set (50,000 rows and 500 columns). I merged the rows I wanted to by this code:
Similarities <- Home %>%
group_by_at(c(1,2,5,9,70,26)) %>%
summarize_all(.funs = function(x) paste(unique(x), collapse = ','))
In this code, for other the other rows that are combined and have different values, their output turns into a list separated with commas. However, now I want to sum all the values in one specific column, in which I tried this code:
Similarities <- Home %>%
group_by_at(c(1,2,5,9,70,26)) %>%
summarize_at(.vars = FTR, .funs = function(x) paste(sum(x))),
summarize_all(.funs = function(x) paste(unique(x), collapse = ','))
I assumed it wouldn't work because I wasn't sure what I was doing.
My goal is to have the specific column: "FTR", when I merge rows together, all the values in "FTR" be added together.
An example of the data would be:
Total Type Clm FTR Loss
300 water 2 -103 N
200 fire 3 203 Y
300 water 2 100 Y
What my code does now is:
Total Type CLM FTR Loss
300 water 2 -103, 100 Y, N
200 fire 3 203 Y
But what I want is:
Total Type CLM FTR Loss
300 water 2 -3 Y, N
200 fire 3 203 Y
The following code sums the collapsed columns, like the question asks for.
special_sum <- function(x, sep = ", ", na.rm = TRUE){
f <- function(y, na.rm){
y <- as.numeric(y)
sum(y, na.rm = na.rm)
}
x <- as.character(x)
x <- strsplit(x, sep)
sapply(x, f, na.rm = na.rm)
}
WIth the second data.frame posted in the question, the function special_sum could be called as follows. The group columns are for tests purposes only.
Home <- read.table(text = "
Total Type CLM FTR Loss
300 water 2 '-103, 100' 'Y, N'
200 fire 3 203 Y
", header = TRUE)
Home %>%
group_by(1, 2) %>%
summarize_at(vars('FTR'), special_sum)
## A tibble: 2 x 3
## Groups: 1, 2 [1]
# `1` `2` FTR
# <dbl> <dbl> <dbl>
#1 1 2 -3
#2 1 2 203
Note that you should probably sum first then paste the values.

Create data frame by iteratively adding rows

I am trying to create a data frame (BOS.df) in order to explore the structure of a future analysis I will perform prior to receiving the actual data. In this scenario, lets say that there are 4 restaurants looking to run ad campaigns (the "Restaurant" variable). The total number of days that the campaign will last is cmp.lngth. I want random numbers for how much they are billing for the ads (ra.num). The ad campaigns start on StartDate. ultimately, I want to create a data frame the cycles through each restaurant, and adds a random billing number for each day of the ad campaign by adding rows.
#Create Data Placeholders
set.seed(123)
Restaurant <- c('B1', 'B2', 'B3', 'B4')
cmp.lngth <- 42
ra.num <- rnorm(cmp.lngth, mean = 100, sd = 10)
StartDate <- as.Date("2017-07-14")
BOS.df <- data.frame(matrix(NA, nrow =0, ncol = 3))
colnames(BOS.df) <- c("Restaurant", "Billings", "Date")
for(i in 1:length(Restaurant)){
for(z in 1:cmp.lngth){
BOS.row <- c(as.character(Restaurant[i]),ra.num[z],StartDate +
cmp.lngth[z]-1)
BOS.df <- rbind(BOS.df, BOS.row)
}
}
My code is not functioning correctly right now. The column names are incorrect, and the data is not being placed correctly if at all. The output comes through as follows:
X.B1. X.94.3952435344779. X.17402.
1 B1 94.3952435344779 17402
2 B1 <NA> <NA>
3 B1 <NA> <NA>
4 B1 <NA> <NA>
5 B1 <NA> <NA>
6 B1 <NA> <NA>
How can I obtain the correct output? Is there a more efficient way than using a for loop?
Using expand.grid:
cmp.lngth <- 2
StartDate <- as.Date("2017-07-14")
set.seed(1)
df1 <- data.frame(expand.grid(Restaurant, seq(cmp.lngth) + StartDate))
colnames(df1) <- c("Restaurant", "Date")
df1$Billings <- rnorm(nrow(df1), mean = 100, sd = 10)
df1 <- df1[ order(df1$Restaurant, df1$Date), ]
df1
# Restaurant Date Billings
# 1 B1 2017-07-15 93.73546
# 5 B1 2017-07-16 103.29508
# 2 B2 2017-07-15 101.83643
# 6 B2 2017-07-16 91.79532
# 3 B3 2017-07-15 91.64371
# 7 B3 2017-07-16 104.87429
# 4 B4 2017-07-15 115.95281
# 8 B4 2017-07-16 107.38325
You can use rbind, but this would be another way to do it.
Also, the length of the data frame should be cmp.lngth*length(Restaurant), not cmp.lngth.
#Create Data Placeholders
set.seed(123)
Restaurant <- c('B1', 'B2', 'B3', 'B4')
cmp.lngth <- 42
ra.num <- rnorm(cmp.lngth, mean = 100, sd = 10)
StartDate <- as.Date("2017-07-14")
BOS.df <- data.frame(matrix(NA, nrow = cmp.lngth*length(Restaurant), ncol = 3))
colnames(BOS.df) <- c("Restaurant", "Billings", "Date")
count <- 1
for(name in Restaurant){
for(z in 1:cmp.lngth){
BOS.row <- c(name, ra.num[z], as.character(StartDate + z - 1))
BOS.df[count,] <- BOS.row
count <- count + 1
}
}
I would also recommend you to look at the package called tidyverse and use add_row with tibble instead of data frame. Here is a sample code:
library(tidyverse)
BOS.tb <- tibble(Restaurant = character(),
Billings = numeric(),
Date = character())
for(name in Restaurant){
for(z in 1:cmp.lngth){
BOS.row <- c(name, ra.num[z], as.character(StartDate + z - 1))
BOS.tb <- add_row(BOS.tb,
Restaurant = name,
Billings = ra.num[z],
Date = as.character(StartDate + z - 1))
}
}

Combine columns on factor

I have data with factors lang and alg and I like to compare difference for selected lang pair between all alg:
> perf[perf$lang == "java", c("alg", "cpu")]
alg cpu
173 binarytrees 0.196
174 chameneosredux 0.404
175 fannkuchredux 0.648
> perf[perf$lang == "python3", c("alg", "cpu")]
alg cpu
246 binarytrees 0.972
248 fannkuchredux 13.752
249 fasta 1.152
For binarytrees I expect to get 0.196/0.972, but for chameneosredux is NA, for fannkuchredux is 0.648/13.752, for fasta is NA, ...
One way is to sort rows on alg but I don't understand how to inject rows with NA on missing factors (all factors available in unique(perf$alg)).
UPDATE Despite original question I think that I like to combine columns of two data frames into single data frame on same factor:
binarytrees 0.196 0.972
chameneosredux 0.404 NA
fasta NA 1.152
fannkuchredux 0.648 13.752
What you are looking for is essentially a FULL OUTER JOIN in SQL, which can be done with base::merge using all = TRUE.
Here is a comparable data set to demonstrate with:
Df <- data.frame(
Lang = rep(LETTERS[1:5], rep(3, 5)),
Alg = c(replicate(5, sample(letters[1:4], 3))),
Cpu = rnorm(15),
stringsAsFactors = FALSE
)
Note that I'm using stringsAsFactors = FALSE. I would suggest you convert your columns to character vectors as well; I don't see any need for using factors here.
This is the merge operation in a light wrapper function, just to make the presentation a little cleaner:
compare <- function(x, y, data) {
merge(x = data[data$Lang == x[1], 2:3],
y = data[data$Lang == y[1], 2:3],
by = "Alg", all = TRUE,
suffixes = c(paste0(".", x[1]),
paste0(".", y[1]))
)
}
And here it is in use:
compare("A", "D", Df)
# Alg Cpu.A Cpu.D
#1 a NA -0.06520117
#2 b 1.0587151 0.08379303
#3 c -2.0390119 NA
#4 d -0.8574474 1.27865596
compare("A", "C", Df)
# Alg Cpu.A Cpu.C
#1 b 1.0587151 -1.0230431
#2 c -2.0390119 -0.7691048
#3 d -0.8574474 -1.2421078
Regarding my comment, this can also be achieved using sqldf. SQLite does not support FULL OUTER JOIN, but this shouldn't be too much of an issue if you are comfortable with SQL, as there are probably a dozen or so ways to work around that:
library(sqldf)
sqldf(
"select x.Alg
,lhs.Cpu as 'Cpu.A'
,rhs.Cpu as 'Cpu.D'
from (
select distinct d.Alg
from Df d
) x
left join Df lhs on lhs.Alg = x.Alg and lhs.Lang = 'A'
left join Df rhs on rhs.Alg = x.Alg and rhs.Lang = 'D'
order by x.Alg"
)
# Alg Cpu.A Cpu.D
#1 a NA -0.06520117
#2 b 1.0587151 0.08379303
#3 c -2.0390119 NA
#4 d -0.8574474 1.27865596

Pivot on data.table similar to rehape melt function

I have read some references to similar problems here on SO, but haven't been able to find a solution yet and wondering if there is any way to do the following using just data.table.
I'll use a simplified example, but in practice, my data table has > 1000 columns similar to var1, var2, ... var1000, etc.
dt <- data.table(uid=c("a","b"), var1=c(1,2), var2=c(100,200))
I am looking for a solution that will allow me to get an output similar to reshape's melt function --
> melt(dt, id=c("uid"))
uid variable value
1 a var1 1
2 b var1 2
3 a var2 100
4 b var2 200
That is, all the columns except for uid are listed under a single column with the corresponding values in an adjoining column. I have tried this with a combination of list, etc, but might be missing something that is obvious.
All uids in dt are unique.
Thanks in advance.
For a data.table reshape, try the following:
dt[, list(variable = names(.SD), value = unlist(.SD, use.names = F)), by = uid]
The cost of the syntax is worth it; the function runs very quickly!
stack generally outperforms melt.
A straightforward approach to this problem with stack would be:
dt[, stack(.SD), by = "uid"]
Of course, you can specify your .SDcols if necessary. And then, use setnames() to change the names to whatever you want.
(Self-promotion alert)
I wrote some functions and put them in a package called "splitstackshape". One of the functions is called Stacked(), and in the 1.2.0 version of the "splitstackshape" package, should work very fast.
It's a little bit different from just stacking all the remaining columns in a data.table. It is more analogous to base R's reshape() than melt() from "reshape2". Here's an example of Stacked() in action.
I've created a decently large data.table to do this test. There are 50 numeric columns we want to stack, and 50 factor columns we want to stack. I've also further optimized #Andreas's answer.
The data
set.seed(1)
m1 <- matrix(rnorm(10000*50), ncol = 50)
m2 <- matrix(sample(LETTERS, 10000*50, replace = TRUE), ncol = 50)
colnames(m1) <- paste("varA", sprintf("%02d", 1:50), sep = "_")
colnames(m2) <- paste("varB", sprintf("%02d", 1:50), sep = "_")
dt <- data.table(uid = 1:10000, m1, m2)
The functions for benchmarking
test1 <- function() Stacked(dt, "uid", c("varA", "varB"), "_")
## merged.stack
test2 <- function() merged.stack(dt, "uid", c("varA", "varB"), "_")
## unlist(..., use.names = TRUE) -- OPTIMIZED
test3 <- function() {
list(cbind(dt[, "uid", with = FALSE],
dt[, list(variable = rep(names(.SD), each = nrow(dt)),
value = unlist(.SD)),
.SDcols = 2:51]),
cbind(dt[, "uid", with = FALSE],
dt[, list(variable = rep(names(.SD), each = nrow(dt)),
value = unlist(.SD)),
.SDcols = 52:101]))
}
## unlist(..., use.names = FALSE) -- OPTIMIZED
test4 <- function() {
list(cbind(dt[, "uid", with = FALSE],
dt[, list(variable = rep(names(.SD), each = nrow(dt)),
value = unlist(.SD, use.names = FALSE)),
.SDcols = 2:51]),
cbind(dt[, "uid", with = FALSE],
dt[, list(variable = rep(names(.SD), each = nrow(dt)),
value = unlist(.SD, use.names = FALSE)),
.SDcols = 52:101]))
}
## Andreas's current answer
test5 <- function() {
list(dt[, list(variable = names(.SD),
value = unlist(.SD, use.names = FALSE)),
by = uid, .SDcols = 2:51],
dt[, list(variable = names(.SD),
value = unlist(.SD, use.names = FALSE)),
by = uid, .SDcols = 52:101])
}
The results
library(microbenchmark)
microbenchmark(Stacked = test1(), merged.stack = test2(),
unlist.namesT = test3(), unlist.namesF = test4(),
AndreasAns = test5(), times = 3)
# Unit: milliseconds
# expr min lq median uq max neval
# Stacked 391.3251 393.0976 394.8702 421.4185 447.9668 3
# merged.stack 764.3071 769.6935 775.0799 867.2638 959.4477 3
# unlist.namesT 1680.0610 1761.9701 1843.8791 1881.9722 1920.0653 3
# unlist.namesF 215.0827 242.7748 270.4669 270.6944 270.9218 3
# AndreasAns 16193.5084 16249.5797 16305.6510 16793.3832 17281.1154 3
^^ I'm not sure why Andreas's current answer is so slow here. The "optimization" I did was basically to unlist without using by, which made a huge difference on the "varB" (factor) columns.
The manual approach is still faster than the functions from "splitstackshape", but these are milliseconds we're talking about, and some pretty compact one-liner code!
Sample output
For reference, here is what the output of Stacked() looks like. It's a list of "stacked" data.tables, one list item for each stacked variable.
test1()
# $varA
# uid .time_1 varA
# 1: 1 01 -0.6264538
# 2: 1 02 -0.8043316
# 3: 1 03 0.2353485
# 4: 1 04 0.6179223
# 5: 1 05 -0.2212571
# ---
# 499996: 10000 46 -0.6859073
# 499997: 10000 47 -0.9763478
# 499998: 10000 48 0.6579464
# 499999: 10000 49 0.7741840
# 500000: 10000 50 0.5195232
#
# $varB
# uid .time_1 varB
# 1: 1 01 D
# 2: 1 02 A
# 3: 1 03 S
# 4: 1 04 L
# 5: 1 05 T
# ---
# 499996: 10000 46 A
# 499997: 10000 47 W
# 499998: 10000 48 H
# 499999: 10000 49 U
# 500000: 10000 50 W
And, here is what the merged.stack output looks like. It's similar to what you would get when you use reshape(..., direction = "long") from base R.
test2()
# uid .time_1 varA varB
# 1: 1 01 -0.6264538 D
# 2: 1 02 -0.8043316 A
# 3: 1 03 0.2353485 S
# 4: 1 04 0.6179223 L
# 5: 1 05 -0.2212571 T
# ---
# 499996: 10000 46 -0.6859073 A
# 499997: 10000 47 -0.9763478 W
# 499998: 10000 48 0.6579464 H
# 499999: 10000 49 0.7741840 U
# 500000: 10000 50 0.5195232 W
Shameless Self-promotion
You might want to try melt_ from my package Kmisc. melt_ is essentially a rewrite of reshape2:::melt.data.frame with most of the grunt work done in C, and avoids as much copying and type coercion as possible for a speedy implementation.
An example:
## devtools::install_github("Kmisc", "kevinushey")
library(Kmisc)
library(reshape2)
library(microbenchmark)
n <- 1E6
big_df <- data.frame( stringsAsFactors=FALSE,
x=sample(letters, n, TRUE),
y=sample(LETTERS, n, TRUE),
za=rnorm(n),
zb=rnorm(n),
zc=rnorm(n)
)
all.equal(
melt <- melt(big_df, id.vars=c('x', 'y')),
melt_ <- melt_(big_df, id.vars=c('x', 'y'))
)
## we don't convert the 'variable' column to factor by default
## if we do, we see they're identical
melt_$variable <- factor(melt_$variable)
stopifnot( identical(melt, melt_) )
microbenchmark( times=5,
melt=melt(big_df, id.vars=c('x', 'y')),
melt_=melt_(big_df, id.vars=c('x', 'y'))
)
gives me
Unit: milliseconds
expr min lq median uq max neval
melt 916.40436 931.60031 999.03877 1102.31090 1160.3598 5
melt_ 61.59921 78.08768 90.90615 94.52041 182.0879 5
With any luck, this will be fast enough for your data.

Resources