Pivot a large data.table

Pivot a large data.table - r

I have a large data table in R:
library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
ID=sample(1:200000, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:1000, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
dim(DT)
I'd like to pivot this data.table, such that Category becomes a column. Unfortunately, since the number of categories isn't constant within groups, I can't use this answer.
Any ideas how I might do this?
/edit: Based on joran's comments and flodel's answer, we're really reshaping the following data.table:
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
This reshape can be accomplished a number of ways (I've gotten some good answers so far), but what I'm really looking for is something that will scale well to a data.table with millions of rows and hundreds to thousands of categories.

data.table implements faster versions of melt/dcast data.table specific methods (in C). It also adds additional features for melting and casting multiple columns. Please see the Efficient reshaping using data.tables vignette.
Note that we don't need to load reshape2 package.
library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
ID=sample(1:200000, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:800, n, replace=TRUE), ## to get to <= 2 billion limit
Qty=runif(n),
key=c('ID', 'Month')
)
dim(DT)
> system.time(ans <- dcast(DT, ID + Month ~ Category, fun=sum))
# user system elapsed
# 65.924 20.577 86.987
> dim(ans)
# [1] 2399401 802

Like that?
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
timevar = "Category", direction = "wide")

There is no data.table specific wide reshaping method.
Here is an approach that will work, but it is rather convaluted.
There is a feature request #2619 Scoping for LHS in :=to help with making this more straightforward.
Here is a simple example
# a data.table
DD <- data.table(a= letters[4:6], b= rep(letters[1:2],c(4,2)), cc = as.double(1:6))
# with not all categories represented
DDD <- DD[1:5]
# trying to make `a` columns containing `cc`. retaining `b` as a column
# the unique values of `a` (you may want to sort this...)
nn <- unique(DDD[,a])
# create the correct wide data.table
# with NA of the correct class in each created column
rows <- max(DDD[, .N, by = list(a,b)][,N])
DDw <- DDD[, setattr(replicate(length(nn), {
# safe version of correct NA
z <- cc[1]
is.na(z) <-1
# using rows value calculated previously
# to ensure correct size
rep(z,rows)},
simplify = FALSE), 'names', nn),
keyby = list(b)]
# set key for binary search
setkey(DDD, b, a)
# The possible values of the b column
ub <- unique(DDw[,b])
# nested loop doing things by reference, so should be
# quick (the feature request would make this possible to
# speed up using binary search joins.
for(ii in ub){
for(jj in nn){
DDw[list(ii), {jj} := DDD[list(ii,jj)][['cc']]]
}
}
DDw
# b d e f
# 1: a 1 2 3
# 2: a 4 2 3
# 3: b NA 5 NA
# 4: b NA 5 NA

EDIT
I found this SO post, which includes a better way to insert the
missing rows into a data.table. Function fun_DT adjusted
accordingly. Code is cleaner now; I don't see any speed improvements
though.
See my update at the other post. Arun's solution works as well, but you have to manually insert the missing combinations. Since you have more identifier columns here (ID, Month), I only came up with a dirty solution here (creating an ID2 first, then creating all ID2-Category combination, then filling up the data.table, then doing the reshaping).
I'm pretty sure this isn't the best solution, but if this FR is built in, those steps might be done automatically.
The solutions are roughly the same speed wise, although it would be interesting to see how that scales (my machine is too slow, so I don't want to increase the n any further...computer crashed to often already ;-)
library(data.table)
library(rbenchmark)
fun_reshape <- function(n) {
DT <- data.table(
ID=sample(1:100, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:10, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
timevar = "Category", direction = "wide")
}
#UPDATED!
fun_DT <- function(n) {
DT <- data.table(
ID=sample(1:100, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:10, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
agg[, ID2 := paste(ID, Month, sep="_")]
setkey(agg, ID2, Category)
agg <- agg[CJ(unique(ID2), unique(Category))]
agg[, as.list(setattr(Qty, 'names', Category)), by=list(ID2)]
}
library(rbenchmark)
n <- 1e+07
benchmark(replications=10,
fun_reshape(n),
fun_DT(n))
test replications elapsed relative user.self sys.self user.child sys.child
2 fun_DT(n) 10 45.868 1 43.154 2.524 0 0
1 fun_reshape(n) 10 45.874 1 42.783 2.896 0 0

Related

R data.table: efficiently access and update a variable column name in j expression with grouping [duplicate]

This question already has answers here:
Apply a function to every specified column in a data.table and update by reference
(7 answers)
Closed 2 years ago.
I want to apply a transformation (whose type, loosely speaking, is "vector" -> "vector") to a list of columns in a data table, and this transformation will involve a grouping operation.
Here is the setup and what I would like to achieve:
library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
A = runif(n),
B = rnorm(n),
C = rexp(n))
DT[, A.prime := (A - mean(A))/sd(A), by=year(date)]
DT[, B.prime := (B - mean(B))/sd(B), by=year(date)]
DT[, C.prime := (C - mean(C))/sd(C), by=year(date)]
The goal is to avoid typing out the column names. In my actual application, I have a list of columns I would like to apply this transformation to.
library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
A = runif(n),
B = rnorm(n),
C = rexp(n))
columns <- c("A", "B", "C")
for (x in columns) {
# This doesn't work.
# target <- DT[, (x - mean(x, na.rm=TRUE))/sd(x, na.rm = TRUE), by=year(date)]
# This doesn't work.
#target <- DT[, (..x - mean(..x, na.rm=TRUE))/sd(..x, na.rm = TRUE), by=year(date)]
# THIS WORKS! But it is tedious writing "get(x)" every time.
target <- DT[, (get(x) - mean(get(x), na.rm=TRUE))/sd(get(x), na.rm = TRUE), by=year(date)][, V1]
set(DT, j = paste0(x, ".prime"), value = target)
}
Question: What is the idiomatic way to achieve the above result? There are two things which may be possibly be improved:
How to avoid typing out get(x) every time I use x to access a column?
Is accessing [, V1] the most efficient way of doing this? Is it possible to update DT directly by reference, without creating an intermediate data.table?

You can use .SDcols to specify the columns that you want to operate on :
library(data.table)
columns <- c("A", "B", "C")
newcolumns <- paste0(columns, ".prime")
DT[, (newcolumns) := lapply(.SD, function(x) (x- mean(x))/sd(x)),
year(date), .SDcols = columns]
This avoids using get(x) everytime and updates data.table by reference.

I think Ronak's answer is superior & preferable, just writing this to demonstrate a common syntax for more complicated j queries is to use a full {} expression:
target <- DT[ , by = year(date), {
xval = eval(as.name(x))
(xval - mean(xval, na.rm=TRUE))/sd(xval, na.rm = TRUE)
}]$V1
Two other small differences:
I used eval(as.name(.)) instead of get; the former is more trustworthy & IME faster
I replaced [ , V1] with $V1 -- the former requires the overhead of [.data.table.
You might also like to know that the base function scale will do the center & normalize steps more concisely (if slightly inefficient for being a bit to general).

Detecting columns containing any value quickly with grep

I have a large dataset, 5000 variables and 3 million rows. I want to check what columns contain dates. I'm working with data.table and reading the data with fread. In order to know what columns contain dates I run this:
my[, lapply(.SD,function(xx)
length(grep("^\\d\\d?/\\d\\d?/\\d{4}$",xx))>0 ) ]
or the same with any(grepl())
But it's very slow.
Is there any way to do it faster? Maybe forcing grep to stop the first time it encounters a date? I think (command line) grep has an option to do it:
grep -m 1
But I think it's not available in R.
Any idea? Solutions with base R or other packages are also welcome.
I could also work only with a few rows of the data.table but some columns could have very little values different than NA and there are chances of missing them.
Very simple example with some NA:
library(data.table)
set.seed(1)
siz <- 10000000
my <- data.table(
AA=c(rep(NA,siz-1),"11/11/2001"),
BB=sample(c("wrong", "11/11/2001"),siz, prob=c(1000000,1), replace=T),
CC=sample(siz),
DD=rep("11/11/2001",siz),
EE=rep("HELLO", siz)
)
I've seen there is an option perl = FALSE but I don't know wheter it will allow me to add extra parameters.
Or similarly I want to know among the files supposed to be dates whether there are strange symbols. I could run grep on every column but it would be great to be able to stop as soon as my test is right, without continuing till the end of the column.
Maybe with some extra package such as stringi?

One option would be to check only the first row (assuming that if there is a 'Date' class it would pick it up unless the first one is a missing value)
my[1][, grepl("\\d{2}/\\d{2}/\\d{4}", unlist(.SD))]
In addition to the above, as #Frank mentioned we can check only a subset of character class columns instead of the whole columns by specifying the .SDcols
j1 <- sapply(my, is.character)
my[, lapply(.SD, function(x)
length(grep("\\d{2}/\\d{2}/\\d{4}", x))>1),
.SDcols = j1]
Benchmarks
set.seed(24)
dat <- data.table(col1 = rnorm(1e6), col2 = "05/05/1942",
col3 = rnorm(1e6))
system.time(res <- dat[, lapply(.SD, function(x)
length(grep("\\d{2}/\\d{2}/\\d{4}", x))>1)])
# user system elapsed
# 6.33 0.01 6.35
system.time(res2 <- dat[1][, grepl("\\d{2}/\\d{2}/\\d{4}", unlist(.SD))])
# user system elapsed
# 0 0 0
system.time({
j1 <- sapply(dat, is.character)
res3 <- dat[, lapply(.SD, function(x)
length(grep("\\d{2}/\\d{2}/\\d{4}", x))>1), .SDcols = j1]
res3 <- names(dat) %in% names(res3)
})
# user system elapsed
# 0.43 0.00 0.44
all.equal(unlist(res), res2, check.attributes = FALSE)
#[1] TRUE
all.equal(unlist(res), res3, check.attributes=FALSE)
#[1] TRUE
If there are lots of NAs, then we can check on the first row where it has all non-NA elements
set.seed(24)
dat <- data.table(col1 = sample(c(NA, 1:10), 1e6, replace=TRUE),
col2 = c(NA, "05/05/1942"),
col3 = sample(c(NA, 1:5), 1e6, replace=TRUE))
dt1 <- head(dat, 20)
#Or just a sample of 20 rows from the dataset
#dt1 <- dat[sample(1:.N, 20, replace=TRUE)]
dt1[dt1[, which(!Reduce(`|`, lapply(.SD, is.na)))[1]]
][, grepl("\\d{2}/\\d{2}/\\d{4}", unlist(.SD))]

Summarizing by groups applying function which involves the next group

Let's assume I have the following data:
set.seed(1)
test <- data.frame(letters=rep(c("A","B","C","D"),10), numbers=sample(1:50, 40, replace=TRUE))
I want to know how many numbers whose letter is A are not in B, how many numbers of B are not in C and so on.
I came up with a solution for this using base functions split and mapply:
s.test <-split(test, test$letters)
notIn <- mapply(function(x,y) sum(!s.test[[x]]$numbers %in% s.test[[y]]$numbers), x=names(s.test)[1:3], y=names(s.test)[2:4])
Which gives:
> notIn
A B C
9 7 7
But I would also like to do this with dplyr or data.table. Is it possible?

The bottleneck seems to be in split. When simulated on 200 groups and 150,000 observations each, split takes 50 seconds out of the total 54 seconds.
The split step can be made drastically faster using data.table as follows.
## test is a data.table here
s.test <- test[, list(list(.SD)), by=letters]$V1
Here's a benchmark on data of your dimensions using data.table + mapply:
## generate data
set.seed(1L)
k = 200L
n = 150000L
test <- data.frame(letters=sample(paste0("id", 1:k), n*k, TRUE),
numbers=sample(1e6, n*k, TRUE), stringsAsFactors=FALSE)
require(data.table) ## latest CRAN version is v1.9.2
setDT(test) ## convert to data.table by reference (no copy)
system.time({
s.test <- test[, list(list(.SD)), by=letters]$V1 ## split
setattr(s.test, 'names', unique(test$letters)) ## setnames
notIn <- mapply(function(x,y)
sum(!s.test[[x]]$numbers %in% s.test[[y]]$numbers),
x=names(s.test)[1:199], y=names(s.test)[2:200])
})
## user system elapsed
## 4.840 1.643 6.624
That's about ~7.5x speedup on your biggest data dimensions. Would this be sufficient?

This seems to give about the same speedup as with data.table but only uses base R. Instead of splitting the data frame it splits the numbers column only (in line marked ##):
## generate data - from Arun's post
set.seed(1L)
k = 200L
n = 150000L
test <- data.frame(letters=sample(paste0("id", 1:k), n*k, TRUE),
numbers=sample(1e6, n*k, TRUE), stringsAsFactors=FALSE)
system.time({
s.numbers <- with(test, split(numbers, letters)) ##
notIn <- mapply(function(x,y)
sum(!s.numbers[[x]] %in% s.numbers[[y]]),
x=names(s.numbers)[1:199], y=names(s.numbers)[2:200])
})

Create Sequence Number for a block of records in an R Data Frame

I have a fairly large dataset (by my standards) and I want to create a sequence number for blocks of records. I can use the plyr package, but the execution time is very slow. The code below replicates a comparable size dataframe.
## simulate an example of the size of a normal data frame
N <- 30000
id <- sample(1:17000, N, replace=T)
term <- as.character(sample(c(9:12), N, replace=T))
date <- sample(seq(as.Date("2012-08-01"), Sys.Date(), by="day"), N, replace=T)
char <- data.frame(matrix(sample(LETTERS, N*50, replace=T), N, 50))
val <- data.frame(matrix(rnorm(N*50), N, 50))
df <- data.frame(id, term, date, char, val, stringsAsFactors=F)
dim(df)
In reality, this is a little smaller than what I work with, as the values are typically larger...but this is close enough.
Here is the execution time on my machine:
> system.time(test.plyr <- ddply(df,
+ .(id, term),
+ summarise,
+ seqnum = 1:length(id),
+ .progress="text"))
|===============================================================================================| 100%
user system elapsed
63.52 0.03 63.85
Is there a "better" way to do this? Unfortunately, I am on a Windows machine.
Thanks in advance.
EDIT: Data.table is extremely fast, but I can't get my sequence numbers to calc correctly. Here is what my ddply version created. The majority only have one record in the group, but some have 2 rows, 3 rows, etc.
> with(test.plyr, table(seqnum))
seqnum
1 2 3 4 5
24272 4950 681 88 9
And using data.table as shown below, the same approach yields:
> with(test.dt, table(V1))
V1
1
24272

Use data.table
dt = data.table(df)
test.dt = dt[,.N,"id,term"]
Here is a timing comparison. I used N = 3000 and replaced the 17000 with 1700 while generating the dataset
f_plyr <- function(){
test.plyr <- ddply(df, .(id, term), summarise, seqnum = 1:length(id),
.progress="text")
}
f_dt <- function(){
dt = data.table(df)
test.dt = dt[,.N,"id,term"]
}
library(rbenchmark)
benchmark(f_plyr(), f_dt(), replications = 10,
columns = c("test", "replications", "elapsed", "relative"))
data.table speeds up things by a factor of 170
test replications elapsed relative
2 f_dt() 10 0.779 1.000
1 f_plyr() 10 132.572 170.182
Also check out Hadley's latest work on dplyr. I wouldn't be surprised if dplyr provides an additional speedup, given that a lot of the code is being reworked in C.
UPDATE: Edited code, changing length(id) to .N as per Matt's comment.

reshape wide to long with character suffixes instead of numeric suffixes

Inspired by a comment from #gsk3 on a question about reshaping data, I started doing a little bit of experimentation with reshaping data where the variable names have character suffixes instead of numeric suffixes.
As an example, I'll load the dadmomw dataset from one of the UCLA ATS Stata learning webpages (see "Example 4" on the webpage).
Here's what the dataset looks like:
library(foreign)
dadmom <- read.dta("https://stats.idre.ucla.edu/stat/stata/modules/dadmomw.dat")
dadmom
# famid named incd namem incm
# 1 1 Bill 30000 Bess 15000
# 2 2 Art 22000 Amy 18000
# 3 3 Paul 25000 Pat 50000
When trying to reshape from this wide format to long, I run into a problem. Here's what I do to reshape the data.
reshape(dadmom, direction="long", idvar=1, varying=2:5,
sep="", v.names=c("name", "inc"), timevar="dadmom",
times=c("d", "m"))
# famid dadmom name inc
# 1.d 1 d 30000 Bill
# 2.d 2 d 22000 Art
# 3.d 3 d 25000 Paul
# 1.m 1 m 15000 Bess
# 2.m 2 m 18000 Amy
# 3.m 3 m 50000 Pat
Note the swapped column names for "name" and "inc"; changing v.names to c("inc", "name") doesn't solve the problem.
reshape seems very picky about wanting the columns to be named in a fairly standard way. For example, I can reshape the data correctly (and easily) if I first rename the columns:
dadmom2 <- dadmom # Just so we can continue experimenting with the original data
# Change the names of the last four variables to include a "."
names(dadmom2)[2:5] <- gsub("(d$|m$)", "\\.\\1", names(dadmom2)[2:5])
reshape(dadmom2, direction="long", idvar=1, varying=2:5,
timevar="dadmom")
# famid dadmom name inc
# 1.d 1 d Bill 30000
# 2.d 2 d Art 22000
# 3.d 3 d Paul 25000
# 1.m 1 m Bess 15000
# 2.m 2 m Amy 18000
# 3.m 3 m Pat 50000
My questions are:
Why is R swapping the columns in the example I've provided?
Can I get to this result with base R reshape without changing the variable names before reshaping?
Are there other approaches that could be considered instead of reshape?

This works (to specify to varying what columns go with who):
reshape(dadmom, direction="long", varying=list(c(2, 4), c(3, 5)),
sep="", v.names=c("name", "inc"), timevar="dadmom",
times=c("d", "m"))
So you actually have nested repeated measures here; both name and inc for mom and dad. Because you have more than one series of repeated measures you have to supply a list to varying that tells reshape which group gets stacked on the other group.
So the two approaches to this problem are to provide a list as I did or to rename the columns the way the R beast likes them as you did.
See my recent blogs on base reshape for more on this (particularly the second link deals with this):
reshape (part I)
reshape (part II)

Though this question was specifically about base R, it is useful to know other approaches that help you to achieve the same type of outcome.
One alternative to reshape or merged.stack would be to use a combination of "dplyr" and "tidry", like this:
dadmom %>%
gather(variable, value, -famid) %>% ## Make the entire dataset long
separate(variable, into = c("var", "time"), ## Split "variable" column into two...
sep = "(?<=name|inc)", perl = TRUE) %>% ## ... using regex to split the values
spread(var, value, convert = TRUE) ## Make result wide, converting type
# famid time inc name
# 1 1 d 30000 Bill
# 2 1 m 15000 Bess
# 3 2 d 22000 Art
# 4 2 m 18000 Amy
# 5 3 d 25000 Paul
# 6 3 m 50000 Pat
Another alternative would be to use melt from "data.table", like this:
library(data.table)
melt(as.data.table(dadmom), ## melt here requres a data.table
measure = patterns("name", "inc"), ## identify columns by patterns
value.name = c("name", "inc"))[ ## specify the resulting variable names
## melt creates a numeric "variable" value. Replace with factored labels
, variable := factor(variable, labels = c("d", "m"))][]
# famid variable name inc
# 1: 1 d Bill 30000
# 2: 2 d Art 22000
# 3: 3 d Paul 25000
# 4: 1 m Bess 15000
# 5: 2 m Amy 18000
# 6: 3 m Pat 50000
How do these approaches compare with merged.stack?
Both packages are much better supported. They update and test their code more extensively than I do.
melt is blazing fast.
The Hadleyverse approach is actually slower (in many of my tests, even slower than base R's reshape) probably because of having to make the data long, then wide, then performing type conversion. However, some users like its step-by-step approach.
The Hadleyverse approach might have some unintended consequences because of the requirement of making the data long before making it wide. That forces all of the measure columns to be coerced to the same type (usually "character") if they are of different types to begin with.
Neither have the same convenience of merged.stack. Just look at the code required to get the result ;-)
merged.stack, however, can probably benefit from a simplified update, something along the lines of this function
ReshapeLong_ <- function(indt, stubs, sep = NULL) {
if (!is.data.table(indt)) indt <- as.data.table(indt)
mv <- lapply(stubs, function(y) grep(sprintf("^%s", y), names(indt)))
levs <- unique(gsub(paste(stubs, collapse="|"), "", names(indt)[unlist(mv)]))
if (!is.null(sep)) levs <- gsub(sprintf("^%s", sep), "", levs, fixed = TRUE)
melt(indt, measure = mv, value.name = stubs)[
, variable := factor(variable, labels = levs)][]
}
Which can then be used as:
ReshapeLong_(dadmom, stubs = c("name", "inc"))
How do these approaches compare with base R's reshape?
The main difference is that reshape is not able to handle unbalanced panel datasets. See, for example, "mydf2" as opposed to "mydf" in the tests below.
Test cases
Here's some sample data. "mydf" is balanced. "mydf2" is not balanced.
set.seed(1)
x <- 10000
mydf <- mydf2 <- data.frame(
id_1 = 1:x, id_2 = c("A", "B"), varAa = sample(letters, x, TRUE),
varAb = sample(letters, x, TRUE), varAc = sample(letters, x, TRUE),
varBa = sample(10, x, TRUE), varBb = sample(10, x, TRUE),
varBc = sample(10, x, TRUE), varCa = rnorm(x), varCb = rnorm(x),
varCc = rnorm(x), varDa = rnorm(x), varDb = rnorm(x), varDc = rnorm(x))
mydf2 <- mydf2[-c(9, 14)] ## Make data unbalanced
Here are some functions to test:
f1 <- function(mydf) {
mydf %>%
gather(variable, value, starts_with("var")) %>%
separate(variable, into = c("var", "time"),
sep = "(?<=varA|varB|varC|varD)", perl = TRUE) %>%
spread(var, value, convert = TRUE)
}
f2 <- function(mydf) {
melt(as.data.table(mydf),
measure = patterns(paste0("var", c("A", "B", "C", "D"))),
value.name = paste0("var", c("A", "B", "C", "D")))[
, variable := factor(variable, labels = c("a", "b", "c"))][]
}
f3 <- function(mydf) {
merged.stack(mydf, var.stubs = paste0("var", c("A", "B", "C", "D")), sep = "var.stubs")
}
## Won't run with "mydf2". Should run with "mydf"
f4 <- function(mydf) {
reshape(mydf, direction = "long",
varying = lapply(c("varA", "varB", "varC", "varD"),
function(x) grep(x, names(mydf))),
sep = "", v.names = paste0("var", c("A", "B", "C", "D")),
timevar="time", times = c("a", "b", "c"))
}
Test performance:
library(microbenchmark)
microbenchmark(f1(mydf), f2(mydf), f3(mydf), f4(mydf))
# Unit: milliseconds
# expr min lq mean median uq max neval
# f1(mydf) 463.006547 492.073086 528.533319 514.189548 538.910756 867.93356 100
# f2(mydf) 3.737321 4.108376 6.674066 4.332391 4.761681 47.71142 100
# f3(mydf) 60.211254 64.766770 86.812077 87.040087 92.841747 262.89409 100
# f4(mydf) 40.596455 43.753431 61.006337 48.963145 69.983623 230.48449 100
Observations:
Base R's reshape would not be able to handle reshaping "mydf2".
The "dplyr" + "tidyr" approach would mangle the results in the resulting "varB", "varC", and "varD" because values would be coerced to character.
As the benchmarks show, reshape gives reasonable performance.
Note: Because of the difference in time between posting my last answer and the differences in approach, I thought I would share this as a new answer.

merged.stack from my "splitstackshape" handles this by utilizing the sep = "var.stubs" construct:
library(splitstackshape)
merged.stack(dadmom, var.stubs = c("inc", "name"), sep = "var.stubs")
# famid .time_1 inc name
# 1: 1 d 30000 Bill
# 2: 1 m 15000 Bess
# 3: 2 d 22000 Art
# 4: 2 m 18000 Amy
# 5: 3 d 25000 Paul
# 6: 3 m 50000 Pat
Notice that since there is no real separator in the variables that are being stacked, we can just strip out the var.stubs from the names to create the "time" variables. Using sep = "var.stubs" is equivalent to doing sep = "inc|name".
This works because ".time_1" is created by stripping out what is left after removing the "var.stubs" from the column names.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Pivot a large data.table - r

Like that? agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")] reshape(agg, v.names = "Qty", idvar = c("ID", "Month"), timevar = "Category", direction = "wide")

Related

R data.table: efficiently access and update a variable column name in j expression with grouping [duplicate]

Detecting columns containing any value quickly with grep

Summarizing by groups applying function which involves the next group

Create Sequence Number for a block of records in an R Data Frame

reshape wide to long with character suffixes instead of numeric suffixes

Categories

Resources