reshape wide to long with character suffixes instead of numeric suffixes - r

Inspired by a comment from #gsk3 on a question about reshaping data, I started doing a little bit of experimentation with reshaping data where the variable names have character suffixes instead of numeric suffixes.
As an example, I'll load the dadmomw dataset from one of the UCLA ATS Stata learning webpages (see "Example 4" on the webpage).
Here's what the dataset looks like:
library(foreign)
dadmom <- read.dta("https://stats.idre.ucla.edu/stat/stata/modules/dadmomw.dat")
dadmom
# famid named incd namem incm
# 1 1 Bill 30000 Bess 15000
# 2 2 Art 22000 Amy 18000
# 3 3 Paul 25000 Pat 50000
When trying to reshape from this wide format to long, I run into a problem. Here's what I do to reshape the data.
reshape(dadmom, direction="long", idvar=1, varying=2:5,
sep="", v.names=c("name", "inc"), timevar="dadmom",
times=c("d", "m"))
# famid dadmom name inc
# 1.d 1 d 30000 Bill
# 2.d 2 d 22000 Art
# 3.d 3 d 25000 Paul
# 1.m 1 m 15000 Bess
# 2.m 2 m 18000 Amy
# 3.m 3 m 50000 Pat
Note the swapped column names for "name" and "inc"; changing v.names to c("inc", "name") doesn't solve the problem.
reshape seems very picky about wanting the columns to be named in a fairly standard way. For example, I can reshape the data correctly (and easily) if I first rename the columns:
dadmom2 <- dadmom # Just so we can continue experimenting with the original data
# Change the names of the last four variables to include a "."
names(dadmom2)[2:5] <- gsub("(d$|m$)", "\\.\\1", names(dadmom2)[2:5])
reshape(dadmom2, direction="long", idvar=1, varying=2:5,
timevar="dadmom")
# famid dadmom name inc
# 1.d 1 d Bill 30000
# 2.d 2 d Art 22000
# 3.d 3 d Paul 25000
# 1.m 1 m Bess 15000
# 2.m 2 m Amy 18000
# 3.m 3 m Pat 50000
My questions are:
Why is R swapping the columns in the example I've provided?
Can I get to this result with base R reshape without changing the variable names before reshaping?
Are there other approaches that could be considered instead of reshape?

This works (to specify to varying what columns go with who):
reshape(dadmom, direction="long", varying=list(c(2, 4), c(3, 5)),
sep="", v.names=c("name", "inc"), timevar="dadmom",
times=c("d", "m"))
So you actually have nested repeated measures here; both name and inc for mom and dad. Because you have more than one series of repeated measures you have to supply a list to varying that tells reshape which group gets stacked on the other group.
So the two approaches to this problem are to provide a list as I did or to rename the columns the way the R beast likes them as you did.
See my recent blogs on base reshape for more on this (particularly the second link deals with this):
reshape (part I)
reshape (part II)

Though this question was specifically about base R, it is useful to know other approaches that help you to achieve the same type of outcome.
One alternative to reshape or merged.stack would be to use a combination of "dplyr" and "tidry", like this:
dadmom %>%
gather(variable, value, -famid) %>% ## Make the entire dataset long
separate(variable, into = c("var", "time"), ## Split "variable" column into two...
sep = "(?<=name|inc)", perl = TRUE) %>% ## ... using regex to split the values
spread(var, value, convert = TRUE) ## Make result wide, converting type
# famid time inc name
# 1 1 d 30000 Bill
# 2 1 m 15000 Bess
# 3 2 d 22000 Art
# 4 2 m 18000 Amy
# 5 3 d 25000 Paul
# 6 3 m 50000 Pat
Another alternative would be to use melt from "data.table", like this:
library(data.table)
melt(as.data.table(dadmom), ## melt here requres a data.table
measure = patterns("name", "inc"), ## identify columns by patterns
value.name = c("name", "inc"))[ ## specify the resulting variable names
## melt creates a numeric "variable" value. Replace with factored labels
, variable := factor(variable, labels = c("d", "m"))][]
# famid variable name inc
# 1: 1 d Bill 30000
# 2: 2 d Art 22000
# 3: 3 d Paul 25000
# 4: 1 m Bess 15000
# 5: 2 m Amy 18000
# 6: 3 m Pat 50000
How do these approaches compare with merged.stack?
Both packages are much better supported. They update and test their code more extensively than I do.
melt is blazing fast.
The Hadleyverse approach is actually slower (in many of my tests, even slower than base R's reshape) probably because of having to make the data long, then wide, then performing type conversion. However, some users like its step-by-step approach.
The Hadleyverse approach might have some unintended consequences because of the requirement of making the data long before making it wide. That forces all of the measure columns to be coerced to the same type (usually "character") if they are of different types to begin with.
Neither have the same convenience of merged.stack. Just look at the code required to get the result ;-)
merged.stack, however, can probably benefit from a simplified update, something along the lines of this function
ReshapeLong_ <- function(indt, stubs, sep = NULL) {
if (!is.data.table(indt)) indt <- as.data.table(indt)
mv <- lapply(stubs, function(y) grep(sprintf("^%s", y), names(indt)))
levs <- unique(gsub(paste(stubs, collapse="|"), "", names(indt)[unlist(mv)]))
if (!is.null(sep)) levs <- gsub(sprintf("^%s", sep), "", levs, fixed = TRUE)
melt(indt, measure = mv, value.name = stubs)[
, variable := factor(variable, labels = levs)][]
}
Which can then be used as:
ReshapeLong_(dadmom, stubs = c("name", "inc"))
How do these approaches compare with base R's reshape?
The main difference is that reshape is not able to handle unbalanced panel datasets. See, for example, "mydf2" as opposed to "mydf" in the tests below.
Test cases
Here's some sample data. "mydf" is balanced. "mydf2" is not balanced.
set.seed(1)
x <- 10000
mydf <- mydf2 <- data.frame(
id_1 = 1:x, id_2 = c("A", "B"), varAa = sample(letters, x, TRUE),
varAb = sample(letters, x, TRUE), varAc = sample(letters, x, TRUE),
varBa = sample(10, x, TRUE), varBb = sample(10, x, TRUE),
varBc = sample(10, x, TRUE), varCa = rnorm(x), varCb = rnorm(x),
varCc = rnorm(x), varDa = rnorm(x), varDb = rnorm(x), varDc = rnorm(x))
mydf2 <- mydf2[-c(9, 14)] ## Make data unbalanced
Here are some functions to test:
f1 <- function(mydf) {
mydf %>%
gather(variable, value, starts_with("var")) %>%
separate(variable, into = c("var", "time"),
sep = "(?<=varA|varB|varC|varD)", perl = TRUE) %>%
spread(var, value, convert = TRUE)
}
f2 <- function(mydf) {
melt(as.data.table(mydf),
measure = patterns(paste0("var", c("A", "B", "C", "D"))),
value.name = paste0("var", c("A", "B", "C", "D")))[
, variable := factor(variable, labels = c("a", "b", "c"))][]
}
f3 <- function(mydf) {
merged.stack(mydf, var.stubs = paste0("var", c("A", "B", "C", "D")), sep = "var.stubs")
}
## Won't run with "mydf2". Should run with "mydf"
f4 <- function(mydf) {
reshape(mydf, direction = "long",
varying = lapply(c("varA", "varB", "varC", "varD"),
function(x) grep(x, names(mydf))),
sep = "", v.names = paste0("var", c("A", "B", "C", "D")),
timevar="time", times = c("a", "b", "c"))
}
Test performance:
library(microbenchmark)
microbenchmark(f1(mydf), f2(mydf), f3(mydf), f4(mydf))
# Unit: milliseconds
# expr min lq mean median uq max neval
# f1(mydf) 463.006547 492.073086 528.533319 514.189548 538.910756 867.93356 100
# f2(mydf) 3.737321 4.108376 6.674066 4.332391 4.761681 47.71142 100
# f3(mydf) 60.211254 64.766770 86.812077 87.040087 92.841747 262.89409 100
# f4(mydf) 40.596455 43.753431 61.006337 48.963145 69.983623 230.48449 100
Observations:
Base R's reshape would not be able to handle reshaping "mydf2".
The "dplyr" + "tidyr" approach would mangle the results in the resulting "varB", "varC", and "varD" because values would be coerced to character.
As the benchmarks show, reshape gives reasonable performance.
Note: Because of the difference in time between posting my last answer and the differences in approach, I thought I would share this as a new answer.

merged.stack from my "splitstackshape" handles this by utilizing the sep = "var.stubs" construct:
library(splitstackshape)
merged.stack(dadmom, var.stubs = c("inc", "name"), sep = "var.stubs")
# famid .time_1 inc name
# 1: 1 d 30000 Bill
# 2: 1 m 15000 Bess
# 3: 2 d 22000 Art
# 4: 2 m 18000 Amy
# 5: 3 d 25000 Paul
# 6: 3 m 50000 Pat
Notice that since there is no real separator in the variables that are being stacked, we can just strip out the var.stubs from the names to create the "time" variables. Using sep = "var.stubs" is equivalent to doing sep = "inc|name".
This works because ".time_1" is created by stripping out what is left after removing the "var.stubs" from the column names.

Related

Is there a way to replace rows in one dataframe with another in R?

I'm trying to figure out how to replace rows in one dataframe with another by matching the values of one of the columns. Both dataframes have the same column names.
Ex:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"))
df2 <- data.frame(x = c(1,2), y = c("f", "g"))
Is there a way to replace the rows of df1 with the same row in df2 where they share the same x variable? It would look like this.
data.frame(x = c(1,2,3,4), y = c("f","g","c","d")
I've been working on this for a while and this is the closest I've gotten -
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
But it just replaces the values with NA.
Does anyone know how to do this?
We can use match. :
inds <- match(df1$x, df2$x)
df1$y[!is.na(inds)] <- df2$y[na.omit(inds)]
df1
# x y
#1 1 f
#2 2 g
#3 3 c
#4 4 d
First off, well done in producing a nice reproducible example that's directly copy-pastable. That always helps, specially with an example of expected output. Nice one!
You have several options, but lets look at why your solution doesn't quite work:
First of all, I tried copy-pasting your last line into a new session and got the dreaded factor-error:
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = 1:2) :
invalid factor level, NA generated
If we look at your data frames df1 and df2 with the str function, you will see that they do not contain text but factors. These are not text - in short they represent categorical data (male vs. female, scores A, B, C, D, and F, etc.) and are really integers that have a text as label. So that could be your issue.
Running your code gives a warning because you are trying to import new factors (labels) into df1 that don't exist. And R doesn't know what to do with them, so it just inserts NA-values.
As r2evens answered, he used the stringsAsFactors to disable using strings as Factors - you can even go as far as disabling it on a session-wide basis using options(stringsAsFactors=FALSE) (and I've heard it will be disabled as default in forthcoming R4.0 - yay!).
After disabling stringsAsFactors, your code works - or does it? Try this on for size:
df2 <- df2[c(2,1),]
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
What's in df1 now? Not quite right anymore.
In the first line, I swapped the two rows in df2 and lo and behold, the replaced values in df1 were swapped. Why is that?
Let's deconstruct your statement df2[which(df1$x %in% df2$x),]$y
Call df1$x %in% df2$x returns a logical vector (boolean) of which elements in df1$x are found ind df2 - i.e. the first two and not the second two. But it doesn't relate which positions in the first vector corresponds to which in the second.
Calling which(df1$x %in% df2$x) then reduces the logical vector to which indices were TRUE. Again, we do not now which elements correspond to which.
For solutions, I would recommend r2evans, as it doesn't rely on extra packages (although data.table or dplyr are two powerful packages to get to know).
In his solution, he uses merge to perform a "full join" which matches rows based on the value, rather than - well, what you did. With transform, he assigns new variables within the context of the data.frame returned from the merge function called in the first argument.
I think what you need here is a "merge" or "join" operation.
(I add stringsAsFactors=FALSE to the frames so that the merging and later work is without any issue, as factors can be disruptive sometimes.)
Base R:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"), stringsAsFactors = FALSE)
# df2 <- data.frame(x = c(1,2), y = c("f", "g"), stringsAsFactors = FALSE)
merge(df1, df2, by = "x", all = TRUE)
# x y.x y.y
# 1 1 a f
# 2 2 b g
# 3 3 c <NA>
# 4 4 d <NA>
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y))
# x y.x y.y y
# 1 1 a f f
# 2 2 b g g
# 3 3 c <NA> c
# 4 4 d <NA> d
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y), y.x = NULL, y.y = NULL)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
Dplyr:
library(dplyr)
full_join(df1, df2, by = "x") %>%
mutate(y = coalesce(y.y, y.x)) %>%
select(-y.x, -y.y)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
A join option with data.table where we join on the 'x' column, assign the values of 'y' in second dataset (i.y) to the first one with :=
library(data.table)
setDT(df1)[df2, y := i.y, on = .(x)]
NOTE: It is better to use stringsAsFactors = FALSE (in R 4.0.0 - it is by default though) or else we need to have all the levels common in both datasets

optimizing solution to find common third on large data set

This is a follow up question to my previous question. I run into a problem to find a memory efficient solution to find a common third for my large data set (3.5 million groups and 6.2 million persons)
The proposed solution using the igraph package works fast for a normal sized data sets unfortunately runs into memory issues by creating a large matrix for bigger data sets. Similar issue comes up with my own solution using concatenated inner joins where the third inner join inflates the dataframe so my pc runs out of memory (16gb).
df.output <- inner_join(df,df, by='group' ) %>%
inner_join(.,df, by=c('person.y'='person')) %>%
inner_join(.,df, by=c('group.y'='group')) %>%
rename(person_in_common='person.y', pers1='person.x',pers2='person') %>%
select(pers1,pers2,person_in_common) %>%
filter(pers1!=pers2) %>%
distinct() %>%
filter(person_in_common!=pers1 & person_in_common!=pers2)
df.output[-3] <- t(apply(df.output[-3], 1,
FUN=function(x) sort(x, decreasing=FALSE)))
df.output <- unique(df.output)
Small data set example and expected output
df <- data.frame(group= c("a","a","b","b","b","c"),
person = c("Tom","Jerry","Tom","Anna","Sam","Nic"), stringsAsFactors = FALSE)
df
group person
1 a Tom
2 a Jerry
3 b Tom
4 b Anna
5 b Sam
6 c Nic
and expected result
df.output
pers1 pers2 person_in_common
1 Anna Jerry Tom
2 Jerry Sam Tom
3 Sam Tom Anna
4 Anna Tom Sam
6 Anna Sam Tom
I unfortunately don't have access to a machine with more ram and are also not really experienced with cloud computing, so I hope to make it work on my local pc. I would appreciate input how to optimize any of the solutions or an advise how to tackle the problem otherwise.
Edit 1
A dataframe which reflects my actual data size.
set.seed(33)
Data <- data.frame(group = sample(1:3700000, 14000000, replace=TRUE),
person = sample(1:6800000, 14000000,replace = TRUE))
Edit 2
My real data is a bit more complex in terms of larger groups and more persons per group as the example data. Consequently it gets more memory intense. I could not figure out how to simulate this kind of structure so following the real data for download:
Full person-group data
So, I managed to run this on your test data (I have 16 GB of RAM), but if you run this on your small example then you would see that it does not give the same results. I did not get why, but maybe you could hep me with that. So I will try to explain every step:
myFun <- function(dt) {
require(data.table)
# change the data do data.table:
setDT(dt)
# set key/order the data by group and person:
setkey(dt, group, person)
# I copy the initial data and change the name of soon to be merged column name to "p2"
# which represents person2
dta <- copy(dt)
setnames(dta, "person", "p2")
# the first merge using data.table:
dt1 <- dt[dta, on = "group", allow.cartesian = TRUE, nomatch = 0]
# now we remove rows where persons are the same:
dt1 <- dt1[person != p2] # remove equal persons
# and also we need to remove rows where person1 and person2 are the same,
# just in different order , example:
# 2: a Tom Jerry
# 3: a Jerry Tom
# is the same, if I get it right then you did this using apply in the end of code,
# but it would be much better if we could reduce data now
# also my approach will be much faster (we take pairwise min word to 2 column
# and max to the last):
l1 <- pmin(dt1[[2]], dt1[[3]])
l2 <- pmax(dt1[[2]], dt1[[3]])
set(dt1, j = 2L, value = l1)
set(dt1, j = 3L, value = l2)
# now lets clear memory and take unique rows of dt1:
rm(l1, l2, dt)
dt1 <- unique(dt1)
gc()
# change name for group column:
setnames(dta, "group", "g2")
# second merge:
dt2 <- dt1[dta, on = "p2", allow.cartesian = TRUE, nomatch = 0]
rm(dt1)
gc()
setnames(dta, "p2", "p3")
dt3 <- dt2[dta, on = "g2", allow.cartesian = TRUE, nomatch = 0] # third merge
rm(dt2)
gc()
dt3 <- dt3[p3 != p2 & p3 != person] # removing equal persons
gc()
dt3 <- dt3[, .(person, p2, p3)]
gc()
return(dt3[])
}
On Small data set example:
df <- data.frame(group = c("a","a","b","b","b","c"),
person = c("Tom","Jerry","Tom","Anna","Sam","Nic"),
stringsAsFactors = FALSE)
df
myFun(df)
# person p2 p3
# 1: Anna Tom Jerry
# 2: Sam Tom Jerry
# 3: Jerry Tom Anna
# 4: Sam Tom Anna
# 5: Jerry Tom Sam
# 6: Anna Tom Sam
# 7: Anna Sam Tom
Something similar to your result but not quite the same
Now with larger data:
set.seed(33)
N <- 10e6
dt <- data.frame(group = sample(3.7e6, N, replace = TRUE),
person = sample(6.8e6, N, replace = TRUE))
system.time(results <- myFun(dt)) # 13.22 sek
rm(results)
gc()
And:
set.seed(33)
N <- 14e6
dt <- data.frame(group = sample(3.7e6, N, replace = TRUE),
person = sample(6.8e6, N, replace = TRUE))
system.time(results <- myFun(dt)) # around 40 sek, but RAM does get used to max
Update:
Maybe you can try this splitting aproch, lets say with nparts 6-10?:
myFunNew3 <- function(dt, nparts = 2) {
require(data.table)
setDT(dt)
setkey(dt, group, person)
dta <- copy(dt)
# split into N parts
splits <- rep(1:nparts, each = ceiling(dt[, .N]/nparts))
set(dt, j = "splits", value = splits)
dtl <- split(dt, by = "splits", keep.by = F)
set(dt, j = "splits", value = NULL)
rm(splits)
gc()
i = 1
for (i in seq_along(dtl)) {
X <- copy(dtl[[i]])
setnames(dta, c("group", "person"))
X <- X[dta, on = "group", allow.cartesian = TRUE, nomatch = 0]
X <- X[person != i.person]
gc()
X <- X[dta, on = "person", allow.cartesian = TRUE, nomatch = 0]
gc()
setnames(dta, "group", "i.group")
X <- X[dta, on = "i.group", allow.cartesian = TRUE, nomatch = 0]
gc()
setnames(X, "i.person.1", "pers2")
setnames(X, "i.person", "pers1" )
setnames(X, "person", "person_in_common" )
X <- X[, .(pers1, pers2, person_in_common)]
gc()
X <- X[pers1 != pers2 & person_in_common != pers1 & person_in_common != pers2]
gc()
name1 <- "pers1"
name2 <- "pers2"
l1 <- pmin(X[[name1]], X[[name2]])
l2 <- pmax(X[[name1]], X[[name2]])
set(X, j = name1, value = l1)
set(X, j = name2, value = l2)
rm(l1, l2)
gc()
X <- unique(X)
gc()
if (i > 1) {
X1 <- rbindlist(list(X1, X), use.names = T, fill = T)
X1 <- unique(X1)
rm(X)
gc()
} else {
X1 <- copy(X)
}
dtl[[i]] <- 0L
gc()
}
rm(dta, dtl)
gc()
setkey(X1, pers1, pers2, person_in_common)
X1[]
}

why is dcast so impossible to pass a non-aggregate function?

I am using the data.table package for a table like this:
DT <- data.table(id=rep(1:100, each=50),
grp=rep(letters[1:4], each=1250),
time=rep(1:50,100),
outcome=rnorm(5000),
seconds=rep(1:500,10),
weights=rnorm(5000),
response=rep(1:200, each=25),
key=c("grp", "time"))
I would like to create a new (possibly rbindlisted) data table of some summary statistics from this table. I first created two intermediary tables a and b,
a <- DT[, list(mean = weighted.mean(outcome, weights),
median=median(outcome),seconds), by=c("grp","time")]
b <- DT[, list(mean=weighted.mean(response, seconds),
median=median(response)), by=c("grp","time")]
and then am trying to rowbind these together across all groups but still preserve the grouping along the rows. This does not work:
DTfinal <- data.table(DT$grp, DT$time,
outcomemean=a$mean, responsemean=b$mean,
outcomemedian=a$median, responsemedian=b$median)
I don't think a merge works since a and b have different lengths. Rowbinding a and b also mixes up the different means and medians of a and b, ideally I would like a rbindlist that has some kind of suffix for each column like c(".a",".b").
Update:
I get an error (since a and b have different dimension) doing
DTfinal <- rbindlist(setNames(list(a[, c("grp", "time", "mean", "median"),
with = FALSE],
b[, c("grp", "time", "mean", "median"),
with = FALSE]),
c("a", "b")),
idcol= "id")
dcast(DTfinal, grp + time ~id, value.var = c('mean', 'median'))
where it returns
Aggregate function missing, defaulting to 'length'
We can use rbindlist after placing the datasets in a list
DTfinal <- rbindlist(list(a,b))
dim(DTfinal)
#[1] 400 4
dim(a)
#[1] 200 4
dim(b)
#[1] 200 4
Suppose if both datasets have different number of columns, and we have a vector of column names that we need to keep
nm1 <- intersect(names(a), names(b))
rbindlist(list(a[, nm1, with = FALSE], b[, nm1, with = FALSE]), idcol= "id")
Update
If we need to convert to 'wide' format
DTfinal <- rbindlist(setNames(list(a,b), c("a", "b")), idcol= "id")
dcast(DTfinal, grp + time ~id, value.var = c('mean', 'median'))
# grp time mean_a mean_b median_a median_b
# 1: a 1 0.52171471 25.99502 -0.06558068 25
# 2: a 2 0.36445108 25.99010 0.13518412 25
# 3: a 3 0.08993721 25.98522 0.20128790 25
# 4: a 4 -64.04617391 25.98039 0.40999376 25
# 5: a 5 0.81730847 25.97561 -0.03481697 25
# ---
#196: d 46 1.62818374 176.67568 -0.26695999 176
#197: d 47 -1.45259871 176.67340 0.14893356 176
#198: d 48 9.59796683 176.67114 -0.05834959 176
#199: d 49 -2.74285453 176.66890 -0.22094347 176
#200: d 50 1.22109043 176.66667 -0.08172928 176

Aggregating data frame rows using an input vector

I have this toy data.frame:
df = data.frame(id = c("a","b","c","d"), value = c(2,3,6,5))
and I'd like to aggregate its rows according to this toy vector:
collapsed.ids = c("a,b","c","d")
where the aggregated data.frame should keep max(df$value) of its aggregated rows.
So for this toy example the output would be:
> aggregated.df
id value
1 a,b 3
2 c 6
3 d 5
I should note that my real data.frame is ~150,000 rows
I would use data.table for this.
Something like the following should work:
library(data.table)
DT <- data.table(df, key = "id") # Main data.table
Key <- data.table(ind = collapsed.ids) # your "Key" table
## We need your "Key" table in a long form
Key <- Key[, list(id = unlist(strsplit(ind, ",", fixed = TRUE))), by = ind]
setkey(Key, id) # Set the key to facilitate a merge
## Merge and aggregate in one step
DT[Key][, list(value = max(value)), by = ind]
# ind value
# 1: a,b 3
# 2: c 6
# 3: d 5
You don't need data.table, you can just use base R.
split.ids <- strsplit(collapsed.ids, ",")
split.df <- data.frame(id = tmp <- unlist(split.ids),
joinid = rep(collapsed.ids, sapply(split.ids, length)))
aggregated.df <- aggregate(value ~ id, data = merge(df, split.df), max)
Result:
# id value
# 1 a,b 3
# 2 c 6
# 3 d 5
Benchmark
df <- df[rep(1:4, 50000), ] # Make a big data.frame
system.time(...) # of the above code
# user system elapsed
# 1.700 0.154 1.947
EDIT: Apparently Ananda's code runs in 0.039, so I'm eating crow. But either are acceptable for this size.

Pivot a large data.table

I have a large data table in R:
library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
ID=sample(1:200000, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:1000, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
dim(DT)
I'd like to pivot this data.table, such that Category becomes a column. Unfortunately, since the number of categories isn't constant within groups, I can't use this answer.
Any ideas how I might do this?
/edit: Based on joran's comments and flodel's answer, we're really reshaping the following data.table:
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
This reshape can be accomplished a number of ways (I've gotten some good answers so far), but what I'm really looking for is something that will scale well to a data.table with millions of rows and hundreds to thousands of categories.
data.table implements faster versions of melt/dcast data.table specific methods (in C). It also adds additional features for melting and casting multiple columns. Please see the Efficient reshaping using data.tables vignette.
Note that we don't need to load reshape2 package.
library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
ID=sample(1:200000, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:800, n, replace=TRUE), ## to get to <= 2 billion limit
Qty=runif(n),
key=c('ID', 'Month')
)
dim(DT)
> system.time(ans <- dcast(DT, ID + Month ~ Category, fun=sum))
# user system elapsed
# 65.924 20.577 86.987
> dim(ans)
# [1] 2399401 802
Like that?
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
timevar = "Category", direction = "wide")
There is no data.table specific wide reshaping method.
Here is an approach that will work, but it is rather convaluted.
There is a feature request #2619 Scoping for LHS in :=to help with making this more straightforward.
Here is a simple example
# a data.table
DD <- data.table(a= letters[4:6], b= rep(letters[1:2],c(4,2)), cc = as.double(1:6))
# with not all categories represented
DDD <- DD[1:5]
# trying to make `a` columns containing `cc`. retaining `b` as a column
# the unique values of `a` (you may want to sort this...)
nn <- unique(DDD[,a])
# create the correct wide data.table
# with NA of the correct class in each created column
rows <- max(DDD[, .N, by = list(a,b)][,N])
DDw <- DDD[, setattr(replicate(length(nn), {
# safe version of correct NA
z <- cc[1]
is.na(z) <-1
# using rows value calculated previously
# to ensure correct size
rep(z,rows)},
simplify = FALSE), 'names', nn),
keyby = list(b)]
# set key for binary search
setkey(DDD, b, a)
# The possible values of the b column
ub <- unique(DDw[,b])
# nested loop doing things by reference, so should be
# quick (the feature request would make this possible to
# speed up using binary search joins.
for(ii in ub){
for(jj in nn){
DDw[list(ii), {jj} := DDD[list(ii,jj)][['cc']]]
}
}
DDw
# b d e f
# 1: a 1 2 3
# 2: a 4 2 3
# 3: b NA 5 NA
# 4: b NA 5 NA
EDIT
I found this SO post, which includes a better way to insert the
missing rows into a data.table. Function fun_DT adjusted
accordingly. Code is cleaner now; I don't see any speed improvements
though.
See my update at the other post. Arun's solution works as well, but you have to manually insert the missing combinations. Since you have more identifier columns here (ID, Month), I only came up with a dirty solution here (creating an ID2 first, then creating all ID2-Category combination, then filling up the data.table, then doing the reshaping).
I'm pretty sure this isn't the best solution, but if this FR is built in, those steps might be done automatically.
The solutions are roughly the same speed wise, although it would be interesting to see how that scales (my machine is too slow, so I don't want to increase the n any further...computer crashed to often already ;-)
library(data.table)
library(rbenchmark)
fun_reshape <- function(n) {
DT <- data.table(
ID=sample(1:100, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:10, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
timevar = "Category", direction = "wide")
}
#UPDATED!
fun_DT <- function(n) {
DT <- data.table(
ID=sample(1:100, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:10, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
agg[, ID2 := paste(ID, Month, sep="_")]
setkey(agg, ID2, Category)
agg <- agg[CJ(unique(ID2), unique(Category))]
agg[, as.list(setattr(Qty, 'names', Category)), by=list(ID2)]
}
library(rbenchmark)
n <- 1e+07
benchmark(replications=10,
fun_reshape(n),
fun_DT(n))
test replications elapsed relative user.self sys.self user.child sys.child
2 fun_DT(n) 10 45.868 1 43.154 2.524 0 0
1 fun_reshape(n) 10 45.874 1 42.783 2.896 0 0

Resources