I run into a bit of trouble with my code and would welcome any suggestion to make it run faster.
I have a data frame that looks like that:
Name <- c("a","a","a","a","a","b","b","b","b","c")
Category <- c("sun","cat","sun","sun","sea","sun","sea","cat","dog","cat")
More_info <- c("table","table","table","table","table","table","table","table","table","cat")
d <- data.frame(Name,Category,More_info)
So I have duplicated entries for each row in column Name (the number of duplicates can vary). For each entry (a,b,...) I want to count the sum of each corresponding element in the Category column and keep the only category that appears the most. If an entry has an equal number of categories, I want to take one of most categories randomly.
So in this case, the output dataframe would look like this:
Name <- c("a","b","c")
Category <- c("sun","dog","cat")
More_info <- c("table","table","table")
d <- data.frame(Name,Category,More_info)
a have sun entry kept because it appears the most, b would be dog or whatever other value as they all appear once with b, and c wouldn't be changed.
My function looks like this:
my_choosing_function <- function(x){
tmp = dbSNP_hapmap[dbSNP_hapmap$refsnp_id==list_of_snps[x],]
snp_freq <- as.data.frame(table(tmp$consequence_type_tv))
best_hit <- snp_freq[order(-snp_freq$Freq),]
best_hit$SNP<-list_of_snps[x]
top<-best_hit[1,]
return(top)
}
trst <- lapply(1:length(list_of_snps), function(x) my_choosing_function(x))
final <- do.call("rbind",trst)
Where I start from a list of unique elements (that would be Name in our case), for each element I do a table of the duplicated entries, I order the table by descending values and keep the top element. I do a lapply for each element of the list of unique values, then do a rbind of the whole thing.
As I have 2500000 rows in my initial data frame and 1500000 unique elements, it takes forever to run. 4 seconds for 100 lines, that would be a total of 34 hours for the lapply.
I'm sure packages like dplyr can do it in a few minutes but can't find a solution to do it. Anyone has an idea?
Thanks a lot for your help!
Note: This should be a very long comment because I use data.table instead of dplyr.
I suggest use data.table because it runs faster. And in the data.table way shown below, it randomly choose one in case of tie, not always the first one.
library(data.table)
library(dplyr)
library(microbenchmark)
d <- data.frame(
Name = as.character(sample.int(10000, 2.5e6, replace = T)),
Category = as.character(sample.int(10000, 2.5e6, replace = T)),
More_info = rep('table', 2.5e6)
)
Mode <- function(x) {
ux <- unique(x)
fr1 <- tabulate(match(x, ux))
if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}
system.time({
d %>%
group_by(Name) %>%
slice(which(Category == Mode(Category))[1])
})
# user system elapsed
# 45.932 0.808 46.745
system.time({
dt <- as.data.table(d)
dt.max <- dt[, .N, by = .(Name, Category)]
dt.max[, r := frank(-N, ties.method = 'random'), by = .(Name)]
dt.max <- dt.max[r == 1, .(Name, Category)]
dt[dt.max, on = .(Name, Category), mult = 'first']
})
# user system elapsed
# 2.424 0.004 2.426
We can modify the Mode function from here and then do a group by filter
library(dplyr)
Mode <- function(x) {
ux <- unique(x)
fr1 <- tabulate(match(x, ux))
if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}
d %>%
group_by(Name) %>%
slice(which(Category == Mode(Category))[1])
A couple slight tweaks on #mt1022's solution can produce a marginal speedup, nothing to phone home about, but might be of use if you find your data grows another order of magnitude.
library(data.table)
library(dplyr)
d <- data.frame(
Name = as.character(sample.int(10000, 2.5e6, replace = T)),
Category = as.character(sample.int(5000, 2.5e6, replace = T)),
More_info = rep('table', 2.5e6)
)
Mode <- function(x) {
ux <- unique(x)
fr1 <- tabulate(match(x, ux))
if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}
system.time({
d %>%
group_by(Name) %>%
slice(which(Category == Mode(Category))[1])
})
# user system elapsed
# 40.459 0.180 40.743
system.time({
dt <- as.data.table(d)
dt.max <- dt[, .N, by = .(Name, Category)]
dt.max[, r := frank(-N, ties.method = 'random'), by = .(Name)]
dt.max <- dt.max[r == 1, .(Name, Category)]
dt[dt.max, on = .(Name, Category), mult = 'first']
})
# user system elapsed
# 4.196 0.052 4.267
Tweaks include
Use setDT() instead of as.data.table() to avoid making a copy
Using stats::runif() to generate the random tiebreaker directly, this is of what data.table is doing internally in the the random option of frank()
Using setkey() to sort the table
Sub-setting the table by the row indices, .I, where the row within each group is equal to the number of observations, .N in each group. (This returns the last row of each group)
Results:
system.time({
dt.max <- setDT(d)[, .(Count = .N), keyby = .(Name, Category)]
dt.max[,rand := stats::runif(.N)]
setkey(dt.max,Name,Count, rand)
dt.max[dt.max[,.I[.N],by = .(Name,Category)]$V1,.(Name,Category,Count)]
})
# user system elapsed
# 1.722 0.057 1.750
Related
df is data.table and df_expected is desired data.table . I want to add hour column from 0 to 23 and visits value would be filled as 0 for hours newly added .
df<-data.table(customer=c("x","x","x","y","y"),location_id=c(1,1,1,2,3),hour=c(2,5,7,0,4),visits=c(40,50,60,70,80))
df_expected<-data.table(customer=c("x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x",
"y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y",
"y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y"),
location_id=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3),
hour=c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23),
visits=c(0,0,40,0,0,50,0,60,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
70,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,80,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))
This is what I tried to obtain my result , but it did not work
df1<-df[,':='(hour=seq(0:23)),by=(customer)]
Error in `[.data.table`(df, , `:=`(hour = seq(0L:23L)), by = (customer)) :
Type of RHS ('integer') must match LHS ('double'). To check and coerce would impact
performance too much for the fastest cases. Either change the type of the target column, or
coerce the RHS of := yourself (e.g. by using 1L instead of 1)
Here's an approach that creates the target and then uses a join to add in the visits information. The ifelse statement just helps up clean up the NA from the merge. You could also leave them in and replace them with := in the new data.table.
target <- data.table(
customer = rep(unique(df$customer), each = 24),
hour = 0:23)
df_join <- df[target, on = c("customer", "hour"),
.(customer, hour, visits = ifelse(is.na(visits), 0, visits))
]
all.equal(df_expected, df_join)
Edit:
This addresses the request to include the location_id column. One way to do this is with by=location in the creation of the target. I've also added in some of the code from chinsoon12's answer.
target <- df[ , .("customer" = rep(unique(customer), each = 24L),
"hour" = rep(0L:23L, times = uniqueN(customer))),
by = location_id]
df_join <- df[target, on = .NATURAL,
.(customer, location_id, hour, visits = fcoalesce(visits, 0))]
all.equal(df_expected, df_join)
Another option using CJ to generate your universe, on=.NATURAL for joining on identically named columns, and fcoalesce to handle NAs:
df[CJ(customer, hour=0L:23L, unique=TRUE), on=.NATURAL, allow.cartesian=TRUE,
.(customer=i.customer, hour=i.hour, visits=fcoalesce(visits, 0))]
here's a for-loop answer.
df_final <- data.table()
for(i in seq(24)){
if(i %in% df[,hour]){
a <- df[hour==i]
}else{
a <- data.table(customer="x", hour=i, visits=0)}
df_final <- rbind(df_final, a)
}
df_final
You can wrap this in another for-loop to have your multiple customers x, y, etc. (the following loop isnt very clean but gets the job done).
df_final <- data.table()
for(j in unique(df[,customer])){
for(i in seq(24)){
if(i %in% df[,hour]){
if(df[hour==i,customer] %in% j){
a <- df[hour==i]
}else{
a <- data.table(customer=j, hour=i, visits=0)
}
}else{
a <- data.table(customer=j, hour=i, visits=0)
}
df_final <- rbind(df_final, a)
}
}
df_final
I'm looking for an efficient way to select rows from a data table such that I have one representative row for each unique value in a particular column.
Let me propose a simple example:
require(data.table)
y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
dt = as.data.table( z )
my objective is to subset data table dt by sampling one row for each letter a-h in column z.
OP provided only a single column in the example. Assuming that there are multiple columns in the original dataset, we group by 'z', sample 1 row from the sequence of rows per group, get the row index (.I), extract the column with the row index ($V1) and use that to subset the rows of 'dt'.
dt[dt[ , .I[sample(.N,1)] , by = z]$V1]
You can use dplyr
library(dplyr)
dt %>%
group_by(z) %%
sample_n(1)
I think that shuffling the data.table row-wise and then applying unique(...,by) could also work. Groups are formed with by and the previous shuffling trickles down inside each group:
# shuffle the data.table row-wise
dt <- dt[sample(dim(dt)[1])]
# uniqueness by given column(s)
unique(dt, by = "z")
Below is an example on a bigger data.table with grouping by 3 columns. Comparing with #akrun ' solution seems to give the same grouping:
set.seed(2017)
dt <- data.table(c1 = sample(52*10^6),
c2 = sample(LETTERS, replace = TRUE),
c3 = sample(10^5, replace = TRUE),
c4 = sample(10^3, replace = TRUE))
# the shuffling & uniqueness
system.time( test1 <- unique(dt[sample(dim(dt)[1])], by = c("c2","c3","c4")) )
# user system elapsed
# 13.87 0.49 14.33
# #akrun' solution
system.time( test2 <- dt[dt[ , .I[sample(.N,1)] , by = c("c2","c3","c4")]$V1] )
# user system elapsed
# 11.89 0.10 12.01
# Grouping is identical (so, all groups are being sampled in both cases)
identical(x=test1[,.(c2,c3)][order(c2,c3)],
y=test2[,.(c2,c3)][order(c2,c3)])
# [1] TRUE
For sampling more than one row per group check here
Updated workflow for dplyr. I added a second column v that can be grouped by z.
require(data.table)
y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
v <- 1:length(z)
dt = data.table(z,v)
library(dplyr)
dt %>%
group_by(z) %>%
slice_sample(n = 1)
I'm looking to use data.table to improve speed for a given function, but I'm not sure I'm implementing it the correct way:
Data
Given two data.tables (dt and dt_lookup)
library(data.table)
set.seed(1234)
t <- seq(1,100); l <- letters; la <- letters[1:13]; lb <- letters[14:26]
n <- 10000
dt <- data.table(id=seq(1:n),
thisTime=sample(t, n, replace=TRUE),
thisLocation=sample(la,n,replace=TRUE),
finalLocation=sample(lb,n,replace=TRUE))
setkey(dt, thisLocation)
set.seed(4321)
dt_lookup <- data.table(lkpId = paste0("l-",seq(1,1000)),
lkpTime=sample(t, 10000, replace=TRUE),
lkpLocation=sample(l, 10000, replace=TRUE))
## NOTE: lkpId is purposly recycled
setkey(dt_lookup, lkpLocation)
I have a function that finds the lkpId that contains both thisLocation and finalLocation, and has the 'nearest' lkpTime (i.e. the minimum non-negative value of thisTime - lkpTime)
Function
## function to get the 'next' lkpId (i.e. the lkpId with both thisLocation and finalLocation,
## with the minimum non-negative time between thisTime and dt_lookup$lkpTime)
getId <- function(thisTime, thisLocation, finalLocation){
## filter lookup based on thisLocation and finalLocation,
## and only return values where the lkpId has both 'this' and 'final' locations
tempThis <- unique(dt_lookup[lkpLocation == thisLocation,lkpId])
tempFinal <- unique(dt_lookup[lkpLocation == finalLocation,lkpId])
availServices <- tempThis[tempThis %in% tempFinal]
tempThisFinal <- dt_lookup[lkpId %in% availServices & lkpLocation==thisLocation, .(lkpId, lkpTime)]
## calcualte time difference between 'thisTime' and 'lkpTime' (from thisLocation)
temp2 <- thisTime - tempThisFinal$lkpTime
## take the lkpId with the minimum non-negative difference
selectedId <- tempThisFinal[min(which(temp2==min(temp2[temp2>0]))),lkpId]
selectedId
}
Attempts at a solution
I need to get the lkpId for each row of dt. Therefore, my initial instinct was to use an *apply function, but it was taking too long (for me) when n/nrow > 1,000,000. So I've tried to implement a data.table solution to see if it's faster:
selectedId <- dt[,.(lkpId = getId(thisTime, thisLocation, finalLocation)),by=id]
However, I'm fairly new to data.table, and this method doesn't appear to give any performance gains over an *apply solution:
lkpIds <- apply(dt, 1, function(x){
thisLocation <- as.character(x[["thisLocation"]])
finalLocation <- as.character(x[["finalLocation"]])
thisTime <- as.numeric(x[["thisTime"]])
myId <- getId(thisTime, thisLocation, finalLocation)
})
both taking ~30 seconds for n = 10,000.
Question
Is there a better way of using data.table to apply the getId function over each row of dt ?
Update 12/08/2015
Thanks to the pointer from #eddi I've redesigned my whole algorithm and am making use of rolling joins (a good introduction), thus making proper use of data.table. I'll write up an answer later.
Having spent the time since asking this question looking into what data.table has to offer, researching data.table joins thanks to #eddi's pointer (for example Rolling join on data.table, and inner join with inequality), I've come up with a solution.
One of the tricky parts was moving away from the thought of 'apply a function to each row', and redesigning the solution to use joins.
And, there will no doubt be better ways of programming this, but here's my attempt.
## want to find a lkpId for each id, that has the minimum difference between 'thisTime' and 'lkpTime'
## and where the lkpId contains both 'thisLocation' and 'finalLocation'
## find all lookup id's where 'thisLocation' matches 'lookupLocation'
## and where thisTime - lkpTime > 0
setkey(dt, thisLocation)
setkey(dt_lookup, lkpLocation)
dt_this <- dt[dt_lookup, {
idx = thisTime - i.lkpTime > 0
.(id = id[idx],
lkpId = i.lkpId,
thisTime = thisTime[idx],
lkpTime = i.lkpTime)
},
by=.EACHI]
## remove NAs
dt_this <- dt_this[complete.cases(dt_this)]
## find all matching 'finalLocation' and 'lookupLocaiton'
setkey(dt, finalLocation)
## inner join (and only return the id columns)
dt_final <- dt[dt_lookup, nomatch=0, allow.cartesian=TRUE][,.(id, lkpId)]
## join dt_this to dt_final (as lkpId must have both 'thisLocation' and 'finalLocation')
setkey(dt_this, id, lkpId)
setkey(dt_final, id, lkpId)
dt_join <- dt_this[dt_final, nomatch=0]
## take the combination with the minimum difference between 'thisTime' and 'lkpTime'
dt_join[,timeDiff := thisTime - lkpTime]
dt_join <- dt_join[ dt_join[order(timeDiff), .I[1], by=id]$V1]
## equivalent dplyr code
# library(dplyr)
# dt_this <- dt_this %>%
# group_by(id) %>%
# arrange(timeDiff) %>%
# slice(1) %>%
# ungroup
I have a data frame that looks something like this (with a lot more observations)
df <- structure(list(session_user_id = c("1803f6c3625c397afb4619804861f75268dfc567",
"1924cb2ebdf29f052187b9a2d21673e4d314199b", "1924cb2ebdf29f052187b9a2d21673e4d314199b",
"1924cb2ebdf29f052187b9a2d21673e4d314199b", "1924cb2ebdf29f052187b9a2d21673e4d314199b",
"198b83b365fef0ed637576fe1bde786fc09817b2", "19fd8069c094fb0697508cc9646513596bea30c4",
"19fd8069c094fb0697508cc9646513596bea30c4", "19fd8069c094fb0697508cc9646513596bea30c4",
"19fd8069c094fb0697508cc9646513596bea30c4", "1a3d33c9cbb2aa41515e6ef76f123b2ea8ee2f13",
"1b64c142b1540c43e3f813ccec09cb2dd7907c14", "1b7346d13f714c97725ba2e1c21b600535164291"
), raw_score = c(1, 1, 1, 1, 1, 0.2, NA, 1, 1, 1, 1, 0.2, 1),
submission_time = c(1389707078L, 1389694184L, 1389694188L,
1389694189L, 1389694194L, 1390115495L, 1389696939L, 1389696971L,
1389741306L, 1389985033L, 1389983862L, 1389854836L, 1389692240L
)), .Names = c("session_user_id", "raw_score", "submission_time"
), row.names = 28:40, class = "data.frame")
I want to create a new data frame with only one observation per "session_ user_id" by keeping the one with the latest "submission_time."
The only idea that I have in mind is to create a list of unique users. Write a loop to find the max of submission_time for each user and then write a loop that gets raw score fore that user and time.
Can somebody show me a better way of doing this in R?
Thanks!
You could first order your data.frame by submission_time and remove all duplicated session_user_id entries afterwards:
## order by submission_time
df <- df[order(df$submission_time, decreasing=TRUE),]
## remove duplicated user_id
df <- df[!duplicated(df$session_user_id),]
# session_user_id raw_score submission_time
#33 198b83b365fef0ed637576fe1bde786fc09817b2 0.2 1390115495
#37 19fd8069c094fb0697508cc9646513596bea30c4 1.0 1389985033
#38 1a3d33c9cbb2aa41515e6ef76f123b2ea8ee2f13 1.0 1389983862
#39 1b64c142b1540c43e3f813ccec09cb2dd7907c14 0.2 1389854836
#28 1803f6c3625c397afb4619804861f75268dfc567 1.0 1389707078
#32 1924cb2ebdf29f052187b9a2d21673e4d314199b 1.0 1389694194
#40 1b7346d13f714c97725ba2e1c21b600535164291 1.0 1389692240
This is simple to express with dplyr: first group by session id, then filter, selecting the row in each group with the maximum time:
library(dplyr)
df %.%
group_by(session_user_id) %.%
filter(submission_time == max(submission_time))
Alternatively, if you don't want to keep all maximum times (if duplicated), you could do:
library(dplyr)
df %.%
group_by(session_user_id) %.%
filter(row_number(desc(submission_time)) == 1)
I'll add a data.table solution as well, and out of curiosity benchmark against dplyr on bigger data:
require(data.table)
DT <- as.data.table(df)
DT[DT[, .I[which.max(submission_time)], by=list(session_user_id)]$V1]
Here I'm assuming that the OP needs just one observation, even for multiple identical "max" values. If not, check out the function f2 below.
Benchmarks on bigger data vs dplyr:
Benchmark against #hadley's dplyr solutions on bigger data. I'll assume there are about 50e3 user ids and there are a total of 1e7 rows.
require(data.table) # 1.8.11 commit 1142
require(dplyr) # latest commit from github
set.seed(45L)
DT <- data.table(session_user_id = sample(paste0("id", 1:5e4), 1e7, TRUE),
raw_score = sample(10, 1e7, TRUE),
submission_time = sample(1e5:5e5, 1e7, TRUE))
DF <- tbl_df(as.data.frame(DT))
f1 <- function(DT) {
DT[DT[, .I[which.max(submission_time)], by=list(session_user_id)]$V1]
}
f2 <- function(DT) {
DT[DT[, .I[submission_time == max(submission_time)],
by=list(session_user_id)]$V1]
}
f3 <- function(DF) {
DF %.%
group_by(session_user_id) %.%
filter(submission_time == max(submission_time))
}
f4 <- function(DF) {
DF %.%
group_by(session_user_id) %.%
filter(row_number(desc(submission_time)) == 1)
}
And here are the timings. All are minimum of three runs:
system.time(a1 <- f1(DT))
# user system elapsed
# 1.044 0.056 1.101
system.time(a2 <- f2(DT))
# user system elapsed
# 1.384 0.080 1.475
system.time(a3 <- f3(DF))
# user system elapsed
# 4.513 0.044 4.555
system.time(a4 <- f4(DF))
# user system elapsed
# 6.312 0.004 6.314
As expected f4 is the slowest because it uses desc (which I'm guessing somehow involves in an ordering or sorting per group - a more computationally expensive operation than just getting max or which.max).
Here, a1 and a4 (only one observation even if multiple max values are present) give identical results and so does a2 and a3 (all max values).
data.table is at least 3x faster here (comparing a2 to a3) and about 5.7x times faster when comparing f1 to f4.
You could use the "plyr' package to summarize the data. Something like this should work
max_subs<-ddply(df,"session_user_id",summarize,max_sub=max(submission_time))
ddply takes a data frame in and returns a data frame, and this will give you the user and submission times you want.
To return the original data frame rows corresponding to these you could do
df2<-df[df$session_user_id %in% max_subs$session_user_id & df$submission_time %in% max_subs$max_sub,]
First find the max submission time by session_user_id. This table will be unique by session_user_id.
Then just merge (sql-speak: inner join) back to your original table joining on submission_time & session_user_id (R automatically picks up common names across the two data frames).
maxSessions<-aggregate(submission_time~session_user_id , df, max)
mySubset<-merge(df, maxSessions)
mySubset #this table has the data your are looking for
If you are looking for speed and your dataset is large then have a look at this How to summarize data by group in R? data.table & plyr are good choices.
This is just an extended comment because I was interested in how fast each of the solutions were
library(microbenchmark)
library(plyr)
library(dplyr)
library(data.table)
df <- df[sample(1:nrow(df),10000,replace=TRUE),] # 10k records
fun.test1 <- function(df) {
df <- df[order(df$submission_time, decreasing = TRUE),]
df <- df[!duplicated(df$session_user_id),]
return(df)
}
fun.test2 <- function(df) {
max_subs<-ddply(df,"session_user_id",summarize,max_sub=max(submission_time))
df2<-df[df$session_user_id %in% max_subs$session_user_id &
df$submission_time %in% max_subs$max_sub,]
return(df2)
}
fun.test3 <- function(df) {
df <- df %.%
group_by(session_user_id) %.%
filter(submission_time == max(submission_time))
return(df)
}
fun.test4 <- function(df) {
maxSessions<-aggregate(submission_time~session_user_id , df, max)
mySubset<-merge(df, maxSessions)
return(mySubset)
}
fun.test5 <- function(df) {
df <- df[df$submission_time %in% by(df, df$session_user_id,
function(x) max(x$submission_time)),]
return(df)
}
dt <- as.data.table(df) # Assuming you're working with data.table to begin with
# Don't know a lot about data.table so I'm sure there's a faster solution
fun.test6 <- function(dt) {
dt <- unique(
dt[,
list(raw_score,submission_time=max(submission_time)),
by=session_user_id]
)
return(dt)
}
Looks like the most basic solution with !duplicated() wins by a significant margin for small data (Under 1k), followed by dplyr. dplyr wins for large samples (over 1k).
microbenchmark(
fun.test1(df),
fun.test2(df),
fun.test3(df),
fun.test4(df),
fun.test5(df),
fun.test6(dt)
)
expr min lq median uq max neval
fun.test1(df) 2476.712 2660.0805 2740.083 2832.588 9162.339 100
fun.test2(df) 5847.393 6215.1420 6335.932 6477.745 12499.775 100
fun.test3(df) 815.886 924.1405 1003.585 1050.169 1128.915 100
fun.test4(df) 161822.674 167238.5165 172712.746 173254.052 225317.480 100
fun.test5(df) 5611.329 5899.8085 6000.555 6120.123 57572.615 100
fun.test6(dt) 511481.105 541534.7175 553155.852 578643.172 627739.674 100
I have a large data table in R:
library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
ID=sample(1:200000, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:1000, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
dim(DT)
I'd like to pivot this data.table, such that Category becomes a column. Unfortunately, since the number of categories isn't constant within groups, I can't use this answer.
Any ideas how I might do this?
/edit: Based on joran's comments and flodel's answer, we're really reshaping the following data.table:
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
This reshape can be accomplished a number of ways (I've gotten some good answers so far), but what I'm really looking for is something that will scale well to a data.table with millions of rows and hundreds to thousands of categories.
data.table implements faster versions of melt/dcast data.table specific methods (in C). It also adds additional features for melting and casting multiple columns. Please see the Efficient reshaping using data.tables vignette.
Note that we don't need to load reshape2 package.
library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
ID=sample(1:200000, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:800, n, replace=TRUE), ## to get to <= 2 billion limit
Qty=runif(n),
key=c('ID', 'Month')
)
dim(DT)
> system.time(ans <- dcast(DT, ID + Month ~ Category, fun=sum))
# user system elapsed
# 65.924 20.577 86.987
> dim(ans)
# [1] 2399401 802
Like that?
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
timevar = "Category", direction = "wide")
There is no data.table specific wide reshaping method.
Here is an approach that will work, but it is rather convaluted.
There is a feature request #2619 Scoping for LHS in :=to help with making this more straightforward.
Here is a simple example
# a data.table
DD <- data.table(a= letters[4:6], b= rep(letters[1:2],c(4,2)), cc = as.double(1:6))
# with not all categories represented
DDD <- DD[1:5]
# trying to make `a` columns containing `cc`. retaining `b` as a column
# the unique values of `a` (you may want to sort this...)
nn <- unique(DDD[,a])
# create the correct wide data.table
# with NA of the correct class in each created column
rows <- max(DDD[, .N, by = list(a,b)][,N])
DDw <- DDD[, setattr(replicate(length(nn), {
# safe version of correct NA
z <- cc[1]
is.na(z) <-1
# using rows value calculated previously
# to ensure correct size
rep(z,rows)},
simplify = FALSE), 'names', nn),
keyby = list(b)]
# set key for binary search
setkey(DDD, b, a)
# The possible values of the b column
ub <- unique(DDw[,b])
# nested loop doing things by reference, so should be
# quick (the feature request would make this possible to
# speed up using binary search joins.
for(ii in ub){
for(jj in nn){
DDw[list(ii), {jj} := DDD[list(ii,jj)][['cc']]]
}
}
DDw
# b d e f
# 1: a 1 2 3
# 2: a 4 2 3
# 3: b NA 5 NA
# 4: b NA 5 NA
EDIT
I found this SO post, which includes a better way to insert the
missing rows into a data.table. Function fun_DT adjusted
accordingly. Code is cleaner now; I don't see any speed improvements
though.
See my update at the other post. Arun's solution works as well, but you have to manually insert the missing combinations. Since you have more identifier columns here (ID, Month), I only came up with a dirty solution here (creating an ID2 first, then creating all ID2-Category combination, then filling up the data.table, then doing the reshaping).
I'm pretty sure this isn't the best solution, but if this FR is built in, those steps might be done automatically.
The solutions are roughly the same speed wise, although it would be interesting to see how that scales (my machine is too slow, so I don't want to increase the n any further...computer crashed to often already ;-)
library(data.table)
library(rbenchmark)
fun_reshape <- function(n) {
DT <- data.table(
ID=sample(1:100, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:10, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
timevar = "Category", direction = "wide")
}
#UPDATED!
fun_DT <- function(n) {
DT <- data.table(
ID=sample(1:100, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:10, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
agg[, ID2 := paste(ID, Month, sep="_")]
setkey(agg, ID2, Category)
agg <- agg[CJ(unique(ID2), unique(Category))]
agg[, as.list(setattr(Qty, 'names', Category)), by=list(ID2)]
}
library(rbenchmark)
n <- 1e+07
benchmark(replications=10,
fun_reshape(n),
fun_DT(n))
test replications elapsed relative user.self sys.self user.child sys.child
2 fun_DT(n) 10 45.868 1 43.154 2.524 0 0
1 fun_reshape(n) 10 45.874 1 42.783 2.896 0 0