Detecting columns containing any value quickly with grep - r

I have a large dataset, 5000 variables and 3 million rows. I want to check what columns contain dates. I'm working with data.table and reading the data with fread. In order to know what columns contain dates I run this:
my[, lapply(.SD,function(xx)
length(grep("^\\d\\d?/\\d\\d?/\\d{4}$",xx))>0 ) ]
or the same with any(grepl())
But it's very slow.
Is there any way to do it faster? Maybe forcing grep to stop the first time it encounters a date? I think (command line) grep has an option to do it:
grep -m 1
But I think it's not available in R.
Any idea? Solutions with base R or other packages are also welcome.
I could also work only with a few rows of the data.table but some columns could have very little values different than NA and there are chances of missing them.
Very simple example with some NA:
library(data.table)
set.seed(1)
siz <- 10000000
my <- data.table(
AA=c(rep(NA,siz-1),"11/11/2001"),
BB=sample(c("wrong", "11/11/2001"),siz, prob=c(1000000,1), replace=T),
CC=sample(siz),
DD=rep("11/11/2001",siz),
EE=rep("HELLO", siz)
)
I've seen there is an option perl = FALSE but I don't know wheter it will allow me to add extra parameters.
Or similarly I want to know among the files supposed to be dates whether there are strange symbols. I could run grep on every column but it would be great to be able to stop as soon as my test is right, without continuing till the end of the column.
Maybe with some extra package such as stringi?

One option would be to check only the first row (assuming that if there is a 'Date' class it would pick it up unless the first one is a missing value)
my[1][, grepl("\\d{2}/\\d{2}/\\d{4}", unlist(.SD))]
In addition to the above, as #Frank mentioned we can check only a subset of character class columns instead of the whole columns by specifying the .SDcols
j1 <- sapply(my, is.character)
my[, lapply(.SD, function(x)
length(grep("\\d{2}/\\d{2}/\\d{4}", x))>1),
.SDcols = j1]
Benchmarks
set.seed(24)
dat <- data.table(col1 = rnorm(1e6), col2 = "05/05/1942",
col3 = rnorm(1e6))
system.time(res <- dat[, lapply(.SD, function(x)
length(grep("\\d{2}/\\d{2}/\\d{4}", x))>1)])
# user system elapsed
# 6.33 0.01 6.35
system.time(res2 <- dat[1][, grepl("\\d{2}/\\d{2}/\\d{4}", unlist(.SD))])
# user system elapsed
# 0 0 0
system.time({
j1 <- sapply(dat, is.character)
res3 <- dat[, lapply(.SD, function(x)
length(grep("\\d{2}/\\d{2}/\\d{4}", x))>1), .SDcols = j1]
res3 <- names(dat) %in% names(res3)
})
# user system elapsed
# 0.43 0.00 0.44
all.equal(unlist(res), res2, check.attributes = FALSE)
#[1] TRUE
all.equal(unlist(res), res3, check.attributes=FALSE)
#[1] TRUE
If there are lots of NAs, then we can check on the first row where it has all non-NA elements
set.seed(24)
dat <- data.table(col1 = sample(c(NA, 1:10), 1e6, replace=TRUE),
col2 = c(NA, "05/05/1942"),
col3 = sample(c(NA, 1:5), 1e6, replace=TRUE))
dt1 <- head(dat, 20)
#Or just a sample of 20 rows from the dataset
#dt1 <- dat[sample(1:.N, 20, replace=TRUE)]
dt1[dt1[, which(!Reduce(`|`, lapply(.SD, is.na)))[1]]
][, grepl("\\d{2}/\\d{2}/\\d{4}", unlist(.SD))]

Related

R: Suggestion to speed up a function (remove duplicates in data frame)

I run into a bit of trouble with my code and would welcome any suggestion to make it run faster.
I have a data frame that looks like that:
Name <- c("a","a","a","a","a","b","b","b","b","c")
Category <- c("sun","cat","sun","sun","sea","sun","sea","cat","dog","cat")
More_info <- c("table","table","table","table","table","table","table","table","table","cat")
d <- data.frame(Name,Category,More_info)
So I have duplicated entries for each row in column Name (the number of duplicates can vary). For each entry (a,b,...) I want to count the sum of each corresponding element in the Category column and keep the only category that appears the most. If an entry has an equal number of categories, I want to take one of most categories randomly.
So in this case, the output dataframe would look like this:
Name <- c("a","b","c")
Category <- c("sun","dog","cat")
More_info <- c("table","table","table")
d <- data.frame(Name,Category,More_info)
a have sun entry kept because it appears the most, b would be dog or whatever other value as they all appear once with b, and c wouldn't be changed.
My function looks like this:
my_choosing_function <- function(x){
tmp = dbSNP_hapmap[dbSNP_hapmap$refsnp_id==list_of_snps[x],]
snp_freq <- as.data.frame(table(tmp$consequence_type_tv))
best_hit <- snp_freq[order(-snp_freq$Freq),]
best_hit$SNP<-list_of_snps[x]
top<-best_hit[1,]
return(top)
}
trst <- lapply(1:length(list_of_snps), function(x) my_choosing_function(x))
final <- do.call("rbind",trst)
Where I start from a list of unique elements (that would be Name in our case), for each element I do a table of the duplicated entries, I order the table by descending values and keep the top element. I do a lapply for each element of the list of unique values, then do a rbind of the whole thing.
As I have 2500000 rows in my initial data frame and 1500000 unique elements, it takes forever to run. 4 seconds for 100 lines, that would be a total of 34 hours for the lapply.
I'm sure packages like dplyr can do it in a few minutes but can't find a solution to do it. Anyone has an idea?
Thanks a lot for your help!
Note: This should be a very long comment because I use data.table instead of dplyr.
I suggest use data.table because it runs faster. And in the data.table way shown below, it randomly choose one in case of tie, not always the first one.
library(data.table)
library(dplyr)
library(microbenchmark)
d <- data.frame(
Name = as.character(sample.int(10000, 2.5e6, replace = T)),
Category = as.character(sample.int(10000, 2.5e6, replace = T)),
More_info = rep('table', 2.5e6)
)
Mode <- function(x) {
ux <- unique(x)
fr1 <- tabulate(match(x, ux))
if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}
system.time({
d %>%
group_by(Name) %>%
slice(which(Category == Mode(Category))[1])
})
# user system elapsed
# 45.932 0.808 46.745
system.time({
dt <- as.data.table(d)
dt.max <- dt[, .N, by = .(Name, Category)]
dt.max[, r := frank(-N, ties.method = 'random'), by = .(Name)]
dt.max <- dt.max[r == 1, .(Name, Category)]
dt[dt.max, on = .(Name, Category), mult = 'first']
})
# user system elapsed
# 2.424 0.004 2.426
We can modify the Mode function from here and then do a group by filter
library(dplyr)
Mode <- function(x) {
ux <- unique(x)
fr1 <- tabulate(match(x, ux))
if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}
d %>%
group_by(Name) %>%
slice(which(Category == Mode(Category))[1])
A couple slight tweaks on #mt1022's solution can produce a marginal speedup, nothing to phone home about, but might be of use if you find your data grows another order of magnitude.
library(data.table)
library(dplyr)
d <- data.frame(
Name = as.character(sample.int(10000, 2.5e6, replace = T)),
Category = as.character(sample.int(5000, 2.5e6, replace = T)),
More_info = rep('table', 2.5e6)
)
Mode <- function(x) {
ux <- unique(x)
fr1 <- tabulate(match(x, ux))
if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}
system.time({
d %>%
group_by(Name) %>%
slice(which(Category == Mode(Category))[1])
})
# user system elapsed
# 40.459 0.180 40.743
system.time({
dt <- as.data.table(d)
dt.max <- dt[, .N, by = .(Name, Category)]
dt.max[, r := frank(-N, ties.method = 'random'), by = .(Name)]
dt.max <- dt.max[r == 1, .(Name, Category)]
dt[dt.max, on = .(Name, Category), mult = 'first']
})
# user system elapsed
# 4.196 0.052 4.267
Tweaks include
Use setDT() instead of as.data.table() to avoid making a copy
Using stats::runif() to generate the random tiebreaker directly, this is of what data.table is doing internally in the the random option of frank()
Using setkey() to sort the table
Sub-setting the table by the row indices, .I, where the row within each group is equal to the number of observations, .N in each group. (This returns the last row of each group)
Results:
system.time({
dt.max <- setDT(d)[, .(Count = .N), keyby = .(Name, Category)]
dt.max[,rand := stats::runif(.N)]
setkey(dt.max,Name,Count, rand)
dt.max[dt.max[,.I[.N],by = .(Name,Category)]$V1,.(Name,Category,Count)]
})
# user system elapsed
# 1.722 0.057 1.750

Efficiently reformat column entries in large data set in R

I have a large (6 million row) table of values that I believe needs to be reformatted before it can be used for comparison to my data set. The table has 3 columns that I care about.
The first column contains nucleotide base changes, in the form of C>G, A>C, A>G, etc. I'd like to split these into two separate columns.
The second column has the chromosome and base position, formatted as 10:130448, 2:40483, 5:30821291, etc. I would also like to split this into two columns.
The third column has the allelic fraction in a number of sample populations, formatted like .02/.03/.20. I'd like to extract the third fraction into a new column.
The problem is that the code I have written is currently extremely slow. It looks like it will take about a day and a half just to run. Is there something I'm missing here? Any suggestions would be appreciated.
My current code does the following: pos, change, and fraction each receive a vector of the above values split use strsplit. I then loop through the entire database, getting the ith value from those three vectors, and creating new columns with the values I want.
Once the database has been formatted, I should be able to easily check a large number of samples by chromosome number, base, reference allele, alternate allele, etc.
pos <- strsplit(total.esp$NCBI.Base, ":")
change <- strsplit(total.esp$Alleles, ">")
fraction <- strsplit(total.esp$'MAFinPercent(EA/AA/All)', "/")
for (i in 1:length(pos)){
current <- pos[[i]]
mutation <- change[[i]]
af <- fraction[[i]]
total.esp$chrom[i] <- current[1]
total.esp$base[i] <- current [2]
total.esp$ref[i] <- mutation[1]
total.esp$alt[i] <- mutation[2]
total.esp$af[i] <- af[3]
}
Thanks!
Here is a data.table solution. We convert the 'data.frame' to 'data.table' (setDT(df1)), loop over the Subset of Data.table (.SD) with lapply, use tstrsplit and split the columns by specifying the split character, unlist the output with recursive=FALSE.
library(data.table)#v1.9.6+
setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]
# Alleles1 Alleles2 NCBI.Base1 NCBI.Base2 MAFinPercent1 MAFinPercent2
#1: C G 10 130448 0.02 0.03
#2: A C 2 40483 0.05 0.03
#3: A G 5 30821291 0.02 0.04
# MAFinPercent3
#1: 0.20
#2: 0.04
#3: 0.03
NOTE: I assumed that there are only 3 columns in the dataset. If there are more columns, and want to do the split only for the 3 columns, we can specify the .SDcols= 1:3 i.e. column index or the actual column names, assign (:=) the output to new columns and subset the columns that are only needed in the output.
data
df1 <- data.frame(Alleles =c('C>G', 'A>C', 'A>G'),
NCBI.Base=c('10:130448', '2:40483', '5:30821291'),
MAFinPercent= c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'),
stringsAsFactors=FALSE)
You can use tidyr, dplyr and separate:
library(tidyr)
library(dplyr)
total.esp %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent.EA.AA.All., c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)
You'll need to be careful about that last MAFinPercent.EA.AA.All. - you have a horrible column name so may have to rename it/quote it depending on how exactly r has it (this is also a good reason to include at least some data in your question, such as the output of dput(head(total.esp))).
data used to check:
total.esp <- data.frame(Alleles= rep("C>G", 50), NCBI.Base = rep("10:130448", 50), 'MAFinPercent(EA/AA/All)'= rep(".02/.03/.20", 50))
Because we now have a tidyr/dplyr solution, a data.table solution and a base solution, let's benchmark them. First, data from #akrun, 300,000 rows in total:
df1 <- data.frame(Alleles =rep(c('C>G', 'A>C', 'A>G'), 100000),
NCBI.Base=rep(c('10:130448', '2:40483', '5:30821291'), 100000),
MAFinPercent= rep(c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'), 100000),
stringsAsFactors=FALSE)
Now, the benchmark:
microbenchmark::microbenchmark(
tidyr = {df1 %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent, c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)},
data.table = {setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]},
base = {pos <- strsplit(df1$NCBI.Base, ":");
change <- strsplit(df1$Alleles, ">");
fraction <- strsplit(df1$MAFinPercent, "/");
data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( fraction, "[", 3)
)}
)
Unit: seconds
expr min lq mean median uq max neval
tidyr 1.295970 1.398792 1.514862 1.470185 1.629978 1.889703 100
data.table 2.140007 2.209656 2.315608 2.249883 2.481336 2.666345 100
base 2.718375 3.079861 3.183766 3.154202 3.221133 3.791544 100
tidyr is the winner
Try this (after retaining your first three lines of code):
total.esp <- data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( af, "[", 3)
)
I cannot imagine this taking more than a couple of minutes. (I do work with R objects of similar size.)

Summarizing by groups applying function which involves the next group

Let's assume I have the following data:
set.seed(1)
test <- data.frame(letters=rep(c("A","B","C","D"),10), numbers=sample(1:50, 40, replace=TRUE))
I want to know how many numbers whose letter is A are not in B, how many numbers of B are not in C and so on.
I came up with a solution for this using base functions split and mapply:
s.test <-split(test, test$letters)
notIn <- mapply(function(x,y) sum(!s.test[[x]]$numbers %in% s.test[[y]]$numbers), x=names(s.test)[1:3], y=names(s.test)[2:4])
Which gives:
> notIn
A B C
9 7 7
But I would also like to do this with dplyr or data.table. Is it possible?
The bottleneck seems to be in split. When simulated on 200 groups and 150,000 observations each, split takes 50 seconds out of the total 54 seconds.
The split step can be made drastically faster using data.table as follows.
## test is a data.table here
s.test <- test[, list(list(.SD)), by=letters]$V1
Here's a benchmark on data of your dimensions using data.table + mapply:
## generate data
set.seed(1L)
k = 200L
n = 150000L
test <- data.frame(letters=sample(paste0("id", 1:k), n*k, TRUE),
numbers=sample(1e6, n*k, TRUE), stringsAsFactors=FALSE)
require(data.table) ## latest CRAN version is v1.9.2
setDT(test) ## convert to data.table by reference (no copy)
system.time({
s.test <- test[, list(list(.SD)), by=letters]$V1 ## split
setattr(s.test, 'names', unique(test$letters)) ## setnames
notIn <- mapply(function(x,y)
sum(!s.test[[x]]$numbers %in% s.test[[y]]$numbers),
x=names(s.test)[1:199], y=names(s.test)[2:200])
})
## user system elapsed
## 4.840 1.643 6.624
That's about ~7.5x speedup on your biggest data dimensions. Would this be sufficient?
This seems to give about the same speedup as with data.table but only uses base R. Instead of splitting the data frame it splits the numbers column only (in line marked ##):
## generate data - from Arun's post
set.seed(1L)
k = 200L
n = 150000L
test <- data.frame(letters=sample(paste0("id", 1:k), n*k, TRUE),
numbers=sample(1e6, n*k, TRUE), stringsAsFactors=FALSE)
system.time({
s.numbers <- with(test, split(numbers, letters)) ##
notIn <- mapply(function(x,y)
sum(!s.numbers[[x]] %in% s.numbers[[y]]),
x=names(s.numbers)[1:199], y=names(s.numbers)[2:200])
})

R subset unique observation keeping last entry

I have a data frame that looks something like this (with a lot more observations)
df <- structure(list(session_user_id = c("1803f6c3625c397afb4619804861f75268dfc567",
"1924cb2ebdf29f052187b9a2d21673e4d314199b", "1924cb2ebdf29f052187b9a2d21673e4d314199b",
"1924cb2ebdf29f052187b9a2d21673e4d314199b", "1924cb2ebdf29f052187b9a2d21673e4d314199b",
"198b83b365fef0ed637576fe1bde786fc09817b2", "19fd8069c094fb0697508cc9646513596bea30c4",
"19fd8069c094fb0697508cc9646513596bea30c4", "19fd8069c094fb0697508cc9646513596bea30c4",
"19fd8069c094fb0697508cc9646513596bea30c4", "1a3d33c9cbb2aa41515e6ef76f123b2ea8ee2f13",
"1b64c142b1540c43e3f813ccec09cb2dd7907c14", "1b7346d13f714c97725ba2e1c21b600535164291"
), raw_score = c(1, 1, 1, 1, 1, 0.2, NA, 1, 1, 1, 1, 0.2, 1),
submission_time = c(1389707078L, 1389694184L, 1389694188L,
1389694189L, 1389694194L, 1390115495L, 1389696939L, 1389696971L,
1389741306L, 1389985033L, 1389983862L, 1389854836L, 1389692240L
)), .Names = c("session_user_id", "raw_score", "submission_time"
), row.names = 28:40, class = "data.frame")
I want to create a new data frame with only one observation per "session_ user_id" by keeping the one with the latest "submission_time."
The only idea that I have in mind is to create a list of unique users. Write a loop to find the max of submission_time for each user and then write a loop that gets raw score fore that user and time.
Can somebody show me a better way of doing this in R?
Thanks!
You could first order your data.frame by submission_time and remove all duplicated session_user_id entries afterwards:
## order by submission_time
df <- df[order(df$submission_time, decreasing=TRUE),]
## remove duplicated user_id
df <- df[!duplicated(df$session_user_id),]
# session_user_id raw_score submission_time
#33 198b83b365fef0ed637576fe1bde786fc09817b2 0.2 1390115495
#37 19fd8069c094fb0697508cc9646513596bea30c4 1.0 1389985033
#38 1a3d33c9cbb2aa41515e6ef76f123b2ea8ee2f13 1.0 1389983862
#39 1b64c142b1540c43e3f813ccec09cb2dd7907c14 0.2 1389854836
#28 1803f6c3625c397afb4619804861f75268dfc567 1.0 1389707078
#32 1924cb2ebdf29f052187b9a2d21673e4d314199b 1.0 1389694194
#40 1b7346d13f714c97725ba2e1c21b600535164291 1.0 1389692240
This is simple to express with dplyr: first group by session id, then filter, selecting the row in each group with the maximum time:
library(dplyr)
df %.%
group_by(session_user_id) %.%
filter(submission_time == max(submission_time))
Alternatively, if you don't want to keep all maximum times (if duplicated), you could do:
library(dplyr)
df %.%
group_by(session_user_id) %.%
filter(row_number(desc(submission_time)) == 1)
I'll add a data.table solution as well, and out of curiosity benchmark against dplyr on bigger data:
require(data.table)
DT <- as.data.table(df)
DT[DT[, .I[which.max(submission_time)], by=list(session_user_id)]$V1]
Here I'm assuming that the OP needs just one observation, even for multiple identical "max" values. If not, check out the function f2 below.
Benchmarks on bigger data vs dplyr:
Benchmark against #hadley's dplyr solutions on bigger data. I'll assume there are about 50e3 user ids and there are a total of 1e7 rows.
require(data.table) # 1.8.11 commit 1142
require(dplyr) # latest commit from github
set.seed(45L)
DT <- data.table(session_user_id = sample(paste0("id", 1:5e4), 1e7, TRUE),
raw_score = sample(10, 1e7, TRUE),
submission_time = sample(1e5:5e5, 1e7, TRUE))
DF <- tbl_df(as.data.frame(DT))
f1 <- function(DT) {
DT[DT[, .I[which.max(submission_time)], by=list(session_user_id)]$V1]
}
f2 <- function(DT) {
DT[DT[, .I[submission_time == max(submission_time)],
by=list(session_user_id)]$V1]
}
f3 <- function(DF) {
DF %.%
group_by(session_user_id) %.%
filter(submission_time == max(submission_time))
}
f4 <- function(DF) {
DF %.%
group_by(session_user_id) %.%
filter(row_number(desc(submission_time)) == 1)
}
And here are the timings. All are minimum of three runs:
system.time(a1 <- f1(DT))
# user system elapsed
# 1.044 0.056 1.101
system.time(a2 <- f2(DT))
# user system elapsed
# 1.384 0.080 1.475
system.time(a3 <- f3(DF))
# user system elapsed
# 4.513 0.044 4.555
system.time(a4 <- f4(DF))
# user system elapsed
# 6.312 0.004 6.314
As expected f4 is the slowest because it uses desc (which I'm guessing somehow involves in an ordering or sorting per group - a more computationally expensive operation than just getting max or which.max).
Here, a1 and a4 (only one observation even if multiple max values are present) give identical results and so does a2 and a3 (all max values).
data.table is at least 3x faster here (comparing a2 to a3) and about 5.7x times faster when comparing f1 to f4.
You could use the "plyr' package to summarize the data. Something like this should work
max_subs<-ddply(df,"session_user_id",summarize,max_sub=max(submission_time))
ddply takes a data frame in and returns a data frame, and this will give you the user and submission times you want.
To return the original data frame rows corresponding to these you could do
df2<-df[df$session_user_id %in% max_subs$session_user_id & df$submission_time %in% max_subs$max_sub,]
First find the max submission time by session_user_id. This table will be unique by session_user_id.
Then just merge (sql-speak: inner join) back to your original table joining on submission_time & session_user_id (R automatically picks up common names across the two data frames).
maxSessions<-aggregate(submission_time~session_user_id , df, max)
mySubset<-merge(df, maxSessions)
mySubset #this table has the data your are looking for
If you are looking for speed and your dataset is large then have a look at this How to summarize data by group in R? data.table & plyr are good choices.
This is just an extended comment because I was interested in how fast each of the solutions were
library(microbenchmark)
library(plyr)
library(dplyr)
library(data.table)
df <- df[sample(1:nrow(df),10000,replace=TRUE),] # 10k records
fun.test1 <- function(df) {
df <- df[order(df$submission_time, decreasing = TRUE),]
df <- df[!duplicated(df$session_user_id),]
return(df)
}
fun.test2 <- function(df) {
max_subs<-ddply(df,"session_user_id",summarize,max_sub=max(submission_time))
df2<-df[df$session_user_id %in% max_subs$session_user_id &
df$submission_time %in% max_subs$max_sub,]
return(df2)
}
fun.test3 <- function(df) {
df <- df %.%
group_by(session_user_id) %.%
filter(submission_time == max(submission_time))
return(df)
}
fun.test4 <- function(df) {
maxSessions<-aggregate(submission_time~session_user_id , df, max)
mySubset<-merge(df, maxSessions)
return(mySubset)
}
fun.test5 <- function(df) {
df <- df[df$submission_time %in% by(df, df$session_user_id,
function(x) max(x$submission_time)),]
return(df)
}
dt <- as.data.table(df) # Assuming you're working with data.table to begin with
# Don't know a lot about data.table so I'm sure there's a faster solution
fun.test6 <- function(dt) {
dt <- unique(
dt[,
list(raw_score,submission_time=max(submission_time)),
by=session_user_id]
)
return(dt)
}
Looks like the most basic solution with !duplicated() wins by a significant margin for small data (Under 1k), followed by dplyr. dplyr wins for large samples (over 1k).
microbenchmark(
fun.test1(df),
fun.test2(df),
fun.test3(df),
fun.test4(df),
fun.test5(df),
fun.test6(dt)
)
expr min lq median uq max neval
fun.test1(df) 2476.712 2660.0805 2740.083 2832.588 9162.339 100
fun.test2(df) 5847.393 6215.1420 6335.932 6477.745 12499.775 100
fun.test3(df) 815.886 924.1405 1003.585 1050.169 1128.915 100
fun.test4(df) 161822.674 167238.5165 172712.746 173254.052 225317.480 100
fun.test5(df) 5611.329 5899.8085 6000.555 6120.123 57572.615 100
fun.test6(dt) 511481.105 541534.7175 553155.852 578643.172 627739.674 100

Pivot a large data.table

I have a large data table in R:
library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
ID=sample(1:200000, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:1000, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
dim(DT)
I'd like to pivot this data.table, such that Category becomes a column. Unfortunately, since the number of categories isn't constant within groups, I can't use this answer.
Any ideas how I might do this?
/edit: Based on joran's comments and flodel's answer, we're really reshaping the following data.table:
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
This reshape can be accomplished a number of ways (I've gotten some good answers so far), but what I'm really looking for is something that will scale well to a data.table with millions of rows and hundreds to thousands of categories.
data.table implements faster versions of melt/dcast data.table specific methods (in C). It also adds additional features for melting and casting multiple columns. Please see the Efficient reshaping using data.tables vignette.
Note that we don't need to load reshape2 package.
library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
ID=sample(1:200000, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:800, n, replace=TRUE), ## to get to <= 2 billion limit
Qty=runif(n),
key=c('ID', 'Month')
)
dim(DT)
> system.time(ans <- dcast(DT, ID + Month ~ Category, fun=sum))
# user system elapsed
# 65.924 20.577 86.987
> dim(ans)
# [1] 2399401 802
Like that?
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
timevar = "Category", direction = "wide")
There is no data.table specific wide reshaping method.
Here is an approach that will work, but it is rather convaluted.
There is a feature request #2619 Scoping for LHS in :=to help with making this more straightforward.
Here is a simple example
# a data.table
DD <- data.table(a= letters[4:6], b= rep(letters[1:2],c(4,2)), cc = as.double(1:6))
# with not all categories represented
DDD <- DD[1:5]
# trying to make `a` columns containing `cc`. retaining `b` as a column
# the unique values of `a` (you may want to sort this...)
nn <- unique(DDD[,a])
# create the correct wide data.table
# with NA of the correct class in each created column
rows <- max(DDD[, .N, by = list(a,b)][,N])
DDw <- DDD[, setattr(replicate(length(nn), {
# safe version of correct NA
z <- cc[1]
is.na(z) <-1
# using rows value calculated previously
# to ensure correct size
rep(z,rows)},
simplify = FALSE), 'names', nn),
keyby = list(b)]
# set key for binary search
setkey(DDD, b, a)
# The possible values of the b column
ub <- unique(DDw[,b])
# nested loop doing things by reference, so should be
# quick (the feature request would make this possible to
# speed up using binary search joins.
for(ii in ub){
for(jj in nn){
DDw[list(ii), {jj} := DDD[list(ii,jj)][['cc']]]
}
}
DDw
# b d e f
# 1: a 1 2 3
# 2: a 4 2 3
# 3: b NA 5 NA
# 4: b NA 5 NA
EDIT
I found this SO post, which includes a better way to insert the
missing rows into a data.table. Function fun_DT adjusted
accordingly. Code is cleaner now; I don't see any speed improvements
though.
See my update at the other post. Arun's solution works as well, but you have to manually insert the missing combinations. Since you have more identifier columns here (ID, Month), I only came up with a dirty solution here (creating an ID2 first, then creating all ID2-Category combination, then filling up the data.table, then doing the reshaping).
I'm pretty sure this isn't the best solution, but if this FR is built in, those steps might be done automatically.
The solutions are roughly the same speed wise, although it would be interesting to see how that scales (my machine is too slow, so I don't want to increase the n any further...computer crashed to often already ;-)
library(data.table)
library(rbenchmark)
fun_reshape <- function(n) {
DT <- data.table(
ID=sample(1:100, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:10, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
timevar = "Category", direction = "wide")
}
#UPDATED!
fun_DT <- function(n) {
DT <- data.table(
ID=sample(1:100, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:10, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
agg[, ID2 := paste(ID, Month, sep="_")]
setkey(agg, ID2, Category)
agg <- agg[CJ(unique(ID2), unique(Category))]
agg[, as.list(setattr(Qty, 'names', Category)), by=list(ID2)]
}
library(rbenchmark)
n <- 1e+07
benchmark(replications=10,
fun_reshape(n),
fun_DT(n))
test replications elapsed relative user.self sys.self user.child sys.child
2 fun_DT(n) 10 45.868 1 43.154 2.524 0 0
1 fun_reshape(n) 10 45.874 1 42.783 2.896 0 0

Resources