I'd like to use the diff function on a really big data.frame : 140 Millions rows and two columns.
The goal is to compute the gap between two consecutive date activity, for each user_id.
For each user, the first activity doesn't have previous one, so I need a NA value.
I used this function, and it works for small dataset, but with the big one, it's really slow. I'm waiting since yesterday, and it's still running.
df2 <- as.vector(unlist(tapply(df$DATE,df$user_id, FUN=function(x){ return (c(NA,diff(x)))})))
I have a lot of memory (24GO) and a 4 cores CPU, but only one is working.
How can we do to manage this problem ? Is it better if I convert the dataframe to a matrix ?
Here is an example using some example data on a dataset that is at first 10 million rows, with 100 users, diffing 100,000 time points each, then 140 million rows, with 1,400 users so same number of timepoints. This transposes the time points to the columns. I should imagine if you were transposing users to columns it would be even faster. I used #Arun 's answer here as a template. Basically it shows that on a really big table you can do it on a single core (i7 2.6 GhZ) in < 90 seconds (and that is using code which is probably not fully optimsied):
require(data.table)
## Smaller sample dataset - 10 million row, 100 users, 100,000 time points each
DT <- data.table( Date = sample(100,1e7,repl=TRUE) , User = rep(1:100,each=1e5) )
## Size of table in memory
tables()
# NAME NROW MB COLS KEY
#[1,] DT 10,000,000 77 Date,User
#Total: 77MB
## Diff by user
dt.test <- quote({
DT2 <- DT[ , list(Diff=diff(c(0,Date))) , by=list(User) ]
DT2 <- DT2[, as.list(setattr(Diff, 'names', 1:length(Diff))) , by = list(User)]
})
## Benchmark it
require(microbenchmark)
microbenchmark( eval(dt.test) , times = 5L )
#Unit: seconds
# expr min lq median uq max neval
# eval(dt.test) 5.788364 5.825788 5.9295 5.942959 6.109157 5
## And with 140 million rows...
DT <- data.table( Date = sample(100,1.4e8,repl=TRUE) , User = rep(1:1400,each=1e5) )
#tables()
# NAME NROW MB
#[1,] DT 140,000,000 1069
microbenchmark( eval(dt.test) , times = 1L )
#Unit: seconds
# expr min lq median uq max neval
# eval(dt.test) 84.3689 84.3689 84.3689 84.3689 84.3689 1
This is a lot faster if you avoid tapply all together, which is fairly easy because your tapply call assumes the data are already sorted by user_id and DATE.
set.seed(21)
N <- 1e6
Data <- data.frame(DATE=Sys.Date()-sample(365,N,TRUE),
USER=sample(1e3,N,TRUE))
Data <- Data[order(Data$USER,Data$DATE),]
system.time({
Data$DIFF <- unlist(tapply(Data$DATE,Data$USER, function(x) c(NA,diff(x))))
})
# user system elapsed
# 1.58 0.00 1.59
Data2 <- Data
system.time({
Data2$DIFF <- c(NA,diff(Data2$DATE))
is.na(Data2$DIFF) <- which(c(NA,diff(Data2$USER))==1)
})
# user system elapsed
# 0.12 0.00 0.12
identical(Data,Data2)
# [1] TRUE
Related
I need to get only distinct values that are spread over two columns and return the distinct values into one column.
Example:
colA colB
---- --------
darcy elizabeth
elizabeth darcy
jon doe
doe joe
It should return:
resultCol
darcy
elizabeth
jon
doe
Is there any builtin function or library that can do that more efficiently?
I tried a workaround to get the results but it is extremely slow for more than 100 thousands observations.
#First i create a sample dataframe
col1<-c("darcy","elizabeth","elizabeth","darcy","john","doe")
col2<-c("elizabeth","darcy","darcy","elizabeth","doe","john")
dfSample<-data.frame(col1,col2)
#Then i create an empty dataframe to store all values in a single column
emptyDataframe<-data.frame(resultColumn=character())
for(i in 1:nrow(dfSample)){
emptyDataframe<-rbind(emptyDataframe,c(toString(dfSample[i,1])),stringsAsFactors=FALSE)
}
for(i in 1:nrow(dfSample)){
emptyDataframe<-rbind(emptyDataframe,c(toString(dfSample[i,2])),stringsAsFactors=FALSE)
}
emptyDataframe
#Finally i get the distinct values using dplyr
var_distinct_values<-distinct(emptyDataframe)
I use union to get unique values across specific columns:
with(dfSample, union(col1, col2))
PS: The answer from d.b in the comments is also another way.
You can improvise his answer if you have extra columns but want to run it only over specific columns:
unique(unlist(dfSample[1:2]))
This gets the unique values from first two columns.
Here is a general purpose solution.
It's based on this answer but can be extended to any number of columns as long as the object is a data.frame or list.
Reduce(union, dfSample)
[1] "darcy" "elizabeth" "john" "doe"
Now with 100K observations in each of 10 columns.
set.seed(1234)
n <- 1e5
bigger <- replicate(n, sample(c(col1, col2), 10, TRUE))
bigger <- as.data.frame(bigger)
system.time(Reduce(union, bigger))
# user system ellapsed
# 3.769 0.000 3.772
Edit.
After a second thought, I realized that the test above is run with a dataframe with a very small number of different values. A test with a larger number does not necessarily give the same results.
set.seed(1234)
s <- sprintf("%05d", 1:5000)
big2 <- replicate(n, sample(s, 10, TRUE))
big2 <- as.data.frame(big2)
rm(s)
microbenchmark::microbenchmark(
red = Reduce(union, big2),
uniq = unique(unlist(big2)),
times = 10
)
#Unit: seconds
# expr min lq mean median uq max neval cld
# red 26.021855 26.42693 27.470746 27.198807 28.56720 29.022047 10 b
# uniq 1.405091 1.42978 1.632265 1.548753 1.56691 2.693431 10 a
The unique/unlist solution is now clearly better.
Suppose I have a dict table like:
id value
1 168833
2 367656
3 539218
4 892211
......(millions of lines)
and a original data frame like:
name code
Abo 1
Cm3 2
LL2 6
JJ 15
how to replace the code column in the original table with dictionary table without using join or merge?
We can use match from base R
df1$value[match(df2$code, df1$id)]
Or another option is hashmap
library(hashmap)
hp <- hashmap(df1$id, df1$value)
hp[[df2$code]]
Based on the example in ?hashmap, it works faster
microbenchmark::microbenchmark(
"R" = y[match(z, x)],
"H" = H[[z]],
times = 500L
)
#Unit: microseconds
# expr min lq mean median uq max neval
# R 154.197 202.1625 240.5838 229.1625 245.1735 6853.756 500
# H 15.861 19.0235 22.7721 22.4490 24.9670 62.230 500
I have to following dataset. I want to create a column so that if there is a number in the unid column then in dat$identification I want it to say "unidentified" otherwise I want it to print whatever is there in the species column. So the final output should look like dat$identificaiton x,y,unidentified,unidentified. With this code it shows 1,2,unidentified,unidentified.
Please note, for other purposes I want to use only the unid column for the !(is.na) part of the ifelse statement and not the species.
unid <- c(NA,NA,1,4)
species <- c("x","y",NA,NA)
df <- data.frame(unid, species)
df$identification <- ifelse(!is.na(unid), "unidentified", df$species)
#Current Output of df$identification:
1,2,unidentified,unidentified
#Needed Output
x,y,unidentified,unidentified
You can coerce the column of class 'factorto classcharacterin theifelse`.
df$identification <- ifelse(!is.na(unid), "unidentified", as.character(df$species))
df
# unid species identification
#1 NA x x
#2 NA y y
#3 1 <NA> unidentified
#4 4 <NA> unidentified
Edit.
After the OP accepted the answer, I reminded myself that ifelse is slow and indexing fast, so I tested both using a larger dataset.
First of all, see if both solutions produce the same results:
df$id1 <- ifelse(!is.na(unid), "unidentified", as.character(df$species))
df$id2 <- "unidentified"
df$id2[is.na(unid)] <- species[is.na(unid)]
identical(df$id1, df$id2)
#[1] TRUE
The results are the same.
Now time them both using package microbenchmark.
n <- 1e4
df1 <- data.frame(unid = rep(unid, n), species = rep(species, n))
microbenchmark::microbenchmark(
ifelse = {df1$id1 <- ifelse(!is.na(df1$unid), "unidentified", as.character(df1$species))},
index = {df1$id2 <- "unidentified"
df1$id2[is.na(df1$unid)] <- species[is.na(df1$unid)]
},
relative = TRUE
)
#Unit: nanoseconds
# expr min lq mean median uq max neval cld
# ifelse 12502465 12749881 16080160.39 14365841 14507468.5 85836870 100 c
# index 3243697 3299628 4575818.33 3326692 4983170.0 74526390 100 b
#relative 67 68 208.89 228 316.5 540 100 a
On average, indexing is 200 times faster. More than worth the trouble to write two lines of code instead of just one for ifelse.
I have a dataframe which is has dimension of 2377426 rows by 2 columns, which looks something like this:
Name Seq
428293 ENSE00001892940:ENSE00001929862 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
431857 ENSE00001892940:ENSE00001883352 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
432253 ENSE00001892940:ENSE00003623668 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
436213 ENSE00001892940:ENSE00003534967 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
429778 ENSE00001892940:ENSE00002409454 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGCTGGGAACCTTTGCTCAAAGCTCC
431263 ENSE00001892940:ENSE00001834214 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGCTGGGAACCTTTGCTCAAAGCTCC
All the value in the first column (Name) are unique but there are many duplicates in the column 'Seq'.
I want a data.frame which only contains unique sequences and a name. I have tried unique but this is too slow. I have also tried ordering the database and using the following code:
dat_sorted = data[order(data$Seq),]
m = dat_sorted[1,]
x =1;for(i in 1:length(dat_sorted[,1])){if(dat_sorted[i,2]!=m[x,2]){x=x+1;m[x,]=dat_sorted[i,]}}
Again this is too slow!
Is there a faster way to find unique value in one column of a dataframe?
data[!duplicated(data$Seq), ]
should do the trick.
library(dplyr)
data %>% distinct
Should be worth for it, especially if your data is too big to your machine.
For the fastest, you can try:
data[!kit::fduplicated(data$Seq), ]
here are some benchmark taken directly from the documentation:
x = sample(c(1:10,NA_integer_),1e8,TRUE) # 382 Mb
microbenchmark::microbenchmark(
duplicated(x),
fduplicated(x),
times = 5L
)
# Unit: seconds
# expr min lq mean median uq max neval
# duplicated(x) 2.21 2.21 2.48 2.21 2.22 3.55 5
# fduplicated(x) 0.38 0.39 0.45 0.48 0.49 0.50 5
kit also has a funique function.
This may look like an innocuously simple problem but it takes a very long time to execute. Any ideas for speeding it up or vectorization etc. would be greatly appreciated.
I have a R data frame with 5 million rows and 50 columns : OriginalDataFrame
A list of Indices from that Frame : IndexList (55000 [numIndex] unique indices)
Its a time series so there are ~ 5 Million rows for 55K unique indices.
The OriginalDataFrame has been ordered by dataIndex. All the indices in IndexList are not present in OriginalDataFrame. The task is to find the indices that are present, and construct a new data frame : FinalDataFrame
Currently I am running this code using library(foreach):
FinalDataFrame <- foreach (i=1:numIndex, .combine="rbind") %dopar% {
OriginalDataFrame[(OriginalDataFrame$dataIndex == IndexList[i]),]
}
I run this on a machine with 24 cores and 128GB RAM and yet this takes around 6 hours to complete.
Am I doing something exceedingly silly or are there better ways in R to do this?
Here's a little benchmark comparing data.table to data.frame. If you know the special data table invocation for this case, it's about 7x faster, ignoring the cost of setting up the index (which is relatively small, and would typically be amortised across multiple calls). If you don't know the special syntax, it's only a little faster. (Note the problem size is a little smaller than the original to make it easier to explore)
library(data.table)
library(microbenchmark)
options(digits = 3)
# Regular data frame
df <- data.frame(id = 1:1e5, x = runif(1e5), y = runif(1e5))
# Data table, with index
dt <- data.table(df)
setkey(dt, "id")
ids <- sample(1e5, 1e4)
microbenchmark(
df[df$id %in% ids , ], # won't preserve order
df[match(ids, df$id), ],
dt[id %in% ids, ],
dt[match(ids, id), ],
dt[.(ids)]
)
# Unit: milliseconds
# expr min lq median uq max neval
# df[df$id %in% ids, ] 13.61 13.99 14.69 17.26 53.81 100
# df[match(ids, df$id), ] 16.62 17.03 17.36 18.10 21.22 100
# dt[id %in% ids, ] 7.72 7.99 8.35 9.23 12.18 100
# dt[match(ids, id), ] 16.44 17.03 17.36 17.77 61.57 100
# dt[.(ids)] 1.93 2.16 2.27 2.43 5.77 100
I had originally thought you might also be able to do this with
rownames, which I thought built up a hash table and did the indexing
efficiently. But that's obviously not the case:
df2 <- df
rownames(df2) <- as.character(df$id)
df2[as.character(ids), ],
microbenchmark(
df[df$id %in% ids , ], # won't preserve order
df2[as.character(ids), ],
times = 1
)
# Unit: milliseconds
# expr min lq median uq max neval
# df[df$id %in% ids, ] 15.3 15.3 15.3 15.3 15.3 1
# df2[as.character(ids), ] 3609.8 3609.8 3609.8 3609.8 3609.8 1
If you have 5M rows and you use == to identify rows to subset, then for each pass of your loop, you are performing 5M comparisons. If you instead key your data (as it inherently is) then you can increase efficiency significantly:
library(data.table)
OriginalDT <- as.data.table(OriginalDataFrame)
setkey(OriginalDT, dataIndex)
# Now inside your foreach:
OriginalDT[ .( IndexList[[i]] ) ]
Note that the setkey function uses a very fast implementation of radix sort. However if your data is already guaranteed to be sorted, #eddi or #arun had posted a nice hack to simply set the attribute to the DT. (I can't find it right now, but perhaps someone can edit this answer and link to it).
You might try just collecting all the results into a list of data.tables then using rbindlist and compare the speed against using .combine=rbind (if you do, please feel free to post benchmark results). I've never tested .combine=rbindlist but that might work as well and would be interesting to try.
edit:
If the sole task is to index the data.frame, then simply use:
dataIndex[ .( IndexList ) ]
No foreach necessary and you still leverage the key's DT
Check data.table package. It works just like data.frame but faster.
Like this (where df is your data frame):
table <- data.table(df)
and use table