I have a dataframe which is has dimension of 2377426 rows by 2 columns, which looks something like this:
Name Seq
428293 ENSE00001892940:ENSE00001929862 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
431857 ENSE00001892940:ENSE00001883352 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
432253 ENSE00001892940:ENSE00003623668 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
436213 ENSE00001892940:ENSE00003534967 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
429778 ENSE00001892940:ENSE00002409454 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGCTGGGAACCTTTGCTCAAAGCTCC
431263 ENSE00001892940:ENSE00001834214 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGCTGGGAACCTTTGCTCAAAGCTCC
All the value in the first column (Name) are unique but there are many duplicates in the column 'Seq'.
I want a data.frame which only contains unique sequences and a name. I have tried unique but this is too slow. I have also tried ordering the database and using the following code:
dat_sorted = data[order(data$Seq),]
m = dat_sorted[1,]
x =1;for(i in 1:length(dat_sorted[,1])){if(dat_sorted[i,2]!=m[x,2]){x=x+1;m[x,]=dat_sorted[i,]}}
Again this is too slow!
Is there a faster way to find unique value in one column of a dataframe?
data[!duplicated(data$Seq), ]
should do the trick.
library(dplyr)
data %>% distinct
Should be worth for it, especially if your data is too big to your machine.
For the fastest, you can try:
data[!kit::fduplicated(data$Seq), ]
here are some benchmark taken directly from the documentation:
x = sample(c(1:10,NA_integer_),1e8,TRUE) # 382 Mb
microbenchmark::microbenchmark(
duplicated(x),
fduplicated(x),
times = 5L
)
# Unit: seconds
# expr min lq mean median uq max neval
# duplicated(x) 2.21 2.21 2.48 2.21 2.22 3.55 5
# fduplicated(x) 0.38 0.39 0.45 0.48 0.49 0.50 5
kit also has a funique function.
Related
Suppose I have a dict table like:
id value
1 168833
2 367656
3 539218
4 892211
......(millions of lines)
and a original data frame like:
name code
Abo 1
Cm3 2
LL2 6
JJ 15
how to replace the code column in the original table with dictionary table without using join or merge?
We can use match from base R
df1$value[match(df2$code, df1$id)]
Or another option is hashmap
library(hashmap)
hp <- hashmap(df1$id, df1$value)
hp[[df2$code]]
Based on the example in ?hashmap, it works faster
microbenchmark::microbenchmark(
"R" = y[match(z, x)],
"H" = H[[z]],
times = 500L
)
#Unit: microseconds
# expr min lq mean median uq max neval
# R 154.197 202.1625 240.5838 229.1625 245.1735 6853.756 500
# H 15.861 19.0235 22.7721 22.4490 24.9670 62.230 500
I want to understand the speed difference between select and $ to subset columns in R (whilst appreciating that they do not return exactly the same things, rather both perform the conceptual get-me-a-column operation). I would like to understand when either is most appropriate.
Specifically, under what conditions would the following select statement be faster than the corresponding $ statement?
Syntax is:
select(df, colName1, colName2, ...)
df$colName
In summary, you should use dplyr when speed of development, ease of understanding or ease of maintenance is most important.
Benchmarks below show that the operation takes longer with dplyr than base R equivalents.
dplyr returns a different (more complex) object.
Base R $ and similar operations can be faster to execute, but come with additional risks (e.g. partial matching behaviour); may be harder to read and/to maintain; return a (minimal) vector object, which might be missing some of the contextual richness of a data frame.
This might also help tease out (if one is wont to avoid looking at source code of packages) that dplyr is doing alot of work under the hood to target columns. It's also an unfair test since we get back different things, but all the ops are "give me this column" ops, so read it with that context:
library(dplyr)
microbenchmark::microbenchmark(
base1 = mtcars$cyl, # returns a vector
base2 = mtcars[['cyl', exact = TRUE]], # returns a vector
base2a = mtcars[['cyl', exact = FALSE]], # returns a vector
base3 = mtcars[,"cyl"], # returns a vector
base4 = subset(mtcars, select = cyl), # returns a 1 column data frame
dplyr1 = dplyr::select(mtcars, cyl), # returns a 1 column data frame
dplyr2 = dplyr::select(mtcars, "cyl"), # returns a 1 column data frame
dplyr3 = dplyr::pull(mtcars, cyl), # returns a vector
dplyr4 = dplyr::pull(mtcars, "cyl") # returns a vector
)
## Unit: microseconds
## expr min lq mean median uq max neval
## base1 4.682 6.3860 9.23727 7.7125 10.6050 25.397 100
## base2 4.224 5.9905 9.53136 7.7590 11.1095 27.329 100
## base2a 3.710 5.5380 7.92479 7.0845 10.1045 16.026 100
## base3 6.312 10.9935 13.99914 13.1740 16.2715 37.765 100
## base4 51.084 70.3740 92.03134 76.7350 95.9365 662.395 100
## dplyr1 698.954 742.9615 978.71306 784.8050 1154.6750 3568.188 100
## dplyr2 711.925 749.2365 1076.32244 808.9615 1146.1705 7875.388 100
## dplyr3 64.299 78.3745 126.97205 85.3110 112.1000 2383.731 100
## dplyr4 63.235 73.0450 99.28021 85.1080 114.8465 263.219 100
But, what if we have alot of columns:
# Make a wider version of mtcars
do.call(
cbind.data.frame,
lapply(1:20, function(i) setNames(mtcars, sprintf("%s_%d", colnames(mtcars), i)))
) -> mtcars_manycols
# I randomly chose to get "cyl_4"
microbenchmark::microbenchmark(
base1 = mtcars_manycols$cyl_4, # returns a vector
base2 = mtcars_manycols[['cyl_4', exact = TRUE]], # returns a vector
base2a = mtcars_manycols[['cyl_4', exact = FALSE]], # returns a vector
base3 = mtcars_manycols[,"cyl_4"], # returns a vector
base4 = subset(mtcars_manycols, select = cyl_4), # returns a 1 column data frame
dplyr1 = dplyr::select(mtcars_manycols, cyl_4), # returns a 1 column data frame
dplyr2 = dplyr::select(mtcars_manycols, "cyl_4"), # returns a 1 column data frame
dplyr3 = dplyr::pull(mtcars_manycols, cyl_4), # returns a vector
dplyr4 = dplyr::pull(mtcars_manycols, "cyl_4") # returns a vector
)
## Unit: microseconds
## expr min lq mean median uq max neval
## base1 4.534 6.8535 12.15802 8.7865 13.1775 75.095 100
## base2 4.150 6.5390 11.59937 9.3005 13.2220 73.332 100
## base2a 3.904 5.9755 10.73095 7.5820 11.2715 61.687 100
## base3 6.255 11.5270 16.42439 13.6385 18.6910 70.106 100
## base4 66.175 89.8560 118.37694 99.6480 122.9650 340.653 100
## dplyr1 1970.706 2155.4170 3051.18823 2443.1130 3656.1705 9354.698 100
## dplyr2 1995.165 2169.9520 3191.28939 2554.2680 3765.9420 11550.716 100
## dplyr3 124.295 142.9535 216.89692 166.7115 209.1550 1138.368 100
## dplyr4 127.280 150.0575 195.21398 169.5285 209.0480 488.199 100
For a ton of projects, dplyr is a great choice. Speed of execution, however, is very often not an attribute of the "tidyverse" but the speed of development and expressiveness usually outweigh the speed difference.
NOTE: dplyr verbs are likely better candidates than subset() and — while I lazily use $ it's also a tad dangerous due to default partial matching behaviour as is [[]] without exact=TRUE. A good habit (IMO) to get into is setting options(warnPartialMatchDollar = TRUE) in all your projects where you aren't knowingly counting on this behaviour.
It is not the same. If you're looking for the same functionality you could consider pull() from the same dplyr package.
Dollarsign returns a vector 'build' from the dataframe, pull does the same.
select is in the dplyr package, part of the tidyverse. https://dplyr.tidyverse.org/
you might do something like
df %>%
select(colName1, colName2)
Which would select those columns from df. These statements are written like verbs (e.g. select, arrange, group_by, etc.) and makes it much easier to work with data.
$ is from base r. It would show you only that column from df.
Say I have
library(dplyr)
a <- list(a=tbl_df(cars), b=tbl_df(iris))
How can I add to each element of this list a column name whose values are the name of the named element of the list?
For instance, this how the output should look like for the first element
Source: local data frame [50 x 3]
speed dist name
(dbl) (dbl) (chr)
1 4 2 a
2 4 10 a
3 7 4 a
4 7 22 a
5 8 16 a
6 9 10 a
7 10 18 a
8 10 26 a
9 10 34 a
10 11 17 a
After all this commenting, guess I'll write an answer.
You should use a for loop for this: it's quick to code, quick to execute, readable and straightforward:
for (i in seq_along(a)) a[[i]]$name = names(a)[i]
You could use map or mapply or lapply instead of the for loop. In this case, I would think it will be less readable.
You could also use mutate instead of [ for adding the column. This will be slower:
library(microbenchmark)
library(dplyr)
cars_tbl = tbl_df(cars)
mbm = microbenchmark
mbm(
mutate = {cars_tbl = mutate(cars_tbl, name = 'a')},
base = {cars_tbl['name'] = 'a'}
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# mutate 240.617 262.4730 293.29001 276.158 299.7255 813.078 100 b
# base 34.971 42.1935 55.46356 53.407 57.3980 226.932 100 a
For such a simple operation, [<- is going to be hard to beat. data.table will probably be faster, but only if the object is already a data.table. If the object is a data.frame rather than a tbl_df, then the mutate is about twice as slow. But these differences are in microseconds. Unless you are repeatedly doing this operation to lists of at least hundreds of thousands of data frames it won't matter.
This is not to say dplyr has poor performance - when you are using the grouping operations, relying on the NSE built in to dplyr, it's excellent. This is just a simple case where the simple base solution is easiest and also quickest.
If we increase the size of the data enough so that it takes a noticeable amount of time to do these operations (10 million rows, here), the differences essentially go away:
df = tbl_df(data.frame(x = rep(1, 1e7)))
mbm(
mutate = {df = mutate(df, name = 'a')},
base = {df['name'] = 'a'}
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# mutate 58.08095 59.87531 132.3180 105.22507 207.6439 261.8121 100 a
# base 52.09899 53.96386 129.9304 99.96153 203.8581 237.0084 100 a
Implementing with for loops and with map, comparing [<- and mutate
# base for loop
for (i in seq_along(a)) {
a[[i]]$name = names(a)[i]
}
# dplyr in for loop
for (i in seq_along(a)) {
a[[i]] = mutate(a[[i]], name = names(a)[i])
}
# dplyr hiding the loop in Map()
a = Map(function(x, y) mutate(x, name = y), x = a, y = names(a))
We could benchmark these (I did -- see the edit history if you want the results), but the differences are less than 1 millisecond so it shouldn't matter. Go with whatever is easiest for you to read, write, and understand.
All this comes with the caveat that if your eventual goal is to bind these data frames together and you want the name column to see what list element the data came from, then that is implemented directly in dplyr::bind_rows.
This may look like an innocuously simple problem but it takes a very long time to execute. Any ideas for speeding it up or vectorization etc. would be greatly appreciated.
I have a R data frame with 5 million rows and 50 columns : OriginalDataFrame
A list of Indices from that Frame : IndexList (55000 [numIndex] unique indices)
Its a time series so there are ~ 5 Million rows for 55K unique indices.
The OriginalDataFrame has been ordered by dataIndex. All the indices in IndexList are not present in OriginalDataFrame. The task is to find the indices that are present, and construct a new data frame : FinalDataFrame
Currently I am running this code using library(foreach):
FinalDataFrame <- foreach (i=1:numIndex, .combine="rbind") %dopar% {
OriginalDataFrame[(OriginalDataFrame$dataIndex == IndexList[i]),]
}
I run this on a machine with 24 cores and 128GB RAM and yet this takes around 6 hours to complete.
Am I doing something exceedingly silly or are there better ways in R to do this?
Here's a little benchmark comparing data.table to data.frame. If you know the special data table invocation for this case, it's about 7x faster, ignoring the cost of setting up the index (which is relatively small, and would typically be amortised across multiple calls). If you don't know the special syntax, it's only a little faster. (Note the problem size is a little smaller than the original to make it easier to explore)
library(data.table)
library(microbenchmark)
options(digits = 3)
# Regular data frame
df <- data.frame(id = 1:1e5, x = runif(1e5), y = runif(1e5))
# Data table, with index
dt <- data.table(df)
setkey(dt, "id")
ids <- sample(1e5, 1e4)
microbenchmark(
df[df$id %in% ids , ], # won't preserve order
df[match(ids, df$id), ],
dt[id %in% ids, ],
dt[match(ids, id), ],
dt[.(ids)]
)
# Unit: milliseconds
# expr min lq median uq max neval
# df[df$id %in% ids, ] 13.61 13.99 14.69 17.26 53.81 100
# df[match(ids, df$id), ] 16.62 17.03 17.36 18.10 21.22 100
# dt[id %in% ids, ] 7.72 7.99 8.35 9.23 12.18 100
# dt[match(ids, id), ] 16.44 17.03 17.36 17.77 61.57 100
# dt[.(ids)] 1.93 2.16 2.27 2.43 5.77 100
I had originally thought you might also be able to do this with
rownames, which I thought built up a hash table and did the indexing
efficiently. But that's obviously not the case:
df2 <- df
rownames(df2) <- as.character(df$id)
df2[as.character(ids), ],
microbenchmark(
df[df$id %in% ids , ], # won't preserve order
df2[as.character(ids), ],
times = 1
)
# Unit: milliseconds
# expr min lq median uq max neval
# df[df$id %in% ids, ] 15.3 15.3 15.3 15.3 15.3 1
# df2[as.character(ids), ] 3609.8 3609.8 3609.8 3609.8 3609.8 1
If you have 5M rows and you use == to identify rows to subset, then for each pass of your loop, you are performing 5M comparisons. If you instead key your data (as it inherently is) then you can increase efficiency significantly:
library(data.table)
OriginalDT <- as.data.table(OriginalDataFrame)
setkey(OriginalDT, dataIndex)
# Now inside your foreach:
OriginalDT[ .( IndexList[[i]] ) ]
Note that the setkey function uses a very fast implementation of radix sort. However if your data is already guaranteed to be sorted, #eddi or #arun had posted a nice hack to simply set the attribute to the DT. (I can't find it right now, but perhaps someone can edit this answer and link to it).
You might try just collecting all the results into a list of data.tables then using rbindlist and compare the speed against using .combine=rbind (if you do, please feel free to post benchmark results). I've never tested .combine=rbindlist but that might work as well and would be interesting to try.
edit:
If the sole task is to index the data.frame, then simply use:
dataIndex[ .( IndexList ) ]
No foreach necessary and you still leverage the key's DT
Check data.table package. It works just like data.frame but faster.
Like this (where df is your data frame):
table <- data.table(df)
and use table
I'd like to use the diff function on a really big data.frame : 140 Millions rows and two columns.
The goal is to compute the gap between two consecutive date activity, for each user_id.
For each user, the first activity doesn't have previous one, so I need a NA value.
I used this function, and it works for small dataset, but with the big one, it's really slow. I'm waiting since yesterday, and it's still running.
df2 <- as.vector(unlist(tapply(df$DATE,df$user_id, FUN=function(x){ return (c(NA,diff(x)))})))
I have a lot of memory (24GO) and a 4 cores CPU, but only one is working.
How can we do to manage this problem ? Is it better if I convert the dataframe to a matrix ?
Here is an example using some example data on a dataset that is at first 10 million rows, with 100 users, diffing 100,000 time points each, then 140 million rows, with 1,400 users so same number of timepoints. This transposes the time points to the columns. I should imagine if you were transposing users to columns it would be even faster. I used #Arun 's answer here as a template. Basically it shows that on a really big table you can do it on a single core (i7 2.6 GhZ) in < 90 seconds (and that is using code which is probably not fully optimsied):
require(data.table)
## Smaller sample dataset - 10 million row, 100 users, 100,000 time points each
DT <- data.table( Date = sample(100,1e7,repl=TRUE) , User = rep(1:100,each=1e5) )
## Size of table in memory
tables()
# NAME NROW MB COLS KEY
#[1,] DT 10,000,000 77 Date,User
#Total: 77MB
## Diff by user
dt.test <- quote({
DT2 <- DT[ , list(Diff=diff(c(0,Date))) , by=list(User) ]
DT2 <- DT2[, as.list(setattr(Diff, 'names', 1:length(Diff))) , by = list(User)]
})
## Benchmark it
require(microbenchmark)
microbenchmark( eval(dt.test) , times = 5L )
#Unit: seconds
# expr min lq median uq max neval
# eval(dt.test) 5.788364 5.825788 5.9295 5.942959 6.109157 5
## And with 140 million rows...
DT <- data.table( Date = sample(100,1.4e8,repl=TRUE) , User = rep(1:1400,each=1e5) )
#tables()
# NAME NROW MB
#[1,] DT 140,000,000 1069
microbenchmark( eval(dt.test) , times = 1L )
#Unit: seconds
# expr min lq median uq max neval
# eval(dt.test) 84.3689 84.3689 84.3689 84.3689 84.3689 1
This is a lot faster if you avoid tapply all together, which is fairly easy because your tapply call assumes the data are already sorted by user_id and DATE.
set.seed(21)
N <- 1e6
Data <- data.frame(DATE=Sys.Date()-sample(365,N,TRUE),
USER=sample(1e3,N,TRUE))
Data <- Data[order(Data$USER,Data$DATE),]
system.time({
Data$DIFF <- unlist(tapply(Data$DATE,Data$USER, function(x) c(NA,diff(x))))
})
# user system elapsed
# 1.58 0.00 1.59
Data2 <- Data
system.time({
Data2$DIFF <- c(NA,diff(Data2$DATE))
is.na(Data2$DIFF) <- which(c(NA,diff(Data2$USER))==1)
})
# user system elapsed
# 0.12 0.00 0.12
identical(Data,Data2)
# [1] TRUE