When attempting to get a "count" for a specified number of observations, it appears that each of these functions work, as they derive the same result. But how are these functions operating differently in the background, and in what scenarios would it not be appropriate to swap one for the other?
sum(grade.data$Quiz >= (100*.45))
length(which(grade.data$Quiz >= (100*.45)))
nrow(grade.data[grade.data$Quiz >= (100*.45),])
The middle one will not give misleading answers when there are missing values. Both of the other ones will.
Number 1 sums a logical vector that is coerced to 1's and 0's. If you added na.rm it would be valid when NA's are present.
Number 2 determines the length of a numeric vector.
Number three constructs a subset and then counts the rows. I would expect it to be rather inefficient compared to the other two as well as having the problem with NA values. If you added & !is.na(grade.data$Quiz) to the logical expression inside [ , ], you would get valid answers.
A fourth method like the third (and also inefficient) without the NA problem would be:
nrow( subset( grade.data, Quiz >= (100*.45) ) )
Let's generate 100k row data.frame to see which method is fastest.
grade.data = data.frame(Quiz = sample(100000), age = sample(18:24, 100000, replace = TRUE))
library(data.table)
dt.grade.data = as.data.table(grade.data)
The methods posted here
data.table = function(x) dt.grade.data[,sum(Quiz>=100*.45)]
logical.sum = function(x) sum(grade.data$Quiz >= (100*.45))
logical.counting.table = function(x) table(grade.data$Quiz >= (100*.45))[["TRUE"]]
logical.which = function(x) length(which(grade.data$Quiz >= (100*.45)))
subsetting = function(x) nrow(grade.data[grade.data$Quiz >= (100*.45),])
subset.cmd = function(x) nrow(subset(grade.data, Quiz >= (100*.45) ))
Benchmark
microbenchmark(data.table(), logical.sum(), logical.counting.table(), logical.pointless.which(), subsetting(), subset.cmd(), times = 100L)
Unit: microseconds
expr min lq median uq max neval
data.table() 1766.148 2188.8000 2308.267 2469.405 29185.36 100
logical.sum() 739.385 945.4765 993.921 1074.386 10253.67 100
logical.counting.table() 28867.605 30847.0290 31546.796 32725.255 65514.14 100
logical.which() 701.205 1080.9555 1138.635 1228.545 3565.96 100
subsetting() 27376.931 28406.7730 29243.866 30564.371 168034.45 100
subset.cmd() 29004.315 31203.1730 32219.878 33362.003 89801.34 100
Seems that a vectorized logical check is the fastest method. In a smaller data frame (500 rows). data.table is actually much slower than all the other methods.
edit: Apparently, relatively efficiency of logical.sum() and logical.which() depends on the data structure. Using different Quiz score distribution can make the logical.sum() the fastest method. And as expected, data.table selection/subsetting blows data.frame subsetting out of the water.
Related
I want to understand the speed difference between select and $ to subset columns in R (whilst appreciating that they do not return exactly the same things, rather both perform the conceptual get-me-a-column operation). I would like to understand when either is most appropriate.
Specifically, under what conditions would the following select statement be faster than the corresponding $ statement?
Syntax is:
select(df, colName1, colName2, ...)
df$colName
In summary, you should use dplyr when speed of development, ease of understanding or ease of maintenance is most important.
Benchmarks below show that the operation takes longer with dplyr than base R equivalents.
dplyr returns a different (more complex) object.
Base R $ and similar operations can be faster to execute, but come with additional risks (e.g. partial matching behaviour); may be harder to read and/to maintain; return a (minimal) vector object, which might be missing some of the contextual richness of a data frame.
This might also help tease out (if one is wont to avoid looking at source code of packages) that dplyr is doing alot of work under the hood to target columns. It's also an unfair test since we get back different things, but all the ops are "give me this column" ops, so read it with that context:
library(dplyr)
microbenchmark::microbenchmark(
base1 = mtcars$cyl, # returns a vector
base2 = mtcars[['cyl', exact = TRUE]], # returns a vector
base2a = mtcars[['cyl', exact = FALSE]], # returns a vector
base3 = mtcars[,"cyl"], # returns a vector
base4 = subset(mtcars, select = cyl), # returns a 1 column data frame
dplyr1 = dplyr::select(mtcars, cyl), # returns a 1 column data frame
dplyr2 = dplyr::select(mtcars, "cyl"), # returns a 1 column data frame
dplyr3 = dplyr::pull(mtcars, cyl), # returns a vector
dplyr4 = dplyr::pull(mtcars, "cyl") # returns a vector
)
## Unit: microseconds
## expr min lq mean median uq max neval
## base1 4.682 6.3860 9.23727 7.7125 10.6050 25.397 100
## base2 4.224 5.9905 9.53136 7.7590 11.1095 27.329 100
## base2a 3.710 5.5380 7.92479 7.0845 10.1045 16.026 100
## base3 6.312 10.9935 13.99914 13.1740 16.2715 37.765 100
## base4 51.084 70.3740 92.03134 76.7350 95.9365 662.395 100
## dplyr1 698.954 742.9615 978.71306 784.8050 1154.6750 3568.188 100
## dplyr2 711.925 749.2365 1076.32244 808.9615 1146.1705 7875.388 100
## dplyr3 64.299 78.3745 126.97205 85.3110 112.1000 2383.731 100
## dplyr4 63.235 73.0450 99.28021 85.1080 114.8465 263.219 100
But, what if we have alot of columns:
# Make a wider version of mtcars
do.call(
cbind.data.frame,
lapply(1:20, function(i) setNames(mtcars, sprintf("%s_%d", colnames(mtcars), i)))
) -> mtcars_manycols
# I randomly chose to get "cyl_4"
microbenchmark::microbenchmark(
base1 = mtcars_manycols$cyl_4, # returns a vector
base2 = mtcars_manycols[['cyl_4', exact = TRUE]], # returns a vector
base2a = mtcars_manycols[['cyl_4', exact = FALSE]], # returns a vector
base3 = mtcars_manycols[,"cyl_4"], # returns a vector
base4 = subset(mtcars_manycols, select = cyl_4), # returns a 1 column data frame
dplyr1 = dplyr::select(mtcars_manycols, cyl_4), # returns a 1 column data frame
dplyr2 = dplyr::select(mtcars_manycols, "cyl_4"), # returns a 1 column data frame
dplyr3 = dplyr::pull(mtcars_manycols, cyl_4), # returns a vector
dplyr4 = dplyr::pull(mtcars_manycols, "cyl_4") # returns a vector
)
## Unit: microseconds
## expr min lq mean median uq max neval
## base1 4.534 6.8535 12.15802 8.7865 13.1775 75.095 100
## base2 4.150 6.5390 11.59937 9.3005 13.2220 73.332 100
## base2a 3.904 5.9755 10.73095 7.5820 11.2715 61.687 100
## base3 6.255 11.5270 16.42439 13.6385 18.6910 70.106 100
## base4 66.175 89.8560 118.37694 99.6480 122.9650 340.653 100
## dplyr1 1970.706 2155.4170 3051.18823 2443.1130 3656.1705 9354.698 100
## dplyr2 1995.165 2169.9520 3191.28939 2554.2680 3765.9420 11550.716 100
## dplyr3 124.295 142.9535 216.89692 166.7115 209.1550 1138.368 100
## dplyr4 127.280 150.0575 195.21398 169.5285 209.0480 488.199 100
For a ton of projects, dplyr is a great choice. Speed of execution, however, is very often not an attribute of the "tidyverse" but the speed of development and expressiveness usually outweigh the speed difference.
NOTE: dplyr verbs are likely better candidates than subset() and — while I lazily use $ it's also a tad dangerous due to default partial matching behaviour as is [[]] without exact=TRUE. A good habit (IMO) to get into is setting options(warnPartialMatchDollar = TRUE) in all your projects where you aren't knowingly counting on this behaviour.
It is not the same. If you're looking for the same functionality you could consider pull() from the same dplyr package.
Dollarsign returns a vector 'build' from the dataframe, pull does the same.
select is in the dplyr package, part of the tidyverse. https://dplyr.tidyverse.org/
you might do something like
df %>%
select(colName1, colName2)
Which would select those columns from df. These statements are written like verbs (e.g. select, arrange, group_by, etc.) and makes it much easier to work with data.
$ is from base r. It would show you only that column from df.
I would like to know how many times each variable changes within each group and later add the result for all groups.
I've found this way:
mi[,lapply(.SD, function(x) sum(x != shift(x),
na.rm=T) ), by = ID][,-1][,lapply(.SD,sum, na.rm=T)]
It works, it produces the proper result but it's really slow in my large datatable.
I would like to do both operations inside the same lapply (or something faster and more compact), but the first one is done by group, the second isn't.
It could be written in an easier way (maybe not always)
mi[,lapply(.SD, function(x) sum(x != shift(x),
na.rm=T) )] [,-1]-mi[,length(unique(ID))]+1
But it's still slow and needs a lot of memory.
Any other idea?
I've also tried with diffs instead of shift, but it becomes more difficult.
Here you have a dummy example:
mi <- data.table(ID=rep(1:3,each=4) , year=rep(1:4, times=3),
VREP=rep(1:3,each=4) , VDI=rep(1:4, times=3), RAN=sample(12))
mi <- rbind(mi, data.table(4,1,1,1,0), use.names=F)
Big example for benchmark
mi <- as.data.table(matrix(sample(0:100,10000000,
replace=T), nrow=100000, ncol=100))
mi[,ID := rep(1:1000,each=100)]
My problem is that the true dataset is much bigger, it's in the limit of memory size, then I've configured R to be able to use more memory using the pagefile, and it makes many operations slow.
I know I could do it splitting the file and joining it again, but sometimes that makes things more difficult or some operations are not splittable.
Your second method produces incorrect results, so is not a fair comparison point. Here's an optimized version of alexis_laz's suggestion instead:
setorder(mi, ID)
setDT(Map(`!=`, mi, shift(mi)))[,
lapply(lapply(.SD, `&`, !ID), sum, na.rm = T), .SDcols = -"ID"]
# year VREP VDI RAN
#1: 9 0 9 9
On your bigger sample:
setorder(mi, ID)
microbenchmark(method1(), alexis_laz(), eddi(), times = 5)
#Unit: milliseconds
# expr min lq mean median uq max neval
# method1() 7336.1830 7510.9543 7932.0476 8150.3197 8207.2181 8455.563 5
# alexis_laz() 1350.0338 1492.3793 1509.0790 1492.5426 1577.3318 1633.107 5
# eddi() 400.3999 475.6908 494.5805 504.6163 524.2077 567.988 5
I have a one million length vector of words called WORDS. I got a 9 millions objects list called SENTENCES. Each object of my list is a sentence which is represented by a 10-50 length vector of words. Here is an example :
head(WORDS)
[1] "aba" "accra" "ada" "afrika" "afrikan" "afula" "aggamemon"
SENTENCES[[1]]
[1] "how" "to" "interpret" "that" "picture"
I want to convert every sentence of my list into a numeric vector whose elements correspond to the position of the sentence's word in the WORDS big vector.
Actually, I know how to do it with that command :
convert <- function(sentence){
return(which(WORDS %in% sentence))
}
SENTENCES_NUM <- lapply(SENTENCES, convert)
The problem is that it takes way too long time. I mean my RStudio blows up although i got a 16Go RAM computer. So the question is do you have any ideas to speed up the computation?
fastmatch, a small package by an R core person, hashes the lookups so the initial and especially subsequent searches are faster.
What you are really doing is making a factor with predefined levels common to each sentence. The slow step in his C code is sorting the factor levels, which you can avoid by providing the (unique) list of factor levels to his fast version of the factor function.
If you just want the integer positions, you can easily convert from factor to integer: many do this inadvertently.
You don't actually need a factor at all for what you want, just match. Your code also generates a logical vector, then recalculates positions from it: match just goes straight to the positions.
library(fastmatch)
library(microbenchmark)
WORDS <- read.table("https://dotnetperls-controls.googlecode.com/files/enable1.txt", stringsAsFactors = FALSE)[[1]]
words_factor <- as.factor(WORDS)
# generate 100 sentences of between 5 and 15 words:
SENTENCES <- lapply(c(1:100), sample, x = WORDS, size = sample(c(5:15), size = 1))
bench_fun <- function(fun)
lapply(SENTENCES, fun)
# poster's slow solution:
hg_convert <- function(sentence)
return(which(WORDS %in% sentence))
jw_convert_match <- function(sentence)
match(sentence, WORDS)
jw_convert_match_factor <- function(sentence)
match(sentence, words_factor)
jw_convert_fastmatch <- function(sentence)
fmatch(sentence, WORDS)
jw_convert_fastmatch_factor <- function(sentence)
fmatch(sentence, words_factor)
message("starting benchmark one")
print(microbenchmark(bench_fun(hg_convert),
bench_fun(jw_convert_match),
bench_fun(jw_convert_match_factor),
bench_fun(jw_convert_fastmatch),
bench_fun(jw_convert_fastmatch_factor),
times = 10))
# now again with big samples
# generating the SENTENCES is quite slow...
SENTENCES <- lapply(c(1:1e6), sample, x = WORDS, size = sample(c(5:15), size = 1))
message("starting benchmark two, compare with factor vs vector of words")
print(microbenchmark(bench_fun(jw_convert_fastmatch),
bench_fun(jw_convert_fastmatch_factor),
times = 10))
I put this on https://gist.github.com/jackwasey/59848d84728c0f55ef11
The results don't format very well, suffice to say, fastmatch with or without factor input is dramatically faster.
# starting benchmark one
Unit: microseconds
expr min lq mean median uq max neval
bench_fun(hg_convert) 665167.953 678451.008 704030.2427 691859.576 738071.699 777176.143 10
bench_fun(jw_convert_match) 878269.025 950580.480 962171.6683 956413.486 990592.691 1014922.639 10
bench_fun(jw_convert_match_factor) 1082116.859 1104331.677 1182310.1228 1184336.810 1198233.436 1436600.764 10
bench_fun(jw_convert_fastmatch) 203.031 220.134 462.1246 289.647 305.070 2196.906 10
bench_fun(jw_convert_fastmatch_factor) 251.474 300.729 1351.6974 317.439 362.127 10604.506 10
# starting benchmark two, compare with factor vs vector of words
Unit: seconds
expr min lq mean median uq max neval
bench_fun(jw_convert_fastmatch) 3.066001 3.134702 3.186347 3.177419 3.212144 3.351648 10
bench_fun(jw_convert_fastmatch_factor) 3.012734 3.149879 3.281194 3.250365 3.498593 3.563907 10
And therefore I wouldn't go to the trouble of a parallel implementation just yet.
Won't be faster, but it is the tidy way of going about things.
library(dplyr)
library(tidyr)
sentence =
data_frame(word.name = SENTENCES,
sentence.ID = 1:length(SENTENCES) %>%
unnest(word.name)
word = data_frame(
word.name = WORDS,
word.ID = 1:length(WORDS)
sentence__word =
sentence %>%
left_join(word)
I am working on a financial problem of deleting messages from a financial center. I am using data.table and I am very satisfied with its performance and easy handling.
Though, I ask myself always how to improve and use the whole power of data.table.
Here is an example of my task:
set.seed(1)
DT <- data.table(SYM = c(rep("A", 10), rep("B", 12)), PRC = format(rlnorm(22, 2), digits = 2), VOL = rpois(22, 312), ID = c(seq(1000, 1009), seq(1004, 1015)), FLAG = c(rep("", 8), "R", "A", rep("", 4), "R", rep("", 7)))
DT$PRC[9] <- DT$PRC[6]
DT$PRC[7] <- DT$PRC[6]
DT$VOL[9] <- DT$VOL[6]
DT$VOL[7] <- DT$VOL[6]
DT$PRC[15] <- DT$PRC[13]
DT$VOL[15] <- DT$VOL[13]
## See the original dataset
DT
## Set the key
setkey(DT, "SYM", "PRC", "VOL", "FLAG")
## Get all rows, that match a row with FLAG == "R" on the given variables in the list
DT[DT[FLAG == "R"][,list(SYM, PRC, VOL)]]
## Remove these rows from the dataset
DT <- DT[!DT[FLAG == "R"][,list(SYM, PRC, VOL)]]
## See the modified data.table
DT
My questions are now:
Is this an efficient way to perform my task or does there exist something more 'data.table' style? Is the key set efficiently?
How can I perform my task if I do not only have three variables to match on (here: SYM, PRC, VOL) but a lot more, does there exist something like exclusion (I do know I can use it data.frame style but I want to know if there is a more elegant way for a data.table)?
What is with the copying in the last command? Following the thread on remove row by reference, I think copying is the only way to do it. What if I have several tasks, can I compound them in a way and avoid copying for each task?
I'm confused why you're setting the key to FLAG, isn't what you want simply
setkey(DT, SYM, PRC, VOL)
DT[!DT[FLAG == "R"]]
If you are only setting the key to perform this operation, #eddi's answer is the best and easiest to read.
setkey(DT, SYM, PRC, VOL)
# ^ as in #eddi's answer, since you are not using the rest of the key
microbenchmark(
notjoin=DT[!DT[FLAG == "R"][,list(SYM, PRC, VOL)]],
logi_not=DT[!DT[,rep(any(FLAG=='R'),.N),by='SYM,PRC,VOL']$V1],
idx_not=DT[!DT[,if(any(FLAG=='R')){.I}else{NULL},by='SYM,PRC,VOL']$V1],
SD=DT[,if(!any(FLAG=='R')){.SD}else{NULL},by='SYM,PRC,VOL'],
eddi=DT[!DT[FLAG == "R"]],
times=1000L
)
results:
Unit: milliseconds
expr min lq median uq max neval
notjoin 4.983404 5.577309 5.715527 5.903417 66.468771 1000
logi_not 4.393278 4.960187 5.097595 5.273607 66.429358 1000
idx_not 4.523397 5.139439 5.287645 5.453129 15.068991 1000
SD 3.670874 4.180012 4.308781 4.463737 9.429053 1000
eddi 2.767599 3.047273 3.137979 3.255680 11.970966 1000
On the other hand, several of options above do not require that your operation involve grouping by the key. Suppose you either...
are doing this once using groups other than the key (which you don't want to change) or
want to perform several operations like this using different groupings before doing the copy operation to drop rows, newDT <- DT[...] (as mentioned in the OP's point 3).
.
setkey(DT,NULL)
shuffDT <- DT[sample(1:nrow(DT))] # not realistic, of course
# same benchmark with shuffDT, but without methods that require a key
# Unit: milliseconds
# expr min lq median uq max neval
# logi_not 4.466166 5.120273 5.298174 5.562732 64.30966 1000
# idx_not 4.623821 5.319501 5.517378 5.799484 15.57165 1000
# SD 4.053672 4.448080 4.612213 4.849505 66.76140 1000
In these cases, the OP's and eddi's methods are not available (since joining requires a key). For a one-off operation, using .SD seems faster. For subsetting by multiple criteria, you'll want to keep track of the rows you want to keep/drop before making the copy newDT <- DT[!union(badrows1,badrows2,...)].
DT[,rn:=1:.N] # same as .I
badflagrows <- DT[,if(any(FLAG=='R')){rn}else{NULL},by='SYM,PRC,VOL']$V1
# fill in next_cond, next_grp
badnextrows <- DT[!badflagrows][,
if(any(next_cond)){rn}else{NULL},by='next_grp']$V1
Perhaps something similar can be done with the logical subsetting ("logi_not" in the benchmarks), which is a little faster.
I have an R dataframe and I'm trying to subtract one column from another. I extract the columns using the $ operator but the class of the columns is 'factor' and R won't perform arithmetic operations on factors. Are there special functions to do this?
If you really want the levels of the factor to be used, you're either doing something very wrong or too clever for its own good.
If what you have is a factor containing numbers stored in the levels of the factor, then you want to coerce it to numeric first using as.numeric(as.character(...)):
dat <- data.frame(f=as.character(runif(10)))
You can see the difference between accessing the factor indices and assigning the factor contents here:
> as.numeric(dat$f)
[1] 9 7 2 1 4 6 5 3 10 8
> as.numeric(as.character(dat$f))
[1] 0.6369432 0.4455214 0.1204000 0.0336245 0.2731787 0.4219241 0.2910194
[8] 0.1868443 0.9443593 0.5784658
Timings vs. an alternative approach which only does the conversion on the levels shows it's faster if levels are not unique to each element:
dat <- data.frame( f = sample(as.character(runif(10)),10^4,replace=TRUE) )
library(microbenchmark)
microbenchmark(
as.numeric(as.character(dat$f)),
as.numeric( levels(dat$f) )[dat$f] ,
as.numeric( levels(dat$f)[dat$f] ),
times=50
)
expr min lq median uq max
1 as.numeric(as.character(dat$f)) 7835865 7869228 7919699 7998399 9576694
2 as.numeric(levels(dat$f))[dat$f] 237814 242947 255778 270321 371263
3 as.numeric(levels(dat$f)[dat$f]) 7817045 7905156 7964610 8121583 9297819
Therefore, if length(levels(dat$f)) < length(dat$f), use as.numeric(levels(dat$f))[dat$f] for a substantial speed gain.
If length(levels(dat$f)) is approximately equal to length(dat$f), there is no speed gain:
dat <- data.frame( f = as.character(runif(10^4) ) )
library(microbenchmark)
microbenchmark(
as.numeric(as.character(dat$f)),
as.numeric( levels(dat$f) )[dat$f] ,
as.numeric( levels(dat$f)[dat$f] ),
times=50
)
expr min lq median uq max
1 as.numeric(as.character(dat$f)) 7986423 8036895 8101480 8202850 12522842
2 as.numeric(levels(dat$f))[dat$f] 7815335 7866661 7949640 8102764 15809456
3 as.numeric(levels(dat$f)[dat$f]) 7989845 8040316 8122012 8330312 10420161
You can define your own operators to do that, see ? Arith. Without group generics, you can define your own binary operators %operator%:
%-% <- function (factor1, factor2){
# put in the code here to calculate difference
# of two factors (e.g. facor1 level cat - factor2 level mouse = ?)
}
You should double check how you're pulling in the data first. If these are truly numeric columns R should recognize this (Excel messes up sometimes). Either way, it could be being coerced to a factor because there are other undesirables in the columns. The responses that you've received so far haven't mentioned that as.numeric() only returns the level numbers. Meaning that you won't be performing the operation on the actual numbers that have been converted to factors but rather the level numbers associated with each factor.
You'll need to convert the factors to numeric arrays.
a <- factor(c(5,6,5))
b <- factor(c(3,2,1))
df <- data.frame(a, b)
# WRONG: Factors can't be subtracted.
df$a - df$b
# CORRECT: Get the levels and substract
as.numeric(levels(df$a)[df$a]) - as.numeric(levels(df$b)[df$b])