Canonical way to select columns in R - r

I am comparing common "tidying" operations in dplyr and in "plain R" (see the output here and source here to see what I mean).
I have a hard time finding a "canonical" and concise way to select columns using only variable names (by canonical, I mean, pure plain R and easily understandable for anyone with minimum understanding of R (so no "voodoo trick")).
Example:
## subset: all columns from "var_1" to "var_2" excluding "var_3"
## dplyr:
table %>% select(var_1:var_2, -var_3)
## plain R:
r <- sapply(c("var_1", "var_2", "var_3"), function(x) which(names(table)==x))
table[ ,setdiff(r[1]:r[2],r[3]) ]
Any suggestions to improve the plain R syntax?
Edit
I implemented some suggestions and compared performance over different syntaxes, and noticed the use of match and subset lead to surprising falls in performance:
# plain R, v1
system.time(for (i in 1:100) {
r <- sapply(c("size", "country"), function(x) which(names(cran_df)==x))
cran_df[,r[1]:r[2]] } )
## user system elapsed
## 0.006 0.000 0.007
# plain R, using match
system.time(for (i in 1:100) {
r <- match(c("size", "country"), names(cran_df))
cran_df[,r[1]:r[2]] %>% head(n=3) } )
## user system elapsed
## 0.056 0.028 0.084
# plain R, using match and subset
system.time(for (i in 1:100) {
r <- match(c("size", "country"), names(cran_df))
subset(cran_df, select=r[1]:r[2]) %>% head(n=3) } )
## user system elapsed
## 11.556 1.057 12.640
# dplyr
system.time(for (i in 1:100) select(cran_tbl_df,size:country))
## user system elapsed
## 0.034 0.000 0.034
Looks like the implementation of subset is sub-optimal...

You can use the built in subset function, which can take a select argument that follows similar (though not identical) syntax to dplyr::select. Note that dropping columns has to be done in a second step:
t1 <- subset(table, select = var1:var2)
t2 <- subset(t1, select = -var_3)
or:
subset(subset(table, select = var1:var2), select = -var_3)
For example:
subset(subset(mtcars, select = c(mpg:wt)), select = -hp)

Related

Running a quicker calculation

I have 2 data frames in R, one of which is a subset of the other. I had to do some manipulations in it, and calculate the % of the subsetted data from the main data frame for 6 x-values (DayTreat in the code). So I created a function to do the calculation and create a new column. My issue is that it's painfully slow. Any suggestions?
percDay <- function(fullDat, subDat)
{
subDat$DaySum <- NULL
for (i in fullDat$DayTreat) # for each DayTreat value in fullDat. Must be `psmelt()` made phyloseq object
{
r <- sum(fullDat$Abundance[fullDat$DayTreat == i]) # Take the sum of all the taxa for that day
subDat$DaySum[subDat$DayTreat == i] <- r # Add the value to the subset of data
}
subDat$DayPerc <- (subDat$Abundance/subDat$DaySum) # Make the percentage of the subset
subDat
}
Examing your code, it looks like that you are doing redundant calculasions
the line:
for (i in fullDat$DayTreat)
should be:
for (i in unique(fullDat$DayTreat))
After that you could use data.table and do not use separate data frames,
if you say that one is subset of onother
require(data.table)
setDT(fullDat)
fullDat[, subsetI := Abundance > 30] # for example, should be your Condition
fullDat[, DaySum:= sum(Abundance), by = DayTreat]
fullDat[, DayPerc := Abundance/DaySum]
# get subset:
fullDat[subsetI == T]
If you would provide example data and desired output, it could be possible to supply more concrete code.
So, at a high level, I think the solutions are:
Use faster data classes if you aren't already
Avoid for loops
vectorize manually or
real on faster functions/libraries that use more C code and/or have more vectorization "under the hood"
Try data.table and/or tidyverse for greater speed and cleaner code
Benchmark and profile your code
Example:
require(tidyverse)
require(data.table)
percDay <- function(fullDat, subDat)
{
subDat$DaySum <- NULL
for (i in fullDat$DayTreat) # for each DayTreat value in fullDat. Must be `psmelt()` made phyloseq object
{
r <- sum(fullDat$Abundance[fullDat$DayTreat == i]) # Take the sum of all the taxa for that day
subDat$DaySum[subDat$DayTreat == i] <- r # Add the value to the subset of data
}
subDat$DayPerc <- (subDat$Abundance/subDat$DaySum) # Make the percentage of the subset
subDat
}
# My simulation of your data.frame:
fullDat <- data.frame(Abundance=rnorm(200),
DayTreat=c(1:100,1:100))
subDat <- dplyr::sample_frac(fullDat, .25)
# Your function modifies the data, so I'll make a copy. For a potential
# speed improvement I'll try data.table class
fullDat0 <- as.data.table(fullDat)
subDat0 <- as.data.table(subDat)
require(rbenchmark)
benchmark("original" = {
percDay(fullDat, subDat)
},
"example_improvement" = {
# Tidy approach
tmp <- fullDat0 %>%
group_by(DayTreat) %>%
summarize(DaySum = sum(Abundance))
subDat0 <- merge(subDat, tmp, by="DayTreat") # could use semi_join
subDat0$DayPerc <- (subDat0$Abundance/subDat0$DaySum) # could use mutate
},
replications = 100,
columns = c("test", "replications", "elapsed",
"relative", "user.self", "sys.self"))
test replications elapsed relative user.self sys.self
example_improvement 100 0.22 1.000 0.22 0.00
original 100 1.42 6.455 1.23 0.01
Typically a data.table approach is going to have the greatest speed. The tibble-based "tidy" approach has clearer syntax whilst typically being faster than data.frame but slower than data.table. An experienced data.table expert like #akrun could offer a maximal performance solution using probably just 1 single data.table statement.

logical indexing in data.table in R

I am a beginner in data.table and I am trying to do a really simple operation which in the base dataframes would look like this:
percentages[percentages<0] = abs(percentages[percentages<0])
The data looks like this:
percentages
p1 p2 p3
1: 0.689 0.206 0.106
The solution for data.table that I have found so far to just get the data is:
percentages[,which(percentages<0),with=FALSE]
but it's more complicated than dataframe...there should be something better but I can't get anything.. any suggestion?
A general option may be using set. It includes a for loop but it would be more efficient as we are looping through the columns and not creating a matrix by doing (df1 < 0 - for huge datasets, this would consume some memory). Using set will be efficient as the documentation says overhead of [.data.table is avoided
for(j in seq_along(df1)){
set(df1, i = which(df1[[j]]<0), j=j, value = abs(df1[[j]]))
}
As the OP wants a single line code, for the single row example showed,
df1[, lapply(.SD, function(x) replace(x, x < 0, abs(x)))]
Benchmarks
Based on the system.time on a slightly bigger dataset
set.seed(42)
dfN <- data.frame(p1 = rnorm(1e7), p2 = rnorm(1e7), p3 = rnorm(1e7), p4 = rnorm(1e7))
dfN1 <- copy(dfN)
setDT(dfN1)
system.time({
i1 <- dfN < 0
dfN[i1] <- abs(dfN[i1])
})
# user system elapsed
# 1.63 0.50 2.12
system.time({
for(j in seq_along(dfN1)){
set(dfN1, i = which(dfN1[[j]]<0), j=j, value = abs(dfN1[[j]][dfN1[[j]]<0]))
}
})
# user system elapsed
# 0.91 0.08 0.98
as akrun posted above, the one-liner reply is
df1[, lapply(.SD, function(x) replace(x, x < 0, abs(x)))]
however, this is not exactly what I was looking for since it seems that data.table is much more syntactically complicated compared to data.frame (at least in this example)
we are basically doing the vectorization ourselves in data.table (using the lapply) while in data.frame it happens automatically

Sort one matrix based on another matrix

I'm trying to put the rows of one matrix in the same order as the rows of another matrix of the same dimension. However I can't quite figure out how to do this without an explicit loop. It seems I should be able to do this with subsetting and an apply or Map function, but I can't figure out how to do it.
Here's a toy example:
sortMe <- matrix(rnorm(6), ncol=2)
sortBy <- matrix(c(2,1,3, 1,3,2), ncol=2)
sorted <- sortMe
for (i in 1:ncol(sortMe)) {
sorted[,i] <- sortMe[,i][sortBy[,i]]
}
Using this method, the resulting sorted matrix contains the values from sortMe sorted in the same order as the sortBy matrix. Any idea how I'd do this without the loop?
This (using a two-column integer matrix to index the matrix's two dimensions) should do the trick:
sorted <- sortMe
sorted[] <- sortMe[cbind(as.vector(sortBy), as.vector(col(sortBy)))]
Using lapply would work.
matrix(unlist(lapply(1:2, function(n) sortMe[,n][sortBy[,n]])), ncol=2)
But there is probably a more efficient way...
I'm going to suggest that you stick you your original version. I would argue that the original loop you wrote is somewhat easier to read and comprehend (also probably easier to write) than the other solutions offered.
Also, the loop is nearly as fast as the other solutions: (I borrowed #Josh O'Brien's timing code before he removed it from his post.)
set.seed(444)
n = 1e7
sortMe <- matrix(rnorm(2 * n), ncol=2)
sortBy <- matrix(c(sample(n), sample(n)), ncol=2)
#---------------------------------------------------------------------------
# #JD Long, original post.
system.time({
sorted_JD <- sortMe
for (i in 1:ncol(sortMe)) {
sorted_JD[, i] <- sortMe[, i][sortBy[, i]]
}
})
# user system elapsed
# 1.190 0.165 1.334
#---------------------------------------------------------------------------
# #Julius (post is now deleted).
system.time({
sorted_Jul2 <- sortMe
sorted_Jul2[] <- sortMe[as.vector(sortBy) +
rep(0:(ncol(sortMe) - 1) * nrow(sortMe), each = nrow(sortMe))]
})
# user system elapsed
# 1.023 0.218 1.226
#---------------------------------------------------------------------------
# #Josh O'Brien
system.time({
sorted_Jos <- sortMe
sorted_Jos[] <- sortMe[cbind(as.vector(sortBy), as.vector(col(sortBy)))]
})
# user system elapsed
# 1.070 0.217 1.274
#---------------------------------------------------------------------------
# #Justin
system.time({
sorted_Just = matrix(unlist(lapply(1:2,
function(n) sortMe[,n][sortBy[,n]])), ncol=2)
})
# user system elapsed
# 0.989 0.199 1.162
all.equal(sorted_JD, sorted_Jul2)
# [1] TRUE
all.equal(sorted_JD, sorted_Jos)
# [1] TRUE
all.equal(sorted_JD, sorted_Just)
# [1] TRUE

Taking row means based on a partition of the columns

I have a matrix mat and would like to calculate the mean of the columns based on a grouping variable gp.
mat<-embed(1:5000,1461)
gp<-c(rep(1:365,each=4),366)
To do this, I use the following
colavg<-t(aggregate(t(mat),list(gp),mean))
But it takes much longer than I expect.
Any suggestions on making the code run faster?
Here is a fast algorithm, I commented in the code.
system.time({
# create a list of column indices per group
gp.list <- split(seq_len(ncol(mat)), gp)
# for each group, compute the row means
means.list <- lapply(gp.list, function(cols)rowMeans(mat[,cols, drop = FALSE]))
# paste everything together
colavg <- do.call(cbind, means.list)
})
# user system elapsed
# 0.08 0.00 0.08
You could use an apply function, for example from the excellent plyr package:
# Create data
mat<-embed(1:5000,1461)
gp<-c(rep(1:365,each=4),366)
# Your code
system.time(colavg<-t(aggregate(t(mat),list(gp),mean)))
library(plyr)
# Put all data in a data frame
df <- data.frame(t(mat))
df$gp <- gp
# Using an apply function
system.time(colavg2 <- t(daply(df, .(gp), colMeans)))
Output:
> # Your code
> system.time(colavg<-t(aggregate(t(mat),list(gp),mean)))
user system elapsed
134.21 1.64 139.00
> # Using an apply function
> system.time(colavg2 <- t(daply(df, .(gp), colMeans)))
user system elapsed
52.78 0.06 53.23

R: Tabulations and insertions with data.table

I am trying to take a very large set of records with multiple indices, calculate an aggregate statistic on groups determined by a subset of the indices, and then insert that into every row in the table. The issue here is that these are very large tables - over 10M rows each.
Code for reproducing the data is below.
The basic idea is that there are a set of indices, say ix1, ix2, ix3, ..., ixK. Generally, I am choosing only a couple of them, say ix1 and ix2. Then, I calculate an aggregation of all the rows with matching ix1 and ix2 values (over all combinations that appear), for a column called val. To keep it simple, I'll focus on a sum.
I have tried the following methods
Via sparse matrices: convert the values to a coordinate list, i.e. (ix1, ix2, val), then create a sparseMatrix - this nicely sums up everything, and then I need only convert back from the sparse matrix representation to the coordinate list. Speed: good, but it is doing more than is necessary and it doesn't generalize to higher dimensions (e.g. ix1, ix2, ix3) or more general functions than a sum.
Use of lapply and split: by creating a new index that is unique for all (ix1, ix2, ...) n-tuples, I can then use split and apply. The bad thing here is that the unique index is converted by split into a factor, and this conversion is terribly time consuming. Try system({zz <- as.factor(1:10^7)}).
I'm now trying data.table, via a command like sumDT <- DT[,sum(val),by = c("ix1","ix2")]. However, I don't yet see how I can merge sumDT with DT, other than via something like DT2 <- merge(DT, sumDT, by = c("ix1","ix2"))
Is there a faster method for this data.table join than via the merge operation I've described?
[I've also tried bigsplit from the bigtabulate package, and some other methods. Anything that converts to a factor is pretty much out - as far as I can tell, that conversion process is very slow.]
Code to generate data. Naturally, it's better to try a smaller N to see that something works, but not all methods scale very well for N >> 1000.
N <- 10^7
set.seed(2011)
ix1 <- 1 + floor(rexp(N, 0.01))
ix2 <- 1 + floor(rexp(N, 0.01))
ix3 <- 1 + floor(rexp(N, 0.01))
val <- runif(N)
DF <- data.frame(ix1 = ix1, ix2 = ix2, ix3 = ix3, val = val)
DF <- DF[order(DF[,1],DF[,2],DF[,3]),]
DT <- as.data.table(DF)
Well, it's possible you'll find that doing the merge isn't so bad as long as your keys are properly set.
Let's setup the problem again:
N <- 10^6 ## not 10^7 because RAM is tight right now
set.seed(2011)
ix1 <- 1 + floor(rexp(N, 0.01))
ix2 <- 1 + floor(rexp(N, 0.01))
ix3 <- 1 + floor(rexp(N, 0.01))
val <- runif(N)
DT <- data.table(ix1=ix1, ix2=ix2, ix3=ix3, val=val, key=c("ix1", "ix2"))
Now you can calculate your summary stats
info <- DT[, list(summary=sum(val)), by=key(DT)]
And merge the columns "the data.table way", or just with merge
m1 <- DT[info] ## the data.table way
m2 <- merge(DT, info) ## if you're just used to merge
identical(m1, m2)
[1] TRUE
If either of those ways of merging is too slow, you can try a tricky way to build info at the cost of memory:
info2 <- DT[, list(summary=rep(sum(val), length(val))), by=key(DT)]
m3 <- transform(DT, summary=info2$summary)
identical(m1, m3)
[1] TRUE
Now let's see the timing:
#######################################################################
## Using data.table[ ... ] or merge
system.time(info <- DT[, list(summary=sum(val)), by=key(DT)])
user system elapsed
0.203 0.024 0.232
system.time(DT[info])
user system elapsed
0.217 0.078 0.296
system.time(merge(DT, info))
user system elapsed
0.981 0.202 1.185
########################################################################
## Now the two parts of the last version done separately:
system.time(info2 <- DT[, list(summary=rep(sum(val), length(val))), by=key(DT)])
user system elapsed
0.574 0.040 0.616
system.time(transform(DT, summary=info2$summary))
user system elapsed
0.173 0.093 0.267
Or you can skip the intermediate info table building if the following doesn't seem too inscrutable for your tastes:
system.time(m5 <- DT[ DT[, list(summary=sum(val)), by=key(DT)] ])
user system elapsed
0.424 0.101 0.525
identical(m5, m1)
# [1] TRUE

Resources