How do I sub sample data by group efficiently? - r

I do have a similar problem that is explained in this question. Similar to that question I have a data frame that has 3 columns (id, group, value). I want to take n samples with replacement from each group and produce a smaller data frame with n samples from each group.
However, I am doing hundreds of subsamples in a simulation code and the solution based on ddply is very slow to be used in my code. I tried to rewrite a simple code to see if I can get a better performance but it is still slow (not better than the ddply solution if not worse). Below is my code. I am wondering if it can be improved for performance
#Producing example DataFrame
dfsize <- 10
groupsize <- 7
test.frame.1 <- data.frame(id = 1:dfsize, group = rep(1:groupsize,each = ceiling(dfsize/groupsize))[1:dfsize], junkdata = sample(1:10000, size =dfsize))
#Main function for subsampling
sample.from.group<- function(df, dfgroup, size, replace){
outputsize <- 1
newdf <-df # assuming a sample cannot be larger than the original
uniquegroups <- unique(dfgroup)
for (uniquegroup in uniquegroups){
dataforgroup <- which(dfgroup==uniquegroup)
mysubsample <- df[sample(dataforgroup, size, replace),]
sizeofsample <- nrow(mysubsample)
newdf[outputsize:(outputsize+sizeofsample-1), ] <- mysubsample
outputsize <- outputsize + sizeofsample
}
return(newdf[1:(outputsize-1),])
}
#Using the function
sample.from.group(test.frame.1, test.frame.1$group, 100, replace = TRUE)

Here's two plyr based solutions:
library(plyr)
dfsize <- 1e4
groupsize <- 7
testdf <- data.frame(
id = seq_len(dfsize),
group = rep(1:groupsize, length = dfsize),
junkdata = sample(1:10000, size = dfsize))
sample_by_group_1 <- function(df, dfgroup, size, replace) {
ddply(df, dfgroup, function(x) {
x[sample(nrow(df), size = size, replace = replace), , drop = FALSE]
})
}
sample_by_group_2 <- function(df, dfgroup, size, replace) {
idx <- split_indices(df[[dfgroup]])
subs <- lapply(idx, sample, size = size, replace = replace)
df[unlist(subs, use.names = FALSE), , drop = FALSE]
}
library(microbenchmark)
microbenchmark(
ddply = sample_by_group_1(testdf, "group", 100, replace = TRUE),
plyr = sample_by_group_2(testdf, "group", 100, replace = TRUE)
)
# Unit: microseconds
# expr min lq median uq max neval
# ddply 4488 4723 5059 5360 36606 100
# plyr 443 487 507 536 31343 100
The second approach is much faster because it does the subsetting in a single step - if you can figure out how to do it in one step, it's usually any easy way to get better performance.

I think this is cleaner and possibly faster:
z <- sapply(unique(test.frame.1$group), FUN= function(x){
sample(which(test.frame.1$group==x), 100, TRUE)
})
out <- test.frame.1[z,]
out

Related

Is there a quick way in R to predict the class outcome of an observation from a nearest neighbours model from RANN?

I am trying to identify the most probable group that an observation belongs to, for several thousand large datasets. It is possible that some of the data is incorrectly classified and I am trying to work out the most likely "true" value. I have tried to use knn3 from the caret package but the predictions take too long to compute. In researching alternatives I have came across the nn2 function from RANN package which performs a nearest neighbour search that is significantly faster than K-Nearest Neighbours.
library(RANN)
library(tidyverse)
iris.scaled <- iris %>%
mutate_if(is.numeric, scale)
iris.nn2 <- nn2(iris.scaled[1:4])
The result on the nn2 function is two lists, one of indices and one of distances. I want to use the indices table to work out the most likely grouping of each observation, however it returns the row number of the observation and not it's group. I need to replace this with the group it belongs to (in this case, the species column).
distance.index <- iris.nn2$nn.idx[,-1]
target = iris.scaled$Species
I have removed the first column as the first nearest neighbour is always the observation itself.
matrix(target[distance.index[,]], nrow = nrow(distance.index), ncol = ncol(distance.index))
This code gives me the output I want, but is there a tidier way of creating this table and then calculating the most common response for each row, with the speed of calculation being the key.
Your scaling can be a real bottleneck when you have more columns (tested on 200 x 22216 gene expression matrix). My version might not seem that impressive with the iris dataset, but on the larger dataset I get 1.3 sec vs. 32.8 sec execution time.
Using tabulate instead of table gives an additional improvement, which is dwarfed, however, by the matrix scaling.
I used a custom scale function here, but using base::scale on a matrix would already be a major improvement.
I also addressed the issue raised by M. Papenberg of "self" not being considered the nearest neighbor by setting those to NA.
invisible(lapply(c("tidyverse", "matrixStats", "RANN", "microbenchmark", "compiler"),
require, character.only=TRUE))
enableJIT(3)
# faster column scaling (modified from https://www.r-bloggers.com/author/strictlystat/)
colScale <- function(x, center = TRUE, scale = TRUE, rows = NULL, cols = NULL) {
if (!is.null(rows) && !is.null(cols)) {x <- x[rows, cols, drop = FALSE]
} else if (!is.null(rows)) {x <- x[rows, , drop = FALSE]
} else if (!is.null(cols)) x <- x[, cols, drop = FALSE]
cm <- colMeans(x, na.rm = TRUE)
if (scale) csd <- matrixStats::colSds(x, center = cm, na.rm = TRUE) else
csd <- rep(1, length = length(cm))
if (!center) cm <- rep(0, length = length(cm))
x <- t((t(x) - cm) / csd)
return(x)
}
# your posted version (mostly):
oldv <- function(){
iris.scaled <- iris %>%
mutate_if(is.numeric, scale)
iris.nn2 <- nn2(iris.scaled[1:4])
distance.index <- iris.nn2$nn.idx[,-1]
target = iris.scaled$Species
category_neighbours <- matrix(target[distance.index[,]], nrow = nrow(distance.index), ncol = ncol(distance.index))
class <- apply(category_neighbours, 1, function(x) {
x1 <- table(x)
names(x1)[which.max(x1)]})
cbind(iris, class)
}
## my version:
myv <- function(){
iris.scaled <- colScale(data.matrix(iris[, 1:(dim(iris)[2]-1)]))
iris.nn2 <- nn2(iris.scaled)
# set self neighbors to NA
iris.nn2$nn.idx[iris.nn2$nn.idx - seq_len(dim(iris.nn2$nn.idx)[1]) == 0] <- NA
# match up categories
category_neighbours <- matrix(iris$Species[iris.nn2$nn.idx[,]],
nrow = dim(iris.nn2$nn.idx)[1], ncol = dim(iris.nn2$nn.idx)[2])
# turn category_neighbours into numeric for tabulate
cn <- matrix(as.numeric(factor(category_neighbours, exclude=NULL)),
nrow = dim(iris.nn2$nn.idx)[1], ncol = dim(iris.nn2$nn.idx)[2])
cnl <- levels(factor(category_neighbours, exclude = NULL))
# tabulate frequencies and match up with factor levels
class <- apply(cn, 1, function(x) {
cnl[which.max(tabulate(x, nbins=length(cnl))[!is.na(cnl)])]})
cbind(iris, class)
}
microbenchmark(oldv(), myv(), times=100L)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> oldv() 11.015986 11.679337 12.806252 12.064935 12.745082 33.89201 100 b
#> myv() 2.430544 2.551342 3.020262 2.612714 2.691179 22.41435 100 a

plyr outperforms dplyr and data.table - What's wrong?

I have to apply a function to every row of a large table (~ 2M rows). I used to use plyr for that, but the table is growing continuously and the current solution starts to approach unacceptable runtimes. I thought I could just switch to data.table or dplyr and all is fine, but that's not the case.
Here's an example:
library(data.table)
library(plyr)
library(dplyr)
dt = data.table("ID_1" = c(1:1000), # unique ID
"ID_2" = ceiling(runif(1000, 0, 100)), # other ID, duplicates possible
"group" = sample(LETTERS[1:10], 1000, replace = T),
"value" = runif(1000),
"ballast1" = "X", # keeps unchanged in derive_dt
"ballast2" = "Y", # keeps unchanged in derive_dt
"ballast3" = "Z", # keeps unchanged in derive_dt
"value_derived" = 0)
setkey(dt, ID_1)
extra_arg = c("A", "F", "G", "H")
ID_1 is guaranteed to contain no duplicates. Now I define a function to apply to every row/ID_1:
derive = function(tmprow, extra_arg){
if(tmprow$group %in% extra_arg){return(NULL)} # exlude entries occuring in extra_arg
group_index = which(LETTERS == tmprow$group)
group_index = ((group_index + sample(1:26, 1)) %% 25) + 1
new_group = LETTERS[group_index]
if(new_group %in% unique(dt$group)){return(NULL)}
new_value = runif(1)
row_derived = tmprow
row_derived$group = new_group
row_derived$value = runif(1)
row_derived$value_derived = 1
return(row_derived)
}
This one doesn't do anything useful (the actual one does). The point is that the function takes one row and computes a new row of the same format.
Now the comparison:
set.seed(42)
system.time(result_dt <- dt[, derive(.SD, extra_arg), by = ID_1])
set.seed(42)
system.time(result_dplyr <- dt %>% group_by(ID_1) %>% do(derive(., extra_arg)))
set.seed(42)
system.time(results_plyr <- x <- ddply(dt, .variable = "ID_1", .fun = derive, extra_arg))
plyr is about 8x faster than both data.table and dplyr. Obviously I'm doing something wrong here, but what?
EDIT
Thanks to eddi's answer I could reduce runtimes for data.table and dplyr to ~ 0.6 and 0.8 of the plyr version, respectively. I intialized row_derived as data.frame: row_derived = as.data.frame(tmprow). That's cool, but I still expected a higher performance increase from these packages...any further suggestions?
The issue is the assignment you use has a very high overhead in data.table, and plyr converts the row to a data.frame before passing to your derive function, and thus avoids it:
library(microbenchmark)
df = as.data.frame(dt)
microbenchmark({dt$group = dt$group}, {df$group = df$group})
#Unit: microseconds
# expr min lq mean median uq max neval
# { dt$group = dt$group } 1895.865 2667.499 3092.38903 3080.3620 3389.049 4984.406 100
# { df$group = df$group } 26.045 45.244 64.13909 61.6045 79.635 157.266 100
I can't suggest a good fix, since you say your example is not real problem, so no point in solving it better. Some basic suggestions to look at are - vectorizing the code, and using := or set instead (depending on what exactly you end up doing).

Vectorising a mutlitple condition dataframe merge

I am trying to merge two data frames. The original data frame is much larger than the data frame that is going to be merged with however there is only 1 possible match for each row. The row is found by matching the type (a factor) and the level. The level is an integer that will be put into one of several buckets (the example only has two)
My current method works but uses sapply and is slow for large numbers of rows. How can I vectorise this operation?
set.seed(123)
sample <- 100
data <- data.frame(type= sample(LETTERS[1:4], sample, replace=TRUE), level =round(runif(sample, 1,sample)), value = round(runif(sample, 200,1000)))
data2 <- data.frame(type= rep(LETTERS[1:4],2), lower= c(rep(1,4), rep(51,4)), upper = c(rep(50,4), rep(sample,4)), cost1 = runif(8, 0,1), cost2 = runif(8, 0,1),cost3 = runif(8, 0,1))
data2[,4:6] <- data2[,4:6]/rowSums(data2[,4:6]) #turns the variables in to percentages, not necessary on real data
x <- unlist(sapply(1:sample, function(n) which(ll <-data$type[n]==data2$type & data$level[n] >= data2$lower & data$level[n] <= data2$upper)))
data3 <- cbind(data, percentage= data2[x, -c(1:3)])
If I understand the matching problem you've set up, the following code seems to speed things up a bit by dividing data by type and then using cut to find the proper bucket. I think it will accommodate larger numbers of pairs of lower and upper values but haven't checked carefully.
library(plyr)
percents <- function(value, cost) {
cost <- cost[cost[,1]== value[1,1],]
cost <- cost[order(cost[,2]),]
ints <- cut(value[,2], breaks=c(t(cost[,2:3])), labels=FALSE, include.lowest=TRUE )
cbind(value,percentage=cost[ceiling(ints/2),-(1:3)])
}
data4 <- rbind.fill(mapply(percents, value=split(data, data$type), cost=list(data2), SIMPLIFY=FALSE) )
Setting
sample <- 10000
gives the following execution time comparisons
microbenchmark({x <- unlist(sapply(1:sample, function(n) which(ll <-data$type[n]==data2$type & data$level[n] >= data2$lower & data$level[n] <= data2$upper)));
data3 <- cbind(data, percentage= data2[x, -c(1:3)])} ,
data4 <- rbind.fill(mapply(percents, value=split(data, data$type), cost=list(data2), SIMPLIFY=FALSE) ),
times=10)
Unit: milliseconds
expr
{ x <- unlist(sapply(1:sample, function(n) which(ll <- data$type[n] == data2$type & data$level[n] >= data2$lower & data$level[n] <= data2$upper))) data3 <- cbind(data, percentage = data2[x, -c(1:3)]) }
data4 <- rbind.fill(mapply(percents, value = split(data, data$type), cost = list(data2), SIMPLIFY = FALSE))
min lq mean median uq max neval
1198.18269 1214.10560 1225.85117 1226.79838 1234.2671 1258.63122 10
20.81022 20.93255 21.50001 21.24237 22.1305 22.65291 10
where the first numbers are for the code shown in your question and the second times are for the code in my post. For this case, the new code seems almost 60 times faster.
Edit
To use rbind_all and avoid mapply, use the following:
microbenchmark({x <- unlist(sapply(1:sample, function(n) which(ll <-data$type[n]==data2$type & data$level[n] >= data2$lower & data$level[n] <= data2$upper)));
data3 <- cbind(data, percentage= data2[x, -c(1:3)])} ,
data4 <- rbind_all(lapply(split(data, data$type), percents, cost=data2 )),
times=10)
which gives slightly improved execution times
min lq mean median uq max neval
1271.57023 1289.17614 1297.68572 1301.84540 1308.31476 1313.56822 10
18.33819 18.57373 23.28578 19.53742 19.95132 58.96143 10
Edit 2
Modification to use the data2$lower values only for setting intervals
percents <- function(value, cost) {
cost <- cost[cost[,"type"] == value[1,"type"],]
cost <- cost[order(cost[,"lower"]),]
ints <- cut(value[,"value"], breaks= c(cost[,"lower"], max(cost[,"upper"])), labels=FALSE, right=FALSE, include.highest=TRUE )
cbind(value,percentage=cost[ints,-(1:3)])
}
to use with
data4 <- rbind_all(lapply(split(data, data$type), percents, cost=data2 ))

R: Creating a vector with a specific amount of random numbers

I was hoping someone could help point me in the right direction to create a vector in R, containing a defined amount of randomly generated numbers. I am a complete newbie to R, and I have learned that the concatenate function is used for creating vectors. However, I wish to populate the vector with 50 random numbers. I do not wish to specify a range or any other conditions for the numbers.
MyVectorObject <- c(...)
Any suggestions would be greatly appreciated!
It depends on which numbers you want to generate. These are some options.
x1 <- rpois(n = 50, lambda = 10)
x2 <- runif(n = 50, min = 1, max = 10)
x3 <- sample(x = c(1, 3, 5), size = 50, replace = TRUE)
If we are talking about integer numbers, you want to generate number in interval <-base::.Machine$integer.max, base::.Machine$integer.max> which is for example on my computer interval <-2147483647,2147483647>
Implementation
you can use base::sample to generate positive numbers from 1 to base::.Machine$integer.max
random.pos <- function(N) {
int.max <- .Machine$integer.max
return(sample(int.max, N, replace=TRUE))
}
if you want also negative numbers, you can use
random.posneg <- function(N) {
int.max <- .Machine$integer.max
random.numbers <- sample(int.max, N, replace = TRUE)
random.signs <- sample(c(1,-1), N, replace=TRUE)
return(random.numbers * random.signs)
}
NOTE: No one from functions specified above does generate 0 (zero)
The best approach (by my opinion) is to use base::runif function.
random.runif <- function(N) {
int.max <- .Machine$integer.max
int.min <- -int.max
return(as.integer(runif(N, int.min, int.max)))
}
This will be little bit slower then using base::sample but you get random numbers uniformly distributed with possible zero.
Benchmark
library(microbenchmark)
require(compiler)
random.runif <- cmpfun(random.runif)
random.pos <- cmpfun(random.pos)
random.posneg <- cmpfun(random.posneg)
N <- 500
op <- microbenchmark(
RUNIF = random.runif(N),
POS = random.pos(N),
POSNEG = random.posneg(N),
times = 10000
)
print(op)
## library(ggplot2)
## boxplot(op)
## qplot(y=time, data=op, colour=expr) + scale_y_log10()
and results from the benchmark above
Unit: microseconds
expr min lq mean median uq max neval
RUNIF 13.423 14.251 15.197122 14.482 14.694 2425.290 10000
POS 4.174 5.043 5.613292 5.317 5.645 2436.909 10000
POSNEG 11.673 12.845 13.383194 13.285 13.800 60.304 10000

Finding the range of cols by row

I have a dataframe that houses cols of numbers - id like to check the range between these cols by row and create a new col that contains this range....
tool1 tool2 tool3 range
1 34 12 33
na 19 23 4
its has to be able to handle NAs too, byt just ignoring them.
How could this be done?
I've decide to expand this, because operating on rows in R is always a pain. So I've decided to compare base R against the two very efficient packages data.table and dplyr
(I'm not a dplyr expert, so if someone wants to modify my answer, please do)
Note:
Your case isn't a classic case of operating on rows because it can be solved using vectorized pmax and pmin, which we won't be always able to use
So creating a bit bigger data than in your example
n <- 1e4
set.seed(123)
df <- data.frame(tool1 = sample(100, n, replace = T),
tool2 = sample(100, n, replace = T),
tool3 = sample(100, n, replace = T))
Loading the necessary packages
library(data.table)
library(dplyr)
library(microbenchmark)
Defining the functions
apply1 <- function(y) apply(y, 1, function(x) max(x, na.rm = T) - min(x, na.rm = T))
apply2 <- function(y) apply(y, 1, function(x) diff(range(x, na.rm = T)))
trans <- function(y) transform(y, range = pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3))
DTfunc <- function(y) setDT(y)[, range := pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3)]
DTfunc2 <- function(y) set(y, j = "range", value = with(y, pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3))) # Thanks to #Arun for this
dplyrfunc <- function(y) mutate(y, range = pmax(tool1, tool2, tool3) - pmin(tool1, tool2, tool3))
df2 <- as.data.table(df) # This is in order to avoid overriding df by `setDT` during benchmarking
Running some benchmarks
microbenchmark(apply1(df), apply2(df), trans(df), DTfunc(df2), DTfunc2(df2), dplyrfunc(df), times = 100)
Unit: microseconds
expr min lq median uq max neval
apply1(df) 37221.513 40699.3790 44103.3495 46777.305 94845.463 100
apply2(df) 262440.581 278239.6460 287478.4710 297301.116 343962.869 100
trans(df) 1088.799 1178.3355 1234.9940 1287.503 1965.328 100
DTfunc(df2) 2068.750 2221.8075 2317.5680 2400.400 5935.883 100
DTfunc2(df2) 903.981 959.0435 986.3355 1026.395 1235.951 100
dplyrfunc(df) 1040.280 1118.9635 1159.9815 1200.680 1509.189 100
Seems like the second data.table approach is the most efficient. Base R transform and dplyr both pretty much the same, while more efficient than the first data.table approach because of the overhead in calling [.data.table

Resources