How to subset from a long list? - r

I have a list x with millions of entries in it. And I want to put all the entries with a length larger than one into a new list z. How can I do this efficiently in R?
I tried this code, and R just keeps running for a long time.
z=NULL
for(i in 1:length(x)) {
if(length(x[[i]])!=1) z=list(z,x[[i]])
}

This is one case where you want to use vapply:
z <- x[vapply(x, length, integer(1)) > 1L]
Here are benchmarks comparing sapply and vapply:
A <- list( x = c(), y = c(1), z = c(1, 2))
B <- A[sample(1:3, 1e7, replace = TRUE)]
system.time(sapply(B, length))
# user system elapsed
# 55.95 0.54 56.50
system.time(vapply(B, length, integer(1)))
# user system elapsed
# 6.78 0.00 6.78

Just do:
z = x[sapply(x, length) > 1]

Related

Optimization of apply

I have existing code that calculates concordance value for a dataframe/matrix. It's basically the number of rows where all the values are the same over the total number of rows.
...
concordance<-new[complete.cases(new),] #removes rows with NAs
TF<-apply(concordance, 1, function(x) if(length(unique(x))>1) F else T)
#outputs vector of T/F if it is concordant
numF<-table(TF)["TRUE"]#gets number of trues
concValue<-numF/NROW(TF) #true/total
...
Above is what I have now. It runs ok but I was wondering if there was any way to make it faster.
Edit: Dimensions of object is variable, but # of cols are typically 2-6 and there are typically 1,000,000+ rows. This is part of a package i'm developing so input data is variable.
Because the number of rows is much larger than the number of columns it makes sense to loop on columns instead, dropping rows, where there is more than different one value in the process:
propIdentical <- function(Mat){
nrowInit <- nrow(Mat)
for(i in 1:(ncol(Mat) - 1)){
if(!nrow(Mat)) break #stop if the matrix has no rows
else{
#check which elements of column i and column i+1 are equal:
equals <- Mat[,i] == Mat[, i+1]
# remove all other rows from the matrix
Mat <- Mat[equals,,drop = F]
}
}
return(nrow(Mat)/nrowInit)
}
some tests:
set.seed(1)
# normal case
dat <- matrix(sample(1:10, rep = T, size = 3*10^6), nrow = 10^6)
system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0.053 0.017 0.070
[1] 0.009898
# normal case on my pc for comparison:
system.time(app <- mean(apply(dat, 1, function(x) length(unique(x))) == 1L)); app
user system elapsed
12.176 0.036 12.231
[1] 0.009898
# worst case
dat <- matrix(1L, nrow = 10^6, ncol = 6)
> system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0.302 0.044 0.348
[1] 1
# worst case on my pc for comparison
system.time(mean(apply(dat, 1, function(x) length(unique(x))) == 1L))
user system elapsed
12.562 0.001 12.578
# testing drop = F and if(!nrow(Mat)) break
dat <- matrix(1:2, ncol = 2)
> system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0 0 0
[1] 0
Note: if you run this on a data.frame make sure to turn it into a matrix first.

How to efficiently count matches across a list in R?

I have a list of vectors of integers, for example:
set.seed(1)
vec_list <- replicate(100, sample(1:10000000, size=sample(1:10000, 100)), simplify=FALSE)
And a vector of integers, for example:
vec <- sample(1:10000000, size=10000)
How can I count the number of intergers in each vector in vec_list that appear in the vector vec? I can do this using a for loop. For example:
total_match <- rep(NA, length(vec_list))
for (i in 1:length(vec_list)){
total_match[i] <- length(which(vec_list[[i]] %in% vec))
print(i)
}
However, the list and vector I am trying to apply this too are very large, and this is slow. Please help with suggestions on how to improve performance.
Using data.table is much faster, but does not return 0's when there are no matches. For example:
DT <- data.table(repid=rep(1:length(vec_list), sapply(vec_list, length)), val=unlist(vec_list))
total_match2 <- DT[.(vec), on=.(val), nomatch=0L, .N, keyby=.(repid)]$N
What about:
sapply(vec_list, function(x) sum(x %in% vec))
Maybe try:
DT <- setDT(stack(setNames(vec_list, 1:length(vec_list))))
DT[, x := +(values %in% vec)][, sum(x), keyby=.(ind)]$V1
Another, a variant of #chinsoon's:
nvec = 5000
max_size = 10000
nv = 10000000
set.seed(1)
vec_list <- replicate(nvec, sample(nv, size=sample(max_size, 1)), simplify=FALSE)
vec <- sample(nv, size=max_size)
system.time(
res <- rbindlist(lapply(vec_list, list), id=TRUE)[.(vec), on=.(V1), nomatch=0, .N, keyby=.id]
)
# user system elapsed
# 0.86 0.20 0.47
system.time({
DT <- setDT(stack(setNames(vec_list, 1:length(vec_list))))
res2 <- DT[, x := +(values %in% vec)][, sum(x), keyby=.(ind)]$V1
})
# user system elapsed
# 1.03 0.45 1.00
identical(res2[res2 != 0], res$N) # TRUE

Is there a fast way to extract elements in a list of data frames?

I'm trying to find a fast way to extract elements in a list of data frames.
To do this, I've tested the function lapply. Here is a reproducible example:
i <- 2
dat <- replicate(100000, data.frame(x=1:5000, y = 1:5000, z = 1:5000), simplify=FALSE)
system.time(test <- lapply(dat, function(y) y[i, c("x", "y")]))
user system elapsed
7.69 0.00 7.73
Ideally, the elapsed time should be <= 1 second.

R: Fastest way to obtain the first and last location each unique value in a vector?

I have a vector containing a list of unknown values. I would like to know the fastest way in R to obtain the first and last index of each unique values and return a n by 2 vector.
For example, the below works but I think might be too slow for large vectors.
library(magrittr)
vals <- sample(1:100, 1e7, replace = T)
a = t(sapply(unique(vals), function(uv) {
w = which(uv == vals)
c(w[1], w[length(w)])
}))
Rcpp solutions welcome.
The current solution can be made more efficient with split from base R
system.time({
a <- t(sapply(unique(vals), function(uv) {
w = which(uv == vals)
c(w[1], w[length(w)])
}))
})
# user system elapsed
# 4.75 1.60 6.39
system.time({
a1 <- do.call(rbind, lapply(split(seq_along(vals), vals),
function(x) x[c(1, length(x))]))[as.character(unique(vals)),]
})
# user system elapsed
# 0.09 0.00 0.09
all.equal(a, a1, check.attributes = FALSE)
#[1] TRUE
Or another option is match/fmatch which is found to be slower compared to split
library(fastmatch)
system.time({
a2 <- cbind(fmatch(unique(vals), vals), length(vals) - fmatch(unique(vals), rev(vals)) + 1)
})
# user system elapsed
# 0.45 0.25 0.70
all.equal(a, a2, check.attributes = FALSE)
#[1] TRUE
data
set.seed(24)
vals <- sample(1:100, 1e7, replace = TRUE)
And a data.table version could be something like
DT <- data.table(vals)
DT[, .(first=min(.I), last=max(.I)), by=vals]
Or dplyr which could be done with
tibble(vals) %>% mutate(row = row_number()) %>%
group_by(vals) %>% summarise(first=min(row), max=max(row))
The timings are quite similar to what #akrun get with the elegant base R split call, though, so not a lot gained there.

Removing columns that satisfy a condition from a big data.frame in R

I have a big data.frame; 100,000 observations of 700 variables.
Most of the variables have actually value 0 in all the observations, and I would like to remove that variables/columns.
I tried the following,
data <- data[apply(data, 2, function(x){all(x == 0)})]
But the apply took a lot of time to resolve.
I tried a while, in case the problem was working with all data at once.
i <- 1
while (i <= ncol(data)) {
if (all(data[i] == 0)) {
data[i] <- NULL
} else {
i <- i+1
}
}
But I kept having the same problem, it took a lot.
So,
Why does that operation take THAT long? Even though the data.frame is big, the operation is pretty simple.
and, above all
Is there any way to do this faster?
Your question is confusing. I assume you want to remove variables, i.e., columns. You can use any with automatic coercion of values to type logical. The usual warnings regarding comparison of floating point numbers apply. If you want to play it safe, you'll need to test whether the doubles are smaller than some precision value, which will be slower, but getting it right is often more important.
DF <- data.frame(x = 1:3, y = 1:3/10, z = 0)
DF[] <- lapply(DF, function(x) if (any(x)) x else NULL)
#Warning messages:
#1: In any(x) : coercing argument of type 'double' to logical
#2: In any(x) : coercing argument of type 'double' to logical
# x y
#1 1 0.1
#2 2 0.2
#3 3 0.3
set.seed(42)
DF2 <- as.data.frame(matrix(sample(0:1, 700*1e5, TRUE, prob = c(0.999999, 0.000001)), ncol = 700))
system.time(DF2[] <- lapply(DF2, function(x) if (any(x)) x else NULL))
#user system elapsed
#0.10 0.02 0.11
Safer option:
set.seed(42)
DF2 <- as.data.frame(matrix(sample(0:1, 700*1e5, TRUE, prob = c(0.999999, 0.000001)), ncol = 700))
system.time(DF2[] <- lapply(DF2, function(x) if (any(x > 1e-16)) x else NULL))
#user system elapsed
#0.34 0.11 0.45
Using vectorized operation like colSums speeds up the operation on my machine -
> set.seed(123)
> df = data.frame(matrix(sample(0:1,100000*700,replace = T,prob = c(0.9999999,0.0000001)), ncol = 700))
> system.time(df1 <- df[apply(df, 2, function(x){all(x == 0)})])
user system elapsed
1.386 0.821 2.225
> system.time(df2 <- df[,which(colSums(df)==0)])
user system elapsed
0.243 0.082 0.326
> identical(df1, df2)
[1] TRUE

Resources