I have existing code that calculates concordance value for a dataframe/matrix. It's basically the number of rows where all the values are the same over the total number of rows.
...
concordance<-new[complete.cases(new),] #removes rows with NAs
TF<-apply(concordance, 1, function(x) if(length(unique(x))>1) F else T)
#outputs vector of T/F if it is concordant
numF<-table(TF)["TRUE"]#gets number of trues
concValue<-numF/NROW(TF) #true/total
...
Above is what I have now. It runs ok but I was wondering if there was any way to make it faster.
Edit: Dimensions of object is variable, but # of cols are typically 2-6 and there are typically 1,000,000+ rows. This is part of a package i'm developing so input data is variable.
Because the number of rows is much larger than the number of columns it makes sense to loop on columns instead, dropping rows, where there is more than different one value in the process:
propIdentical <- function(Mat){
nrowInit <- nrow(Mat)
for(i in 1:(ncol(Mat) - 1)){
if(!nrow(Mat)) break #stop if the matrix has no rows
else{
#check which elements of column i and column i+1 are equal:
equals <- Mat[,i] == Mat[, i+1]
# remove all other rows from the matrix
Mat <- Mat[equals,,drop = F]
}
}
return(nrow(Mat)/nrowInit)
}
some tests:
set.seed(1)
# normal case
dat <- matrix(sample(1:10, rep = T, size = 3*10^6), nrow = 10^6)
system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0.053 0.017 0.070
[1] 0.009898
# normal case on my pc for comparison:
system.time(app <- mean(apply(dat, 1, function(x) length(unique(x))) == 1L)); app
user system elapsed
12.176 0.036 12.231
[1] 0.009898
# worst case
dat <- matrix(1L, nrow = 10^6, ncol = 6)
> system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0.302 0.044 0.348
[1] 1
# worst case on my pc for comparison
system.time(mean(apply(dat, 1, function(x) length(unique(x))) == 1L))
user system elapsed
12.562 0.001 12.578
# testing drop = F and if(!nrow(Mat)) break
dat <- matrix(1:2, ncol = 2)
> system.time(prI <- propIdentical(dat)) ; prI
user system elapsed
0 0 0
[1] 0
Note: if you run this on a data.frame make sure to turn it into a matrix first.
Related
I have a big data.frame; 100,000 observations of 700 variables.
Most of the variables have actually value 0 in all the observations, and I would like to remove that variables/columns.
I tried the following,
data <- data[apply(data, 2, function(x){all(x == 0)})]
But the apply took a lot of time to resolve.
I tried a while, in case the problem was working with all data at once.
i <- 1
while (i <= ncol(data)) {
if (all(data[i] == 0)) {
data[i] <- NULL
} else {
i <- i+1
}
}
But I kept having the same problem, it took a lot.
So,
Why does that operation take THAT long? Even though the data.frame is big, the operation is pretty simple.
and, above all
Is there any way to do this faster?
Your question is confusing. I assume you want to remove variables, i.e., columns. You can use any with automatic coercion of values to type logical. The usual warnings regarding comparison of floating point numbers apply. If you want to play it safe, you'll need to test whether the doubles are smaller than some precision value, which will be slower, but getting it right is often more important.
DF <- data.frame(x = 1:3, y = 1:3/10, z = 0)
DF[] <- lapply(DF, function(x) if (any(x)) x else NULL)
#Warning messages:
#1: In any(x) : coercing argument of type 'double' to logical
#2: In any(x) : coercing argument of type 'double' to logical
# x y
#1 1 0.1
#2 2 0.2
#3 3 0.3
set.seed(42)
DF2 <- as.data.frame(matrix(sample(0:1, 700*1e5, TRUE, prob = c(0.999999, 0.000001)), ncol = 700))
system.time(DF2[] <- lapply(DF2, function(x) if (any(x)) x else NULL))
#user system elapsed
#0.10 0.02 0.11
Safer option:
set.seed(42)
DF2 <- as.data.frame(matrix(sample(0:1, 700*1e5, TRUE, prob = c(0.999999, 0.000001)), ncol = 700))
system.time(DF2[] <- lapply(DF2, function(x) if (any(x > 1e-16)) x else NULL))
#user system elapsed
#0.34 0.11 0.45
Using vectorized operation like colSums speeds up the operation on my machine -
> set.seed(123)
> df = data.frame(matrix(sample(0:1,100000*700,replace = T,prob = c(0.9999999,0.0000001)), ncol = 700))
> system.time(df1 <- df[apply(df, 2, function(x){all(x == 0)})])
user system elapsed
1.386 0.821 2.225
> system.time(df2 <- df[,which(colSums(df)==0)])
user system elapsed
0.243 0.082 0.326
> identical(df1, df2)
[1] TRUE
I have a particularly big dataset which consists of 3.7 mio rows and 76 string columns.
I want to compare the above row with the below row in terms of whether they match and have written this code. The number of same patterns of the above and the below row should be indicated.
a <- c("a","a","a","a","a","a","a","a","a")
b <- c("b","b","b","b","a","b","b","b","b")
c <- c("c","c","c","c","a","a","a","b","b")
d <- c("d","d","d","d","d","d","d","d","d")
features_split <- data.frame(a,b,c,d); features_split
ncol = max(sapply(features_split,length))
safe <- as.data.table(lapply(1:ncol,function(i)sapply(features_split,"[",i)))
nrow(safe)
df <- safe
LIST <-list()
LIST2 <-list()
for(i in 1:(nrow(df)-1))
{
LIST[[i]] <-df[i+1,] %in% df[i,]
LIST2[[i]] <- length(LIST[[i]][LIST[[i]]==TRUE])
}
safe2 <- unlist(LIST2)
not_available <- rowSums(!is.na(safe))
It takes forever to run that loop. How can I improve?
(about 1 hour for 100.000 rows, but I have more than 3.7 mio)
Grateful for anything,
Tobi
Using a data.frame
Proof of concept, using data.frame:
set.seed(4)
nr <- 1000
mydf <- data.frame(a=sample(letters[1:3], nr, repl=TRUE),
b=sample(letters[1:3], nr, repl=TRUE),
c=sample(letters[1:3], nr, repl=TRUE),
d=sample(letters[1:3], nr, repl=TRUE),
stringsAsFactors=FALSE)
matches <- vapply(seq.int(nrow(mydf)-1),
function(ii,zz) sum(mydf[ii,] == mydf[ii+1,]),
integer(1))
head(matches)
## [1] 0 3 4 2 1 0
sum(matches == 4) # total number of perfect row-matches
## 16
In matches, the integer in position i indicates how many strings from row i exactly match the corresponding string from row i+1. A match of 0 means no matches at all, and (in this case) 4 means the row is a perfect match.
Taking it a bit larger for a demonstration of time:
nr <- 100000
nc <- 76
mydf2 <- as.data.frame(matrix(sample(letters[1:4], nr*nc, repl=TRUE), nc=nc),
stringsAsFactors=FALSE)
dim(mydf2)
## [1] 100000 76
system.time(
matches2 <- vapply(seq.int(nrow(mydf2)-1),
function(ii) sum(mydf2[ii,] == mydf2[ii+1,]),
integer(1))
)
## user system elapsed
## 370.63 12.14 385.36
Using a matrix instead
If you can afford to do it as a matrix (since you have a homogenous data type of "character") instead of a data.frame, you'll get considerably better performance:
nr <- 100000
nc <- 76
mymtx2 <- matrix(sample(letters[1:4], nr*nc, repl=TRUE), nc=nc)
dim(mymtx2)
## [1] 10000 76
system.time(
matches2 <- vapply(seq.int(nrow(mymtx2)-1),
function(ii) sum(mymtx2[ii,] == mymtx2[ii+1,]),
integer(1))
)
## user system elapsed
## 0.81 0.00 0.81
(Compare with 370.63 user from the previous run.) Scaling it up to full-strength:
nr <- 3.7e6
nc <- 76
mymtx3 <- matrix(sample(letters[1:4], nr*nc, repl=TRUE), nc=nc)
dim(mymtx3)
## [1] 3700000 76
system.time(
matches3 <- vapply(seq.int(nrow(mymtx3)-1),
function(ii) sum(mymtx3[ii,] == mymtx3[ii+1,]),
integer(1))
)
## user system elapsed
## 35.32 0.05 35.81
length(matches3)
## [1] 3699999
sum(matches3 == nc)
## [1] 0
Unfortunately, still no matches, but I think 36 seconds is considerably better for 3.7M than an hour for 100K. (Please correct me if I'm made an incorrect assumption.)
(Ref: win7 x64, R-3.0.3-64bit, intel i7-2640M 2.8GHz, 8GB RAM)
I have a list x with millions of entries in it. And I want to put all the entries with a length larger than one into a new list z. How can I do this efficiently in R?
I tried this code, and R just keeps running for a long time.
z=NULL
for(i in 1:length(x)) {
if(length(x[[i]])!=1) z=list(z,x[[i]])
}
This is one case where you want to use vapply:
z <- x[vapply(x, length, integer(1)) > 1L]
Here are benchmarks comparing sapply and vapply:
A <- list( x = c(), y = c(1), z = c(1, 2))
B <- A[sample(1:3, 1e7, replace = TRUE)]
system.time(sapply(B, length))
# user system elapsed
# 55.95 0.54 56.50
system.time(vapply(B, length, integer(1)))
# user system elapsed
# 6.78 0.00 6.78
Just do:
z = x[sapply(x, length) > 1]
Fastest way to find the index of the second (third...) highest/lowest value in vector or column ?
i.e. what
sort(x,partial=n-1)[n-1]
is to
max()
but for
which.max()
Best,
Fastest way to find second (third...) highest/lowest value in vector or column
One possible route is to use the index.return argument to sort. I'm not sure if this is fastest though.
set.seed(21)
x <- rnorm(10)
ind <- 2
sapply(sort(x, index.return=TRUE), `[`, length(x)-ind+1)
# x ix
# 1.746222 3.000000
EDIT 2 :
As Joshua pointed out, none of the given solutions actually performs correct when you have a tie on the maxima, so :
X <- c(11:19,19)
n <- length(unique(X))
which(X == sort(unique(X),partial=n-1)[n-1])
fastest way of doing it correctly then. I deleted the order way, as that one doesn't work and is a lot slower, so not a good answer according to OP.
To point to the issue we ran into :
> X <- c(11:19,19)
> n <- length(X)
> which(X == sort(X,partial=n-1)[n-1])
[1] 9 10 #which is the indices of the double maximum 19
> n <- length(unique(X))
> which(X == sort(unique(X),partial=n-1)[n-1])
[1] 8 # which is the correct index of 18
The timings of the valid solutions :
> x <- runif(1000000)
> ind <- 2
> n <- length(unique(x))
> system.time(which(x == sort(unique(x),partial=n-ind+1)[n-ind+1]))
user system elapsed
0.11 0.00 0.11
> system.time(sapply(sort(unique(x), index.return=TRUE), `[`, n-ind+1))
user system elapsed
0.69 0.00 0.69
library Rfast has implemented the nth element function with return index option.
UPDATE (28/FEB/21) package kit offers a faster implementation (topn) as shown in the simulations below.
x <- runif(1e+6)
n <- 2
which_nth_highest_richie <- function(x, n)
{
for(i in seq_len(n - 1L)) x[x == max(x)] <- -Inf
which(x == max(x))
}
which_nth_highest_joris <- function(x, n)
{
ux <- unique(x)
nux <- length(ux)
which(x == sort(ux, partial = nux - n + 1)[nux - n + 1])
}
microbenchmark::microbenchmark(
topn = kit::topn(x, n,decreasing = T)[n],
Rfast = Rfast::nth(x,n,descending = T,index.return = T),
order = order(x, decreasing = TRUE)[n],
richie = which_nth_highest_richie(x,n),
joris = which_nth_highest_joris(x,n))
Unit: milliseconds
expr min lq mean median uq max neval
topn 3.741101 3.7917 4.517201 4.060752 5.108901 7.403901 100
Rfast 15.8121 16.7586 20.64204 17.73010 20.7083 47.6832 100
order 110.5416 113.4774 120.45807 116.84005 121.2291 164.5618 100
richie 22.7846 24.1552 39.35303 27.10075 42.0132 179.289 100
joris 131.7838 140.4611 158.20704 156.61610 165.1735 243.9258 100
Topn is the clear winner in finding the index of the 2nd biggest value in 1 million numbers.
Futher, simulations where run to estimate running times of finding the nth biggest number for varying n.
Variable x was repopulated for each n but it's size was always 1 million numbers.
As shown topn is the best option for finding the nth biggest element and it's index, given that n is not too big. In the plot we can observe that topn becomes slower than Rfast's nth for bigger n.
It is worthy to note that topn has not been implemented for n > 1000 and will throw an error in such cases.
Method: Set all max values to -Inf, then find the indices of the max. No sorting required.
X <- runif(1e7)
system.time(
{
X[X == max(X)] <- -Inf
which(X == max(X))
})
Works with ties and is very fast.
If you can guarantee no ties, then an even faster version is
system.time(
{
X[which.max(X)] <- -Inf
which.max(X)
})
EDIT: As Joris mentioned, this method doesn't scale that well for finding third, fourth, etc., highest values.
which_nth_highest_richie <- function(x, n)
{
for(i in seq_len(n - 1L)) x[x == max(x)] <- -Inf
which(x == max(x))
}
which_nth_highest_joris <- function(x, n)
{
ux <- unique(x)
nux <- length(ux)
which(x == sort(ux, partial = nux - n + 1)[nux - n + 1])
}
Using x <- runif(1e7) and n = 2, Richie wins
system.time(which_nth_highest_richie(x, 2)) #about half a second
system.time(which_nth_highest_joris(x, 2)) #about 2 seconds
For n = 100, Joris wins
system.time(which_nth_highest_richie(x, 100)) #about 20 seconds, ouch!
system.time(which_nth_highest_joris(x, 100)) #still about 2 seconds
The balance point, where they take the same length of time, is about n = 10.
No ties which() is probably your friend here. Combine the output from the sort() solution with which() to find the index that matches the output from the sort() step.
> set.seed(1)
> x <- sample(1000, 250)
> sort(x,partial=n-1)[n-1]
[1] 992
> which(x == sort(x,partial=n-1)[n-1])
[1] 145
Ties handling The solution above doesn't work properly (and wasn't intended to) if there are ties and the ties are the values that are the ith largest or larger values. We need to take the unique values of the vector before sorting those values and then the above solution works:
> set.seed(1)
> x <- sample(1000, 1000, replace = TRUE)
> length(unique(x))
[1] 639
> n <- length(x)
> i <- which(x == sort(x,partial=n-1)[n-1])
> sum(x > x[i])
[1] 0
> x.uni <- unique(x)
> n.uni <- length(x.uni)
> i <- which(x == sort(x.uni, partial = n.uni-1)[n.uni-1])
> sum(x > x[i])
[1] 2
> tail(sort(x))
[1] 994 996 997 997 1000 1000
order() is also very useful here:
> head(ord <- order(x, decreasing = TRUE))
[1] 220 145 209 202 211 163
So the solution here is ord[2] for the index of the 2nd highest/largest element of x.
Some timings:
> set.seed(1)
> X <- sample(1e7, 1e7)
> system.time({n <- length(X); which(X == sort(X, partial = n-1)[n-1])})
user system elapsed
0.319 0.058 0.378
> system.time({ord <- order(X, decreasing = TRUE); ord[2]})
user system elapsed
14.578 0.084 14.708
> system.time({order(X, decreasing = TRUE)[2]})
user system elapsed
14.647 0.084 14.779
But as the linked post was getting at and the timings above show, order() is much slower, but both provide the same results:
> all.equal(which(X == sort(X, partial = n-1)[n-1]),
+ order(X, decreasing = TRUE)[2])
[1] TRUE
And for the ties-handling version:
foo <- function(x, i) {
X <- unique(x)
N <- length(X)
i <- i-1
which(x == sort(X, partial = N-i)[N-i])
}
> system.time(foo(X, 2))
user system elapsed
1.249 0.176 1.454
So the extra steps slow this solution down a bit, but it is still very competitive with order().
Use maxN function given by Zach to find the next max value and use which() with arr.ind = TRUE.
which(x == maxN(x, 4), arr.ind = TRUE)
Using arr.ind will return index position in any of the above solutions as well and simplify the code.
This is my solution for finding the index of the top N highest values in a vector (not exactly what the OP wanted, but this might help other people)
index.top.N = function(xs, N=10){
if(length(xs) > 0) {
o = order(xs, na.last=FALSE)
o.length = length(o)
if (N > o.length) N = o.length
o[((o.length-N+1):o.length)]
}
else {
0
}
}
I'm relatively new to R, and was wondering the most efficient way to iteratively construct a dataframe (one row at a time, the number of iterations "n" and the length of each row "l" are known beforehand).
Create empty dataframe, add a row each iteration
Preallocate n x l dataframe, modify a row each iteration
Preallocate n x l matrix, modify a row each iteration, make dataframe from matrix
Something else
Pre-allocate!!!
And use a matrix if the data are all the same type. It will be much faster than a data.frame.
For example:
> n <- 1000 # Number of rows
> row <- 1:20*1 # one row
>
> # Adding row, one-by-one
> Data <- data.frame()
> system.time(for(i in 1:n) Data <- rbind(Data,row))
user system elapsed
2.18 0.00 2.18
>
> # Pre-allocated data.frame
> Data <- as.data.frame(Data)
> system.time(for(i in 1:n) Data[i,] <- row)
user system elapsed
0.94 0.00 0.93
>
> # Pre-allocated matrix (fast!)
> Data <- as.matrix(Data)
> system.time({ for(i in 1:n) Data[i,] <- row; Data <- as.data.frame(Data) })
user system elapsed
0 0 0
How about pre-allocating with whatever column types you need from a list first?
as.data.frame(list(a1 = vector("numeric", n), a2 = vector("character", n)))