database$VAR
which has values of 0's and 1's.
How can I redefine the data frame so that the 1's are removed?
Thanks!
TMTOWTDI
Using subset:
df.new <- subset(df, VAR == 0)
EDIT:
David's solution seems to be the fastest on my machine. Subset seems to be the slowest. I won't even pretend to try and understand what's going on under that accounts for these differences:
> df <- data.frame(y=rep(c(1,0), times=1000000))
>
> system.time(df[ -which(df[,"y"]==1), , drop=FALSE])
user system elapsed
0.16 0.05 0.23
> system.time(df[which(df$y == 0), ])
user system elapsed
0.03 0.01 0.06
> system.time(subset(df, y == 0))
user system elapsed
0.14 0.09 0.27
I'd upvote the answer using "subset" if I had the reputation for it :-) . You can also use a logical vector directly for subsetting -- no need for "which":
d <- data.frame(VAR = c(0,1,0,1,1))
d[d$VAR == 0, , drop=FALSE]
I'm surprised to find the logical version a little faster in at least one case. (I expected the "which" version might win due to R possibly preallocating the proper amount of storage for the result.)
> d <- data.frame(y=rep(c(1,0), times=1000000))
> system.time(d[which(d$y == 0), ])
user system elapsed
0.119 0.067 0.188
> system.time(d[d$y == 0, ])
user system elapsed
0.049 0.024 0.074
Try this:
R> df <- data.frame(VAR = c(0,1,0,1,1))
R> df[ -which(df[,"VAR"]==1), , drop=FALSE]
VAR
1 0
3 0
R>
We use which( booleanExpr ) to get the indices for which your condition holds, then use -1 on these to exclude them and lastly use a drop=FALSE to prevent our data.frame of one columns from collapsing into a vector.
Related
I have a big data.frame; 100,000 observations of 700 variables.
Most of the variables have actually value 0 in all the observations, and I would like to remove that variables/columns.
I tried the following,
data <- data[apply(data, 2, function(x){all(x == 0)})]
But the apply took a lot of time to resolve.
I tried a while, in case the problem was working with all data at once.
i <- 1
while (i <= ncol(data)) {
if (all(data[i] == 0)) {
data[i] <- NULL
} else {
i <- i+1
}
}
But I kept having the same problem, it took a lot.
So,
Why does that operation take THAT long? Even though the data.frame is big, the operation is pretty simple.
and, above all
Is there any way to do this faster?
Your question is confusing. I assume you want to remove variables, i.e., columns. You can use any with automatic coercion of values to type logical. The usual warnings regarding comparison of floating point numbers apply. If you want to play it safe, you'll need to test whether the doubles are smaller than some precision value, which will be slower, but getting it right is often more important.
DF <- data.frame(x = 1:3, y = 1:3/10, z = 0)
DF[] <- lapply(DF, function(x) if (any(x)) x else NULL)
#Warning messages:
#1: In any(x) : coercing argument of type 'double' to logical
#2: In any(x) : coercing argument of type 'double' to logical
# x y
#1 1 0.1
#2 2 0.2
#3 3 0.3
set.seed(42)
DF2 <- as.data.frame(matrix(sample(0:1, 700*1e5, TRUE, prob = c(0.999999, 0.000001)), ncol = 700))
system.time(DF2[] <- lapply(DF2, function(x) if (any(x)) x else NULL))
#user system elapsed
#0.10 0.02 0.11
Safer option:
set.seed(42)
DF2 <- as.data.frame(matrix(sample(0:1, 700*1e5, TRUE, prob = c(0.999999, 0.000001)), ncol = 700))
system.time(DF2[] <- lapply(DF2, function(x) if (any(x > 1e-16)) x else NULL))
#user system elapsed
#0.34 0.11 0.45
Using vectorized operation like colSums speeds up the operation on my machine -
> set.seed(123)
> df = data.frame(matrix(sample(0:1,100000*700,replace = T,prob = c(0.9999999,0.0000001)), ncol = 700))
> system.time(df1 <- df[apply(df, 2, function(x){all(x == 0)})])
user system elapsed
1.386 0.821 2.225
> system.time(df2 <- df[,which(colSums(df)==0)])
user system elapsed
0.243 0.082 0.326
> identical(df1, df2)
[1] TRUE
In R, I have a reasonably large data frame (d) which is 10500 by 6000. All values are numeric.
It has many na value elements in both its rows and columns, and I am looking to replace these values with a zero. I have used:
d[is.na(d)] <- 0
but this is rather slow. Is there a better way to do this in R?
I am open to using other R packages.
I would prefer it if the discussion focused on computational speed rather than, "why would you replace na's with zeros", for example. And, while I realize a similar Q has been asked (How do I replace NA values with zeros in an R dataframe?) the focus has not been towards computational speed on a large data frame with many missing values.
Thanks!
Edited Solution:
As helpfully suggested, changing d to a matrix before applying is.na sped up the computation by an order of magnitude
You can get a considerable performance increase using the data.table package.
It is much faster, in general, with a lot of manipulations and transformations.
The downside is the learning curve of the syntax.
However, if you are looking for a speed performance boost, the investment could be worth it.
Generate fake data
r <- 10500
c <- 6000
x <- sample(c(NA, 1:5), r * c, replace = TRUE)
df <- data.frame(matrix(x, nrow = r, ncol = c))
Base R
df1 <- df
system.time(df1[is.na(df1)] <- 0)
user system elapsed
4.74 0.00 4.78
tidyr - replace_na()
dfReplaceNA <- function (df) {
require(tidyr)
l <- setNames(lapply(vector("list", ncol(df)), function(x) x <- 0), names(df))
replace_na(df, l)
}
system.time(df2 <- dfReplaceNA(df))
user system elapsed
4.27 0.00 4.28
data.table - set()
dtReplaceNA <- function (df) {
require(data.table)
dt <- data.table(df)
for (j in 1:ncol(dt)) {set(dt, which(is.na(dt[[j]])), j, 0)}
setDF(dt) # Return back a data.frame object
}
system.time(df3 <- dtReplaceNA(df))
user system elapsed
0.80 0.31 1.11
Compare data frames
all.equal(df1, df2)
[1] TRUE
all.equal(df1, df3)
[1] TRUE
I guess that all columns must be numeric or assigning 0s to NAs wouldn't be sensible.
I get the following timings, with approximately 10,000 NAs:
> M <- matrix(0, 10500, 6000)
> set.seed(54321)
> r <- sample(1:10500, 10000, replace=TRUE)
> c <- sample(1:6000, 10000, replace=TRUE)
> M[cbind(r, c)] <- NA
> D <- data.frame(M)
> sum(is.na(M)) # check
[1] 9999
> sum(is.na(D)) # check
[1] 9999
> system.time(M[is.na(M)] <- 0)
user system elapsed
0.19 0.12 0.31
> system.time(D[is.na(D)] <- 0)
user system elapsed
3.87 0.06 3.95
So, with this number of NAs, I get about an order of magnitude speedup by using a matrix. (With fewer NAs, the difference is smaller.) But the time using a data frame is just 4 seconds on my modest laptop -- much less time than it took to answer the question. If the problem really is of this magnitude, why is that slow?
I hope this helps.
The R function
xts:::na.locf.xts
is extremely slow when used with a multicolumn xts of more than a few columns.
There is indeed a loop over the columns in the code of na.locf.xts
I am trying to find a way to avoid this loop.
Any idea?
The loop in na.locf.xts is slow because it creates a copy of the entire object for each column in the object. The loop itself isn't slow; the copies created by [.xts are slow.
There's an experimental (and therefore unexported) version of na.locf.xts on R-Forge that moves the loop over columns to C, which avoids copying the object. It's quite a bit faster for very large objects.
set.seed(21)
m <- replicate(20, rnorm(1e6))
is.na(m) <- sample(length(x), 1e5)
x <- xts(m, Sys.time()-1e6:1)
y <- x[1:1e5,1:3]
> # smaller objects
> system.time(a <- na.locf(y))
user system elapsed
0.008 0.000 0.008
> system.time(b <- xts:::.na.locf.xts(y))
user system elapsed
0.000 0.000 0.003
> identical(a,b)
[1] TRUE
> # larger objects
> system.time(a <- na.locf(x))
user system elapsed
1.620 1.420 3.064
> system.time(b <- xts:::.na.locf.xts(x))
user system elapsed
0.124 0.092 0.220
> identical(a,b)
[1] TRUE
timeIndex <- index(x)
x <- apply(x, 2, na.locf)
x <- as.xts(x, order.by = timeIndex)
This avoids the column-by-column data copying. Without this, when filling the nth column, you make a copy of 1 : (n - 1) columns and append the nth column to it, which becomes prohibitively slow when n is large.
What is the fastest way to detect if a vector has at least 1 NA in R? I've been using:
sum( is.na( data ) ) > 0
But that requires examining each element, coercion, and the sum function.
As of R 3.1.0 anyNA() is the way to do this. On atomic vectors this will stop after the first NA instead of going through the entire vector as would be the case with any(is.na()). Additionally, this avoids creating an intermediate logical vector with is.na that is immediately discarded. Borrowing Joran's example:
x <- y <- runif(1e7)
x[1e4] <- NA
y[1e7] <- NA
microbenchmark::microbenchmark(any(is.na(x)), anyNA(x), any(is.na(y)), anyNA(y), times=10)
# Unit: microseconds
# expr min lq mean median uq
# any(is.na(x)) 13444.674 13509.454 21191.9025 13639.3065 13917.592
# anyNA(x) 6.840 13.187 13.5283 14.1705 14.774
# any(is.na(y)) 165030.942 168258.159 178954.6499 169966.1440 197591.168
# anyNA(y) 7193.784 7285.107 7694.1785 7497.9265 7865.064
Notice how it is substantially faster even when we modify the last value of the vector; this is in part because of the avoidance of the intermediate logical vector.
I'm thinking:
any(is.na(data))
should be slightly faster.
We mention this in some of our Rcpp presentations and actually have some benchmarks which show a pretty large gain from embedded C++ with Rcpp over the R solution because
a vectorised R solution still computes every single element of the vector expression
if your goal is to just satisfy any(), then you can abort after the first match -- which is what our Rcpp sugar (in essence: some C++ template magic to make C++ expressions look more like R expressions, see this vignette for more) solution does.
So by getting a compiled specialised solution to work, we do indeed get a fast solution. I should add that while I have not compared this to the solutions offered in this SO question here, I am reasonably confident about the performance.
Edit And the Rcpp package contains examples in the directory sugarPerformance. It has an increase of the several thousand of the 'sugar-can-abort-soon' over 'R-computes-full-vector-expression' for any(), but I should add that that case does not involve is.na() but a simple boolean expression.
One could write a for loop stopping at NA, but the system.time then depends on where the NA is... (if there is none, it takes looooong)
set.seed(1234)
x <- sample(c(1:5, NA), 100000000, replace = TRUE)
nacount <- function(x){
for(i in 1:length(x)){
if(is.na(x[i])) {
print(TRUE)
break}
}}
system.time(
nacount(x)
)
[1] TRUE
User System verstrichen
0.14 0.04 0.18
system.time(
any(is.na(x))
)
User System verstrichen
0.28 0.08 0.37
system.time(
sum(is.na(x)) > 0
)
User System verstrichen
0.45 0.07 0.53
Here are some actual times from my (slow) machine for some of the various methods discussed so far:
x <- runif(1e7)
x[1e4] <- NA
system.time(sum(is.na(x)) > 0)
> system.time(sum(is.na(x)) > 0)
user system elapsed
0.065 0.001 0.065
system.time(any(is.na(x)))
> system.time(any(is.na(x)))
user system elapsed
0.035 0.000 0.034
system.time(match(NA,x))
> system.time(match(NA,x))
user system elapsed
1.824 0.112 1.918
system.time(NA %in% x)
> system.time(NA %in% x)
user system elapsed
1.828 0.115 1.925
system.time(which(is.na(x) == TRUE))
> system.time(which(is.na(x) == TRUE))
user system elapsed
0.099 0.029 0.127
It's not surprising that match and %in% are similar, since %in% is implemented using match.
You can try:
d <- c(1,2,3,NA,5,3)
which(is.na(d) == TRUE, arr.ind=TRUE)
As a matter of best practices, I'm trying to determine if it's better to create a function and apply() it across a matrix, or if it's better to simply loop a matrix through the function. I tried it both ways and was surprised to find apply() is slower. The task is to take a vector and evaluate it as either being positive or negative and then return a vector with 1 if it's positive and -1 if it's negative. The mash() function loops and the squish() function is passed to the apply() function.
million <- as.matrix(rnorm(100000))
mash <- function(x){
for(i in 1:NROW(x))
if(x[i] > 0) {
x[i] <- 1
} else {
x[i] <- -1
}
return(x)
}
squish <- function(x){
if(x >0) {
return(1)
} else {
return(-1)
}
}
ptm <- proc.time()
loop_million <- mash(million)
proc.time() - ptm
ptm <- proc.time()
apply_million <- apply(million,1, squish)
proc.time() - ptm
loop_million results:
user system elapsed
0.468 0.008 0.483
apply_million results:
user system elapsed
1.401 0.021 1.423
What is the advantage to using apply() over a for loop if performance is degraded? Is there a flaw in my test? I compared the two resulting objects for a clue and found:
> class(apply_million)
[1] "numeric"
> class(loop_million)
[1] "matrix"
Which only deepens the mystery. The apply() function cannot accept a simple numeric vector and that's why I cast it with as.matrix() in the beginning. But then it returns a numeric. The for loop is fine with a simple numeric vector. And it returns an object of same class as that one passed to it.
The point of the apply (and plyr) family of functions is not speed, but expressiveness. They also tend to prevent bugs because they eliminate the book keeping code needed with loops.
Lately, answers on stackoverflow have over-emphasised speed. Your code will get faster on its own as computers get faster and R-core optimises the internals of R. Your code will never get more elegant or easier to understand on its own.
In this case you can have the best of both worlds: an elegant answer using vectorisation that is also very fast, (million > 0) * 2 - 1.
As Chase said: Use the power of vectorization. You're comparing two bad solutions here.
To clarify why your apply solution is slower:
Within the for loop, you actually use the vectorized indices of the matrix, meaning there is no conversion of type going on. I'm going a bit rough over it here, but basically the internal calculation kind of ignores the dimensions. They're just kept as an attribute and returned with the vector representing the matrix. To illustrate :
> x <- 1:10
> attr(x,"dim") <- c(5,2)
> y <- matrix(1:10,ncol=2)
> all.equal(x,y)
[1] TRUE
Now, when you use the apply, the matrix is split up internally in 100,000 row vectors, every row vector (i.e. a single number) is put through the function, and in the end the result is combined into an appropriate form. The apply function reckons a vector is best in this case, and thus has to concatenate the results of all rows. This takes time.
Also the sapply function first uses as.vector(unlist(...)) to convert anything to a vector, and in the end tries to simplify the answer into a suitable form. Also this takes time, hence also the sapply might be slower here. Yet, it's not on my machine.
IF apply would be a solution here (and it isn't), you could compare :
> system.time(loop_million <- mash(million))
user system elapsed
0.75 0.00 0.75
> system.time(sapply_million <- matrix(unlist(sapply(million,squish,simplify=F))))
user system elapsed
0.25 0.00 0.25
> system.time(sapply2_million <- matrix(sapply(million,squish)))
user system elapsed
0.34 0.00 0.34
> all.equal(loop_million,sapply_million)
[1] TRUE
> all.equal(loop_million,sapply2_million)
[1] TRUE
You can use lapply or sapply on vectors if you want. However, why not use the appropriate tool for the job, in this case ifelse()?
> ptm <- proc.time()
> ifelse_million <- ifelse(million > 0,1,-1)
> proc.time() - ptm
user system elapsed
0.077 0.007 0.093
> all.equal(ifelse_million, loop_million)
[1] TRUE
And for comparison's sake, here are the two comparable runs using the for loop and sapply:
> ptm <- proc.time()
> apply_million <- sapply(million, squish)
> proc.time() - ptm
user system elapsed
0.469 0.004 0.474
> ptm <- proc.time()
> loop_million <- mash(million)
> proc.time() - ptm
user system elapsed
0.408 0.001 0.417
It is far faster in this case to do index-based replacement than either the ifelse(), the *apply() family, or the loop:
> million <- million2 <- as.matrix(rnorm(100000))
> system.time(million3 <- ifelse(million > 0, 1, -1))
user system elapsed
0.046 0.000 0.044
> system.time({million2[(want <- million2 > 0)] <- 1; million2[!want] <- -1})
user system elapsed
0.006 0.000 0.007
> all.equal(million2, million3)
[1] TRUE
It is well worth having all these tools at your finger tips. You can use the one that makes the most sense to you (as you need to understand the code months or years later) and then start to move to more optimised solutions if compute time becomes prohibitive.
Better example for speed advantage of for loop.
for_loop <- function(x){
out <- vector(mode="numeric",length=NROW(x))
for(i in seq(length(out)))
out[i] <- max(x[i,])
return(out)
}
apply_loop <- function(x){
apply(x,1,max)
}
million <- matrix(rnorm(1000000),ncol=10)
> system.time(apply_loop(million))
user system elapsed
0.57 0.00 0.56
> system.time(for_loop(million))
user system elapsed
0.32 0.00 0.33
EDIT
Version suggested by Eduardo.
max_col <- function(x){
x[cbind(seq(NROW(x)),max.col(x))]
}
By row
> system.time(for_loop(million))
user system elapsed
0.99 0.00 1.11
> system.time(apply_loop(million))
user system elapsed
1.40 0.00 1.44
> system.time(max_col(million))
user system elapsed
0.06 0.00 0.06
By column
> system.time(for_loop(t(million)))
user system elapsed
0.05 0.00 0.05
> system.time(apply_loop(t(million)))
user system elapsed
0.07 0.00 0.07
> system.time(max_col(t(million)))
user system elapsed
0.04 0.00 0.06