data.table subsetting by NaN doesn't work - r

I have a column in a data table with NaN values. Something like:
my.dt <- data.table(x = c(NaN, NaN, NaN, .1, .2, .2, .3), y = c(2, 4, 6, 8, 10, 12, 14))
setkey(my.dt, x)
I can use the J() function to find all instances where the x column is equal to .2
> my.dt[J(.2)]
x y
1: 0.2 10
2: 0.2 12
But if I try to do the same thing with NaN it doesn't work.
> my.dt[J(NaN)]
x y
1: NaN NA
I would expect:
x y
1: NaN 2
2: NaN 4
3: NaN 6
What gives? I can't find anything in the data.table documentation to explain why this is happening (although it may just be that I don't know what to look for). Is there any way to get what I want? Ultimately, I'd like to replace all of the NaN values with zero, using something like my.dt[J(NaN), x := 0]

Update: This has been fixed a while back, in v1.9.2. From NEWS:
NA, NaN, +Inf and -Inf are now considered distinct values, may be in keys, can be joined to and can be grouped. data.table defines: NA < NaN < -Inf. Thanks to Martin Liberts for the suggestions, #4684, #4815 and #4883.
require(data.table) ## 1.9.2+
my.dt[J(NaN)]
# x y
# 1: NaN 2
# 2: NaN 4
# 3: NaN 6
This issue is part design choice, part bug. There are several questions on SO and a few emails on the listserv exploring NA's in data.table key.
The main idea is outlined in the FAQ in that NA's are treated as FALSE
Please feel free chime in on the conversation in the mailing list. There was a conversation started by #Arun,
http://r.789695.n4.nabble.com/Follow-up-on-subsetting-data-table-with-NAs-td4669097.html
Also, you can read more in the answers and comments to any of the following questions on SO:
subsetting a data.table using !=<some non-NA> excludes NA too
NA in `i` expression of data.table (possible bug)
DT[!(x == .)] and DT[x != .] treat NA in x inconsistently
In the meantime, your best bet is to use is.na.
While it is slower than a radix search, it is still faster than most vector searches in R, and certainly much, much faster than any fancy workarounds
library(microbenchmark)
microbenchmark(my.dt[.(1)], my.dt[is.na(ID)], my.dt[ID==1], my.dt[!!!(ID)])
# Unit: milliseconds
expr median
my.dt[.(1)] 1.309948
my.dt[is.na(ID)] 3.444689 <~~ Not bad
my.dt[ID == 1] 4.005093
my.dt[!(!(!(ID)))] 10.038134
### using the following for my.dt
my.dt <- as.data.table(replicate(20, sample(100, 1e5, TRUE)))
setnames(my.dt, 1, "ID")
my.dt[sample(1e5, 1e3), ID := NA]
setkey(my.dt, ID)

Here's a fast workaround that relies a lot on what's actually happening internally (making the code a bit fragile imo). Because internally NaN is just a very very negative number, it will always be at the front of your data.table when you setkey. We can use that property to isolate those entries like so:
# this will give the index of the first element that is *not* NaN
my.dt[J(-.Machine$double.xmax), roll = -Inf, which = T]
# this is equivalent to my.dt[!is.nan(x)], but much faster
my.dt[seq_len(my.dt[J(-.Machine$double.xmax), roll = -Inf, which = T] - 1)]
Here's a benchmark for Ricardo's sample data:
my.dt <- as.data.table(replicate(20, sample(100, 1e5, TRUE)))
setnames(my.dt, 1, "ID")
my.dt[sample(1e5, 1e3), ID := NA]
setkey(my.dt, ID)
# NOTE: I have to use integer max here - because this example has integers
# instead of doubles, so I'll just add simple helper function (that would
# likely need to be extended for other cases, but I'm just dealing with the ones here)
minN = function(x) if (is.integer(x)) -.Machine$integer.max else -.Machine$double.xmax
library(microbenchmark)
microbenchmark(normalJ = my.dt[J(1)],
naJ = my.dt[seq_len(my.dt[J(minN(ID)), roll = -Inf, which = T] - 1)])
#Unit: milliseconds
# expr min lq median uq max neval
# normalJ 1.645442 1.864812 2.120577 2.863497 5.431828 100
# naJ 1.465806 1.689350 2.030425 2.600720 10.436934 100
In my tests the following minN function also covers character and logical vectors:
minN = function(x) {
if (is.integer(x)) {
-.Machine$integer.max
} else if (is.numeric(x)) {
-.Machine$double.xmax
} else if (is.character(x)) {
""
} else if (is.logical(x)) {
FALSE
} else {
NA
}
}
And you will want to add mult = 'first', e.g.:
my.dt[seq_len(my.dt[J(minN(colname)), roll = -Inf, which = T, mult = 'first'] - 1)]

See if this is helpful.
my.dt[!is.finite(x),]
x y
1: NaN 2
2: NaN 4
3: NaN 6

Related

Check whether elements of vectors are inside intervals given by matrix

Actually a really nice problem to which I came up with a solution (see below), which is, however, not beautiful:
Assume you have a vector x and a matrix A which contains the start of an interval in the first column and the end of the interval in the second.
How can I get the elements of A, which fall into the intervals given by A?
x <- c(4, 7, 15)
A <- cbind(c(3, 9, 14), c(5, 11, 16))
Expected output:
[1] 4 15
You could you the following information, if this would be helpful for increasing the performance:
Both, the vector and the rows of the matrix are ordered and the intervals don't overlap. All intervals have the same length. All numbers are integers, but can be huge.
Now I did not want to be lazy and came up with the following solution, which is too slow for long vectors and matrices:
x <- c(4, 7, 15) # Define input vector
A <- cbind(c(3, 9, 14), c(5, 11, 16)) # Define matrix with intervals
b <- vector()
for (i in 1:nrow(A)) {
b <- c(b, A[i, 1]:A[i, 2])
}
x[x %in% b]
I know that loops in R can be slow, but I did not know how to write the operation without one (maybe there is a way with apply).
We can use sapply to loop over each element of x and find if it lies in the range of any of those matrix values.
x[sapply(x, function(i) any(i > A[, 1] & i < A[,2]))]
#[1] 4 15
In case, if length(x) and nrow(A) are same then we don't even need the sapply loop and we can use this comparison directly.
x[x > A[, 1] & x < A[,2]]
#[1] 4 15
Here is a method that does not use an explicit loop or an apply function. outer is sometimes much faster.
x[rowSums(outer(x, A[,1], `>=`) & outer(x, A[,2], `<=`)) > 0]
[1] 4 15
This answer is late, but today I had the same problem to solve and my answer is maybe helpful for future readers. My solution was the following:
f3 <- function(x,A) {
Reduce(f = "|",
x = lapply(1:NROW(A),function(k) x>A[k,1] & x<A[k,2]),
init = logical(length(x)))
}
This function return a logical vector of length(x) indicating whether the corresponding value in x can be found in the intervals or not. If I want to get the elements I simply have to write
x[f3(x,A)]
I did some benchmarks and my function seems to work very well, also while testing with larger data.
Lets define the other solutions suggested here in this post:
f1 <- function(x,A) {
sapply(x, function(i) any(i > A[, 1] & i < A[,2]))
}
f2 <- function(x,A) {
rowSums(outer(x, A[,1], `>`) & outer(x, A[,2], `<`)) > 0
}
Now they are also returning a logical vector.
The benchmarks on my machine are following:
x <- c(4, 7, 15)
A <- cbind(c(3, 9, 14), c(5, 11, 16))
microbenchmark::microbenchmark(f1(x,A), f2(x,A), f3(x,A))
#Unit: microseconds
# expr min lq mean median uq max neval
#f1(x, A) 21.5 23.20 25.023 24.30 25.40 61.8 100
#f2(x, A) 18.8 21.20 23.606 22.75 23.70 75.4 100
#f3(x, A) 13.9 15.85 18.682 18.30 19.15 52.2 100
It seems like there is no big difference, but the follwoing example will make it more obvious:
x <- seq(1,100,length.out = 1e6)
A <- cbind(20:70,(20:70)+0.5)
microbenchmark::microbenchmark(f1(x,A), f2(x,A), f3(x,A), times=10)
#Unit: milliseconds
# expr min lq mean median uq max neval
#f1(x, A) 4176.172 4227.6709 4419.6010 4484.2946 4539.9668 4569.7412 10
#f2(x, A) 1418.498 1511.5647 1633.4659 1571.0249 1703.6651 1987.8895 10
#f3(x, A) 614.556 643.4138 704.3383 672.5385 770.7751 873.1291 10
That the functions all return the same result can be checked e.g. via:
all(f1(x,A)==f3(x,A))

Efficiently introduce new level on a factor vector

I have a long vector of class factor that contains NA values.
# simple example
x <- factor(c(NA,'A','B','C',NA), levels=c('A','B','C'))
For purposes of modeling, I wish to replace these NA values with a new factor level (e.g., 'Unknown') and set this level as the reference level.
Because the replacement level is not an existing level, simple replacement doesn't work:
# this won't work, since the replacement value is not an existing level of the factor
x[is.na(x)] <- '?'
x # returns: [1] <NA> A B C <NA> -- the NAs remain
# this doesn't work either:
replace(x, NA,'?')
I came up with a couple solutions, but both are kind of ugly and surprisingly slow.
f1 <- function(x, uRep='?'){
# convert to character, replace NAs with Unknown, and convert back to factor
stopifnot(is.factor(x))
newLevels <- c(uRep,levels(x))
x <- as.character(x)
x[is.na(x)] <- uRep
factor(x, levels=newLevels)
}
f2 <- function(x, uRep='?'){
# add new level for Unknown, replace NAs with Unknown, and make Unknown first level
stopifnot(is.factor(x))
levels(x) <- c(levels(x),uRep)
x[is.na(x)] <- uRep
relevel(x, ref=uRep)
}
f3 <- function(x, uRep='?'){ # thanks to #HongOoi
y <- addNA(x)
levels(y)[length(levels(y))]<-uRep
relevel(y, ref=uRep)
}
#test
f1(x) # works
f2(x) # works
f3(x) # works
Solution #2 is editing only the (relatively small) set of levels, plus one arithmetic op to relevel. I would have expected that to be faster than #1, which is casting to character and back to factor.
However, #2 is twice as slow on a benchmark vector of 10K elements with 10 levels and 10% NA.
x <- sample(factor(c(LETTERS[1:10],NA),levels=LETTERS[1:10]),10000,replace=TRUE)
library(microbenchmark)
microbenchmark(f1(x),f2(x),f3(x),times=500L)
# Unit: microseconds
# expr min lq mean median uq max neval
# f1(x) 271.981 278.1825 322.4701 313.0360 360.7175 609.393 500
# f2(x) 651.728 703.2595 768.6756 747.9480 825.7800 1517.707 500
# f3(x) 808.246 883.2980 966.2374 927.5585 1061.1975 1779.424 500
Solution #3, my wrapper for the built-in addNA (mentioned in answer below) was slower than either. addNA does some extra checks for NA values and sets the new level as the last one (requiring me to relevel) and named NA (which then requires renaming by index before releveling, since NA is hard to access -- relevel(addNA(x), ref=NA_character_)) doesn't work).
Is there a more efficient way to write this, or am I just hosed?
You can use fct_explicit_na followed by fct_relevel from the forcats package if you want a pre-fab solution. It's slower than your f1 function, but it still runs in a fraction of a second on a vector of length 100,000:
library(forcats)
x <- factor(c(NA,'A','B','C',NA), levels=c('A','B','C'))
[1] <NA> A B C <NA>
Levels: A B C
x = fct_relevel(fct_explicit_na(x, "Unknown"), "Unknown")
[1] Unknown A B C Unknown
Levels: Unknown A B C
Timings on a vector of length 100,000:
x <- sample(factor(c(LETTERS[1:10],NA), levels=LETTERS[1:10]), 1e5, replace=TRUE)
microbenchmark(forcats = fct_relevel(fct_explicit_na(x, "Unknown"), "Unknown"),
f1 = f1(x),
unit="ms", times=100L)
Unit: milliseconds
expr min lq mean median uq max neval cld
forcats 7.624158 10.634761 15.303339 12.162105 15.513846 250.0516 100 b
f1 3.568801 4.226087 8.085532 5.321338 5.995522 235.2449 100 a
There is a builtin function addNA for this.
From ?factor:
addNA(x, ifany = FALSE)
addNA modifies a factor by turning NA into an extra level (so that NA values are counted in tables, for instance).

R Improve performance of function(s)

This question is related to my previous one. Here is a small sample data. I have used both data.table and data.frame to find a faster solution.
test.dt <- data.table(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6),
a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
a5.6=c(4,8,2,1,3,9))
test.dt[,rown:=as.numeric(row.names(test.dt))]
test.df <- data.frame(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6),
a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
a5.6=c(4,8,2,1,3,9))
test.df$rown <- as.numeric(row.names(test.df))
> test.df
strt end a1.2 a2.3 a3.4 a4.5 a5.6 rown
1 1 2 1 2 3 5 4 1
2 1 1 2 4 1 1 8 2
3 2 5 3 6 2 15 2 3
4 3 5 4 8 4 10 1 4
5 5 5 5 10 5 12 3 5
6 2 4 6 12 1 10 9 6
I want to use the start and end column values to determine the range of columns to subset (columns from a1.2 to a5.6) and obtain the mean. For example, in the first row, since strt=1 and end=2, I need to get the mean of a1.2 and a2.3; in the third row, I need to get the mean of a2.3, a3.4, a4.5, and a5.6
The output should be a vector like this
> k
1 2 3 4 5 6
1.500000 2.000000 6.250000 5.000000 3.000000 7.666667
Here, is what I tried:
Solution 1: This uses the data.table and applies a function over it.
func.dt <- function(rown, x, y) {
tmp <- paste0("a", x, "." , x+1)
tmp1 <- paste0("a", y, "." , y+1)
rowMeans(test.dt[rown,get(tmp):get(tmp1), with=FALSE])
}
k <- test.dt[, func.dt(rown, strt, end), by=.(rown)]
Solution 2: This uses the data.frame and applies a function over it.
func.df <- function(rown, x, y) {
rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
}
k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)
Solution 3: This uses the data.frame and loops through it.
test.ave <- rep(NA, length(test1$strt))
for (i in 1 : length(test.df$strt)) {
test.ave[i] <- rowMeans(test.df[i, as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
}
Benchmarking shows that Solution 2 is the fastest.
test replications elapsed relative user.self sys.self user.child sys.child
1 sol1 100 0.67 4.786 0.67 0 NA NA
2 sol2 100 0.14 1.000 0.14 0 NA NA
3 sol3 100 0.15 1.071 0.16 0 NA NA
But, this is not good enough for me. Given the size of my data, these functions would need to run for a few days before I get the output. I am sure that I am not fully utilizing the power of data.table and I also know that my functions are crappy (they refer to the dataset in the global environment without passing it). Unfortunately, I am out of my depth and do not know how to fix these issues and make my functions fast. I would greatly appreciate any suggestions that help in improving my function(s) or point to alternate solutions.
I was curious how fast I could make this without resorting to writing custom C or C++ code. The best I could come up with is below. Note that using mean.default will provide greater precision, since it does a second pass over the data for error correction.
f_jmu <- compiler::cmpfun({function(m) {
# remove start/end columns from 'm' matrix
ma <- m[,-(1:2)]
# column index for each row in 'ma' matrix
cm <- col(ma)
# logical index of whether we need the column for each row
i <- cm >= m[,1L] & cm <= m[,2L]
# multiply the input matrix by the index matrix and sum it
# divide by the sum of the index matrix to get the mean
rowSums(i*ma) / rowSums(i)
}})
The Rcpp function is still faster (not surprisingly), but the function above gets respectably close. Here's an example on 50 million observations on my laptop with an i7-4600U and 12GB of RAM.
set.seed(21)
N <- 5e7
test.df <- data.frame(strt = 1L,
end = sample(5, N, replace = TRUE),
a1.2 = sample(3, N, replace = TRUE),
a2.3 = sample(7, N, replace = TRUE),
a3.4 = sample(14, N, replace = TRUE),
a4.5 = sample(8, N, replace = TRUE),
a5.6 = sample(30, N, replace = TRUE))
test.df$strt <- pmax(1L, test.df$end - sample(3, N, replace = TRUE) + 1L)
test.m <- as.matrix(test.df)
Also note that I take care to ensure that test.m is an integer matrix. That helps reduce the memory footprint, which can help make things faster.
R> system.time(st1 <- MYrcpp(test.m))
user system elapsed
0.900 0.216 1.112
R> system.time(st2 <- f_jmu(test.m))
user system elapsed
6.804 0.756 7.560
R> identical(st1, st2)
[1] TRUE
Unless you can think of a way to do this with a clever subsetting approach, I think you've reached R's speed barrier. You'll want to use a low-level language like C++ for this problem. Fortunately, the Rcpp package makes interfacing with C++ in R simple. Disclaimer: I've never written a single line of C++ code in my life. This code may be very inefficient.
library(Rcpp)
cppFunction('NumericVector MYrcpp(NumericMatrix x) {
int nrow = x.nrow(), ncol = x.ncol();
NumericVector out(nrow);
for (int i = 0; i < nrow; i++) {
double avg = 0;
int start = x(i,0);
int end = x(i,1);
int N = end - start + 1;
while(start<=end){
avg += x(i, start + 1);
start = start + 1;
}
out[i] = avg/N;
}
return out;
}')
For this code I'm going to pass the data.frame as a matrix (i.e. testM <- as.matrix(test.df))
Let's see if it works...
MYrcpp(testM)
[1] 1.500000 2.000000 6.250000 5.000000 3.000000 7.666667
How fast is it?
Unit: microseconds
expr min lq mean median uq max neval
f2() 1543.099 1632.3025 2039.7350 1843.458 2246.951 4735.851 100
f3() 1859.832 1993.0265 2642.8874 2168.012 2493.788 19619.882 100
f4() 281.541 315.2680 364.2197 345.328 375.877 1089.994 100
MYrcpp(testM) 3.422 10.0205 16.7708 19.552 21.507 56.700 100
Where f2(), f3() and f4() are defined as
f2 <- function(){
func.df <- function(rown, x, y) {
rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
}
k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)
}
f3 <- function(){
test.ave <- rep(NA, length(test.df$strt))
for (i in 1 : length(test.df$strt)) {
test.ave[i] <- rowMeans(test.df[i,as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
}
}
f4 <- function(){
lapply(
apply(test.df,1, function(x){
x[(x[1]+2):(x[2]+2)]}),
mean)
}
That's roughly a 20x increase over the fastest.
Note, to implement the above code you'll need a C complier which R can access. For windows look into Rtools. For more on Rcpp read this
Now let's see how it scales.
N = 5e3
test.df <- data.frame(strt = 1,
end = sample(5, N, replace = TRUE),
a1.2 = sample(3, N, replace = TRUE),
a2.3 = sample(7, N, replace = TRUE),
a3.4 = sample(14, N, replace = TRUE),
a4.5 = sample(8, N, replace = TRUE),
a5.6 = sample(30, N, replace = TRUE))
test.df$rown <- as.numeric(row.names(test.df))
test.dt <- as.data.table(test.df)
microbenchmark(f4(), MYrcpp(testM))
Unit: microseconds
expr min lq mean median uq max neval
f4() 88647.256 108314.549 125451.4045 120736.073 133487.5295 259502.49 100
MYrcpp(testM) 196.003 216.533 242.6732 235.107 261.0125 499.54 100
With 5e3 rows MYrcpp is now 550x faster. This partially due to the fact that f4() is not going to scale well as Richard discusses in the comment. The f4() is essentially invoking a nested for loop by calling an apply within a lapply. Interestingly, the C++ code is also invoking a nested loop by utilizing a while loop inside a for loop. The speed disparity is due in large part to the fact that the C++ code is already complied and does not need to be interrupted into something the machine can understand at run time.
I'm not sure how big your data set is, but when I run MYrcpp on a data.frame with 1e7 rows, which is the largest data.frame I could allocate on my crummy laptop, it ran in 500 milliseconds.
Update: R equivalent of C++ code
MYr <- function(x){
nrow <- nrow(x)
ncol <- ncol(x)
out <- matrix(NA, nrow = 1, ncol = nrow)
for(i in 1:nrow){
avg <- 0
start <- x[i,1]
end <- x[i,2]
N <- end - start + 1
while(start<=end){
avg <- avg + x[i, start + 2]
start = start + 1
}
out[i] <- avg/N
}
out
}
Both MYrcpp and MYr are similar in many ways. Let me discuss a couple of the differences
The first line of MYrcpp is different from the MYr. In words the first line of MYrcpp, NumericVector MYrcpp(NumericMatrix x), means that we are defining a function whose name is MYrcpp which returns an output of class NumericVector and takes an input x of class NumericMatrix.
In C++ you have to define the class of a variable when you introduce it, i.e. int nrow = x.row() is a variable whose name is nrow whose class is int (i.e. integer) and is assigned to be x.nrow() i.e. the number of rows of x. (IGNORE if you're overwhelmed, nrow() is a method for instances of class `NumericVector. Like in Python you call a method by attaching it to the instance. The R equivalent is S3 and S4 methods)
When you subset in C++ you use () instead of [] like in R. Also, indexing begins at zero (like in Python). For example, x(0,1) in C++ is equivalent to x[1,2] in R
++ is an operator that means increment by 1, i.e. j++ is the same as j + 1. += is an operator that means add to together and assign, i.e. a += b is the same as a = a + b
My solution is the first one in the benchmark
library(microbenchmark)
microbenchmark(
lapply(
apply(test.df,1, function(x){
x[(x[1]+2):(x[2]+2)]}),
mean),
test.dt[, func.dt(rown, strt, end), by=.(rown)]
)
min lq mean median uq max neval
138.654 175.7355 254.6245 201.074 244.810 3702.443 100
4243.641 4747.5195 5576.3399 5252.567 6247.201 8520.286 100
It seems to be 25 times faster, but this is a small dataset. I am sure there is a better way to do this than what I have done.

What is the easiest way to find the pairwise complete data for two variables?

Suppose you have two variables that both have some missing data, but these missing data may not overlap perfectly. What is the easiest way of finding the number of common datapoints with no missing values? Is there some built-in function?
One way is to do make a function like the following:
pairwise.miss = function(x, y) {
#deal with input types
x = as.vector(x)
y = as.vector(y)
#make combined object
c = cbind(x, y)
#remove NA rows
c = c[complete.cases(c), ]
#return length
return(nrow(c))
}
Another idea is to use some function that returns the pairwise complete data. For instance, rcorr() from Hmisc does this, but may give errors for non-numeric data. So:
rcorr(x, y)$n[1,2]
Is there an easier way?
You can simply list the two variables in complete.cases() and sum() the output.
x <- c(1, 2, 3, NA, NA, NA, 5)
y <- c(1, NA, 3, NA, 3, 2, NA)
complete.cases(x, y)
#[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE
sum(complete.cases(x, y))
#[1] 2
The sum of a logical vector is the number of TRUE elements since TRUE is coerced to 1 and FALSE to 0.
This works for any data type. However, note that empty strings, i.e. "", are not considered missing. An actual missing character value is denoted by NA_character_.
A possible solution is to use is.na and logical operators:
!(is.na(x) | is.na(y)) # logical vector
which(!(is.na(x) | is.na(y))) # integer vector of indices.
If you want only the total count, use:
sum(!(is.na(x) | is.na(y)))
I benchmarked the solutions given above:
if (!require("pacman")) install.packages("pacman")
pacman::p_load(microbenchmark)
#fetch some data
x = iris[1] #from isis
y = iris[1]
x[sample(1:150, 50), ] = NA #random subset
y[sample(1:150, 50), ] = NA
#benchmark
times = microbenchmark(pairwise.function = pairwise.miss(x, y),
sum.is.na = sum(!is.na(x) & !is.na(y)),
sum.is.na2 = sum(!(is.na(x) | is.na(y))),
sum.complete.cases = sum(complete.cases(x, y)));times
Results:
> times
Unit: microseconds
expr min lq mean median uq max neval
pairwise.function 202.205 217.2935 244.31481 233.3150 253.8460 450.763 100
sum.is.na 75.594 78.5500 89.26383 80.5730 94.1035 248.558 100
sum.is.na2 74.662 77.6170 89.23899 80.5725 94.8825 167.676 100
sum.complete.cases 14.311 16.1770 18.77197 17.1105 17.7330 155.233 100
So my original method was horribly slow compared to the sum.complete.cases one.
Perhaps there is rarely a need for speed in this computation, but one might as well use the most efficient method when it is equally easy to use.

Should I have if statement or ifelse?

I am very new in using R and trying to get my hear around different commands. I have this simple code:
setwd("C:/Research")
tempdata=read.csv("temperature_humidity.csv")
Thour=tempdata$t
RHhour=tempdata$RH
weather=data.frame(cbind(hour,Thour,RHhour))
head(weather)
if (Thour>25) {
y=0 else {
y=3
}
x=Thour+y*2
x
I simply want the code to read the Thour(temperature) from CSV file and if it is higher than 25 then uses y=0 in the formula, if its lower than 25 then uses y=3
I tried ifelse but it doesn't work as well.
Thanks for your help.
I've said that too many times today already, but avoid ifelse statements as much as possible (very inefficient and unnecessary in most cases), try this instead:
c(3, 0)[(Thour >= 25) + 1]
This solution will return a logical vector of TRUE/FALSE which will be coerced to 0/1 when added to 1 and become 1/2 which will be the indexes in c(3, 0)
Or even better solution (posted by #BondedDust in comments) would be:
3*(Thour <= 25)
This solution will return a logical vector of TRUE/FALSE which will be coerced to 0/1 when multiplied by 3
Benchmark comparison:
Thour <- sample(1:100000)
require(microbenchmark)
microbenchmark(ifel = {ifelse(Thour < 25 , 0 , 3)}, Bool = {3*(Thour >= 25)})
Unit: microseconds
expr min lq median uq max neval
ifel 38633.499 41643.768 41786.978 55153.050 59169.69 100
Bool 901.135 1848.091 1972.434 2010.841 20754.74 100
This should work for you. Just replace what you're naming Thour with the appropriate code.
Thour <- sample(1:100, 1)
Thour
# [1] 8
y <- ifelse(Thour >= 25, 0, 3)
y
# [1] 3
And:
Thour <- sample(1:100, 1)
Thour
# [1] 37
y <- ifelse(Thour >= 25, 0, 3)
y
# [1] 0
You may need to change the logical operator (>=) to match your exact circumstance since it's unclear, which if any of the higher or lower range you want to be inclusive.
R has a very flexible syntax. So you can write this in many ways:
# ifelse() function
y <- ifelse(Thour > 25, 0, 3)
# more ifelse()
y <- 3 * ifelse(Thour > 25, 0, 1)
# The simpler way:
y <- 3 * (Thour > 25)
By the way, use <- instead of = for assignment... it's the "preferred" style

Resources