nested ifelse in R so close to working - r

I'm working with the following four columns of raw weight measurement data and a very nearly-functioning nested ifelse statement that results in the 'kg' vector.
Id G4_R_2_4 G4_R_2_5 G4_R_2_5_option2 kg
219 13237 16.0 NA NA 16.0
220 139129 8.50 55.70 47.20 8.50
221 139215 28.9 NA NA 28.9
222 139216 NA 46.70 8.50 46.70
223 139264 12.40 NA NA 12.40
224 139281 13.60 NA NA 13.60
225 139366 16.10 NA NA 16.10
226 139376 61.80 NA NA 61.80
227 140103 NA 48.60 9.10 48.60
The goal is to merge the three 'G4' columns into kg based on the following conditions:
1) If G4_R_2_4 is not NA, print its value
2) If G4_R_2_4 is NA, print the lesser of the values appearing in G4_R_2_5 and G4_R_2_5_option2 (sorry for lame variable names)
I've been working with the following statement (big dataset called 'child'):
> child$kg <- ifelse(child$G4_R_2_4 == 'NA' & child$G4_R_2_5 < child$G4_R_2_5_option2,
child$G4_R_2_5, ifelse(child$G4_R_2_4 == 'NA' & child$G4_R_2_5 > child$G4_R_2_5_option2,
child$G4_R_2_5_option2, child$G4_R_2_4))
Which results in the 'kg' vector I have now. It seems to satisfy the G4_R_2_4 condition (is/is not NA) but always prints the value from G4_R_2_5 for the NA cases. How do I get it to incorporate the greater than/less than condition?

It's not clear from your example, but I think the problem is you're handling NA's incorrectly and\or using wrong type for data.frame's columns. Try rewriting your code like that:
#if your columns are of character type (warnings are ok)
child$G4_R_2_4<-as.numeric(child$G4_R_2_4)
child$G4_R_2_5<-as.numeric(child$G4_R_2_5)
child$G4_R_2_5_option2<-as.numeric(child$G4_R_2_5_option2)
#correct NA handling
child$kg<-ifelse(is.na(child$G4_R_2_4) & child$G4_R_2_5 <
child$G4_R_2_5_option2, child$G4_R_2_5, ifelse(is.na(child$G4_R_2_4) &
child$G4_R_2_5 > child$G4_R_2_5_option2, child$G4_R_2_5_option2, child$G4_R_2_4))

Here is an alternative version that might be interesting, assuming that the values are stored in numerical form (else the column entries should be converted into numerical values, as suggested in the other answers):
get_kg <- function(x){
if(!is.na(x[2])) return (x[2])
return (min(x[3], x[4], na.rm = T))}
child$kg <- apply(child,1,get_kg)
#> child
# Id G4_R_2_4 G4_R_2_5 G4_R_2_5_option2 kg
#219 13237 16.0 NA NA 16.0
#220 139129 8.5 55.7 47.2 8.5
#221 139215 28.9 NA NA 28.9
#222 139216 NA 46.7 8.5 8.5
#223 139264 12.4 NA NA 12.4
#224 139281 13.6 NA NA 13.6
#225 139366 16.1 NA NA 16.1
#226 139376 61.8 NA NA 61.8
#227 140103 NA 48.6 9.1 9.1

We could do this using pmin. Assuming that your 'G4' columns are 'character' class, we convert those columns to 'numeric' class and use pmin on that columns.
indx <- grep('^G4', names(child))
child[indx] <- lapply(child[indx], as.numeric)
d1 <- child[indx]
child$kgN <- ifelse(is.na(d1[,1]), do.call(pmin, c(d1[-1], na.rm=TRUE)), d1[,1])
child$kgN
#[1] 16.0 8.5 28.9 8.5 12.4 13.6 16.1 61.8 9.1
Or without using ifelse
cbind(d1[,1], do.call(pmin, c(d1[-1], na.rm=TRUE)))[cbind(1:nrow(d1),
(is.na(d1[,1]))+1L)]
#[1] 16.0 8.5 28.9 8.5 12.4 13.6 16.1 61.8 9.1
Benchmarks
set.seed(24)
child1 <- as.data.frame(matrix(sample(c(NA,0:50), 1e6*3, replace=TRUE),
ncol=3, dimnames=list(NULL, c('G4_R_2_4', 'G4_R_2_5',
'G4_R_2_5_option2'))) )
cyberj0g <- function(){
with(child1, ifelse(is.na(G4_R_2_4) & G4_R_2_5 <
G4_R_2_5_option2, G4_R_2_5, ifelse(is.na(G4_R_2_4) &
G4_R_2_5 > G4_R_2_5_option2, G4_R_2_5_option2, G4_R_2_4)))
}
get_kg <- function(x){
if(!is.na(x[2])) return (x[2])
return (min(x[3], x[4], na.rm = T))}
RHertel <- function() apply(child1,1,get_kg)
akrun <- function(){cbind(child1[,1], do.call(pmin, c(child1[-1],
na.rm=TRUE)))[cbind(1:nrow(child1), (is.na(child1[,1]))+1L)]}
system.time(cyberj0g())
# user system elapsed
# 0.451 0.000 0.388
system.time(RHertel())
# user system elapsed
# 11.808 0.000 10.928
system.time(akrun())
# user system elapsed
# 0.000 0.000 0.084
library(microbenchmark)
microbenchmark(cyberj0g(), akrun(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval cld
# cyberj0g() 3.750391 4.137777 3.538063 4.091793 2.895156 3.197511 20 b
# akrun() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20 a

I'm pretty sure the problem is that you're not testing whether the values are NAs, you're testing whether they are equal to the string "NA", which they never are. This should work:
child$kg <- ifelse(is.na(child$G4_R_2_4) &
child$G4_R_2_5 < child$G4_R_2_5_option2,
child$G4_R_2_5,
ifelse(is.na(child$G4_R_2_4) &
child$G4_R_2_5 > child$G4_R_2_5_option2,
child$G4_R_2_5_option2,
child$G4_R_2_4))

Related

How to group by and fill NA with closest not NA in R dataframe column with condition on another column

I have a data frame of blood test markers results and I want to fill in the NA's by the following criteria:
For each group of ID (TIME is in ascending order) if the marker value is NA then fill it with the closest not NA value in this group (may be past or future) but only if the time difference is less than 14.
this example of my data:
df<-data.frame(ID=c(rep(2,5),rep(4,3)), TIME =c(1,22,33,43,85,-48,1,30),
CEA = c(1.32,1.42,1.81,2.33,2.23,29.7,23.34,18.23),
CA.15.3 = c(14.62,14.59,16.8,22.34,36.33,56.02,94.09,121.5),
CA.125 = c(33.98,27.56,30.31,NA,39.57,1171.00,956.50,825.30),
CA.19.9 = c(6.18,7.11,5.72, NA, 7.38,39.30,118.20,98.26),
CA.72.4 = c(rep(NA,5),1.32, NA, NA),
NSE = c(NA, 13.21, rep(NA,6)))
ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE
2 1 1.32 14.62 33.98 6.18 NA NA
2 22 1.42 14.59 27.56 7.11 NA 13.21
2 33 1.81 16.80 30.31 5.72 NA NA
2 43 2.33 22.34 NA NA NA NA
2 85 2.23 36.33 39.57 7.38 NA NA
4 -48 29.70 56.02 1171.00 39.30 1.32 NA
4 1 23.34 94.09 956.50 118.20 NA NA
4 30 18.23 121.50 825.30 98.26 NA NA
ID is the patient.
The TIME is the time of the blood test.
The others are the markers.
The only way I could do it is with loops which I try to avoid as much as possible.
I expect the output to be:
ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE
2 1 1.32 14.62 33.98 6.18 NA NA
2 22 1.42 14.59 27.56 7.11 NA 13.21
2 33 1.81 16.80 30.31 5.72 NA 13.21
2 43 2.33 22.34 30.31 5.72 NA NA
2 85 2.23 36.33 39.57 7.38 NA NA
4 -48 29.70 56.02 1171.00 39.30 1.32 NA
4 1 23.34 94.09 956.50 118.20 NA NA
4 30 18.23 121.50 825.30 98.26 NA NA
CA.19.9 and CA.124 are filled with the previous (10 days before)
NSE filled with the previous (11 days)
CA.72.4 not filled since the time difference of 1.32 which is -48 is 49 days from the next measure.
I bet there is a much simpler, vectorized solution but the following works.
fill_NA <- function(DF){
sp <- split(df, df$ID)
sp <- lapply(sp, function(DF){
d <- diff(DF$TIME)
i_diff <- c(FALSE, d < 14)
res <- sapply(DF[-(1:2)], function(X){
inx <- i_diff & is.na(X)
if(any(inx)){
inx <- which(inx)
last_change <- -1
for(i in inx){
if(i > last_change + 1){
if(i == 1){
X[i] <- X[i + 1]
}else{
X[i] <- X[i - 1]
}
last_change <- i
}
}
}
X
})
cbind(DF[1:2], res)
})
res <- do.call(rbind, sp)
row.names(res) <- NULL
res
}
fill_NA(df)
# ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE
#1 2 1 1.32 14.62 33.98 6.18 NA NA
#2 2 22 1.42 14.59 27.56 7.11 NA 13.21
#3 2 33 1.81 16.80 30.31 5.72 NA 13.21
#4 2 43 2.33 22.34 30.31 5.72 NA NA
#5 2 85 2.23 36.33 39.57 7.38 NA NA
#6 4 -48 29.70 56.02 1171.00 39.30 1.32 NA
#7 4 1 23.34 94.09 956.50 118.20 NA NA
#8 4 30 18.23 121.50 825.30 98.26 NA NA
Yes, you can have a vectorized solution. first let us consider the case in which you only impute using the future value. You need to create few auxiliary variables:
a variable that tells you whether the next observation belong to the same id (so it can be used to impute),
a variable that tells you whether the next observation is less than 14 days apart from the current one.
These do not depend on the specific variable you want to impute. for each variable to be imputed you will also need a variable that tells you whether the next variable is missing.
Then you can vectorize the following logic: when the next observation has the same id, and when it is less than 14 days from the current one and it is not missing copy its value in the current one.
Things get more complicated when you need to decide whether to use the past or future value, but the logic is the same. the code is below, it is a bit long but you can simplify it, I just wanted to be clear about what it does.
Hope this helps
x <-data.frame(ID=c(rep(2,5),rep(4,3)), TIME =c(1,22,33,43,85,-48,1,30),
CEA = c(1.32,1.42,1.81,2.33,2.23,29.7,23.34,18.23),
CA.15.3 = c(14.62,14.59,16.8,22.34,36.33,56.02,94.09,121.5),
CA.125 = c(33.98,27.56,30.31,NA,39.57,1171.00,956.50,825.30),
CA.19.9 = c(6.18,7.11,5.72, NA, 7.38,39.30,118.20,98.26),
CA.72.4 = c(rep(NA,5),1.32, NA, NA),
NSE = c(NA, 13.21, rep(NA,6)))
### these are the columns we want to input
cols.to.impute <- colnames(x)[! colnames(x) %in% c("ID","TIME")]
### is the next id the same?
x$diffidf <- NA
x$diffidf[1:(nrow(x)-1)] <- diff(x$ID)
x$diffidf[x$diffidf > 0] <- NA
### is the previous id the same?
x$diffidb <- NA
x$diffidb[2:nrow(x)] <- diff(x$ID)
x$diffidb[x$diffidb > 0] <- NA
### diff in time with next observation
x$difftimef <- NA
x$difftimef[1:(nrow(x)-1)] <- diff(x$TIME)
### diff in time with previous observation
x$difftimeb <- NA
x$difftimeb[2:nrow(x)] <- diff(x$TIME)
### if next (previous) id is not the same time difference is not meaningful
x$difftimef[is.na(x$diffidf)] <- NA
x$difftimeb[is.na(x$diffidb)] <- NA
### we do not need diffid anymore (due to previous statement)
x$diffidf <- x$diffidb <- NULL
### if next (previous) point in time is more than 14 days it is not useful for imputation
x$difftimef[abs(x$difftimef) > 14] <- NA
x$difftimeb[abs(x$difftimeb) > 14] <- NA
### create variable usef that tells us whether we should attempt to use the forward observation for imputation
### it is 1 only if difftime forward is less than difftime backward
x$usef <- NA
x$usef[!is.na(x$difftimef) & x$difftimef < x$difftimeb] <- 1
x$usef[!is.na(x$difftimef) & is.na(x$difftimeb)] <- 1
x$usef[is.na(x$difftimef) & !is.na(x$difftimeb)] <- 0
if (!is.na(x$usef[nrow(x)]))
stop("\nlast observation usef is not missing\n")
### now we get into column specific operations.
for (col in cols.to.impute){
### we will store the results in x$imputed, and copy into c[,col] at the end
x$imputed <- x[,col]
### x$usef needs to be modified depending on the specific column, so we define a local version of it
x$usef.local <- x$usef
### if a variable is not missing no point in looking at usef.local, so we make it missing
x$usef.local[!is.na(x[,col])] <- NA
### when usef.local is 1 but the next observation is missing it cannot be used for imputation, so we
### make it 0. but a value of 0 does not mean we can use the previous observation because that may
### be missing too. so first we make usef 0 and next we check the previous observation and if that
### is missing too we make usef missing
x$previous.value <- c(NA,x[1:(nrow(x)-1),col])
x$next.value <- c(x[2:nrow(x),col],NA)
x$next.missing <- is.na(x$next.value)
x$previous.missing <- is.na(x$previous.value)
x$usef.local[x$next.missing & x$usef.local == 1] <- 0
x$usef.local[x$previous.missing & x$usef.local == 0] <- NA
### now we can impute properly: use next value when usef.local is 1 and previous value when usef.local is 0
tmp <- rep(FALSE,nrow(x))
tmp[x$usef.local == 1] <- TRUE
x$imputed[tmp] <- x$next.value[tmp]
tmp <- rep(FALSE,nrow(x))
tmp[x$usef.local == 0] <- TRUE
x$imputed[tmp] <- x$previous.value[tmp]
### copy to column
x[,col] <- x$imputed
}
### get rid of useless temporary stuff
x$previous.value <- x$previous.missing <- x$next.value <- x$next.missing <- x$imputed <- x$usef.local <- NULL
ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE difftimef difftimeb usef
1 2 1 1.32 14.62 33.98 6.18 NA NA NA NA NA
2 2 22 1.42 14.59 27.56 7.11 NA 13.21 11 NA 1
3 2 33 1.81 16.80 30.31 5.72 NA 13.21 10 11 1
4 2 43 2.33 22.34 30.31 5.72 NA NA NA 10 0
5 2 85 2.23 36.33 39.57 7.38 NA NA NA NA NA
6 4 -48 29.70 56.02 1171.00 39.30 1.32 NA NA NA NA
7 4 1 23.34 94.09 956.50 118.20 NA NA NA NA NA
8 4 30 18.23 121.50 825.30 98.26 NA NA NA NA NA
>

R generate bins from a data frame respecting blanks

I need to generate bins from a data.frame based on the values of one column. I have tried the function "cut".
For example: I want to create bins of air temperature values in the column "AirTDay" in a data frame:
AirTDay (oC)
8.16
10.88
5.28
19.82
23.62
13.14
28.84
32.21
17.44
31.21
I need the bin intervals to include all values in a range of 2 degrees centigrade from that initial value (i.e. 8-9.99, 10-11.99, 12-13.99...), to be labelled with the average value of the range (i.e. 9.5, 10.5, 12.5...), and to respect blank cells, returning "NA" in the bins column.
The output should look as:
Air_T (oC) TBins
8.16 8.5
10.88 10.5
5.28 NA
NA
19.82 20.5
23.62 24.5
13.14 14.5
NA
NA
28.84 28.5
32.21 32.5
17.44 18.5
31.21 32.5
I've gotten as far as:
setwd('C:/Users/xxx')
temp_data <- read.csv("temperature.csv", sep = ",", header = TRUE)
TAir <- temp_data$AirTDay
Tmin <- round(min(TAir, na.rm = FALSE), digits = 0) # is start at minimum value
Tmax <- round(max(TAir, na.rm = FALSE), digits = 0)
int <- 2 # bin ranges 2 degrees
mean_int <- int/2
int_range <- seq(Tmin, Tmax + int, int) # generate bin sequence
bin_label <- seq(Tmin + mean_int, Tmax + mean_int, int) # generate labels
temp_data$TBins <- cut(TAir, breaks = int_range, ordered_result = FALSE, labels = bin_label)
The output table looks correct, but for some reason it shows a sequential additional column, shifts column names, and collapse all values eliminating blank cells. Something like this:
Air_T (oC) TBins
1 8.16 8.5
2 10.88 10.5
3 5.28 NA
4 19.82 20.5
5 23.62 24.5
6 13.14 14.5
7 28.84 28.5
8 32.21 32.5
9 17.44 18.5
10 31.21 32.5
Any ideas on where am I failing and how to solve it?
v<-ceiling(max(dat$V1,na.rm=T))
breaks<-seq(8,v,2)
labels=seq(8.5,length.out=length(s)-1,by=2)
transform(dat,Tbins=cut(V1,breaks,labels))
V1 Tbins
1 8.16 8.5
2 10.88 10.5
3 5.28 <NA>
4 NA <NA>
5 19.82 18.5
6 23.62 22.5
7 13.14 12.5
8 NA <NA>
9 NA <NA>
10 28.84 28.5
11 32.21 <NA>
12 17.44 16.5
13 31.21 30.5
This result follows the logic given: we have
paste(seq(8,v,2),seq(9.99,v,by=2),sep="-")
[1] "8-9.99" "10-11.99" "12-13.99" "14-15.99" "16-17.99" "18-19.99" "20-21.99"
[8] "22-23.99" "24-25.99" "26-27.99" "28-29.99" "30-31.99"
From this we can tell that 19.82 will lie between 18 and 20 thus given the value 18.5, similar to 10.88 being between 10-11.99 thus assigned the value 10.5

Estimating missing values in time-series data frame based on a "rate of change"

I am trying to use a loop in R to estimate values that will replace the NAs in my data frame based on a rate of change ("rate") that multiplies my last value (ok, this is confusing, but please refer to the example below). This is something similar to my data:
l1 <- c(NA,NA,NA,27,31,0.5)
l2 <- c(NA,8,12,28,39,0.5)
l3 <- c(NA,NA,NA,NA,39,0.3)
l4 <- c(NA,NA,11,15,31,0.2)
l5 <- c(NA,NA,NA,NA,51,0.9)
data <- as.data.frame(rbind(l1,l2,l3,l4,l5))
colnames(data) <- c("dbh1","dbh2","dbh3","dbh4","dbh5","rate")
So I created a loop to identify my first no-NA value in each line, then use that value to estimate its previous values based on the "rate". So for instance, in row 1, the first NA value would be replace by "27-(0.5*3)", then the second one would be "27-(0.5*2)" and the third one by "27-(0.5*1)". This is the loop I came up with. I know the first part (the outside loop) works but the the inside one doesn't:
for (i in 1: nrow(data)) {
dbh.cols <- data3[i,c("dbh1","dbh2","dbh3","dbh4","dbh5")]
sample.year <- which(dbh.cols != "NA")
data$first.dbh[i] <- min(dbh.cols, na.rm = T)
data$first.index[i] <- min(sample.year)
for (j on 1: (min(sample.year)-1)) {
ifelse(is.na(data[i,j]), min(dbh.cols, na.rm = T) - (min(sample.year)-j)*rate[i,j], data[i,j])
}
}
I am not good at programming so probably my internal loop strategy with "ifelse" is too weird (and wrong) but I just couldn't think of anything else that would work here... Any suggestions?
1) This uses no explicit loops, just an apply. It assumes that the NAs are all leading as in the example given.
fillIn <- function(x) {
rate <- tail(x, 1)
n <- sum(is.na(x)) # no of NAs
c(x[n+1] - rate * seq(n, 1), na.omit(x))
}
replace(data, TRUE, t(apply(data, 1, fillIn)))
giving:
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9
2) Here is a second approach that uses na.approx from the zoo package. It does not require apply. Here data1 has the same content as data except that the first column is filled in. The other NAs remain. The last line uses na.approx to fill in the remaining NAs linearly.
library(zoo)
NAs <- rowSums(is.na(data))
data1 <- cbind( data[cbind(1:nrow(data), NAs + 1)] - data$rate * NAs, data[-1] )
replace(data, TRUE, t(na.approx(t(data1))))
giving:
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9
2a) A variation on (2) uses na.locf in the middle line to bring forward the first non-NA in each row. The first and last lines are the same.
library(zoo)
NAs <- rowSums(is.na(data))
data1 <- cbind(na.locf(t(data), fromLast = TRUE)[1, ] - data$rate * NAs, data[-1])
replace(data, TRUE, t(na.approx(t(data1))))
You do not need to use multiple for loops for this. Here is some simplified code to do what you want just for the for loop. Working explicitly with your data we need to get the first non-NA value from each row.
for_estimate <- apply(data, 1, function(x) x[min(which(is.na(x) == FALSE))])
Secondly, we need to determine what integer to multiply the rate by for each row depending on how many NA values there are.
# total number of NA values per row
n_na <- apply(data,1, function(x) sum(is.na(x)) )
# make it a matrix with a 0's appended on
n_na <- matrix(c(n_na, rep(0, nrow(data) * (ncol(data)-1))),
nrow = nrow(data), ncol = ncol(data)-1)
# fill in the rest of the matrix
for(i in 2:ncol(n_na)){
n_na[,i] <- n_na[,i-1] -1
}
Once we have that we can use this code to back fill the NA values in that way you are interested in.
for(i in (ncol(data)-1):1){
if(sum(is.na(data[,i]))>0){
to_fill <- which(is.na(data[,i])==TRUE)
data[to_fill,i] <- for_estimate[to_fill] - (data$rate[to_fill]*(n_na[to_fill,i])
}
}
output
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9

substitute into `j` element in data.table[, j , by]

I have:
DT = data.table(ID=rep(1:2,each = 2), Index=rep(1:2,times = 2), Close=3:6, Open=7:10)
My algorithm has earlier determined that the DT holds the time information in the column with name Index, hence the algorithm stores the following mapping:
time.col <- "Index"
Now the algorithm wants to perform a calculation that would be equivalent to:
DT[, list(Index, Value=cumsum(Close)),by=ID]
ID Index Value
1: 1 1 3
2: 1 2 7
3: 2 1 5
4: 2 2 11
How to rewrite the line and plug the time.col variable in?
Neither of the following works:
DT[, list(time.col, Value=cumsum(Close)),by=ID]
DT[, list(substitute(time.col), Value=cumsum(Close)),by=ID]
You can create an expression for all of j in DT:
e <- parse(text = paste0("list(", time.col,",", "Value=cumsum(Close))"))
DT[, eval(e),by=ID]
EDIT
Or, if you store "Index" as a name, you can evaluate time.col within the environment of .SD:
time.col <- as.name("Index")
DT[,list(eval(time.col,envir=.SD), Value=cumsum(Close)),by=ID]
Very similar question here: In R data.table, how do I pass variable parameters to an expression?
Also, this question helps to understand the mystery of non-standard evaluation in data.table:
eval and quote in data.table
It turns out that the fastest solution from the above-mentioned evals is
e <- parse(text = paste0("list(", time.col,",", "Value=cumsum(Close))"))
DT[, eval(e),by=ID]
However, := solution is even faster. See also Arun's note regarding copying.
Dataset
dim(DT); object.size(DT); DT
[1] 1354402 8
81291568 bytes
Instrument Date Open High Low Close Volume Adjusted Close
1: GOOG/AMEX_ABI 1981-03-11 NA NA 6.56 6.75 217200 NA
2: GOOG/AMEX_ABI 1981-03-12 NA NA 6.66 6.88 616400 NA
3: GOOG/AMEX_ABI 1981-03-13 NA NA 6.81 6.84 462000 NA
4: GOOG/AMEX_ABI 1981-03-16 NA NA 6.81 7.00 306400 NA
5: GOOG/AMEX_ABI 1981-03-17 NA NA 6.88 6.88 925600 NA
---
1354398: YAHOO/TSX_AMM_TO 2014-04-24 1.56 1.58 1.56 1.58 2700 1.58
1354399: YAHOO/TSX_AMM_TO 2014-04-25 1.60 1.62 1.59 1.62 11000 1.62
1354400: YAHOO/TSX_AMM_TO 2014-04-28 1.59 1.61 1.54 1.54 7200 1.54
1354401: YAHOO/TSX_AMM_TO 2014-04-29 1.58 1.60 1.58 1.59 500 1.59
1354402: YAHOO/TSX_AMM_TO 2014-04-30 1.55 1.55 1.50 1.52 36800 1.52
Benchmarking
time.col <- "Date"
fun <- function(){
out <- DT[, list(get(time.col), Value=cumsum(Close)),by=Instrument]
setnames(out, "V1", time.col)
}
fun2 <- function() {
DT[, Value := cumsum(Close), by=Instrument]
out <- DT[ , c("Instrument", ..time.col, "Value")]
DT[, Value:=NULL] # cleanup
out
}
fun2. <- function() {
DT[, Value := cumsum(Close), by=Instrument]
# out <- DT[,c("Instrument", ..time.col, "Value")]
# DT[, Value:=NULL] # cleanup
# out
}
fun3 <- function() {
DT[,list( eval(as.name(time.col),envir=.SD), Value=cumsum(Close)),by=Instrument]
}
fun4 <- function() {
e <- parse(text = paste0("list(", time.col,",", "Value=cumsum(Close))"))
DT[, eval(e),by=Instrument]
}
Result
library(rbenchmark)
benchmark(fun(),
fun2(),
fun3(),
fun4(),
replications=200)
test replications elapsed relative user.self sys.self user.child sys.child
1 fun() 200 5.40 1.327 5.29 0.11 NA NA
2 fun2() 200 5.18 1.273 4.72 0.45 NA NA
3 fun2.() 200 2.70 1.000 2.70 0.00 NA NA
3 fun3() 200 4.12 1.012 3.90 0.22 NA NA
4 fun4() 200 4.07 1.000 3.91 0.16 NA NA

Remove column values with NA in R

I have a data frame, called gen, which is represented below
A B C D E
1 NA 4.35 35.3 3.36 4.87
2 45.2 .463 34.3 NA 34.4
3 NA 34.5 35.6 .457 46.3
I would like to remove the columns where there are NA's. (I know na.omit does it for rows, but I can't seem to find one for columns). The final result would read:
B C E
1 4.35 35.3 4.87
2 .463 34.3 34.4
3 34.5 35.6 46.3
Thanks!
gen <- gen[sapply(gen, function(x) all(!is.na(x)))]
dfrm[ , sapply(dfrm, function(x){ !any(is.na(x)) } )
You might want to use instead this variant:
dfrm[ , sapply(dfrm, function(x){ all(is.finite(x)) } )
If you have Inf or -Inf values in a vector they are not removed or identified with selection based on is.na.
Just use this:
gen[colSums(is.na(gen)) == 0]

Resources