Problem with data.table ifelse behavior - r

I am trying to calculate a simple ratio using data.table. Different files have different tmax values, so that is why I need ifelse. When I debug this, the dt looks good. The tmaxValue is a single value (the first "t=60" encountered in this case), but t0Value is all of the "t=0" values in dt.
summaryDT <- calculate_Ratio(reviewDT[,list(Result, Time), by=key(reviewDT)])
calculate_Ratio <- function(dt){
tmaxValue <- ifelse(grepl("hhep", inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=240min"),Result],
ifelse(grepl("hlm",inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=60"),Result],
dt[which(dt[,Time] == "t=30"),Result]))
t0Value <- dt[which(dt[,Time] == "t=0"),Result]
return(dt[,Ratio:=tmaxValue/t0Value])
}
What I am getting out is theResult for tmaxValue divided by all of the Result's for all of the t0Value's, but what I want is a single ratio for each unique by.
Thanks for the help.

You didn't provide a reproducible example, but typically using ifelse is the wrong thing to do.
Try using if(...) ... else ... instead.
ifelse(test, yes, no) acts very weird: It produces a result with the attributes and length from test and the values from yes or no.
...so in your case you should get something without attributes and of length one - and that's probably not what you wanted, right?
[UPDATE] ...Hmm or maybe it is since you say that tmaxValue is a single value...
Then the problem isn't in calculating tmaxValue? Note that ifelse is still the wrong tool for the job...

Related

Bug in my for-loop to iterate over data frame

I am working on a data frame and have extracted on the of the columns with hour data from 0 t0 23. I am adding one more column as type of the day based on hour. I had executed below for loop but getting error. Can somebody help me what is wrong with below syntax and how to correct the same.
for(i in data$Requesthours) {
if(data$Requesthours>=0 & data$Requesthours<3) {
data$Partoftheday <- "Midnight"
} else if(data$Requesthours>=3 & data$Requesthours<6) {
data$Partoftheday <- "Early Morning"
} else if(data$Requesthours>=6 & data$Requesthours<12) {
data$Partoftheday <- "Morning"
} else if(data$Requesthours>=12 & data$Requesthours<16) {
data$Partoftheday <- "Afternoon"
} else if(data$Requesthours>=16 & data$Requesthours<20) {
data$Partoftheday <- "Evening"
} else if(data$Requesthours>=20 & data$Requesthours<=23) {
data$Partoftheday <- "Night"
}
}
Still waiting for you to post your bug, but here's an R coding tip which will reduce this to a one-liner (and bypass your bug). Also it'll be way faster (it's vectorized, unlike your for-loop and if-else-ladder).
data$Partoftheday <- as.character(
cut(data$Requesthours,
breaks=c(-1,3,6,12,16,20,24),
labels=c('Midnight', 'Early Morning', 'Morning', 'Afternoon', 'Evening', 'Night')
)
)
# see Notes on cut() at bottom to explain this
Now back to your bug: You're confused about how to iterate over a column in R. for(i in data$Requesthours) is trying to iterate over your df, but you're confusing indices with data values. Also you try to make i an iterator, but then you don't refer to the value i anywhere inside the loop, you refer back to data$Requesthours, which is an entire column not a single value (how do the loop contents known which value you're referring to? They don't. You could use an ugly explicit index-loop like for (i in 1:nrow(data) ... or for (i in seq_along(data) ... then access data[i,]$Requesthours, but please don't. Because...
One of the huge idiomatic things about learning R is generally when you write a for-loop to iterate over a dataframe or a df column, you should stop to think (or research) if there isn't a vectorized function in R that does what you want. cut, if, sum, mean, max, diff, stdev, ... fns are all vectorized, as are all the arithmetic and logical operators. 'vectorized' means you can feed them an entire (column) vector as an input, and they produce an entire (column) vector as output which you can directly assign to your new column. Very simple, very fast, very powerful. Generally beats the pants off for-loops. Please read R-intro.html, esp. Section 2 about vector assignment
And if you can't find or write a vectorized fn, there's also the *apply family of functions apply, sapply, lapply, ... to apply any arbitrary function you want to a list/vector/dataframe/df column.
Notes on cut()
cut(data, breaks, labels, ...) is a function where data is your input vector (e.g. your selected column data$Requesthours), breaks is a vector of integer or numeric, and labels is a vector to name the output. The length of labels is one more than breaks, since 5 breaks divides your data into 6 ranges.
We want the output vector to be string, not categorical, hence we apply as.character() to the output from cut()
Since your first if-else comparison is (hr>=0 & hr<3), we have to fiddle the lowest cutoff_hour 0 to -1, otherwise hr==0 would wrongly give NA. (There is a parameter include.lowest=TRUE/FALSE but it's not what you want, because it would also cause hr==3 to be 'Midnight', hr==6 to be 'Early Morning', etc.)
if(data$Requesthours>=0 & data$Requesthours<3) (and other similar ifs) make no sense since data$Requesthours is a vector. You should try either of the following:
Solution 1:
for(i in seq(length(data$Requesthours))) {
if(data$Requesthours[i]>=0 & data$Requesthours[i]<3)
data$Partoftheday[i] <- "Midnight"
....
}
This solution is slow like hell and really ugly, but it would work.
Solution 2:
data$Partoftheday[data$Requesthours>=0 & data$Requesthours<3] <- "Midnight"
...
Solution 3 = what was proposed by smci

R row selection providing partial results

I'm having an issue, which I have found a solution for, but would like to understand what was going on in the original coding.
So I started with a table pulled from an SQL database and wanted information for 1 client, who is covered by 2 client numbers.
Originally I was running this to select those account numbers.
match <- c("C524",'5568')
gtc <- gtc[gtc$AccountNumber == match,]
However this was only returning about half of the desired results, and the results returned vary at different times (this was running as a weekly report), and depending on the PC running it.
Now, I've set up a loop which works fine and extracts all the results, but would really like to know what was going on with the original query.
match <- c("C524",'5568')
for (each in match) {
gtcLoop<- gtc[gtc$AccountNumber == each,]
result<-rbind(result,gtcLoop)
}
Also, long time lurker, first time poster so let me know if I've done anything wrong in this question.
You need to replace == by %in%:
gtc <- data.frame(AccountNumber = sample(c(match, "something"), 10, replace = TRUE))
gtc[gtc$AccountNumber %in% match,]
Just to tag onto Qaswed's answer (+1), you need to understand what is happening when you compute vector comparisons like ==. See:
?`==`
and
?`%in%`
then try something like 1 == c(1,2) and 1 %in% c(1,2).
The reason you are getting half the results is because the row subset is using the first evaluation only, as in:
df <- data.frame(id=c(1:5), acct_cd = letters[1:5])
df[df$acct_cd == c("a","c"),] # this is wrong, for demo only
df[df$acct_cd %in% c("a","c"),] # this is correct

Need an explanation for a particular R code snippet

The following is the code for which i need an explanation for:
for (i in id) {
data <- read.csv(files[i] )
c <- complete.cases(data)
naRm <- data[c, ]
completeCases <- rbind(completeCases, c(i, nrow(naRm)))
as i understand, the variable c here stores multiple logical values. The line after, that seems foreign to me. How does data[c, ] work?
FYI, I am an R newbie.
complete.classes looks for all rows that are "complete", have no missing values. Here is the man page. Thus the completeCases object will tell you the number of "complete" rows in each file you have just read. You really don't need to store the value of i in the rbind call though as it is just the row number, so it is redundant. A vector would do just fine for this application.
Also looks like you are missing a close brackets or this isn't a complete chunk of code.

Double "for loops" in a dataframe in R

I need to do a quality control in a dataset with more than 3000 variables (columns). However, I only want to apply some conditions in a couple of them. A first step would be to replace outliers by NA. I want to replace the observations that are greater or smaller than 3 standard deviations from the mean by NA. I got it, doing column by column:
height = ifelse(abs(height-mean(height,na.rm=TRUE)) <
3*sd(height,na.rm=TRUE),height,NA)
And I also want to create other variables based on different columns. For example:
data$CGmark = ifelse(!is.na(data$mark) & !is.na(data$height) ,
paste(data$age, data$mark,sep=""),NA)
An example of my dataset would be:
name = factor(c("A","B","C","D","E","F","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
I have tried this (for one condition):
d1=names(data)
list = c("age","height","mark")
ntraits=length(list)
nrows=dim(data)[1]
for(i in 1:ntraits){
a=list[i]
b=which(d1==a)
d2=data[,b]
for (j in 1:nrows){
d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA)
}
}
Someone told me that I am not storing d2. How can I create for loops to apply the conditions I want? I know that there are similar questions but i didnt get it yet. Thanks in advance.
You pretty much wrote the answer in your first line. You're overthinking this one.
First, it's good practice to encapsulate this kind of operation in a function. Yes, function dispatch is a tiny bit slower than otherwise, but the code is often easier to read and debug. Same goes for assigning "helper" variables like mean_x: the cost of assigning the variable is very, very small and absolutely not worth worrying about.
NA_outside_3s <- function(x) {
mean_x <- mean(x)
sd_x <- sd(x,na.rm=TRUE)
x_outside_3s <- abs(x - mean(x)) < 3 * sd_x
x[x_outside_3s] <- NA # no need for ifelse here
x
}
of course, you can choose any function name you want. More descriptive is better.
Then if you want to apply the function to very column, just loop over the columns. That function NA_outside_3s is already vectorized, i.e. it takes a logical vector as an argument and returns a vector of the same length.
cols_to_loop_over <- 1:ncol(my_data) # or, some subset of columns.
for (j in cols_to_loop_over) {
my_data[, j] <- NA_if_3_sd(my_data[, j])
}
I'm not sure why you wrote your code the way you did (and it took me a minute to even understand what you were trying to do), but looping over columns is usually straightforward.
In my comment I said not to worry about efficiency, but once you understand how the loop works, you should rewrite it using lapply:
my_data[cols_to_loop_over] <- lapply(my_data[cols_to_loop_over], NA_outside_3s)
Once you know how the apply family of functions works, they are very easy to read if written properly. And yes, they are somewhat faster than looping, but not as much as they used to be. It's more a matter of style and readability.
Also: do NOT name a variable list! This masks the function list, which is an R built-in function and a fairly important one at that. You also shouldn't generally name variables data because there is also a data function for loading built-in data sets.

R data.table replacing an index of values from another data.table

Hi still trying to figure out data.table. If I have a data.table of values such as those below, what is the most efficient way to replace the values with those from another data.table?
set.seed(123456)
a=data.table(
date_id = rep(seq(as.Date('2013-01-01'),as.Date('2013-04-10'),'days'),5),
px =rnorm(500,mean=50,sd=5),
vol=rnorm(500,mean=500000,sd=150000),
id=rep(letters[1:5],each=100)
)
b=data.table(
date_id=rep(seq(as.Date('2013-01-01'),length.out=600,by='days'),5),
id=rep(letters[1:5],each=600),
px=NA_real_,
vol=NA_real_
)
setkeyv(a,c('date_id','id'))
setkeyv(b,c('date_id','id'))
What I'm trying to do is replace the px and vol in b with those in a where date_id and id match I'm a little flummoxed with this - I would suppose that something along the lines of might be the way to go but I don't think this will work in practice.
b[which(b$date_id %in% a$date_id & b$id %in% a$id),list(px:=a$px,vol:=a$vol)]
EDIT
I tried the following
t = a[b,roll=T]
t[!is.na(px),list(px.1:=px,vol.1=vol),by=list(date_id,id)]
and got the error message
Error in `:=`(px.1, px) :
:= is defined for use in j only, and (currently) only once; i.e., DT[i,col:=1L] and DT[,newcol:=sum(colB),by=colA] are ok, but not DT[i,col]:=1L, not DT[i]$col:=1L and not DT[,{newcol1:=1L;newcol2:=2L}]. Please see help(":="). Check is.data.table(DT) is TRUE.
If you are wanting to replace the values within b you can use the prefix i.. From the NEWS regarding version 1.7.10
The prefix i. can now be used in j to refer to join inherited
columns of i that are otherwise masked by columns in x with
the same name.
b[a, `:=`(px = i.px, vol = i.vol)]
Doesn't sound like you need the roll from your description, and it seems like you want to do this instead when you get your error:
t[!is.na(px),`:=`(px.1=px,vol.1=vol),by=list(date_id,id)]

Resources