Efficiently combining multiple conditions on R data.frame - r

I'm working with data frames in R and often use a lot of conditions on my dataframes, mainly & and | operators. I'm doing that like so:
df = data.frame("col1"=c("success", "failure", "success"), "col2"=c("success", "success", "failure"), "col3"=c(1,1,100))
#multiple conditions, option 1
df[(df[["col1"]]=="success") & (df[["col2"]]=="success") & (df[["col3"]] == 1), ]
#multiple conditions, option 2
df[which((df[["col1"]]=="success") & (df[["col2"]]=="success") & (df[["col3"]] == 1)),]
However, my expressions tend to get really long and hard to read that way.
Is there a better, more readable way of doing it?
EDIT: Preferably, I'd like to work within the base R environment w/out external libraries.
I put this together based on other posts here on SO about not using subset, using | correctly, but didnt' find anything addressing this specific issue.
I hope this is not too opinion-based, otherwise I'll retract my question. Thanks!

One option is to use the filter() function in the dplyr package:
library(dplyr)
filter(df, col1=="success" & col2=="success" & col3==1)
You can also use commas (equivalent to &) to separate multiple arguments:
filter(df, col1=="success", col2=="success", col3==1)

Try this: (may be not clumsy but with same '&')
> df[df$col1=="success" & df$col2=="success" & df$col3==1,]
col1 col2 col3
1 success success 1

Related

Filter dataframes where two columns equal to same value

Is there a way or writing
filter(dataDF, column1 == 'myvalue' & column2 == 'myvalue')
without having to write out myvalue twice?
You can use dplyr::filter_at
filter_at(dataDF, c("column1", "column2"), all_vars(. == 'myvalue'))
By your comment,
hmm, better than what I had. Reasons I want to do it is 1. Such that if I need to go and edit what 'myvalue' at a later date I only need to change it in once place and 2. To make code as efficient and short as possible. Your solution solves number 1. but not number 2
You can put the 'myValue' in a variable and use it . That way you only have to update/change it at one place.
valueToCheck='myvalue'
filter(dataDF, column1 == valueToCheck & column2 == valueToCheck)
Kind of a clunky base-R solution
df[do.call(`&`, lapply(df[, c('column1', 'column2')], `==`, myvalue)), ]

ifelse in R: alternatives to nesting

I have written a series of ifelse statements that create a new column in a data frame. Example:
rf$rs.X<- ifelse(rf$chl==0 & rf$dm=="y" & rf$gdr=="m" & rf$smk=="n" & rf$age>=70 & rf$sbp>=160, ">=40%", NA)
When I run additional ifelse statements, they overwrite the previous one, because I map it to the same new column in the data frame (i.e. all my ifelse statements begin with the same rf$rs.X).
I know that for each statement I could specify a new column to be created, but I have 68 ifelse statements and I do not want a new column for each: I want a single new column.
To work around this I tried nesting all 68 ifelse statements but it doesn't seem to work (when I try to run it I get a blue plus sign (+)). After reading on here, it seems there is a limit to nesting (n=50).
So, for a series of ifelse statements, how can I get all the outputs to the same column in a data frame, with out the ifelse statement overwriting the previous one?
Thanks!
I would have written it like this:
rf$rs.X<- with( rf, ifelse(chl==0 & dm=="y" & gdr=="m" &
smk=="n" & age>=70 & sbp>=160, ">=40%", NA)
Then the next one (say for the low sbp cases, could have the rs.X value as the alternative:
rf$rs.X<- with( rf, ifelse(chl==0 & dm=="y" & gdr=="m" &
smk=="n" & age>=70 & sbp < 160, "30-39%", rs.X)
So that way the value is not overwritten for the non-qualifying rows.

Data cleaning using subset with 2 conditions on same variable

I am a newbie to R,
I have at dataset ITEproduction_2014.2015 and I only want to see datapoints between 4 and 39 days. Currently I use 2 separate lines to create a subset.
Can I do this in 1 line? something like Data.Difference >3 and < 40?
ITEproduction_2014.2015 <- subset(ITEproduction_2014.2015,Date.Difference>3)
ITEproduction_2014.2015 <- subset(ITEproduction_2014.2015,Date.Difference<40)
thanks in advance,
Dirk
just a little googling would have solved your problem, for example read this about logical operators,
like this?
ITEproduction_2014.2015<-subset(ITEproduction_2014.2015,Date.Difference>3 & Date.Difference<40)
Avoid using subset altogether if you can. See the warning in the help file:
?subset()
If you like the syntax of subset(), and prefer it to standard subsetting functions like [, you can use dplyr:
library(dplyr)
ITEproduction_2014.2015 %>%
dplyr::filter(
Date.Difference > 3,
Date.Difference < 40
)

Avoiding redundancy when selecting rows in a data frame

My code is littered with statements of the following taste:
selected <- long_data_frame_name[long_data_frame_name$col1 == "condition1" &
long_data_frame_name$col2 == "condition2" & !is.na(long_data_frame_name$col3),
selected_columns]
The repetition of the data frame name is tedious and error-prone. Is there a way to avoid it?
You can use with
For instance
sel.ID <- with(long_data_frame_name, col1==2 & col2<0.5 & col3>0.2)
selected <- long_data_frame_name[sel.ID, selected_columns]
Several ways come to mind.
If you think about it, you are subsetting your data. Hence use the subset function (base package):
your_subset <- subset(long_data_frame_name,
col1 == "cond1" & "cond2" == "cond2" & !is.na(col3),
select = selected_columns)
This is in my opinion the most "talking" code to accomplish your task.
Use data tables.
library(data.table)
long_data_table_name = data.table(long_data_frame_name, key="col1,col2,col3")
selected <- long_data_table_name[col1 == "condition1" &
col2 == "condition2" &
!is.na(col3),
list(col4,col5,col6,col7)]
You don't have to set the key in the data.table(...) call, but if you have a large dataset, this will be much faster. Either way it will be much faster than using data frames. Finally, using J(...), as below, does require a keyed data.table, but is even faster.
selected <- long_data_table_name[J("condition1","condition2",NA),
list(col4,col5,col6,col7)]
You have several possibilities:
attach which adds the variables of the data.frame to the search path just below the global environment. Very useful for code demonstrations but I warn you not to do that programmatically.
with which creates a whole new environment temporarilly.
In very limited cases you want to use other options such as within.
df = data.frame(random=runif(100))
df1 = with(df,log(random))
df2 = within(df,logRandom <- log(random))
within will examine the created environment after evaluation and add the modifications to the data. Check the help of with to see more examples. with will just evaluate you expression.

Problem with data.table ifelse behavior

I am trying to calculate a simple ratio using data.table. Different files have different tmax values, so that is why I need ifelse. When I debug this, the dt looks good. The tmaxValue is a single value (the first "t=60" encountered in this case), but t0Value is all of the "t=0" values in dt.
summaryDT <- calculate_Ratio(reviewDT[,list(Result, Time), by=key(reviewDT)])
calculate_Ratio <- function(dt){
tmaxValue <- ifelse(grepl("hhep", inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=240min"),Result],
ifelse(grepl("hlm",inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=60"),Result],
dt[which(dt[,Time] == "t=30"),Result]))
t0Value <- dt[which(dt[,Time] == "t=0"),Result]
return(dt[,Ratio:=tmaxValue/t0Value])
}
What I am getting out is theResult for tmaxValue divided by all of the Result's for all of the t0Value's, but what I want is a single ratio for each unique by.
Thanks for the help.
You didn't provide a reproducible example, but typically using ifelse is the wrong thing to do.
Try using if(...) ... else ... instead.
ifelse(test, yes, no) acts very weird: It produces a result with the attributes and length from test and the values from yes or no.
...so in your case you should get something without attributes and of length one - and that's probably not what you wanted, right?
[UPDATE] ...Hmm or maybe it is since you say that tmaxValue is a single value...
Then the problem isn't in calculating tmaxValue? Note that ifelse is still the wrong tool for the job...

Resources