I'm working with data frames in R and often use a lot of conditions on my dataframes, mainly & and | operators. I'm doing that like so:
df = data.frame("col1"=c("success", "failure", "success"), "col2"=c("success", "success", "failure"), "col3"=c(1,1,100))
#multiple conditions, option 1
df[(df[["col1"]]=="success") & (df[["col2"]]=="success") & (df[["col3"]] == 1), ]
#multiple conditions, option 2
df[which((df[["col1"]]=="success") & (df[["col2"]]=="success") & (df[["col3"]] == 1)),]
However, my expressions tend to get really long and hard to read that way.
Is there a better, more readable way of doing it?
EDIT: Preferably, I'd like to work within the base R environment w/out external libraries.
I put this together based on other posts here on SO about not using subset, using | correctly, but didnt' find anything addressing this specific issue.
I hope this is not too opinion-based, otherwise I'll retract my question. Thanks!
One option is to use the filter() function in the dplyr package:
library(dplyr)
filter(df, col1=="success" & col2=="success" & col3==1)
You can also use commas (equivalent to &) to separate multiple arguments:
filter(df, col1=="success", col2=="success", col3==1)
Try this: (may be not clumsy but with same '&')
> df[df$col1=="success" & df$col2=="success" & df$col3==1,]
col1 col2 col3
1 success success 1
Related
Is there a way or writing
filter(dataDF, column1 == 'myvalue' & column2 == 'myvalue')
without having to write out myvalue twice?
You can use dplyr::filter_at
filter_at(dataDF, c("column1", "column2"), all_vars(. == 'myvalue'))
By your comment,
hmm, better than what I had. Reasons I want to do it is 1. Such that if I need to go and edit what 'myvalue' at a later date I only need to change it in once place and 2. To make code as efficient and short as possible. Your solution solves number 1. but not number 2
You can put the 'myValue' in a variable and use it . That way you only have to update/change it at one place.
valueToCheck='myvalue'
filter(dataDF, column1 == valueToCheck & column2 == valueToCheck)
Kind of a clunky base-R solution
df[do.call(`&`, lapply(df[, c('column1', 'column2')], `==`, myvalue)), ]
I have written a series of ifelse statements that create a new column in a data frame. Example:
rf$rs.X<- ifelse(rf$chl==0 & rf$dm=="y" & rf$gdr=="m" & rf$smk=="n" & rf$age>=70 & rf$sbp>=160, ">=40%", NA)
When I run additional ifelse statements, they overwrite the previous one, because I map it to the same new column in the data frame (i.e. all my ifelse statements begin with the same rf$rs.X).
I know that for each statement I could specify a new column to be created, but I have 68 ifelse statements and I do not want a new column for each: I want a single new column.
To work around this I tried nesting all 68 ifelse statements but it doesn't seem to work (when I try to run it I get a blue plus sign (+)). After reading on here, it seems there is a limit to nesting (n=50).
So, for a series of ifelse statements, how can I get all the outputs to the same column in a data frame, with out the ifelse statement overwriting the previous one?
Thanks!
I would have written it like this:
rf$rs.X<- with( rf, ifelse(chl==0 & dm=="y" & gdr=="m" &
smk=="n" & age>=70 & sbp>=160, ">=40%", NA)
Then the next one (say for the low sbp cases, could have the rs.X value as the alternative:
rf$rs.X<- with( rf, ifelse(chl==0 & dm=="y" & gdr=="m" &
smk=="n" & age>=70 & sbp < 160, "30-39%", rs.X)
So that way the value is not overwritten for the non-qualifying rows.
I am a newbie to R,
I have at dataset ITEproduction_2014.2015 and I only want to see datapoints between 4 and 39 days. Currently I use 2 separate lines to create a subset.
Can I do this in 1 line? something like Data.Difference >3 and < 40?
ITEproduction_2014.2015 <- subset(ITEproduction_2014.2015,Date.Difference>3)
ITEproduction_2014.2015 <- subset(ITEproduction_2014.2015,Date.Difference<40)
thanks in advance,
Dirk
just a little googling would have solved your problem, for example read this about logical operators,
like this?
ITEproduction_2014.2015<-subset(ITEproduction_2014.2015,Date.Difference>3 & Date.Difference<40)
Avoid using subset altogether if you can. See the warning in the help file:
?subset()
If you like the syntax of subset(), and prefer it to standard subsetting functions like [, you can use dplyr:
library(dplyr)
ITEproduction_2014.2015 %>%
dplyr::filter(
Date.Difference > 3,
Date.Difference < 40
)
My code is littered with statements of the following taste:
selected <- long_data_frame_name[long_data_frame_name$col1 == "condition1" &
long_data_frame_name$col2 == "condition2" & !is.na(long_data_frame_name$col3),
selected_columns]
The repetition of the data frame name is tedious and error-prone. Is there a way to avoid it?
You can use with
For instance
sel.ID <- with(long_data_frame_name, col1==2 & col2<0.5 & col3>0.2)
selected <- long_data_frame_name[sel.ID, selected_columns]
Several ways come to mind.
If you think about it, you are subsetting your data. Hence use the subset function (base package):
your_subset <- subset(long_data_frame_name,
col1 == "cond1" & "cond2" == "cond2" & !is.na(col3),
select = selected_columns)
This is in my opinion the most "talking" code to accomplish your task.
Use data tables.
library(data.table)
long_data_table_name = data.table(long_data_frame_name, key="col1,col2,col3")
selected <- long_data_table_name[col1 == "condition1" &
col2 == "condition2" &
!is.na(col3),
list(col4,col5,col6,col7)]
You don't have to set the key in the data.table(...) call, but if you have a large dataset, this will be much faster. Either way it will be much faster than using data frames. Finally, using J(...), as below, does require a keyed data.table, but is even faster.
selected <- long_data_table_name[J("condition1","condition2",NA),
list(col4,col5,col6,col7)]
You have several possibilities:
attach which adds the variables of the data.frame to the search path just below the global environment. Very useful for code demonstrations but I warn you not to do that programmatically.
with which creates a whole new environment temporarilly.
In very limited cases you want to use other options such as within.
df = data.frame(random=runif(100))
df1 = with(df,log(random))
df2 = within(df,logRandom <- log(random))
within will examine the created environment after evaluation and add the modifications to the data. Check the help of with to see more examples. with will just evaluate you expression.
I am trying to calculate a simple ratio using data.table. Different files have different tmax values, so that is why I need ifelse. When I debug this, the dt looks good. The tmaxValue is a single value (the first "t=60" encountered in this case), but t0Value is all of the "t=0" values in dt.
summaryDT <- calculate_Ratio(reviewDT[,list(Result, Time), by=key(reviewDT)])
calculate_Ratio <- function(dt){
tmaxValue <- ifelse(grepl("hhep", inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=240min"),Result],
ifelse(grepl("hlm",inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=60"),Result],
dt[which(dt[,Time] == "t=30"),Result]))
t0Value <- dt[which(dt[,Time] == "t=0"),Result]
return(dt[,Ratio:=tmaxValue/t0Value])
}
What I am getting out is theResult for tmaxValue divided by all of the Result's for all of the t0Value's, but what I want is a single ratio for each unique by.
Thanks for the help.
You didn't provide a reproducible example, but typically using ifelse is the wrong thing to do.
Try using if(...) ... else ... instead.
ifelse(test, yes, no) acts very weird: It produces a result with the attributes and length from test and the values from yes or no.
...so in your case you should get something without attributes and of length one - and that's probably not what you wanted, right?
[UPDATE] ...Hmm or maybe it is since you say that tmaxValue is a single value...
Then the problem isn't in calculating tmaxValue? Note that ifelse is still the wrong tool for the job...