I have written a series of ifelse statements that create a new column in a data frame. Example:
rf$rs.X<- ifelse(rf$chl==0 & rf$dm=="y" & rf$gdr=="m" & rf$smk=="n" & rf$age>=70 & rf$sbp>=160, ">=40%", NA)
When I run additional ifelse statements, they overwrite the previous one, because I map it to the same new column in the data frame (i.e. all my ifelse statements begin with the same rf$rs.X).
I know that for each statement I could specify a new column to be created, but I have 68 ifelse statements and I do not want a new column for each: I want a single new column.
To work around this I tried nesting all 68 ifelse statements but it doesn't seem to work (when I try to run it I get a blue plus sign (+)). After reading on here, it seems there is a limit to nesting (n=50).
So, for a series of ifelse statements, how can I get all the outputs to the same column in a data frame, with out the ifelse statement overwriting the previous one?
Thanks!
I would have written it like this:
rf$rs.X<- with( rf, ifelse(chl==0 & dm=="y" & gdr=="m" &
smk=="n" & age>=70 & sbp>=160, ">=40%", NA)
Then the next one (say for the low sbp cases, could have the rs.X value as the alternative:
rf$rs.X<- with( rf, ifelse(chl==0 & dm=="y" & gdr=="m" &
smk=="n" & age>=70 & sbp < 160, "30-39%", rs.X)
So that way the value is not overwritten for the non-qualifying rows.
Related
I am reproducing some Stata code in R and struggling with the following command:
gen new_var=((var1_a==1 & var1_b==0) | (var2_a==1 & var2_b==0))
I am generally familiar with the gen syntax, but in this case I do not understand how values are assigned based on the boolean condition.
What would the above be in R?
In Stata, in general the above gen command will work because you have variables in your in-memory dataset (similar to a single R dataframe) named var1_a, var1_b, var_2_a, and var2_b. If these exist as vectors in your R environment, then our colleague Nick Cox is exactly correct: all is needed is the statement without the leading gen.. (although typically in R we would write it like this):
new_var <- (var1_a==1 & var1_b==0) | (var2_a==1 & var2_b==0)
However, if you have a data frame object, say df that contains columns with these names, and the objective is to add another column to df that reflects your logical conditions (like adding a new variable ("column") to the dataset in Stata using generate / gen. In this case, the above approach will not work as the columns var1_a, var1_b, etc will not be found in the global environment.
Instead, to add a new column called new_var to the dataframe called df, we would write something like this:
df["new_var"] <- (df$var1_a==1 & df$var1_b==0) | (df$var2_a==1 & var2_b==0)
How can I convert the following code from Stata to R?
gen a01sb=cond(b01~=1 & c01~=1, a01, 0)
I know that it is sorted by and includes an if-else-condition but I don't know how to code this in R.
Thanks in advance!
In Stata both != and ~= mean "not equals" but in R only != would be equivalent. The ifelse function usually is done within a dataframe but can also work with vectorized logical operators such as & used in the first argument
a01sb <- ifelse( (b01 != 1)& (c01 != 1), a01, 0) # inner parens used for clarity
(There would be no sorting. Sorting would not make a great deal of sense if trying to keep results associated with the vectors on which the calculations are made.)
I'd like to learn how to conditionally replace values in R data frame using if/then statements. Suppose I have a data frame like this one:
df <- data.frame(
customer_id = c(568468,568468,568468,485342,847295,847295),
customer = c('paramount','paramount','paramount','miramax','pixar','pixar'));
I'd like to do something along the lines of,
"if customer in ('paramount','pixar') make customer_id 99. Else do nothing". I'm using this code, but it's not working:
if(df$customer %in% c('paramount','pixar')){
df$customer_id == 99
}else{
df$customer_id == df$customer_id
}
I get a warning message such as the condition has length > 1 and only the first element will be used. And the values aren't replaced.
I'd also like to know how to do this using logical operators to perform something like,
"if customer_id >= 500000, replace customer with 'fox'. Else, do nothing.
Very easy to do in SQL, but can't seem to figure it out in R.
My sense is that I'm missing a bracket somewhere?
How do I conditionally replace values in R data frame using if/then statements?
You can use ifelse, like this:
df$customer_id <- ifelse(df$customer %in% c('paramount', 'pixar'), 99, df$customer_id)
The syntax is simple:
ifelse(condition, result if TRUE, result if FALSE)
This is vectorized, so you can use it on a dataframe column.
You are using == instead of =(Assignment Operator) in if block. And I dont think there's need of else block in your example as you are not going to change values
if(df$customer %in% c('paramount','pixar')){
df$customer_id = 99
}
Above code will do the job for you
I'm working with data frames in R and often use a lot of conditions on my dataframes, mainly & and | operators. I'm doing that like so:
df = data.frame("col1"=c("success", "failure", "success"), "col2"=c("success", "success", "failure"), "col3"=c(1,1,100))
#multiple conditions, option 1
df[(df[["col1"]]=="success") & (df[["col2"]]=="success") & (df[["col3"]] == 1), ]
#multiple conditions, option 2
df[which((df[["col1"]]=="success") & (df[["col2"]]=="success") & (df[["col3"]] == 1)),]
However, my expressions tend to get really long and hard to read that way.
Is there a better, more readable way of doing it?
EDIT: Preferably, I'd like to work within the base R environment w/out external libraries.
I put this together based on other posts here on SO about not using subset, using | correctly, but didnt' find anything addressing this specific issue.
I hope this is not too opinion-based, otherwise I'll retract my question. Thanks!
One option is to use the filter() function in the dplyr package:
library(dplyr)
filter(df, col1=="success" & col2=="success" & col3==1)
You can also use commas (equivalent to &) to separate multiple arguments:
filter(df, col1=="success", col2=="success", col3==1)
Try this: (may be not clumsy but with same '&')
> df[df$col1=="success" & df$col2=="success" & df$col3==1,]
col1 col2 col3
1 success success 1
I am currently with a 500,000 observations of data and I have a step in my R code that does the following -
attach(ds)
weight <- rep(NA,length(date))
sales_base <- rep(NA,length(date))
cumsales <- rep(NA,length(date))
weight[dup_no!=0 & month(date)==7] = lag_sales[dup_no!=0 & month(date)==7]
sales_base[dup_no!=0 & month(date)==7] = cumsales[dup_no!=0 & month(date)==7]
cumsales [dup_no!=0 & month(date)==7] = 1+ disc[dup_no!=0 & month(date)==7]
for(i in 2:length(permno))
{
if(dup_no[i]!=0 & month(date[i])!=6 & !is.na(lag_sales[i]) & (lag_sales[i])>0)
{
cumsales[i] = cumsales[i-1]*(1+disc[i])
weight[i] = cumsales[i]*sales_base[i-1]
}
if(dup_no[i]!=0 & month(date[i])!=6 & (lag_sales[i])<=0)
{
cumsales[i] = cumsales[i-1]*(1+disc[i])
weight_port[i] = NA
}
}
(The formulae might not make sense as I haven't showed you the entire code.)
The first three lines creates 3 columns of value 0. The next three lines fills in the values of the cells in the columns provided a set of condition is fulfilled. The next for loop tries to fill in the remaining empty values of the columns by calculating new values based on the previous filled in cell values(obtained from lines 5, 6, 7).
The for loop here is taking a lot of time because of the datasize and I need to optimize this code as it will run on a much larger data. Is there any alternative that can be used instead of this for loop?
Thanks in advance!
Loops are generally very time consuming in R. Best avoid them whenever possible. If you search for "vectorization" you will find tons of threads and tutorials discussing the topic.
Just a brief example with your code:
index <- dup_no!=0 & month(date)!=6 & !is.na(lag_sales) & (lag_sales)>0
cumsales[index] <- cumsales[which(index)-1]*(1+disc[index])
weight[index] <- cumsales[index]*sales_base[which(index)-1]
This should be able to replace the first part of your for loop.