I'm new to R. I'm trying to set a new column in my data frame depending on what's in 3 other columns. I've looked at other queries like:
Populate a column using if statements in r
Which I thought would solve it but it looks like I can only give sapply a single vector as when I try the following code:
IHC <- c("N","N","Y","N","N")
CCD <- c("13-Nov-2009", NA, "09-Feb-2011", "10-Dec-2012", "16-Nov-2009")
IHE <- c(NA, "20-Feb-2011",NA,NA,NA)
df1 <- data.frame(IHC, CCD, IHE)
InHouse <- function(IHC,CCD,IHE) {
if(IHE == "" && CCD == NA | IHC == "N") y <- ""
if(IHE == "") y <- CCD
if(CCD > IHE) y <- IHE
else y <- CCD
return(y)
}
df1$AAA <- sapply(c(df1$IHC, df1$CCD, df1$IHE), InHouse)
I get the following error:
Error in IHE == "" : 'IHE' is missing
Any help would be great.
There are several issues.
Your conditions involve comparisons like: IHE=="". IHE is NA but never "". So I assume you want is.na(IHE)??
You are mixing the scalar form of and (&& instead of &) with the vectorized form of or (| instead of ||). Why??
The comparison CCD > IHE is meaningless if either is NA (which is always the case).
The logical operators & and | have equal precedence, so IHE == "" && CCD == NA | IHC == "N" is equivalent to (IHE == "" && CCD == NA) | IHC == "N". Is that what you want??
Most important, your condition are not mutually exclusive.
This is a way to apply the conditions without the use of any of the apply(...) functions.
df1 <- data.frame(IHC, CCD, IHE, stringsAsFactors=F)
df1$AAA <- CCD
cond <- with(df1,is.na(IHE) & is.na(CCD) | IHC == "N")
df1[cond,]$AAA <- ""
cond <- is.na(df1$IHE)
df1[cond,]$AAA <- df1[cond,]$CCD
cond <- with(df1,CCD > IHE & is.na(CCD) & is.na(IHE))
df1[cond,]$AAA <- df1[cond,]$IHE
Related
I am sure this question has been asked before and has an easy solution, but I can't seem to find it.
I am trying to conditionally replace the logical value of a variable based on the value of other variables in the data. Specifically, I am trying to determine eligibility based on survey responses.
I have created my eligibility variable in dataframe screen:
screen$eligible <- ifelse (
(screen$age > 17 & screen$age < 23)
& (screen$alcohol > 3 | screen$marijuana > 3)
& (screen$country == 0 | screen$ageus < 12)
& (screen$county_1 == 17 | screen$county_1 == 27 | screen$county_1 == 31)
& (screen$residence_1 == 47),
TRUE,
FALSE)
And now, based on study changes, I would like to further limit eligibility. I tried the code below, and it works in part, but it appears that I am introducing NAs to my eligibility variable and missing out on folks who should be eligible.
screen$eligible <- ifelse( screen$eligible ==TRUE, ifelse(
(screen$gender_1 == 1 & screen$age > 18)
|(screen$gender_8 == 1 & screen$age > 20),
FALSE, TRUE), FALSE)
I ultimately want TRUE or FALSE values.
Two questions
Is there a clearer or more concise way to update the code to update my eligibility requirements?
Any ideas as to why I might be introducing NAs?
continuing from what #zephryl wrote, an even more readable code is:
screen$eligible <- with(screen,
(age > 17 & age < 23)
& (alcohol > 3 | marijuana > 3)
& (country == 0 | ageus < 12)
& county_1 %in% c(17, 27, 31)
& (residence_1 == 47))
to detect where are the NAs:
sapply(screen, anyNA)
1. Is there a clearer or more concise way to update the code to update my eligibility requirements?
If you ever find yourself writing x = ifelse(condition, TRUE, FALSE), as you are here -- that's equivalent to just writing x = condition. Also, your three county_1 == x statements can be replaced with one county_1 %in% c(x, y, z). So your first code block could be written as,
screen$eligible <- (screen$age > 17 & screen$age < 23)
& (screen$alcohol > 3 | screen$marijuana > 3)
& (screen$country == 0 | screen$ageus < 12)
& screen$county_1 %in% c(17, 27, 31)
& (screen$residence_1 == 47)
Likewise, your second codeblock could be simplified as:
screen$eligible <- screen$eligible
& ((screen$gender_1 == 1 & screen$age > 18)
| (screen$gender_8 == 1 & screen$age > 20))
2. Any ideas as to why I might be introducing NAs?
It's hard to say without seeing your data, but the NAs probably indicate that one or more of your constituent variables (gender_1, gender_8, age) is NA for some cases.
I have the following dataframe:
df1
Name Ch1 Val1
A a x1
B b x2
C a x3
...
And I want to add another row that gives me a solution on the loop I am trying to get:
for (i in nrow(df))
if ( (df[i,3]>=-2)==T & (df3[i,3] <=2)==T & df[i,2]=="a"){
df[i,4]<-TRUE
}else if ((df[i,3]>2)==T & df[i,2]=="b"){
df[i,4]<-TRUE
}else (df[i,4]<-FALSE)
So basically if the value in Val1 is in an interval of -2 and +2 AND Ch1 is "a" it should result in TRUE
OR if Val1 is bigger than 2 AND Ch1 is "b" then the result is TRUE
Otherwise it should always be false.
My loop seems to only return the result for the first row the rest is NA.
Any idea where the mistake is? Or another way to solve this (even though I actually have a few more ORs)
Thank you!
If I understand correctly you try to create a new column, which contains true or false. I would use dplyrfor this.
df <- df %>%
mutate(new_column = case_when(
Val1 >=-2 & Val1 <=2 & Ch1 =="a" ~ TRUE,
Val1 > 2 & Ch1 == "b" ~ TRUE,
TRUE ~ FALSE
))
Your for loop only does one iteration because it is passed a single value instead of a sequence: i takes on only the single value you specify, not each value in a sequence such as each number from 1 up to nrow(df).
For example:
df <- data.frame(a = 1:5)
for (i in nrow(df)) {
print(i)
}
results in:
5
but,
for (i in 1:nrow(df)) {
print(i)
}
results in:
1
2
3
4
5
but the answer posted by #annet is more elegant.
I'm trying to apply some basic "if" by group over a large dataset in R.
I tried to write a function and to apply over the groups using dplyr but it's not working. What could be the problem?
#dataframe
db <- data.frame(ID=c(1,1),
type=c("a","b"),
qual=c("no","OK"))
#if (no problem)
attach(db)
if(db[type =="a","qual"]=="OK"){
db[type =="a","qual_fin"] <- "OK"
db[type =="b","qual_fin"] <- "no"
} else if ( db[type =="b","qual"]=="OK"){
db[type =="b","qual_fin"] <- "OK"
db[type =="a","qual_fin"] <- "no"
} else {db$qual_fin <- "no"
}
#dataframe with groups
db <- data.frame(ID=c(1,1,2,2),
type=c("a","b","a","b"),
qual=c("OK","OK","no","OK"))
#function
quality <- function( a,b, qual_fin_a,qual_fin_b){
if(a =="OK"){
qual_fin_a <- "OK"
qual_fin_b <- "no"
} else if ( b =="OK"){
qual_fin_b <- "OK"
qual_fin_a <- "no"
} else {qual_fin_a <- "no"
qual_fin_b <- "no"
}}
#if by group
library(dplyr)
db2 <- db %>%
group_by(ID) %>%
do(quality(a=db[db$type =="a","qual"],
b=db[db$type =="b","qual"],
qual_fin_a=db[db$type=="a","qual_fin"],
qual_fin_b=db[db$type=="b","qual_fin"]))
I expect this result:
> db
ID type qual qual_fin
1 1 a OK OK
2 1 b OK no
3 2 a no no
4 2 b OK OK
I imagine that the solution is pretty simple but I'm struggling to find it!
I am trying to create a subset of the rows that have a value of 1 for variable A, and a value of 1 for at least one of the following variables: B, C, or D.
Subset1 <- subset(Data,
Data$A==1 &
Data$B ==1 ||
Data$C ==1 |
Data$D == 1,
select= A)
Subset1
The problem is that the code above returns some rows that have A=0 and I am not sure why.
To troublehsoot:
I know that && and || are the long forms or and and or which vectorizes it.
I have run this code several times using &&, ||,& and | in different places. Nothing returns what I am looking for exactly.
When I shorten the code, it works fine and I subset only the rows that I would expect:
Subset1 <- subset(Data,
Data$A==1 &
Data$B==0,
select= A)
Subset1
Unfortunately, this doesn't suffice since I also need to capture rows whose C or D value = 1.
Can anyone explain why my first code block is not subsetting what I am expecting it to?
You can use parens to be more specific about what your & is referring to. Otherwise (as #Patrick Trentin clarified) your logical operators are combined according to operator precedence (within the same level of precedence they are evaluated from left to right).
Example:
> FALSE & TRUE | TRUE #equivalent to (FALSE & TRUE) | TRUE
[1] TRUE
> FALSE & (TRUE | TRUE)
[1] FALSE
So in your case you can try something like below (assuming you want items that A == 1 & that meet one of the other conditions):
Data$A==1 & (Data$B==1 | Data$C==1 | Data$D==1)
Since you didn't provide the data you're working with, I've replicated some here.
set.seed(20)
Data = data.frame(A = sample(0:1, 10, replace=TRUE),
B = sample(0:1, 10, replace=TRUE),
C = sample(0:1, 10, replace=TRUE),
D = sample(0:1, 10, replace=TRUE))
If you use parenthesis, which can evaluate to a logical function, you can achieve what you're looking for.
Subset1 <- subset(Data,
Data$A==1 &
(Data$B == 1 |
Data$C == 1 |
Data$D ==1),
select=A)
Subset1
A
1 1
2 1
4 1
5 1
I have a sequence of data frame subsetting operations. Some of them might fail because the rows to be replaced do not exist. I would still like the others to execute. Example:
source_data[source_data$abbr_d == "bdp",]$party_id <- 32
source_data[source_data$abbr_d == "svp",]$party_id <- 4
source_data[source_data$abbr_d == "cvp",]$party_id <- 2
source_data[source_data$abbr_d == "fdp",]$party_id <- 1
source_data[source_data$abbr_d == "gps",]$party_id <- 13
source_data[source_data$abbr_d == "sp",]$party_id <- 3
source_data[source_data$abbr_d == "csp",]$party_id <- 8
source_data[source_data$abbr_d == "pcs",]$party_id <- 8
Error in `$<-.data.frame`(`*tmp*`, "party_id", value = 13) :
replacement has 1 row, data has 0
source_data[source_data$abbr_d == "lega",]$party_id <- 18
source_data[source_data$abbr_d == "edu",]$party_id <- 16
source_data[source_data$abbr_d == "glp",]$party_id <- 31
I would like the script to continue after the error has been thrown. I've tried using tryCatch() but that doesn't really help because I don't know in advance at which point the replacement will fail.
Is there a way to tell R to just "not care" about those replacement errors? And still continue with the next replacement operations?
The only solution I came up with is to use if-statements like this, which is tedious:
if(nrow(source_data[source_data$abbr_d == "lega", 1]) > 0){
source_data[source_data$abbr_d == "lega",]$party_id <- 18
}
if(nrow(source_data[source_data$abbr_d == "edu", 1]) > 0){
source_data[source_data$abbr_d == "edu",]$party_id <- 16
}
etc...
That is quite verbose code. Luckily, there is a way to get this done in a fraction of the code, and preventing your issue. My suggestion is to use a lookup table to build the party_id column
df = data.frame(abbr_d = sample(LETTERS[1:8], 100, replace = TRUE))
lookup_table = 1:8
names(lookup_table) = LETTERS[1:8]
# A B C D E F G H
# 1 2 3 4 5 6 7 8
df$party_id = lookup_table[df$abbr_d]
So, you create the link between abbr_d and party_id once (here letters and simple numbers, but simply replace your values), and use the df$abbr_d column to subset the lookup table. This maps the labels in abbr_d to the values that correspond to that for party_id.
The error you see is avoided because only addr_d values that are actually in the data are looked up in the lookup table. These unneeded values in the lookup table do not pose an issue.
A dplyr approach as a bonus:
library(dplyr)
df %>% mutate(party_id = lookup_table[abbr_d])
You can use data.table library to mitigate the issue
txt<-"
1,a,1
2,b,2
3,c,3
4,d,4
"
dat = read.delim(textConnection(txt),
header=FALSE,sep=",",strip.white=TRUE)
dat
dat[dat$V2=="e",]$V3<-4
# Error in `$<-.data.frame`(`*tmp*`, "V3", value = 4) :
# le tableau de remplacement a 1 lignes, le tableau remplacé en a 0
library(data.table)
data=as.data.table(dat)
data[data$V2=="e",]$V3<-4
# no error thrown
data.table is often faster than data frame, afaik.