I am trying to create a subset of the rows that have a value of 1 for variable A, and a value of 1 for at least one of the following variables: B, C, or D.
Subset1 <- subset(Data,
Data$A==1 &
Data$B ==1 ||
Data$C ==1 |
Data$D == 1,
select= A)
Subset1
The problem is that the code above returns some rows that have A=0 and I am not sure why.
To troublehsoot:
I know that && and || are the long forms or and and or which vectorizes it.
I have run this code several times using &&, ||,& and | in different places. Nothing returns what I am looking for exactly.
When I shorten the code, it works fine and I subset only the rows that I would expect:
Subset1 <- subset(Data,
Data$A==1 &
Data$B==0,
select= A)
Subset1
Unfortunately, this doesn't suffice since I also need to capture rows whose C or D value = 1.
Can anyone explain why my first code block is not subsetting what I am expecting it to?
You can use parens to be more specific about what your & is referring to. Otherwise (as #Patrick Trentin clarified) your logical operators are combined according to operator precedence (within the same level of precedence they are evaluated from left to right).
Example:
> FALSE & TRUE | TRUE #equivalent to (FALSE & TRUE) | TRUE
[1] TRUE
> FALSE & (TRUE | TRUE)
[1] FALSE
So in your case you can try something like below (assuming you want items that A == 1 & that meet one of the other conditions):
Data$A==1 & (Data$B==1 | Data$C==1 | Data$D==1)
Since you didn't provide the data you're working with, I've replicated some here.
set.seed(20)
Data = data.frame(A = sample(0:1, 10, replace=TRUE),
B = sample(0:1, 10, replace=TRUE),
C = sample(0:1, 10, replace=TRUE),
D = sample(0:1, 10, replace=TRUE))
If you use parenthesis, which can evaluate to a logical function, you can achieve what you're looking for.
Subset1 <- subset(Data,
Data$A==1 &
(Data$B == 1 |
Data$C == 1 |
Data$D ==1),
select=A)
Subset1
A
1 1
2 1
4 1
5 1
Related
I am sure this question has been asked before and has an easy solution, but I can't seem to find it.
I am trying to conditionally replace the logical value of a variable based on the value of other variables in the data. Specifically, I am trying to determine eligibility based on survey responses.
I have created my eligibility variable in dataframe screen:
screen$eligible <- ifelse (
(screen$age > 17 & screen$age < 23)
& (screen$alcohol > 3 | screen$marijuana > 3)
& (screen$country == 0 | screen$ageus < 12)
& (screen$county_1 == 17 | screen$county_1 == 27 | screen$county_1 == 31)
& (screen$residence_1 == 47),
TRUE,
FALSE)
And now, based on study changes, I would like to further limit eligibility. I tried the code below, and it works in part, but it appears that I am introducing NAs to my eligibility variable and missing out on folks who should be eligible.
screen$eligible <- ifelse( screen$eligible ==TRUE, ifelse(
(screen$gender_1 == 1 & screen$age > 18)
|(screen$gender_8 == 1 & screen$age > 20),
FALSE, TRUE), FALSE)
I ultimately want TRUE or FALSE values.
Two questions
Is there a clearer or more concise way to update the code to update my eligibility requirements?
Any ideas as to why I might be introducing NAs?
continuing from what #zephryl wrote, an even more readable code is:
screen$eligible <- with(screen,
(age > 17 & age < 23)
& (alcohol > 3 | marijuana > 3)
& (country == 0 | ageus < 12)
& county_1 %in% c(17, 27, 31)
& (residence_1 == 47))
to detect where are the NAs:
sapply(screen, anyNA)
1. Is there a clearer or more concise way to update the code to update my eligibility requirements?
If you ever find yourself writing x = ifelse(condition, TRUE, FALSE), as you are here -- that's equivalent to just writing x = condition. Also, your three county_1 == x statements can be replaced with one county_1 %in% c(x, y, z). So your first code block could be written as,
screen$eligible <- (screen$age > 17 & screen$age < 23)
& (screen$alcohol > 3 | screen$marijuana > 3)
& (screen$country == 0 | screen$ageus < 12)
& screen$county_1 %in% c(17, 27, 31)
& (screen$residence_1 == 47)
Likewise, your second codeblock could be simplified as:
screen$eligible <- screen$eligible
& ((screen$gender_1 == 1 & screen$age > 18)
| (screen$gender_8 == 1 & screen$age > 20))
2. Any ideas as to why I might be introducing NAs?
It's hard to say without seeing your data, but the NAs probably indicate that one or more of your constituent variables (gender_1, gender_8, age) is NA for some cases.
I have the following dataframe:
df1
Name Ch1 Val1
A a x1
B b x2
C a x3
...
And I want to add another row that gives me a solution on the loop I am trying to get:
for (i in nrow(df))
if ( (df[i,3]>=-2)==T & (df3[i,3] <=2)==T & df[i,2]=="a"){
df[i,4]<-TRUE
}else if ((df[i,3]>2)==T & df[i,2]=="b"){
df[i,4]<-TRUE
}else (df[i,4]<-FALSE)
So basically if the value in Val1 is in an interval of -2 and +2 AND Ch1 is "a" it should result in TRUE
OR if Val1 is bigger than 2 AND Ch1 is "b" then the result is TRUE
Otherwise it should always be false.
My loop seems to only return the result for the first row the rest is NA.
Any idea where the mistake is? Or another way to solve this (even though I actually have a few more ORs)
Thank you!
If I understand correctly you try to create a new column, which contains true or false. I would use dplyrfor this.
df <- df %>%
mutate(new_column = case_when(
Val1 >=-2 & Val1 <=2 & Ch1 =="a" ~ TRUE,
Val1 > 2 & Ch1 == "b" ~ TRUE,
TRUE ~ FALSE
))
Your for loop only does one iteration because it is passed a single value instead of a sequence: i takes on only the single value you specify, not each value in a sequence such as each number from 1 up to nrow(df).
For example:
df <- data.frame(a = 1:5)
for (i in nrow(df)) {
print(i)
}
results in:
5
but,
for (i in 1:nrow(df)) {
print(i)
}
results in:
1
2
3
4
5
but the answer posted by #annet is more elegant.
this is my first project in R, after just having learned java.
I have a (large) data set that I have imported from a csv file into data frame.
I have identified the two relevent columns for this question, the first that has the name of the patient, and second that asks the patient the level of swelling.
The level of swelling is relative i.e. better, worse or about the same.
Not all patients have the same number of observations.
I am having difficulty converting these relative values into numerical values that can be used as part of a greater analysis.
Below is psuedocode to what i think could be an appropriate solution:
for row in 'patientname'
patientcounter = dtfr1[row, 'patientname'];
if dtfr1[row, 'patientname'] == patientcounter
if dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] == 'better'
conditioncounter--;
dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] = conditioncounter;
elseif [row, 'Does.you.swelling.seem.better.or.worse'] == 'better'
conditoncounter++;
dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] = conditioncounter;
else
dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] = conditioncounter;
if dtfr1[row, 'patientname'] =! patientcounter
patientcounter = dtfr1[row, 'patientname'];
What would your advice be for a good solution to this problem? Thanks!
If I'm understanding correctly, you want the difference in the counts of worse and better, by patient? If so, something like this would work.
# Simulated data
dtfr1 <- data.frame(patient = sample(letters[1:3], 100, replace=TRUE),
condition = sample(c("better", "worse"), 100, replace=TRUE))
head(dtfr1)
# patient condition
# 1 a worse
# 2 b better
# 3 b worse
# 4 a better
# 5 c worse
# 6 a better
better_count <- tapply(dtfr1$condition, dtfr1$patient, function(x) sum(x == "better"))
worse_count <- tapply(dtfr1$condition, dtfr1$patient, function(x) sum(x == "worse"))
worse_count - better_count
# a b c
# 5 0 -1
I'm new to R. I'm trying to set a new column in my data frame depending on what's in 3 other columns. I've looked at other queries like:
Populate a column using if statements in r
Which I thought would solve it but it looks like I can only give sapply a single vector as when I try the following code:
IHC <- c("N","N","Y","N","N")
CCD <- c("13-Nov-2009", NA, "09-Feb-2011", "10-Dec-2012", "16-Nov-2009")
IHE <- c(NA, "20-Feb-2011",NA,NA,NA)
df1 <- data.frame(IHC, CCD, IHE)
InHouse <- function(IHC,CCD,IHE) {
if(IHE == "" && CCD == NA | IHC == "N") y <- ""
if(IHE == "") y <- CCD
if(CCD > IHE) y <- IHE
else y <- CCD
return(y)
}
df1$AAA <- sapply(c(df1$IHC, df1$CCD, df1$IHE), InHouse)
I get the following error:
Error in IHE == "" : 'IHE' is missing
Any help would be great.
There are several issues.
Your conditions involve comparisons like: IHE=="". IHE is NA but never "". So I assume you want is.na(IHE)??
You are mixing the scalar form of and (&& instead of &) with the vectorized form of or (| instead of ||). Why??
The comparison CCD > IHE is meaningless if either is NA (which is always the case).
The logical operators & and | have equal precedence, so IHE == "" && CCD == NA | IHC == "N" is equivalent to (IHE == "" && CCD == NA) | IHC == "N". Is that what you want??
Most important, your condition are not mutually exclusive.
This is a way to apply the conditions without the use of any of the apply(...) functions.
df1 <- data.frame(IHC, CCD, IHE, stringsAsFactors=F)
df1$AAA <- CCD
cond <- with(df1,is.na(IHE) & is.na(CCD) | IHC == "N")
df1[cond,]$AAA <- ""
cond <- is.na(df1$IHE)
df1[cond,]$AAA <- df1[cond,]$CCD
cond <- with(df1,CCD > IHE & is.na(CCD) & is.na(IHE))
df1[cond,]$AAA <- df1[cond,]$IHE
Let's say I have a 4x2 matrix.
x<- matrix(seq(1:8), 4)
That contains the following elements
1 5
2 6
3 7
4 8
For this specific example, let's say I want to remove the rows that contain a '2' or an '7' (without having to manually look in the matrix and remove them). How would I do this?
Here's something I came up with but it isn't doing what I want it to. I want it to return the row indices in the matrix that contain either a 2 or a 7.
remove<- which(2 || 7 %in% x)
x<- x[-remove,]
Can anyone help me figure this out?
x[-which(x == 2 | x == 7, arr.ind = TRUE)[,1],]
is the simplest, most efficient way I can think of.
the single '|' checks if each element is 2 or 7 (which '||' won't do). arr.ind gives each position as a pair of coordinates, rather than the default single number. [,1] selects each row which has a 2 or 7.
Hope that helps :)
As #Dirk said, which is the right function, here is my answer:
index <- apply(x, 1, function(a) 2 %in% a || 7 %in% a)
> index
[1] FALSE TRUE TRUE FALSE
x[index, ]
x[-which(...), ] is not the right approach... Why? See what happens in case which finds no match:
x <- matrix(8, nrow = 4, ncol = 2)
x[-which(x == 2 | x == 7, arr.ind = TRUE)[,1],]
# [,1] [,2]
(it returns nothing, while it should return the whole x.)
Instead, indexing with logicals is the safer approach: you can negate the vector of matches without risking the odd behavior shown above. Here is an example:
x[!(rowSums(x == 2 | x == 7) > 0), , drop = FALSE]
which can also be written in the shorter form:
x[!rowSums(x == 2 | x == 7), , drop = FALSE]