Count number of rows meeting multiple conditions in dataframe - r

I have a question. I'm working on a database with patients and multiple conditions I scored as yes/no or numbers. I first counted the number of patients (rows) in which patients meet at least one criteria of 5, see this code (working):
nrow( df_1[df_1$tenderness_CS != 'no' | df_1$intoxication != 'no' |
df_1$focal_neuro_deficits != 'no' | df_1$EMV <= 13 | df_1$distr_injury != 'no',] )
But now I want to count how many patients meet 2, 3 and 4 criteria of the above standing. Doesn't matter which of the 5 criteria are met, just if 2 or 3 are met. I really don't know how to do that.
Any help? Thanks!

You can do
n_conditions <- (df_1$tenderness_CS != 'no') +
(df_1$intoxication != 'no') +
(df_1$focal_neuro_deficits != 'no') +
(df_1$EMV <= 13) +
(df_1$distr_injury != 'no')
which will give you a vector of the number of conditions each patient met.
You can then do
table(n_conditions)
to show the times each number of conditions was met, and
df_1[n_conditions == 3,]
To subset the dara frame to get only those patients who met 3 conditions etc.

Instead of doing +, we can make use of rowSums. The advantage is that it would also take of NA elements with na.rm argument i.e. if a particular column have NA in a row, it would result in NA if we do +
nm1 <- c("tenderness_CS", "intoxication",
"focal_neuro_deficits", "distr_injury")
n_conditions <- rowSums(cbind(df_1[nm1] != "no", df_1$EMV <= 13), na.rm = TRUE)
Now, we get the frequency of counts with table
table(n_conditions)

The logicals TRUE and FALSE can be treated like numerics 1 and 0.
So for example TRUE+TRUE is equal to 2.
So you can write:
nrow( df_1[df_1$tenderness_CS != 'no' + df_1$intoxication != 'no' +
df_1$focal_neuro_deficits != 'no' + (df_1$EMV <= 13) + df_1$distr_injury != 'no' %in% c(2,3,4),])
because this will first sum the results of each condition (1 when the condition is TRUE and 0 when it is FALSE) and then test whether the sum is in the vector c(2,3,4).

Related

Replace logical values conditionally in R

I am sure this question has been asked before and has an easy solution, but I can't seem to find it.
I am trying to conditionally replace the logical value of a variable based on the value of other variables in the data. Specifically, I am trying to determine eligibility based on survey responses.
I have created my eligibility variable in dataframe screen:
screen$eligible <- ifelse (
(screen$age > 17 & screen$age < 23)
& (screen$alcohol > 3 | screen$marijuana > 3)
& (screen$country == 0 | screen$ageus < 12)
& (screen$county_1 == 17 | screen$county_1 == 27 | screen$county_1 == 31)
& (screen$residence_1 == 47),
TRUE,
FALSE)
And now, based on study changes, I would like to further limit eligibility. I tried the code below, and it works in part, but it appears that I am introducing NAs to my eligibility variable and missing out on folks who should be eligible.
screen$eligible <- ifelse( screen$eligible ==TRUE, ifelse(
(screen$gender_1 == 1 & screen$age > 18)
|(screen$gender_8 == 1 & screen$age > 20),
FALSE, TRUE), FALSE)
I ultimately want TRUE or FALSE values.
Two questions
Is there a clearer or more concise way to update the code to update my eligibility requirements?
Any ideas as to why I might be introducing NAs?
continuing from what #zephryl wrote, an even more readable code is:
screen$eligible <- with(screen,
(age > 17 & age < 23)
& (alcohol > 3 | marijuana > 3)
& (country == 0 | ageus < 12)
& county_1 %in% c(17, 27, 31)
& (residence_1 == 47))
to detect where are the NAs:
sapply(screen, anyNA)
1. Is there a clearer or more concise way to update the code to update my eligibility requirements?
If you ever find yourself writing x = ifelse(condition, TRUE, FALSE), as you are here -- that's equivalent to just writing x = condition. Also, your three county_1 == x statements can be replaced with one county_1 %in% c(x, y, z). So your first code block could be written as,
screen$eligible <- (screen$age > 17 & screen$age < 23)
& (screen$alcohol > 3 | screen$marijuana > 3)
& (screen$country == 0 | screen$ageus < 12)
& screen$county_1 %in% c(17, 27, 31)
& (screen$residence_1 == 47)
Likewise, your second codeblock could be simplified as:
screen$eligible <- screen$eligible
& ((screen$gender_1 == 1 & screen$age > 18)
| (screen$gender_8 == 1 & screen$age > 20))
2. Any ideas as to why I might be introducing NAs?
It's hard to say without seeing your data, but the NAs probably indicate that one or more of your constituent variables (gender_1, gender_8, age) is NA for some cases.

Is there a way to combine if statment with filter function?

I need to filter out some rows if a condition is true.
Tried this:
if(DiffTask$TaskCue == "square1"){DiffTasks<-subset(data,-subject==10)}
Warning message:
In if (DiffTask$TaskCue == "square1") { : the condition has length >
1 and only the first element will be used
You can connect any number of logical conditions, e.g.
DiffTask <- data.frame(TaskCue = rep(c("square1", "square2"), 5),
subject = rep(10:6, 2))
subset(DiffTask, !(TaskCue == "square1" & subject == 10))

Creating variables in R- two issues

I have two basic questions about new variable creation in R. I will show some code and hopefully someone can help answer these!
df0$new <- ifelse(df0$old=="yes",1,0)
In this code I am creating a new variable called "new" that is equal to 1 if the variable "old" is equal to yes or is otherwise equal to 0. But in the variable "old" I have missing data (represented as -99, -98, NAN). So how can I account for there being missing values?
The second question is about using an "OR" statement.
df0$z <- ifelse(df0$x1=="yes",1,0 | )
I want to create a new variable z that is equal to 1 if the participant responds "yes" to any of 5 questions (q1-q5). So I want to code it so it looks like: z = 1 if q1 ==1 OR q2 == 1 OR q3 == 1 OR q4 == 1 OR q5 == 1. If none of q1-q5 equal 1 than I want to set z equal to 0. However this also brings up the issue with the missing values as described up above. Thanks so much!
You could do something like the following.
First, get rid of the -99, -98 and NaN. I am assuming that when, in the question, you write NAN you are meaning NaN.
Encode NA values as NA.
is.na(df0$old) <- (df0$old %in% c(-99, -98)) | is.nan(df0$old)
Now, note that FALSE/TRUE are encoded as 0/1 and coerce the logical results to class integer.
df0$new <- as.integer(df0$old == "yes")
df0$z <- as.integer(q1 == "yes" | q2 == "yes" | q3 == "yes" | q4 == "yes" | q5 == "yes")
Another solution for the first part.
library(dplyr) #because of left_join function
df1 <- data.frame(old = c("yes", "no"), new = c(1, 0))
df0 <- left_join(df0, df1)

Logical Operators not subsetting as expected

I am trying to create a subset of the rows that have a value of 1 for variable A, and a value of 1 for at least one of the following variables: B, C, or D.
Subset1 <- subset(Data,
Data$A==1 &
Data$B ==1 ||
Data$C ==1 |
Data$D == 1,
select= A)
Subset1
The problem is that the code above returns some rows that have A=0 and I am not sure why.
To troublehsoot:
I know that && and || are the long forms or and and or which vectorizes it.
I have run this code several times using &&, ||,& and | in different places. Nothing returns what I am looking for exactly.
When I shorten the code, it works fine and I subset only the rows that I would expect:
Subset1 <- subset(Data,
Data$A==1 &
Data$B==0,
select= A)
Subset1
Unfortunately, this doesn't suffice since I also need to capture rows whose C or D value = 1.
Can anyone explain why my first code block is not subsetting what I am expecting it to?
You can use parens to be more specific about what your & is referring to. Otherwise (as #Patrick Trentin clarified) your logical operators are combined according to operator precedence (within the same level of precedence they are evaluated from left to right).
Example:
> FALSE & TRUE | TRUE #equivalent to (FALSE & TRUE) | TRUE
[1] TRUE
> FALSE & (TRUE | TRUE)
[1] FALSE
So in your case you can try something like below (assuming you want items that A == 1 & that meet one of the other conditions):
Data$A==1 & (Data$B==1 | Data$C==1 | Data$D==1)
Since you didn't provide the data you're working with, I've replicated some here.
set.seed(20)
Data = data.frame(A = sample(0:1, 10, replace=TRUE),
B = sample(0:1, 10, replace=TRUE),
C = sample(0:1, 10, replace=TRUE),
D = sample(0:1, 10, replace=TRUE))
If you use parenthesis, which can evaluate to a logical function, you can achieve what you're looking for.
Subset1 <- subset(Data,
Data$A==1 &
(Data$B == 1 |
Data$C == 1 |
Data$D ==1),
select=A)
Subset1
A
1 1
2 1
4 1
5 1

Calculate count of number of switch in vector

I have a vector in which i have to calculate how many times data switched from 0 to 100 and back to 0. An example is given as below.
Input
X1<-c(100,100,100,0,0,0,0,0,100,100,100,100,100,0,0,0,0,100,100,100,0,0,100,100)
So the output should be 3 as the value started at 0 stayed at 100 for the some time and back to 0. My requirements is to count how many times this switch has occurred. I am aware of rle but that only gives me the length.
Thanks in advance for the help.
This looks sufficient
sum(X1[-1] != X1[-length(X1)]) / 2
Assumptions are that
You only have two unique values in X1
The last element of X1 equals the first element, that is, it switches back to original state in the end.
You can do something like,
sum(diff(X1) == 100)
#[1] 3
#Or
min(sum(diff(X1) == 100), sum(diff(X1) == -100))
#[1] 3
You could run rle and then iterate through three elements of values at a time to see if the required condition has been met.
with(rle(X1),
sum(sapply(3:length(lengths), function(i)
values[i-2] == 0 & values[i-1] == 100 & values[i] == 0)))
#[1] 2
more generally for counting switches in n cases (numeric or character):
count_switches_groups <- function(seq.input){
COUNT <- 0
transition = rep("no switch",length(seq.input))
for (i in 2:length(seq.input)) {
if (seq.input[i] != seq.input[i - 1]) {
COUNT <- COUNT + 1
transition[i] <- paste0("from ",seq.input[i - 1]," to ",seq.input[i])
}
}
total_switches <- COUNT
state_transitions <- transition[transition != "no switch"]
occurances <- as.data.frame(table(state_transitions))
return_list <- list(total_switches,occurances)
names(return_list) <- c("total_transitions","unique_switches")
return(return_list)
}
count_switches_groups(X1)
sum((np.diff(x)==100)|(np.diff(x)==-100))
I think this would be the answer, worked for me

Resources