R handling NA values while doing a comparison ifelse [duplicate] - r

This question already has answers here:
How to ignore NA in ifelse statement
(5 answers)
Closed 8 years ago.
how can i tell > and < operators to ignore NA values? Below code return NA on 1st row. I want it to return 0 as both conditions fail on that row
##sum by values
df <- data.frame(sex=c('M','F','M'),occupation=c('Student','Analyst','Analyst'),age=c(NA,6,9), marks=c(34,65,21))
df
#df$counting <- ifelse(df$age > 5 & df$age < 8, 1, 0)
df$counting <- ifelse(df$age > 5 & df$age < 8, 1, 0)+ifelse(df$marks > 60 & df$marks < 70, 1, 0)
df

Please see the following SO post: How to ignore NA in ifelse statement
With respect to your question:
df$counting <- ifelse(df$age > 5 & df$age < 8 & !is.na(df$age), 1, 0) + ifelse(df$marks > 60 & df$marks < 70, 1, 0)
> df
sex occupation age marks counting
1 M Student NA 34 0
2 F Analyst 6 65 2
3 M Analyst 9 21 0

you could also use cut or findInterval.
df$counting <- colSums(rbind(cut(df$age, c(5,8), labels=F),cut(df$marks, c(60,70), labels=F)), na.rm=T)
df
# sex occupation age marks counting
#1 M Student NA 34 0
#2 F Analyst 6 65 2
#3 M Analyst 9 21 0

Related

How to fulfil two conditions in ifelse function in R

I have two columns one is gender and the other one a measure as below. I want to set cutoffs for male(gender = 1) and measure column. I want to say if it is male and measure is less that 23 then it is 1 otherwise 0 and if if it is female and measure is less that 15 then it is 1 otherwise 0.
I tried below, but not not working. I appreciate your help.
d$measure_status = ifelse(d$gender ==2 & d$measure<15, 1, ifelse( d$gender ==1 & d$measure<23, 1), 0)
gender measure measure_status
2 14
2 17
1 25
1 26
You can use with and make it a single ifelse condition
df = data.frame(gender = c(2,2,1,1), measure = c(14,17,25,26))
df$measure_status <- with(df, ifelse((df$gender ==2 & df$measure<15) | (df$gender ==1 & df$measure<23), 1 , 0))
df
Output:
gender measure measure_status
1 2 14 1
2 2 17 0
3 1 25 0
4 1 26 0
You can also use transform
df = data.frame(gender = c(2,2,1,1), measure = c(14,17,25,26))
df <- transform(df, measure_status = ifelse((df$gender ==2 & df$measure<15) | (df$gender ==1 & df$measure<23), 1 , 0))

Removing 0s from dataframe without removing NAs

I try to create a subset, where I remove all answers == 0 for variable B, given another variable A == 1. However, I want to keep the NAs in Variable B (just remove the 0s).
I tried it with this df2 <- subset(df, B[df$A == 1] > 0) but the result makes no sense. Can someone help?
i <- c(1:10)
A <- c(0,1,1,1,0,0,1,1,0,1)
B <- c(0, 10, 13, NA, NA, 9, 0, 0, 3, NA)
df <- data.frame(i, A, B)
subset takes a condition and returns only the rows where the value is TRUE. If you try NA == 0, or NA != 0 it will always return NA, which is neither TRUE nor FALSE, however as subset would have it it only returns rows where the value is TRUE. There are multiple ways around this:
subset(df, !(A == 1 & B == 0) | is.na(B))
or:
subset(df, !(A == 1 & B %in% 0))
There's plenty more options available however
This should work, if I understand it correctly:
subset(df, (df$A == 1) & ((df$B != 0) | (is.na(df$B))))
outputs:
i A B
2 1 10
3 1 13
4 1 NA
10 1 NA
If you do not want to specify every single column, you can just change the 0 to NA and the NA (temporarily) to a number (for example 999/-999) and switch back after you are finished.
i <- c(1:10)
A <- c(0,1,1,1,0,0,1,1,0,1)
B <- c(0, 10, 13, NA, NA, 9, 0, 0, 3, NA)
df <- data.frame(i, A, B)
df[is.na(df)] <- 999
df[df==0] <- NA
df <- na.omit(df)
df[df==999] <- NA
i A B
2 2 1 10
3 3 1 13
4 4 1 NA
10 10 1 NA
If i is unique, identify wich cases you want to remove and select the rest, try:
df[df$i != subset(df, A==1 & B==0)$i, ]
Output:
i A B
1 1 0 0
2 2 1 10
3 3 1 13
4 4 1 NA
5 5 0 NA
6 6 0 9
9 9 0 3
10 10 1 NA

How to compare the sign of two columns?

I have a dataframe with two columns. I want to compare the signs of each element in the column and see when it differs. It is easier to see with an example.
This is the dataframe:
df = data.frame(COL1 = rnorm(15, 0, 1), COL2 = rnorm(15, 0, 1))
COL1 COL2
1 0.01274137 -0.97966119
2 -0.48455106 1.19248167
3 -0.79149435 -1.45365392
4 -0.18961660 0.02216361
5 -0.34771000 1.39026672
6 0.28199427 0.49143945
7 -0.28650800 -0.71676355
8 -0.29677529 1.13092654
9 -0.24240084 0.99432286
10 2.13540200 0.66348347
11 1.94442199 0.53371032
12 -1.63108069 -0.21556863
13 0.38334186 -0.91472900
14 1.15981803 -0.54540520
15 1.04363634 -1.68835445
I would like to have a code that compares the signs of COL1 and COL2 and tells me when it differs. The outcome should be:
# rows where the sign differs: 1, 2, 3, 4, 5, 8, 9, 13, 14, 15
Can anyone help me with this?
Thanks
You can retrieve sign of each element with sign, and which retrieves the index of the inequalities
which(sign(df$COL1) != sign(df$COL2))
Edit: Warning, all three current answers above fail when there are NA values.
set.seed(4)
df2 = data.frame(COL1 = rnorm(15, 0, 1), COL2 = rnorm(15, 0, 1))
df2[1, 1] <- NA
COL1 COL2
1 NA 0.1690268
2 -0.54249257 1.1650268
3 0.89114465 -0.0442040
4 0.59598058 -0.1003684
5 1.63561800 -0.2834446
6 0.68927544 1.5408150
7 -1.28124663 0.1651690
8 -0.21314452 1.3076224
9 1.89653987 1.2882569
10 1.77686321 0.5928969
11 0.56660450 -0.2829437
12 0.01571945 1.2558840
13 0.38305734 0.9098392
14 -0.04513712 -0.9280281
15 0.03435191 1.2401808
which(sign(df2$COL1) != sign(df2$COL2))
[1] 2 3 4 5 7 8 11
which(sign(df2[,1] * df2[,2]) == -1)
[1] 2 3 4 5 7 8 11
which(df2$COL1 < 0 & df2$COL2 > 0 | df2$COL1 > 0 & df2$COL2 < 0)
[1] 2 3 4 5 7 8 11
Here is a solution that works if you have NA values, which tests equality and retrieves index when equality values are not in ! ... %in% TRUE, as opposed to != TRUE
which(!(sign(df2$COL1) == sign(df2$COL2)) %in% TRUE)
[1] 1 2 3 4 5 7 8 11
Compare output of
! NA %in% TRUE
[1] TRUE
NA != TRUE
[1] NA
How about multiplying the columns together and getting the sign with sign?
which(sign(data[,1] * data[,2]) == -1)
[1] 1 2 4 5 8 9 13 14 15
You can just apply logic comparing the columns if they're are < or > zero.
library(dplyr)
df %>%
filter(COL1 < 0 & COL2 > 0 | COL1 > 0 & COL2 < 0)
The index of rows can be obtained using which
which(df$COL1 < 0 & df$COL2 > 0 | df$COL1 > 0 & df$COL2 < 0)

Changing values in one column based on another in R

So I am using R and trying to change values in a data frame in one column by comparing two columns together. I have something like
Median MyPrice
10 0
20 18
20 20
30 35
15 NA
And I would like to say something like
if(MyPrice == 0 & MyPrice < Median){MyPrice <- 1
}else if (MyPrice == Median){MyPrice <- 2
}else if (MyPrice > Median){MyPrice <- 3
}else {MyPrice <- 4}
To come up with
Median MyPrice
10 1
20 1
20 2
30 3
15 4
But there is always an error. I have also tried something like
for(i in MyPrice){if(MyPrice == 0 & MyPrice < Median){MyPrice <- 1
}else if (MyPrice == Median){MyPrice <- 2
}else if (MyPrice > Median){MyPrice <- 3
}else {MyPrice <- 4}
}
The for loop runs but it changes all values in MyPrice to 4. I also tried the ifelse() function but it seemed to have an issue taking that many arguments at once.
I would also not be opposed to a new column being added to the end of the data frame if a solution like that is easier.
You don't necessarily have to use a for loop. Start by setting every comparison to 4.
> x$Comp=4
> x$Comp[x$Median>x$MyPrice]=1 #if Median is higher, comparison = 1
> x$Comp[x$Median==x$MyPrice]=2 #if Median is equal to MyPrice, comparison = 2
> x$Comp[x$Median<x$MyPrice]=3 #if Median is lower, comparison = 3
> x
Median MyPrice Comp
1 10 0 1
2 20 18 1
3 20 20 2
4 30 35 3
5 15 NA 4
Given your first argument that if MyPrice == 0 & MyPrice < Median, your 2nd row where Median: 20 and MyPrice: 18 should also be 4. Here is a working nested ifelse statement with an NA handler after.
df <- as.data.frame(matrix(c(10,0,20,18,20,20,30,35,15,NA), byrow = T, ncol = 2))
colnames(df) <- c("Median","MyPrice")
df$NewPrice <- ifelse(df$MyPrice == 0 & df$MyPrice < df$Median, 1,
ifelse(df$MyPrice == df$Median, 2,
ifelse(df$MyPrice > df$Median, 3, 4)))
df$NewPrice[is.na(df$MyPrice)] <- 4
df
# Median MyPrice NewPrice
#1 10 0 1
#2 20 18 4
#3 20 20 2
#4 30 35 3
#5 15 NA 4
What about setting a new variable with all values in 4 and then, replace those cases where your conditions apply?
Simple, straight forward and easy to read :-)
#(Following the example from #Evans Friedland)
df <- as.data.frame(matrix(c(10,0,20,18,20,20,30,35,15,NA), byrow = T, ncol = 2))
colnames(df) <- c("Median","MyPrice")
df <- mutate(df, myNewPrice = 4) #set my new price to 4, then edit by following your conditions
df$myNewPrice<- replace (df$myNewPrice, df$MyPrice == 0 & df$MyPrice < df$Median, 1)
df$myNewPrice<- replace (df$myNewPrice, df$MyPrice == df$Median , 2)
df$myNewPrice<- replace (df$myNewPrice, df$MyPrice > df$Median , 3)
df$myNewPrice <- as.numeric (df$myNewPrice) #might, might not be needed.

R data frame , New Vector with condition

Hello I am new to R and I need some help
I have a data like this
ID Age Sex
A01 30 m
A02 35 f
B03 45 m
C99 50 m
...
And i would like to create a new column Group with condition like this
if data1$age <30 then Group is = 1
else if data1$age >=30 and data1$age <40 then Group = 2
else if data1$age >=40 and data1$age <50 then Group = 3
else data1$age >=50 group = 4
ID Age Sex Group
A01 30 m 2
A02 35 f 2
B03 45 m 3
C99 50 m 4
How do i do that in R
You can try findInterval, which can be used like this (using #Tim's sample data):
> findInterval(data1$Age, c(0, 30, 40, 50))
[1] 2 2 3 4
Some good old-fashioned Base R will come in handy for your problem:
data1 <- data.frame(ID=c("A01", "A02", "B03", "C99"),
Age=c(30, 35, 45, 50),
Sex=c("m", "f", "m", "m"))
data1$Group[data1$Age < 30] <- 1
data1$Group[data1$Age >= 30 & data1$Age < 40] <- 2
data1$Group[data1$Age >= 40 & data1$Age < 50] <- 3
data1$Group[data1$Age >= 50] <- 4
> data1
ID Age Sex Group
1 A01 30 m 2
2 A02 35 f 2
3 B03 45 m 2
4 C99 50 m 4
By the way, you miscategorized ID A01 in your example. Since his age is 30, he belongs in group 2 according to your logic.
We can also do with cut
cut(data1$Age, c(0,seq(30,50,10),Inf), right=FALSE, labels=FALSE)
#[1] 2 2 3 4
EDIT: Based on #thelatemail's comments.

Resources