R: Troubles with ifelse - r

I have two measures for the same object. The measure is binary (1,0) but many observations are also missing, such that the possible options are: 1, 0, NA.
Data Have:
Source1 Source2
NA NA
NA 0
NA 1
0 NA
0 0
0 1
1 NA
1 0
1 1
(Sources can contradict each other, ignore that for now).
I would like to create a third composite variable that summarizes the two variables, such that IF EITHER of the two sources = 1, then the composite variable should be equal to 1. Otherwise, if either of the sources is not missing, then the composite variable should be equal to zero. Lastly, only if both sources are missing, the composite variable should be set to missing.
Data Want:
Source1 Source2 Composite
NA NA NA
NA 0 0
NA 1 1
0 NA 0
0 0 0
0 1 1
1 NA 1
1 0 1
1 1 1
I have tried different approaches but continue to have the same issue.
Attempt 1:
df<- df %>% mutate(combined = ifelse(df$source1==1 | df$source2==1, 1,
ifelse(df$source1==0 | df$source2==0, 0, NA)))
Attempt 2:
df2<- df %>% mutate(combined = ifelse(is.na(df$source1) & is.na(df$source2), NA,
ifelse(df$source1 == 1 | df$source2 ==1, 1, 0)))
Attempt 3:
df3<- df %>% mutate(combined = ifelse(df$source1==1, 1,
ifelse(df$source1==0 & df$source2==1, 1,
ifelse(df$source1==0 & df$source2==0, 0,
ifelse(df$source1==0 & is.na(df$source2), 0,
ifelse(is.na(df$source1) & df$source2'==1, 1,
ifelse(is.na(df$source1) & df$source2==0, 0, NA)))))))
The codes identify whether there is a 1 in either source, but the rest of the values are all missing regardless of there being a 0 or not.
Actual Output:
Source1 Source2 Composite
NA NA NA
NA 0 NA
NA 1 1
0 NA NA
0 0 NA
0 1 1
1 NA 1
1 0 1
1 1 1

Assuming both Source1 and Source2 columns are composed of 0's,1's, and NA's (as you noted). You could use this as a base R solution. I.e., this uses do.call() to call pmax() over each of the relevant columns in your dataframe.
cols = paste0("Source", 1:2)
df$newcol = do.call(pmax, c(df[cols], na.rm = TRUE))
# equivalent to: pmax(df$Source1, df$Source2, na.rm = TRUE)
df
Source1 Source2 Composite newcol
1 NA NA NA NA
2 NA 0 0 0
3 NA 1 1 1
4 0 NA 0 0
5 0 0 0 0
6 0 1 1 1
7 1 NA 1 1
8 1 0 1 1
9 1 1 1 1
Data:
df = read.table(header = TRUE, text = "Source1 Source2 Composite
NA NA NA
NA 0 0
NA 1 1
0 NA 0
0 0 0
0 1 1
1 NA 1
1 0 1
1 1 1")

One approach is to use case_when rather than if-else. It seems simplest to check for missing variables first, and then check the non-missing cases afterwards:
library(tidyverse)
df %>%
mutate(S1Miss = is.na(Source1),
S2Miss = is.na(Source2)) %>%
mutate(Composite = case_when(
S1Miss & S2Miss ~ NA,
S1Miss | S2Miss ~ 0,
Source1 == 1 & Source2 == 1 ~ 1,
TRUE ~ 0
)) %>%
select(Source1, Source2, Composite)
Note here I made it "easier to read" by first storing the variables in 1 call to mutate and remove these intermediary results using select.

this was fun but i wouldn't recommend doing it like this.
source1<-c(NA, NA, NA, 0, 0, 0, 1, 1, 1)
source2<-c(NA, 0, 1, NA, 0, 1, NA, 0, 1)
df<-data.frame(source1, source2)
df$composite<-ifelse(test = is.na(df$source1) & is.na(df$source2), yes = NA,
no = ifelse(test = is.na(df$source1) & !is.na(df$source2), yes = df$source2,
no = ifelse(is.na(df$source2) & !is.na(df$source1), yes = df$source1,
no = ifelse(df$source1 > df$source2, yes = df$source1,
no = df$source2))))
source1 source2 composite
1 NA NA NA
2 NA 0 0
3 NA 1 1
4 0 NA 0
5 0 0 0
6 0 1 1
7 1 NA 1
8 1 0 1
9 1 1 1

Related

Ifelse across multiple columns matching on similar attributes

I need to create a binary variable called dum, (perhaps using an ifelse statement) matching on the number of the column names.
ifelse f[number] %in% c(4:6) & l[number]==1, 1, else 0
f1<-c(3,2,1,6,5)
f2<-c(4,1,5,NA,NA)
f3<-c(5,3,4,NA,NA)
f4<-c(1,2,4,NA,NA)
l1<-c(1,0,1,0,0)
l2<-c(1,1,1,NA,NA)
l3<-c(1,0,0,NA,NA)
l4<-c(0,0,0,NA,NA)
mydata<-data.frame(f1,f2,f3,f4,l1,l2,l3,l4)
dum is 1 if f1 contains values between 4, 5, 6 AND l1 contains a value of 1, OR f2 contains values between 4, 5, 6 AND l2 contains a value of 1, and so on.
In essence, the expected output should be
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 0
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
I can only think of doing it in a very long way such as
mutate(dum=ifelse(f1 %in% c(4:6 & l1==1, 1,
ifelse(f2 %in% c(4:6) & l2==1, 1,
ifelse(f3 %in% c(4:6) & l3==1, 1,
ifelse(f4 %in% c(4:6) & l4==1, 1, 0))))
But this is burdensome since the real data has many more columns than that and can go up to f20 and l20.
Is there a more efficient way to do this?
Here is one suggestion. Again it is not exactly clear. Assuming you want one column with dum that indicates the presences of the number in the column names in that row in any of the columns:
library(dplyr)
library(readr)
mydata %>%
mutate(across(f1:l4, ~case_when(. == parse_number(cur_column()) ~ 1,
TRUE ~ 0), .names = 'new_{col}')) %>%
mutate(sumNew = rowSums(.[9:16])) %>%
mutate(dum = ifelse(sumNew >=1, 1, 0)) %>%
select(1:8, dum)
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 1
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
Here is one option with across - loop across the 'f' columns, use the first condition, loop across the 'l' columns' with the second condition applied, join them together with & to return a logical matrix, get the row wise sum of the columns (TRUE -> 1 and FALSE -> 0), check if that sum is greater than 0 (i.e. if there are any TRUE in that row), and coerce the logical to binary with + or as.integer
library(dplyr)
mydata %>%
mutate(dum = +(rowSums(across(starts_with('f'), ~.x %in% 4:6) &
across(starts_with('l'), ~ .x %in% 1)) > 0))
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 0
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
We could also use base R
mydata$dum <- +(Reduce(`|`, Map(function(x, y) x %in% 4:6 &
y %in% 1, mydata[startsWith(names(mydata), "f")],
mydata[startsWith(names(mydata), "l")])))
Here's an approach multiplying two mapplys together, columns identified with grep, then calculating rowSums > 0. If you set na.rm=F you could get NAs in respective rows.
as.integer(rowSums(mapply(`%in%`, mydata[grep('^f', names(mydata))], list(4:6))*
mapply(`==`, mydata[grep('^l', names(mydata))], 1), na.rm=T) > 0)
# [1] 1 0 1 0 0
If f* and l* each aren't consecutive, rather use sort(grep(., value=T)).

Convert part of the values in rows in a data frame

I have a data frame which looks like this:
df
colA colB
0 0
1 1
0 1
0 1
0 1
1 0
0 0
1 1
0 1
I would like to convert a certain proportion of the 0 in colA to NA and a certain proportion of 1 in colB to NA
if I do this:
df["colA"][df["colA"] == 0] <- NA
all the 0 in columns A will be converted to NA, however I just want half of them to be converted
Similarly, for colB I want only 1/3 of the 1 to be converted:
df["colB"][df["colB"] == 1] <- NA
Expected output:
colA colB
0 0
1 1
NA 1
0 1
NA 1
1 0
0 0
1 NA
NA NA
One way
tmp=which(df["colA"]==0)
df$colA[sample(tmp,round(length(tmp)/2))]=NA
similar for colB
tmp=which(df["colB"]==1)
df$colB[sample(tmp,round(length(tmp)/3))]=NA
You can use prodNA from the missForest package
set.seed(1)
library(missForest)
df[df$colA == 0, "colA"] <- prodNA(df[df$colA == 0, "colA", drop=F], noNA = 0.5)
df[df$colB == 1, "colB"] <- prodNA(df[df$colB == 1, "colB", drop=F], noNA = 1/3)
df
colA colB
1 NA 0
2 1 NA
3 0 NA
4 NA 1
5 NA 1
6 1 0
7 0 0
8 1 1
9 0 1
I'll contribute a tidyverse approach here.
library(tidyverse)
df %>% mutate(id_colA = ifelse(colA == 1, NA, 1:n()),
colA = ifelse(id_colA %in% sample(na.omit(id_colA), sum(!is.na(id_colA))/2), NA, colA),
id_colB = ifelse(colB == 0, NA, 1:n()),
colB = ifelse(id_colB %in% sample(na.omit(id_colB), sum(!is.na(id_colB))/3), NA, colB)) %>%
select(-starts_with("id_"))

Changing the conditions to replace elements in a vector

Consider the vector:
use = c(1,1,2,2,5,1,2,1,2,5,1)
I'm trying to replace all the numbers different from 5 to NA before the first number 5 shows up in the sequence:
ifelse(use != 5,NA,1).
After that the condition should be
ifelse(use != 5,0,1).
The output would be:
after = c(NA,NA,NA,NA,1,0,0,0,0,1,0)
Any tips?
You should try:
`is.na<-`(match(use, 5, 0), seq(match(5, use) - 1))
[1] NA NA NA NA 1 0 0 0 0 1 0
Here is a base R solution
after <- replace(v<- ifelse(use !=5,NA,1),
which(head(which(v==1),1)<seq_along(v) & is.na(v)),
0)
such that
> after
[1] NA NA NA NA 1 0 0 0 0 1 0
Weird subsetting:
c(NA[!cumsum(use == 5)], +(use[!!cumsum(use == 5)] == 5))
#[1] NA NA NA NA 1 0 0 0 0 1 0
We can use match
replace(use, seq_len(match(5, use) - 1), NA)
#[1] NA NA NA NA 5 1 2 1 2 5 1
Or as #M-- commented, this can be changed to binary with
+(replace(use, seq_len(match(5, use) - 1), NA)==5)
This will work if there's only one 5 in your vector
use = c(1,1,2,2,5,1,2,2,2)
use <- findInterval(use,5)*5
i <- which(use > 0)
if(i > 1) use[1:(i-1)] <- NA
Here is another variation. I through in some error handling in case there are no 5's in the vector.
test1 <- c(1,1,1,1,2,3,3)
test2 <- c(5,1,1,2,5,1,2,7,8)
test3 <- c(1,1,3,5,6,7,8,2)
test4 <- c(1,2,3,4,5,5,1,5,5,5,1,1,7,8,1)
find_and_replace <- function(vec, target){
tryCatch(
ifelse( seq_along(vec) %in% 1:{(which(vec == target)[[1]])-1}, NA, ifelse(vec == 5, 1, 0)),
error = function(x) {
warning(paste("Warning: No", target))
vec
}
)
}
find_and_replace(test1, 5)
#> Warning: No 5
#> [1] 1 1 1 1 2 3 3
find_and_replace(test2, 5)
#> [1] NA 0 0 0 1 0 0 0 0
find_and_replace(test3, 5)
#> [1] NA NA NA 1 0 0 0 0
find_and_replace(test4, 5)
#> [1] NA NA NA NA 1 1 0 1 1 1 0 0 0 0 0
The following code solves the problem:
use[1:(which(use == 5)[1]-1)] = NA
use[(which(use == 5)[1]+1):length(use)] = 0
use[which(use == 5)[1]] = 1
use
[1] NA NA NA NA 1 0 0 0 0
You can use which to find the location of the target, and then case_when
use <- c(1,1,2,2,5,1,2,1,2)
first_five <- min(which(use == 5))
dplyr::case_when(
seq_along(use) < first_five ~ NA_real_,
seq_along(use) == first_five ~ 1,
TRUE ~ 0
)
#> [1] NA NA NA NA 1 0 0 0 0
use
#> [1] 1 1 2 2 5 1 2 1 2
Created on 2020-01-14 by the reprex package (v0.3.0)
You could detect the first 5,
first_pos <- which(use==5)
and, if such elements exist, set all entries before the first occurence to NA:
if(length(first_pos)>0) {
use[seq(1,first_pos[1]-1)] <- NA
use[seq(1,first_pos[1])] <- 1
use[seq(first_pos[1]+1, length(use)] <- 0
}
Note that first_pos[1] is called in case there are more than one 5.

recode using ifelse clause within groups

I'm trying to set up column (called 'combined) to indicate the combined information of owner and Head within each group (Group). There is only 1 owner in each group, and 'Head' is basically the first row of each group that has the minimum id value.
This combined column should flag '1' if the ID is flagged as owner, then the rest of the id within each group will be 0 regardless of the information in 'Head'. However for groups that do not have any Owner in the IDs (i.e. all 0 in owner within the group), then this column will take the Head column information. My data looks like this and the last column (combined) is the desired outcome.
sample <- data.frame(Group = c("46005589", "46005589","46005590","46005591", "46005591","46005592","46005592","46005592", "46005593", "46005594"), ID= c("189199", "2957073", "272448", "1872092", "10374996", "1153514", "2771118","10281300", "2610301", "3564526"), Owner = c(0, 1, 1, 0, 0, 0, 1, 0, 1, 1), Head = c(1, 0, 0, 1, 0, 1, 0, 0, 1, 1), combined = c(0, 1, 1, 1, 0, 0, 1, 0, 1, 1))
> sample
Group ID Owner Head combined
1 46005589 189199 0 1 0
2 46005589 2957073 1 0 1
3 46005590 272448 1 0 1
4 46005591 1872092 0 1 1
5 46005591 10374996 0 0 0
6 46005592 1153514 0 1 0
7 46005592 2771118 1 0 1
8 46005592 10281300 0 0 0
9 46005593 2610301 1 1 1
10 46005594 3564526 1 1 1
I've tried a few dplyr and ifelse clauses and it didn't seem to give outputs to what I wanted. How should I recode this column? Thanks.
I don't think this is the best way but you might look at visually inspecting IDs with all 0s. You could do this with rowSums and specify these IDs using %in%. Here is a possible solution:
library(dplyr)
df %>%
mutate_at(vars(ID,Group),funs(as.factor)) %>%
mutate(Combined=if_else(Owner==1,1,0),
NewCombi=ifelse(ID== "1872092",Head,Combined))
This yields: NewCombi is our target.
# Group ID Owner Head Combined NewCombi
#1 46005589 189199 0 1 0 0
#2 46005589 2957073 1 0 1 1
#3 46005590 272448 1 0 1 1
#4 46005591 1872092 0 1 0 1
#5 46005591 10374996 0 0 0 0
#6 46005592 1153514 0 1 0 0
#7 46005592 2771118 1 0 1 1
#8 46005592 10281300 0 0 0 0
#9 46005593 2610301 1 1 1 1
#10 46005594 3564526 1 1 1 1
The new combined column can be created in two steps in dplyr: first use filter(all(Owner == 0))by creating a column that only contains 'Head' information of IDs that do not contain any 'Owner', then merge this column back to the original dataframe, sum up the 1s in this column and the 1s 'Owner' column to obtain the combined info.
library(dplyr)
sample2 <- sample %>%
group_by(Group) %>%
filter(all(Owner == 0)) %>%
mutate(Head_nullowner = ifelse(Head == 1, 1, 0)) #select all rows of IDs that do not have any owners
#merge Head_nullowner with the original dataframe by both Group and ID
sample <- merge(sample, sample2[c("Group", "ID", "Head_nullowner")], by.x = c("Group", "ID"), by.y = c("Group", "ID"), all.x = T)
sample$Head_nullowner[is.na(sample$Head_nullowner)] <- 0
sample$OwnerHead_combined = sample$Owner + sample$Head_nullowner
> sample
Group ID Owner Head combined Head_nullowner OwnerHead_combined
1 46005589 189199 0 1 0 0 0
2 46005589 2957073 1 0 1 0 1
3 46005590 272448 1 0 1 0 1
4 46005591 10374996 0 0 0 0 0
5 46005591 1872092 0 1 1 1 1
6 46005592 10281300 0 0 0 0 0
7 46005592 1153514 0 1 0 0 0
8 46005592 2771118 1 0 1 0 1
9 46005593 2610301 1 1 1 0 1
10 46005594 3564526 1 1 1 0 1

R - handling NA in ifelse statements

This question is slightly similar to this question with a more theoretical component.
Given df below:
varA <- c(1,0,0,NA,NA)
varB <- c(NA,NA,NA,1,0)
df <- data.frame(varA, varB)
varA varB
1 NA
0 NA
0 NA
NA 1
NA 0
What's the most elegant method to generate var (with consideration given to NA) which combines the information from varA and varB?
varA varB var
1 NA 1
0 NA 0
0 NA 0
NA 1 1
NA 0 0
My approach, right now, is as follows:
df$var[df$varA == 1 | df$varB == 1] <- 1
df$var[df$varA == 0 | df$varB == 0] <- 0
As a side question, how does R handle NA in ifelse statements? For example, if I write the following code, it does not produce the output I intended.
df$var <- ifelse(df$varA == 1 | df$varB == 1, 1,
ifelse(df$varA == 0 | df$varB == 0, 0, NA)
combines the information from varA and varB
Seems like you are looking for coalesce:
library(dplyr)
df %>% mutate(var = coalesce(varA, varB))
# varA varB var
#1 1 NA 1
#2 0 NA 0
#3 0 NA 0
#4 NA 1 1
#5 NA 0 0
For your purposes, NA is equivalent to 0, so why not convert them to 0?
df[is.na(df)] <- 0
df$var <- with(df, as.integer(varA | varB))
> df
varA varB var
1 1 0 1
2 0 0 0
3 0 0 0
4 0 1 1
5 0 0 0
We can use pmax
df$var <- do.call(pmax, c(df, na.rm = TRUE))
df$var
#[1] 1 0 0 1 0

Resources