How can I create a new variable based on conditions in R - r

I am trying to create a new variable based on some conditions.
My data looks like
a b
1 NA
2 3
3 3
NA 2
NA NA
What I want is a variable c such that
when a is not NA, b is NA, c = a
when a is NA, b is not NA, c = b
when a is NA, b is NA, c = NA
when a is not NA, b is not NA, and a == b, c = a
when a is not NA, b is not NA, and a != b, c = "multiple_values"
How can I do this?
It seems like ifelse() can't do what I want.

Except for one of the condition, i.e non-NA elements in both 'a', 'b', and they are not equal to each others, all other conditions are met with coalesce. So, we can do a case_when to generate the "multiple_values" based on the last condition and all others by applying coalesce
library(dplyr)
df1 %>%
mutate(c = case_when(!is.na(a) & !is.na(b) & a != b ~ "multiple_values",
TRUE ~ as.character(coalesce(a, b))))
# a b c
#1 1 NA 1
#2 2 3 multiple_values
#3 3 3 3
#4 NA 2 2
#5 NA NA <NA>
data
df1 <- structure(list(a = c(1L, 2L, 3L, NA, NA), b = c(NA, 3L, 3L, 2L,
NA)), class = "data.frame", row.names = c(NA, -5L))

In base R you could use within.
dat <- within(dat, {
c <- NA
c[!is.na(a) & is.na(b)] <- a[!is.na(a) & is.na(b)]
c[is.na(a) & !is.na(b)] <- b[is.na(a) & !is.na(b)]
# # c[is.na(a) & is.na(b)] <- NA # redundant
c[!is.na(a) & !is.na(b) & a == b] <- a[!is.na(a) & !is.na(b) & a == b]
c[!is.na(a) & !is.na(b) & a != b] <- "multiple_values"
})
dat
# a b c
# 1 1 NA 1
# 2 2 3 multiple_values
# 3 3 3 3
# 4 NA 2 2
# 5 NA NA <NA>
Data: dat <- data.frame(a=c(1:3, NA, NA), b=c(NA, 3, 3, 2, NA))

ifelse can do what you want but it's just that there would be lot of nested statements
df$c <- with(df, ifelse(!is.na(a) & is.na(b), a,
ifelse(is.na(a) & !is.na(b), b,
ifelse(is.na(a) & is.na(b), NA,
ifelse(!is.na(a) & !is.na(b) & a == b, a, "multiple_values")))))
df
# a b c
#1 1 NA 1
#2 2 3 multiple_values
#3 3 3 3
#4 NA 2 2
#5 NA NA <NA>

Here is another base R answer that uses mapply to loop through the pairs of values, a simple function that combines them and drops NAs, and uses switch to decide on the outcome.
df1$c <-
mapply(function(x, y) {
z <- c(x, y)
z <- unique(z[!is.na(z)])
switch(length(z) + 1L, NA, z, "many")
}, df1$a, df1$b)
which returns
df1
a b c
1 1 NA 1
2 2 3 many
3 3 3 3
4 NA 2 2
5 NA NA <NA>

Using data.table, you can:
df1 <- structure(list(a = c(1L, 2L, 3L, NA, NA), b = c(NA, 3L, 3L, 2L,
NA)), class = "data.frame", row.names = c(NA, -5L))
library(data.table)
df1 <- as.data.table(df1)
df1[, c:="NONE"]
df1[!is.na(a) & is.na(b), c:=a]
df1[is.na(a) & !is.na(b), c:=b]
df1[is.na(a) & is.na(b), c:=NA]
df1[!is.na(a) & !is.na(b) & a==b, c:=a]
df1[!is.na(a) & !is.na(b) & a!=b, c:="multiple values"]

Related

How to count nonblank values in each dataframe row

Given a data frame in R how do I determine the number of non blank values per row.
col1 col2 col3 rowCounts
1 3 2
1 6 2
1 1
0
This is how I did it in python:
df['rowCounts'] = df.apply(lambda x: x.count(), axis=1)
What is the R Code for this?
In base R, we can use (assuming NA as blank) rowSums as a vectorized option on the logical matrix (!is.na(df)) where TRUE (->1 i.e. non-NA) values will be added for each row with rowSums
df$rowCounts <- rowSums(!is.na(df))
-output
df
# col1 col2 col3 rowCounts
#1 1 3 NA 2
#2 NA 1 6 2
#3 NA NA 1 1
#4 NA NA NA 0
If the blank is ""
df$rowCounts <- rowSums(df != "", na.rm = TRUE)
Or with apply and MARGIN = 1 as a similar syntax to Python (though it will be slower compared to rowSums)
df$rowCounts <- apply(df, 1, function(x) sum(!is.na(x)))
data
df <- structure(list(col1 = c(1L, NA, NA, NA), col2 = c(3L, 1L, NA,
NA), col3 = c(NA, 6L, 1L, NA)), class = "data.frame", row.names = c(NA,
-4L))

Determine if sub string appears in a string by row of dataframe

I have a dataframe that is revised every day. When an error occurs, It's checked, and if it can be solved, then the keyword "REVISED" is added to the beginning of the error message. Like so:
ID M1 M2 M3
1 NA "REVISED-error" "error"
2 "REVISED-error" "REVISED-error" NA
3 "REVISED-error" "REVISED-error" "error"
4 NA "error" NA
5 NA NA NA
I want to find a way to add two columns, helping me determine if there are any error, and how many of them have been revised. Like this:
ID M1 M2 M3 i1 ix
1 NA "REVISED-error" "error" 2 1 <- 2 errors, 1 revised
2 "REVISED-error" "REVISED-error" NA 2 2
3 "REVISED-error" "REVISED-error" "error" 3 2
4 NA "error" NA 1 0
5 NA NA NA 0 0
I found this code:
df <- df%>%mutate(i1 = rowSums(!is.na(.[2:4])))
That helps me to know how many errors are in those specific columns. How can I know if any of said errors contains the keyword REVISED? I've tried a few things but none have worked so far:
df <- df%>%
mutate(i1 = rowSums(!is.na(.[2:4])))%>%
mutate(ie = rowSums(.[2:4) %in% "REVISED")
This returns an error x must be an array of at least two dimensions
You could use apply to find number of times "error" and "REVISED" appears in each row.
df[c("i1", "ix")] <- t(apply(df[-1], 1, function(x)
c(sum(grepl("error", x)), sum(grepl("REVISED", x)))))
df
# ID M1 M2 M3 i1 ix
#1 1 <NA> REVISED-error error 2 1
#2 2 REVISED-error REVISED-error <NA> 2 2
#3 3 REVISED-error REVISED-error error 3 2
#4 4 <NA> error <NA> 1 0
#5 5 <NA> <NA> <NA> 0 0
Althernative approach using is.na and rowSums to calculate i1.
df$i1 <- rowSums(!is.na(df[-1]))
df$ix <- apply(df[-1], 1, function(x) sum(grepl("REVISED", x)))
data
df <- structure(list(ID = 1:5, M1 = structure(c(NA, 1L, 1L, NA, NA),
.Label = "REVISED-error", class = "factor"),
M2 = structure(c(2L, 2L, 2L, 1L, NA), .Label = c("error",
"REVISED-error"), class = "factor"), M3 = structure(c(1L,
NA, 1L, NA, NA), .Label = "error", class = "factor")), row.names = c(NA,
-5L), class = "data.frame")
You can use str_count() from the stringr library to count the number of times REVISED appears, like so
df <- data.frame(M1=as.character(c(NA, "REVISED-x", "REVISED-x")),
M2=as.character(c("REVISED-x", "REVISED-x", "REVISED-x")),
stringsAsFactors = FALSE)
library(stringr)
df$ix <- str_count(paste0(df$M1, df$M2), "REVISED")
df
# M1 M2 ix
# 1 <NA> REVISED-x 1
# 2 REVISED-x REVISED-x 2
# 3 REVISED-x REVISED-x 2

Sum many rows with some of them have NA in all needed columns

I am trying to do rowSums but I got zero for the last row and I need it to be "NA".
My df is
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
I used this code based on this link; Sum of two Columns of Data Frame with NA Values
df$sum<-rowSums(df[,c("a", "b", "c")], na.rm=T)
Any advice will be greatly appreciated
For each row check if it is all NA and if so return NA; otherwise, apply sum. We have selected columns a, b and c even though that is all the columns because the poster indicated that there might be additional ones.
sum_or_na <- function(x) if (all(is.na(x))) NA else sum(x, na.rm = TRUE)
transform(df, sum = apply(df[c("a", "b", "c")], 1, sum_or_na))
giving:
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
Note
df in reproducible form is assumed to be:
df <- structure(list(a = c(1L, 2L, 3L, NA), b = c(4L, NA, 5L, NA),
c = c(7L, 8L, NA, NA)),
row.names = c("1", "2", "3", "4"), class = "data.frame")

R compare two columns return third column if any column match conditions

I have a dataset:data1 which have ME and PDR columns.
I want to create this third column: case which would look like this:
ME PDR case
1 2 2
NA 1 1
NA 1 1
1 2 2
NA NA NA
I tried to use this command but it doesn't return me 1 when I have 1 in either columns and no 2 in any of them.
data1$case=ifelse(data1$ME==2 | data1$PDR==2 ,2,ifelse(data1$ME==NA & data1$PDR==NA,NA,1))
We can use pmax
data1$case <- do.call(pmax, c(data1, na.rm = TRUE))
data1$case
#[1] 2 1 1 2 NA
Regarding the OP's case with NA, the == returns NA for any element that is an NA. So, we need to take care of the NA with adding a condition (& !is.na(ME) - for both columns)
with(data1, ifelse((ME == 2 & !is.na(ME)) | (PDR == 2 & !is.na(PDR)),
2, ifelse(is.na(ME) &is.na(PDR), NA, 1)))
#[1] 2 1 1 2 NA
NOTE: The == for checking NA is not recommended as there are functions to get a logical vector when there are missing values (is.na, complete.cases)
data
data1 <- structure(list(ME = c(1L, NA, NA, 1L, NA), PDR = c(2L, 1L, 1L,
2L, NA)), class = "data.frame", row.names = c(NA, -5L))

Conditional Column Formatting

I have a data frame that looks like this:
cat df1 df2 df3
1 1 NA 1 NA
2 1 NA 2 NA
3 1 NA 3 NA
4 2 1 NA NA
5 2 2 NA NA
6 2 3 NA NA
I want to populate df3 so that when cat = 1, df3 = df2 and when cat = 2, df3 = df1. However I am getting a few different error messages.
My current code looks like this:
df$df3[df$cat == 1] <- df$df2
df$df3[df$cat == 2] <- df$df1
Try this code:
df[df$cat==1,"df3"]<-df[df$cat==1,"df2"]
df[df$cat==2,"df3"]<-df[df$cat==1,"df1"]
The output:
df
cat df1 df2 df3
1 1 1 1 1
2 2 1 2 1
3 3 1 3 NA
4 4 2 NA NA
5 5 2 NA NA
6 5 2 NA NA
You can try
ifelse(df$cat == 1, df$df2, df$df1)
[1] 1 2 3 1 2 3
# saving
df$df3 <- ifelse(df$cat == 1, df$df2, df$df1)
# if there are other values than 1 and 2 you can try a nested ifelse
# that is setting other values to NA
ifelse(df$cat == 1, df$df2, ifelse(df$cat == 2, df$df1, NA))
# or you can try a tidyverse solution.
library(tidyverse)
df %>%
mutate(df3=case_when(cat == 1 ~ df2,
cat == 2 ~ df1))
cat df1 df2 df3
1 1 NA 1 1
2 1 NA 2 2
3 1 NA 3 3
4 2 1 NA 1
5 2 2 NA 2
6 2 3 NA 3
# data
df <- structure(list(cat = c(1L, 1L, 1L, 2L, 2L, 2L), df1 = c(NA, NA,
NA, 1L, 2L, 3L), df2 = c(1L, 2L, 3L, NA, NA, NA), df3 = c(NA,
NA, NA, NA, NA, NA)), .Names = c("cat", "df1", "df2", "df3"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Resources