Using ifelse within mutate and handling NA's - r

thanks for your time.
I have a question about using ifelse within the mutate function. ifelse is from base R, while mutate is from the dplyr package.
My question is about how ifelse handles NA values.
I have two character vectors:
example_character_vector contains some words and occasional NA values while the other vector, color_indicator, contains only the words Green, Yellow, and Red.
I want to mutate my dataframe example_data_frame to create a new override_color_indicator variable that converts some of the yellows to greens depending on a condition in the example_character_vector.
Example data:
example_character_vector <- c("Basic", NA, "Full", "None", NA, "None",
NA)
color_indicator <- c("Green", "Green", "Yellow", "Yellow", "Yellow",
"Red", "Red")
example_data_frame <- data.frame(example_character_vector,
color_indicator)
This example_data_frame looks like so:
example_character_vector color_indicator
1 Basic Green
2 <NA> Green
3 Full Yellow
4 None Yellow
5 <NA> Yellow
6 None Red
7 <NA> Red
I am using nested ifelse statements within mutate to create a new column called override_color_indicator.
If color_indicator is yellow and the example_character_vector contains the word "Full", I want the override_color_indicator to be Green (this is a special case within my data). Otherwise, I would like the override_color_indicator to be exactly the same as the color_indicator.
Here is my mutate:
example_data_frame <- example_data_frame %>%
mutate(override_color_indicator =
ifelse(color_indicator == "Green",
"Green",
ifelse(color_indicator == "Yellow" &
str_detect(example_character_vector, "Full"),
"Green",
ifelse(color_indicator == "Yellow" &
!str_detect(example_character_vector, "Full") |
color_indicator == "Yellow" &
is.na(character_vector),
"Yellow",
"Red"))))
(Apologies for the formatting - I tried to format this the best I could for Stack Overflow.)
This above code produces this dataframe:
example_character_vector color_indicator override_color_indicator
1 Basic Green Green
2 <NA> Green Green
3 Full Yellow Green
4 None Yellow Yellow
5 <NA> Yellow <NA>
6 None Red Red
7 <NA> Red Red
My problem here is that in line 5, an NA is introduced in the override_color_indicator color. Instead of an NA, I would like it be "Yellow".
For clarity, this is my desired dataframe:
example_character_vector color_indicator override_color_indicator
1 Basic Green Green
2 <NA> Green Green
3 Full Yellow Green
4 None Yellow Yellow
5 <NA> Yellow Yellow
6 None Red Red
7 <NA> Red Red
I've looked quite a bit for an answer, and couldn't find one anywhere. I could just create a workaround and go back and manually assign the entries to Yellow, but I don't love that option from a programmatic standpoint.
Also, I'm just kind of curious as to why this behavior happens. I've ran into this problem a few times now.
Thanks for your time!

You should use case_when here, but the reason you are getting NA is because of the second ifelse. One interesting thing about how NA propagates in R is that (from the docs) "the result will be NA if the outcome is ambiguous". So because we knew this would be FALSE regardless of the NA, we have
NA & FALSE
#> [1] FALSE
but since this is ambiguous, the NA propagates here.
NA & TRUE
#> [1] NA
Row 5 has TRUE for Yellow but str_detect will return NA, so ifelse returns NA. You can get around this by adding & !is.na(example_character_vector) in that line:
library(tidyverse)
example_data_frame <- structure(list(example_character_vector = c("Basic", NA, "Full", "None", NA, "None", NA), color_indicator = c("Green", "Green", "Yellow", "Yellow", "Yellow", "Red", "Red")), class = "data.frame", row.names = c(NA, -7L))
example_data_frame %>%
mutate(
override_color_indicator =
ifelse(
color_indicator == "Green",
"Green",
ifelse(
color_indicator == "Yellow" &
str_detect(example_character_vector, "Full") & !is.na(example_character_vector),
"Green",
ifelse(
color_indicator == "Yellow" &
(!str_detect(example_character_vector, "Full") | is.na(example_character_vector)),
"Yellow",
"Red"
)
)
)
)
#> example_character_vector color_indicator override_color_indicator
#> 1 Basic Green Green
#> 2 <NA> Green Green
#> 3 Full Yellow Green
#> 4 None Yellow Yellow
#> 5 <NA> Yellow Yellow
#> 6 None Red Red
#> 7 <NA> Red Red
But definitely use case_when!

Try this instead. case_when is a more flexible vectorised if and allows you to use TRUE to say "else, use the value in color_indicator.
example_data_frame <- structure(list(example_character_vector = c("Basic", NA, "Full", "None", NA, "None", NA), color_indicator = c("Green", "Green", "Yellow", "Yellow", "Yellow", "Red", "Red")), class = "data.frame", row.names = c(NA, -7L))
library(dplyr)
example_data_frame %>%
mutate(x = case_when(color_indicator == "Yellow" &
example_character_vector == "Full" ~ "Green",
TRUE ~ color_indicator))
#> example_character_vector color_indicator x
#> 1 Basic Green Green
#> 2 <NA> Green Green
#> 3 Full Yellow Green
#> 4 None Yellow Yellow
#> 5 <NA> Yellow Yellow
#> 6 None Red Red
#> 7 <NA> Red Red

Related

R: Add new column by specific patterns in another column of the dataframe

my dataframe A looks like this:
**Group** **Pattern**
One Black & White
Two Black OR Pink
Three Red
Four Pink
Five White & Green
Six Green & Orange
Seven Orange
Eight Pink & Red
Nine Black OR White
Ten Green
. .
. .
. .
I have then dataframe B which looks like this:
**Color** **Value**
Orange 12
Pink 2
Red 4
Green 22
Black 84
White 100
I would like to add a new column, called Value, in dataframe A, based on its Pattern column. I want it to be in a way that if there is any (&), the values are summed up (for example if it is Black & White, I want it to become 184) and if there is any (OR), I want to have the higher number (in the same example, it will be 100).
I can join them with dplyr inner_join, but rows with &/OR are excluded.Is there any other way?
Cheers!
dfA <- data.frame(group=seq(1,4), pattern=c("Black & White", "Black OR Pink", "Red", "Pink"), stringsAsFactors=F)
dfB <- data.frame(color=c("Pink", "Red", "Black", "White"), value=c(2,4,84,100), stringsAsFactors=F)
getVal2return <- function(i, dfA, dfB){
andv <- unlist(strsplit(dfA$pattern[i], split=" & "))
orv <- unlist(strsplit(dfA$pattern[i], split=" OR "))
if (length(andv) > 1) {
val <- sum(dfB$value[match(andv, dfB$color)])
} else if (length(orv)> 1){
val <- max(dfB$value[match(orv, dfB$color)])
} else {
val <- dfB$value[match(dfA$pattern[i], dfB$color)]
}
return(val)
}
dfA$newVal <- sapply(1:nrow(dfA), function(x) { getVal2return(x, dfA, dfB) })
> dfA
group pattern newVal
1 1 Black & White 184
2 2 Black OR Pink 84
3 3 Red 4
4 4 Pink 2
This is a fairly pedestrian method, but effective:
A$Value <- A$Pattern
for(i in seq(nrow(B))) A$Value <- gsub(B$Color[i], B$Value[i], A$Value)
A$Value <- sub("&", "+", A$Value)
A$Value <- sub("^(\\d+) OR (\\d+)$", "max(\\1, \\2)", A$Value)
A$Value <- vapply(A$Value, function(x) eval(parse(text = x)), numeric(1))
A
#> Group Pattern Value
#> 1 One Black & White 184
#> 2 Two Black OR Pink 84
#> 3 Three Red 4
#> 4 Four Pink 2
#> 5 Five White & Green 122
#> 6 Six Green & Orange 34
#> 7 Seven Orange 12
#> 8 Eight Pink & Red 6
#> 9 Nine Black OR White 100
#> 10 Ten Green 22
Created on 2022-02-18 by the reprex package (v2.0.1)
DATA
A <- structure(list(Group = c("One", "Two", "Three", "Four", "Five",
"Six", "Seven", "Eight", "Nine", "Ten"), Pattern = c("Black & White",
"Black OR Pink", "Red", "Pink", "White & Green", "Green & Orange",
"Orange", "Pink & Red", "Black OR White", "Green")), class = "data.frame",
row.names = c(NA, -10L))
B <- structure(list(Color = c("Orange", "Pink", "Red", "Green", "Black",
"White"), Value = c(12L, 2L, 4L, 22L, 84L, 100L)), class = "data.frame",
row.names = c(NA, -6L))
I would try something like this, I like R base better
Assuming df2 as the second dataframe
df['Value'] = apply(df['Pattern'], 1, function(Pattern){
s = strsplit(Pattern, ' & ')[[1]]
if (length(s) == 2) {
return(with(df2, Value[Color == s[1]] + Value[Color == s[2]]))
}
s = strsplit(Pattern, ' OR ')[[1]]
if (length(s) == 2) {
return(with(df2, max(Value[Color == s[1]], Value[Color == s[2]])))
}
return(df2[df2$Color == Pattern,]$Value)
})
df
#> Group Pattern Value
#> 1 One Black & White 184
#> 2 Two Black OR Pink 84
#> 3 Three Red 4
#> 4 Four Pink 2
#> 5 Five White & Green 122
#> 6 Six Green & Orange 34
#> 7 Seven Orange 12
#> 8 Eight Pink & Red 6
#> 9 Nine Black OR White 100
#> 10 Ten Green 22

Counting new levels of a factor per site (R)

I need to generate a table counting new levels of a factor per site.
My code is like this
# Data creation
f = c("red", "green", "blue", "orange", "yellow")
f = factor(f)
d = data.frame(
site = 1:10,
color1= c(
"red", "red", "green", "green", "green",
"blue","green", "blue", "orange", "yellow"
),
color2= c(
"green", "green", "green", "blue","green",
"blue", "orange", "yellow","red", "red"
)
)
d$color1 = factor( d$color1 , levels = levels(f) )
d$color2 = factor( d$color2 , levels = levels(f) )
d
It shows me this table
I need to count how many new colors are in every new site. Only count first time appearing, not duplicated. Resulting a table like this one.
Counting not duplicated colors per site is in this figure.
Is there a dplyr way to find this output?
You can do:
library(tidyverse)
d %>%
pivot_longer(cols = -site) %>%
mutate(newColors = duplicated(value)) %>%
group_by(site) %>%
mutate(newColors = sum(!newColors)) %>%
ungroup() %>%
pivot_wider()
which gives:
# A tibble: 10 x 4
site newColors color1 color2
<int> <int> <fct> <fct>
1 1 2 red green
2 2 0 red green
3 3 0 green green
4 4 1 green blue
5 5 0 green green
6 6 0 blue blue
7 7 1 green orange
8 8 1 blue yellow
9 9 0 orange red
10 10 0 yellow red
Note that this differs for row 9 where you have a 1, but both colors (orange and red) already appeared in previous rows.

tidyr join an ID table with main table across multiple columns

This seems like a very basic operation, but my searches are not finding a simple solution.
As an example of what I am trying to do, consider the following two data frames from a database.
First an ID table that assigns an index to a color name:
ColorID <- tibble(ID = c(1:4), Name = c("Red", "Green", "Blue", "Black"))
ColorID
# A tibble: 4 x 2
ID Name
<int> <chr>
1 1 Red
2 2 Green
3 3 Blue
4 4 Black
Next some table that points to these color indexes (instead of storing text strings):
Widgets <- tibble(Front = c(1,3,4,2,1,1), Back = c(4,4,3,3,1,2),
Top = c(4,3,2,1,2,3), Bottom = c(1,2,3,4,3,2))
Widgets
# A tibble: 6 x 4
Front Back Top Bottom
<dbl> <dbl> <dbl> <dbl>
1 1 4 4 1
2 3 4 3 2
3 4 3 2 3
4 2 3 1 4
5 1 1 2 3
6 1 2 3 2
Now I just want to join the two tables to substitute the index values with the actual color names, so what I want is:
Joined <- tibble(Front = c("Red", "Blue", "Black", "Green", "Red","Red"),
Back = c("Black", "Black", "Blue","Blue", "Red", "Green"),
Top = c("Black","Blue", "Green", "Red", "Green", "Blue"),
Bottom = c("Red", "Green", "Blue", "Black", "Blue","Green"))
Joined
# A tibble: 6 x 4
Front Back Top Bottom
<chr> <chr> <chr> <chr>
1 Red Black Black Red
2 Blue Black Blue Green
3 Black Blue Green Blue
4 Green Blue Red Black
5 Red Red Green Blue
6 Red Green Blue Green
I've tried many iterations with no success, what I thought would work is something like:
J <- Widgets %>% inner_join(ColorID, by = c(. = "ID"))
I can tackle this column by column by using one variable at a time, e.g.
J <- Widgets %>% inner_join(ColorID, by = c("Front" = "ID"))
Which doesn't replace "Front", but instead creates a new "Name" column. Seems like there has to be a simple solution to this though. Thanks.
There is no need for join functions:
library(dplyr)
ColorID <- tibble(ID = c(1:4), Name = c("Red", "Green", "Blue", "Black"))
# reorder so that row number and ID are different
ColorID <- ColorID[c(2, 1, 4, 3), ]
Widgets <- tibble(Front = c(1,3,4,2,1,1), Back = c(4,4,3,3,1,2),
Top = c(4,3,2,1,2,3), Bottom = c(1,2,3,4,3,2))
check_id <- function(col){
ColorID$Name[match(col, ColorID$ID)]
}
Widgets %>%
mutate(across(everything(), check_id))
# A tibble: 6 x 4
Front Back Top Bottom
<chr> <chr> <chr> <chr>
1 Red Black Black Red
2 Blue Black Blue Green
3 Black Blue Green Blue
4 Green Blue Red Black
5 Red Red Green Blue
6 Red Green Blue Green
(Edited) What I'm doing with dplyr and mutate is matching the numbers on Widgets with the number on the ColorID$ID column. This provides me with the row on the ColorID data frame I need for extracting the name.
Does this work:
library(dplyr)
library(tidyr)
Widgets %>% pivot_longer(everything()) %>%
inner_join(ColorID, by = c('value' = 'ID')) %>% select(-value) %>%
pivot_wider(names_from = name, values_from = Name) %>% unnest(everything())
# A tibble: 6 x 4
Front Back Top Bottom
<chr> <chr> <chr> <chr>
1 Red Black Black Red
2 Blue Black Blue Green
3 Black Blue Green Blue
4 Green Blue Red Black
5 Red Red Green Blue
6 Red Green Blue Green

Assign unique ID based on values in EITHER of two columns

This is not a duplicate of this question. Please read questions entirely before labeling duplicates.
I have a data.frame like so:
library(tidyverse)
tibble(
color = c("blue", "blue", "red", "green", "purple"),
shape = c("triangle", "square", "circle", "hexagon", "hexagon")
)
color shape
<chr> <chr>
1 blue triangle
2 blue square
3 red circle
4 green hexagon
5 purple hexagon
I'd like to add a group_id column like this:
color shape group_id
<chr> <chr> <dbl>
1 blue triangle 1
2 blue square 1
3 red circle 2
4 green hexagon 3
5 purple hexagon 3
The difficulty is that I want to group by unique values of color or shape. I suspect the solution might be to use list-columns, but I can't figure out how.
We can use duplicated in base R
df1$group_id <- cumsum(!Reduce(`|`, lapply(df1, duplicated)))
-output
df1
# A tibble: 5 x 3
# color shape group_id
# <chr> <chr> <int>
#1 blue triangle 1
#2 blue square 1
#3 red circle 2
#4 green hexagon 3
#5 purple hexagon 3
Or using tidyverse
library(dplyr)
library(purrr)
df1 %>%
mutate(group_id = map(., duplicated) %>%
reduce(`|`) %>%
`!` %>%
cumsum)
data
df1 <- structure(list(color = c("blue", "blue", "red", "green", "purple"
), shape = c("triangle", "square", "circle", "hexagon", "hexagon"
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))

Delete rows from a data frame in R based on column values in another data frame

I have two data frames as follows:
df1 <- data.frame(fruit=c("apple", "blackberry", "orange", "pear", "grape"),
color=c("black", "purple", "blue", "green", "red"),
quantity1=c(1120, 7600, 21409, 120498, 25345),
quantity2=c(1200, 7898, 21500, 140985, 27098),
taste=c("sweet", "bitter", "sour", "salty", "spicy"))
df2 <- data.frame(fruit=c("apple", "orange", "pear"),
color=c("black", "yellow", "green"),
quantity=c(1145, 65094, 120500))
I would like to delete rows in df1 based on rows in df2, they must match all 3 conditions:
The fruit name must match
The color must match
The quantity in df2 must be a value in between the two quantities in df1
The output for my example should look like:
df3 <- data.frame(fruit=c("blackberry", "orange", "grape"),
color=c("purple", "blue", "red"),
quantity1=c(7600, 21409, 25345),
quantity2=c(21500, 7898, 27098),
taste=c("bitter", "sour", "spicy"))
I wonder if tidyverse could be also used:
library(tidyverse)
df1 %>%
left_join(df2, by = c("fruit", "color")) %>%
filter(is.na(quantity) | quantity <= quantity1 | quantity >= quantity2)
#> fruit color quantity1 quantity2 taste quantity
#> 1 blackberry purple 7600 7898 bitter NA
#> 2 orange blue 21409 21500 sour NA
#> 3 grape red 25345 27098 spicy NA
With data.table, we can use a non-equi join
library(data.table)
setDT(df1)[!df2, on = .(fruit, color, quantity1 <= quantity,
quantity2 >= quantity)]
# fruit color quantity1 quantity2 taste
#1: blackberry purple 7600 7898 bitter
#2: orange blue 21409 21500 sour
#3: grape red 25345 27098 spicy
Or using the same methodology with fuzzy_anti_join as showed in this post
You can use fuzzy_anti_join from fuzzyjoin package :
fuzzyjoin::fuzzy_anti_join(df1, df2,
by = c('fruit', 'color','quantity1' = 'quantity', 'quantity2' = 'quantity'),
match_fun = list(`==`, `==`, `<=`, `>=`))
# A tibble: 3 x 5
# fruit color quantity1 quantity2 taste
# <chr> <chr> <dbl> <dbl> <chr>
#1 blackberry purple 7600 7898 bitter
#2 orange blue 21409 21500 sour
#3 grape red 25345 27098 spicy

Resources