R coalesce two columns but keep both values if not NA - r

I have a dataframe with two columns of related data. I want to create a third column that combines them, but there are lots of NAs in one or both columns. If both columns have a non-NA value, I want the new third column to paste both values. If either of the first two columns has an NA, I want the third column to contain just the non-NA value. An example with a toy data frame is below:
x <- c("a", NA, "c", "d")
y <- c("l", "m", NA, "o")
df <- data.frame(x, y)
# this is the new column I want to produce from columns x and y above
df$z <- c("al", "m", "c", "do")
I thought coalesce would solve my problem, but I can't find a way to keep both values if there is a value in both columns. Thanks in advance for any assistance.

One posible solution:
df$z <- gsub("NA", "",paste0(df$x, df$y))

Another possible solution:
library(dplyr)
df %>%
mutate(z = ifelse(is.na(x) | is.na(y), coalesce(x,y), paste0(x,y)))

An option with unite
library(tidyr)
library(dplyr)
df %>%
unite(z, everything(), na.rm = TRUE, sep = "", remove = FALSE)
z x y
1 al a l
2 m <NA> m
3 c c <NA>
4 do d o

Related

Recoding factor with many levels

I need to recode a factor variable with almost 90 levels. It is trait names from database which I then need to pivot to get the dataset for analysis.
Is there a way to do it automatically without typing each OldName=NewName?
This is how I do it with dplyr for fewer levels:
df$TraitName <- recode_factor(df$TraitName, 'Old Name' = "new.name")
My idea was to use a key dataframe with a column of old names and corresponding new names but I cannot figure out how to feed it to recode
You could quite easily create a named vector from your lookup table and pass that to recode using splicing. It might as well be faster than a join.
library(tidyverse)
# test data
df <- tibble(TraitName = c("a", "b", "c"))
# Make a lookup table with your own data
# Youll bind your two columns instead here
# youll want to keep column order to deframe it.
# column names doesnt matter.
lookup <- tibble(old = c("a", "b", "c"), new = c("aa", "bb", "cc"))
# Convert to named vector and splice it within the recode
df <-
df |>
mutate(TraitNameRecode = recode_factor(TraitName, !!!deframe(lookup)))
One way would be a lookup table, a join, and coalesce (to get the first non-NA value:
my_data <- data.frame(letters = letters[1:6])
levels_to_change <- data.frame(letters = letters[4:5],
new_letters = LETTERS[4:5])
library(dplyr)
my_data %>%
left_join(levels_to_change) %>%
mutate(new = coalesce(new_letters, letters))
Result
Joining, by = "letters"
letters new_letters new
1 a <NA> a
2 b <NA> b
3 c <NA> c
4 d D D
5 e E E
6 f <NA> f

How can I create a new data frame with several rows for each observation based on string column?

I have a data frame in R with data on observations. One column contains several data points for each observation recorded as one long string with separators. I would like to restructure this data so that one observation can occur with several rows instead per the example below.
The data right now looks like this:
df <- data.frame(matrix(c("A", "B",
"X", "Y",
"{data1},{data2}", "{data1}"),
nrow = 2,
ncol = 3,
byrow = F))
names(df) <- c("key", "info", "more_info")
I would like it to look like this:
df <- data.frame(matrix(c("A", "A", "B",
"X", "X", "Y",
"{data1}", "{data2}", "{data1}"),
nrow = 3,
ncol = 3,
byrow = F))
names(df) <- c("key", "info", "more_info")
My first idea was to first use separate() and then use pivot_longer() but this ran into issues since the length of the last column is not the same for each observation. In fact, for some observations it may consist of hundreds of records.
You can use separate_rows from tidyr:
> library(tidyr)
> separate_rows(df, more_info, sep=",")
# A tibble: 3 x 3
key info more_info
<fct> <fct> <chr>
1 A X {data1}
2 A X {data2}
3 B Y {data1}
An option with unnest after strsplit
library(dplyr)
library(tidyr)
df %>%
mutate(more_info = strsplit(more_info, ",")) %>%
unnest(c(more_info))

How do I mutate a list-column to a common one leaving only the last value when there is a vector in the list?

I am trying to use purrr::map_chr to get the last element of the vector in a list-column as the actual value in case that it exists.
THE reproducible example:
library(data.table)
library(purrr)
x <- data.table(one = c("a", "b", "c"), two = list("d", c("e","f","g"), NULL))
I want data as it is but changing my list-column to a common one with "g" as the value for x[2,2]. What I've tryed:
x %>% mutate(two = ifelse(is.null(.$two), map_chr(~NA_character_), map_chr(~last(.))))
The result should be the next one.
# one two
# a d
# b g
# c NA
Thaks in advance!
Here is an option. We can use if/else instead of ifelse here
library(dplyr)
library(tidyr)
x %>%
mutate(two = map_chr(two, ~ if(is.null(.x)) NA_character_ else last(.x)))
# one two
#1 a d
#2 b g
#3 c NA
Or replace the NULL elements with NA and extract the last
x %>%
mutate(two = map_chr(two, ~ last(replace(.x, is.null(.), NA))))
I would propose this solution which is a bit cleaner.
library(tidyverse)
df <- tibble(one = c("a", "b", "c"), two = list("d", c("e","f","g"), NULL))
df %>%
mutate_at("two", replace_na, NA_character_) %>%
mutate_at("two", map_chr, last)

Conditional value replacement in linked column

In a data frame, I want to replace a value based on a condition in another column.
Example: when the value in column A is above x, then both values in column A and B are replaced by NA.
I can't find the proper way to do this with the different functions: na_if, ifelse, if_else,case_when...
Subscript the data frame by a logical vector having the condition:
DF[DF$A > x, c("A", "B")] <- NA
Here's a working answer:
d <- data.frame("A" = 1:10, "B" = 11:20)
x <- 5
d[d$A > x, c("A", "B")] <- NA

Rename suffix part of column name but keep the rest the same

For now I am redoing a merge because I poorly named the columns, however, I would like to know how to match on a suffix of a column name and rename that part of the column, keeping the rest the same.
For example, if I have a data.frame (could be a data.table too, doesn't matter - I could convert it):
d <- data.frame("ID" = c(1, 2, 3),
"Attribute1.prev" = c("A", "B", "C"),
"Attribute1.cur" = c("D", "E", "F"))
Now imagine that there are hundreds of columns similar to columns 2 & 3 from my sample DT. How would I go through and detect all columns ending in ".prev" change to ".1" and all columns ending in ".cur" change to ".2"?
So, the new column names would be: ID (unchanged), Attribute1.1, Attribute1.2 and so on for as many columns that match.
With base R we may do
names(d) <- sub("\\.prev", ".1", sub("\\.cur", ".2", names(d)))
d
# ID Attribute1.1 Attribute1.2
# 1 1 A D
# 2 2 B E
# 3 3 C F
With the stringr package you could also use
names(d) <- str_replace_all(names(d), c("\\.prev" = ".1", "\\.cur" = ".2"))
If instead of Attribute1 and Attribute2 you may have some names with dots/spaces, you could also replace "\\.prev" and "\\.cur" patterns to "\\.prev$" and "\\.cur$" as to make sure that we match them at the end of the column names.
Here's an idea using dplyr & stringr syntax
library(dplyr); library(stringr)
names(d) <-
d %>% names() %>%
str_replace(".prev", ".1") %>%
str_replace(".cur", ".2")
Cheers!
Here is an option with gsubfn
library(gsubfn)
names(d) <- gsubfn("(\\w+)", list(prev = 1, cur = 2), names(d))
names(d)
#[1] "ID" "Attribute1.1" "Attribute1.2"

Resources