Join data but ignore missing values - r

I am having some trouble with joining data frames with dplyr, where I would like to ignore the NAs.
The data that I have is quite big, but a simplified version looks like:
id <- c("id1", "id2", "id3", "id4")
A <- c("E", "F", "G", NA)
B <- c("T", NA, "N", "T")
C <- c(NA, "T", "U", NA)
df <- data.frame(A, B, C)
id A B C
1 id1 E T NA
2 id2 F NA T
3 id3 G N U
4 id4 NA T NA
I have an entry that I would like to match with df, which is e.g.:
df2 <- data.frame(A = "E", B = "T", C = "M")
A B C
1 E T M
As a result I would like to obtain all rows from df that match with df2, but the NAs should be ignored. So the result should look like this:
id A B C
1 id1 E T NA
2 id4 NA T NA
I was trying to do this with semi_join, but it did not work so far:
result <- df %>%
group_by(n = seq(n())) %>%
do(modify_if(., is.na, ~NULL) %>%
semi_join(df2, by = c("A", "B", "C"))) %>%
ungroup %>%
select(-n)
Which results in:
Error: `by` can't contain join column `C` which is missing from LHS
Call `rlang::last_error()` to see a backtrace
Who knows the answer?

Here's a solution with a mix of tidyverse and base R. I think this is pretty clear, but I'd be interested in a pure tidyverse implementation that isn't completely contrived.
The idea is to first expand all entries in df and df2 and then filter through all the columns using a loop.
The data:
id <- c("id1", "id2", "id3", "id4")
A <- c("E", "F", "G", NA)
B <- c("T", NA, "N", "T")
C <- c(NA, "T", "U", NA)
df <- data.frame(id, A, B, C, stringsAsFactors = F) # Make sure to use strings not factors
df2 <- data.frame(A = "E", B = "T", C = "M", stringsAsFactors = F)
Code:
library(tidyr)
results <- crossing(df, df2)
select_columns <- c("A", "B", "C")
for(col in select_columns) {
keep <- is.na(results[[col]]) | results[[col]] == results[[paste0(col, 1)]]
results <- results[keep,, drop=F]
}
results <- results %>% dplyr::select(id, A:C) %>% distinct
results
id A B C
1 id1 E T <NA>
2 id4 <NA> T <NA>

If you only need to do this for a single set of values this is probably the most straightforward approach:
d[A %in% c("E",NA) & B %in%c("T",NA) & C %in% c("M",NA),]

Another example using tidyverse and base (dplyr, tidyr, base):
In this I convert your df2 into a dataframe that includes all combinations of values you want to accept ( (E or NA) & (T or NA) & (M or NA) ) and then I do an inner join with this full set. There are other ways to create a dataframe of all possible combinations but this one uses tidyr fairly easily.
library(dplyr)
library(tidyr)
id <- c("id1", "id2", "id3", "id4")
A <- c("E", "F", "G", NA)
B <- c("T", NA, "N", "T")
C <- c(NA, "T", "U", NA)
df <- data.frame(A, B, C, stringsAsFactors = FALSE)
df2 <- data.frame(A = "E", B = "T", C = "M",stringsAsFactors = FALSE)
df2_expanded <- df2 %>%
rowwise() %>%
mutate(combinations = list(expand.grid(A = c(A,NA),B = c(B,NA),C = c(C,NA),stringsAsFactors = FALSE))) %>%
select(-A,-B,-C) %>%
unnest(combinations)
# A tibble: 8 x 3
# A B C
# <chr> <chr> <chr>
# 1 E T M
# 2 NA T M
# 3 E NA M
# 4 NA NA M
# 5 E T NA
# 6 NA T NA
# 7 E NA NA
# 8 NA NA NA
df %>%
inner_join(df2_expanded)
# A B C
# 1 E T <NA>
# 2 <NA> T <NA>

Related

Merge three Variables to one and replicate observations

I have a Dataframe which looks like the following:
B <- data.frame(
nr=c(1,2,3,4,5),
A=c('a','b','c','d','e'),
B=c("s", "t", "i", "u", "z"),
B1=c("", "v", "", "", ""),
B2 =c("", "g", "", "", ""))
B <- B %>% mutate_all(na_if,"")
Since my Varaibales B1 and B2 only have one value, I would like to merge B1 and B2 to the Variable B. Therefor it should create two new observation and replicating every other Variable of this Oberservation.
It should look like the following:
B <- data.frame(
nr=c(1,2,2, 2, 3,4,5),
A=c("a","b", "b", "b", "c","d","e"),
B=c("s", "v", "g", "t", "i", "u", "z"))
Thanks for your help!!
Reshape to 'long' format with pivot_longer on the 'B' columns and remove the NA with values_drop_na = TRUE
library(dplyr)
library(tidyr)
B %>%
pivot_longer(cols = starts_with("B"), values_to = "B",
values_drop_na = TRUE, names_to = NULL)
-output
# A tibble: 7 × 3
nr A B
<dbl> <chr> <chr>
1 1 a s
2 2 b t
3 2 b v
4 2 b g
5 3 c i
6 4 d u
7 5 e z

R script to generate all combinatorics of two identical lists including incomplete lists

I think this problem can be solved in many different ways, but I basically want to find a function that will give me a dataframe with every combination of values from a list into its columns, including the incomplete sets and excluding some, but not all, redundant combinations (order isn't important for now).
So I might start out with a list like this:
List = c("A","B","C")
and I want to get a dataframe that looks like
C1 = c("A","B","C","A","A","B","A")
C2 = c("","","","B","C","C","B")
C3 = c("","","","","","","C")
df <- cbind(C1, C2, C3)
row.names(df) <- c("A", "B", "C", "AB", "AC", "BC", "ABC")
colnames(df) <- c("First_Item", "Second_Item","Third_Item")
And then it fills in each cell with the corresponding letter.
e.g. position A1 in the df would be "A", positions A2 and A3 would be empty.
any idea how to do this?
I tried with dplyr:
library(tidyr)
list_1 = c("A", "B", "C", "NA")
list_2 = c("A", "B", "C", "NA")
list_3 = c("A", "B", "C", "NA")
list_4 = c("A", "B", "C", "NA")
test <- crossing(list_1, list_2,list_3,list_4)
test <- test[apply(test, MARGIN = 1, FUN = function(x) !(duplicated(x) | !any = "NA")),]
But I want to keep all the values with multiple NAs in them, so this doesn't quite work.
expand.grid has the same problem
expand.grid(list_1 = c("A", "B", "C", "NA"),list_2 = c("A", "B", "C", "NA"),list_3 = c("A", "B", "C", "NA"),list_4 = c("A", "B", "C", "NA"))
That's basically Roland's answer:
library(magrittr) # just for the pipe-operator
List %>%
seq_along() %>%
lapply(combn, x = List, simplify = FALSE) %>%
unlist(recursive = FALSE) %>%
sapply(`length<-`, length(List)) %>%
t() %>%
data.frame()
returns
X1 X2 X3
1 A <NA> <NA>
2 B <NA> <NA>
3 C <NA> <NA>
4 A B <NA>
5 A C <NA>
6 B C <NA>
7 A B C
Further more you could use the dplyr and tidyr packages to replace NAs. Just add one more function into the pipe:
mutate(across(everything(), replace_na, ""))
Here is my approach:
library(purrr)
List <- c("xA","xB","xC") # arbitrary as per request in comments
seq_along(List) %>% # h/t #MartinGal
map(~ combn(List, m = .x) %>%
apply(2, paste, collapse = "<!>")) %>%
unlist() %>%
tibble::tibble() %>%
tidyr::separate(1, into = c("First_Item", "Second_Item", "Third_Item"),
sep = "<!>")
Returns:
# A tibble: 7 x 3
First_Item Second_Item Third_Item
<chr> <chr> <chr>
1 xA NA NA
2 xB NA NA
3 xC NA NA
4 xA xB NA
5 xA xC NA
6 xB xC NA
7 xA xB xC

left_join in a for loop with different columns names

I have a data.frame called a whose structure is similar to:-
a <- data.frame(X1=c("A", "B", "C", "A", "C", "D"),
X2=c("B", "C", "D", "A", "B", "A"),
X3=c("C", "D", "A", "B", "A", "B")
)
And I have another set which is:-
b <- data.frame(Xn=c("A", "B", "C", "D"),
Feature=c("some", "more", "what", "why"))
I want to add all the Features from set b to set a, such that X1, X2 and X3 have their corresponding feature column in set a. In other words, the columns in set a become:-
colnames(a) <- c("X1", "X2", "X3", "Features1", "Features2", "Features3")
How can I do this using a left_join in a for loop??
In base R, we can unlist a dataframe and match it with b$Xn to get corresponding Feature value. We can cbind this dataframe to original dataframe to get final answer.
temp <- a
temp[] <- b$Feature[match(unlist(temp), b$Xn)]
names(temp) <- paste0('Feature', seq_along(temp))
cbind(a, temp)
# X1 X2 X3 Feature1 Feature2 Feature3
#1 A B C some more what
#2 B C D more what why
#3 C D A what why some
#4 A A B some some more
#5 C B A what more some
#6 D A B why some more
In tidyverse, we can get the data in long format, join the data and get it back to wide format.
library(dplyr)
library(tidyr)
a %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
left_join(b, by = c('value' = 'Xn')) %>%
select(-value) %>%
pivot_wider(names_from = name, values_from = Feature) %>%
select(-row) %>%
rename_all(~paste0('Feature', seq_along(.))) %>%
bind_cols(a, .)
This can be done by using mutate_all to recode all of the columns in a:
library(tidyverse)
a %>%
mutate_all(funs(feat=recode(., !!!set_names(as.character(b$Feature), b$Xn))))
X1 X2 X3 X1_feat X2_feat X3_feat
1 A B C some more what
2 B C D more what why
3 C D A what why some
4 A A B some some more
5 C B A what more some
6 D A B why some more
You can add a rename_at to get the desired names:
a %>%
mutate_all(funs(f=recode(., !!!set_names(as.character(b$Feature), b$Xn)))) %>%
rename_at(vars(matches("f")), ~gsub(".([0-9]).*", "Feature\\1", .))
X1 X2 X3 Feature1 Feature2 Feature3
1 A B C some more what
2 B C D more what why
3 C D A what why some
4 A A B some some more
5 C B A what more some
6 D A B why some more

stringr: replace string by supplying pattern through a character vector

Here is my data:
df <- tibble::tribble(
~A, ~B,
"C", "G",
"D", "H",
"E", "I",
"F", "J")
value1 <- "D"
value2 <- "C"
And, in variable A, I want to replace D and C with "m" and "n", something like this, but it's not working!
df %>% mutate(X = A %>% str_replace_all(c(value1 = "m", value2 = "n")))
My desired output is:
df %>% mutate(X = A %>% str_replace_all(c("D" = "m", "C" = "n")))
But instead of supplying "D" and "C" manually, I want to programmatically supply these, something in line with...using value1 and value2.
How should I do that?
You could try using setNames to set the names of m and n like:
library(dplyr)
library(stringr)
df %>% mutate(X = A %>% str_replace_all(setNames(c("m","n"), c(value1, value2))))
# A tibble: 4 x 3
# A B X
# <chr> <chr> <chr>
#1 C G n
#2 D H m
#3 E I E
#4 F J F
And then checking that it's equal to your desired result:
identical(
df %>% mutate(X = A %>% str_replace_all(c("D" = "m", "C" = "n"))),
df %>% mutate(X = A %>% str_replace_all(setNames(c("m","n"), c(value1, value2)))))
#[1] TRUE
I also included the other packages you use: dplyr and stringr
You can think of creating a named vector and use it as replacement vector.
replacementVector <- c("m","n")
names(replacementVector) <- c("D","C")
Now, use the replacementVector in dplyr chain along with ifelse as:
df %>% mutate(X = ifelse(is.na(replacementVector[A]), A, replacementVector[A]))
# # A tibble: 4 x 3
# A B X
# <chr> <chr> <chr>
# 1 C G n
# 2 D H m
# 3 E I E
# 4 F J F
Data:
library(tidyverse)
df <- tibble::tribble(
~A, ~B,
"C", "G",
"D", "H",
"E", "I",
"F", "J")
As is vectorized over string and replacement if you put all the values in the same vector you can just run
df %>% mutate(X = A %>% str_replace_all(c("C","D"), c("m","n")))
We could use chartr
df %>%
mutate(X = chartr('DC', 'mn', A))
# A tibble: 4 x 3
# A B X
# <chr> <chr> <chr>
#1 C G n
#2 D H m
#3 E I E
#4 F J F

How do I assign one value to multiple factor levels at once in R?

I have a data frame with two columns, last name and a new column where I want to assign "Yes" or "No".
df <- data.frame(x <- as.factor(c("a","b","c","d","e")), y <- NA)
colnames(df)<-c("last name", "newCol")
Let's say I want to assign "No" in newCol for "a", "c", "e". How do I assign "No" to all three factor levels at once?
I tried using the logical operator, |, with the factor levels, but it did not work. I have been able to do it by assigning "No" to individual levels, but I find it inefficient.
You can use %in% to check for "a", "c", "e" in one step:
df$newCol[df$`last name` %in% c("a", "c", "e")] <- "No"
df;
#last name newCol
#1 a No
#2 b <NA>
#3 c No
#4 d <NA>
#5 e No
Or using a tidyverse approach using ifelse:
require(tidyverse);
df %>% mutate(newCol = ifelse(`last name` %in% c("a", "c", "e"), "No", "NA"))
# last name newCol
#1 a No
#2 b NA
#3 c No
#4 d NA
#5 e No
or slightly different using replace:
df %>% mutate(newCol = replace(newCol, `last name` %in% c("a", "c", "e"), "No"))

Resources