Value matching with NA - missing values - using mutate

Value matching with NA - missing values - using mutate - r

I am somewhat stuck. Is there a better way than the below to do value matching considering NAs as "real values" within mutate?
library(dplyr)
data_foo <- data.frame(A= c(1:2, NA, 4, NA), B = c(1, 3, NA, NA, 4))
Not the desired output:
data_foo %>% mutate(irr = A==B)
#> A B irr
#> 1 1 1 TRUE
#> 2 2 3 FALSE
#> 3 NA NA NA
#> 4 4 NA NA
#> 5 NA 4 NA
data_foo %>% rowwise() %>% mutate(irr = A%in%B)
#> Source: local data frame [5 x 3]
#> Groups: <by row>
#>
#> # A tibble: 5 x 3
#> A B irr
#> <dbl> <dbl> <lgl>
#> 1 1 1 TRUE
#> 2 2 3 FALSE
#> 3 NA NA FALSE
#> 4 4 NA FALSE
#> 5 NA 4 FALSE
Desired output: The below shows the desired column, irr. I am using this somewhat cumbersome helper columns. Is there a shorter way?
data_foo %>%
mutate(NA_A = is.na(A),
NA_B = is.na(B),
irr = if_else(is.na(A)|is.na(B), NA_A == NA_B, A == B))
#> A B NA_A NA_B irr
#> 1 1 1 FALSE FALSE TRUE
#> 2 2 3 FALSE FALSE FALSE
#> 3 NA NA TRUE TRUE TRUE
#> 4 4 NA FALSE TRUE FALSE
#> 5 NA 4 TRUE FALSE FALSE

Using map2
library(tidyverse)
data_foo %>%
mutate(irr = map2_lgl(A, B, `%in%`))
# A B irr
#1 1 1 TRUE
#2 2 3 FALSE
#3 NA NA TRUE
#4 4 NA FALSE
#5 NA 4 FALSE
Or with setequal
data_foo %>%
rowwise %>%
mutate(irr = setequal(A, B))
The above method is concise, but it is also loopy. We can replace the NA with a different value and then do the ==
data_foo %>%
mutate_all(list(new = ~ replace_na(., -999))) %>%
transmute(A, B, irr = A_new == B_new)
# A B irr
#1 1 1 TRUE
#2 2 3 FALSE
#3 NA NA TRUE
#4 4 NA FALSE
#5 NA 4 FALSE
Or with bind_cols and reduce
data_foo %>%
mutate_all(replace_na, -999) %>%
reduce(`==`) %>%
bind_cols(data_foo, irr = .)

Maybe simpler than akrun's answer?
Any of the two ways below will produce the expected result. Note that as.character won't do it, because the return value of as.character(NA) is NA_character_.
data_foo %>%
mutate(irr = paste(A) == paste(B))
data_foo %>%
mutate(irr = sQuote(A) == sQuote(B))
#Source: local data frame [5 x 3]
#Groups: <by row>
#
## A tibble: 5 x 3
# A B irr
# <dbl> <dbl> <lgl>
#1 1 1 TRUE
#2 2 3 FALSE
#3 NA NA TRUE
#4 4 NA FALSE
#5 NA 4 FALSE
Edit.
Following the comments below I have updated the code and it now follows akrun's suggestion.
There is also the excellent idea in tmfmnk's answer. I use a similar one in yet another way of solving the question's problem.
The documentation of all.equal says that
Do not use all.equal directly in if expressions—either use
isTRUE(all.equal(....)) or identical if appropriate.
Though there is no if expression in mutate, I believe that it is more stable than identical and has the same effect if the values being compared are (sort of/in fact) equal.
data_foo %>%
mutate(irr = isTRUE(all.equal(A, B)))

Could also be a possibility:
data_foo %>%
rowwise() %>%
mutate(irr = identical(A, B)) %>%
ungroup()
A B irr
<dbl> <dbl> <lgl>
1 1 1 TRUE
2 2 3 FALSE
3 NA NA TRUE
4 4 NA FALSE
5 NA 4 FALSE

The coalesce function is useful if you want to perform an action when a value is NA
data_foo %>%
mutate(irr = coalesce(A == B, is.na(A) & is.na(B)))
# A B irr
# 1 1 1 TRUE
# 2 2 3 FALSE
# 3 NA NA TRUE
# 4 4 NA FALSE
# 5 NA 4 FALSE
Same thing for > 2 columns
data_foo %>%
mutate(irr = coalesce(reduce(., `==`), rowMeans(is.na(.)) == 1))

Related

Generating True/False column in one dataframe based on multiple conditions in another dataframe in R

I have two dataframes of different length. The first looks like this and is the dataframe I want to add the True/False column to:
chr_snp loc_snp ma_snp
1 184319928 T
1 276998062 A
1 278255864 G
2 243012470 G
2 123072103 T
3 526785124 A
The second data frame is the reference dataframe that is smaller:
chr_QTL loc_QTL ma_QTL
1 281788173 G
1 203085725 C
2 241577141 C
For each row in dataframe 1 (df1), I want to first check if the value of df1$chr_snp matches a value in df2$chr_QTL. If this match is true, then I want to determine if the value in df1$loc_snp is within 10 million units (these are DNA base-pairs) above OR below any values based on the first condition in df2$loc_QTL. Now, what is tricky is that for the first three rows of df1, there are three possible row matches in df2 (rows 1 and 2) based on the first criteria alone. However, only two match based on the second criteria (10M base-pairs greater than OR less than value in df2$loc_QTL). Note: df1$ma_snp and df2$ma_QTL can be totally ignored. So, based on these criteria, df1 should now look like:
chr_snp loc_snp ma_snp Match
1 184319928 T FALSE
1 276998062 A TRUE
1 278255864 G TRUE
2 243012470 G TRUE
2 123072103 T FALSE
3 526785124 A FALSE

Here is one option. Join the data then look for any instance of < 10,000,000.
library(tidyverse)
left_join(df1 |>
mutate(rw_id = row_number()),
df2, by = c("chr_snp" = "chr_QTL")) |>
mutate(less = abs(loc_snp -loc_QTL) < 10e6) |>
group_by(rw_id)|>
summarise(across(contains(colnames(df1)), ~.[[1]]),
Match = any(less),
Match = ifelse(is.na(Match), FALSE, Match))
#> # A tibble: 6 x 5
#> rw_id chr_snp loc_snp ma_snp Match
#> <int> <dbl> <dbl> <chr> <lgl>
#> 1 1 1 184319928 T FALSE
#> 2 2 1 276998062 A TRUE
#> 3 3 1 278255864 G TRUE
#> 4 4 2 243012470 G TRUE
#> 5 5 2 123072103 T FALSE
#> 6 6 3 526785124 A FALSE
or another option:
library(tidyverse)
df1 |>
mutate(Match = map2_lgl(chr_snp, loc_snp,
\(x, y){
(x %in% df2$chr_QTL) &&
any(abs(df2[df2$chr_QTL == x, 'loc_QTL']-y) < 10e6)
}))
#> # A tibble: 6 x 4
#> chr_snp loc_snp ma_snp Match
#> <dbl> <dbl> <chr> <lgl>
#> 1 1 184319928 T FALSE
#> 2 1 276998062 A TRUE
#> 3 1 278255864 G TRUE
#> 4 2 243012470 G TRUE
#> 5 2 123072103 T FALSE
#> 6 3 526785124 A FALSE
just for fun. Here is a third option with base:
df1$Match <- apply(outer(df1$loc_snp, df2$loc_QTL, \(x,y) abs(x-y) < 10e6) &
outer(df1$chr_snp, df2$chr_QTL, `==`), 1, any)
df1
#> # A tibble: 6 x 4
#> chr_snp loc_snp ma_snp Match
#> <dbl> <dbl> <chr> <lgl>
#> 1 1 184319928 T FALSE
#> 2 1 276998062 A TRUE
#> 3 1 278255864 G TRUE
#> 4 2 243012470 G TRUE
#> 5 2 123072103 T FALSE
#> 6 3 526785124 A FALSE

One way:
split(df1, 1:NROW(df1)) <- lapply(split(df1, 1:NROW(df1)), function(x)
x$Match <- any(df2$chr_QTL==x$chr_snp & abs(df2$loc_QTL - x$loc_snp) < 1e7))

Another "how to deal with NAs in logical statements" question

Short version: I need is to get a results column r like this, ideally using dplyr (but happy for base R as well):
d <- tibble(c1 = c(T,T,F,T,F,NA), c2 = c(T,F,F,F,F,NA), c3 = c(T,F,F,NA,NA,NA))
d %>% rowwise() %>% mutate(r = something())
# A tibble: 6 x 3
c1 c2 c3 r
<lgl> <lgl> <lgl> <lgl>
1 TRUE TRUE TRUE TRUE
2 TRUE FALSE FALSE TRUE
3 FALSE FALSE FALSE FALSE
4 TRUE FALSE NA TRUE
5 FALSE FALSE NA FALSE
6 NA NA NA NA
I understand why NA|FALSE == NA. Each TRUE/FALSE in this table is the result of a comparison, and I would really like to keep the syntax as short as possible.
Long version:
I have survey results, and need to create a summary of three questions asking for the primary, secondary and tertiary 'route to something' (there are more than 3 levels in reality). The summary should tell me, for each respondent, whether they made use of route A, route B, etc. Not all respondents filled in all questions, so there might be NAs. Some respondents didn't answer any of the question at all, and their summary should be NA. So I have:
df <- tibble(primary = c("C", "A", "B", "D", NA),
secondary = c("B", "D", "C", NA, NA),
tertiary = c("A", "E", NA, NA, NA))
# I think I need something along these lines:
df <- df %>% rowwise() %>%
mutate(
routeA = (primary == "A") | (secondary == "A") | (tertiary == "A") ...
routeB = ....
)
# Result expected
df
# A tibble:
primary secondary tertiary routeA routeB ...
<chr> <chr> <chr> <lgl> <lgl>
C B A TRUE TRUE
A D E TRUE FALSE
B C NA FALSE TRUE
D NA NA FALSE FALSE
NA NA NA NA NA

You can do this relatively efficiently with apply and match from base R:
f <- function(x, levels) {
if (all(is.na(x))) {
rep.int(NA, length(levels))
} else {
as.logical(match(levels, x, 0L))
}
}
lv <- LETTERS[1:5]
df[paste0("route", lv)] <- t(apply(df, 1L, f, levels = lv))
df
## # A tibble: 5 × 8
## primary secondary tertiary routeA routeB routeC routeD routeE
## <chr> <chr> <chr> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 C B A TRUE TRUE TRUE FALSE FALSE
## 2 A D E TRUE FALSE FALSE TRUE TRUE
## 3 B C NA FALSE TRUE TRUE FALSE FALSE
## 4 D NA NA FALSE FALSE FALSE TRUE FALSE
## 5 NA NA NA NA NA NA NA NA
I say "relatively" because rowwise operations on data frames tend to be less efficient than rowwise operations on matrices, requiring coercions to and from matrix or reshaping to and from long format.
This case is no exception, as apply coerces df from data frame to matrix and the assignment coerces the result of t from matrix to data frame.

Suboptimal:
my_match <- function(x, val) {
if (all(is.na(x))) return(NA)
return(any(na.omit(x) == val))
}
df %>% rowwise() %>% mutate(rA = my_match(c_across(where(is.character)), "A"),
rB = my_match(c_across(where(is.character)), "B"))
To be improved:
this won't scale well to larger numbers of routes
too much repeated code (another way of saying the same thing) — but I'm not quite sure how to create a function/shortcut version of this (could loop over the possible sites adding one column at a time, but I don't feel like going quite as far as necessary down the rlang/tidy-evaluation/NSE rabbit hole right now ...)

As mentioned in the comments, this is straightforward when the data is reshaped to long format and then back to wide.
library(tidyr)
library(dplyr)
library(tibble)
df <- df %>%
rowid_to_column()
df %>%
pivot_longer(-rowid) %>%
filter(!is.na(value)) %>%
pivot_wider(id_cols = rowid, names_from = value, values_fill = FALSE, values_fn = ~ TRUE, names_sort = TRUE) %>%
left_join(df, ., by = "rowid")
# A tibble: 5 x 9
rowid primary secondary tertiary A B C D E
<int> <chr> <chr> <chr> <lgl> <lgl> <lgl> <lgl> <lgl>
1 1 C B A TRUE TRUE TRUE FALSE FALSE
2 2 A D E TRUE FALSE FALSE TRUE TRUE
3 3 B C NA FALSE TRUE TRUE FALSE FALSE
4 4 D NA NA FALSE FALSE FALSE TRUE FALSE
5 5 NA NA NA NA NA NA NA NA

Another idea is:
ans = unclass(table(row(df), unlist(df)))
ans
# A B C D E
# 1 1 1 1 0 0
# 2 1 0 0 1 1
# 3 0 1 1 0 0
# 4 0 0 0 1 0
# 5 0 0 0 0 0
Missing values can, also, be filled where appropriate:
ans[!rowSums(ans)] = NA
ans

How to remove columns full of only NA values

Here is an example of the output when I execute the code chunk "is.na() function.
start_lat start_lng end_lat end_lng member_casual ride_length day_of_week X X.1 X.2
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
The "x", "x.1", and "x.2" columns are added to my dataframe and I don't know where they came from. I used na.omit function, but the columns are not recognized. In other words, they are not valid names. Can someone please help me remove these columns in my dataframe.

## figure out which columns are all NA values
all_na_cols = sapply(your_data, \(x) all(is.na(x)))
## drop them
your_data = your_data[!all_na_cols]
Running na.omit() on a data frame will drop rows if they have one or more NA values in them, so not what you want to do here.
The "x", "x.1", and "x.2" columns are added to my dataframe and I don't know where they came from.
That would worry me a lot. If I were you, I'd go back in your script and run it one line at a time until I found out where those columns came from, and then I'd solve the source of problem there rather than putting a bandage on it here.

A tidyverse solution
Using dlpyr::select()
Make some dummy data:
require(dplyr)
myData <- tibble(a = c(1,2,3,4), b = c("a", "b", "c", "d"),
c = c(NA, NA, NA, NA), d = c(NA, "not_na", "not_na", NA))
myData
#> # A tibble: 4 x 4
#> a b c d
#> <dbl> <chr> <lgl> <chr>
#> 1 1 a NA <NA>
#> 2 2 b NA not_na
#> 3 3 c NA not_na
#> 4 4 d NA <NA>
Select only the rows that are not all NA
myNewData <- select(myData, where(function(x) !all(is.na(x))))
myNewData
#> # A tibble: 4 x 3
#> a b d
#> <dbl> <chr> <chr>
#> 1 1 a <NA>
#> 2 2 b not_na
#> 3 3 c not_na
#> 4 4 d <NA>
Created on 2022-02-16 by the reprex package (v2.0.1)

Find the length of consecutive numbers in R

I have a data.frame that looks like this:
data=data.frame(position=c(1,2,3,1,1,4,5,6,7,8,2,2),
name=c("A","B","C","A","A","D","E","F","G","H","B","B"))
position name
1 A
2 B
3 C
1 A
1 A
4 D
5 E
6 F
7 G
8 H
2 B
2 B
I would like to be able to identify in the column "position" all the consecutive intervals
and then paste into a new column the length of each interval.
I would like my data to look somehow like this.
position length
1 - 3 3
4 - 8 5
Any help and comment are highly appreciated

Here is a base R solution.
Create a column, sequence, which indicates which rows are contiguous.
data$sequence <- c(NA, head(data$position, -1)) + 1 == data$position
data$sequence[[1]] <- data$sequence[[2]]
data
#> position name sequence
#> 1 1 A TRUE
#> 2 2 B TRUE
#> 3 3 C TRUE
#> 4 1 A FALSE
#> 5 1 A FALSE
#> 6 4 D FALSE
#> 7 5 E TRUE
#> 8 6 F TRUE
#> 9 7 G TRUE
#> 10 8 H TRUE
#> 11 2 B FALSE
#> 12 2 B FALSE
Use rle to construct the run lengths.
run_lengths <- rle(data$sequence)
i_ends <- cumsum(run_lengths$lengths)[run_lengths$values]
i_starts <- c(1, head(i_ends, -1))
data.frame(
position = paste0(data$position[i_starts], " - ", data$position[i_ends]),
length = i_ends - i_starts
)
#> position length
#> 1 1 - 3 2
#> 2 3 - 8 7

Does this work:
library(dplyr)
library(tidyr)
library(data.table)
data %>% mutate(ID = case_when (position == lead(position) - 1 ~ 1, TRUE ~ 0)) %>%
mutate(ID = case_when(position == lag(position) + 1 ~ 1, TRUE ~ ID)) %>% mutate(r = rleid(ID)) %>% filter(ID == 1) %>%
group_by(r) %>% mutate(position = paste(min(position),max(position), sep = '-'), length = length(unique(name))) %>% ungroup() %>% select(1,5) %>% distinct()
# A tibble: 2 x 2
position length
<chr> <int>
1 1-3 3
2 4-8 5
>

Binding dataframes with different column names by row

I imported this excel sheet as a list of dataframes. I want to merge the list into one dataframe. bind_rows() allow me to easily add together the dataframes, but the issue is that I have a variable/column that has different names in each dataframe. bind_row() will by default create two separate columns, with empty values for the data from the other data frames. How can I join these columns?
Sample code:
# Sample dataframes
df1 <- tibble(A = c(1,2,3),
B = c("X","Y","Z"),
C = c(T,F,F)
)
df2 <- tibble(A = c(3,4,5),
B = c("U","V","W"),
D = c(T,T,F)
)
# List of dataframes
my_ls <- list(df1, df2)
my_ls
[[1]]
# A tibble: 3 x 3
A B C
<dbl> <chr> <lgl>
1 1 X TRUE
2 2 Y FALSE
3 3 Z FALSE
[[2]]
# A tibble: 3 x 3
A B D
<dbl> <chr> <lgl>
1 3 U TRUE
2 4 V TRUE
3 5 W FALSE
# Creating joined dataframe:
my_df <- bind_rows(my_ls)
my_df
# Current outcome: A tibble: 6 x 4
A B C D
<dbl> <chr> <lgl> <lgl>
1 1 X TRUE NA
2 2 Y FALSE NA
3 3 Z FALSE NA
4 3 U NA TRUE
5 4 V NA TRUE
6 5 W NA FALSE
The desired outcome:
# Desired outcome: A tibble: 6 x 3
A B C
<dbl> <chr> <lgl>
1 1 X TRUE
2 2 Y FALSE
3 3 Z FALSE
4 3 U TRUE
5 4 V TRUE
6 5 W FALSE
Currently, I've been using mutate() with case_when(), where I check which column is not empty (!is.na()). This works, but I can't help but think there must be an easier way.
# Example using mutate
my_df <- my_df %>%
mutate(
C = case_when(is.na(C) & !is.na(D) ~ D,
!is.na(C) & is.na(D) ~ C,
# The lines below may be a bit redundant for my purpose, since the dataframes either have the C or D variable.
!is.na(C) & !is.na(D) ~ C, # Better would be to return that variable has overlapping information
is.na(C) & is.na(D) ~ NA
)
) %>%
select(-D)
my_df
# A tibble: 6 x 3
A B C
<dbl> <chr> <lgl>
1 1 X TRUE
2 2 Y FALSE
3 3 Z FALSE
4 3 U TRUE
5 4 V TRUE
6 5 W FALSE

You can bind_rows and then select non-NA value using coalesce :
library(dplyr)
bind_rows(my_ls) %>% mutate(C = coalesce(C, D)) %>% select(A:C)
# A B C
# <dbl> <chr> <lgl>
#1 1 X TRUE
#2 2 Y FALSE
#3 3 Z FALSE
#4 3 U TRUE
#5 4 V TRUE
#6 5 W FALSE

Following the comment by #KarthikS you can rename your columns before binding. My approach using rename_with does not require the columns to be in a specific order. To illusrate this I used somewhat different example dataframes:
library(purrr)
library(dplyr)
d1 <- data.frame(A = 1, B = 2, C = 3)
d2 <- data.frame(A = 4, B = 5, D = 6)
d3 <- data.frame(D = 7, A = 8, B = 9)
d <- list(d1, d2, d3)
map(d, ~ rename_with(.x, ~ "C", matches("^D$"))) %>%
bind_rows()
#> A B C
#> 1 1 2 3
#> 2 4 5 6
#> 3 8 9 7
And now four your dataset:
d <- list(df1, df2)
map(d, ~ rename_with(.x, ~ "C", matches("^D$"))) %>%
bind_rows()
#> # A tibble: 6 x 3
#> A B C
#> <dbl> <chr> <lgl>
#> 1 1 X TRUE
#> 2 2 Y FALSE
#> 3 3 Z FALSE
#> 4 3 U TRUE
#> 5 4 V TRUE
#> 6 5 W FALSE
And if we add an addtional one with a different order:
df3 <- tibble(D = c(T,T,F),
A = c(7,8,9),
B = c("A","B","C"))
d <- list(df1, df2, df3)
map(d, ~ rename_with(.x, ~ "C", matches("^D$"))) %>%
bind_rows()
#> # A tibble: 9 x 3
#> A B C
#> <dbl> <chr> <lgl>
#> 1 1 X TRUE
#> 2 2 Y FALSE
#> 3 3 Z FALSE
#> 4 3 U TRUE
#> 5 4 V TRUE
#> 6 5 W FALSE
#> 7 7 A TRUE
#> 8 8 B TRUE
#> 9 9 C FALSE
Created on 2020-10-16 by the reprex package (v0.3.0)

Apologize for breaking out of the tidyverse for a quick answer
expl <- read.table(text= " A B C D
1 1 X TRUE NA
2 2 Y FALSE NA
3 3 Z FALSE NA
4 3 U NA TRUE
5 4 V NA TRUE
6 5 W NA FALSE")
expl$E <- ifelse(is.na(expl$C), expl$D, expl$C)
print(expl)
or maybe
expl[,c("C", "D")] %>% rowMeans(na.rm = TRUE) %>% as.logical()
EDIT: Translated the latter to tidy:
expl %>% select("C", "D") %>% rowMeans(na.rm = TRUE) %>% as.logical()
EDIT after first comment:
If you want more control you should probably write the things you want to do in each case in a function similar to the following example:
library(magrittr)
expl <- read.table(text= " A B C D
1 1 X TRUE NA
2 2 Y FALSE NA
3 3 Z FALSE NA
4 3 U NA TRUE
5 4 V NA TRUE
6 5 W NA FALSE
7 7 I NA NA
8 9 J TRUE TRUE")
myfun <- function(a, b){
if(is.na(a) & is.na(b))
return(NA)
if(!is.na(a) & !is.na(b)) {
warning("too much information, a and b set!")
return(NaN)
}
return(max(a, b, na.rm=TRUE))
}
myfun = Vectorize(myfun)
myfun(expl$C, expl$D) %>% as.logical()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Value matching with NA - missing values - using mutate - r

Could also be a possibility: data_foo %>% rowwise() %>% mutate(irr = identical(A, B)) %>% ungroup() A B irr <dbl> <dbl> <lgl> 1 1 1 TRUE 2 2 3 FALSE 3 NA NA TRUE 4 4 NA FALSE 5 NA 4 FALSE

Related

Generating True/False column in one dataframe based on multiple conditions in another dataframe in R

Another "how to deal with NAs in logical statements" question

How to remove columns full of only NA values

Find the length of consecutive numbers in R

Binding dataframes with different column names by row

Categories

Resources