What is the recommended tidy way of replacing NAs in conjunction with using a
predicate function?
I was hoping to leverage tidyr::replace_na() (or a similar predefined missing value handler) in some way, but I can't seem to get it to work with either a purrr or dplyr way of using predicate functions.
library(magrittr)
# Example data:
df <- tibble::tibble(
id = c(rep("A", 3), rep("B", 3)),
x = c(1, 2, NA, 10, NA, 30),
y = c("a", NA, "c", NA, NA, "f")
)
# Works, but needs manual spec of columns that should be handled:
df %>%
tidyr::replace_na(list(x = 0))
#> # A tibble: 6 x 3
#> id x y
#> <chr> <dbl> <chr>
#> 1 A 1 a
#> 2 A 2 <NA>
#> 3 A 0 c
#> 4 B 10 <NA>
#> 5 B 0 <NA>
#> 6 B 30 f
# Doesn't work (at least not in the intended way):
df %>%
dplyr::mutate_if(
function(.x) inherits(.x, c("integer", "numeric")),
~tidyr::replace_na(0)
)
#> # A tibble: 6 x 3
#> id x y
#> <chr> <dbl> <chr>
#> 1 A 0 a
#> 2 A 0 <NA>
#> 3 A 0 c
#> 4 B 0 <NA>
#> 5 B 0 <NA>
#> 6 B 0 f
# Works, but uses an inline def of the replacement function:
df %>%
dplyr::mutate_if(
function(.x) inherits(.x, c("integer", "numeric")),
function(.x) dplyr::if_else(is.na(.x), 0, .x)
)
#> # A tibble: 6 x 3
#> id x y
#> <chr> <dbl> <chr>
#> 1 A 1 a
#> 2 A 2 <NA>
#> 3 A 0 c
#> 4 B 10 <NA>
#> 5 B 0 <NA>
#> 6 B 30 f
# Works, but uses an inline def of the replacement function:
df %>%
purrr::modify_if(
function(.x) inherits(.x, c("integer", "numeric")),
function(.x) dplyr::if_else(is.na(.x), 0, .x)
)
#> # A tibble: 6 x 3
#> id x y
#> <chr> <dbl> <chr>
#> 1 A 1 a
#> 2 A 2 <NA>
#> 3 A 0 c
#> 4 B 10 <NA>
#> 5 B 0 <NA>
#> 6 B 30 f
Created on 2019-01-21 by the reprex package (v0.2.1)
If we are using ~, then specify the . also i.e.
df %>%
mutate_if(function(.x) inherits(.x, c("integer", "numeric")),
~ replace_na(., 0))
# A tibble: 6 x 3
# id x y
# <chr> <dbl> <chr>
#1 A 1 a
#2 A 2 <NA>
#3 A 0 c
#4 B 10 <NA>
#5 B 0 <NA>
#6 B 30 f
otherwise, just do
df %>%
mutate_if(function(.x) inherits(.x, c("integer", "numeric")),
replace_na, replace = 0)
# A tibble: 6 x 3
# id x y
# <chr> <dbl> <chr>
#1 A 1 a
#2 A 2 <NA>
#3 A 0 c
#4 B 10 <NA>
#5 B 0 <NA>
#6 B 30 f
Or another variation is
df %>%
mutate_if(funs(inherits(., c("integer", "numeric"))),
~ replace_na(., 0))
Related
I have dataframe like this.
First I arrange and then slice by DPP.
Like this:
But after arrange and slice. I cant have rowsums
card_201406 <- data.frame(ID = c(123, 234, 344, 456, 678, 124, 567, 256, 345),
Block_Code = c("D", "U","Z", "G","T","R","A","U", "B"),
DPP = c(1,2,2,3,3,3,4,5,1),
a = 1:9, a_1 = 1:9, a_2 = 1:9, a_3 = 1:9, a_4 = 1:9)
card_201406 <- card_201406 %>% arrange(DPP) %>% group_by(DPP) %>% slice(1)
card_201406 <- card_201406 %>%
mutate(SUM_a = rowSums(do.call(cbind, select(.,starts_with("a_")))))
RESULT:
Adding missing grouping variables: `DPP`
Error in `mutate()`:
! Problem while computing `SUM_a = rowSums(do.call(cbind, select(.,
starts_with("a_"))))`.
x `SUM_a` must be size 1, not 5.
i The error occurred in group 1: DPP = 1.
Run `rlang::last_error()` to see where the error occurred.
I just want sum perrows
Thanks for helping
Haven't checked what's wrong but you could simplify using across:
library(dplyr, warn=FALSE)
card_201406 %>%
arrange(DPP) %>%
group_by(DPP) %>%
slice(1) %>%
mutate(SUM_a = rowSums(across(starts_with("a_"))))
#> # A tibble: 5 × 9
#> # Groups: DPP [5]
#> ID Block_Code DPP a a_1 a_2 a_3 a_4 SUM_a
#> <dbl> <chr> <dbl> <int> <int> <int> <int> <int> <dbl>
#> 1 123 D 1 1 1 1 1 1 4
#> 2 234 U 2 2 2 2 2 2 8
#> 3 456 G 3 4 4 4 4 4 16
#> 4 567 A 4 7 7 7 7 7 28
#> 5 256 U 5 8 8 8 8 8 32
One way to solve your problem:
library(dplyr)
card_201406 %>%
arrange(DPP) %>%
filter(!duplicated(DPP)) %>% # keep first row of each group
mutate(SUM_a = rowSums(.[grep("a_", names(.))]))
ID Block_Code DPP a a_1 a_2 a_3 a_4 SUM_a
1 123 D 1 1 1 1 1 1 4
2 234 U 2 2 2 2 2 2 8
3 456 G 3 4 4 4 4 4 16
4 567 A 4 7 7 7 7 7 28
5 256 U 5 8 8 8 8 8 32
I'm having a hard putting this into the form of a question. I have situation where the data in a column (column B) were recorded in such a way that all the values with respect to an indicator (column A) ended up in the bottom-most row within each value of the indicator. Or more simply, like this:
(my_df <- data.frame(
A = c(rep(1, 6), rep(2, 6)),
B = c(rep(NA, 5), "a,b,c,d,e,f", rep(NA, 5), "g,h,i,j,k,l")
))
#> A B
#> 1 1 <NA>
#> 2 1 <NA>
#> 3 1 <NA>
#> 4 1 <NA>
#> 5 1 <NA>
#> 6 1 a,b,c,d,e,f
#> 7 2 <NA>
#> 8 2 <NA>
#> 9 2 <NA>
#> 10 2 <NA>
#> 11 2 <NA>
#> 12 2 g,h,i,j,k,l
Created on 2022-01-28 by the reprex package (v2.0.1)
I am trying to find a simple way to distribute the cell contents upward so that they are in their correct rows, with respect to their respective codes:
(expected_df_1 <- data.frame(
A = c(rep(1, 6), rep(2, 6)),
B = c(letters[1:6], letters[7:12])
))
#> A B
#> 1 1 a
#> 2 1 b
#> 3 1 c
#> 4 1 d
#> 5 1 e
#> 6 1 f
#> 7 2 g
#> 8 2 h
#> 9 2 i
#> 10 2 j
#> 11 2 k
#> 12 2 l
Created on 2022-01-28 by the reprex package (v2.0.1)
This would also be fine:
(expected_df_2 <- data.frame(
A = c(rep(1, 6), rep(2, 6)),
B = c(rep(NA, 5), "a,b,c,d,e,f", rep(NA, 5), "g,h,i,j,k,l"),
C = c(letters[1:6], letters[7:12])
))
#> A B C
#> 1 1 <NA> a
#> 2 1 <NA> b
#> 3 1 <NA> c
#> 4 1 <NA> d
#> 5 1 <NA> e
#> 6 1 a,b,c,d,e,f f
#> 7 2 <NA> g
#> 8 2 <NA> h
#> 9 2 <NA> i
#> 10 2 <NA> j
#> 11 2 <NA> k
#> 12 2 g,h,i,j,k,l l
Created on 2022-01-28 by the reprex package (v2.0.1)
I can't for the life of me find a solution to this. Ideas? Preferably I'd like to stay within the tidyverse framework if possible, but I'll take any suggestions!
Another alternative to try. After grouping by column A, use strsplit on the comma separated values in column B (removing NA).
library(tidyverse)
my_df %>%
group_by(A) %>%
mutate(B = unlist(strsplit(na.omit(B), ',')))
Output
A B
<dbl> <chr>
1 1 a
2 1 b
3 1 c
4 1 d
5 1 e
6 1 f
7 2 g
8 2 h
9 2 i
10 2 j
11 2 k
12 2 l
A possible solution, removing all NA first and then separating into rows, by comma, the elements together:
library(tidyverse)
my_df <- data.frame(
A = c(rep(1, 6), rep(2, 6)),
B = c(rep(NA, 5), "a,b,c,d,e,f", rep(NA, 5), "g,h,i,j,k,l")
)
my_df %>%
drop_na(B) %>%
separate_rows(B, sep=",")
#> # A tibble: 12 × 2
#> A B
#> <dbl> <chr>
#> 1 1 a
#> 2 1 b
#> 3 1 c
#> 4 1 d
#> 5 1 e
#> 6 1 f
#> 7 2 g
#> 8 2 h
#> 9 2 i
#> 10 2 j
#> 11 2 k
#> 12 2 l
I'm struggling with a problem in R. I'm trying to move all values in RL column of the same ID in Trial column into a new column, provided that any of the value in RL column is greater than 5.
I have a data set like this:
dt <- tibble(
TRIAL = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
RL = c(1, 2, 3, 1, 6, 3, 2, 3, 1),
SL = c(1, 1.5, 1, 0, 0, 1, 1, 1.5, 0)
)
# # A tibble: 9 x 3
# TRIAL RL SL
# <chr> <dbl> <dbl>
# 1 A 1 1
# 2 A 2 1.5
# 3 A 3 1
# 4 B 1 0
# 5 B 6 0
# 6 B 3 1
# 7 C 2 1
# 8 C 3 1.5
# 9 C 1 0
This is what I want to achieve: I want all values from one column in a group to be moved to a new column if the max value for that group is greater than 5, see example below.
# # A tibble: 9 x 4
# TRIAL RL SL RLCT
# <chr> <dbl> <dbl> <dbl>
# 1 A 1 1 NA
# 2 A 2 1.5 NA
# 3 A 3 1 NA
# 4 B NA 0 1
# 5 B NA 0 6
# 6 B NA 1 3
# 7 C 2 1 NA
# 8 C 3 1.5 NA
# 9 C 1 0 NA
When I run this code I get not the expected output
dt %>% group_by("TRIAL") %>% mutate(RLCT = case_when ("RL"> 5 ~ "RL"))
# # A tibble: 9 x 5
# # Groups: "TRIAL" [1]
# TRIAL RL SL `"TRIAL"` RLCT
# <chr> <dbl> <dbl> <chr> <chr>
# 1 A 1 1 TRIAL RL
# 2 A 2 1.5 TRIAL RL
# 3 A 3 1 TRIAL RL
# 4 B 1 0 TRIAL RL
# 5 B 6 0 TRIAL RL
# 6 B 3 1 TRIAL RL
# 7 C 2 1 TRIAL RL
# 8 C 3 1.5 TRIAL RL
# 9 C 1 0 TRIAL RL
Sure not the most straightforward solution but seems to work:
dt0 <- dt %>%
mutate(RLCT = NA) %>%
group_by(TRIAL) %>%
filter(!any(RL > 5))
dt %>%
group_by(TRIAL) %>%
filter(any(RL > 5)) %>%
mutate(RLCT = RL) %>%
rbind(dt0, .) %>%
mutate(RL = ifelse(!is.na(RLCT), NA, RL))
# A tibble: 9 x 4
# Groups: TRIAL [3]
TRIAL RL SL RLCT
<chr> <dbl> <dbl> <dbl>
1 A 1 1 NA
2 A 2 1.5 NA
3 A 3 1 NA
4 C 2 1 NA
5 C 3 1.5 NA
6 C 1 0 NA
7 B NA 0 1
8 B NA 0 6
9 B NA 1 3
Add (arrange(TRIAL)) for alphabetic ordering
My question is similar to this one however I have additional columns in the LHS that should be kept https://stackoverflow.com/a/35642948/9285732
y is a subset of x with updated values for val1. In x I want to overwrite the relevant values but keep the rest.
Sample data:
library(tidyverse)
x <- tibble(name = c("hans", "dieter", "bohlen", "hans", "dieter", "alf"),
location = c(1,1,1,2,2,3),
val1 = 1:6, val2 = 1:6, val3 = 1:6)
y <- tibble(name = c("hans", "dieter", "hans"),
location = c(2,2,1),
val1 = 10)
> x
# A tibble: 6 x 5
name location val1 val2 val3
<chr> <dbl> <int> <int> <int>
1 hans 1 1 1 1
2 dieter 1 2 2 2
3 bohlen 1 3 3 3
4 hans 2 4 4 4
5 dieter 2 5 5 5
6 alf 3 6 6 6
> y
# A tibble: 3 x 3
name location val1
<chr> <dbl> <dbl>
1 hans 2 10
2 dieter 2 10
3 hans 1 10
> # desired output
> out
# A tibble: 6 x 5
name location val1 val2 val3
<chr> <dbl> <dbl> <int> <int>
1 hans 1 10 1 1
2 dieter 1 2 2 2
3 bohlen 1 3 3 3
4 hans 2 10 4 4
5 dieter 2 10 5 5
6 alf 3 6 6 6
I wrote a function that is doing what I want, however it's quite cumbersome. I wonder if there's a more elegant way or even a dplyr function that I'm unaware of.
overwrite_join <- function(x, y, by = NULL){
bycols <- which(colnames(x) %in% by)
commoncols <- which(colnames(x) %in% colnames(y))
extracols <- which(!(colnames(x) %in% colnames(y)))
x1 <- anti_join(x, y, by = by) %>%
bind_rows(y) %>%
select(commoncols) %>%
left_join(x %>% select(bycols, extracols), by = by)
out <- x %>% select(by) %>%
left_join(x1, by = by)
return(out)
}
overwrite_join(t1, t2, by = c("name", "location"))
You could do something along the lines of
> x %>%
left_join(y = y, by = c("name", "location")) %>%
within(., val1.x <- ifelse(!is.na(val1.y), val1.y, val1.x)) %>%
select(-val1.y)
# # A tibble: 6 x 5
# name location val1.x val2 val3
# <chr> <dbl> <dbl> <int> <int>
# 1 hans 1 10 1 1
# 2 dieter 1 2 2 2
# 3 bohlen 1 3 3 3
# 4 hans 2 10 4 4
# 5 dieter 2 10 5 5
# 6 alf 3 6 6 6
and then rename val1.x.
My package safejoin might help. Only available on github so far but has a feature designed just for that.
The conflict argument below must be fed a function or lambda to deal with conflicting columns when joining, here we want in priority a value from the y data frame so we can use dplyr::coalesce() there. Note that we must first coerce y$val1 as in your example it's double while x$val1 is integer. Your real case might not need this step.
# remotes::install_github("moodymudskipper/safejoin")
library(safejoin)
library(dplyr)
y$val1 <- as.integer(y$val1)
safe_left_join(x, y, by = c("name", "location"), conflict = ~coalesce(.y, .x))
#> # A tibble: 6 x 5
#> name location val1 val2 val3
#> <chr> <dbl> <int> <int> <int>
#> 1 hans 1 10 1 1
#> 2 dieter 1 2 2 2
#> 3 bohlen 1 3 3 3
#> 4 hans 2 10 4 4
#> 5 dieter 2 10 5 5
#> 6 alf 3 6 6 6
Edit : inspired by your own solution here's a 100% dplyr option that you might like better, just like your option though it's not a proper join!
bind_rows(y, x) %>%
group_by(name, location) %>%
summarize_all(~na.omit(.x)[[1]]) %>%
ungroup()
#> # A tibble: 6 x 5
#> name location val1 val2 val3
#> <chr> <dbl> <dbl> <int> <int>
#> 1 alf 3 6 6 6
#> 2 bohlen 1 3 3 3
#> 3 dieter 1 2 2 2
#> 4 dieter 2 10 5 5
#> 5 hans 1 10 1 1
#> 6 hans 2 10 4 4
Try dplyr::coalesce
x %>%
left_join(y, by = c("name", "location")) %>%
mutate(val1 = coalesce(val1.y, val1.x)) %>%
select(-val1.x, -val1.y)
# A tibble: 6 x 5
name location val2 val3 val1
<chr> <dbl> <int> <int> <int>
1 hans 1 1 1 10
2 dieter 1 2 2 2
3 bohlen 1 3 3 3
4 hans 2 4 4 10
5 dieter 2 5 5 10
6 alf 3 6 6 6
This is the idiom I now use. It does not preserve the row or column order in x, if that is important.
I like it because I can evaluate the values to just before the bind_rows(), do a visual inspection, and if I like it, put the fixed rows back onto the base dataframe.
library(dplyr)
x <- tibble(name = c("hans", "dieter", "bohlen", "hans", "dieter", "alf"),
location = c(1,1,1,2,2,3),
val1 = 1:6, val2 = 1:6, val3 = 1:6)
y <- tibble(name = c("hans", "dieter", "hans"),
location = c(2,2,1),
val1 = 10)
keys <- c("name", "location")
out <- x %>%
semi_join(y, keys) %>%
select(-matches(setdiff(names(y), keys))) %>%
left_join(y, keys) %>%
bind_rows(x %>% anti_join(y, keys))
out %>%
print()
#> # A tibble: 6 x 5
#> name location val2 val3 val1
#> <chr> <dbl> <int> <int> <dbl>
#> 1 hans 1 1 1 10
#> 2 hans 2 4 4 10
#> 3 dieter 2 5 5 10
#> 4 dieter 1 2 2 2
#> 5 bohlen 1 3 3 3
#> 6 alf 3 6 6 6
Created on 2019-12-12 by the reprex package (v0.3.0)
When I use filter from the dplyr package to drop a level of a factor variable, filter also drops the NA values. Here's an example:
library(dplyr)
set.seed(919)
(dat <- data.frame(var1 = factor(sample(c(1:3, NA), size = 10, replace = T))))
# var1
# 1 <NA>
# 2 3
# 3 3
# 4 1
# 5 1
# 6 <NA>
# 7 2
# 8 2
# 9 <NA>
# 10 1
filter(dat, var1 != 1)
# var1
# 1 3
# 2 3
# 3 2
# 4 2
This does not seem ideal -- I only wanted to drop rows where var1 == 1.
It looks like this is occurring because any comparison with NA returns NA, which filter then drops. So, for example, filter(dat, !(var1 %in% 1)) produces the correct results. But is there a way to tell filter not to drop the NA values?
You could use this:
filter(dat, var1 != 1 | is.na(var1))
var1
1 <NA>
2 3
3 3
4 <NA>
5 2
6 2
7 <NA>
And it won't.
Also just for completion, dropping NAs is the intended behavior of filter as you can see from the following:
test_that("filter discards NA", {
temp <- data.frame(
i = 1:5,
x = c(NA, 1L, 1L, 0L, 0L)
)
res <- filter(temp, x == 1)
expect_equal(nrow(res), 2L)
})
This test above was taken from the tests for filter from github.
The answers previously given are good, but when your filter statement involves a function of many fields, the work around might not be so great. Also, who wants to use mapply the non-vectorized identical. Here is another somewhat simpler solution using coalesce
filter(dat, coalesce( var1 != 1, TRUE))
I often map identical with mapply...
(note: I believe because of changes in R 3.6.0, set.seed and sample end up with different test data)
library(dplyr, warn.conflicts = FALSE)
set.seed(919)
(dat <- data.frame(var1 = factor(sample(c(1:3, NA), size = 10, replace = T))))
#> var1
#> 1 3
#> 2 1
#> 3 <NA>
#> 4 3
#> 5 1
#> 6 3
#> 7 2
#> 8 3
#> 9 2
#> 10 1
filter(dat, var1 != 1)
#> var1
#> 1 3
#> 2 3
#> 3 3
#> 4 2
#> 5 3
#> 6 2
filter(dat, !mapply(identical, as.numeric(var1), 1))
#> var1
#> 1 3
#> 2 <NA>
#> 3 3
#> 4 3
#> 5 2
#> 6 3
#> 7 2
it works for numerics and strings as well (probably more common use case)...
library(dplyr, warn.conflicts = FALSE)
set.seed(919)
(dat <- data.frame(var1 = sample(c(1:3, NA), size = 10, replace = T),
var2 = letters[sample(c(1:3, NA), size = 10, replace = T)],
stringsAsFactors = FALSE))
#> var1 var2
#> 1 3 <NA>
#> 2 1 a
#> 3 NA a
#> 4 3 b
#> 5 1 b
#> 6 3 <NA>
#> 7 2 a
#> 8 3 c
#> 9 2 <NA>
#> 10 1 b
filter(dat, !mapply(identical, var1, 1L))
#> var1 var2
#> 1 3 <NA>
#> 2 NA a
#> 3 3 b
#> 4 3 <NA>
#> 5 2 a
#> 6 3 c
#> 7 2 <NA>
filter(dat, !mapply(identical, var2, 'a'))
#> var1 var2
#> 1 3 <NA>
#> 2 3 b
#> 3 1 b
#> 4 3 <NA>
#> 5 3 c
#> 6 2 <NA>
#> 7 1 b