mutate if feature exists else NA_real_ - r

A dataframe:
exdf <- data.frame(
a = 1:3,
b = c(2,2,2)
)
Sometimes b is present, in which case one can do this:
exdf %>% mutate(c = a / b)
But, sometimes feature b will not be present, in which case:
exdf %>% select(-b) %>% mutate(c = a / b)
Error: Problem with `mutate()` input `c`.
x object 'b' not found
ℹ Input `c` is `a/b`.
I want to tell dplyr to try the mutation, else if something goes wrong just make new feature c all NA_real_ as opposed to a / b.
Can this be done?

We can use a condition with if/else on exists
library(dplyr)
exdf %>%
select(-b) %>%
mutate(c = if(exists('b')) a/b else NA_real_)

Set up a simple if else statement within mutate which checks whether the column name is in the data.frame or not.
> exdf %>%
... dplyr::rowwise() %>%
... dplyr::mutate(q = ifelse("b" %in% colnames(.), a/b, NA_real_))
# A tibble: 3 x 3
# Rowwise:
a b q
<int> <dbl> <dbl>
1 1 2 0.5
2 2 2 1
3 3 2 1.5
> exdf %>%
... dplyr::select(-b) %>%
... dplyr::rowwise() %>%
... dplyr::mutate(q = ifelse("b" %in% colnames(.), a/b, NA_real_))
# A tibble: 3 x 2
# Rowwise:
a q
<int> <dbl>
1 1 NA
2 2 NA
3 3 NA

Related

How to count the number of times a specified variable appears in a dataframe column using dplyr?

Suppose we start with this very simple dataframe called myData:
> myData
Element Class
1 A 0
2 A 0
3 C 0
4 A 0
5 B 1
6 B 1
7 A 2
Generated by:
myData = data.frame(Element = c("A","A","C","A","B","B","A"),Class = c(0,0,0,0,1,1,2))
How would I use dplyr to extract the number of times "A" appears in the Element column of the myData dataframe? I would simply like the number 4 returned, for further processing in dplyr. All I have so far is the dplyr code shown at the bottom, which seems clumsy because among other things it yields another dataframe with more information than just the number 4 that is needed:
# A tibble: 1 x 2
Element counted
<chr> <int>
1 A 4
The dplyr code that produces the above tibble:
library(dplyr)
myData %>% group_by(Element) %>% filter(Element == "A") %>% summarise(counted = n())
We can use count which simplifies the group_by + summarise step
library(dplyr)
myData %>%
filter(Element == 'A') %>%
count(Element, name = 'counted')
Or with just summarise and sum
myData %>%
summarise(counted = sum(Element == 'A'), Element = 'A') %>%
relocate(Element, .before = 1)
Element counted
1 A 4
Another option using tally like this:
myData = data.frame(Element = c("A","A","C","A","B","B","A"),Class = c(0,0,0,0,1,1,2))
library(dplyr)
myData %>%
filter(Element == "A") %>%
group_by(Element) %>%
tally()
#> # A tibble: 1 × 2
#> Element n
#> <chr> <int>
#> 1 A 4
Created on 2022-07-28 by the reprex package (v2.0.1)

How to use condition statement in pipe in r

I am trying to use the condition statement in the pipe but failed.
The data like this:
group = rep(letters[1:3], each = 3)
status = c(T,T,T, T,T,F, F,F,F)
value = c(1:9)
df = data.frame(group = group, status = status, value = value)
> df
group status value
1 a TRUE 1
2 a TRUE 2
3 a TRUE 3
4 b TRUE 4
5 b TRUE 5
6 b FALSE 6
7 c FALSE 7
8 c FALSE 8
9 c FALSE 9
I want to get the rows in each group that have max value with the condition that if any of the status in each group have TRUE then filter(status == T) %>% slice_max(value) or slice_max(value) otherwise.
What I have tried is this:
# way 1
df %>%
group_by(group) %>%
if(any(status) == T) {
filter(status == T) %>% slice_max(value)
} else {
slice_max(value)
}
# way 2
df %>%
group_by(group) %>%
when(any(status) == T,
filter(status == T) %>% slice_max(value),
slice_max(value))
What I expected output should like this:
> expected_df
group status value
1 a TRUE 3
2 b TRUE 5
3 c FALSE 9
Any help will be highly appreciated!
Try arranging the data by status then value, then just taking the first result
df %>%
group_by(group) %>%
arrange(!status, desc(value)) %>%
slice(1)
Since we arrange by status, if they have a TRUE value, it will come first, if not, then you just get the largest value. Generally it's a bit awkward to combine pipes and if statements but if that's something you want to look into, that's covered in this existing question but if statements don't work with group_by.
A bit more verbose :
library(dplyr)
df %>%
group_by(group) %>%
filter(if(any(status)) value ==max(value[status]) else value == max(value)) %>%
ungroup
# group status value
# <chr> <lgl> <int>
#1 a TRUE 3
#2 b TRUE 5
#3 c FALSE 9
df %>%
group_by(group) %>%
slice(which.max(value*(all(!status)|status)))
# A tibble: 3 x 3
# Groups: group [3]
group status value
<chr> <lgl> <int>
1 a TRUE 3
2 b TRUE 5
3 c FALSE 9
Though the best is to arrange the data

Adding a Proportion Column with Dplyr

Let's say I had the following data frame, that was also altered to include counts of a,b, and c, based on whether or not they are classified by Z = 0 or 1
X <- (1:10)
Y<- c('a','b','a','c','b','b','a','a','c','c')
Z <- c(0,1,1,1,0,1,0,1,1,1)
test_df <- data.frame(X,Y,Z)
(the code below was provided by a stack exchange member, thank you!)
res <- test_df %>% group_by(Y,Z) %>% summarise(N=n()) %>%
pivot_wider(names_from = Z,values_from=N,
values_fill = 0)
How might I add a column on the right which would indicate the proportion of each of the letters for which z=1, out of all appearances of that letter? It would seem that a basic summary statement should work but I figure out how...
My expected output would be something like
Z=0 Z=1 PropZ=1
a 2 2 .5
b 1 2 .66
c 0 3 1
Perhaps this helps
library(dplyr)
library(tidyr)
test_df %>%
group_by(Y, Z) %>%
summarise(N = n(), .groups = 'drop') %>%
left_join(test_df %>%
group_by(Y) %>%
summarise(Prop = mean(Z == 1), .groups = 'drop')) %>%
pivot_wider(names_from = Z, values_from = N, values_fill = 0)
-output
# A tibble: 3 x 4
# Y Prop `0` `1`
# <chr> <dbl> <int> <int>
#1 a 0.5 2 2
#2 b 0.667 1 2
#3 c 1 0 3
test_df %>% group_by(Y) %>%
summarise( z0 = sum(Z == 0), z1 = sum(Z == 1) , PropZ = z1/n())
I am not sure if what is your expected output, but below might be some options
u <- xtabs(q ~ Y + Z, cbind(test_df, q = 1))
> u
Z
Y 0 1
a 2 2
b 1 2
c 0 3
or
> prop.table(u)
Z
Y 0 1
a 0.2 0.2
b 0.1 0.2
c 0.0 0.3
To calculate proportions of 1 for each letter you can use rowSums.
transform(res, prop_1 = `1`/rowSums(res[-1]))
In dplyr :
library(dplyr)
res %>%
ungroup %>%
mutate(prop_1 = `1`/rowSums(.[-1]))
# Y `0` `1` prop_1
# <chr> <int> <int> <dbl>
#1 a 2 2 0.5
#2 b 1 2 0.667
#3 c 0 3 1

map over columns and apply custom function

Missing something small here and struggling to pass columns to function. I just want to map (or lapply) over columns and perform a custom function on each of the columns. Minimal example here:
library(tidyverse)
set.seed(10)
df <- data.frame(id = c(1,1,1,2,3,3,3,3),
r_r1 = sample(c(0,1), 8, replace = T),
r_r2 = sample(c(0,1), 8, replace = T),
r_r3 = sample(c(0,1), 8, replace = T))
df
# id r_r1 r_r2 r_r3
# 1 1 0 0 1
# 2 1 0 0 1
# 3 1 1 0 1
# 4 2 1 1 0
# 5 3 1 0 0
# 6 3 0 0 1
# 7 3 1 1 1
# 8 3 1 0 0
a function just to filter and counts unique ids remaining in the dataset:
cnt_un <- function(var) {
df %>%
filter({{var}} == 1) %>%
group_by({{var}}) %>%
summarise(n_uniq = n_distinct(id)) %>%
ungroup()
}
it works outside of map
cnt_un(r_r1)
# A tibble: 1 x 2
r_r1 n_uniq
<dbl> <int>
1 1 3
I want to apply the function over all r_r columns to get something like:
df2
# y n_uniq
# 1 r_r1 3
# 2 r_r2 2
# 3 r_r3 2
I thought the following would work but doesnt
map(dplyr::select(df, matches("r_r")), ~ cnt_un(.x))
any suggestions? thanks
I'm not sure if there's a direct tidyeval way to do this with something like map. The issue you're running into is that in calling map(df, *whatever_function*), the function is being called on each column of df as a vector, whereas your function expects a bare column name in the tidyeval style. To verify that:
map(df, class)
will return "numeric" for each column.
An alternative is to iterate over column names as strings, and convert those to symbols; this takes just one additional line in the function.
library(dplyr)
library(tidyr)
library(purrr)
cnt_un_name <- function(varname) {
var <- ensym(varname)
df %>%
filter({{var}} == 1) %>%
group_by({{var}}) %>%
summarise(n_uniq = n_distinct(id)) %>%
ungroup()
}
Calling the function is a little awkward because it keeps only the relevant column names (calling on "r_r1" gets columns "r_r1" and "n_uniq", etc). One way is to get the vector of column names you want, name it so you can add an ID column in map_dfr, and drop the extra columns, since they'll be mostly NA.
grep("^r_r\\d+", names(df), value = TRUE) %>%
set_names() %>%
map_dfr(cnt_un_name, .id = "y") %>%
select(y, n_uniq)
#> # A tibble: 3 x 2
#> y n_uniq
#> <chr> <int>
#> 1 r_r1 3
#> 2 r_r2 2
#> 3 r_r3 2
A better way is to call the function, then bind after reshaping.
grep("^r_r\\d+", names(df), value = TRUE) %>%
map(cnt_un_name) %>%
map_dfr(pivot_longer, 1, names_to = "y") %>%
select(y, n_uniq)
# same output as above
Alternatively (and maybe better/more scaleable) would be to do the column renaming inside the function definition.
Here's a base R solution that uses lapply. The tricky bit is that your function isn't actually running on single columns; it's using id, too, so you can't use canned functions that iterate column-wise.
do.call(rbind, lapply(grep("r_r", colnames(df), value = TRUE), function(i) {
X <- subset(df, df[,i] == 1)
row <- data.frame(y = i, n_uniq = length(unique(X$id)), stringsAsFactors = FALSE)
}))
y n_uniq
1 r_r1 2
2 r_r2 3
3 r_r3 2
Here is another solution. I changed the syntax of your function. Now you supply the pattern of the columns you want to select.
cnt_un <- function(var_pattern) {
df %>%
pivot_longer(cols = contains(var_pattern), values_to = "vals", names_to = "y") %>%
filter(vals == 1) %>%
group_by(y) %>%
summarise(n_uniq = n_distinct(id)) %>%
ungroup()
}
cnt_un("r_r")
#> # A tibble: 3 x 2
#> y n_uniq
#> <chr> <int>
#> 1 r_r1 2
#> 2 r_r2 3
#> 3 r_r3 2

Filter group only when both levels are present

This feels like it should be more straightforward and I'm just missing something. The goal is to filter the data into a new df where both var values 1 & 2 are represented in the group
here's some toy data:
grp <- c(rep("A", 3), rep("B", 2), rep("C", 2), rep("D", 1), rep("E",2))
var <- c(1,1,2,1,1,2,1,2,2,2)
id <- c(1:10)
df <- as.data.frame(cbind(id, grp, var))
only grp A and C should be present in the new data because they are the only ones where var 1 & 2 are present.
I tried dplyr, but obviously '&' won't work since it's not row based and '|' just returns the same df:
df.new <- df %>% group_by(grp) %>% filter(var==1 & var==2) #returns no rows
Here is another dplyr method. This can work for more than two factor levels in var.
library(dplyr)
df2 <- df %>%
group_by(grp) %>%
filter(all(levels(var) %in% var)) %>%
ungroup()
df2
# # A tibble: 5 x 3
# id grp var
# <fct> <fct> <fct>
# 1 1 A 1
# 2 2 A 1
# 3 3 A 2
# 4 6 C 2
# 5 7 C 1
We can condition on there being at least one instance of var == 1 and at least one instance of var == 2 by doing the following:
library(tidyverse)
df1 <- data_frame(grp, var, id) # avoids coercion to character/factor
df1 %>%
group_by(grp) %>%
filter(sum(var == 1) > 0 & sum(var == 2) > 0)
grp var id
<chr> <dbl> <int>
1 A 1 1
2 A 1 2
3 A 2 3
4 C 2 6
5 C 1 7

Resources