How to use multiple arguments in mutate_all for any function? - r

My data is below
grp <- paste('group', sample(1:3, 100, replace = T))
x <- rnorm(100, 100)
y <- rnorm(100, 10)
df <- data.frame(grp = grp, x =x , y =y , stringsAsFactors = F)
lag_size <- c(10, 4, 9)
Now when I try to use
df %>% group_by(grp) %>% mutate_all(lag, n = lag_size) %>% arrange(grp)
it gives an error
Error in mutate_impl(.data, dots) :
Expecting a single value:
whereas this works fine
df %>% group_by(grp) %>% mutate_all(lag, n = 10) %>% arrange(grp)

If we need to do the lag based on the 'grp' i.e. to lag the corresponding 'grp' with the value specified in 'lag_size'
library(tidyverse)
res <- map2(split(df[2:3], df$grp) , lag_size, ~.x %>%
mutate_all(lag, n = .y)) %>%
bind_rows(., .id = 'grp')
We can check the lag in 'grp' by the position of the first non-NA element
res %>%
group_by(grp) %>%
summarise(n = which(!is.na(x))[1]-1)
# A tibble: 3 x 2
# grp n
# <chr> <dbl>
#1 group 1 10
#2 group 2 4
#3 group 3 9

Related

randomly add NA values to dataframe with the proportion set by group

I would like to randomly add NA values to my dataframe with the proportion set by group.
library(tidyverse)
set.seed(1)
dat <- tibble(group = c(rep("A", 100),
rep("B", 100)),
value = rnorm(200))
pA <- 0.5
pB <- 0.2
# does not work
# was trying to create another column that i could use with
# case_when to set value to NA if missing==1
dat %>%
group_by(group) %>%
mutate(missing = rbinom(n(), 1, c(pA, pB))) %>%
summarise(mean = mean(missing))
I'd create a small tibble to keep track of the expected missingness rates, and join it to the first data frame. Then go through row by row to decide whether to set a value to missing or not.
This is easy to generalize to more than two groups as well.
library("tidyverse")
set.seed(1)
dat <- tibble(
group = c(
rep("A", 100),
rep("B", 100)
),
value = rnorm(200)
)
expected_nans <- tibble(
group = c("A", "B"),
p = c(0.5, 0.2)
)
dat_with_nans <- dat %>%
inner_join(
expected_nans,
by = "group"
) %>%
mutate(
r = runif(n()),
value = if_else(r < p, NA_real_, value)
) %>%
select(
-p, -r
)
dat_with_nans %>%
group_by(
group
) %>%
summarise(
mean(is.na(value))
)
#> # A tibble: 2 × 2
#> group `mean(is.na(value))`
#> <chr> <dbl>
#> 1 A 0.53
#> 2 B 0.17
Created on 2022-03-11 by the reprex package (v2.0.1)
Nesting and unnesting works
library(tidyverse)
dat <- tibble(group = c(rep("A", 1000),
rep("B", 1000)),
value = rnorm(2000))
pA <- .1
pB <- 0.5
set.seed(1)
dat %>%
group_by(group) %>%
nest() %>%
mutate(p = case_when(
group=="A" ~ pA,
TRUE ~ pB
)) %>%
mutate(data = purrr::map(data, ~ mutate(.x, missing = rbinom(n(), 1, p)))) %>%
unnest() %>%
summarise(mean = mean(missing))
# A tibble: 2 × 2
group mean
<chr> <dbl>
1 A 0.11
2 B 0.481
set.seed(1)
dat %>%
group_by(group) %>%
nest() %>%
mutate(p = case_when(
group=="A" ~ pA,
TRUE ~ pB
)) %>%
mutate(data = purrr::map(data, ~ mutate(.x, missing = rbinom(n(), 1, p)))) %>%
unnest() %>%
ungroup() %>%
mutate(value = case_when(
missing == 1 ~ NA_real_,
TRUE ~ value
)) %>%
select(-p, -missing)

How to estimate the mean of the 10% upper and lower values over multiple categories with dplyr?

Suppose you have this data.frame in R
set.seed(15)
df <- data.frame(cat = rep(c("a", "b"), each = 50),
x = c(runif(50, 0, 1), runif(50, 1, 2)))
I want to estimate the mean of the 10% upper and lower values in each category.
I can do it using base functions like this
dfa <- df[df$cat=="a",]
dfb <- df[df$cat=="b",]
mean(dfa[dfa$x >= quantile(dfa$x, 0.9),"x"])
# [1] 0.9537632
mean(dfa[dfa$x <= quantile(dfa$x, 0.1),"x"])
# [1] 0.07959845
mean(dfb[dfb$x >= quantile(dfb$x, 0.9),"x"])
# [1] 1.963775
mean(dfb[dfb$x <= quantile(dfb$x, 0.1),"x"])
# [1] 1.092218
However, I can't figure it out how to implement this using dplyr or purrr.
Thanks for the help.
We could do this in a group by approach using cut and quantile as breaks
library(dplyr)
df %>%
group_by(cat) %>%
mutate(grp = cut(x, breaks = c(-Inf, quantile(x,
probs = c(0.1, 0.9)), Inf))) %>%
group_by(grp, .add = TRUE) %>%
summarise(x = mean(x, na.rm = TRUE), .groups = 'drop_last') %>%
slice(-2)
-ouptut
# A tibble: 4 x 3
# Groups: cat [2]
cat grp x
<chr> <fct> <dbl>
1 a (-Inf,0.0813] 0.0183
2 a (0.853, Inf] 0.955
3 b (-Inf,1.21] 1.07
4 b (1.93, Inf] 1.95
Here's a way you can use cut() to help partitaion your data into groups and then take the mean
df %>%
group_by(cat) %>%
mutate(part=cut(x, c(-Inf, quantile(x, c(.1, .9)), Inf), labels=c("low","center","high"))) %>%
filter(part!="center") %>%
group_by(cat, part) %>%
summarize(mean(x))
which returns everything in a nice tibble
cat part `mean(x)`
<chr> <fct> <dbl>
1 a low 0.0796
2 a high 0.954
3 b low 1.09
4 b high 1.96
To make it a bit cleaner, you can factor out the splitting to a helper function
split_quantile <- function(x , p=c(.1, .9)) {
cut(x, c(-Inf, quantile(x, c(.1, .9)), Inf), labels=c("low","center","high"))
}
df %>%
group_by(cat) %>%
mutate(part = split_quantile(x)) %>%
filter(part != "center") %>%
group_by(cat, part) %>%
summarize(mean(x))
A variant of #MrFlick's answer - you can use cut_number and slice:
df %>%
group_by(cat) %>%
mutate(part = cut_number(x, n = 10)) %>%
group_by(cat, part) %>%
summarise(mean(x)) %>%
slice(1, n())

Top_n return both max and min value - R

is it possible for the top_n() command to return both max and min value at the same time?
Using the example from the reference page https://dplyr.tidyverse.org/reference/top_n.html
I tried the following
df <- data.frame(x = c(10, 4, 1, 6, 3, 1, 1))
df %>% top_n(c(1,-1)) ## returns an error
df <- data.frame(x = c(10, 4, 1, 6, 3, 1, 1))
df %>% top_n(1) %>% top_n(-1) ## returns only max value
Thanks
Not really involving top_n(), but you can try:
df %>%
arrange(x) %>%
slice(c(1, n()))
x
1 1
2 10
Or:
df %>%
slice(which(x == max(x) | x == min(x))) %>%
distinct()
Or (provided by #Gregor):
df %>%
slice(c(which.min(x), which.max(x)))
Or using filter():
df %>%
filter(x %in% range(x) & !duplicated(x))
Idea similar to #Jakub's answer with purrr::map_dfr
library(tidyverse) # dplyr and purrrr for map_dfr
df %>%
map_dfr(c(1, -1), top_n, wt = x, x = .)
# x
# 1 10
# 2 1
# 3 1
# 4 1
Here is an option with top_n where we pass a logical vector based that returns TRUE for min/max using range and then get the distinct rows as there are ties for range i.e duplicate elements are present
library(dplyr)
df %>%
top_n(x %in% range(x), 1) %>%
distinct
# x
#1 10
#2 1
I like #tmfmnk's answer. If you want to use top_n function, you can do this:
df <- data.frame(x = c(10, 4, 1, 6, 3, 1, 1))
bind_rows(
df %>% top_n(1),
df %>% top_n(-1)
)
# this solution addresses the specification in comments
df %>%
group_by(y) %>%
summarise(min = min(x),
max = max(x),
average = mean(x))

More efficient way to perform calculations on multiple (combined) columns by group

What is a more efficient way to perform calculations on multiple combined columns by group?
I have a dataset with Manager Effectiveness & Team Effectiveness components. How can I quickly calculate the number of 5s for each component by gender?
The desired outcome is like so:
Number of 5s for 'Manager effectiveness' = 2
Number of 5s for 'Team effectiveness' = 0
So far, I've tried the dplyr method:
Data %>%
group_by(gender) %>%
summarise(sum(c(Manager EQ, Manager IQ)) == 5)
Data %>%
group_by(gender) %>%
summarise(sum(c(Team collaboration, Team friendliness)) == 5)
Though it works, typing each column name quickly becomes tedious and error-prone as more columns are involved.
We can use summarise_at
library(dplyr)
Data %>%
group_by(gender) %>%
summarise_at(vars(starts_with('Manager')), ~ sum(. == 5))
Or if we are checking the sum of all numeric columns, use summarise_if
Data %>%
group_by(gender) %>%
summarise_if(is.numeric, ~ sum(. == 5))
Can we wrapped in a function
f1 <- function(dat, colPrefix, grp, val) {
dat %>%
group_by_at(grp) %>%
summarise_at(vars(starts_with(colPrefix)), ~ sum(. == val))
}
f1(Data, "Manager", "gender", 5)
Mostly expanding on #akrun's answer:
## made up data 100 observations
set.seed(133)
dat <- 1:5
gen <- c("M", "F")
z <- tibble(me = sample(dat, 100, TRUE),
mi = sample(dat, 100, TRUE),
tc = sample(dat, 100, TRUE),
tf = sample(dat, 100, TRUE),
gender = sample(gen, 100, TRUE))
# Grouping by gender, counting 5's, and reshaping data
z %>%
group_by(gender) %>%
summarise_at(vars(everything()), ~ sum(. == 5)) %>%
pivot_longer(me:tf) %>%
mutate(name = paste0("# 5's for ", name)) %>%
pivot_wider(gender)
Output:
# A tibble: 2 x 5
gender `# 5's for me` `# 5's for mi` `# 5's for tc` `# 5's for tf`
<chr> <int> <int> <int> <int>
1 F 6 6 8 5
2 M 10 14 20 5
This is starting to get a little hack-ey, but in response to Amanda's comment & my misunderstanding of the question:
z %>%
group_by(gender) %>%
summarise_at(vars(everything()), ~ sum(. == 5)) %>%
pivot_longer(me:tf) %>%
mutate(name = paste0("# 5's for ", name)) %>%
mutate(grp = ifelse(str_detect(name, 'm'), 'manager', 'team')) %>%
group_by(gender, grp) %>%
summarise(total_5s = sum(value))
Gives results:
# A tibble: 4 x 3
# Groups: gender [2]
gender grp total_5s
<chr> <chr> <int>
1 F manager 12
2 F team 13
3 M manager 24
4 M team 25
Unfortunately this relies heavily on making a distinction and group based on the column names of the original data.

how to use group_by in a function in R

I want to use group_by in a function, the following is my code, 1, 2 work well, so I create a function - 3, while it doesn't work in 4. I don't known how to address this problem, so ask for a help.
# 1 generate variables and dataframe
x <- rnorm(100)
y <- rep(c("A", "B"), 50)
df <- data.frame(y, x)
# 2 group by y
df %>%
group_by(y) %>%
summarise(n = n(),
mean = mean(x),
sd = sd(x))
# 3 create function
group <- function(df, var1, var2){
df %>%
group_by(var1) %>%
summarise(n = n(),
mean = mean(var2),
sd = sd(var2))
}
# 4 test function
group(df = df, var1 = y, var2 = x)
# the error is as follows:
"Error in grouped_df_impl(data, unname(vars), drop) :
Column `var1` is unknown
Called from: grouped_df_impl(data, unname(vars), drop)",
You can do:
library(dplyr)
group <- function(df, var1, var2){
var1 <- enquo(var1); var2 <- enquo(var2);
df %>%
group_by(!!var1) %>%
summarise(n = n(),
mean = mean(!!var2),
sd = sd(!!var2))
}
group(df = df, var1 = y, var2 = x)
### A tibble: 2 x 4
## y n mean sd
## <fct> <int> <dbl> <dbl>
##1 A 50 -0.133 0.866
##2 B 50 0.0770 0.976
For further reference check the link

Resources