I am using dplyr summarise function. My data contain NAs so I need to include na.rm=TRUE for each call. for example:
group <- rep(c('a', 'b'), 3)
value <- c(1:4, NA, NA)
df = data.frame(group, value)
library(dplyr)
group_by(df, group) %>% summarise(
mean = mean(value, na.rm=TRUE),
sd = sd(value, na.rm=TRUE),
min = min(value, na.rm=TRUE))
Is there a way to write the argument na.rm=TRUE only one time, and not
on each row?
You should use summarise_at, which lets you compute multiple functions for the supplied columns and set arguments that are shared among them:
df %>% group_by(group) %>%
summarise_at("value",
funs(mean = mean, sd = sd, min = min),
na.rm = TRUE)
If you're planning to apply your functions to one column only, you can use filter(!is.na()) in order to filter out any NA values of this variable only (i.e. NA in other variables won't affect the process).
group <- rep(c('a', 'b'), 3)
value <- c(1:4, NA, NA)
df = data.frame(group, value)
library(dplyr)
group_by(df, group) %>%
filter(!is.na(value)) %>%
summarise(mean = mean(value),
sd = sd(value),
min = min(value))
# # A tibble: 2 x 4
# group mean sd min
# <fctr> <dbl> <dbl> <dbl>
# 1 a 2 1.414214 1
# 2 b 3 1.414214 2
Related
Suppose you have this data.frame in R
set.seed(15)
df <- data.frame(cat = rep(c("a", "b"), each = 50),
x = c(runif(50, 0, 1), runif(50, 1, 2)))
I want to estimate the mean of the 10% upper and lower values in each category.
I can do it using base functions like this
dfa <- df[df$cat=="a",]
dfb <- df[df$cat=="b",]
mean(dfa[dfa$x >= quantile(dfa$x, 0.9),"x"])
# [1] 0.9537632
mean(dfa[dfa$x <= quantile(dfa$x, 0.1),"x"])
# [1] 0.07959845
mean(dfb[dfb$x >= quantile(dfb$x, 0.9),"x"])
# [1] 1.963775
mean(dfb[dfb$x <= quantile(dfb$x, 0.1),"x"])
# [1] 1.092218
However, I can't figure it out how to implement this using dplyr or purrr.
Thanks for the help.
We could do this in a group by approach using cut and quantile as breaks
library(dplyr)
df %>%
group_by(cat) %>%
mutate(grp = cut(x, breaks = c(-Inf, quantile(x,
probs = c(0.1, 0.9)), Inf))) %>%
group_by(grp, .add = TRUE) %>%
summarise(x = mean(x, na.rm = TRUE), .groups = 'drop_last') %>%
slice(-2)
-ouptut
# A tibble: 4 x 3
# Groups: cat [2]
cat grp x
<chr> <fct> <dbl>
1 a (-Inf,0.0813] 0.0183
2 a (0.853, Inf] 0.955
3 b (-Inf,1.21] 1.07
4 b (1.93, Inf] 1.95
Here's a way you can use cut() to help partitaion your data into groups and then take the mean
df %>%
group_by(cat) %>%
mutate(part=cut(x, c(-Inf, quantile(x, c(.1, .9)), Inf), labels=c("low","center","high"))) %>%
filter(part!="center") %>%
group_by(cat, part) %>%
summarize(mean(x))
which returns everything in a nice tibble
cat part `mean(x)`
<chr> <fct> <dbl>
1 a low 0.0796
2 a high 0.954
3 b low 1.09
4 b high 1.96
To make it a bit cleaner, you can factor out the splitting to a helper function
split_quantile <- function(x , p=c(.1, .9)) {
cut(x, c(-Inf, quantile(x, c(.1, .9)), Inf), labels=c("low","center","high"))
}
df %>%
group_by(cat) %>%
mutate(part = split_quantile(x)) %>%
filter(part != "center") %>%
group_by(cat, part) %>%
summarize(mean(x))
A variant of #MrFlick's answer - you can use cut_number and slice:
df %>%
group_by(cat) %>%
mutate(part = cut_number(x, n = 10)) %>%
group_by(cat, part) %>%
summarise(mean(x)) %>%
slice(1, n())
My problem is that I don't understand why I cannot calculate mean and sd of a variable total.
Steps that I have done:
I filtered the dataset in order to see data only from day 1 to 7.
I have summarised values from the variable "x" and created a new variable "Total"
I have a dataset:
name day x
ab 1 3
cd 3 5
fg 7 2
ll 3 1
kk 9 0
My code:
df_changed <- df%>%
dplyr::group_by(`name`, `day` )%>%
dplyr::filter(`day`>= 1, `day`<= 7) %>%
dplyr::summarise(Total=sum(x, na.rm = TRUE))%>%
dplyr::summarise(mean = mean(Total), sd = sd(Total)) %>%
view(df_changed)
Perhaps you may want to calculate mean and SD of x instead of Total as stated. Try this code
library(tidyverse)
df_changed <- df%>%
dplyr::group_by(`name`, `day` )%>%
dplyr::filter(`day`>= 1, `day`<= 7) %>%
dplyr::summarise(Total=sum(x, na.rm = TRUE),
mean = mean(x, na.rm = T),
sd = sd(x, na.rm =T))
A dataframe:
mydf <- data.frame(
x = rep(letters[1:3], 4),
y = rnorm(12, 0, 3)
)
I can easily mutate a new column z that is the value of y plus or minus a random number:
mydf <- mydf %>%
mutate(z = rnorm(nrow(.), mean = 0, sd = sd(y)))
What I wouldlike to do is create z as a random number but when setting the sd use the sd for that letter only.
Tried:
mydf <- mydf %>%
group_by(x) %>%
mutate(z = rnorm(nrow(.), mean = 0, sd = sd(y)))
Error: Problem with `mutate()` input `z`.
x Input `z` can't be recycled to size 4.
ℹ Input `z` is `rnorm(nrow(.), mean = 0, sd = sd(y))`.
ℹ Input `z` must be size 4 or 1, not 12.
ℹ The error occurred in group 1: x = "a".
How can I add z, which is the value of y plus or minus a random number with an sd equal to that of the sd for the group as opposed to the column as a whole?
Here the nrow(.) will break the group by attribute and get the entire row and mutate requires the length of the new the column to be the same as the number of rows of the earlier data. So, this will break that stream unless we wrap the column in a list which may not be what the OP wanted.
library(dplyr)
mydf %>%
group_by(x) %>%
summarise(n = nrow(.))
# A tibble: 3 x 2
# x n
# <chr> <int>
#1 a 12 ###
#2 b 12 ###
#3 c 12 ###
We can use n()
mydf %>%
group_by(x) %>%
mutate(z = rnorm(n(), mean = 0, sd = sd(y)))
As described in numerous questions on here, I should be able to take a data.frame, group it, sort by date, and then apply cumsum, to get the cumulative sum over time per grouping.
Instead, with dplyr 0.8.0, I'm getting cumulative sums that ignore the grouping.
Example code:
data.frame(
cat = sample(c("a", "b", "c"), size = 1000, replace = T),
date = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 1000, replace=T)
) %>%
mutate(
x = 1
) %>%
arrange(date) %>%
group_by(cat) %>%
mutate(x = cumsum(x)) %>%
tail()
Now, I'd expect the last few rows to have x equal to around 300-something, for each group.
Instead I get:
# A tibble: 6 x 3
# Groups: cat [2]
cat date x
<chr> <date> <dbl>
1 a 1999-12-31 995
2 a 1999-12-31 996
3 c 2000-01-01 997
4 a 2000-01-01 998
5 c 2000-01-01 999
6 a 2000-01-01 1000
What am I doing wrong?
I'm guessing this is a classic problem when you load plyr after dplyr, nothing to do with your version of dplyr. For example:
tmp1<- data.frame(cat = sample(c("a", "b", "c"), size = 1000, replace = T),
date = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 1000, replace=T)) %>% mutate(x = 1)
see difference between
tmp1 %>%
arrange(date) %>%
group_by(cat) %>%
plyr::mutate(x = cumsum(x)) %>%
tail()
and
tmp1 %>%
arrange(date) %>%
group_by(cat) %>%
dplyr::mutate(x = cumsum(x)) %>%
tail()
plyr's mutate doesn't understand grouping.
You can verify if this is the problem using search()
My data is below
grp <- paste('group', sample(1:3, 100, replace = T))
x <- rnorm(100, 100)
y <- rnorm(100, 10)
df <- data.frame(grp = grp, x =x , y =y , stringsAsFactors = F)
lag_size <- c(10, 4, 9)
Now when I try to use
df %>% group_by(grp) %>% mutate_all(lag, n = lag_size) %>% arrange(grp)
it gives an error
Error in mutate_impl(.data, dots) :
Expecting a single value:
whereas this works fine
df %>% group_by(grp) %>% mutate_all(lag, n = 10) %>% arrange(grp)
If we need to do the lag based on the 'grp' i.e. to lag the corresponding 'grp' with the value specified in 'lag_size'
library(tidyverse)
res <- map2(split(df[2:3], df$grp) , lag_size, ~.x %>%
mutate_all(lag, n = .y)) %>%
bind_rows(., .id = 'grp')
We can check the lag in 'grp' by the position of the first non-NA element
res %>%
group_by(grp) %>%
summarise(n = which(!is.na(x))[1]-1)
# A tibble: 3 x 2
# grp n
# <chr> <dbl>
#1 group 1 10
#2 group 2 4
#3 group 3 9