This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 1 year ago.
Here is my data:
x <- rnorm(0,1, n = 6)
class <- c(1,1,1,2,2,2)
df <- cbind(x, class)
I want to calculate the mean of x by class and have them repeat for all rows (I get a new column with the means for each class repeated so that the number of rows of the data frame remain the same.
We can use
library(dplyr)
df <- df %>%
group_by(class) %>%
mutate(Mean = mean(x)) %>%
ungroup
-ouptut
df
# A tibble: 6 x 3
x class Mean
<dbl> <dbl> <dbl>
1 2.43 1 1.05
2 0.0625 1 1.05
3 0.669 1 1.05
4 0.195 2 -0.0550
5 0.285 2 -0.0550
6 -0.644 2 -0.0550
data
df <- data.frame(x, class)
A base R option using ave
transform(
df,
Mean = ave(x, class)
)
Related
here is my data. I would like to group them into ten groups. with a minimal sum of var1 within-group and minimal variance of var2 across groups.
data <- data.frame(id=1:100,var1=runif(1000),var2=runif(1000))
I'm not exactly sure how to understand the minimal sum of var1 condition, as the total sum will always be the same. I have assumed it means to have the lowest maximum group sum.
# Create dataset
set.seed(1)
data <- data.frame(var1 = runif(1000), var2 = runif(1000))
# Create 50 splits
# These are numerically balanced by var2 as specified
# meaning that groups will have similar means
data_grouped <- data %>%
groupdata2::fold(k = 10,
num_col = "var2",
num_fold_cols = 50)
# Find the split with the lowest maximum sum of a group
data_grouped %>%
# Note that `gather()` will lead to 50k rows
# So might need to rethink this step for a bigger dataset
tidyr::gather(key = "split", value = "group", 3:52) %>%
dplyr::group_by(split, group) %>%
# Find sum per group per split
dplyr::summarise(var1_sum = sum(var1), .groups = "drop_last") %>%
# Find max group sum per split
dplyr::summarise(var1_max = max(var1_sum)) %>%
# Find split with lowest max group sum
dplyr::filter(var1_max == min(var1_max))
> # A tibble: 1 x 2
> split var1_max
> <chr> <dbl>
> 1 .folds_19 51.9
# Assign best grouping factor to original data frame
data$group <- data_grouped$.folds_19
# Check the means of var2
data %>%
dplyr::group_by(group) %>%
dplyr::summarise(var2_mean = mean(var2))
> # A tibble: 10 x 2
> group var2_mean
> <fct> <dbl>
> 1 1 0.491
> 2 2 0.489
> 3 3 0.490
> 4 4 0.490
> 5 5 0.492
> 6 6 0.490
> 7 7 0.489
> 8 8 0.491
> 9 9 0.491
> 10 10 0.490
I have a data frame of 30 years of a response variable. I want to write a code that will subset that df into "x" number of years "n" times, and run a regression of the response in all of those subsets.
So if we started with 30 years, x=5 & n=2, we would end with 2 regressions, each using 5 random years out of the available 30. I wrote a function for that here:
# build df
df = data.frame(year=c(1:30),
response = runif(30,1,100))
# create function
subsample <- function(df, x, n ){
df %>%
# collaplse the tibble
nest(data=everything()) %>%
# repeat the tibble for number of simulations
slice(rep(1:n(), each = n)) %>%
# add group number, which will be the "nth" trial
mutate(group = c(1:n)) %>%
# expand data
unnest(cols = c(data)) %>%
# group by group number, then subsample n times from each group
group_by(group) %>%
group_map(~ sample_n(.x, x, replace = F)) %>%
# stitch back together and add group number col back
bind_rows(.id="trial") %>%
# arrange by group and year
mutate(trial=as.numeric(trial)) %>%
arrange(trial,year) %>%
# group by subsample and run regression
group_by(trial) %>%
do({
mod = lm(response ~ year, data = .)
data.frame(Intercept = coef(mod)[1],
Slope = coef(mod)[2])
})
}
# test function
subsample(df, x=5, n=2)
# A tibble: 2 x 3
# Groups: simulation [2]
trial Intercept Slope
<dbl> <dbl> <dbl>
1 1 48.5 -0.895
2 2 35.4 -0.275
Great, so that works, and we get two regressions (all I want is slope and intercept) each using a subset of 5 out of 30 years.
However, now I want to do this with every possible combination of years (so x = c(2:30)), ending with a df that should look like this
# A tibble: 2 x 3
number_of_years trial Intercept Slope
<dbl> <dbl> <dbl> <dbl>
1 2 1 48.5 -0.895
2 2 2 35.4 -0.275
3 3 1 55.2 0.333
4 3 2 34.1 0.224
5 4 1 63.2 -0.359
6 4 2 45.5 -0.241
7 5 1 43.1 0.257
8 5 2 37.9 -0.657
9 6 1 51.0 -0.456
10 6 2 65.6 0.126
This would be showing regression values of 2 trials ("n") each using 2 random years (number_of_years, "x"), then 2 trials using 3 random years, 4 random years, etc... all the way until 30.
So I tried to follow the same logic as above, but now trying map_group() with the custom function that I built:
df %>%
# collaplse the tibble
nest(data=everything()) %>%
# repeat the tibble for the number of simulations we want to test (29, in this case)
slice(rep(1:n(), each = (nrow(df)-1))) %>%
# add column for number out of total and unnest
mutate(number_of_years = c(2:(nrow(.)+1))) %>%
select(number_of_years,data) %>% #reorder
unnest(cols =c(data)) %>%
# group by out of total
group_by(number_of_years) %>%
group_map( ~ subsample(.x, x=5, n=2,))
### this is the problematic line!
### this is giving us 2 trials (n=2) of a regression, each using
### x=5 years of sampling. but instead of x=5 years, I want x=number_of_years
### so x should be the same as the grouping variable.
So the problem here is, since my subsample() function needs 3 inputs (df, x, n), I need to figure out how to make "x" the same as the grouping variable for the dataset. x should be (number_of_years). I've tried to do group_map( ~ subsample(.x,.x$number_of_years,2) and variations like that, but I can't figure out how to make it return 30 tibbles of 2 trials each, meaning 2 regressions of subsamples of the original df, but each one using a different number of years to calculate the regression.
I would like to stay in the tidyverse/ dplyr/ purr workflow if possible.
Thanks!
Hi I have a df that has variables as the columns and rows are time. The element of each intersection is a count.
Var_1 Var_2 Var_3
Time_1 5 4 5
Time_2 4 19 4
Time_3 2 2 87
This df has a lot of rows (> 30,000)
How can I calculate Z scores for every 20 rows? Thanks in advance! <3
Here is an answer that uses dplyr::summarise() to calculate means and standard deviations, then we merge them with the original data and use mutate() to calculate the z-scores. We'll illustrate the single variable case, but it can be extended to handle multiple variables.
Given the ambiguity of the original question, we assume the Time- column is structured in groups of 20, which allows us to use it as the main grouping variable for the solution. That is, there are 20 observations at Time-1, another 20 at Time-2, etc.
If the requirement is to create groups of 20 rows based on consecutive row identifiers, the solution can easily be modified to add a grouping variable to represent sets of 20 rows.
# simulate some data
y <- rpois(20000,3) # simulate counts
TimeVal <- paste0(rep("Time-",20000),
rep(1:1000,20))
data <-data.frame(TimeVal,y,stringsAsFactors = FALSE)
library(dplyr)
result <- data %>% group_by(TimeVal) %>% summarise(ybar = mean(y),
stDev = sd(y)) %>%
full_join(data,.,) %>% mutate(.,zScore = (y - ybar) / stDev)
head(result)
...and the output:
> head(result)
TimeVal y ybar stDev zScore
1 Time-1 6 3.45 1.276302 1.99795938
2 Time-2 2 2.95 1.700619 -0.55862010
3 Time-3 2 3.20 1.908430 -0.62878909
4 Time-4 3 3.10 1.916686 -0.05217339
5 Time-5 2 3.10 1.447321 -0.76002513
6 Time-6 2 3.30 1.809333 -0.71849700
>
Extending the solution: z-scores for multiple columns
To solve for multiple columns in the original input data frame, first we create a long form tidy data frame with tidyr::pivot_longer), calculate means and standard deviations, merge them with the narrow data and calculate z-scores.
Converting the input data to a long form tidy data frame allows us to use the original column names in a dplyr::by_group(), eliminating a lot of code that would be otherwise required to calculate the z-scores for each column in the original data.
library(tidyr)
set.seed(95014) # set seed to make results reproducible
y2 <- rpois(20000,8)
y3 <- rpois(20000,15)
data <- data.frame(TimeVal,y,y2,y3,stringsAsFactors = FALSE)
# convert to narrow format tidy, calculate means, sds, and zScores
longData <- data %>%
group_by(TimeVal) %>%
pivot_longer(-TimeVal,
names_to = "variable",
values_to = "value")
result <- longData %>%
group_by(TimeVal,variable) %>%
summarise(avg = mean(value), stDev = sd(value)) %>%
full_join(longData,.) %>%
mutate(.,zScore = (value - avg) / stDev)
head(result)
...and the output:
> head(result)
# A tibble: 6 x 6
# Groups: TimeVal [2]
TimeVal variable value avg stDev zScore
<chr> <chr> <int> <dbl> <dbl> <dbl>
1 Time-1 y 6 3.45 1.28 2.00
2 Time-1 y2 13 8.7 2.23 1.93
3 Time-1 y3 20 16.4 5.25 0.686
4 Time-2 y 2 2.95 1.70 -0.559
5 Time-2 y2 6 8.2 2.89 -0.760
6 Time-2 y3 12 14.8 3.34 -0.852
>
I'am struggle with one task: I have a dataframe, where one column is always numeric and others are always factors. I don't know the index of numeric columns.
My task is: to group dataframe by all factors, then to find mean and sd within each group.
I have already done some part of work:
library(dplyr)
library(stats)
df <- data.frame(
col1 = sample(LETTERS[1:3], 100, replace=TRUE),
col2 = sample(LETTERS[1:3], 100, replace=TRUE),
col3 = rnorm(100))
df
find_mean_sd <- function(df){
numeric <- which(sapply(df,is.numeric)==TRUE)
columns <- names(df)[-numeric]
dots <- lapply(columns, as.symbol)
df %>%
group_by_(.dots=dots) %>%
summarise(mean = mean(df[,numeric]), SD= sd(df[,numeric]))
}
find_mean_sd(df)
I am confused with mean and sd: why do they the same for all groups? I wanted to get 9 different meanings.
In case you want to fix your code, you can try this:
library(dplyr)
find_mean_sd <- function(df){
numeric <- which(sapply(df,is.numeric)==TRUE)
columns <- names(df)[-numeric]
dots <- lapply(columns, as.symbol)
df %>%
group_by_(.dots=dots) %>%
summarise_all(funs(mean,sd)) # here you can summarise by the functions you need
}
find_mean_sd(df)
# A tibble: 9 x 4
# Groups: col1 [3]
col1 col2 mean SD
<fct> <fct> <dbl> <dbl>
1 A A 0.202 1.19
2 A B -0.141 0.950
3 A C 0.585 0.596
4 B A -0.0812 1.20
5 B B -0.380 1.18
6 B C 0.300 0.846
7 C A -0.152 0.705
8 C B 0.136 1.39
9 C C 0.263 0.762
I think the problem was that you use in a dplyr chain the df, that is not necessary in the part of the summarise for your purpose, despite the A. Suliman solution is more elegant.
We can use dplyr::*_if to select the required columns
library(dplyr)
df %>%
group_by_if(is.factor) %>%
summarise_if(is.numeric, list(mean=~mean(., na.rm = TRUE), SD=~sd(.,na.rm = TRUE)))
I am trying to draw a stratified sample from a data set for which a variable exists that indicates how large the sample size per group should be.
library(dplyr)
# example data
df <- data.frame(id = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
In this example, grp refers to the group I want to sample by and frq is the sample size specificied for that group.
Using split, I came up with this possible solution, which gives the desired result but seems rather inefficient :
s <- split(df, df$grp)
lapply(s,function(x) sample_n(x, size = unique(x$frq))) %>%
do.call(what = rbind)
Is there a way using just dplyr's group_by and sample_n to do this?
My first thought was:
df %>% group_by(grp) %>% sample_n(size = frq)
but this gives the error:
Error in is_scalar_integerish(size) : object 'frq' not found
This works:
df %>% group_by(grp) %>% sample_n(frq[1])
# A tibble: 9 x 3
# Groups: grp [3]
id grp frq
<int> <int> <dbl>
1 3 1 3
2 4 1 3
3 2 1 3
4 6 2 2
5 8 2 2
6 13 3 4
7 14 3 4
8 12 3 4
9 11 3 4
Not sure why it didn't work when you tried it.
library(tidyverse)
# example data
df <- data.frame(id = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
set.seed(22)
df %>%
group_by(grp) %>% # for each group
nest() %>% # nest data
mutate(v = map(data, ~sample_n(data.frame(id=.$id), unique(.$frq)))) %>% # sample using id values and (unique) frq value
unnest(v) # unnest the sampled values
# # A tibble: 9 x 2
# grp id
# <int> <int>
# 1 1 2
# 2 1 5
# 3 1 3
# 4 2 8
# 5 2 9
# 6 3 14
# 7 3 13
# 8 3 15
# 9 3 11
Function sample_n works if you pass as inputs a data frame of ids (not a vector of ids) and one frequency value (for each group).
An alternative version using map2 and generating the inputs for sample_n in advance:
df %>%
group_by(grp) %>% # for every group
summarise(d = list(data.frame(id=id)), # create a data frame of ids
frq = unique(frq)) %>% # get the unique frq value
mutate(v = map2(d, frq, ~sample_n(.x, .y))) %>% # sample using data frame of ids and frq value
unnest(v) %>% # unnest sampled values
select(-frq) # remove frq column (if needed)
The following answer is not recommended, just shows a different approach without nests/maps that some people might find more comprehensible. Possibly of use to someone working with a smallish data set who wants to do something slightly different to the original question, is a bit scared or doesn't have time to play around with functions they don't really understand, and isn't too worried about efficiency. You just need to recall the behaviour of the original sample function in base R: when provided with a (positive) integer argument x, it outputs a vector randomly permuting the integers from 1:x.
> sample(5)
[1] 5 1 4 2 3
If we had five elements, we could then obtain a random sample of size three by only selecting the positions where 1, 2 and 3 were permuted - in this case we'd pick the second, fourth and fifth elements. All clear? Then similarly we can just do that within each group, assigning random integers from 1 to the group size, and choosing as our sample the places where the random id is less than or equal to the desired sample size for that group.
library(tidyverse)
# The iris data set has three different species
# I want to sample 2, 5 and 3 flowers respectively from each
sample_sizes <- data.frame(
Species = unique(iris$Species),
n_to_sample = c(2, 5, 3)
)
iris %>%
left_join(sample_sizes, by = "Species") %>% # adds column for how many to sample from this species
group_by(Species) %>% # each species is a group, the size of the group can be found by n()
mutate(random_id = sample(n())) %>% # give each flower in the group a random id between 1 and n()
ungroup() %>%
filter(random_id <= n_to_sample)
Which gave me the output:
# A tibble: 10 x 7
Sepal.Length Sepal.Width Petal.Length Petal.Width Species n_to_sample random_id
<dbl> <dbl> <dbl> <dbl> <fct> <dbl> <int>
1 4.9 3.1 1.5 0.1 setosa 2 1
2 5.7 4.4 1.5 0.4 setosa 2 2
3 6.2 2.2 4.5 1.5 versicolor 5 3
4 6.3 2.5 4.9 1.5 versicolor 5 2
5 6.4 2.9 4.3 1.3 versicolor 5 5
6 6 2.9 4.5 1.5 versicolor 5 4
7 5.5 2.4 3.8 1.1 versicolor 5 1
8 7.3 2.9 6.3 1.8 virginica 3 1
9 7.2 3 5.8 1.6 virginica 3 3
10 6.2 3.4 5.4 2.3 virginica 3 2
You can of course pipe through to select(-random_id, -n_to_sample) if you no longer have any use for the final two columns, but I left them in so it's clearer from the output how the code worked.
For the example data given in the question:
library(dplyr)
# example data
df <- data.frame(id = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
df %>%
group_by(grp) %>%
mutate(random_id = sample(n())) %>%
ungroup() %>%
filter(random_id <= frq) %>%
select(-random_id)
# A tibble: 9 x 3
id grp frq
<int> <int> <dbl>
1 1 1 3
2 2 1 3
3 3 1 3
4 8 2 2
5 9 2 2
6 11 3 4
7 12 3 4
8 13 3 4
9 15 3 4
NB if you're a safety fanatic and x might be zero, and you want to guarantee the length of the output is definitely the same as x, you're better to do sample(seq_len(x)) than sample(x). That way you get the zero-length vector integer(0) rather than the length-one vector 0 in the case where x is zero. In my code, the mutate will never be working on a row for which n() is zero (if n() were zero then that group is empty so there couldn't be a row there) and this isn't a problem. Just something to be aware of if you're taking this approach somewhere else.
Benchmarks for comparison:
f1 <- function(df) { # #AntoniosK with nest and map
df %>%
group_by(grp) %>% # for each group
nest() %>% # nest data
mutate(v = map(data, ~sample_n(data.frame(id=.$id), unique(.$frq)))) %>% # sample using id values and (unique) frq value
unnest(v) # unnest the sampled values
}
f2 <- function(df) { # #AntoniosK with nest and map2
df %>%
group_by(grp) %>% # for every group
summarise(d = list(data.frame(id=id)), # create a data frame of ids
frq = unique(frq)) %>% # get the unique frq value
mutate(v = map2(d, frq, ~sample_n(.x, .y))) %>% # sample using data frame of ids and frq value
unnest(v) %>% # unnest sampled values
select(-frq) # remove frq column (if needed)
}
f3 <- function(df) { # #thc
df %>% group_by(grp) %>% sample_n(frq[1])
}
f4 <- function(df) { # #Silverfish
df %>%
group_by(grp) %>%
mutate(random_id = sample(n())) %>%
ungroup() %>%
filter(random_id <= frq) %>%
select(-random_id)
}
# example data of variable size
df_n <- function(n) {
data.frame(id = seq_len(3*n),
grp = rep(1:3,each = n),
frq = rep(c(3,2,4), each = n))
}
require(microbenchmark)
microbenchmark(f1(df_n(1e3)), f2(df_n(1e3)), f3(df_n(1e3)), f4(df_n(1e3)),
f1(df_n(1e6)), f2(df_n(1e6)), f3(df_n(1e6)), f4(df_n(1e6)),
times=20)
Results strongly favour #thc's df %>% group_by(grp) %>% sample_n(frq[1]) both for data frame with a couple of thousand or couple of million rows. My naive approach takes two or three times as long, and #AntoniosK's faster solution is the one with nest and map2 (worse than mine for smaller data frames but better for the larger ones).
Unit: milliseconds
expr min lq mean median uq max neval
f1(df_n(1000)) 12.0007 12.27295 12.479760 12.34190 12.46475 13.6403 20
f2(df_n(1000)) 9.5841 9.82185 9.905120 9.87820 9.98865 10.2993 20
f3(df_n(1000)) 1.3729 1.53470 1.593015 1.56755 1.68910 1.8456 20
f4(df_n(1000)) 3.1732 3.21600 3.558855 3.27500 3.57350 5.4715 20
f1(df_n(1e+06)) 1582.3807 1695.15655 1699.288195 1714.13435 1727.53300 1744.2654 20
f2(df_n(1e+06)) 323.3649 336.94280 407.581130 346.95390 463.69935 911.6647 20
f3(df_n(1e+06)) 216.3265 235.85830 268.756465 247.63620 259.02640 395.9372 20
f4(df_n(1e+06)) 641.5119 663.03510 737.089355 682.69730 803.98205 1132.6586 20