How to group by a fixed number of rows in dplyr? [duplicate] - r

This question already has answers here:
Calculate the mean of every 13 rows in data frame
(4 answers)
Closed 1 year ago.
I have a data frame:
set.seed(123)
x <- sample(10)
y <- x^2
my.df <- data.frame(x, y)
The result is this:
> my.df
x y
1 3 9
2 8 64
3 4 16
4 7 49
5 6 36
6 1 1
7 10 100
8 9 81
9 2 4
10 5 25
What I want is to group the rows by every n rows to compute the mean, sum, or whatever on the 5 selected rows. Something like this for n=5:
my.df %>% group_by(5) %>% summarise(sum = sum(y), mean = mean(y))
The expected output would be something like:
# A tibble: 1 x 2
sum mean
<dbl> <dbl>
1 174 34.8
2 211 42.2
Of course, the number of rows in the data frame could be 15, 20, 100, whatever. I still want to group the data every n rows.
How can I do this?

We can use rep or gl to create the grouping variable
library(dplyr)
my.df %>%
group_by(grp = as.integer(gl(n(), 5, n()))) %>%
#or with rep
# group_by(grp = rep(row_number(), length.out = n(), each = 5))
summarise(sum = sum(y), mean = mean(y))
# A tibble: 2 x 3
# grp sum mean
# <int> <dbl> <dbl>
#1 1 174 34.8
#2 2 211 42.2

Another option could be:
my.df %>%
group_by(x = ceiling(row_number()/5)) %>%
summarise_all(list(sum = sum, mean = mean))
x sum mean
<dbl> <dbl> <dbl>
1 1 174 34.8
2 2 211 42.2

Related

Using Filter function in R. Need to assign NA and keep length of dataset the same for Horse Racing Database

I'm still new to the group and R.
I had some really helpful feedback on my last query so hoping I can get
some more support with the following:
I am working on a horse racing database that at this stage has 4 variables:
race horse number, race id, distance of race and the rating (DaH) assigned for the horses
performance for the race.
The dataset:
horse_ratings <- tibble(
horse=c(1,1,1,2,2,2,3,3,3),
raceid=c(1,2,3,1,2,3,1,2,3),
Dist=c(9.47,9.47,10,10.1,10.2,9,11,9.47,10.5),
DaH=c(101,99,103,101,94,87,102,96,62)
)
Giving:
> horse_ratings
# A tibble: 9 x 4
horse raceid Dist DaH
<dbl> <dbl> <dbl> <dbl>
1 1 1 9.47 101
2 1 2 9.47 99
3 1 3 10 103
4 2 1 10.1 101
5 2 2 10.2 94
6 2 3 9 87
7 3 1 11 102
8 3 2 9.47 96
9 3 3 10.5 62
I will perform a number of calculations on the dataset such as mean rating, max rating etc
which id like to result in a number of vectors of equal length.
I'm using the filter function to look at the performance ratings achieved for different
race distances (ie. Distance greater than 10 to begin). However, if one of the horses has not
run a race for that distance then i've noticed that the result does not include that
horse in the output. ie:
> horse_ratings %>%
+ group_by(horse) %>%
+ filter(Dist>10) %>%
+ summarise(mean_rating=mean(DaH))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
horse mean_rating
<dbl> <dbl>
1 2 97.5
2 3 82
So horse 1 has disappeared as it has not run a race of distance greater than 10.
I need to keep the output vector of length 3 ideally so I can put all the calculations
in to a dataframe of same length (for my final data output/print out).
I'm hoping there's a way of assigning an NA or similar to an output for horse 1
Giving:
# A tibble: 2 x 2
horse mean_rating
<dbl> <dbl>
1 1 NA
2 2 97.5
3 3 82
Or a similar solution.
Help would be much appreciated!!
You can use the .drop = FALSE parameter in group_by():
horse_ratings %>%
group_by(horse, .drop = FALSE) %>%
filter(Dist > 10) %>%
summarise(mean_rating = mean(DaH))
horse mean_rating
<dbl> <dbl>
1 1 NaN
2 2 97.5
3 3 82
Don't filter first, do it in summarise so you don't drop groups (horse).
library(dplyr)
horse_ratings %>%
group_by(horse) %>%
summarise(mean_rating = mean(DaH[Dist>10], na.rm = TRUE))
# A tibble: 3 x 2
# horse mean_rating
# <dbl> <dbl>
#1 1 NaN
#2 2 97.5
#3 3 82
library(tidyverse)
Method 1:
horse_stats <-
horse_ratings %>%
mutate(raceid = as.factor(raceid)) %>%
filter(Dist > 10) %>%
group_by(horse) %>%
summarise_if(is.numeric, c("sum", "mean", "max", "min")) %>%
ungroup() %>%
left_join(horse_ratings %>%
select(horse) %>%
distinct(),
., by = "horse", all.x = TRUE)
Method 2 :
horse_stats <-
horse_ratings %>%
mutate(raceid = factor(raceid),
Dist = ifelse(Dist <= 10, 0, Dist),
DaH = ifelse(Dist == 0, 0, Dist)) %>%
group_by(horse) %>%
summarise_if(is.numeric, c("sum", "mean", "max", "min")) %>%
ungroup() %>%
mutate_if(is.numeric, list(~na_if(., 0)))

dplyr: getting grouped min and max of columns in a for loop [duplicate]

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 3 years ago.
I am trying to get the grouped min and max of several columns using a for loop:
My data:
df <- data.frame(a=c(1:5, NA), b=c(6:10, NA), c=c(11:15, NA), group=c(1,1,1,2,2,2))
> df
a b c group
1 1 6 11 1
2 2 7 12 1
3 3 8 13 1
4 4 9 14 2
5 5 10 15 2
6 NA NA NA 2
My attempt:
cols <- df %>% select(a,b) %>% names()
for(i in seq_along(cols)) {
output <- df %>% dplyr::group_by(group) %>%
dplyr::summarise_(min=min(.dots=i, na.rm=T), max=max(.dots=i, na.rm=T))
print(output)
}
Desired output for column a:
group min max
<dbl> <int> <int>
1 1 1 3
2 2 4 5
Using dplyr package, you can get:
df %>%
na.omit() %>%
pivot_longer(-group) %>%
group_by(group, name) %>%
summarise(min = min(value),
max = max(value)) %>%
arrange(name, group)
# group name min max
# <dbl> <chr> <int> <int>
# 1 1 a 1 3
# 2 2 a 4 5
# 3 1 b 6 8
# 4 2 b 9 10
# 5 1 c 11 13
# 6 2 c 14 15
We can use summarise_all after grouping by 'group' and if it needs to be in a particular order, then use select to select based on the column names
library(dplyr)
library(stringr)
df %>%
group_by(group) %>%
summarise_all(list(min = ~ min(., na.rm = TRUE),
max = ~ max(., na.rm = TRUE))) %>%
select(group, order(str_remove(names(.), "_.*")))
# A tibble: 2 x 7
# group a_min a_max b_min b_max c_min c_max
# <dbl> <int> <int> <int> <int> <int> <int>
#1 1 1 3 6 8 11 13
#2 2 4 5 9 10 14 15
Without to use for loop but using dplyr and tidyr from tidyverse, you can get the min and max of each columns by 1) pivoting the dataframe in a longer format, 2) getting the min and max value per group and then 3) pivoting wider the dataframe to get the expected output:
library(tidyverse)
df %>% pivot_longer(., cols = c(a,b,c), names_to = "Names",values_to = "Value") %>%
group_by(group,Names) %>% summarise(Min = min(Value, na.rm =TRUE), Max = max(Value,na.rm = TRUE)) %>%
pivot_wider(., names_from = Names, values_from = c(Min,Max)) %>%
select(group,contains("_a"),contains("_b"),contains("_c"))
# A tibble: 2 x 7
# Groups: group [2]
group Min_a Max_a Min_b Max_b Min_c Max_c
<dbl> <int> <int> <int> <int> <int> <int>
1 1 1 3 6 8 11 13
2 2 4 5 9 10 14 15
Is it what you are looking for ?
In base R, we can use aggregate and get min and max for multiple columns by group.
aggregate(.~group, df, function(x)
c(min = min(x, na.rm = TRUE),max= max(x, na.rm = TRUE)))
# group a.min a.max b.min b.max c.min c.max
#1 1 1 3 6 8 11 13
#2 2 4 5 9 10 14 15

Omitting columns instead of dropping them in purrr

I need to calculate an index for multiple lists. However, I can only do this if I drop some columns (here represented by "w" and "x"). For ex.
library(tidyverse)
lists<- list(
l1=tribble(
~w, ~x, ~y, ~z,
#--|--|--|----
12, "a", 2, 1,
12, "a",5, 3,
12, "a",6, 2),
l2=tribble(
~w, ~x, ~y, ~z,
#--|--|--|----
13,"b", 5, 7,
13,"b", 4, 6,
13,"b", 3, 2))
lists %>%
map(~ .x %>%
#group_by(w,x) %>%
select(-w,-x) %>%
mutate(row_sums = rowSums(.)))
Instead of dropping those columns I would like to keep/omit them and calculate the index only for "y" and "z".
I manage to do this by first extracting those columns and binding them again afterward. For ex.
select.col<-lists %>%
map_dfr(~ .x %>%
select(w,x))
lists %>%
map_dfr(~ .x %>%
select(-w,-x) %>%
mutate(row_sums = rowSums(.))) %>%
bind_cols(select.col)
However, this is not so elegant and I had to bind the lists (map_dfr), I would like to keep them as a list though.
Probably, another approach would be to use select_if(., is.numeric), but as I have some numeric columns I need to omit, I'm not sure whether this is the best option.
I'm certain there is a simple solution to this problem. Can anyone take a look at it?
Instead of dropping the columns, you can select the columns for which you want to take the sum.
You can select by name
library(dplyr)
library(purrr)
lists %>% map(~ .x %>% mutate(row_sums = rowSums(.[c("y", "z")])))
#$l1
# A tibble: 3 x 5
# w x y z row_sums
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 12 a 2 1 3
#2 12 a 5 3 8
#3 12 a 6 2 8
#$l2
# A tibble: 3 x 5
# w x y z row_sums
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 13 b 5 7 12
#2 13 b 4 6 10
#3 13 b 3 2 5
Or also by position of columns
lists %>% map(~ .x %>% mutate(row_sums = rowSums(.[3:4])))
Here is a tidyverse approach to get the row sums
library(tidyverse)
lists %>%
map(~ .x %>%
mutate(row_sums = select(., y:z) %>%
reduce(`+`)))
#$l1
# A tibble: 3 x 5
# w x y z row_sums
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 12 a 2 1 3
#2 12 a 5 3 8
#3 12 a 6 2 8
#$l2
# A tibble: 3 x 5
# w x y z row_sums
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 13 b 5 7 12
#2 13 b 4 6 10
#3 13 b 3 2 5
Or using base R
lapply(lists, transform, row_sums = y + z)

How to keep real values of grouped variable within dplyr package in R

My data is something like this:
group <- c(21, 21, 21, 9, 9, 9, 25, 25, 25)
a <- c(8,3,5,6,8,3,3,9,3)
b <- c(4,9,0,1,3,5,6,1,1)
c <- c(1,7,2,5,6,8,4,8,6)
value <- c(23,34,43,52,65,21,12,89,76)
df <- data.frame(group,a,b,c,value)
I applied following function to it.
out <- df %>%
select(group, a, b, value) %>%
group_by(group = gl(n()/3, 3)) %>%
summarise(res = mean(value), a=a[1], b=b[1])
print(out)
Then I am getting following result.
group res a b
<fct> <dbl> <dbl> <dbl>
1 1 33.3 8 4
2 2 46 6 1
3 3 59 3 6
>
My question is how to keep the orgiignal values of ID as they were in the output df like this
group res a b
<fct> <dbl> <dbl> <dbl>
1 21 33.3 8 4
2 9 46 6 1
3 25 59 3 6
>
Thanks in advance!
The issue is you are overwriting your group variable in group_by call hence you are not getting the original variable. You need to use some other name in group_by and then do the calculations.
We can use two options -
1) With summarise
library(dplyr)
df %>%
group_by(group1 = gl(n()/3, 3)) %>%
summarise(res = mean(value), a=a[1], b=b[1], group = group[1])
# group1 res a b group
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 1 33.3 8 4 21
#2 2 46 6 1 9
#3 3 59 3 6 25
2) With mutate
df %>%
select(group, a, b, value) %>%
group_by(group1 = gl(n()/3, 3)) %>%
mutate(res = mean(value), a=a[1], b=b[1]) %>%
slice(1)
In both the case, if you are no longer interested in keeping the grouping variable do ungroup() %>% select(-group1) to remove it.

Avoiding the use of for loop for cumsum

First generating some sample data:
doy <- rep(1:365,times=2)
year <- rep(2000:2001,each=365)
set.seed(1)
value <-runif(min=0,max=10,365*2)
doy.range <- c(40,50,60,80)
thres <- 200
df <- data.frame(cbind(doy,year,value))
What I want to do is the following:
For the df$year == 2000, starting from doy.range == 40, start adding the
df$value and calculate the df$doy when the cumualtive sum of df$value is >= thres
Here's my long for loop to achieve this:
# create a matrix to store results
mat <- matrix(, nrow = length(doy.range)*length(unique(year)),ncol=3)
mat[,1] <- rep(unique(year),each=4)
mat[,2] <- rep(doy.range,times=2)
for(i in unique(df$year)){
dat <- df[df$year== i,]
for(j in doy.range){
dat1 <- dat[dat$doy >= j,]
dat1$cum.sum <-cumsum(dat1$value)
day.thres <- dat1[dat1$cum.sum >= thres,"doy"][1] # gives me the doy of the year where cumsum of df$value becomes >= thres
mat[mat[,2] == j & mat[,1] == i,3] <- day.thres
}
}
This loop gives me the in the third column of my matrix, the doy when cumsum$value exceeded thres
However, I really want to avoid the loops. Is there any way I can do it using less code?
If I understand correctly you can use dplyr. Assume a threshold of 200:
library(dplyr)
df %>% group_by(year) %>%
filter(doy >= 40) %>%
mutate(CumSum = cumsum(value)) %>%
filter(CumSum >= 200) %>%
top_n(n = -1, wt = CumSum)
which yields
# A tibble: 2 x 4
# Groups: year [2]
doy year value CumSum
<dbl> <dbl> <dbl> <dbl>
1 78 2000 3.899895 201.4864
2 75 2001 9.205178 204.3171
The verbs used are self-explanatory I guess. If not, let me know.
For different doy create a function and use lapply:
f <- function(doy.range) {
df %>% group_by(year) %>%
filter(doy >= doy.range) %>%
mutate(CumSum = cumsum(value)) %>%
filter(CumSum >= 200) %>%
top_n(n = -1, wt = CumSum)
}
lapply(doy.range, f)
[[1]]
# A tibble: 2 x 4
# Groups: year [2]
doy year value CumSum
<dbl> <dbl> <dbl> <dbl>
1 78 2000 3.899895 201.4864
2 75 2001 9.205178 204.3171
[[2]]
# A tibble: 2 x 4
# Groups: year [2]
doy year value CumSum
<dbl> <dbl> <dbl> <dbl>
1 89 2000 2.454885 200.2998
2 91 2001 6.578281 200.6544
[[3]]
# A tibble: 2 x 4
# Groups: year [2]
doy year value CumSum
<dbl> <dbl> <dbl> <dbl>
1 98 2000 4.100841 200.5048
2 102 2001 7.158333 200.3770
[[4]]
# A tibble: 2 x 4
# Groups: year [2]
doy year value CumSum
<dbl> <dbl> <dbl> <dbl>
1 120 2000 6.401010 204.9951
2 120 2001 5.884192 200.8252
The idea is to create a function that based on a given (starting) doy and threshold gets you the relevant info. Then apply this function to different combinations of starting doys and thresholds and get a dataset back with all relevant info:
# create example data
doy <- rep(1:365,times=2)
year <- rep(2000:2001,each=365)
set.seed(1)
value <-runif(min=0,max=10,365*2)
df <- data.frame(doy,year,value)
library(dplyr)
library(purrr)
# function (inputs: dr for doy range and t for threshold)
f = function(dr, t) {
df %>%
filter(doy >= dr) %>% # keep rows with values aboven a given doy
group_by(year) %>% # for each year
mutate(CumSumValue = cumsum(value)) %>% # get the cumulative sum of value
filter(CumSumValue >= t) %>% # keep rows equal or above a given threshold
slice(1) %>% # keep the first row
ungroup() %>% # forget the grouping
select(-value) %>% # remove unnecessary variable
mutate(doy_input=dr, thres_input=t) %>% # add the input info as columns
select(doy_input, thres_input, year, doy, CumSumValue) # re arrange columns
}
# input doy and threshold
doy.range <- c(40,50,60,80)
thres <- 200
# map those vectors to the function
map2_df(doy.range, thres, f)
# # A tibble: 8 x 5
# doy_input thres_input year doy CumSumValue
# <dbl> <dbl> <int> <int> <dbl>
# 1 40 200 2000 78 201.4864
# 2 40 200 2001 75 204.3171
# 3 50 200 2000 89 200.2998
# 4 50 200 2001 91 200.6544
# 5 60 200 2000 98 200.5048
# 6 60 200 2001 102 200.3770
# 7 80 200 2000 120 204.9951
# 8 80 200 2001 120 200.8252

Resources