cumulative grouping - r

I have the following data frame:
df = data.frame(a = c(1,1,3,2,2), b=6:10)
## a b
## 1 6
## 1 7
## 3 3
## 2 9
## 2 10
I want to analyze the data by groups (a is the grouping parameter), but instead of the usual (e.g. each value specify a group of rows, and the groups are disjoint) I need "cumulative groups". that is, for the value of a=i, the group should contain all the rows in which a<=i. These are not disjoint groups, but still I want to summarize each group separately.
So for example, if for each group I want the mean of b, the result would be:
## a mean_b
## 1 6.5
## 2 8
## 3 7
note that in the real scenario behind this simplified example, I cannot analyze disjoint group separately and then aggregate the relevant groups. the summarize function must be "aware" of all the rows in that group to perform the computation.
So of course, I can use some apply functions and compute things in the good old way, and make a new df out of it, but I look for the dplyr/tidyverse like functions to do that.
any suggestions?

How about something like this?
library(dplyr)
df %>%
arrange(a) %>%
group_by(a) %>%
summarise(sum_b = sum(b)) %>%
ungroup() %>%
mutate(sum_b = cumsum(sum_b))
# a sum_b
# <dbl> <int>
#1 1. 13
#2 2. 32
#3 3. 40
We take sum by group (a) and then take cumulative sum adding the previous value of the group in the next group.

I had a look and I don't see how it is possible with dplyr itself. However, we can hack the group_by function to make it cumulative. I'll quickly walkd you through it:
First, I make your df. It doesn't really fit your output above, so I slightly changed it.
df = data.frame(a = c(1,1,3,2,2), b=6:10)
df$b[3] <- 3
Now I use the normal group_by to check out what it actually does to the data.frame.
library(dplyr)
df_grouped <- df %>%
arrange(a) %>%
group_by(a)
> attributes(df_grouped)
$class
[1] "grouped_df" "tbl_df" "tbl" "data.frame"
$row.names
[1] 1 2 3 4 5
$names
[1] "a" "b"
$vars
[1] "a"
$drop
[1] TRUE
$indices
$indices[[1]]
[1] 0 1
$indices[[2]]
[1] 2 3
$indices[[3]]
[1] 4
$group_sizes
[1] 2 2 1
$biggest_group_size
[1] 2
$labels
a
1 1
2 2
3 3
So besides other things, there is a new attribute called indices where the group of each element in the grouped variable is referenced. We can actually just change that to make it cumulative.
for (i in seq_along(attributes(df_grouped)[["indices"]])[-1]) {
attributes(df_grouped)[["indices"]][[i]] <- c(
attributes(df_grouped)[["indices"]][[i - 1]],
attributes(df_grouped)[["indices"]][[i]]
)
}
It looks a bit weird but is straightforward. The elements of each group are added to the next group. E.g. all elements from group 1 are added to group 2.
> attributes(df_grouped)$indices
[[1]]
[1] 0 1
[[2]]
[1] 0 1 3 4
[[3]]
[1] 0 1 3 4 2
We can use the changed groups in the normal dplyr way.
> df_grouped %>%
+ summarise(sum_b = mean(b))
# A tibble: 3 x 2
a sum_b
<dbl> <dbl>
1 1 6.5
2 2 8
3 3 7
Now of course this is pretty ugly and looks very hacky. But inside a function that doesn't really matter as long as it is still efficient (which it is). So let's make a custom group_by.
group_by_cuml <- function(.data, ...) {
.data_grouped <- group_by(.data, ...)
for (i in seq_along(attributes(.data_grouped)[["indices"]])[-1]) {
attributes(.data_grouped)[["indices"]][[i]] <- c(
attributes(.data_grouped)[["indices"]][[i - 1]],
attributes(.data_grouped)[["indices"]][[i]]
)
}
return(.data_grouped)
}
Now you can use the custom function in clean dplyr pipe.
> df %>%
+ group_by_cuml(a) %>%
+ summarise(sum_b = mean(b))
# A tibble: 3 x 2
a sum_b
<dbl> <dbl>
1 1 6.5
2 2 8
3 3 7

I would do it this way :
df %>%
arrange(a) %>%
map_dfr(seq_along(as <- unique(.$a)),
~filter(.y, a %in% as[1:.]),.y = ., .id = "a") %>%
group_by(a = meta_group) %>%
summarise(b = mean(b))
# # A tibble: 3 x 2
# a b
# <chr> <dbl>
# 1 1 6.5
# 2 2 7.0
# 3 3 8.0
If you want a separate function you can do :
summarize2 <- function(.data, ..., .by){
grps <- select_at(.data,.by) %>% pull %>% unique
.data %>%
arrange_at(.by) %>%
map_dfr(seq_along(grps),
~ filter_at(.y, .by,all_vars(. %in% grps[1:.x])),
.y = .,
.id = "meta_group") %>%
group_by(meta_group) %>%
summarise(...)
}
df %>%
summarize2(b = mean(b), .by = "a")
# # A tibble: 3 x 2
# meta_group b
# <chr> <dbl>
# 1 1 6.5
# 2 2 7.0
# 3 3 8.0
df %>%
summarize2(b = mean(b), .by = vars(a))
# # A tibble: 3 x 2
# meta_group b
# <chr> <dbl>
# 1 1 6.5
# 2 2 7.0
# 3 3 8.0

One way is to use the base function Reduce with the argument accumulate = TRUE. Once you concatenate, then you can apply any function, i.e.
Reduce(c, split(df$b,df$a), accumulate = TRUE)
#[[1]]
#[1] 6 7
#[[2]]
#[1] 6 7 9 10
#[[3]]
#[1] 6 7 9 10 3
and then for the mean,
sapply(Reduce(c, split(df$b,df$a), accumulate = TRUE), mean)
[1] 6.5 8.0 7.0

Related

Row mean of two matching columns with same name but differ by: '_1' and '_2'

Lets say I have the dataframe:
z = data.frame(col_1 = c(1,2,3,4), col_2 = c(3,4,5,6))
col_1 col_2
1 1 3
2 2 4
3 3 5
4 4 6
I want to take columns with the same name that only differ by the number e.g. '_1' and '_2' and take the pairwise mean. In reality I have a big dataframe with many pairs and they are not in a nice order, therefore looking for a clever solution that can be applied to this.
So the output should look like this:
col
1 2
2 3
3 4
4 5
With the column name given as the same as the column pair but with the additional label removed.
Any help would be great thanks.
Here is a base R option using list2DF + split.default + rowMeans
list2DF(lapply(split.default(z,gsub("_\\d+","",names(z))),rowMeans))
which gives
col
1 2
2 3
3 4
4 5
Try this tidyverse approach. By using separate() you can extract the name and then with reshaping you can reach the desired output. Here the code:
library(dplyr)
library(tidyr)
#Data
z = data.frame(col_1 = c(1,2,3,4), col_2 = c(3,4,5,6))
#Code
z1 <- z %>% mutate(id=1:n()) %>%
pivot_longer(-id) %>%
separate(name,c('var1','var2'),sep='_') %>%
group_by(id,var1) %>% summarise(Mean=mean(value)) %>%
pivot_wider(names_from = var1,values_from=Mean) %>% ungroup() %>% select(-id)
Output:
# A tibble: 4 x 1
col
<dbl>
1 2
2 3
3 4
4 5
Here is a purrr oriented solution:
library(purrr)
library(stringr)
split.default(z, str_remove(names(z), "[:digit:]+$")) %>% map_dfc(rowMeans)
#> # A tibble: 4 x 1
#> col_
#> <dbl>
#> 1 2
#> 2 3
#> 3 4
#> 4 5
It works even if z is:
z <- data.frame(col_1 = c(1,2,3,4),
col_2 = c(3,4,5,6),
anothercol_1 = c(1,2,3,4),
anothercol_2 = c(3,4,5,6))

Create multiple data that count for unique values of each variables using dplyr and loop

I have some question for programming using dplyr and for loop in order to create multiple data. The code without loop works very well, but the code with for loop doesn't give me the expected result as well as error message.
Error message was like:
"Error in UseMethod ("select_") : no applicable method for 'select_'
applied to an object of class "character"
Please anyone put me on the right way.
The code below worked
B <- data %>% select (column1) %>% group_by (column1) %>% arrange (column1) %>% summarise (n = n ())
The code below did not work
column_list <- c ('column1', 'column2', 'column3')
for (b in column_list) {
a <- data %>% select (b) %>% group_by (b) %>% arrange (b) %>% summarise (n = n () )
assign (paste0(b), a)
}
Don't use assign. Instead use lists.
We can use _at variations in dplyr which works with characters variables.
library(dplyr)
split_fun <- function(df, col) {
df %>% group_by_at(col) %>% summarise(n = n()) %>% arrange_at(col)
}
and then use lapply/map to apply it to different columns
purrr::map(column_list, ~split_fun(data, .))
This will return you a list of dataframes which can be accessed using [[ individually if needed.
Using example with mtcars
df <- mtcars
column_list <- c ('cyl', 'gear', 'carb')
purrr::map(column_list, ~split_fun(df, .))
#[[1]]
# A tibble: 3 x 2
# cyl n
# <dbl> <int>
#1 4 11
#2 6 7
#3 8 14
#[[2]]
# A tibble: 3 x 2
# gear n
# <dbl> <int>
#1 3 15
#2 4 12
#3 5 5
#[[3]]
# A tibble: 6 x 2
# carb n
# <dbl> <int>
#1 1 7
#2 2 10
#3 3 3
#4 4 10
#5 6 1
#6 8 1

Remove first 10 and last 10 values

I have a file that contains multiple individuals and multiple values for the same individual.
I need to remove the first 10 and last 10 values of each individual, putting all the leftover values in a new table.
This is what my data kinda looks like:
Cow Data
NL123456 123
NL123456 456
I tried doing a for-loop, counting per individual how many values there were (but I think, I already got stuck there, because I am not using the right command I think? All variables in Cow are a factor).
I figured removing the first and last had to be something like this:
data1[c(11: n-10),]
If you know you always have more than 20 datapoints by cow you can do the following, illustrated on the iris dataset :
library(dplyr)
dim(iris)
# [1] 150 5
iris_trimmed <-
iris %>%
group_by(Species) %>%
slice(11:(n()-10)) %>%
ungroup()
dim(iris_trimmed)
# [1] 90 5
On your data :
res <-
your_data %>%
group_by(Cow) %>%
slice(11:(n()-10)) %>%
ungroup()
In base R you can do :
iris_trimmed <- do.call(
rbind,
lapply(split(iris, iris$Species),
function(x) head(tail(x,-10),-10)))
dim(iris_trimmed)
# [1] 90 5
Using data.table:
library(data.table)
idt <- as.data.table(iris)
idt[, .SD[11:(.N-10)], Species]
Same logic in base R:
do.call(
rbind,
lapply(
split(iris, iris[["Species"]]),
function(x) x[11:(nrow(x)-10), ]
)
)
Here a solution with dplyr.
In my example I cut only the first and last values. (you can adapt it by changing 2 with any number in filter).
The idea is to add after you group_by id the number of row per each observation starting from the top (n) and in reverse from the bottom (n1), then you simply filter out.
library(dplyr)
data %>%
group_by(id) %>%
mutate(n=1:n(),
n1 = n():1) %>% # n and n1 are the row numbers
filter(n >= 2,n1 >= 2) %>% # change 2 with 10, or whatever
# filter() keeps only the rows that you want
select(-n, -n1) %>%
ungroup()
# # A tibble: 4 x 2
# id value
# <dbl> <int>
# 1 1 6
# 2 1 8
# 3 2 1
# 4 2 2
Data:
set.seed(123)
data <- data.frame(id = c(rep(1,4), rep(2,4)), value=sample(8))
data
# id value
# 1 1 3
# 2 1 6
# 3 1 8
# 4 1 5
# 5 2 4
# 6 2 1
# 7 2 2
# 8 2 7

How to create a column that is a group label for unique collections of other columns data table [duplicate]

I have a tbl_df where I want to group_by(u, v) for each distinct integer combination observed with (u, v).
EDIT: this was subsequently resolved by adding the (now-deprecated) group_indices() back in dplyr 0.4.0
a) I then want to assign each distinct group some arbitrary distinct number label=1,2,3...
e.g. the combination (u,v)==(2,3) could get label 1, (1,3) could get 2, and so on.
How to do this with one mutate(), without a three-step summarize-and-self-join?
dplyr has a neat function n(), but that gives the number of elements within its group, not the overall number of the group. In data.table this would simply be called .GRP.
b) Actually what I really want to assign a string/character label ('A','B',...).
But numbering groups by integers is good-enough, because I can then use integer_to_label(i) as below. Unless there's a clever way to merge these two? But don't sweat this part.
set.seed(1234)
# Helper fn for mapping integer 1..26 to character label
integer_to_label <- function(i) { substr("ABCDEFGHIJKLMNOPQRSTUVWXYZ",i,i) }
df <- tibble::as_tibble(data.frame(u=sample.int(3,10,replace=T), v=sample.int(4,10,replace=T)))
# Want to label/number each distinct group of unique (u,v) combinations
df %>% group_by(u,v) %>% mutate(label = n()) # WRONG: n() is number of element within its group, not overall number of group
u v
1 2 3
2 1 3
3 1 2
4 2 3
5 1 2
6 3 3
7 1 3
8 1 2
9 3 1
10 3 4
KLUDGE 1: could do df %>% group_by(u,v) %>% summarize(label = n()) , then self-join
dplyr has a group_indices() function that you can use like this:
df %>%
mutate(label = group_indices(., u, v)) %>%
group_by(label) ...
Another approach using data.table would be
require(data.table)
setDT(df)[,label:=.GRP, by = c("u", "v")]
which results in:
u v label
1: 2 1 1
2: 1 3 2
3: 2 1 1
4: 3 4 3
5: 3 1 4
6: 1 1 5
7: 3 2 6
8: 2 3 7
9: 3 2 6
10: 3 4 3
As of dplyr version 1.0.4, the function cur_group_id() has replaced the older function group_indices.
Call it on the grouped data.frame:
df %>%
group_by(u, v) %>%
mutate(label = cur_group_id())
# A tibble: 10 x 3
# Groups: u, v [6]
u v label
<int> <int> <int>
1 2 2 4
2 2 2 4
3 1 3 2
4 3 2 6
5 1 4 3
6 1 2 1
7 2 2 4
8 2 4 5
9 3 2 6
10 2 4 5
Updated answer
get_group_number = function(){
i = 0
function(){
i <<- i+1
i
}
}
group_number = get_group_number()
df %>% group_by(u,v) %>% mutate(label = group_number())
You can also consider the following slightly unreadable version
group_number = (function(){i = 0; function() i <<- i+1 })()
df %>% group_by(u,v) %>% mutate(label = group_number())
using iterators package
library(iterators)
counter = icount()
df %>% group_by(u,v) %>% mutate(label = nextElem(counter))
Updating my answer with three different ways:
A) A neat non-dplyr solution using interaction(u,v):
> df$label <- factor(interaction(df$u,df$v, drop=T))
[1] 1.3 2.3 2.2 2.4 3.2 2.4 1.2 1.2 2.1 2.1
Levels: 2.1 1.2 2.2 3.2 1.3 2.3 2.4
> match(df$label, levels(df$label)[ rank(unique(df$label)) ] )
[1] 1 2 3 4 5 4 6 6 7 7
B) Making Randy's neat fast-and-dirty generator-function answer more compact:
get_next_integer = function(){
i = 0
function(u,v){ i <<- i+1 }
}
get_integer = get_next_integer()
df %>% group_by(u,v) %>% mutate(label = get_integer())
C) Also here is a one-liner using a generator function abusing a global variable assignment from this:
i <- 0
generate_integer <- function() { return(assign('i', i+1, envir = .GlobalEnv)) }
df %>% group_by(u,v) %>% mutate(label = generate_integer())
rm(i)
I don't have enough reputation for a comment, so I'm posting an answer instead.
The solution using factor() is a good one, but it has the disadvantage that group numbers are assigned after factor() alphabetizes its levels. The same behaviour happens with dplyr's group_indices(). Perhaps you would like the group numbers to be assigned from 1 to n based on the current group order. In which case, you can use:
my_tibble %>% mutate(group_num = as.integer(factor(group_var, levels = unique(.$group_var))) )

How to number/label data-table by group-number from group_by?

I have a tbl_df where I want to group_by(u, v) for each distinct integer combination observed with (u, v).
EDIT: this was subsequently resolved by adding the (now-deprecated) group_indices() back in dplyr 0.4.0
a) I then want to assign each distinct group some arbitrary distinct number label=1,2,3...
e.g. the combination (u,v)==(2,3) could get label 1, (1,3) could get 2, and so on.
How to do this with one mutate(), without a three-step summarize-and-self-join?
dplyr has a neat function n(), but that gives the number of elements within its group, not the overall number of the group. In data.table this would simply be called .GRP.
b) Actually what I really want to assign a string/character label ('A','B',...).
But numbering groups by integers is good-enough, because I can then use integer_to_label(i) as below. Unless there's a clever way to merge these two? But don't sweat this part.
set.seed(1234)
# Helper fn for mapping integer 1..26 to character label
integer_to_label <- function(i) { substr("ABCDEFGHIJKLMNOPQRSTUVWXYZ",i,i) }
df <- tibble::as_tibble(data.frame(u=sample.int(3,10,replace=T), v=sample.int(4,10,replace=T)))
# Want to label/number each distinct group of unique (u,v) combinations
df %>% group_by(u,v) %>% mutate(label = n()) # WRONG: n() is number of element within its group, not overall number of group
u v
1 2 3
2 1 3
3 1 2
4 2 3
5 1 2
6 3 3
7 1 3
8 1 2
9 3 1
10 3 4
KLUDGE 1: could do df %>% group_by(u,v) %>% summarize(label = n()) , then self-join
dplyr has a group_indices() function that you can use like this:
df %>%
mutate(label = group_indices(., u, v)) %>%
group_by(label) ...
Another approach using data.table would be
require(data.table)
setDT(df)[,label:=.GRP, by = c("u", "v")]
which results in:
u v label
1: 2 1 1
2: 1 3 2
3: 2 1 1
4: 3 4 3
5: 3 1 4
6: 1 1 5
7: 3 2 6
8: 2 3 7
9: 3 2 6
10: 3 4 3
As of dplyr version 1.0.4, the function cur_group_id() has replaced the older function group_indices.
Call it on the grouped data.frame:
df %>%
group_by(u, v) %>%
mutate(label = cur_group_id())
# A tibble: 10 x 3
# Groups: u, v [6]
u v label
<int> <int> <int>
1 2 2 4
2 2 2 4
3 1 3 2
4 3 2 6
5 1 4 3
6 1 2 1
7 2 2 4
8 2 4 5
9 3 2 6
10 2 4 5
Updated answer
get_group_number = function(){
i = 0
function(){
i <<- i+1
i
}
}
group_number = get_group_number()
df %>% group_by(u,v) %>% mutate(label = group_number())
You can also consider the following slightly unreadable version
group_number = (function(){i = 0; function() i <<- i+1 })()
df %>% group_by(u,v) %>% mutate(label = group_number())
using iterators package
library(iterators)
counter = icount()
df %>% group_by(u,v) %>% mutate(label = nextElem(counter))
Updating my answer with three different ways:
A) A neat non-dplyr solution using interaction(u,v):
> df$label <- factor(interaction(df$u,df$v, drop=T))
[1] 1.3 2.3 2.2 2.4 3.2 2.4 1.2 1.2 2.1 2.1
Levels: 2.1 1.2 2.2 3.2 1.3 2.3 2.4
> match(df$label, levels(df$label)[ rank(unique(df$label)) ] )
[1] 1 2 3 4 5 4 6 6 7 7
B) Making Randy's neat fast-and-dirty generator-function answer more compact:
get_next_integer = function(){
i = 0
function(u,v){ i <<- i+1 }
}
get_integer = get_next_integer()
df %>% group_by(u,v) %>% mutate(label = get_integer())
C) Also here is a one-liner using a generator function abusing a global variable assignment from this:
i <- 0
generate_integer <- function() { return(assign('i', i+1, envir = .GlobalEnv)) }
df %>% group_by(u,v) %>% mutate(label = generate_integer())
rm(i)
I don't have enough reputation for a comment, so I'm posting an answer instead.
The solution using factor() is a good one, but it has the disadvantage that group numbers are assigned after factor() alphabetizes its levels. The same behaviour happens with dplyr's group_indices(). Perhaps you would like the group numbers to be assigned from 1 to n based on the current group order. In which case, you can use:
my_tibble %>% mutate(group_num = as.integer(factor(group_var, levels = unique(.$group_var))) )

Resources