Summarize using different grouping variables in dplyr - r

I would like summarize a dataframe using different grouping variables for each summary I wish to be carried out. As an example I have three variables (x1, x2, x3). I want to group the dataframe by x1 and get the number of observations in that group, but I want to do the same for x2 and x3.
I would like to accomplish this with the same block of piping but so far the only solution I have come up with is to save multiple outputs for each individual grouping I would like.
To reproduce my dataframe:
x1 <- c(0,1,1,2,2,3,3,3,4,4,5,6,6,7,8,9,9,10)
x2 <- c(0,0,1,1,0,1,2,0,0,2,1,0,3,4,2,3,0,3)
x3 <- c(0,1,0,1,2,2,1,3,4,2,4,6,3,3,6,6,9,7)
df <- data.frame(x1,x2,x3)
My expected output would look something like this, where x is the min and max number across the variables and n_x1-3 are the number of observations at a specific number and using that variable as a grouping variable:
x n_x1 n_x2 n_x3
1 0 1 7 2
2 1 2 4 3
3 2 2 3 3
4 3 3 3 3
5 4 2 1 2
6 5 1 NA NA
7 6 2 NA 3
8 7 1 NA 1
9 8 1 NA NA
10 9 2 NA 1
11 10 1 NA NA
So far I have come up with summarizing and grouping by each variable individually and then joining them all together as a last step.
x1_count <- df %>%
group_by(x1) %>%
summarise(n_x1=n())
x2_count <- df %>%
group_by(x2) %>%
summarise(n_x2=n())
x3_count <- df %>%
group_by(x3) %>%
summarise(n_x3=n())
all_count <- full_join(x1_count, x2_count,
by=c("x1"="x2")) %>%
full_join(., x3_count,
by=c("x1"="x3")) %>%
rename("x"="x1")
Is there some type of work around where I wouldn't have to output multiple dataframes and later join them together. I would prefer a cleaner more elegant solution.

a simple tidyr solution
library(tidyr)
df %>%
pivot_longer(everything(),names_to="variables",values_to="values") %>%
group_by(variables,values) %>%
summarize(n_x=n()) %>%
ungroup() %>%
pivot_wider(names_from = variables,values_from=n_x)
# A tibble: 11 x 4
values x1 x2 x3
<dbl> <int> <int> <int>
1 0 1 7 2
2 1 2 4 3
3 2 2 3 3
4 3 3 3 3
5 4 2 1 2
6 5 1 NA NA
7 6 2 NA 3
8 7 1 NA 1
9 8 1 NA NA
10 9 2 NA 1
11 10 1 NA NA

We can use a simple map with full_join
library(dplyr)
library(purrr)
map(names(df), ~ df %>%
count(!!rlang::sym(.x)) %>%
rename_at(1, ~ 'x')) %>%
reduce(full_join, by = 'x') %>%
rename_at(-1, ~ str_c('n_x', seq_along(.)))
# x n_x1 n_x2 n_x3
#1 0 1 7 2
#2 1 2 4 3
#3 2 2 3 3
#4 3 3 3 3
#5 4 2 1 2
#6 5 1 NA NA
#7 6 2 NA 3
#8 7 1 NA 1
#9 8 1 NA NA
#10 9 2 NA 1
#11 10 1 NA NA
Or using a simple base R option
t(table(c(col(df)), unlist(df)))

Related

Coalesce multiple columns at once

My question is similar to existing questions about coalesce, but I want to coalesce several columns by row such that NAs are pushed to the last column.
Here's an example:
If I have
a <- data.frame(A=c(2,NA,4,3,2), B=c(NA,3,4,NA,5), C= c(1,3,6,7,NA), D=c(5,6,NA,4,3), E=c(2,NA,1,3,NA))
A B C D E
1 2 NA 1 5 2
2 NA 3 3 6 NA
3 4 4 6 NA 1
4 3 NA 7 4 3
5 2 5 NA 3 NA
I would like to get
b <- data.frame(A=c(2,3,4,3,2), B=c(1,3,4,7,5), C=c(5,6,6,4,3), D=c(2,NA,1,3,NA))
A B C D
1 2 1 5 2
2 3 3 6 NA
3 4 4 6 1
4 3 7 4 3
5 2 5 3 NA
Does anyone have any ideas for how I could do this? I would be so grateful for any tips, as my searches have come up dry.
You can use unite and separate:
library(tidyverse)
a %>%
unite(newcol, everything(), na.rm = TRUE) %>%
separate(newcol, into = LETTERS[1:4])
A B C D
1 2 1 5 2
2 3 3 6 <NA>
3 4 4 6 1
4 3 7 4 3
5 2 5 3 <NA>
Since you have an unknown number of new columns in separate, one can use splitstackshape's function cSplit:
library(splitstackshape)
a %>%
unite(newcol, na.rm = TRUE) %>%
cSplit("newcol", "_", type.convert = F) %>%
rename_with(~ LETTERS)
This could be another solution. From what I understood you basically just want to shift the values in each row after the first NA to the left replacing the NA and I don't think coalesce can help you here.
library(dplyr)
library(purrr)
a %>%
pmap_dfr(~ {x <- c(...)[-which(is.na(c(...)))[1]]
setNames(x, LETTERS[seq_along(x)])})
# A tibble: 5 x 4
A B C D
<dbl> <dbl> <dbl> <dbl>
1 2 1 5 2
2 3 3 6 NA
3 4 4 6 1
4 3 7 4 3
5 2 5 3 NA
We may use base R - loop over the rows, order based on the NA elements and remove the columns that have all NAs
a[] <- t(apply(a, 1, \(x) x[order(is.na(x))]))
a[colSums(!is.na(a)) > 0]
A B C D
1 2 1 5 2
2 3 3 6 NA
3 4 4 6 1
4 3 7 4 3
5 2 5 3 NA

Is there a way to group values in a column between data gaps in R?

I want to group my data in different chunks when the data is continuous. Trying to get the group column from dummy data like this:
a b group
<dbl> <dbl> <dbl>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 12 2
8 8 15 2
9 9 NA NA
10 10 25 3
I tried using
test %>% mutate(test = complete.cases(.)) %>%
group_by(group = cumsum(test == TRUE)) %>%
select(group, everything())
But it doesn't work as expected:
group a b test
<int> <dbl> <dbl> <lgl>
1 1 1 1 TRUE
2 2 2 2 TRUE
3 3 3 3 TRUE
4 3 4 NA FALSE
5 3 5 NA FALSE
6 3 6 NA FALSE
7 4 7 12 TRUE
8 5 8 15 TRUE
9 5 9 NA FALSE
10 6 10 25 TRUE
Any advice?
Using rle in base R -
transform(df, group1 = with(rle(!is.na(b)), rep(cumsum(values), lengths))) |>
transform(group1 = replace(group1, is.na(b), NA))
# a b group group1
#1 1 1 1 1
#2 2 2 1 1
#3 3 3 1 1
#4 4 NA NA NA
#5 5 NA NA NA
#6 6 NA NA NA
#7 7 12 2 2
#8 8 15 2 2
#9 9 NA NA NA
#10 10 25 3 3
A couple of approaches to consider if you wish to use dplyr for this.
First, you could look at transition from non-complete cases (using lag) to complete cases.
library(dplyr)
test %>%
mutate(test = complete.cases(.)) %>%
group_by(group = cumsum(test & !lag(test, default = F))) %>%
mutate(group = replace(group, !test, NA))
Alternatively, you could add row numbers to your data.frame. Then, you could filter to include only complete cases, and group_by enumerating with cumsum based on gaps in row numbers. Then, join back to original data.
test$rn <- seq.int(nrow(test))
test %>%
filter(complete.cases(.)) %>%
group_by(group = c(0, cumsum(diff(rn) > 1)) + 1) %>%
right_join(test) %>%
arrange(rn) %>%
dplyr::select(-rn)
Output
a b group
<int> <int> <dbl>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 12 2
8 8 15 2
9 9 NA NA
10 10 25 3
Using data.table, get rleid then remove group IDs for NAs, then fix the sequence with factor to integer conversion:
library(data.table)
setDT(test)[, group1 := {
x <- complete.cases(test)
grp <- rleid(x)
grp[ !x ] <- NA
as.integer(factor(grp))
}]
# a b group group1
# 1: 1 1 1 1
# 2: 2 2 1 1
# 3: 3 3 1 1
# 4: 4 NA NA NA
# 5: 5 NA NA NA
# 6: 6 NA NA NA
# 7: 7 12 2 2
# 8: 8 15 2 2
# 9: 9 NA NA NA
# 10: 10 25 3 3

How to repeat values for different variables in cases with the same ID in R

I want to repeat the information of these variables in the spaces with NA and that they have the same IDs. This an outlook of the actual structure data. (The identifiers are in the columns id1 to id4):
id1<-rep(3,8)
id2<-c(rep(1,3),rep(2,4),3)
id3<-rep(1,8)
id4<-c(rep(1,3),rep(2,4),3)
v1<-c(1,NA,NA,1,NA,NA,NA,2)
v2<-c(3,NA,NA,5,NA,NA,NA,1)
v3<-c(4,3,4,5,4,2,1,1)
v4<-c(4,8,2,5,4,3,1,1)
data.frame(id1,id2,id3,id4,v1,v2,v3,v4)
This is the visualization:
> data.frame(id1,id2,id3,id4,v1,v2,v3,v4)
id1 id2 id3 id4 v1 v2 v3 v4
1 3 1 1 1 1 3 4 4
2 3 1 1 1 NA NA 3 8
3 3 1 1 1 NA NA 4 2
4 3 2 1 2 1 5 5 5
5 3 2 1 2 NA NA 4 4
6 3 2 1 2 NA NA 2 3
7 3 2 1 2 NA NA 1 1
8 3 3 1 3 2 1 1 1
How can I fill the NA values with the infomation in the first line of each case by ID?
Perhaps this helps
library(dplyr)
df1 %>%
group_by(across(starts_with('id'))) %>%
mutate(across(everything(), ~ replace(., is.na(.), first(.)))) %>%
ungroup
Or use fill
library(tidyr)
df1 %>%
group_by(across(starts_with('id'))) %>%
fill(everything())

Add a new row for each id in dataframe for ALL variables

I want to add a new row after each id. I found a solution on a stackflow page(Inserting a new row to data frame for each group id)
but there is one thing I want to change and I dont know how. I want to make a new row for all variables, I don't want to write down all the variables ( the stackflow example). It doesnt matter the numbers in the row, I will change that later. If it is possible to add "base" in the new row for trt, that would be good. I want the code to work for many ids and varibles, having a lot of those in the data I'm working with. Many thanks if someone can help me with this!
The example code:
set.seed(1)
> id <- rep(1:3,each=4)
> trt <- rep(c("A","OA", "B", "OB"),3)
> pointA <- sample(1:10,12, replace=TRUE)
> pointB<- sample(1:10,12, replace=TRUE)
> pointC<- sample(1:10,12, replace=TRUE)
> df <- data.frame(id,trt,pointA, pointB,pointC)
> df
id trt pointA pointB pointC
1 1 A 3 7 3
2 1 OA 4 4 4
3 1 B 6 8 1
4 1 OB 10 5 4
5 2 A 3 8 9
6 2 OA 9 10 4
7 2 B 10 4 5
8 2 OB 7 8 6
9 3 A 7 10 5
10 3 OA 1 3 2
11 3 B 3 7 9
12 3 OB 2 2 7
I want it to look like:
df <- rbind(df[1:4,], df1, df[5:8,], df2, df[9:12,],df3)
> df
id trt pointA pointB pointC
1 1 A 3 7 3
2 1 OA 4 4 4
3 1 B 6 8 1
4 1 OB 10 5 4
5 1 base
51 2 A 3 8 9
6 2 OA 9 10 4
7 2 B 10 4 5
8 2 OB 7 8 6
13 2 base
9 3 A 7 10 5
10 3 OA 1 3 2
11 3 B 3 7 9
12 3 OB 2 2 7
14 3 base
>
I'm trying this code:
df %>%
+ group_by(id) %>%
+ summarise(week = "base") %>%
+ mutate_all() %>% #want tomutate allvariables
+ bind_rows(df, .) %>%
+ arrange(id)
You could bind_rows directly, it will add NAs to all other columns by default.
library(dplyr)
df %>% group_by(id) %>% summarise(trt = 'base') %>% bind_rows(df) %>% arrange(id)
# id trt pointA pointB pointC
# <int> <chr> <int> <int> <int>
# 1 1 base NA NA NA
# 2 1 A 3 7 3
# 3 1 OA 4 4 4
# 4 1 B 6 8 1
# 5 1 OB 10 5 4
# 6 2 base NA NA NA
# 7 2 A 3 8 9
# 8 2 OA 9 10 4
# 9 2 B 10 4 5
#10 2 OB 7 8 6
#11 3 base NA NA NA
#12 3 A 7 10 5
#13 3 OA 1 3 2
#14 3 B 3 7 9
#15 3 OB 2 2 7
If you want empty strings instead of NA, we can give a range of columns in mutate_at and replace NA values with empty string.
df %>%
group_by(id) %>%
summarise(trt = 'base') %>%
bind_rows(df) %>%
mutate_at(vars(pointA:pointC), ~replace(., is.na(.) , '')) %>%
arrange(id)
library(dplyr)
library(purrr)
df %>% mutate_if(is.factor, as.character) %>%
group_split(id) %>%
map_dfr(~bind_rows(.x, data.frame(id=.x$id[1], trt="base", stringsAsFactors = FALSE)))
#Note that group_modify is Experimental
df %>% mutate_if(is.factor, as.character) %>%
group_by(id) %>%
group_modify(~bind_rows(.x, data.frame(trt="base", stringsAsFactors = FALSE)))

Growth rates, using the last non-NA value by groups

I have a dataframe that looks like this:
value id
1 2 A
2 5 A
3 NA A
4 7 A
5 9 A
6 1 B
7 NA B
8 NA B
9 5 B
10 6 B
And I would like to calculate growth rates of the value using the id variable to group. Usually, I would do something like this:
df <- df %>% group_by(id) %>% mutate(growth = log(value) - as.numeric(lag(value)))
To get this dataframe:
value id growth
(dbl) (chr) (dbl)
1 2 A NA
2 5 A -0.3905621
3 NA A NA
4 7 A NA
5 9 A -4.8027754
6 1 B NA
7 NA B NA
8 NA B NA
9 5 B NA
10 6 B -3.2082405
Now what I want to do is to use the last non NA value as well for the growth rates. Kind of like calculating the growth rates over the "NA-gaps" as well. For example: In row 4 should be the growth rate from 5 to 7 and in row 9 should be the growth rate from 1 to 5.
Thanks!
zoo::na.locf will replace NAs with the last non-NA value, so this may work for you:
df <- df %>%
group_by(id) %>%
mutate(
valuenoNA = zoo::na.locf(value),
growth = log(valuenoNA) - as.numeric(lag(valuenoNA)))
1 2 A NA 2
2 5 A -0.3905621 5
3 NA A -3.3905621 5
4 7 A -3.0540899 7
5 9 A -4.8027754 9
6 1 B NA 1
7 NA B -1.0000000 1
8 NA B -1.0000000 1
9 5 B 0.6094379 5
10 6 B -3.2082405 6
We can use fill from tidyverse
library(tidyverse)
df %>%
group_by(id) %>%
fill(value) %>%
mutate(growth = log(value) - lag(value))

Resources