Summarize data table individually for multiple columns - r

I am trying to summarize data across multiple columns automatically if at all possible rather than writing code for each column independently. I would like to summarize this:
Patch Size Achmil Aciarv Aegpod Agrcap
A 10 0 1 1 0
B 2 1 0 0 0
C 2 1 0 0 0
D 2 1 0 0 0
into this
Species Presence MaxSize MeanSize Count
Achmil 0 10 10 1
Achmil 1 2 2 3
Aciarv 0 2 2 3
Aciarv 1 10 10 1
I know that I can individually run group_by and summarize for each column
achmil<-group_by(LimitArea, Achmil) %>%
summarise(SumA=mean(Size))
but is there no way to automatically run this for each column for each presence and absence using some sort of loop? Any help is appreciated.

Perhaps we need to gather in to long format and then do the summarise
library(tidyverse)
gather(df1, Species, Presence, Achmil:Agrcap) %>%
group_by(Species, Presence) %>%
summarise( MaxSize = max(Size), MeanSize = mean(Size), Count = n())
# A tibble: 7 x 5
# Groups: Species [?]
# Species Presence MaxSize MeanSize Count
# <chr> <int> <dbl> <dbl> <int>
#1 Achmil 0 10.0 10.0 1
#2 Achmil 1 2.00 2.00 3
#3 Aciarv 0 2.00 2.00 3
#4 Aciarv 1 10.0 10.0 1
#5 Aegpod 0 2.00 2.00 3
#6 Aegpod 1 10.0 10.0 1
#7 Agrcap 0 10.0 4.00 4
In the newer version of dplyr/tidyr, we can use pivot_longer
df1 %>%
pivot_longer(cols = Achmil:Agrcap, names_to = "Species",
values_to = "Presence") %>%
group_by(Species, Presence) %>%
summarise(MaxSize = max(Size), MeanSize = mean(Size), Count = n())

Here another solution using aggregate (and reshape2::melt())
library(reshape2)
df = melt(df[,2:ncol(df)], "Size")
aggregate(. ~ `variable`+`value`, data = df,
FUN = function(x) c(max = max(x), mean = mean(x), count = length(x)))
variable value Size.max Size.mean Size.count
1 Achmil 0 10 10 1
2 Aciarv 0 2 2 3
3 Aegpod 0 2 2 3
4 Agrcap 0 10 4 4
5 Achmil 1 2 2 3
6 Aciarv 1 10 10 1
7 Aegpod 1 10 10 1

Related

Create new columns based on 2 columns

So I have this kind of table df
Id
Type
QTY
unit
1
A
5
1
2
B
10
2
3
C
5
3
2
A
10
4
3
B
5
5
1
C
10
6
I want to create this data frame df2
Id
A_QTY
A_unit
B_QTY
B_unit
C_QTY
C_unit
1
5
1
0
0
10
6
2
10
4
10
2
0
0
3
0
0
5
5
5
3
This means that I want to create a new column for every "Type's" "QTY" and "unit" for each "Id". I was thinking to use a loop to first create a new column for each Type, to get something like this :
Id
Type
QTY
unit
A_QTY
A_unit
B_QTY
B_unit
C_QTY
C_unit
1
A
5
1
5
1
0
0
0
0
2
B
10
2
0
0
10
2
0
0
3
C
5
3
0
0
0
0
5
3
2
A
10
4
10
4
0
0
0
0
3
B
5
5
0
0
5
5
0
0
1
C
10
6
0
0
0
0
10
6
, and then group_by() to agregate them resulting in df2. But I get stuck when it comes to creating the new columns. I have tried the for loop but my level on R is still not that great yet. I can't manage to create new columns from those existing columns...
I'll appreciate any suggestions you have for me!
You can use pivot_wider from the tidyr package:
library(dplyr)
library(tidyr)
df %>%
pivot_wider(names_from = "Type", # Columns to get the names from
values_from = c("QTY", "unit"), # Columns to get the values from
names_glue = "{Type}_{.value}", # Column naming
values_fill = 0, # Fill NAs with 0
names_vary = "slowest") # To get the right column ordering
output
# A tibble: 3 × 7
Id A_QTY A_unit B_QTY B_unit C_QTY C_unit
<int> <int> <int> <int> <int> <int> <int>
1 1 5 1 0 0 10 6
2 2 10 4 10 2 0 0
3 3 0 0 5 5 5 3
library(tidyverse)
df %>%
pivot_longer(-c(Id, Type)) %>%
mutate(name = str_c(Type, name, sep = "_")) %>%
select(-Type) %>%
pivot_wider(names_from = "name", values_from = "value", values_fill = 0)
# A tibble: 3 × 7
Id A_QTY A_unit B_QTY B_unit C_QTY C_unit
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5 1 0 0 10 6
2 2 10 4 10 2 0 0
3 3 0 0 5 5 5 3

How to count groupings of elements in base R or dplyr using multiple conditions?

I am trying to count the number of elements by groupings, subject to the condition that each grouping code ("Group") is > 0. Suppose we start with the below output DF generated via the code immediately beneath:
Element Group reSeq
<chr> <dbl> <int>
1 R 0 1
2 R 0 1
3 X 0 1
4 X 1 2
5 X 1 2
6 X 0 1
7 X 0 1
8 X 0 1
9 B 0 1
10 R 0 1
11 R 2 2
12 R 2 2
13 X 3 3
14 X 3 3
15 X 3 3
library(dplyr)
myDF <- data.frame(
Element = c("R","R","X","X","X","X","X","X","B","R","R","R","X","X","X"),
Group = c(0,0,0,1,1,0,0,0,0,0,2,2,3,3,3)
)
myDF %>% group_by(Element) %>% mutate(reSeq = match(Group, unique(Group)))
Instead, I would like the reSeq column to calculate and output as shown below with explanations to the right:
Element Group reSeq reSeq explanation
<chr> <dbl> <int>
1 R 0 1 1st instance of R (ungrouped)(Group = 0 means not grouped)
2 R 0 2 2nd instance of R (ungrouped)(Group = 0 means not grouped)
3 X 0 1 1st instance of X (ungrouped)(Group = 0 means not grouped)
4 X 1 2 2nd instance of X (grouped by Group = 1)
5 X 1 2 2nd instance of X (grouped by Group = 1)
6 X 0 3 3rd instance of X (ungrouped)
7 X 0 4 4th instance of X (ungrouped)
8 X 0 5 5th instance of X (ungrouped)
9 B 0 1 1st instance of B (ungrouped)
10 R 0 3 3rd instance of R (ungrouped)
11 R 2 4 4th instance of R (grouped by Group = 2)
12 R 2 4 4th instance of R (grouped by Group = 2)
13 X 3 6 6th instance of X (grouped by Group = 3)
14 X 3 6 6th instance of X (grouped by Group = 3)
15 X 3 6 6th instance of X (grouped by Group = 3)
Any recommendations for doing this? If possible, starting with the dplyr code I use above because I am fairly familiar with it.
If we use rowid from data.table, can skip a couple of steps
library(dplyr)
library(data.table)
library(tidyr)
myDF %>%
mutate(reSeq = rowid(Element) * NA^!(Group == 0 |!duplicated(Group))) %>%
group_by(Element) %>%
fill(reSeq) %>%
mutate(reSeq = match(reSeq, unique(reSeq))) %>%
ungroup
-output
# A tibble: 15 × 3
Element Group reSeq
<chr> <dbl> <int>
1 R 0 1
2 R 0 2
3 X 0 1
4 X 1 2
5 X 1 2
6 X 0 3
7 X 0 4
8 X 0 5
9 B 0 1
10 R 0 3
11 R 2 4
12 R 2 4
13 X 3 6
14 X 3 6
15 X 3 6
Below is what I managed to cobble together. Maybe there's a cleaner solution? Here's the code:
library(dplyr)
library(tidyr)
myDF %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup()%>%
mutate(reSeq = ifelse(Group == 0 | Group != lag(Group), eleCnt,0)) %>%
mutate(reSeq = na_if(reSeq, 0)) %>%
group_by(Element) %>%
fill(reSeq) %>%
mutate(reSeq = match(reSeq, unique(reSeq))) %>%
ungroup
And here's the output:
# A tibble: 15 x 4
Element Group eleCnt reSeq
<chr> <dbl> <int> <int>
1 R 0 1 1
2 R 0 2 2
3 X 0 1 1
4 X 1 2 2
5 X 1 3 2
6 X 0 4 3
7 X 0 5 4
8 X 0 6 5
9 B 0 1 1
10 R 0 3 3
11 R 2 4 4
12 R 2 5 4
13 X 3 7 6
14 X 3 8 6
15 X 3 9 6

How to count data frame elements grouped by multiple conditions in dplyr?

I am trying to use dplyr to count elements grouped by multiple conditions (columns) in a data frame. In the below example (dataframe output is at the top (except that I manually inserted the 2 right-most columns to explain what I am trying to do), and R code is underneath), I am trying to count the joint groupings of the Element and Group columns. My multiple condition grouping attempt is eleGrpCnt. Any recommendations for the correct way to do this in dplyr? I thought that group_by a combined (Element, Group) would work.
desired
Element Group origOrder eleCnt eleGrpCnt eleGrpCnt explanation
<chr> <dbl> <int> <int> <int> <comment> <comment>
1 B 0 1 1 1 1 1st grouping of B where Group = 0
2 R 0 2 1 1 1 1st grouping of R where Group = 0
3 R 1 3 2 1 2 2nd grouping of R where Group = 1
4 R 1 4 3 2 2 2nd grouping of R where Group = 1
5 B 0 5 2 2 1 1st grouping of B where Group = 0
6 X 2 6 1 1 1 1st grouping of X where Group = 2
7 X 2 7 2 2 1 1st grouping of X where Group = 2
8 X 0 8 3 1 2 2nd grouping of X where Group = 0
9 X 0 9 4 2 2 2nd grouping of X where Group = 0
10 X -1 10 5 1 3 3rd grouping of X where Group = -1
library(dplyr)
myData6 <-
data.frame(
Element = c("B","R","R","R","B","X","X","X","X","X"),
Group = c(0,0,1,1,0,2,2,0,0,-1)
)
myData6 %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup() %>%
group_by(Element, Group) %>%
mutate(eleGrpCnt = row_number())%>%
ungroup()
If you group by element then the numbers you are looking for are simply the matches of Group against the unique values of Group:
library(dplyr)
myData6 %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup() %>%
group_by(Element) %>%
mutate(eleGrpCnt = match(Group, unique(Group)))
#> # A tibble: 10 x 5
#> # Groups: Element [3]
#> Element Group origOrder eleCnt eleGrpCnt
#> <chr> <dbl> <int> <int> <dbl>
#> 1 B 0 1 1 1
#> 2 R 0 2 1 1
#> 3 R 1 3 2 2
#> 4 R 1 4 3 2
#> 5 B 0 5 2 1
#> 6 X 2 6 1 1
#> 7 X 2 7 2 1
#> 8 X 0 8 3 2
#> 9 X 0 9 4 2
#> 10 X -1 10 5 3
Created on 2022-09-11 with reprex v2.0.2
Here's one approach; I'm sorting by Group value but if you want to change the order to match original appearance order we could add a step.
myData6 %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup() %>%
arrange(Element, Group) %>%
group_by(Element) %>%
mutate(eleGrpCnt = cumsum(Group != lag(Group, default = -999))) %>%
ungroup() %>%
arrange(origOrder)
# A tibble: 10 × 5
Element Group origOrder eleCnt eleGrpCnt
<chr> <dbl> <int> <int> <int>
1 B 0 1 1 1
2 R 0 2 1 1
3 R 1 3 2 2
4 R 1 4 3 2
5 B 0 5 2 1
6 X 2 6 1 3
7 X 2 7 2 3
8 X 0 8 3 2
9 X 0 9 4 2
10 X -1 10 5 1

Simple operation with lagged values

I need to calculate line-wise simple operations using lagged values, for example the sum for a variable for the previous x years
I tried:
toy %>%
group_by(student) %>%
mutate(lag_passed = sum(lag(passed, n = 5, order_by = year, default = 0)))
toy %>%
group_by(student) %>%
arrange(year) %>%
mutate(lag_passed = lapply(passed, function(x) sum(lag(x, n = 5, default = 0))))
Reproducible examples. Task sum the number of passed tests in the previous five years.
toy <- data.frame(student = rep("A",10),
year=c(1:10),
passed=c(0,0,0,1,2,0,0,0,0,1))
student year passed
1 A 1 0
2 A 2 0
3 A 3 0
4 A 4 1
5 A 5 2
6 A 6 0
7 A 7 0
8 A 8 0
9 A 9 0
10 A 10 1
expected <- data.frame(student = rep("A",10),
year=c(1:10),
passed=c(0,0,0,1,2,0,0,1,0,1),
lag_passed=c(0,0,0,0,1,3,3,3,4,3))
student year passed lag_passed
1 A 1 0 0
2 A 2 0 0
3 A 3 0 0
4 A 4 1 0
5 A 5 2 1
6 A 6 0 3
7 A 7 0 3
8 A 8 1 3
9 A 9 0 4
10 A 10 1 3
runner::sum_run() will help here. using idx = year is optional, unless you have missing values in some of the years, in which case it will take into account those missing years too, which is however, not the case with sample data. grouping on student is added because, in actual you may want to carry out the operation for each student.
toy <- data.frame(student = rep("A",10),
year=c(1:10),
passed=c(0,0,0,1,2,0,0,1,0,1))
library(dplyr)
library(runner)
toy %>% group_by(student) %>%
mutate(lag_passed = sum_run(x = passed,
idx = year,
k = 5,
lag = 1))
#> # A tibble: 10 x 4
#> # Groups: student [1]
#> student year passed lag_passed
#> <chr> <int> <dbl> <dbl>
#> 1 A 1 0 NA
#> 2 A 2 0 0
#> 3 A 3 0 0
#> 4 A 4 1 0
#> 5 A 5 2 1
#> 6 A 6 0 3
#> 7 A 7 0 3
#> 8 A 8 1 3
#> 9 A 9 0 4
#> 10 A 10 1 3
Created on 2021-05-15 by the reprex package (v2.0.0)
Another rolling sum solution with zoo::rollapply:
f <- function(x) {zoo::rollapply(x, 6, sum, align = 'right', partial = TRUE) - x}
expected %>%
group_by(student) %>%
arrange(year) %>%
mutate(lag_passed2 = f(passed)) %>%
ungroup()
# student year passed lag_passed lag_passed2
# <chr> <int> <dbl> <dbl> <dbl>
# 1 A 1 0 0 0
# 2 A 2 0 0 0
# 3 A 3 0 0 0
# 4 A 4 1 0 0
# 5 A 5 2 1 1
# 6 A 6 0 3 3
# 7 A 7 0 3 3
# 8 A 8 1 3 3
# 9 A 9 0 4 4
# 10 A 10 1 3 3
lag_passed2 created with the helper function is the same as lag_passed. The idea is to calculate a sliding window sum with a window length of 6 (allow partial window at begining by partial = T and align = 'right'), then substract the passed value of the current years.
Note: the helper function f can be replaced to a simpler one by specifying the window using offsets and default right alignment as pointed out by #G. Grothendieck:
f <- function(x) rollapplyr(x, list(-seq(5)), sum, partial = TRUE, fill = 0)

Aggregate rows with specific shared value

I want to aggregate my data as follows:
Aggregate only for successive rows where status = 0
Keep age and sum up points
Example data:
da <- data.frame(userid = c(1,1,1,1,2,2,2,2), status = c(0,0,0,1,1,1,0,0), age = c(10,10,10,11,15,16,16,16), points = c(2,2,2,6,3,5,5,5))
da
userid status age points
1 1 0 10 2
2 1 0 10 2
3 1 0 10 2
4 1 1 11 6
5 2 1 15 3
6 2 1 16 5
7 2 0 16 5
8 2 0 16 5
I would like to have:
da2
userid status age points
1 1 0 10 6
2 1 1 11 6
3 2 1 15 3
4 2 1 16 5
5 2 0 16 10
da %>%
mutate(grp = with(rle(status),
rep(seq_along(values), lengths)) + cumsum(status != 0)) %>%
group_by_at(vars(-points)) %>%
summarise(points = sum(points)) %>%
ungroup() %>%
select(-grp)
## A tibble: 5 x 4
# userid status age points
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 10 6
#2 1 1 11 6
#3 2 0 16 10
#4 2 1 15 3
#5 2 1 16 5
You can use group_by from dplyr:
da %>% group_by(da$userid, cumsum(da$status), da$status)
%>% summarise(age=max(age), points=sum(points))
Output:
`da$userid` `cumsum(da$status)` `da$status` age points
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 10 6
2 1 1 1 11 6
3 2 2 1 15 3
4 2 3 0 16 10
5 2 3 1 16 5
Exactly the same idea as above :
library(dplyr)
data1 <- data %>% group_by(userid, age, status) %>%
filter(status == 0) %>%
summarise(points = sum(points))
data2 <- data %>%
group_by(userid, age, status) %>%
filter(status != 0) %>%
summarise(points = sum(points))
data <- rbind(data1,
data2)
We need to be more carreful with your specification of status equal to 0. I think the code of Quang Hoang works only for your specific example.
I hope it will help.

Resources