Calculating means for multiple groups in R - r

Hello fellow overflowers,
currently I'm trying to calculate means for multiple groups.
My df looks like this (~600 rows):
col1 col2 col3 col4 col5
<type> <gender> <var1> <var2> <var3>
1 A 1 3 2 3
2 A 2 NA 5 NA
3 A 1 3 3 5
4 B 1 4 NA 1
5 B 2 3 4 5
Now the result should look like this:
col1 col2 col3 col4 col5
<type> <gender> <mean-var1> <mean-var2> <mean-var3>
1 A 1 3.6 4.1 4.6
2 A 2 4.1 3.8 4.2
3 B 1 3.9 4.2 3.7
4 B 2 4.3 3.2 2.7
5 C 1 3.5 4.5 3.6
6 C 2 4 3.7 4.2
...
So far, I've tried to use the group_by function:
avg_values<-data%>%
group_by(type, gender) %>%
summarize_all (mean())
So far, it didn't work out. Could you help me figure out a good way to handle this?

Does this work:
library(dplyr)
df %>% group_by(type, gender) %>% summarise(across(var1:var3, ~ mean(., na.rm = T)))
`summarise()` regrouping output by 'type' (override with `.groups` argument)
# A tibble: 4 x 5
# Groups: type [2]
type gender var1 var2 var3
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 3 2.5 4
2 A 2 NaN 5 NaN
3 B 1 4 NaN 1
4 B 2 3 4 5
Data used:
df
# A tibble: 5 x 5
type gender var1 var2 var3
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 3 2 3
2 A 2 NA 5 NA
3 A 1 3 3 5
4 B 1 4 NA 1
5 B 2 3 4 5

Related

Adding unique ID column associated to two groups R [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 7 months ago.
I have a data frame in this format:
Group
Observation
a
1
a
2
a
3
b
4
b
5
c
6
c
7
c
8
I want to create a unique ID column which considers both group and each unique observation within it, so that it is formatted like so:
Group
Observation
Unique_ID
a
1
1.1
a
2
1.2
a
3
1.3
b
4
2.1
b
5
2.2
c
6
3.1
c
7
3.2
c
8
3.3
Does anyone know of any syntax or functions to accomplish this? The formatting does not need to exactly match '1.1' as long as it signifies group and each unique observation within it. Thanks in advance
Another way using cur_group_id and row_number
library(dplyr)
A <- 'Group Observation
a 1
a 2
a 3
b 4
b 5
c 6
c 7
c 8'
df <- read.table(textConnection(A), header = TRUE)
df |>
group_by(Group) |>
mutate(Unique_ID = paste0(cur_group_id(), ".", row_number())) |>
ungroup()
Group Observation Unique_ID
<chr> <int> <chr>
1 a 1 1.1
2 a 2 1.2
3 a 3 1.3
4 b 4 2.1
5 b 5 2.2
6 c 6 3.1
7 c 7 3.2
8 c 8 3.3
library(tidyverse)
df <- read_table("Group Observation
a 1
a 2
a 3
b 4
b 5
c 6
c 7
c 8")
df %>%
mutate(unique = Group %>%
as.factor() %>%
as.integer() %>%
paste(., Observation, sep = "."))
#> # A tibble: 8 x 3
#> Group Observation unique
#> <chr> <dbl> <chr>
#> 1 a 1 1.1
#> 2 a 2 1.2
#> 3 a 3 1.3
#> 4 b 4 2.4
#> 5 b 5 2.5
#> 6 c 6 3.6
#> 7 c 7 3.7
#> 8 c 8 3.8
Created on 2022-07-12 by the reprex package (v2.0.1)
Try this
df |> group_by(Group) |>
mutate(Unique_ID = paste0(cur_group_id(),".",1:n()))
output
# A tibble: 8 × 3
# Groups: Group [3]
Group Observation Unique_ID
<chr> <int> <chr>
1 a 1 1.1
2 a 2 1.2
3 a 3 1.3
4 b 4 2.1
5 b 5 2.2
6 c 6 3.1
7 c 7 3.2
8 c 8 3.3

How to group interconnected elements in R dplyr

I have a data frame that looks like this.
Elements from the col1 are connected indirectly with elements in col2.
for example 1 is connected with 2 and 3.
and 2 is connected with 3. Therefore 1 should be connected with 3 as well.
library(tidyverse)
df1 <- tibble(col1=c(1,1,2,5,5,6),
col2=c(2,3,3,6,7,7))
df1
#> # A tibble: 6 × 2
#> col1 col2
#> <dbl> <dbl>
#> 1 1 2
#> 2 1 3
#> 3 2 3
#> 4 5 6
#> 5 5 7
#> 6 6 7
Created on 2022-03-15 by the reprex package (v2.0.1)
I want my data to look like this
#> col1 col2 col3
#> <dbl> <dbl>
#> 1 1 2 group1
#> 2 1 3 group1
#> 3 2 3 group1
#> 4 5 6 group2
#> 5 5 7 group2
#> 6 6 7 group2
I would appreciate any possible help to solve this riddle.
Thank you for your time
We may use igraph
library(igraph)
library(dplyr)
library(stringr)
g <- graph.data.frame(df1, directed = TRUE)
df1 %>%
mutate(col3 = str_c("group", clusters(g)$membership[as.character(col1)]))
-output
# A tibble: 6 × 3
col1 col2 col3
<dbl> <dbl> <chr>
1 1 2 group1
2 1 3 group1
3 2 3 group1
4 5 6 group2
5 5 7 group2
6 6 7 group2
Another igraph option
df1 %>%
mutate(
col3 =
paste0("group", {
graph_from_data_frame(.) %>%
components() %>%
membership()
}[as.character(col1)])
)
gives
# A tibble: 6 x 3
col1 col2 col3
<dbl> <dbl> <chr>
1 1 2 group1
2 1 3 group1
3 2 3 group1
4 5 6 group2
5 5 7 group2
6 6 7 group2

Use dynamically generated column names in dplyr

I have a data frame with multiple columns, the user provides a vector with the column names, and I want to count maximum amount of times an element appears
set.seed(42)
df <- tibble(
var1 = sample(c(1:3),10,replace=T),
var2 = sample(c(1:3),10,replace=T),
var3 = sample(c(1:3),10,replace=T)
)
select_vars <- c("var1", "var3")
df %>%
rowwise() %>%
mutate(consensus=max(table(unlist(c(var1,var3)))))
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
This does exactly what I want, but when I try to use a vector of variables i cant get it to work
df %>%
rowwise() %>%
mutate(consensus=max(unlist(table(select_vars)) )))
You can wrap it in c(!!! syms()) to get it working, and you don't need the unlist apparently. But honestly, I'm not sure what you are trying to do, and why table is needed here. Do you just want to check if var2 and var3 are the same value and if then 2 and if not then 1?
library(dplyr)
df <- tibble(
var1 = sample(c(1:3),10,replace=T),
var2 = sample(c(1:3),10,replace=T),
var3 = sample(c(1:3),10,replace=T)
)
select_vars <- c("var2", "var3")
df %>%
rowwise() %>%
mutate(consensus=max(table(c(!!!syms(select_vars)))))
#> # A tibble: 10 x 4
#> # Rowwise:
#> var1 var2 var3 consensus
#> <int> <int> <int> <int>
#> 1 2 3 2 1
#> 2 3 1 3 1
#> 3 3 1 1 2
#> 4 3 3 3 2
#> 5 1 1 2 1
#> 6 2 1 3 1
#> 7 3 2 3 1
#> 8 1 2 3 1
#> 9 2 1 2 1
#> 10 2 1 1 2
Created on 2021-07-22 by the reprex package (v0.3.0)
In the OP's code, we need select
library(dplyr)
df %>%
rowwise() %>%
mutate(consensus=max(table(unlist(select(cur_data(), select_vars))) ))
-output
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
Or just subset from cur_data() which would only return the data keeping the group attributes
df %>%
rowwise %>%
mutate(consensus = max(table(unlist(cur_data()[select_vars]))))
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
Or using pmap
library(purrr)
df %>%
mutate(consensus = pmap_dbl(cur_data()[select_vars], ~ max(table(c(...)))))
# A tibble: 10 x 4
var1 var2 var3 consensus
<int> <int> <int> <dbl>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
As these are rowwise operations, can get some efficiency if we use collapse functions
library(collapse)
tfm(df, consensus = dapply(slt(df, select_vars), MARGIN = 1,
FUN = function(x) fmax(tabulate(x))))
# A tibble: 10 x 4
var1 var2 var3 consensus
* <int> <int> <int> <int>
1 1 1 1 2
2 1 1 3 1
3 1 2 1 2
4 1 2 1 2
5 2 2 2 2
6 2 3 3 1
7 2 3 2 2
8 1 1 1 2
9 3 1 2 1
10 3 3 2 1
Benchmarks
As noted above, collapse is faster (run on a slightly bigger dataset)
df1 <- df[rep(seq_len(nrow(df)), 1e5), ]
system.time({
tfm(df1, consensus = dapply(slt(df1, select_vars), MARGIN = 1,
FUN = function(x) fmax(tabulate(x))))
})
#user system elapsed
# 5.257 0.123 5.323
system.time({
df1 %>%
mutate(consensus = pmap_dbl(cur_data()[select_vars], ~ max(table(c(...)))))
})
#user system elapsed
# 54.813 0.517 55.246
The rowwise operation is taking too much time, so stopped the execution
df1 %>%
rowwise() %>%
mutate(consensus=max(table(unlist(select(cur_data(), select_vars))) ))
})
Timing stopped at: 575.5 3.342 581.3
What you need is to use the verb all_of
df %>%
rowwise() %>%
mutate(consensus=max(table(unlist(all_of(select_vars)))))
# A tibble: 10 x 4
# Rowwise:
var1 var2 var3 consensus
<int> <int> <int> <int>
1 2 3 3 1
2 2 2 2 1
3 1 2 2 1
4 2 3 3 1
5 1 2 1 1
6 2 1 2 1
7 2 2 2 1
8 3 1 2 1
9 2 1 3 1
10 3 2 1 1

Erase groups based on a condition with dplyr [duplicate]

This question already has answers here:
Filter group of rows based on sum of values from different column
(2 answers)
Closed 2 years ago.
I have a data.frame that looks like this
data=data.frame(group=c("A","B","C","A","B","C","A","B","C"),
time= c(rep(1,3),rep(2,3), rep(3,3)),
value=c(0,1,1,0.1,10,20,10,20,30))
group time value
1 A 1 0.0
2 B 1 1.0
3 C 1 1.0
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
I would like to find an elegant way to erase a group when its values are smaller < 0.2 in two different time points. Those points do not have to be consecutive.
In this case, I would like to filter out group A because its value at time point 1 and time point 2 is smaller than < 0.2.
group time value
1 B 1 1.0
2 C 1 1.0
3 B 2 10.0
4 C 2 20.0
5 B 3 20.0
6 C 3 30.0
With this solution you check that no group has more than 1 observation with values under 0.2 as you requested.
library(dplyr)
data %>%
group_by(group) %>%
filter(sum(value < 0.2) < 2) %>%
ungroup()
#> # A tibble: 6 x 3
#> group time value
#> <chr> <dbl> <dbl>
#> 1 B 1 1
#> 2 C 1 1
#> 3 B 2 10
#> 4 C 2 20
#> 5 B 3 20
#> 6 C 3 30
But if you are really a fan of base R:
data[ave(data$value<0.2, data$group, FUN = function(x) sum(x)<2), ]
#> group time value
#> 2 B 1 1
#> 3 C 1 1
#> 5 B 2 10
#> 6 C 2 20
#> 8 B 3 20
#> 9 C 3 30
Try this dplyr approach:
library(tidyverse)
#Code
data <- data %>% group_by(group) %>% mutate(Flag=any(value<0.2)) %>%
filter(Flag==F) %>% select(-Flag)
Output:
# A tibble: 6 x 3
# Groups: group [2]
group time value
<fct> <dbl> <dbl>
1 B 1 1
2 C 1 1
3 B 2 10
4 C 2 20
5 B 3 20
6 C 3 30

fill gap in dataframe [duplicate]

This question already has answers here:
adding default values to item x group pairs that don't have a value (df %>% spread %>% gather seems strange)
(2 answers)
Closed 4 years ago.
Original Data
id hhcode value
1 1 4.1
1 2 4.5
1 3 3.3
10 5 3.2
Required Output
id hhcode value
1 1 4.1
1 2 4.5
1 3 3.3
1 5 0
10 1 0
10 2 0
10 3 0
10 5 3.2
What got so far
df <- data.frame(
id = c(1, 1, 1, 10),
hhcode = c(1, 2, 3, 5),
value = c(4.1, 4.5, 3.3, 3.2)
)
library(statar)
library(tidyverse)
df %>%
group_by(id) %>%
fill_gap(hhcode, full = TRUE)
# A tibble: 10 x 3
# Groups: id [2]
id hhcode value
<dbl> <dbl> <dbl>
1 1 1 4.1
2 1 2 4.5
3 1 3 3.3
4 1 4 NA
5 1 5 NA
6 10 1 NA
7 10 2 NA
8 10 3 NA
9 10 4 NA
10 10 5 3.2
Any hint to get the required output?
We could use complete
library(tidyverse)
complete(df, id, hhcode, fill = list(value = 0))
# A tibble: 8 x 3
# id hhcode value
# <dbl> <dbl> <dbl>
#1 1 1 4.1
#2 1 2 4.5
#3 1 3 3.3
#4 1 5 0
#5 10 1 0
#6 10 2 0
#7 10 3 0
#8 10 5 3.2

Resources