How to calculate mean , min, and max across when grouping using dplyr? - r

So I have a data frame , simplified as this:
ID A B C
1 1 5 0
2 3 0 3
3 0 2 1
2 5 9 1
3 3 5 3
1 2 6 4
Simply put, I want to calculate the following for each row:
Mean
Median
Max
Min
Easy enough, but the hard part for me is after taking each, how do I create a mean value to represent each ID.
So after I get these values, how do I show the AVERAGE MEAN/MED/MAX/MIN for each ID???
Expected output:
(1)
ID Mean Median Min Max
1 2 1 0 5
2 2 3 0 3
3 1 1 0 2
2 5 5 1 9
3 3.66 3 3 5
1 4 4 2 6
(2)
ID AvgMean AvgMedian AvgMin AvgMax
1 3 2.5 1 5.5
2 3.5 4 1 6
3 2.33 3 3 3.5

You can try something like this:
library(dplyr)
df %>%
group_by(ID) %>%
summarise(mean_ = mean(c_across(A:C), na.rm = T),
medi_ = median(c_across(A:C), na.rm = T),
max_ = max(c_across(A:C), na.rm = T),
min_ = min(c_across(A:C), na.rm = T))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 5
ID mean_ medi_ max_ min_
<int> <dbl> <dbl> <int> <int>
1 1 3 3 6 0
2 2 3.5 3 9 0
3 3 2.33 2.5 5 0
For the second part:
df %>%
rowwise() %>%
summarise(mean_ = mean(c_across(A:C), na.rm = T),
medi_ = median(c_across(A:C), na.rm = T),
max_ = max(c_across(A:C), na.rm = T),
min_ = min(c_across(A:C), na.rm = T))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 6 x 4
mean_ medi_ max_ min_
<dbl> <int> <int> <int>
1 2 1 5 0
2 2 3 3 0
3 1 1 2 0
4 5 5 9 1
5 3.67 3 5 3
6 4 4 6 2
With data:
df <- structure(list(ID = c(1L, 2L, 3L, 2L, 3L, 1L), A = c(1L, 3L,
0L, 5L, 3L, 2L), B = c(5L, 0L, 2L, 9L, 5L, 6L), C = c(0L, 3L,
1L, 1L, 3L, 4L)), class = "data.frame", row.names = c(NA, -6L
))

thanks for posting expected output. I would consider using summarize and across together
library(dplyr)
df <- df %>%
group_by(ID)
summarize(across(2:4, mean))

In base R, the following seems to do what the question asks for.
out is a data.frame of the statistics ungrouped and out2 of the grouped ones.
fun <- function(X){
f <- function(x, na.rm = FALSE){
c(
Mean = mean(x, na.rm = na.rm),
Median = median(x, na.rm = na.rm),
Min = min(x, na.rm = na.rm),
Max = max(x, na.rm = na.rm)
)
}
t(apply(X, 1, f))
}
out <- lapply(split(df1[-1], df1$ID), fun)
out2 <- lapply(out, colMeans)
out <- do.call(rbind, out)
out <- cbind.data.frame(ID = row.names(out), out)
out2 <- cbind.data.frame(ID = names(out2), do.call(rbind, out2))
out
# ID Mean Median Min Max
#1 1 2.000000 1 0 5
#6 6 4.000000 4 2 6
#2 2 2.000000 3 0 3
#4 4 5.000000 5 1 9
#3 3 1.000000 1 0 2
#5 5 3.666667 3 3 5
out2
# ID Mean Median Min Max
#1 1 3.000000 2.5 1.0 5.5
#2 2 3.500000 4.0 0.5 6.0
#3 3 2.333333 2.0 1.5 3.5

Related

Create a column that returns the min/max of certain rows

I have data like these:
col1 col2 col3 col4 col5
1 3 1 7 3
4 2 8 2 5
3 1 5 1 4
I want to add columns that show the minimum and maximum by row, but only for certain columns (2 - 4, for example):
col1 col2 col3 col4 col5 min max
1 3 1 7 3 1 7
1 2 8 2 5 2 8
9 1 5 1 0 1 5
I know I could use select to subset those rows and then calculate the min/max and use cbind to merge with the original data, but I feel like there is a better approach. Thanks!
Data
df <- structure(list(col1 = c(1L, 4L, 3L), col2 = 3:1, col3 = c(1L, 8L, 5L),
col4 = c(7L, 2L, 1L), col5 = c(3L, 5L, 4L)),
class = "data.frame", row.names = c(NA, -3L))
We could use pmin/pmax after selecting the columns
df$min <- do.call(pmin, c(df[2:4], na.rm = TRUE))
df$max <- do.call(pmax, c(df[2:4], na.rm = TRUE))
-output
> df
col1 col2 col3 col4 col5 min max
1 1 3 1 7 3 1 7
2 4 2 8 2 5 2 8
3 3 1 5 1 4 1 5
Or using tidyverse, we can do
library(dplyr)
df %>%
mutate(min = exec(pmin, !!! rlang::syms(names(.)[2:4]), na.rm = TRUE),
max = exec(pmax, !!! rlang::syms(names(.)[2:4]), na.rm =TRUE))
-output
col1 col2 col3 col4 col5 min max
1 1 3 1 7 3 1 7
2 4 2 8 2 5 2 8
3 3 1 5 1 4 1 5
data
df <- structure(list(col1 = c(1L, 4L, 3L), col2 = 3:1, col3 = c(1L,
8L, 5L), col4 = c(7L, 2L, 1L), col5 = c(3L, 5L, 4L)),
class = "data.frame", row.names = c(NA,
-3L))
With dplyr, you could use across + pmin/pmax:
library(dplyr)
df %>%
mutate(min = do.call(pmin, c(across(col2:col4), na.rm = TRUE)),
max = do.call(pmax, c(across(col2:col4), na.rm = TRUE)))
# # A tibble: 3 × 7
# col1 col2 col3 col4 col5 min max
# <int> <int> <int> <int> <int> <int> <int>
# 1 1 3 1 7 3 1 7
# 2 4 2 8 2 5 2 8
# 3 3 1 5 1 4 1 5
or c_across + min/max:
df %>%
rowwise() %>%
mutate(min = min(c_across(col2:col4), na.rm = TRUE),
max = max(c_across(col2:col4), na.rm = TRUE)) %>%
ungroup()
because you tagged the question with dplyr here is a dplyr solution
library(dplyr)
mt2 <- mtcars %>%
mutate(pmax = pmax(cyl,carb),
pmin = pmin(cyl,carb))
Here is one with rowwise() combined with c_across():
library(dplyr)
df %>%
rowwise() %>%
mutate(min = min(c_across(col1:col5)),
max = max(c_across(col1:col5)))
col1 col2 col3 col4 col5 min max
<int> <int> <int> <int> <int> <int> <int>
1 1 3 1 7 3 1 7
2 4 2 8 2 5 2 8
3 3 1 5 1 4 1 5

Returning group maximum and NA using dplyr

I need a function I can use that returns both the group maximum and any NA values. Here is toy data:
df <- data.frame(id = rep(1:5,
each = 3),
score = rnorm(15))
df$score[c(3,7,10,14)] <- NA
# id score
# 1 1 -1.4666164
# 2 1 0.4392647
# 3 1 NA
# 4 2 -0.6010311
# 5 2 1.9845774
# 6 2 0.1749082
# 7 3 NA
# 8 3 -0.3089731
# 9 3 0.4427471
# 10 4 NA
# 11 4 1.7156319
# 12 4 -0.2354253
# 13 5 1.1781350
# 14 5 NA
# 15 5 0.0642082
I can use slice_max to get the maximum in each group:
df %>%
group_by(id) %>%
slice_max(score)
# id score
# <int> <dbl>
# 1 1 0.439
# 2 2 1.98
# 3 3 0.443
# 4 4 1.72
# 5 5 1.18
But how do I get the maximum plus any NAs returned?
We can group_by the id column, then use summarize to output the summaries with max. Here, two max are used, with one of them has na.rm = T and the other doesn't. union() is used to combine output that is present in both max.
library(dplyr)
df %>%
group_by(id) %>%
summarize(score = union(max(score, na.rm = T), max(score)))
UPDATE: The above code only works if you have at most one NA per ID. Thanks #KU99 for the reminder.
If you have more than one NA per ID, you need to combine the result of max with the records of NA found by is.na().
df %>%
group_by(id) %>%
summarize(score = c(max(score, na.rm = T), score[is.na(score)]))
Result
# A tibble: 9 × 2
# Groups: id [5]
id score
<int> <dbl>
1 1 0.735
2 1 NA
3 2 0.314
4 3 0.994
5 3 NA
6 4 0.847
7 4 NA
8 5 1.95
9 5 NA
Data
df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L), score = c(-1.05089006245306, 0.734652105895187,
NA, -1.31427279695036, -0.250038722057874, 0.314204596436828,
NA, 0.994420599790523, 0.855768431757766, NA, 0.834325037545013,
0.846790152407738, 1.95410525460771, NA, 0.971120269710021)), row.names = c(NA,
-15L), class = "data.frame")
One option would be to use slice and | to create a logical condition with is.na to return the NA rows and the max rows.
library(dplyr)
df %>%
group_by(id) %>%
slice(which(score == max(score, na.rm = T)|is.na(score)))
Another option would be to use slice.max as you did but then to use bind_rows to add the NA values back to the dataframe.
library(dplyr)
df %>%
group_by(id) %>%
slice_max(score) %>%
bind_rows(df %>% filter(is.na(score))) %>%
arrange(id)
Output
id score
<int> <dbl>
1 1 -0.161
2 1 NA
3 2 1.49
4 3 -0.451
5 3 NA
6 4 0.878
7 4 NA
8 5 -0.0652
9 5 NA
Data
df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L), score = c(-0.161217942983375, -0.456571996252207,
NA, 0.540071362460494, 1.49325799630099, -0.17985218510166, NA,
-0.451301758592, -0.839100876399644, NA, -0.0432130218441599,
0.87779273806634, -0.339260854059069, NA, -0.065177224102029)), row.names = c(NA,
-15L), class = "data.frame")
Using a custom function you could do:
library(dplyr)
set.seed(123)
slice_max_na <- function(.data, order_by, ..., n, prop, with_ties = TRUE) {
bind_rows(
slice_max(.data, order_by = {{order_by}}, ..., n = n, prop = prop, with_ties = with_ties),
filter(.data, is.na({{order_by}})),
)
}
df %>%
group_by(id) %>%
slice_max_na(score)
#> # A tibble: 9 × 2
#> # Groups: id [5]
#> id score
#> <int> <dbl>
#> 1 1 -0.230
#> 2 2 1.72
#> 3 3 -0.687
#> 4 4 1.22
#> 5 5 0.401
#> 6 1 NA
#> 7 3 NA
#> 8 4 NA
#> 9 5 NA
Here is dplyr version more using rank:
library(dplyr)
df %>%
group_by(id) %>%
mutate(rank = rank(-score, ties.method = "random")) %>%
filter(rank == 1 | is.na(score)) %>%
select(-rank)
id score
<int> <dbl>
1 1 0.505
2 1 NA
3 2 -0.109
4 3 NA
5 3 1.45
6 4 NA
7 4 0.355
8 5 NA
9 5 -0.298

Counting occurencies in every column in R

Hello I need to count the occurencies of every number in each column.
Example data-frame:
A B C
2 1 2
2 1 1
1 1 3
3 3 3
3 2 2
2 1 2
I want my output to look like this
how_much A B C
1 1 4 1
2 3 1 3
3 2 1 2
In tidyverse you could do:
library(tidyverse)
gather(df1) %>%
group_by(key,value) %>%
count() %>%
pivot_wider(value, names_from = key, values_from = n, values_fill = 0)
value A B C
<int> <int> <int> <int>
1 1 1 4 1
2 2 3 1 3
3 3 2 1 2
We can use table
table(unlist(df1), names(df1)[c(col(df1))])
-output
A B C
1 1 4 1
2 3 1 3
3 2 1 2
Or loop over the columns with sapply, and apply table
sapply(df1, table)
A B C
1 1 4 1
2 3 1 3
3 2 1 2
data
df1 <- structure(list(A = c(2L, 2L, 1L, 3L, 3L, 2L), B = c(1L, 1L, 1L,
3L, 2L, 1L), C = c(2L, 1L, 3L, 3L, 2L, 2L)),
class = "data.frame", row.names = c(NA,
-6L))
In order for the solution to be more flexible and can be used for any occurrence of numbers we can use the following solution using purrr package functions.
library(dplyr)
library(purrr)
df1 %>%
map(~ unique(.x) %>% sort()) %>% reduce(~ union(..1, ..2)) %>%
bind_cols(map_dfr(., ~ map_dfc(df1, function(a) sum(a == .x)))) %>%
rename(what = ...1)
# A tibble: 3 x 4
what A B C
<int> <int> <int> <int>
1 1 1 4 1
2 2 3 1 3
3 3 2 1 2
A slightly verbose answer, but it will work on all data types.
set.seed(1234)
df1 <- data.frame(A = sample(letters[1:3], 8, T),
B = sample(letters[1:3], 8, T),
C = sample(letters[1:3], 8, T))
df1
#> A B C
#> 1 b c b
#> 2 b b a
#> 3 a b c
#> 4 c b c
#> 5 a c c
#> 6 a b a
#> 7 b b b
#> 8 b b a
library(tidyverse)
unique(unlist(apply(df1, 1, unique))) %>% as.data.frame() %>% setNames('how_much') %>%
bind_cols(map_df(unique(unlist(apply(df1, 1, unique))), ~map_int(df1, \(x) sum(x %in% .x) ) ))
#> how_much A B C
#> 1 b 4 6 2
#> 2 c 1 2 3
#> 3 a 3 0 3
Created on 2021-06-23 by the reprex package (v2.0.0)

group_by() and percentages: summarise() drops the columns I also need - R

I have this df:
> df <- data.frame(Adults = sample(0:5, 10, replace = TRUE),
+ Children = sample(0:2, 10, replace = TRUE),
+ Teens = sample(1:3, 10, replace = TRUE),
+ stringsAsFactors = FALSE)
> df
Adults Children Teens
1 5 0 1
2 5 1 2
3 5 2 3
4 5 2 2
5 0 1 2
6 5 1 3
7 0 2 3
8 4 2 1
9 4 0 1
10 1 2 1
We can see that Children doesn't have 3,4,5 values and Teens doesn't have 0,4,5 values. However, we know that Adults, Children, and Teens could have from 0 to 5.
When I use group_by() with summarise(), summarise drops the columns I'm not grouping. The code:
df %>%
group_by(Adults) %>% mutate(n_Adults = n()) %>%
group_by(Teens) %>% mutate(n_Teens = n()) %>%
group_by(Children) %>% mutate(n_Children = n())
And when I group by c(0,1,2,3,4,5) (in order to have all the possible values) it gives me this error:
Error in mutate_impl(.data, dots) : Column `c(0, 1, 2, 3, 4, 5)` must be length 10 (the number of rows) or one, not 6
I'm looking for this output:
Values n_Adults n_Children n_Teens p_Adults p_Children p_Teens
0 2 2 0 0.2 0.2 0
1 1 3 4 0.1 0.1 0.4
2 0 5 3 0 0 0.3
3 0 0 3 0 0 0.3
4 2 0 0 0.2 0.2 0
5 5 0 0 0.5 0.5 0
Where n_ is the count of the respective column and p_ is the percentage of the respective column.
We can gather the data into 'long' format, get the frequency with count after converting the 'value' to factor with levels specified as 0:5, spread to 'wide' format and create the 'p' columns by dividing with the sum of each column and if needed change the column name (with rename_at)
library(tidyverse)
gather(df) %>%
count(key, value = factor(value, levels = 0:5)) %>%
spread(key, n, fill = 0) %>%
mutate_at(2:4, list(p = ~./sum(.)))%>%
rename_at(2:4, ~ paste0(.x, "_n"))
data
df <- structure(list(Adults = c(1L, 1L, 4L, 3L, 3L, 5L, 1L, 4L, 4L,
1L), Children = c(1L, 1L, 2L, 2L, 0L, 2L, 0L, 0L, 1L, 0L), Teens = c(1L,
2L, 3L, 1L, 1L, 3L, 1L, 2L, 2L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
library(reprex)
library(tidyverse)
set.seed(20)
df <- data.frame(Adults = sample(0:5, 10, replace = TRUE),
Children = sample(0:2, 10, replace = TRUE),
Teens = sample(1:3, 10, replace = TRUE),
stringsAsFactors = FALSE)
df
#> Adults Children Teens
#> 1 5 2 2
#> 2 4 2 1
#> 3 1 0 2
#> 4 3 2 1
#> 5 5 0 1
#> 6 5 1 1
#> 7 0 0 3
#> 8 0 0 3
#> 9 1 0 1
#> 10 2 2 3
df_adults <- df %>%
count(Adults) %>%
rename( n_Adults = n)
df_childred <- df %>%
count(Children) %>%
rename( n_Children = n)
df_teens <- df %>%
count(Teens) %>%
rename( n_Teens = n)
df_new <- data.frame(unique_id = 0:5)
df_new <- left_join(df_new,df_adults, by = c("unique_id"="Adults"))
df_new <- left_join(df_new,df_childred, by = c("unique_id"="Children"))
df_new <- left_join(df_new,df_teens, by = c("unique_id"="Teens"))
df_new <- df_new %>%
replace_na(list( n_Adults=0, n_Children=0, n_Teens=0))
df_new %>%
mutate(p_Adults = n_Adults/sum(n_Adults),p_Children = n_Children/sum(n_Children), p_Teens = n_Teens/sum(n_Teens))
#> unique_id n_Adults n_Children n_Teens p_Adults p_Children p_Teens
#> 1 0 2 5 0 0.2 0.5 0.0
#> 2 1 2 1 5 0.2 0.1 0.5
#> 3 2 1 4 2 0.1 0.4 0.2
#> 4 3 1 0 3 0.1 0.0 0.3
#> 5 4 1 0 0 0.1 0.0 0.0
#> 6 5 3 0 0 0.3 0.0 0.0
Created on 2019-02-25 by the reprex package (v0.2.1)

Sorting and calculating sum and rank with new columns in R

I have 200 columns and want to calculate mean and rank and then generate columns. Here is an example of data
df<-read.table(text="Q1a Q2a Q3b Q4c Q5a Q6c Q7b
1 2 4 2 2 0 1
3 2 1 2 2 1 1
4 3 2 1 1 1 1",h=T)
I want to sum a, b and c for each row, and then sum them together. Next I want to calculate the rank for each row. I want to generate the following table:
Q1a Q2a Q3b Q4c Q5a Q6c Q7b a b c Total Rank
1 2 4 2 2 0 1 5 5 2 12 2
3 2 1 2 2 1 1 7 2 3 12 2
4 3 2 1 1 1 1 8 3 2 13 1
library(dplyr)
df %>%
cbind(sapply(c('a', 'b', 'c'), function(x) rowSums(.[, grep(x, names(.)), drop=FALSE]))) %>%
mutate(Total = a + b + c,
Rank = match(Total, sort(Total, decreasing = T)))
Output is:
Q1a Q2a Q3b Q4c Q5a Q6c Q7b a b c Total Rank
1 1 2 4 2 2 0 1 5 5 2 12 2
2 3 2 1 2 2 1 1 7 2 3 12 2
3 4 3 2 1 1 1 1 8 3 2 13 1
Sample data:
df <- structure(list(Q1a = c(1L, 3L, 4L), Q2a = c(2L, 2L, 3L), Q3b = c(4L,
1L, 2L), Q4c = c(2L, 2L, 1L), Q5a = c(2L, 2L, 1L), Q6c = c(0L,
1L, 1L), Q7b = c(1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-3L))
You can also go with the tidyverse approach. However, it is longer.
library(tidyverse)
df %>%
rownames_to_column(var = "ID") %>%
gather(question, value, -ID) %>%
mutate(type = substr(question, 3,3)) %>%
group_by(ID, type) %>%
summarise(sumType = sum(value, na.rm = TRUE)) %>%
as.data.frame() %>%
spread(type, sumType) %>%
mutate(Total = a+b+c,
Rank = match(Total, sort(Total, decreasing = T)))
Results:
ID a b c Total Rank
1 1 5 5 2 12 2
2 2 7 2 3 12 2
3 3 8 3 2 13 1

Resources