Calculate proportions of multiple columns - r

I have a dataset that looks like this:
location = rep(c("A", "B", "C", "D"),
times = c(4, 6, 3, 7))
ID = (1:20)
Var1 = rep(c(0,2,1,1,0), times = 4)
Var2 = rep(c(2,1,1,0,2), times = 4)
Var3 = rep(c(1,1,0,2,0), times = 4)
df=as.data.frame(cbind(location, ID, Var1, Var2, Var3))
There are different locations where we evaluated variables with three levels each (score 0, 1, 2). Now I would like to get a result that contains the proportions of each score by location. The number of individuals examined (ID) is not the same at each location.
So what I did was making functions to use with lapply:
score0 = function(a){sum(a==0)}
score1 = function(a){sum(a==1)}
score2 = function(a){sum(a==2)}
And I tried this, as well as many other things:
df %>%
group_by(location) %>%
lapply(FUN = score0)
But it doesn't work. Again, what I would like to get is a data frame with the proportions of each score or level (0, 1, 2) per location. Or at least the number of occurrences of each score, so I can divide it by the number of individuals per location.
I hope this makes sense.
I also checked this question Calculate proportions of categories within groups but cannot apply the solution to my data with multiple variables.
Thanks for your help!

Something like this?
df %>%
pivot_longer(starts_with("Var"), values_to = "score") %>%
type_convert() %>%
group_by(location) %>%
count(score) %>%
mutate(frac = n / sum(n))
resulting in
# A tibble: 12 × 4
# Groups: location [4]
location score n frac
<chr> <dbl> <int> <dbl>
1 A 0 3 0.25
2 A 1 6 0.5
3 A 2 3 0.25
4 B 0 7 0.389

Thank you, #danloo, this was almost it. What I wanted was this:
df %>%
pivot_longer(starts_with("Var"), values_to = "score") %>%
type_convert() %>%
group_by(location, name) %>%
count(score) %>%
mutate(frac = n / sum(n))
with the following result:
# A tibble: 36 × 5
# Groups: location, name [12]
location name score n frac
<chr> <chr> <dbl> <int> <dbl>
1 A Var1 0 2 0.4
2 A Var1 1 2 0.4
3 A Var1 2 1 0.2
4 A Var2 0 1 0.2
5 A Var2 1 2 0.4
6 A Var2 2 2 0.4
7 A Var3 0 2 0.4
8 A Var3 1 2 0.4
9 A Var3 2 1 0.2
10 B Var1 0 2 0.4

Related

Referencing previous / next column in call to mutate()

My data looks like this:
set.seed(1234)
library(tidyverse)
df <- data.frame(Time = c(1,1,2,2,3,3),
Region = c("A", "B", "A", "B", "A", "B"),
Age_1 = round(rnorm(6, mean = 10),0),
Age_2 = round(rnorm(6, mean = 10),0),
Age_3 = round(rnorm(6, mean = 10),0),
Age_4 = round(rnorm(6, mean = 10),0),
Age_5 = round(rnorm(6, mean = 10),0))
I need to generate ratios of population change for each region and point in time. For instance, Ratio_2 for Time == 2 would be Age_2 (at Time == 2) / Age_1 (at Time == 1), grouped by Region. I could do this manually by typing:
df %>%
group_by(Region) %>%
mutate(Ratio_2 = Age_2 / dplyr::lag(Age_1, order_by = Time),
Ratio_3 = Age_3 / dplyr::lag(Age_2, order_by = Time),
Ratio_4 = Age_4 / dplyr::lag(Age_3, order_by = Time),
Ratio_5 = Age_5 / dplyr::lag(Age_4, order_by = Time))
df
# A tibble: 6 x 11
# Groups: Region [2]
Time Region Age_1 Age_2 Age_3 Age_4 Age_5 Ratio_2 Ratio_3 Ratio_4 Ratio_5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 A 11 8 9 9 10 NA NA NA NA
2 1 B 10 9 10 10 11 NA NA NA NA
3 2 A 9 10 9 8 12 0.909 1.12 0.889 1.33
4 2 B 9 10 9 9 9 1 1 0.9 0.9
5 3 A 8 11 9 9 12 1.22 0.9 1 1.5
6 3 B 9 9 9 9 9 1 0.9 1 1
Since my original data has lots of age groups, this procedure involves lots of manual coding. A programmatic solution in my mind could look something like this:
df %>%
group_by(Region) %>%
mutate(across(4:7, ~ . / dplyr::lag(.[?], order_by = Time), .names="Ratio_{.col}"))
The part containing dplyr::lag(.[?]) needs to reference the previous column in the data frame relative to . but I haven't found a method for doing so.
Note: This question is related to a post from yesterday, in which I was trying to solve the problem at hand with the data being in long format. Doing it in wide format is a different question though, which is why I opened this question.
Here is one option with across
library(dplyr)
library(stringr)
df %>%
group_by(Region) %>%
mutate(across(matches('^Age_[2-5]$'),
~ ./lag(get(str_replace(cur_column(), '\\d+',
as.character(readr::parse_number(cur_column())-1))), order_by = Time ),
.names = "Ratio_{.col}" )) %>%
ungroup
Or it can be done in a simplified way
library(purrr)
df[str_c('Region_', 2:5)] <- map2(df[4:7], df[3:6],
~ .x/lag(.y, order_by = df$Time))

Add levels missing in one group to summary table using dplyr

When summarizing data, some groups may have observations not present in another group. In the example below, group 2 has no males. How can I in a tidy way, insert these observations in a summary table?
data example:
a <- data.frame(gender=factor(c("m", "m", "m", "f", "f", "f", "f")), group=c(1,1,1,1,1,2,2))
gender group
1 m 1
2 m 1
3 m 1
4 f 1
5 f 1
6 f 2
7 f 2
data summary:
a %>% group_by(gender, group) %>% summarise(n=n())
gender group n
<fct> <dbl> <int>
1 f 1 2
2 f 2 2
3 m 1 3
Desired output:
gender group n
<fct> <dbl> <int>
1 f 1 2
2 f 2 2
3 m 1 3
4 m 2 0
At the end, we can use complete
library(dplyr)
library(tidyr)
a %>%
group_by(gender, group) %>%
summarise(n=n(), .groups = 'drop') %>%
complete(gender, group, fill = list(n = 0))
-output
# A tibble: 4 x 3
# gender group n
# <fct> <dbl> <dbl>
#1 f 1 2
#2 f 2 2
#3 m 1 3
#4 m 2 0
Or an option is also to reshape to wide and then back to long format
a %>%
pivot_wider(names_from = group, values_from = group,
values_fn = length, values_fill = 0) %>%
pivot_longer(cols = -gender, names_to = 'group', values_to = 'n')
It is more easier in base R
as.data.frame(table(a))

a beautiful solution to decode a table with dplyr and mutate

Dear dplyr/tidyverse companions, I am looking for a nice solution to the following problem. I only get my solutions in base R with a loop. How do you solve this cleanly in tidyverse?
I have a dataset called data, which has not useful column names and not useful values (integer).
data <- tibble(var1 = rep(c(1:3), 2),
var2 = rep(c(1:3), 2))
# A tibble: 6 x 2
var1 var2
<int> <int>
1 1 1
2 2 2
3 3 3
4 1 1
5 2 2
6 3 3
Additional I have a coding table, which has for every column a better name (var1 -> variable1) and a better value (1 -> "a")
coding <- tibble(variable = c(rep("var1", 3),rep("var2", 3)),
name = c(rep("variable1", 3),rep("variable2", 3)),
code = rep(c(1:3), 2),
value = rep(c("a", "b", "c"), 2))
# A tibble: 6 x 4
variable name code value
<chr> <chr> <int> <chr>
1 var1 variable1 1 a
2 var1 variable1 2 b
3 var1 variable1 3 c
4 var2 variable2 1 a
5 var2 variable2 2 b
6 var2 variable2 3 c
I'm looking for a result, which has transformed names of the columns and the real values as factors in the dataset, compare:
result <- tibble(variable1 = factor(rep(c("a", "b", "c"), 2)),
variable2 = factor(rep(c("a", "b", "c"), 2)))
# A tibble: 6 x 2
variable1 variable2
<fct> <fct>
1 a a
2 b b
3 c c
4 a a
5 b b
6 c c
Thank you for your commitment :) :) :) :)
library(dplyr)
library(tidyr)
data %>%
stack() %>%
left_join(coding, by = c(ind = "variable", values = "code")) %>%
group_by(name) %>%
mutate(j = row_number()) %>%
pivot_wider(id_cols = j, values_from = value) %>%
select(-j)
# # A tibble: 6 x 2
# variable1 variable2
# <chr> <chr>
# 1 a a
# 2 b b
# 3 c c
# 4 a a
# 5 b b
# 6 c c
A general solution for any number of columns -
create a row number column to identify each row
get data in long format
join it with coding for each value
keep only unique rows and get it back in wide format.
library(dplyr)
library(tidyr)
data %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row, values_to = 'code') %>%
left_join(coding, by = 'code') %>%
select(row, name = name.y, value) %>%
distinct() %>%
pivot_wider() %>%
select(-row)
# variable1 variable2
# <chr> <chr>
#1 a a
#2 b b
#3 c c
#4 a a
#5 b b
#6 c c

Use pivot_wider on table but keep count of rows

So my problem is as follows, I have a small data frame like this:
test_df <- data.frame(id=c(1,1,2,2,2), ttype=c("D", "C", "D", "D", "C"), val=c(1, 5, 10, 5, 100))
test_df
id ttype val
1 1 A 1
2 1 B 5
3 2 A 10
4 2 A 5
5 2 B 100
Now I want to make it wider to end up like this:
id A B n
1 1 5 1 2
2 2 100 15 3
So I want to replace the ttype with a column for each value, grouped by id with the summed values of val. But my problem is that I still want to keep track of how many either A or B occurred in total for each id, which is n in this case.
Now I found a way to do this, but it is very ugly. But this way works:
test_df %>%
group_by(id, ttype) %>%
summarise(val = sum(val), n=n()) %>%
pivot_wider(names_from = ttype, values_from=c(val, n), values_fill=0) %>%
mutate(n=n_A+n_B) %>%
select(-n_A, -n_B)
results in:
# A tibble: 2 x 4
# Groups: id [2]
id val_A val_B n
<dbl> <dbl> <dbl> <int>
1 1 5 1 2
2 2 100 15 3
So here the amount of A en B is included separately, after which I sum them and remove both other columns. But this means I have to hardcode column names and makes it not really doable when there are more than 2 values in ttype.
I feel like there must be a simple way to do this, but I can't figure it out.
You can add count of id rows as new column and get data in wide format using pivot_wider by taking sum of val values.
library(dplyr)
library(tidyr)
test_df %>%
add_count(id) %>%
pivot_wider(names_from = ttype, values_from = val, values_fn = sum)
# id n D C
# <dbl> <int> <dbl> <dbl>
#1 1 2 1 5
#2 2 3 15 100

Using dplyrs group_by and summarise to find number of intersections with a different vector

I have a situation where I am trying to find the number of intersections with a vector per group in another tibble.
Data example
a <- tibble(EXPERIMENT = rep(c("a","b","c"),each =4),
ECOTYPE = rep(1:12))
b <- tibble(ECOTYPE = c(1,1,5,4,8,7,6,1,4,4,2,5,6,7,1))
I want to find the number of intersections between ECOTYPE in b and ECOTYPEper EXPERIMENT in a.
I wonder if I can use dplyr to solve this, as the group_by function seems to fit this problem, but when I run:
a %>%
group_by(EXPERIMENT) %>%
summarise(INTERSECTIONS = length(intersect(b$ECOTYPE, .$ECOTYPE))
I only get the total number of intersections between a and b.
Am I missing something?
Edit:
Sorry for not posting my desired output. I would like something like this:
# A tibble: 3 x 2
EXPERIMENT INTERSECTIONS
<chr> <dbl>
1 a 8
2 b 7
3 c 0
Depending how you want to count, this will give the number of rows in b matching a:
b %>% mutate(b_flag = 1) %>%
right_join(a) %>%
group_by(EXPERIMENT) %>%
summarize(INTERSECTIONS = sum(b_flag, na.rm = T))
# # A tibble: 3 x 2
# EXPERIMENT INTERSECTIONS
# <fctr> <dbl>
# 1 a 8
# 2 b 7
# 3 c 0
I think the only problem with your code is the unnecessary .$, but it gives the counts of distinct ecotypes in b, ignoring the fact that b has three ECOTYPE = 1 rows, for example.
a %>%
group_by(EXPERIMENT) %>%
summarise(INTERSECTIONS = length(intersect(b$ECOTYPE, ECOTYPE)))
# # A tibble: 3 x 2
# EXPERIMENT INTERSECTIONS
# <fctr> <int>
# 1 a 3
# 2 b 4
# 3 c 0
This is a result of how intersect works:
intersect(c(1, 2, 3), c(1, 1, 1))
# [1] 1
Join the two and count how many are left:
inner_join(a,b, by='ECOTYPE') %>% group_by(EXPERIMENT) %>% count()
# A tibble: 2 x 2
# Groups: EXPERIMENT [2]
EXPERIMENT n
<chr> <int>
1 a 8
2 b 7
Now, if you add an indicator column to b, you can start to count absences as well:
b %>% mutate(present=TRUE) %>% right_join(a, by='ECOTYPE') %>% group_by(EXPERIMENT) %>% summarise(n(), missing=sum(is.na(present)))
# A tibble: 3 x 3
EXPERIMENT `n()` missing
<chr> <int> <int>
1 a 9 1
2 b 7 0
3 c 4 4

Resources