Group_by and mutate by multiple columns in R - r

I have dataframe with country, gender, 2013,2014,2014,2015 column names.
City Gender 2013 2014 2015
Aberdeen Female 30 40 50
Aberdeen Male 20 15 16
Aberdeenshire Female 60 80 70
Aberdeenshire Male 50 40 15
.....Includes 425 records.
I want to perform female to male ratio (dividing Female/male for each city) for each city, so this is how i tried to get,
City 2013_ratio 2014_ratio 2015_ration
Aberdeen 1.5 2.66 2.5
Aberdeenshire 1.2 2 4.66
can anyone help me to solve this. I have tried grouping by city but I don't know how to do by getting value by rows in gender.

You can more easily calculate the ratio if the Male and Female are in different columns, which you can change the structure by using tidyr
library(dplyr)
library(tidyr)
df %>%
gather(Year, Value, -City, - Gender) %>%
spread(Gender, Value) %>%
mutate(Ratio = Female/Male, Year = paste0(Year, "_Ratio")) %>%
select(-Female, -Male) %>%
spread(Year, Ratio)

The code from Rob's suggested solution would be (with an additional spread() step:
# data
df = data.frame(City = c("a", "a", "b", "b"),
Gender = c("Female", "Male", "Female", "Male"),
`2013` = c(30, 20, 60, 50),
`2014` = c(40, 15, 80, 40),
`2015` = c(50, 16, 70, 15))
# Actual process
library("dplyr")
library("tidyr")
df %>%
# Transform wide table into tidy
gather("Year", "Number", X2013:X2015) %>%
# Reshape gender columns for easier summaries
spread("Gender", "Number") %>%
# Compute ratios
group_by(City, Year) %>%
summarise(ratio = Female/(Male + Female))
#> # A tibble: 6 x 3
#> # Groups: City [?]
#> City Year ratio
#> <fct> <chr> <dbl>
#> 1 a X2013 0.6
#> 2 a X2014 0.727
#> 3 a X2015 0.758
#> 4 b X2013 0.545
#> 5 b X2014 0.667
#> 6 b X2015 0.824
Created on 2018-10-10 by the reprex package (v0.2.1)
To get exactly your result you can apply back the function spread() to spread the ratios over years, (spread(Year, ratio))

With tidyverse:
df = read.table(text="City Gender 2013 2014 2015
Aberdeen Female 30 40 50
Aberdeen Male 20 15 16
Aberdeenshire Female 60 80 70
Aberdeenshire Male 50 40 15", header = T)
> library(tidyverse)
>
> df %>%
group_by(City) %>%
arrange(City, Gender) %>%
summarise_at(vars(X2013:X2015), .funs = funs(ratio = first(.)/last(.)))
# A tibble: 2 x 4
City X2013_ratio X2014_ratio X2015_ratio
<fct> <dbl> <dbl> <dbl>
1 Aberdeen 1.5 2.67 3.12
2 Aberdeenshire 1.2 2 4.67
or
df %>%
group_by(City) %>%
arrange(City,Gender) %>%
summarise_at(vars(X2013:X2015), .funs = funs(ratio = .[Gender == "Female"]/.[Gender != "Female"]))

Related

Divide group sum by total sum

I am using the dplyr package. Let's suppose I have the below table.
Group
count
A
20
A
10
B
30
B
35
C
50
C
60
My goal is to create a summary table that contains the mean per each group, and also, the percentage of the mean of each group compared to the total means added together. So the final table will look like this:
Group
avg
prcnt_of_total
A
15
.14
B
32.5
.31
C
55
.53
For example, 0.14 is the result of the following calculation: 15/(15+32.5+55)
Right now, I was only able to produce the first column code that calculates the mean for each group:
summary_df<- df %>%
group_by(Group)%>%
summarise(avg=mean(count))
I still don't know how to produce the prcnt_of_total column. Any suggestions?
You can use the following code:
df <- read.table(text="Group count
A 20
A 10
B 30
B 35
C 50
C 60", header = TRUE)
library(dplyr)
df %>%
group_by(Group) %>%
summarise(avg = mean(count)) %>%
ungroup() %>%
mutate(prcnt_of_total = prop.table(avg))
#> # A tibble: 3 × 3
#> Group avg prcnt_of_total
#> <chr> <dbl> <dbl>
#> 1 A 15 0.146
#> 2 B 32.5 0.317
#> 3 C 55 0.537
Created on 2022-07-14 by the reprex package (v2.0.1)
We can drop the group in summarise itself.
library(dplyr)
df1 %>%
group_by(Group) %>%
summarise(avg = mean(count), .groups = "drop") %>%
mutate(prcnt_of_total = avg/sum(avg))
#> # A tibble: 3 x 3
#> Group avg prcnt_of_total
#> <chr> <dbl> <dbl>
#> 1 A 15 0.146
#> 2 B 32.5 0.317
#> 3 C 55 0.537
On another note, I am not sure if getting the average divided by the sum of averages is a meaningful metric unless we are sure to have the same number of entries per group. Given that, I suggested another solution as well.
## if you always have the same number of rows between the groups
df1 %>%
group_by(Group) %>%
summarise(avg = mean(count),
prcnt_of_total = sum(count)/sum(.$count))
#> # A tibble: 3 x 3
#> Group avg prcnt_of_total
#> <chr> <dbl> <dbl>
#> 1 A 15 0.146
#> 2 B 32.5 0.317
#> 3 C 55 0.537
Data:
read.table(text = "Group count
A 20
A 10
B 30
B 35
C 50
C 60",
header = T, stringsAsFactors = F) -> df1
You can do this:
df %>%
group_by(Group) %>%
summarize(avg = mean(count), prcent_of_total = sum(count)/sum(df$count))
Output:
Group avg prcent_of_total
<chr> <dbl> <dbl>
1 A 15 0.146
2 B 32.5 0.317
3 C 55 0.537
data.table is similar:
library(data.table)
setDT(df)[,.(avg = mean(count), prcent_of_total = sum(count)/sum(df$count)),Group]

Convert rows into columns in R

I have this sample dataset and i want to convert it into the following format:
Type <- c("AGE", "AGE", "REGION", "REGION", "REGION", "DRIVERS", "DRIVERS")
Level <- c("18-25", "26-70", "London", "Southampton", "Newcastle", "1", "2")
Estimate <- c(1.5,1,2,3,1,2,2.5)
df_before <- data.frame(Type, Level, Estimate)
Type Level Estimate
1 AGE 18-25 1.5
2 AGE 26-70 1.0
3 REGION London 2.0
4 REGION Southampton 3.0
5 REGION Newcastle 1.0
6 DRIVERS 1 2.0
7 DRIVERS 2 2.5
Basically, I would like to to transform the dataset into the following format. I have tried with the function dcast() but it seems that is not working.
AGE Estimate_AGE REGION Estimate_REGION DRIVERS Estimate_DRIVERS
1 18-25 1.5 London 2 1 2.0
2 26-70 1.0 Southampton 3 2 2.5
3 <NA> NA Newcastle 1 <NA> NA
df_before %>%
group_by(Type) %>%
mutate(id = row_number(), Estimate = as.character(Estimate))%>%
pivot_longer(-c(Type, id)) %>%
pivot_wider(id, names_from = c(Type, name))%>%
type.convert(as.is = TRUE)
# A tibble: 3 x 7
id AGE_Level AGE_Estimate REGION_Level REGION_Estimate DRIVERS_Level DRIVERS_Estimate
<int> <chr> <dbl> <chr> <int> <int> <dbl>
1 1 18-25 1.5 London 2 1 2
2 2 26-70 1 Southampton 3 2 2.5
3 3 NA NA Newcastle 1 NA NA
In data.table:
library(data.table)
setDT(df_before)
dcast(melt(df_before, 'Type'), rowid(Type, variable)~Type + variable)
Note that you will get alot of warning because of the type mismatch. You could use reshape2::melt to avoid this.
Anyway your datafram is not in a standard format.
In Base R >=4.0
transform(df_before, id = ave(Estimate, Type, FUN = seq_along)) |>
reshape(v.names = c('Level', 'Estimate'), dir = 'wide', timevar = 'Type', sep = "_")
id Level_AGE Estimate_AGE Level_REGION Estimate_REGION Level_DRIVERS Estimate_DRIVERS
1 1 18-25 1.5 London 2 1 2.0
2 2 26-70 1.0 Southampton 3 2 2.5
5 3 <NA> NA Newcastle 1 <NA> NA
IN base R <4
reshape(transform(df_before, id = ave(Estimate, Type, FUN = seq_along)),
v.names = c('Level', 'Estimate'), dir = 'wide', timevar = 'Type', sep = "_")
Update:
The exact output as the desired output:
df_before %>%
group_by(Type) %>%
mutate(id = row_number()) %>%
pivot_wider(
names_from = Type,
values_from = c(Level, Estimate)
) %>%
select(AGE = Level_AGE, Estimate_AGE, REGION = Level_REGION,
Estimate_REGION, DRIVERS = Level_DRIVERS, Estimate_DRIVERS) %>%
type.convert(as.is=TRUE)
AGE Estimate_AGE REGION Estimate_REGION DRIVERS Estimate_DRIVERS
<chr> <dbl> <chr> <int> <int> <dbl>
1 18-25 1.5 London 2 1 2
2 26-70 1 Southampton 3 2 2.5
3 NA NA Newcastle 1 NA NA
First answer:
Main aspect is to group by Type as already provided Onyambu's solution. After that we could use one pivot_wider:
library(dplyr)
library(tidyr)
df_before %>%
group_by(Type) %>%
mutate(id = row_number()) %>%
pivot_wider(
names_from = Type,
values_from = c(Level, Estimate)
)
id Level_AGE Level_REGION Level_DRIVERS Estimate_AGE Estimate_REGION Estimate_DRIVERS
<int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1 18-25 London 1 1.5 2 2
2 2 26-70 Southampton 2 1 3 2.5
3 3 NA Newcastle NA NA 1 NA
We can try this:
library(tidyverse)
Type <- c("AGE", "AGE", "REGION", "REGION", "REGION", "DRIVERS", "DRIVERS")
Level <- c("18-25", "26-70", "London", "Southampton", "Newcastle", "1", "2")
Estimate <- c(1.5, 1, 2, 3, 1, 2, 2.5)
df_before <- data.frame(Type, Level, Estimate)
data <-
df_before %>% group_split(Type)
data <-
map2(
data, map(data, ~ unique(.$Type)),
~ mutate(., "{.y}" := Level, "Estimate_{.y}" := Estimate) %>%
select(-c("Type", "Level", "Estimate"))
)
#get the longest number of rows to be able to join the columns
max_rows <- map_dbl(data, nrow) %>%
max()
#add rows if needed
map_if(
data, ~ nrow(.) < max_rows,
~ rbind(., NA)
) %>%
bind_cols()
#> # A tibble: 3 × 6
#> AGE Estimate_AGE DRIVERS Estimate_DRIVERS REGION Estimate_REGION
#> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 18-25 1.5 1 2 London 2
#> 2 26-70 1 2 2.5 Southampton 3
#> 3 <NA> NA <NA> NA Newcastle 1
Created on 2021-12-07 by the reprex package (v2.0.1)
A solution based on tidyr::pivot_wider and purrr::map_dfc:
library(tidyverse)
Type <- c("AGE", "AGE", "REGION", "REGION", "REGION", "DRIVERS", "DRIVERS")
Level <- c("18-25", "26-70", "London", "Southampton", "Newcastle", "1", "2")
Estimate <- c(1.5,1,2,3,1,2,2.5)
df_before <- data.frame(Type, Level, Estimate)
df_before %>%
pivot_wider(names_from=Type, values_from=c(Level, Estimate), values_fn=list) %>%
map_dfc(~ c(unlist(.x), rep(NA, max(table(df_before$Type))-length(unlist(.x)))))
#> # A tibble: 3 × 6
#> Level_AGE Level_REGION Level_DRIVERS Estimate_AGE Estimate_REGION
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 18-25 London 1 1.5 2
#> 2 26-70 Southampton 2 1 3
#> 3 <NA> Newcastle <NA> NA 1
#> # … with 1 more variable: Estimate_DRIVERS <dbl>
Another solution, based on dplyr:: group_split and purrr::map_dfc:
library(tidyverse)
df_before %>%
mutate(maxn = max(table(.$Type))) %>%
group_by(Type) %>% group_split() %>%
map_dfc(
~ data.frame(c(.x$Level, rep(NA, .x$maxn[1] - nrow(.x))),
c(.x$Estimate, rep(NA, .x$maxn[1] - nrow(.x)))) %>%
set_names(c(.x$Type[1], paste0("Estimate_", .x$Type[1])))) %>%
type.convert(as.is=T)
#> AGE Estimate_AGE DRIVERS Estimate_DRIVERS REGION Estimate_REGION
#> 1 18-25 1.5 1 2.0 London 2
#> 2 26-70 1.0 2 2.5 Southampton 3
#> 3 <NA> NA NA NA Newcastle 1

Keeping factor order after gather and summarise steps in tidyverse

I have over a hundred variables for which I'm trying to calculate frequency and percent. How can I maintain the factor order of each variables' values in the output? Please note that specifying the order for each variable outside the dataset is not practical as I have over 100 variables.
Example data:
df <- data.frame(gender=factor(c("male", "female", "male", NA), levels=c("male", "female")),
disease=factor(c("yes","yes","no", NA), levels=c("yes", "no")))
df
gender disease
1 male yes
2 female yes
3 male no
4 <NA> <NA>
Attempt:
df %>% gather(key, value, factor_key = T) %>%
group_by(key, value) %>%
summarise(n=n()) %>%
ungroup() %>%
group_by(key) %>%
mutate(percent=n/sum(n))
Output:
# A tibble: 6 x 4
# Groups: key [2]
key value n percent
<fct> <chr> <int> <dbl>
1 gender female 1 0.25
2 gender male 2 0.5
3 gender NA 1 0.25
4 disease no 1 0.25
5 disease yes 2 0.5
6 disease NA 1 0.25
Desired output would order gender as male, female and disease as yes, no.
Update: if you use pivot_longer (the new gather), it retains the factor levels! You can also fine-tune the column types with arguments names_transform and values_transform in pivot_longer.
library(tidyverse)
df <- data.frame(gender=factor(c("male", "female", "male", NA), levels=c("male", "female")),
disease=factor(c("yes","yes","no", NA), levels=c("yes", "no")))
df %>%
pivot_longer(everything()) %>%
group_by(name, value) %>%
summarise(n=n(), .groups = "drop_last") %>%
mutate(percent=n/sum(n))
#> # A tibble: 6 x 4
#> # Groups: name [2]
#> name value n percent
#> <chr> <fct> <int> <dbl>
#> 1 disease yes 2 0.5
#> 2 disease no 1 0.25
#> 3 disease <NA> 1 0.25
#> 4 gender male 2 0.5
#> 5 gender female 1 0.25
#> 6 gender <NA> 1 0.25
Created on 2020-10-16 by the reprex package (v0.3.0)
Because gather drops the factor for the value variable and summarise also appears to drop data frame attributes, you'll have to re-add them. You can re-add them in a semi-automated by reading in and combining the factor levels like this:
library(tidyverse)
df <- data.frame(gender=factor(c("male", "female", "male", NA), levels=c("male", "female")),
disease=factor(c("yes","yes","no", NA), levels=c("yes", "no")))
df %>%
gather(key, value, factor_key = T) %>%
group_by(key, value) %>%
summarise(n=n()) %>%
ungroup() %>%
group_by(key) %>%
mutate(percent=n/sum(n),
value = factor(value, levels = df %>% map(levels) %>% unlist())) %>%
arrange(key, value)
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
#> `summarise()` regrouping output by 'key' (override with `.groups` argument)
#> # A tibble: 6 x 4
#> # Groups: key [2]
#> key value n percent
#> <fct> <fct> <int> <dbl>
#> 1 gender male 2 0.5
#> 2 gender female 1 0.25
#> 3 gender <NA> 1 0.25
#> 4 disease yes 2 0.5
#> 5 disease no 1 0.25
#> 6 disease <NA> 1 0.25
Created on 2020-10-16 by the reprex package (v0.3.0)

Dplyr: Rename Tibble Output Columns With Factor Levels

I am trying to find a way to rename my factor levels (1, 2, 3) with girl, boy, other in the dplyr tibble output.
This is the code:
library(dplyr)
df1 %>%
dplyr::group_by(sex)%>%
dplyr::summarise(percent=100*n()/nrow(df1), n=n())
And my result is:
# A tibble: 3 x 3
sexs percent n
<int> <dbl> <int>
1 1 52.1 731
2 2 47.1 661
3 NA 0.855 12
The desired result would be:
# A tibble: 3 x 3
sexs percent n
<int> <dbl> <int>
Girl 1 52.1 731
Boy 2 47.1 661
Other NA 0.855 12
I happen to love the forcats package because when I get done I can actually see what I did. Another solution by simply adding to the pipe before your existiung code.
library(dplyr)
library(forcats)
sex <- sample(1:2, 100, replace = TRUE)
sex[[88]] <- NA
df1 <- data.frame(sex)
df1 %>%
mutate(newsex = fct_explicit_na(fct_recode(as_factor(sex),
Girl = "1",
Boy = "2" ),
na_level = "Other")) %>%
group_by(newsex, sex) %>%
summarise(percent = 100 * n() / nrow(df1), n=n())
#> # A tibble: 3 x 4
#> # Groups: newsex [3]
#> newsex sex percent n
#> <fct> <int> <dbl> <int>
#> 1 Girl 1 56 56
#> 2 Boy 2 43 43
#> 3 Other NA 1 1
Created on 2020-05-11 by the reprex package (v0.3.0)
When posting please provide some sample data to work with, it will help others test and make sure everything is working properly. This problem is relatively simple so it shouldn't be a problem.
If you want to replace the NA with literally any other number you can do this
df1 %>%
dplyr::mutate(sex = ifelse(is.na(sex), 0, sex),
sex = factor(sex,
levels = c(1,2,0),
labels = c("Girl", "Boy", "Other"))) %>%
dplyr::group_by(sex)%>%
dplyr::summarise(percent=100*n()/nrow(df1), n=n())
Otherwise you can use case_when to assign the factors and then convert the column to a factor
df1 %>%
dplyr::mutate(sex = case_when(
sex == 1 ~ "Girl",
sex == 2 ~ "Boy",
is.na(sex) ~ "Other") %>%
as_factor(.)) %>%
dplyr::group_by(sex)%>%
dplyr::summarise(percent=100*n()/nrow(df1), n=n())

dplyr to create aggregate percentages of factor levels

How do I use dplyr to create proportions of a level of a factor variable for each state? For example, I'd like to add a variable that indicates the percent of females within each state to the data frame.
# gen data
state <- rep(c(rep("Idaho", 10), rep("Maine", 10)), 2)
student.id <- sample(1:1000,8,replace=T)
gender <- rep( c("Male","Female"), 100*c(0.25,0.75) )
gender <- sample(gender, 40)
school.data <- data.frame(student.id, state, gender)
Here's an attempt that I know is wrong, but gets me access to the information:
middle %>%
group_by(state, gender %in%c("Female")) %>%
summarise(count = n()) %>%
mutate(test_count = count)
I have a hard time with the count and mutate functions, which makes it hard to get much further. It doesn't behave as I'd expect.
To add a new column to your existing data frame:
school.data %>%
group_by(state) %>%
mutate(pct.female = mean(gender == "Female"))
Use summarize rather than mutate if you just want one row per state rather than adding a column to the original data.
school.data %>%
group_by(state) %>%
summarize(pct.female = mean(gender == "Female"))
# # A tibble: 2 x 2
# state pct.female
# <fctr> <dbl>
# 1 Idaho 0.75
# 2 Maine 0.70
Gregor's answer gets to the heart of it. Here's a version that would give you counts and proportions for both genders per state:
library(dplyr)
gender.proportions <- group_by(school.data, state, gender) %>%
summarize(n = length(student.id)) %>% # count per gender
ungroup %>% group_by(state) %>%
mutate(proportion = n / sum(n)) # proportion per gender
# state gender n proportion
# <fctr> <fctr> <int> <dbl>
#1 Idaho Female 16 0.80
#2 Idaho Male 4 0.20
#3 Maine Female 11 0.55
#4 Maine Male 9 0.45
Edit:
In reference to OP's comment/request, the code below would repeat the male and female proportions for each individual student in each state:
gender.proportions <- group_by(school.data, state) %>%
mutate(prop.female = mean(gender == 'Female'), prop.male = mean(gender == 'Male'))
student.id state gender prop.female prop.male
<int> <fctr> <fctr> <dbl> <dbl>
1 479 Idaho Male 0.8 0.2
2 634 Idaho Female 0.8 0.2
3 175 Idaho Female 0.8 0.2
4 527 Idaho Female 0.8 0.2
5 368 Idaho Female 0.8 0.2
6 423 Idaho Male 0.8 0.2
7 357 Idaho Female 0.8 0.2
8 994 Idaho Female 0.8 0.2
9 479 Idaho Female 0.8 0.2
10 634 Idaho Female 0.8 0.2
# ... with 30 more rows
Here is one solution using a left_join.
state <- rep(c(rep("Idaho", 10), rep("Maine", 10)), 2)
student.id <- sample(1:1000,8,replace=T)
gender <- rep( c("Male","Female"), 100*c(0.25,0.75) )
gender <- sample(gender, 40)
school.data <- data.frame(student.id, state, gender)
school.data %>%
group_by(state) %>%
mutate(gender_id = ifelse(gender == "Female", 1, 0)) %>%
summarise(female_count = sum(gender_id)) %>%
left_join(school.data %>%
group_by(state) %>%
summarise(state_count = n()),
by = c("state" = "state")
) %>%
mutate(percent_female = female_count / state_count)

Resources