Keeping factor order after gather and summarise steps in tidyverse - r

I have over a hundred variables for which I'm trying to calculate frequency and percent. How can I maintain the factor order of each variables' values in the output? Please note that specifying the order for each variable outside the dataset is not practical as I have over 100 variables.
Example data:
df <- data.frame(gender=factor(c("male", "female", "male", NA), levels=c("male", "female")),
disease=factor(c("yes","yes","no", NA), levels=c("yes", "no")))
df
gender disease
1 male yes
2 female yes
3 male no
4 <NA> <NA>
Attempt:
df %>% gather(key, value, factor_key = T) %>%
group_by(key, value) %>%
summarise(n=n()) %>%
ungroup() %>%
group_by(key) %>%
mutate(percent=n/sum(n))
Output:
# A tibble: 6 x 4
# Groups: key [2]
key value n percent
<fct> <chr> <int> <dbl>
1 gender female 1 0.25
2 gender male 2 0.5
3 gender NA 1 0.25
4 disease no 1 0.25
5 disease yes 2 0.5
6 disease NA 1 0.25
Desired output would order gender as male, female and disease as yes, no.

Update: if you use pivot_longer (the new gather), it retains the factor levels! You can also fine-tune the column types with arguments names_transform and values_transform in pivot_longer.
library(tidyverse)
df <- data.frame(gender=factor(c("male", "female", "male", NA), levels=c("male", "female")),
disease=factor(c("yes","yes","no", NA), levels=c("yes", "no")))
df %>%
pivot_longer(everything()) %>%
group_by(name, value) %>%
summarise(n=n(), .groups = "drop_last") %>%
mutate(percent=n/sum(n))
#> # A tibble: 6 x 4
#> # Groups: name [2]
#> name value n percent
#> <chr> <fct> <int> <dbl>
#> 1 disease yes 2 0.5
#> 2 disease no 1 0.25
#> 3 disease <NA> 1 0.25
#> 4 gender male 2 0.5
#> 5 gender female 1 0.25
#> 6 gender <NA> 1 0.25
Created on 2020-10-16 by the reprex package (v0.3.0)
Because gather drops the factor for the value variable and summarise also appears to drop data frame attributes, you'll have to re-add them. You can re-add them in a semi-automated by reading in and combining the factor levels like this:
library(tidyverse)
df <- data.frame(gender=factor(c("male", "female", "male", NA), levels=c("male", "female")),
disease=factor(c("yes","yes","no", NA), levels=c("yes", "no")))
df %>%
gather(key, value, factor_key = T) %>%
group_by(key, value) %>%
summarise(n=n()) %>%
ungroup() %>%
group_by(key) %>%
mutate(percent=n/sum(n),
value = factor(value, levels = df %>% map(levels) %>% unlist())) %>%
arrange(key, value)
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
#> `summarise()` regrouping output by 'key' (override with `.groups` argument)
#> # A tibble: 6 x 4
#> # Groups: key [2]
#> key value n percent
#> <fct> <fct> <int> <dbl>
#> 1 gender male 2 0.5
#> 2 gender female 1 0.25
#> 3 gender <NA> 1 0.25
#> 4 disease yes 2 0.5
#> 5 disease no 1 0.25
#> 6 disease <NA> 1 0.25
Created on 2020-10-16 by the reprex package (v0.3.0)

Related

Convert rows into columns in R

I have this sample dataset and i want to convert it into the following format:
Type <- c("AGE", "AGE", "REGION", "REGION", "REGION", "DRIVERS", "DRIVERS")
Level <- c("18-25", "26-70", "London", "Southampton", "Newcastle", "1", "2")
Estimate <- c(1.5,1,2,3,1,2,2.5)
df_before <- data.frame(Type, Level, Estimate)
Type Level Estimate
1 AGE 18-25 1.5
2 AGE 26-70 1.0
3 REGION London 2.0
4 REGION Southampton 3.0
5 REGION Newcastle 1.0
6 DRIVERS 1 2.0
7 DRIVERS 2 2.5
Basically, I would like to to transform the dataset into the following format. I have tried with the function dcast() but it seems that is not working.
AGE Estimate_AGE REGION Estimate_REGION DRIVERS Estimate_DRIVERS
1 18-25 1.5 London 2 1 2.0
2 26-70 1.0 Southampton 3 2 2.5
3 <NA> NA Newcastle 1 <NA> NA
df_before %>%
group_by(Type) %>%
mutate(id = row_number(), Estimate = as.character(Estimate))%>%
pivot_longer(-c(Type, id)) %>%
pivot_wider(id, names_from = c(Type, name))%>%
type.convert(as.is = TRUE)
# A tibble: 3 x 7
id AGE_Level AGE_Estimate REGION_Level REGION_Estimate DRIVERS_Level DRIVERS_Estimate
<int> <chr> <dbl> <chr> <int> <int> <dbl>
1 1 18-25 1.5 London 2 1 2
2 2 26-70 1 Southampton 3 2 2.5
3 3 NA NA Newcastle 1 NA NA
In data.table:
library(data.table)
setDT(df_before)
dcast(melt(df_before, 'Type'), rowid(Type, variable)~Type + variable)
Note that you will get alot of warning because of the type mismatch. You could use reshape2::melt to avoid this.
Anyway your datafram is not in a standard format.
In Base R >=4.0
transform(df_before, id = ave(Estimate, Type, FUN = seq_along)) |>
reshape(v.names = c('Level', 'Estimate'), dir = 'wide', timevar = 'Type', sep = "_")
id Level_AGE Estimate_AGE Level_REGION Estimate_REGION Level_DRIVERS Estimate_DRIVERS
1 1 18-25 1.5 London 2 1 2.0
2 2 26-70 1.0 Southampton 3 2 2.5
5 3 <NA> NA Newcastle 1 <NA> NA
IN base R <4
reshape(transform(df_before, id = ave(Estimate, Type, FUN = seq_along)),
v.names = c('Level', 'Estimate'), dir = 'wide', timevar = 'Type', sep = "_")
Update:
The exact output as the desired output:
df_before %>%
group_by(Type) %>%
mutate(id = row_number()) %>%
pivot_wider(
names_from = Type,
values_from = c(Level, Estimate)
) %>%
select(AGE = Level_AGE, Estimate_AGE, REGION = Level_REGION,
Estimate_REGION, DRIVERS = Level_DRIVERS, Estimate_DRIVERS) %>%
type.convert(as.is=TRUE)
AGE Estimate_AGE REGION Estimate_REGION DRIVERS Estimate_DRIVERS
<chr> <dbl> <chr> <int> <int> <dbl>
1 18-25 1.5 London 2 1 2
2 26-70 1 Southampton 3 2 2.5
3 NA NA Newcastle 1 NA NA
First answer:
Main aspect is to group by Type as already provided Onyambu's solution. After that we could use one pivot_wider:
library(dplyr)
library(tidyr)
df_before %>%
group_by(Type) %>%
mutate(id = row_number()) %>%
pivot_wider(
names_from = Type,
values_from = c(Level, Estimate)
)
id Level_AGE Level_REGION Level_DRIVERS Estimate_AGE Estimate_REGION Estimate_DRIVERS
<int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1 18-25 London 1 1.5 2 2
2 2 26-70 Southampton 2 1 3 2.5
3 3 NA Newcastle NA NA 1 NA
We can try this:
library(tidyverse)
Type <- c("AGE", "AGE", "REGION", "REGION", "REGION", "DRIVERS", "DRIVERS")
Level <- c("18-25", "26-70", "London", "Southampton", "Newcastle", "1", "2")
Estimate <- c(1.5, 1, 2, 3, 1, 2, 2.5)
df_before <- data.frame(Type, Level, Estimate)
data <-
df_before %>% group_split(Type)
data <-
map2(
data, map(data, ~ unique(.$Type)),
~ mutate(., "{.y}" := Level, "Estimate_{.y}" := Estimate) %>%
select(-c("Type", "Level", "Estimate"))
)
#get the longest number of rows to be able to join the columns
max_rows <- map_dbl(data, nrow) %>%
max()
#add rows if needed
map_if(
data, ~ nrow(.) < max_rows,
~ rbind(., NA)
) %>%
bind_cols()
#> # A tibble: 3 × 6
#> AGE Estimate_AGE DRIVERS Estimate_DRIVERS REGION Estimate_REGION
#> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 18-25 1.5 1 2 London 2
#> 2 26-70 1 2 2.5 Southampton 3
#> 3 <NA> NA <NA> NA Newcastle 1
Created on 2021-12-07 by the reprex package (v2.0.1)
A solution based on tidyr::pivot_wider and purrr::map_dfc:
library(tidyverse)
Type <- c("AGE", "AGE", "REGION", "REGION", "REGION", "DRIVERS", "DRIVERS")
Level <- c("18-25", "26-70", "London", "Southampton", "Newcastle", "1", "2")
Estimate <- c(1.5,1,2,3,1,2,2.5)
df_before <- data.frame(Type, Level, Estimate)
df_before %>%
pivot_wider(names_from=Type, values_from=c(Level, Estimate), values_fn=list) %>%
map_dfc(~ c(unlist(.x), rep(NA, max(table(df_before$Type))-length(unlist(.x)))))
#> # A tibble: 3 × 6
#> Level_AGE Level_REGION Level_DRIVERS Estimate_AGE Estimate_REGION
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 18-25 London 1 1.5 2
#> 2 26-70 Southampton 2 1 3
#> 3 <NA> Newcastle <NA> NA 1
#> # … with 1 more variable: Estimate_DRIVERS <dbl>
Another solution, based on dplyr:: group_split and purrr::map_dfc:
library(tidyverse)
df_before %>%
mutate(maxn = max(table(.$Type))) %>%
group_by(Type) %>% group_split() %>%
map_dfc(
~ data.frame(c(.x$Level, rep(NA, .x$maxn[1] - nrow(.x))),
c(.x$Estimate, rep(NA, .x$maxn[1] - nrow(.x)))) %>%
set_names(c(.x$Type[1], paste0("Estimate_", .x$Type[1])))) %>%
type.convert(as.is=T)
#> AGE Estimate_AGE DRIVERS Estimate_DRIVERS REGION Estimate_REGION
#> 1 18-25 1.5 1 2.0 London 2
#> 2 26-70 1.0 2 2.5 Southampton 3
#> 3 <NA> NA NA NA Newcastle 1

How do you use dplyr::pull to convert grouped a colum into vectors?

I have a tibble, df, I would like to take the tibble and group it and then use dplyr::pull to create vectors from the grouped dataframe. I have provided a reprex below.
df is the base tibble. My desired output is reflected by df2. I just don't know how to get there programmatically. I have tried to use pull to achieve this output but pull did not seem to recognize the group_by function and instead created a vector out of the whole column. Is what I'm trying to achieve possible with dplyr or base r. Note - new_col is supposed to be a vector created from the name column.
library(tidyverse)
library(reprex)
df <- tibble(group = c(1,1,1,1,2,2,2,3,3,3,3,3),
name = c('Jim','Deb','Bill','Ann','Joe','Jon','Jane','Jake','Sam','Gus','Trixy','Don'),
type = c(1,2,3,4,3,2,1,2,3,1,4,5))
df
#> # A tibble: 12 x 3
#> group name type
#> <dbl> <chr> <dbl>
#> 1 1 Jim 1
#> 2 1 Deb 2
#> 3 1 Bill 3
#> 4 1 Ann 4
#> 5 2 Joe 3
#> 6 2 Jon 2
#> 7 2 Jane 1
#> 8 3 Jake 2
#> 9 3 Sam 3
#> 10 3 Gus 1
#> 11 3 Trixy 4
#> 12 3 Don 5
# Desired Output - New Col is a column of vectors
df2 <- tibble(group=c(1,2,3),name=c("Jim","Jane","Gus"), type=c(1,1,1), new_col = c("'Jim','Deb','Bill','Ann'","'Joe','Jon','Jane'","'Jake','Sam','Gus','Trixy','Don'"))
df2
#> # A tibble: 3 x 4
#> group name type new_col
#> <dbl> <chr> <dbl> <chr>
#> 1 1 Jim 1 'Jim','Deb','Bill','Ann'
#> 2 2 Jane 1 'Joe','Jon','Jane'
#> 3 3 Gus 1 'Jake','Sam','Gus','Trixy','Don'
Created on 2020-11-14 by the reprex package (v0.3.0)
Maybe this is what you are looking for:
library(dplyr)
df <- tibble(group = c(1,1,1,1,2,2,2,3,3,3,3,3),
name = c('Jim','Deb','Bill','Ann','Joe','Jon','Jane','Jake','Sam','Gus','Trixy','Don'),
type = c(1,2,3,4,3,2,1,2,3,1,4,5))
df %>%
group_by(group) %>%
mutate(new_col = name, name = first(name, order_by = type), type = first(type, order_by = type)) %>%
group_by(name, type, .add = TRUE) %>%
summarise(new_col = paste(new_col, collapse = ","))
#> `summarise()` regrouping output by 'group', 'name' (override with `.groups` argument)
#> # A tibble: 3 x 4
#> # Groups: group, name [3]
#> group name type new_col
#> <dbl> <chr> <dbl> <chr>
#> 1 1 Jim 1 Jim,Deb,Bill,Ann
#> 2 2 Jane 1 Joe,Jon,Jane
#> 3 3 Gus 1 Jake,Sam,Gus,Trixy,Don
EDIT If new_col should be a list of vectors then you could do `summarise(new_col = list(c(new_col)))
df %>%
group_by(group) %>%
mutate(new_col = name, name = first(name, order_by = type), type = first(type, order_by = type)) %>%
group_by(name, type, .add = TRUE) %>%
summarise(new_col = list(c(new_col)))
Another option would be to use tidyr::nest:
df %>%
group_by(group) %>%
mutate(new_col = name, name = first(name, order_by = type), type = first(type, order_by = type)) %>%
nest(new_col = new_col)

Dplyr: Rename Tibble Output Columns With Factor Levels

I am trying to find a way to rename my factor levels (1, 2, 3) with girl, boy, other in the dplyr tibble output.
This is the code:
library(dplyr)
df1 %>%
dplyr::group_by(sex)%>%
dplyr::summarise(percent=100*n()/nrow(df1), n=n())
And my result is:
# A tibble: 3 x 3
sexs percent n
<int> <dbl> <int>
1 1 52.1 731
2 2 47.1 661
3 NA 0.855 12
The desired result would be:
# A tibble: 3 x 3
sexs percent n
<int> <dbl> <int>
Girl 1 52.1 731
Boy 2 47.1 661
Other NA 0.855 12
I happen to love the forcats package because when I get done I can actually see what I did. Another solution by simply adding to the pipe before your existiung code.
library(dplyr)
library(forcats)
sex <- sample(1:2, 100, replace = TRUE)
sex[[88]] <- NA
df1 <- data.frame(sex)
df1 %>%
mutate(newsex = fct_explicit_na(fct_recode(as_factor(sex),
Girl = "1",
Boy = "2" ),
na_level = "Other")) %>%
group_by(newsex, sex) %>%
summarise(percent = 100 * n() / nrow(df1), n=n())
#> # A tibble: 3 x 4
#> # Groups: newsex [3]
#> newsex sex percent n
#> <fct> <int> <dbl> <int>
#> 1 Girl 1 56 56
#> 2 Boy 2 43 43
#> 3 Other NA 1 1
Created on 2020-05-11 by the reprex package (v0.3.0)
When posting please provide some sample data to work with, it will help others test and make sure everything is working properly. This problem is relatively simple so it shouldn't be a problem.
If you want to replace the NA with literally any other number you can do this
df1 %>%
dplyr::mutate(sex = ifelse(is.na(sex), 0, sex),
sex = factor(sex,
levels = c(1,2,0),
labels = c("Girl", "Boy", "Other"))) %>%
dplyr::group_by(sex)%>%
dplyr::summarise(percent=100*n()/nrow(df1), n=n())
Otherwise you can use case_when to assign the factors and then convert the column to a factor
df1 %>%
dplyr::mutate(sex = case_when(
sex == 1 ~ "Girl",
sex == 2 ~ "Boy",
is.na(sex) ~ "Other") %>%
as_factor(.)) %>%
dplyr::group_by(sex)%>%
dplyr::summarise(percent=100*n()/nrow(df1), n=n())

Group_by and mutate by multiple columns in R

I have dataframe with country, gender, 2013,2014,2014,2015 column names.
City Gender 2013 2014 2015
Aberdeen Female 30 40 50
Aberdeen Male 20 15 16
Aberdeenshire Female 60 80 70
Aberdeenshire Male 50 40 15
.....Includes 425 records.
I want to perform female to male ratio (dividing Female/male for each city) for each city, so this is how i tried to get,
City 2013_ratio 2014_ratio 2015_ration
Aberdeen 1.5 2.66 2.5
Aberdeenshire 1.2 2 4.66
can anyone help me to solve this. I have tried grouping by city but I don't know how to do by getting value by rows in gender.
You can more easily calculate the ratio if the Male and Female are in different columns, which you can change the structure by using tidyr
library(dplyr)
library(tidyr)
df %>%
gather(Year, Value, -City, - Gender) %>%
spread(Gender, Value) %>%
mutate(Ratio = Female/Male, Year = paste0(Year, "_Ratio")) %>%
select(-Female, -Male) %>%
spread(Year, Ratio)
The code from Rob's suggested solution would be (with an additional spread() step:
# data
df = data.frame(City = c("a", "a", "b", "b"),
Gender = c("Female", "Male", "Female", "Male"),
`2013` = c(30, 20, 60, 50),
`2014` = c(40, 15, 80, 40),
`2015` = c(50, 16, 70, 15))
# Actual process
library("dplyr")
library("tidyr")
df %>%
# Transform wide table into tidy
gather("Year", "Number", X2013:X2015) %>%
# Reshape gender columns for easier summaries
spread("Gender", "Number") %>%
# Compute ratios
group_by(City, Year) %>%
summarise(ratio = Female/(Male + Female))
#> # A tibble: 6 x 3
#> # Groups: City [?]
#> City Year ratio
#> <fct> <chr> <dbl>
#> 1 a X2013 0.6
#> 2 a X2014 0.727
#> 3 a X2015 0.758
#> 4 b X2013 0.545
#> 5 b X2014 0.667
#> 6 b X2015 0.824
Created on 2018-10-10 by the reprex package (v0.2.1)
To get exactly your result you can apply back the function spread() to spread the ratios over years, (spread(Year, ratio))
With tidyverse:
df = read.table(text="City Gender 2013 2014 2015
Aberdeen Female 30 40 50
Aberdeen Male 20 15 16
Aberdeenshire Female 60 80 70
Aberdeenshire Male 50 40 15", header = T)
> library(tidyverse)
>
> df %>%
group_by(City) %>%
arrange(City, Gender) %>%
summarise_at(vars(X2013:X2015), .funs = funs(ratio = first(.)/last(.)))
# A tibble: 2 x 4
City X2013_ratio X2014_ratio X2015_ratio
<fct> <dbl> <dbl> <dbl>
1 Aberdeen 1.5 2.67 3.12
2 Aberdeenshire 1.2 2 4.67
or
df %>%
group_by(City) %>%
arrange(City,Gender) %>%
summarise_at(vars(X2013:X2015), .funs = funs(ratio = .[Gender == "Female"]/.[Gender != "Female"]))

Calculating % of total within groups across each column and transposing

Is there a way to create the following output (assuming a lot of IDs and a lot more attributes)?
I am stuck after calculating the % of total by ATT1 within ID and then ATT2, etc.. Not sure how to go about making the rows into column headers and aggregate.
Input File (df in R):
ID ATT1 ATT2 ATT3 ATT4 Value
1 a x d i 10
1 a y d j 10
1 a y d k 10
1 b y c k 10
1 b y c l 10
2 a x c k 20
…
And I want the output file to look like (ATT4_l is cut off):
ID ATT1_a ATT1_b ATT2_x ATT2_y ATT3_d ATT3_c ATT4_i ATT4_j ATT4_k
1 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.2 0.4
...
I tried using dplyr
df %>% group_by(ID, ATT1) %>% mutate(proc = (Value/sum(Value) * 100))
But I am not sure what to do once I have all the ATT calculated to get them into columns and aggregated so that each ID only has 1 row of data.
You can do this with the two main workhorses of the tidyverse: dplyr for calculations and tidyr for reshaping data. Some of the reshaping is convoluted so I'm breaking it into steps.
library(dplyr)
library(tidyr)
...
If you gather the data from its original wide format into a long format, you'll have a column of IDs, a column of ATTx values, a column of letters (don't know the context meaning of these, so I'm literally calling it letters), and a column of values. From this format, you can group observations by combinations of ID, ATT, and letter, and you can later stick ATTs and letters together in the way you've laid out.
df %>%
gather(key = att, value = letter, -ID, -Value) %>%
head()
#> # A tibble: 6 x 4
#> ID Value att letter
#> <int> <int> <chr> <chr>
#> 1 1 10 ATT1 a
#> 2 1 10 ATT1 a
#> 3 1 10 ATT1 a
#> 4 1 10 ATT1 b
#> 5 1 10 ATT1 b
#> 6 2 20 ATT1 a
After grouping, calculate total values for each ID/ATT/letter combo:
df %>%
gather(key = att, value = letter, -ID, -Value) %>%
group_by(ID, att, letter) %>%
summarise(group_val = sum(Value)) %>%
head()
#> # A tibble: 6 x 4
#> # Groups: ID, att [3]
#> ID att letter group_val
#> <int> <chr> <chr> <int>
#> 1 1 ATT1 a 30
#> 2 1 ATT1 b 20
#> 3 1 ATT2 x 10
#> 4 1 ATT2 y 40
#> 5 1 ATT3 c 20
#> 6 1 ATT3 d 30
Using mutate, you can calculate the share of each observation within its larger group. mutate drops one layer of the grouping hierarchy, so this is the share of values for each letter within a given ID and ATT. Since you no longer need the total values, just their shares, drop that column, and stick the ATTs and letters back together with unite.
df %>%
gather(key = att, value = letter, -ID, -Value) %>%
group_by(ID, att, letter) %>%
summarise(group_val = sum(Value)) %>%
mutate(share = group_val / sum(group_val)) %>%
select(-group_val) %>%
unite(group, att, letter, sep = "_") %>%
head()
#> # A tibble: 6 x 3
#> # Groups: ID [1]
#> ID group share
#> <int> <chr> <dbl>
#> 1 1 ATT1_a 0.6
#> 2 1 ATT1_b 0.4
#> 3 1 ATT2_x 0.2
#> 4 1 ATT2_y 0.8
#> 5 1 ATT3_c 0.4
#> 6 1 ATT3_d 0.6
Now you have all the information you're looking for, just need to get it into a wide format, turning the values in the group column into individual columns. You do this with spread:
df %>%
gather(key = att, value = letter, -ID, -Value) %>%
group_by(ID, att, letter) %>%
summarise(group_val = sum(Value)) %>%
mutate(share = group_val / sum(group_val)) %>%
select(-group_val) %>%
unite(group, att, letter, sep = "_") %>%
spread(key = group, value = share)
#> # A tibble: 2 x 11
#> # Groups: ID [2]
#> ID ATT1_a ATT1_b ATT2_x ATT2_y ATT3_c ATT3_d ATT4_i ATT4_j ATT4_k
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.6 0.4 0.2 0.8 0.4 0.6 0.2 0.2 0.4
#> 2 2 1 NA 1 NA 1 NA NA NA 1
#> # ... with 1 more variable: ATT4_l <dbl>
Note that there are NAs filled in here where there aren't observations for combinations of ID/ATT/letter. I'm assuming you'll have more complete data than in the sample you posted.
Created on 2018-10-03 by the reprex package (v0.2.1)
I believe you are looking for the reshape2 package
library(reshape2)
df.new <- dcast(df,
formula = ID~ATT1,
value.var = "proc",
fun.aggregate = mean)
This will not completely fix your problem though - I recommend doing this first to make your data tidy
df.tidy <- melt(df,
id.vars = c("ID","Value"),
variable.name = "ATT1_4",
value.name = "att.factor")
df.tidy <- df.tidy %>% group_by(ID, att.factor) %>% mutate(proc = (Value/sum(Value)*100))
df.new <- dcast(df.tidy,
formula = ID~att.factor,
value.var = "proc",
fun.aggregate = mean)
NaN will be returned for anything combination that isnt represented in df.tidy. you can use the fill argument to assign a value to those.

Resources