Creating column that is a proportion of two conditions

Creating column that is a proportion of two conditions - r

I have a data frame with about 50 variables but where the ones in the example under are the most important. My aim is to create a table that includes various elements split by department and gender. The combination of dplyr, group_by and summarise gives me most of what I need but I haven't been able to figure out how to get separate columns that shows for example meanFemaleSalary/meanMaleSalary per department. I'm able to get the mean salary per gender per department in separate data frames, but either get an error or just a single value when I try to divide them.
I have tried searching the site and found what I believed was similar questions but couldn't get any of the answers to work. I'd be grateful if anyone could give me a hint on how to proceed…
Thanks!
Example:
library(dplyr)
x <- data.frame(Department = rep(c("Dep1", "Dep2", "Dep3"), times=2),
Gender = rep(c("F", "M"), times=3),
Salary = seq(10,15))
This is what I have that actually works so far:
Table <- x %>% group_by(Department, Gender) %>% summarise(Count = n(),
AverageSalary = mean(Salary, na.rm = T),
MedianSalary = median(Salary, na.rm = T))
I'd like two additional columns for AvgSalaryWomen/Men and MedianSalaryWomen/Men.
Again thanks!

If you want the new columns to be part of Table you could do something like this. But it will result in the value being repeated per department.
Table %>% group_by(Department) %>%
mutate(`AvgSalaryWomen/Men` = AverageSalary[Gender == "F"]/AverageSalary[Gender == "M"],
`MedianSalaryWomen/Men` = MedianSalary[Gender == "F"]/MedianSalary[Gender == "M"])
# Department Gender Count AverageSalary MedianSalary `AvgSalaryWomen/Men` `MedianSalaryWomen/Men`
# <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
# 1 Dep1 F 1 10. 10 0.769 0.769
# 2 Dep1 M 1 13. 13 0.769 0.769
# 3 Dep2 F 1 14. 14 1.27 1.27
# 4 Dep2 M 1 11. 11 1.27 1.27
# 5 Dep3 F 1 12. 12 0.800 0.800
# 6 Dep3 M 1 15. 15 0.800 0.800
If you want just one row per department simply change mutate to summarise and you'll get
# Department `AvgSalaryWomen/Men` `MedianSalaryWomen/Men`
# <fct> <dbl> <dbl>
# 1 Dep1 0.769 0.769
# 2 Dep2 1.27 1.27
# 3 Dep3 0.800 0.800

Here is an option to get this by spreading it to wide format
library(tidyverse)
x %>%
spread(Gender, Salary) %>%
group_by(Department) %>%
summarise(`AvgSalaryWomen/Men` = mean(F)/mean(M),
`MedianSalaryWomen/Men` = median(F)/median(M))
# A tibble: 3 x 3
# Department `AvgSalaryWomen/Men` `MedianSalaryWomen/Men`
# <fctr> <dbl> <dbl>
# 1 Dep1 0.769 0.769
# 2 Dep2 1.27 1.27
# 3 Dep3 0.800 0.800 `

If you want to end up with a table that has one row per department and includes all of the descriptive statistics you're computing along the way, you probably need to convert to long, unite some columns to use as a key, go back to wide, and then add your ratios. Something like...
Table <- x %>%
group_by(Department, Gender) %>%
summarise(Count = n(),
AverageSalary = mean(Salary, na.rm = TRUE),
MedianSalary = median(Salary, na.rm = TRUE)) %>%
# convert to long form
gather(Quantity, Value, -Department, -Gender) %>%
# create a unified gender/measure column to use as the key in the next step
unite(Set, Gender, Quantity) %>%
# go back to wide, now with repeating columns by gender
spread(Set, Value) %>%
# compute the department-level quantities you want using those new cols
mutate(AverageSalaryWomenMen = F_AverageSalary/M_AverageSalary,
MedianSalaryWomenMen = F_MedianSalary/M_MedianSalary)

Related

How to scale by pair of columns together

I want to use the scale function but to do it on each pair of columns - To calculate the mean on pair of columns and not on each column.
In details:
This is my data for example:
phone
phone1_X
phone2
phone2_X
phone3
phone3_X
1
2
3
4
5
6
2
4
6
8
10
12
I want to use the scale function on each pair phone1+phone1_X, Phone2+Phone2_X etc..
Each pair has the same name "phone1" but the second column always contains an additional "_X" (a different condition in the experiment).
In the end, I wish to have the original table but in Z.scores (but as I mentioned before, the mean is calculated by pair of columns and not by one column)
Thank you so much!

There might be a more elegant way, but this is how I'd do it.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -phone) %>%
group_by(phone, name = stringr::str_extract(name, 'phone[0-9]?')) %>%
summarise(mean_value = mean(value), .groups = 'drop') %>%
pivot_wider(names_from = name, values_from = mean_value)
#> # A tibble: 2 × 4
#> phone phone1 phone2 phone3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3.5 5.5
#> 2 2 4 7 11

Find relative frequencies of summarized columns in R

I need to get the relative frequencies of a summarized column in R. I've used dplyr's summarize to find the total of each grouped row, like this:
data %>%
group_by(x) %>%
summarise(total = sum(dollars))
x total
<chr> <dbl>
1 expense 1 3600
2 expense 2 2150
3 expense 3 2000
But now I need to create a new column for the relative frequencies of each total row to get this result:
x total p
<chr> <dbl> <dbl>
1 expense 1 3600 46.45%
2 expense 2 2150 27.74%
3 expense 3 2000 25.81%
I've tried this:
data %>%
group_by(x) %>%
summarise(total = sum(dollars), p = scales::percent(total/sum(total))
and this:
data %>%
group_by(x) %>%
summarise(total = sum(dollars), p = total/sum(total)*100)
but the result is always this:
x total p
<chr> <dbl> <dbl>
1 expense 1 3600 100%
2 expense 2 2150 100%
3 expense 3 2000 100%
The problem seems to be the summarized total column that may be affecting the results. Any ideas to help me? Thanks

You get 100% because of the grouping. However, after you've summarized, dplyr will drop the one level of grouping. Meaning that if you e.g. do mutate() after, you get the results you need:
library(dplyr)
data <- tibble(
x = c("expense 1", "expense 2", "expense 3"),
dollars = c(3600L, 2150L, 2000L)
)
data %>%
group_by(x) %>%
summarise(total = sum(dollars)) %>%
mutate(p = total/sum(total)*100)
# A tibble: 3 x 3
x total p
<chr> <int> <dbl>
1 expense 1 3600 46.5
2 expense 2 2150 27.7
3 expense 3 2000 25.8

You get 100% because it calculates the total of that particular group. You need to ungroup. Assuming that you want to divide by total entries just divide by nrow(df).
data %>%
group_by(x) %>%
summarise(total = sum(dollars), p = total/nrow(data)*100)

After the first sum, ungroup and create p with mutate.
iris %>%
group_by(Species) %>%
summarise(total = sum(Sepal.Length)) %>%
ungroup() %>%
mutate(p = total/sum(total)*100)
## A tibble: 3 x 3
# Species total p
# <fct> <dbl> <dbl>
#1 setosa 250. 28.6
#2 versicolor 297. 33.9
#3 virginica 329. 37.6

Grouped matrix correlation

I'm trying to get correlation matrices of an arbitrary number of factors by group, ideally using dplyr. I have no problem getting the correlation matrix by filtering by group and summarizing, but using a "group_by", I'm not sure how to pass the factor data to cor.
library(dplyr)
numRows <- 20
myData <- tibble(A = rnorm(numRows),
B = rnorm(numRows),
C = rnorm(numRows),
Group = c(rep("Group1", numRows/2), rep("Group2", numRows/2)))
# Essentially what I'm doing is trying to get these matrices, but for all groups
myData %>%
filter(Group == "Group1") %>%
select(-Group) %>%
summarize(CorMat = cor(.))
# However, I don't know what to pass into "cor". The code below fails
myData %>%
group_by(Group) %>%
summarize(CorMat = cor(.))
# Error looks like this
Error: Problem with `summarise()` column `CorMat`.
i `CorMat = cor(.)`.
x 'x' must be numeric
i The error occurred in group 1: Group = "Group1".
I've seen solutions for the grouped correlation between specific factors (Correlation matrix by group) or correlations between all factors to a specific factor (Correlation matrix of grouped variables in dplyr), but nothing for a grouped correlation matrix of all factors to all factors.

You can try using nest_by which will put you data (without Group) into a list column called data. Then you can refer to this column using cor:
myData %>%
nest_by(Group) %>%
summarise(CorMat = cor(data))
Output
Group CorMat[,1] [,2] [,3]
<chr> <dbl> <dbl> <dbl>
1 Group1 1 -0.132 0.638
2 Group1 -0.132 1 -0.284
3 Group1 0.638 -0.284 1
4 Group2 1 0.429 -0.228
5 Group2 0.429 1 -0.235
6 Group2 -0.228 -0.235 1
If you want a named list of matrices, you can also try the following. You can add split (or try group_split without names) and then map to remove the Group column.
library(tidyverse)
myData %>%
nest_by(Group) %>%
summarise(CorMat = cor(data)) %>%
ungroup %>%
split(f = .$Group) %>%
map(~ .x %>% select(-Group))
Output
$Group1
# A tibble: 3 x 1
CorMat[,1] [,2] [,3]
<dbl> <dbl> <dbl>
1 1 -0.132 0.638
2 -0.132 1 -0.284
3 0.638 -0.284 1
$Group2
# A tibble: 3 x 1
CorMat[,1] [,2] [,3]
<dbl> <dbl> <dbl>
1 1 0.429 -0.228
2 0.429 1 -0.235
3 -0.228 -0.235 1

Compute percentage of rows in group that have a certain value in another column

I am using the dataset birthwt.
For each age, I want to find the percentage of mothers that are white. My end goal is to display that percentage in a plot by age. How can I do this? I'm learning how to use tidyverse functions so I would prefer to do it that way if possible. Here is my work so far:
library(tidyverse)
library(tidyselect)
library("MASS")
grouped <- birthwt %>%
count(race, age) %>%
spread(key = race, value = n, fill = 0)
grouped
This gets a table where each row represents an age, and there is a column for each race representing the count of mothers of that age. This approach may or may not be on the right path.

We can group by 'age' and get the mean of logical vector
library(dplyr)
birthwt %>%
group_by(age) %>%
summarise(perc = mean(race == 1))
# A tibble: 24 x 2
# age perc
# <int> <dbl>
# 1 14 0.333
# 2 15 0.333
# 3 16 0.286
# 4 17 0.25
# 5 18 0.6
# 6 19 0.625
# 7 20 0.333
# 8 21 0.417
# 9 22 0.769
#10 23 0.308
# … with 14 more rows
Or an option with data.table
library(data.table)
setDT(birthwt)[, .(perc = mean(race == 1)), age]
Or using base R
birthwt$perc <- with(birthwt, ave(race == 1, age))
Or another base R option is
with(birthwt, tapply(race == 1, age, FUN = mean))
Or with aggregate
aggregate(cbind(perc = race == 1) ~ age, birthwt, FUN = mean)
Or with by
by(birthwt$race == 1, birthwt$age, FUN = mean)

We can count the number of race which are white for each age and divide it by total number of rows for each age to get ratio.
library(dplyr)
birthwt %>%
group_by(age) %>%
summarise(perc = sum(race == 1)/n())
# A tibble: 24 x 2
# age perc
# <int> <dbl>
# 1 14 0.333
# 2 15 0.333
# 3 16 0.286
# 4 17 0.25
# 5 18 0.6
# 6 19 0.625
# 7 20 0.333
# 8 21 0.417
# 9 22 0.769
#10 23 0.308
# … with 14 more rows
In base R, we can use aggregate following the same logic
aggregate(race~age, birthwt,function(x) sum(x == 1)/length(x))
Or something similar to your approach using table, we could do
tab <- table(birthwt$age, birthwt$race)
tab[, "1"]/rowSums(tab)

How to control new variables' names after tidyr's spread?

I have a dataframe with panel structure: 2 observations for each unit from two years:
library(tidyr)
mydf <- data.frame(
id = rep(1:3, rep(2,3)),
year = rep(c(2012, 2013), 3),
value = runif(6)
)
mydf
# id year value
#1 1 2012 0.09668064
#2 1 2013 0.62739399
#3 2 2012 0.45618433
#4 2 2013 0.60347152
#5 3 2012 0.84537624
#6 3 2013 0.33466030
I would like to reshape this data to wide format which can be done easily with tidyr::spread. However, as the values of the year variable are numbers, the names of my new variables become numbers as well which makes its further use harder.
spread(mydf, year, value)
# id 2012 2013
#1 1 0.09668064 0.6273940
#2 2 0.45618433 0.6034715
#3 3 0.84537624 0.3346603
I know I can easily rename the columns. However, if I would like to reshape within a chain with other operations, it becomes inconvenient. E.g. the following line obviously does not make sense.
library(dplyr)
mydf %>% spread(year, value) %>% filter(2012 > 0.5)
The following works but is not that concise:
tmp <- spread(mydf, year, value)
names(tmp) <- c("id", "y2012", "y2013")
filter(tmp, y2012 > 0.5)
Any idea how I can change the new variable names within spread?

I know some years has passed since this question was originally asked, but for posterity I want to also highlight the sep argument of spread. When not NULL, it will be used as separator between the key name and values:
mydf %>%
spread(key = year, value = value, sep = "")
# id year2012 year2013
#1 1 0.15608322 0.6886531
#2 2 0.04598124 0.0792947
#3 3 0.16835445 0.1744542
This is not exactly as wanted in the question, but sufficient for my purposes. See ?spread.
Update with tidyr 1.0.0: tidyr 1.0.0 have now introduced pivot_wider (and pivot_longer) which allows for more control in this respect with the arguments names_sep and names_prefix. So now the call would be:
mydf %>%
pivot_wider(names_from = year, values_from = value,
names_prefix = "year")
# # A tibble: 3 x 3
# id year2012 year2013
# <int> <dbl> <dbl>
# 1 1 0.347 0.388
# 2 2 0.565 0.924
# 3 3 0.406 0.296
To get exactly what was originally wanted (prefixing "y" only) you can of course now get that directly by simply having names_prefix = "y".
The names_sep is used in case you gather over multiple columns as demonstrated below where I have added quarters to the data:
# Add quarters to data
mydf2 <- data.frame(
id = rep(1:3, each = 8),
year = rep(rep(c(2012, 2013), each = 4), 3),
quarter = rep(c("Q1","Q2","Q3","Q4"), 3),
value = runif(24)
)
head(mydf2)
# id year quarter value
# 1 1 2012 Q1 0.8651470
# 2 1 2012 Q2 0.3944423
# 3 1 2012 Q3 0.4580580
# 4 1 2012 Q4 0.2902604
# 5 1 2013 Q1 0.4751588
# 6 1 2013 Q2 0.6851755
mydf2 %>%
pivot_wider(names_from = c(year, quarter), values_from = value,
names_sep = "_", names_prefix = "y")
# # A tibble: 3 x 9
# id y2012_Q1 y2012_Q2 y2012_Q3 y2012_Q4 y2013_Q1 y2013_Q2 y2013_Q3 y2013_Q4
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 0.865 0.394 0.458 0.290 0.475 0.685 0.213 0.920
# 2 2 0.566 0.614 0.509 0.0515 0.974 0.916 0.681 0.509
# 3 3 0.968 0.615 0.670 0.748 0.723 0.996 0.247 0.449

You can use backticks for column names starting with numbers and filter should work as expected
mydf %>%
spread(year, value) %>%
filter(`2012` > 0.5)
# id 2012 2013
#1 3 0.8453762 0.3346603
Or another option would be using unite to join two columns to a single columnn after creating a second column 'year1' with string 'y'.
mydf %>%
mutate(year1='y') %>%
unite(yearN, year1, year) %>%
spread(yearN, value) %>%
filter(y_2012 > 0.5)
# id y_2012 y_2013
#1 3 0.8453762 0.3346603
Even we can change the 'year' column within mutate by using paste
mydf %>%
mutate(year=paste('y', year, sep="_")) %>%
spread(year, value) %>%
filter(y_2012 > 0.5)

Another option is to use the setNames() function as the next thing in the pipe:
mydf %>%
spread(mydf, year, value) %>%
setNames( c("id", "y2012", "y2013") ) %>%
filter(y2012 > 0.5)
The only problem using setNames is that you have to know exactly what your columns will be when you spread() them. Most of the time, that's not a problem, particularly if you're working semi-interactively.
But if you're missing a key/value pair in your original data, there's a chance it won't show up as a column, and you can end up naming your columns incorrectly without even knowing it. Granted, setNames() will throw an error if the number of names doesn't match the number of columns, so you've got a bit of error checking built in.
Still, the convenience of using setNames() has outweighed the risk more often than not for me.

Using spread()'s successor pivot_wider() we can give a prefix to the created columns :
library(tidyr)
set.seed(1)
mydf <- data.frame(
id = rep(1:3, rep(2,3)),
year = rep(c(2012, 2013), 3),
value = runif(6)
)
pivot_wider(mydf, names_from = "year", values_from = "value", names_prefix = "y")
#> # A tibble: 3 x 3
#> id y2012 y2013
#> <int> <dbl> <dbl>
#> 1 1 0.266 0.372
#> 2 2 0.573 0.908
#> 3 3 0.202 0.898
Created on 2019-09-14 by the reprex package (v0.3.0)

rename() in dplyr should do the trick
library(tidyr); library(dplyr)
mydf %>%
spread(year,value)%>%
rename(y2012 = '2012',y2013 = '2013')%>%
filter(y2012>0.5)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Creating column that is a proportion of two conditions - r

Related

How to scale by pair of columns together

Find relative frequencies of summarized columns in R

Grouped matrix correlation

Compute percentage of rows in group that have a certain value in another column

How to control new variables' names after tidyr's spread?

Categories

Resources