computing differences between groups: alternative to spread for multiple computations - r

I commonly need to compute differences between groups, nested by some interval and/or additional grouping. For computing a single variable, this is easy to accomplish with spread and mutate. Here's a reproducible example with the datasetChickWeight; don't get distracted by the calculation itself (this is just a toy example), my question is about how to handle a dataset structured like the dataframe ChickSum created below.
# reproducible dataset
data(ChickWeight)
ChickSum = ChickWeight %>%
filter(Time == max(Time) | Time == min(Time)) %>%
group_by(Diet, Time) %>%
summarize(mean.weight = mean(weight)) %>%
ungroup()
Here is how I might go about calculating the change in average chick weight between the first and last time, stratified by diet:
# Compute change in mean weight between first and last time
ChickSum %>%
spread(Time, mean.weight) %>%
mutate(weight.change = `21` - `0`)
However, this doesn't work so well with multiple variables:
ChickSum2 = ChickWeight %>%
filter(Time == max(Time) | Time == min(Time)) %>%
group_by(Diet, Time) %>%
# now also compute variable "count"
summarize(count = n(), mean.weight = mean(weight)) %>%
ungroup()
I can't spread by Time and both count and mean.weight; my current solution is to do two spread-mutate operations---once for count and again for mean.weight---and then join the results.
ChickCountChange = ChickSum2 %>%
select(-mean.weight) %>%
spread(Time, count) %>%
mutate(count.change = `21` - `0`)
ChickWeightChange = ChickSum2 %>%
select(-count) %>%
spread(Time, mean.weight) %>%
mutate(weight.change = `21` - `0`)
full_join(
select(ChickWeightChange, Diet, weight.change),
select(ChickCountChange, Diet, count.change),
by = "Diet")
Is there another approach to these types of computation? I've been trying to conceive of a strategy that combines group_by and purrr::pmap in order to avoid spread but still maintain the advantages of the above approach (such as spread's fill argument for choosing how to handle missing group combinations), but I haven't figured it out. I'm open to suggestions or alternative data structures/ways of thinking about the problem.

You might try re-grouping, then using lag() to calculate the differences. Works for your toy example, but it may be better to see some of your real dataset:
ChickWeight %>%
filter(Time == max(Time) | Time == min(Time)) %>%
group_by(Diet, Time) %>%
# now also compute variable "count"
summarize(count = n(), mean.weight = mean(weight)) %>%
ungroup() %>%
group_by(Diet) %>%
mutate(count.change = count - lag(count),
weight.change = mean.weight - lag(mean.weight)) %>%
filter(Time == max(Time))
Result:
Diet Time count mean.weight count.change weight.change
<fct> <dbl> <int> <dbl> <int> <dbl>
1 1 21 16 178. -4 136.
2 2 21 10 215. 0 174
3 3 21 10 270. 0 230.
4 4 21 9 239. -1 198.

So I came up with a potential/partial solution in the process of writing up a reproducible example. Essentially, we use gather to group by the variables themselves:
ChickSum2 %>%
gather(variable, value, count, mean.weight) %>%
spread(Time, value) %>% mutate(Change = `21` - `0`) %>%
select(Diet, variable, Change) %>%
spread(variable, Change)
This works only if the following two conditions are true:
All variables are the same type (e.g. both mean.weight and count are numeric).
the difference calculation is the same for all variables (e.g. I want to compute last - first for all variables).
I guess the second condition could be relaxed by using e.g. case_when.

Related

Reclassify attributes that are less than x% of total as 'other'

Okay so I have data as so:
ID Name Job
001 Bill Carpenter
002 Wilma Lawyer
003 Greyson Lawyer
004 Eddie Janitor
I want to group these together for analysis so any job that appears less than x percent of the whole will be grouped into "Other"
How can I do this, here is what I tried:
df %>%
group_by(Job) %>%
summarize(count = n()) %>%
mutate(pct = count/sum(count)) %>%
arrange(desc(count)) %>%
drop_na()
And now I know what the percentages are but how do I integrate this in to the original data to make everything below X "Other". (let's say less than or equal to 25% is other).
Maybe there's a more straightforward way....
You can try this :
library(dplyr)
df %>%
count(Job) %>%
mutate(n = n/sum(n)) %>%
left_join(df, by = 'Job') %>%
mutate(Job = replace(Job, n <= 0.25, 'Other'))
To integrate our calculation in original data we do a left_join and then replace the values.

Creating multiple new columns in a dataframe in an iterative way

-- Small edit made to test data. Columns are no longer grouped by round, but instead grouped by Team as is the case in the real dataset.
I have tried a variety of methods for creating multiple new columns, while minimising the repetition in the code. My initial and successful method requires excessive copy and pasting, but I would like to minimise this as much as possible. Below is example data for the problem:
df <- tribble(~R1TeamX, ~R2TeamX,~R3TeamX, ~R1TeamY,~R2TeamY, ~R3TeamY,
10, 11, 12, 15, 19, 20,
11, 13, 14, 25, 18, 15)
This example data is for three rounds with the scores for both team X and team Y. I am looking to create additional columns, finding the difference between the scores of Team X and Y. The real dataset has upwards of 30 rounds.
My initial solution used mutate and works as follows:
df <- df %>%
mutate(R1Diff = R1TeamX - R1TeamY,
R2Diff = R2TeamX - R2TeamY,
R3Diff = R3TeamX - R3TeamY)
While this does the job, it is not scaleable. I have attempted to reduce this down to less code using str_c & mutate, but cannot identify the correct looping method to make this work for several lines of code. Below is my attempt at standardising the code so far:
teamx <- str_c("R", 1:3, "TeamX")
teamy <- str_c("R", 1:3, "TeamY")
round_diff <- str_c("R", 1:3, "Diff")
df <- df %>%
mutate(!!round_diff[1] := UQ(parse_quo(teamx[1], global_env())) - UQ(parse_quo(teamy[1], global_env())),
!!round_diff[2] := UQ(parse_quo(teamx[2], global_env())) - UQ(parse_quo(teamy[2], global_env())),
!!round_diff[3] := UQ(parse_quo(teamx[3], global_env())) - UQ(parse_quo(teamy[3], global_env())))
While additional code is required, this standardises my input to some degree reducing some of the leg work, but I know there must be some way to reduce this into a single line. I have explored mutate_at and for loops to no avail. I suspect this problem could also be tackled with purrr::map, but I do have enough ability in this area to identify the correct approach.
Any help would be greatly appreciated.
Whilst it's possible to do this in dplyr and tidyr functions, remember you still have some useful base R options open to you. This method uses lapply and makes the assumption that your columns are alternating between team X and team Y
seq(length(df)/2) %>%
lapply(function(x) df[[x]] - df[[x + 1]]) %>%
as.data.frame() %>%
setNames(paste0("R", seq(length(df)/2), "Diff")) %>%
cbind(df,.)
#> R1TeamX R1TeamY R2TeamX R2TeamY R3TeamX R3TeamY R1Diff R2Diff R3Diff
#> 1 10 11 12 15 19 20 -1 -1 -3
#> 2 11 13 14 25 18 15 -2 -1 -11
try to do it this way
library(tidyverse)
df %>%
mutate(id = row_number()) %>%
pivot_longer(
-id,
names_to = c("set", ".value"),
names_pattern = "(R\\d+Team)(X|Y)"
) %>%
mutate(Diff = X - Y) %>%
pivot_longer(-c(id, set)) %>%
pivot_wider(id, names_from = c(set, name), values_from = value, names_sep = "")
Here's a solution that I believe is robust with respect to the number of rounds, the number of opponents of Team X and the order in which the results are stored.
First, make the data tidy: remove information about Teams and Rounds from column names.
newDF <- df %>%
mutate(id = row_number()) %>%
pivot_longer(
-id,
names_to = c("Round", "Team"),
names_pattern = "R(\\d+)Team(X|Y)",
values_to="Score"
)
Now calculate the differences in scores
newDF %>%
# Calculate difference in scores
mutate(Team=ifelse(Team == "X", Team, "Opponent")) %>%
pivot_wider(values_from=Score, names_from=Team) %>%
mutate(Diff=X - Opponent) %>%
select(-Opponent) %>%
# Bring in identity of oponent
left_join(
newDF %>%
filter(Team != "X") %>%
select(-Score) %>%
rename(Opposition=Team),
by=c("id", "Round")
)
Giving
# A tibble: 6 x 5
id Round X Diff Opposition
<int> <chr> <dbl> <dbl> <chr>
1 1 1 10 -5 Y
2 1 2 11 -8 Y
3 1 3 12 -8 Y
4 2 1 11 -14 Y
5 2 2 13 -5 Y
6 2 3 14 -1 Y
based on OP's revised input data.

Generating demographic tables/frequencies of a large number of specific variables in R

I have larger data sets that for a 'first run' require basic frequencies by a group (groups are flagged with 1 or 0 in columns). The issue is that some of the basic frequencies are for a very large number of variables (180 or so) that aren't named with a specific prefix and aren't positioned say in columns 2:100. They could be in columns 2:80, and then say 90:117, etc.
I get the basic gist of doing this would be something like this:
mtcars %>% filter(gear == 4) %>% group_by(am) %>% summarise(n=n()) %>% mutate(perc = n / sum(n)*100)
One issue is that my 3-5 groups are all flagged in separate columns, so I would need to use filter(pop1 == 1), filter(pop2 == 1), etc. But is there a way for the group_by to go through 180 variables with an output for each variable? It's just a simple frequency of each variable's values or missing. So for the mtcars, it would be a frequency of just am, and then just vs. It would be fine for the output to be one long one with two columns.
Like this:
Variable Value n Perc
am 0 4 33.3
am 1 8 66.7
vs 0 2 16.7
vs 1 10 83.3
I recognize this might involve gather or pivot_longer, but could not figure out a way to only transform the 180 variables out of say 200 that I need frequencies on to long.
Edit:
I ended up using this to select for many columns:
positions <- c(4:176,198)
And using select(positions) to circumvent typing in all of the variables.
We can use map from purrr to separately do the grouping and apply the same code as in the OP's post
library(dplyr)
library(purrr)
map_dfr(c('am', 'vs'), ~
mtcars %>%
filter(gear == 4) %>%
group_by(Variable = .x, Value = !!rlang::sym(.x)) %>%
summarise(n = n()) %>%
mutate(perc = n/sum(n) * 100))
# A tibble: 4 x 4
# Groups: Variable [2]
# Variable Value n perc
# <chr> <dbl> <int> <dbl>
#1 am 0 4 33.3
#2 am 1 8 66.7
#3 vs 0 2 16.7
#4 vs 1 10 83.3
Or another option is to convert to 'long' format with pivot_longer
library(tidyr)
mtcars %>%
filter(gear == 4) %>%
select(vs, am) %>%
pivot_longer(everything()) %>%
count(name, value) %>%
mutate(perc = n/sum(n) * 100)

summary matrix by ID combination by r

I have a df (test) like this
Now if you look at the data, 6 to 10 combination is available in the second period but not in the first period. Hence when I use this code
a_summary <- test %>%
group_by(from, to) %>%
summarize(avg = mean(share, na.rm = T)) %>%
ungroup() %>%
spread(from, avg, fill = 0)
The output comes like this
Now, look at the 10 to 6 cell. it gives a value of 1 because 10 to 6 combination only exist one time. But when I make the average, I would like to consider all combination in each period. hence the expected outcome of that 10 to 6 cell is .5 and overall matrix column and row summation should be 1.
a_summary <- test %>%
group_by(from, to) %>%
summarize(count = sum(n, na.rm = T)) %>%
ungroup() %>%
spread(from, count, fill = 0)
This will give you all count of all combinations. Now you can normalize this matrix with dividing by sum(test$n) or use prop.table()

unexpected row when going from long to wide format with dplyr and tidyr

I've got a data frame (dfdat) with two categorical variables, location and employmentstatus.
I'd like to generate a data frame with the proportions of employment status for each location.
mydf_wide (achieved outcome) is almost what I'm looking for. The problem's that employmentstatus is a variable with two levels, yet there're three rows in mydf_wide. I don't understand why that is, because I'd have expected something similar to mytable (expected outcome).
Any help would be much appreciated.
Starting point (df):
dfdat <- data.frame(location=c("GA","GA","MA","OH","RI","GA","AZ","MA","OH","RI"),employmentstatus=c(1,2,1,2,1,1,1,2,1,1))
Expected outcome (table):
mytable <- table(dfdat$employmentstatus,dfdat$location)
mytable <- round(100*(prop.table(mytable, 2)),1)
Achieved outcome (df):
library(dplyr)
mydf <- dfdat %>%
group_by(location,employmentstatus) %>%
summarise (n = n()) %>%
mutate(freq = round((n / sum(n)*100),1))
library(tidyr)
mydf_wide <- spread(mydf, location, freq)
mydf_wide <- as.data.frame(mydf_wide)
We need to do a second group_by with 'location' to get the sum. Also, instead of grouping and then creating the 'n', count function can be used
dfdat %>%
count(location, employmentstatus) %>%
group_by(location) %>%
mutate(n = round(100*n/sum(n), 2)) %>%
spread(location, n, fill = 0)
# A tibble: 2 x 6
# employmentstatus AZ GA MA OH RI
#* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 100 66.67 50 50 100
#2 2 0 33.33 50 50 0
If we are using the OP's code, then remove the 'n' column and then do the spread
dfdat %>%
group_by(location,employmentstatus) %>%
summarise (n = n()) %>%
mutate(freq = round((n / sum(n)*100),1)) %>%
select(-n) %>%
spread(location, freq, fill =0)
or update the 'n' column with the output of round and then spread. An extra column in 'n' made sure that the combinations exist in the dataset

Resources