I have a dataframe with character and numeric data. I would like to use dplyr to create a summary grouped by time points and trials generating the following:
averages
standard deviations
variation
ratio between time points
(etc etc)
I feel like all of this could be done in the dplyr pipe, but I am struggling to make a ratio of averages between time points within trials.
I fully admit that I may be carrying around a hammer looking for nails, so please feel free to recommend solutions that utilize other packages or functions, but ideally I'd like simple/straight forward code for ease of use by multiple collaborators.
library(dplyr)
# creating an example DF
num <- runif(100, 50, 3200)
smpl <- 1:100
df <- data.frame( num, smpl)
df$time <- "time1"
df$time[seq(2,100,2)] <- "time2"
df$trial <- "a"
df$trial[26:50] <- "b"
df$trial[51:75] <- "c"
df$trial[75:100] <- "d"
# using the magic of pipelines to calculate useful things
df1 <- df %>%
group_by(time, trial) %>%
summarise(avg = mean(num),
var = var(num),
stdev = sd(num))
I'd love to get [the ratio time2/time1 of the avg for each trial] included in this block above, but I don't know how to call "avg" specifically by "time1" vs "time2" within the pipe.
From here on, nothing does quite what I'm hoping for...
df1 <- df1[with(df1,order(trial,time)),]
# this better ressembles my actual DF structure,
# so reordering it will make some of my next attempts to solve this make more sense
I tried to use the fact that 'every other line' is different (this is not ideal because each df will have a different number of rows, so I will either introduce NAs or it will require constantly change these #'s (or writing a function to constantly change them))
tm2 <- data.frame(x=df1$avg[seq(2,4,2)])
tm1 <- data.frame(x=df1$avg[seq(1,3,2)])
so minimally, this is the ratio I'd like included in the df, but tied to the avg & trial columns:
tm2/tm1
It doesn't matter to me 'which' time row this ratio ends up in, so long as it is consistent across all the trials (so if a column of ratios has "blank" for every "time1" and "value" for every "time2", that's fine).
# I added in a separate column to allow 'match' later
tm1$time <- "time1"
tm2$time <- "time1" # to keep them all 'in row'
df1$avg_tm1 <- tm1$x[match(df1$time, tm1$time)]
df1$avg_tm2 <- tm2$x[match(df1$time, tm2$time)]
but this fails to match by 'trial' also, since that info is lost in this new tm1 df ; this really makes me think it should all be done in dplry the first time...
Then I tried to create a new column in the tm1 df with the ratio
tm2$ratio <-tm2$x/tm1$x
and add in the ratio values only if the avg matches
df1$ratio <- tm2$ratio[match(tm2$x, df1$avg)]
This might work, but when I extract the avg values, it rounds, so the numbers do not match exactly. I'm also cautious about this because if I process ridiculous amounts of data, there's a higher and higher chance that two random averages will be similar enough to misplace these ratios.
I tried several other things that completely failed, so let's pretend that something worked and entered the ratio into the df1 as separate columns
Then any further calculations or annotations are straight forward:
df2 <- df1 %>%
mutate(ratio = avg_tm2/avg_tm1,
lost = 1- ratio,
word = paste0(round(lost*100),"%"))
But I am still stuck on 'how' to call specific cells inside the pipe or which other tools/packages to use to calculate deltas or ratios between cells in the same column.
Thanks in advance
We could group by 'trial' and mutate to create the 'ratio' column
df1 %>%
group_by(trial) %>%
mutate(ratio = last(avg)/first(avg))
# A tibble: 8 x 6
# Groups: trial [4]
# time trial avg var stdev ratio
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 time1 a 1815. 715630. 846. 0.795
#2 time1 b 2012. 1299823. 1140. 0.686
#3 time1 c 1505. 878168. 937. 1.09
#4 time1 d 1387. 902364. 950. 1.17
#5 time2 a 1444. 998943. 999. 0.795
#6 time2 b 1380. 720135. 849. 0.686
#7 time2 c 1641. 1205778. 1098. 1.09
#8 time2 d 1619. 582418. 763. 1.17
NOTE: We used set.seed(2) for creating the dataset
Work out a separate data.frame:
set.seed(2)
# your code above to generate df1
df2 <- select(df1, time, trial, avg) %>%
spread(time, avg) %>%
mutate(ratio = time2/time1)
df2
# # A tibble: 4 × 4
# trial time1 time2 ratio
# <chr> <dbl> <dbl> <dbl>
# 1 a 1815.203 1443.731 0.7953555
# 2 b 2012.436 1379.981 0.6857266
# 3 c 1505.474 1641.439 1.0903135
# 4 d 1386.876 1619.341 1.1676176
and now you can merge the relevant column onto the original frame:
left_join(df1, select(df2, trial, ratio), by="trial")
# Source: local data frame [8 x 6]
# Groups: time [?]
# time trial avg var stdev ratio
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 time1 a 1815.203 715630.4 845.9494 0.7953555
# 2 time1 b 2012.436 1299823.3 1140.0979 0.6857266
# 3 time1 c 1505.474 878168.3 937.1063 1.0903135
# 4 time1 d 1386.876 902363.7 949.9282 1.1676176
# 5 time2 a 1443.731 998943.3 999.4715 0.7953555
# 6 time2 b 1379.981 720134.6 848.6074 0.6857266
# 7 time2 c 1641.439 1205778.0 1098.0792 1.0903135
# 8 time2 d 1619.341 582417.5 763.1629 1.1676176
Related
I'm having a hard time making rounded percentages that add up to 100% within groups.
Consider the following example:
# Loading main library used
library(dplyr)
# Creating the basic data frame
df = data.frame(group = c('A','A','A','A','B','B','B','B'),
categories = c('Cat1','Cat2','Cat3','Cat4','Cat1','Cat2','Cat3','Cat4'),
values = c(2200,4700,3000,2000,2900,4400,2200,1000))
print(df)
# group categories values
# 1 A Cat1 2200
# 2 A Cat2 4700
# 3 A Cat3 3000
# 4 A Cat4 2000
# 5 B Cat1 2900
# 6 B Cat2 4400
# 7 B Cat3 2200
# 8 B Cat4 1000
df_with_shares = df %>%
# Calculating group totals and adding them back to the main df
left_join(df %>%
group_by(group) %>%
summarize(group_total = sum(values)),
by='group') %>%
# Calculating each category's share within the groups
mutate(group_share = values / group_total,
group_share_rounded = round(group_share,2))
# Summing the rounded shares within groups
rounded_totals = df_with_shares %>%
group_by(group) %>%
summarize(total_share = sum(group_share_rounded))
print(rounded_totals)
# # A tibble: 2 x 2
# group total_share
# <chr> <int>
# 1 A 0.99
# 2 B 1.01
# Note how the totals do not add up to 100% as expected
I am aware of a few generic solutions to the "rounding percentages to add up to 100%" problem, as explained in this SO post. I was even able to make a little R implementation of one of those approaches, as seen here. This is what it would look like if I just applied that R approach to this problem:
df_with_rounded_shares = df %>%
mutate(
percs = values / sum(values),
percs_cumsum = cumsum(percs),
percs_cumsum_round = round(percs_cumsum, 2),
percs_cumsum_round_offset = replace_na(lag(percs_cumsum_round,1),0),
percs_rounded_final = percs_cumsum_round - percs_cumsum_round_offset)
However, the method I devised in the thread above does not work as I would like. It just calculates the shares of the values column across the whole dataset. In other words, it does not take into consideration the grouping variable representing the multiple groups in the data, each of which need their rounded values to add up to 100% independently from every other group.
What can I do to generate a column of rounded percentages that add up to 100% by group?
PS: While writing this question I actually found something that worked, so I'll answer my own question below. I know it's super simple, but I think it's still worth having a direct answer here on SO addressing this issue.
The method devised in your implementation (from here) just needs a few small tweaks to make it work.
First, include a group_by statement before calculating the new columns. Also, you need to use a summarize statement instead of the mutate statement you have now.
In essence, this is what it'll look like:
# Modified version of your implementation of the rounding procedure.
# The new procedure below accommodates for grouping variables.
df_with_rounded_shares_by_group = df %>%
group_by(group) %>%
summarize(
group_share = values / sum(values),
group_share_cumsum = cumsum(group_share),
group_share_cumsum_round = round(group_share_cumsum, 2),
group_share_cumsum_round_offset = replace_na(lag(group_share_cumsum_round,1),0),
group_share_rounded_final = group_share_cumsum_round - group_share_cumsum_round_offset) %>%
# Removing unnecessary temporary columns
select(-group_share_cumsum, -group_share_cumsum_round, -group_share_cumsum_round_offset)
# Verifying if the results add up to 100% within each group
rounded_totals = df_with_rounded_shares_by_group %>%
group_by(group) %>%
summarize(total_share = sum(group_share_rounded_final))
print(rounded_totals)
# # A tibble: 2 x 2
# group total_share
# <chr> <dbl>
# 1 A 1
# 2 B 1
# Yep, they all add up to 100% as expected!
Btw, apologies for the ridiculously long column names. I just made them enormous to make it clear what each step was really doing.
I would like to create a column in my data frame that gives the percentage of each category. The total (100%) would be the summary of the column Score.
My data looks like
Client Score
<chr> <int>
1 RP 125
2 DM 30
Expected
Client Score %
<chr> <int>
1 RP 125 80.6
2 DM 30 19.3
Thanks!
Note special character in column names is not good.
library(dplyr)
df %>%
mutate(`%` = round(Score/sum(Score, na.rm = TRUE)*100, 1))
Client Score %
1 RP 125 80.6
2 DM 30 19.4
Probably the best way is to use dplyr. I recreated your data below and used the mutate function to create a new column on the dataframe.
#Creation of data
Client <- c("RP","DM")
Score <- c(125,30)
DF <- data.frame(Client,Score)
DF
#install.packages("dplyr") #Remove first # and install if library doesn't load
library(dplyr) #If this doesn't run, install library using code above.
#Shows new column
DF %>%
mutate("%" = round((Score/sum(Score))*100,1))
#Overwrites dataframe with new column added
DF %>%
mutate("%" = round((Score/sum(Score))*100,1)) -> DF
Using base R functions the same goal can be achieved.
X <- round((DF$Score/sum(DF$Score))*100,1) #Creation of percentage
DF$"%" <- X #Storage of X as % to dataframe
DF #Check to see it exists
In base R, may use proportions
df[["%"]] <- round(proportions(df$Score) * 100, 1)
-output
> df
Client Score %
1 RP 125 80.6
2 DM 30 19.4
I have a data frame in which I would I would like to compute some extra column as a function of the existing columns, but want to specify both each new column name and the function dynamically. I have a vector of column names that are already in the dataframe df_daily:
DAILY_QUESTIONS <- c("Q1_Daily", "Q2_Daily", "Q3_Daily", "Q4_Daily", "Q5_Daily")
The rows of the dataframe have responses to each question from each user each time they answer the questionnaire, as well as a column with the number of days since the user first answered the questionnaire (i.e. Days_From_First_Use = 0 on the very first use, = 1 if it is used the next day etc.). I want to average the responses to these questions by Days_From_First_Use . I start by by grouping my dataframe by Days_From_First_Use:
df_test <- df_daily %>%
group_by(Days_From_First_Use)
and then try averaging the responses in a loop as follows:
for(i in 1:5){
df_test <- df_test %>%
mutate(!! paste0('Avg_Score_', DAILY_QUESTIONS[i]) :=
paste0('mean(', DAILY_QUESTIONS[i], ')'))
}
Unfortunately, while my new variable names are correct ("Avg_Score_Q1_Daily", "Avg_Score_Q2_Daily", "Avg_Score_Q3_Daily", "Avg_Score_Q4_Daily", "Avg_Score_Q5_Daily"), my answers are not: every row in my data frame has a string such as "mean(Q1_Daily)" in the relevant column .
So I'm clearly doing something wrong - what do I need to do fix this and get the average score across all users on each day?
Sincerely and with many thanks in advance
Thomas Philips
I took a somewhat different approach, using summarize(across(...)) after group_by(Days_From_First_Use) I achieve the dynamic names by using rename_with and a custom function that replaces (starts with)"Q" with "Avg_Score_Q"
library(dplyr, warn.conflicts = FALSE)
# fake data -- 30 normalized "responses" from 0 to 2 days from first use to 5 questions
DAILY_QUESTIONS <- c("Q1_Daily", "Q2_Daily", "Q3_Daily", "Q4_Daily", "Q5_Daily")
df_daily <- as.data.frame(do.call('cbind', lapply(1:5, function(i) rnorm(30, i))))
colnames(df_daily) <- DAILY_QUESTIONS
df_daily$Days_From_First_Use <- floor(runif(30, 0, 3))
df_test <- df_daily %>%
group_by(Days_From_First_Use) %>%
summarize(across(.fns = mean)) %>%
rename_with(.fn = function(x) gsub("^Q","Avg_Score_Q",x))
#> `summarise()` ungrouping output (override with `.groups` argument)
df_test
#> # A tibble: 3 x 6
#> Days_From_First… Avg_Score_Q1_Da… Avg_Score_Q2_Da… Avg_Score_Q3_Da…
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0 1.26 1.75 3.02
#> 2 1 0.966 2.14 3.48
#> 3 2 1.08 2.45 3.01
#> # … with 2 more variables: Avg_Score_Q4_Daily <dbl>, Avg_Score_Q5_Daily <dbl>
Created on 2020-12-06 by the reprex package (v0.3.0)
Intro
After recently taking Hadley Wickham's functional programming class I decided I'd try applying some of the lessons to my projects at work. Naturally, the first project I tried has proven to be more complicated than the examples worked demonstrated in the class. Does anyone have recommendations for a way to use the purrr package to make the task described below more efficient?
Project Background
I need to assign quintile groups to records in a spatial polygon dataframe. In addition to the record identifier there are several other variables and I need to calculate the quintile group for each.
Here's the crux of the problem: I have been asked to identify outliers in one particular variable and to omit those records from the entire analysis as long as it doesn't change the quintile composition of the first quintile group for any of the other variables.
Question
I have put together a dplyr pipeline (see the example below) that performs this checking process for a single variable, but how might I rewrite this process so that I can efficiently check each variable?
EDIT: While it is certainly possible to change the shape of the data from wide to long as an intermediary step, in the end it needs to return to its wide format so that it matches up with the #polygons slot of the spatial polygons dataframe.
Reproducible Example
You can find the complete script here: https://gist.github.com/tiernanmartin/6cd3e2946a77b7c9daecb51aa11e0c94
Libraries and Settings
library(grDevices) # boxplot.stats()
library(operator.tools) # %!in% logical operator
library(tmap) # 'metro' data set
library(magrittr) # piping
library(dplyr) # exploratory data analysis verbs
library(purrr) # recursive mapping of functions
library(tibble) # improved version of a data.frame
library(ggplot2) # dot plot
library(ggrepel) # avoid label overlap
options(scipen=999)
set.seed(888)
Load the example data and take a small sample of it
data("metro")
m_spdf <- metro
# Take a sample
m <-
metro#data %>%
as_tibble %>%
select(-name_long,-iso_a3) %>%
sample_n(50)
> m
# A tibble: 50 x 10
name pop1950 pop1960 pop1970 pop1980 pop1990
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Sydney 1689935 2134673 2892477 3252111 3631940
2 Havana 1141959 1435511 1779491 1913377 2108381
3 Campinas 151977 293174 540430 1108903 1693359
4 Kano 123073 229203 541992 1349646 2095384
5 Omsk 444326 608363 829860 1032150 1143813
6 Ouagadougou 33035 59126 115374 265200 537441
7 Marseille 755805 928768 1182048 1372495 1418279
8 Taiyuan 196510 349535 621625 1105695 1636599
9 La Paz 319247 437687 600016 809218 1061850
10 Baltimore 1167656 1422067 1554538 1748983 1848834
# ... with 40 more rows, and 4 more variables:
# pop2000 <dbl>, pop2010 <dbl>, pop2020 <dbl>,
# pop2030 <dbl>
Calculate quintile groups with and without outlier records
# Calculate the quintile groups for one variable (e.g., `pop1990`)
m_all <-
m %>%
mutate(qnt_1990_all = dplyr::ntile(pop1990,5))
# Find the outliers for a different variable (e.g., 'pop1950')
# and subset the df to exlcude these outlier records
m_out <- boxplot.stats(m$pop1950) %>% .[["out"]]
m_trim <-
m %>%
filter(pop1950 %!in% m_out) %>%
mutate(qnt_1990_trim = dplyr::ntile(pop1990,5))
# Assess whether the outlier trimming impacted the first quintile group
m_comp <-
m_trim %>%
select(name,dplyr::contains("qnt")) %>%
left_join(m_all,.,"name") %>%
select(name,dplyr::contains("qnt"),everything()) %>%
mutate(qnt_1990_chng_lgl = !is.na(qnt_1990_trim) & qnt_1990_trim != qnt_1990_all,
qnt_1990_chng_dir = if_else(qnt_1990_chng_lgl,
paste0(qnt_1990_all," to ",qnt_1990_trim),
"No change"))
With a little help from ggplot2, I can see that in this example six outliers were identified and that their omission did not affect the first quintile group for pop1990.
Importantly, this information is tracked in two new variables: qnt_1990_chng_lgl and qnt_1990_chng_dir.
> m_comp %>% select(name,qnt_1990_chng_lgl,qnt_1990_chng_dir,everything())
# A tibble: 50 x 14
name qnt_1990_chng_lgl qnt_1990_chng_dir qnt_1990_all qnt_1990_trim
<chr> <lgl> <chr> <dbl> <dbl>
1 Sydney FALSE No change 5 NA
2 Havana TRUE 4 to 5 4 5
3 Campinas TRUE 3 to 4 3 4
4 Kano FALSE No change 4 4
5 Omsk FALSE No change 3 3
6 Ouagadougou FALSE No change 1 1
7 Marseille FALSE No change 3 3
8 Taiyuan TRUE 3 to 4 3 4
9 La Paz FALSE No change 2 2
10 Baltimore FALSE No change 4 4
# ... with 40 more rows, and 9 more variables: pop1950 <dbl>, pop1960 <dbl>,
# pop1970 <dbl>, pop1980 <dbl>, pop1990 <dbl>, pop2000 <dbl>, pop2010 <dbl>,
# pop2020 <dbl>, pop2030 <dbl>
I now need to find a way to repeat this process for every variable in the dataframe (i.e., pop1960 - pop2030). Ideally, two new variables would be created for each existing pop* variable and their names would be preceded by qnt_ and followed by either _chng_dir or _chng_lgl.
Is purrr the right tool to use for this? dplyr::mutate_? data.table?
It turns out this problem is solvable using tidyr::gather + dplyr::group_by + tidyr::spread functions. While #shayaa and #Gregor didn't provide the solution I was looking for, their advice helped me course-correct away from the functional programming methods I was researching.
I ended up using #shayaa's gather and group_by combination, followed by mutate to create the variable names (qnt_*_chng_lgl and qnt_*_chng_dir) and then using spread to make it wide again. An anonymous function passed to summarize_all removed all the extra NA's that the wide-long-wide transformations created.
m_comp <-
m %>%
mutate(qnt = dplyr::ntile(pop1950,5)) %>%
filter(pop1950 %!in% m_out) %>%
gather(year,pop,-name,-qnt) %>%
group_by(year) %>%
mutate(qntTrim = dplyr::ntile(pop,5),
qnt_chng_lgl = !is.na(qnt) & qnt != qntTrim,
qnt_chng_dir = ifelse(qnt_chng_lgl,
paste0(qnt," to ",qntTrim),
"No change"),
year_lgl = paste0("qnt_chng_",year,"_lgl"),
year_dir = paste0("qnt_chng_",year,"_dir")) %>%
spread(year_lgl,qnt_chng_lgl) %>%
spread(year_dir,qnt_chng_dir) %>%
spread(year,pop) %>%
select(-qnt,-qntTrim) %>%
group_by(name) %>%
summarize_all(function(.){subset(.,!is.na(.)) %>% first})
Nothing wrong with your analysis it seems to me,
After this part
m <- metro#data %>%
as_tibble %>%
select(-name_long,-iso_a3) %>%
sample_n(50)
Just melt your data and continue your analysis but with group_by(year)
library(reshape2)
library(stringr)
mm <- melt(m)
mm[,2] <- as.factor(str_sub(mm[,2],-4))
names(mm)[2:3] <- c("year", "population")
e.g.,
mm %>% group_by(year) %>%
+ mutate(qnt_all = dplyr::ntile(population,5))
I would like to randomly sample months according to a set of weights given by an index in a separate data frame, but the index changes according to the some categorical variables.
Below is an example problem:
require(dplyr)
sim.size <- 1000
# Generating the weights for each month, and category combination
class_probs <- data_frame(categoryA=rep(letters[1:3],24)
categoryB=rep(LETTERS[1:2],each=36),
Month=rep(month.name,6),
MonthIndex=runif(72))
# Generating some randomly simulated cateogories
sim.data <- data_frame(categoryA=sample(letters[1:3],size=sim.size,replace=TRUE),
categoryB=sample(LETTERS[1:2],size=sim.size,replace=TRUE))
# This is where i need help
# I would like to add an extra column called Month on the end of sim.data
# That will be sampled using the class_probs data, taking into account the
# Both categoryA and categoryB to generate the weights in MonthIndex
sim.data %>%
group_by(categoryA,categoryB) %>%
do(sample_n(class_probs[class_probs$categoryA==categoryA &
class_probs$categoryB==categoryB, ],
size=nrow(sim.data[sim.data$categoryA==categoryA &
sim.data$categoryB==categoryB]),
replace=TRUE,
weight=MonthIndex)$Month)
So for each group i would like to be able to sample the same number of occurrences of a particular combination of categoryA and categoryB, and for each occurrence i would like to sample a Month according to the MonthIndex given from the subset of the class_prob data.frame...
The chosen Month is then binded onto the original dataset sim.data as an extra column
Hopefully my code is already quite close...i just need a bit of help working out what bits need to change...
Here's an approach with a helper function to do the sampling, then a simple mutate call in dplyr to create the new column.
Helper function:
sampler <- function(x, y, df) {
tab <- sample_n(df %>% filter(categoryA==x,
categoryB==y),
size=1,
replace=TRUE,
weight=MonthIndex)
return(tab$Month)
}
Calling it to create a new variable:
sim.data %>%
rowwise() %>%
mutate(month = sampler(categoryA, categoryB, class_probs))
Result:
Source: local data frame [1,000 x 3]
Groups: <by row>
categoryA categoryB month
1 b B February
2 b A February
3 b B May
4 c B December
5 c B June
6 b A August
7 c A March
8 c A September
9 b A August
10 c A December
.. ... ... ...