Calculating conditional summaries of grouped data in dplyr - r

I have a dataset of population mortality data segregated by year, decile (ranked) of deprivation, gender, cause of death and age. Age data is broken down into categories including 0-1, 1-4, 5-9, 10-14 etc.
I am trying to coerce my dataset such that the mortality data for 0-1 and 1-4 is merged together to create age categories 0-4, 5-9, 10-14 and so on up to 90. My data is in long format.
Using dplyr I am trying to use if_else and summarise() to aggregate mortality data for 0-1 and 1-4 together, however any iteration of code I apply is merely producing the same dataset I originally had, i.e. the code is not merging my data together.
head(death_popn_long) #cause_death variable content removed for brevity
Year deprivation_decile Sex cause_death ageband deaths popn
1 2017 1 Male NA 0 0 2106
2 2017 1 Male NA 0 0 2106
3 2017 1 Male NA 0 0 2106
4 2017 1 Male NA 0 0 2106
5 2017 1 Male NA 0 0 2106
6 2017 1 Male NA 0 0 2106
#Attempt to merge ageband 0-1 & 1-4 by summarising combined death counts
test <- death_popn_long %>%
group_by(Year, deprivation_decile, Sex, cause_death, ageband) %>%
summarise(deaths = if_else(ageband %in% c("0", "1"), sum(deaths),
deaths))
I would like the deaths variable to be the combined (i.e. sum of both 0-1 and 1-4) death count for these age bands, however the above any any alternative code I attempt merely recreates the previous dataset I already had.

You don't want to use ageband in your group_by statement if you intend on manipulating its groups. You'll need to create your new version of ageband and then group by that:
test <- death_popn_long %>%
mutate(new_ageband = if_else(ageband %in% c("0", "1"), 1, ageband)) %>%
group_by(Year, deprivation_decile, Sex, cause_death, new_ageband) %>%
summarise(deaths = sum(deaths))
If you'd like a marginally shorter version you can define new_ageband in the group_by clause instead of using a mutate verb beforehand. I just did that to be explicit.
Also, for future SO questions - it's very helpful to provide data in your question (using something like dput). :)

Related

In R, how to combine two data frames where column names in one equals row values in another?

Say I have one data frame of tooth brush brands and a measure of how popular they are over time:
year brand_1 brand_2
2010 0.7 0.3
2011 0.6 0.6
2012 0.4 0.9
And another that says when each tooth brush brand went electrical, with NA meaning they never did so:
brand went_electrical_year
brand_1 NA
brand_2 2011
Now I'd like to combine these to get the prevalence of electrical tooth brush brands (as a proportion of the total) each year:
year electrical_prevalence
2010 0
2011 0.5
2012 0.69
In 2010 it's 0 b/c none of the brands are electrical. In 2011 it's 0.5 b/c both are and they are equally prevalent. In 2012 it's 0.69 b/c both are but the electrical one is more prevalent.
I've wrestled with this in R but can't figure out a way to do it. Would appreciate any help or suggestions. Cheers.
Assuming your data frames are df1 and df2, you can use the following tidyverse approach.
First, use pivot_longer to put your data into a long format which will be easier to manipulate. Use left_join to add the relevant years of when the brands went electrical.
We can create an indicator mult which will be 1 if the brand has gone electrical, or zero if it hadn't.
Then, for each year, you can determine the proportion by multiplying the popularity value by mult for each brand, and then dividing by the total sum for that year.
library(tidyverse)
df1 %>%
pivot_longer(cols = -year) %>%
left_join(df2, by = c("name" = "brand")) %>%
mutate(mult = ifelse(went_electrical_year > year | is.na(went_electrical_year), 0, 1)) %>%
group_by(year) %>%
summarise(electrical_prevalence = sum(value * mult) / sum(value))
Output
year electrical_prevalence
<int> <dbl>
1 2010 0
2 2011 0.5
3 2012 0.692

Growth Rates in Unbalanced Panel Data

I am trying to get a Growth Rate for some variables in an Unbalanced Panel data, but I´m still getting results for years in which the lag does not exist.
I've been trying to get the Growth Rates using library Dplyr. As I Show down here:
total_firmas_growth <- total_firmas %>%
group_by(firma) %>%
arrange(anio, .by_group = T) %>% mutate(
ing_real_growth = (((ingresos_real_2/Lag(ingresos_real_2))-1)*100)
)
for Instance, if a firm has a value for "ingresos_real_2" in the year 2008 and the next value is in year 2012, the code calculate the growth rate instead of get an NA, because of the missing year (i.e 2011 is missing to calculate 2012 growth rate, as you can see in the example with the "firma" 115 (id) right below:
total_firmas_growth <-
" firma anio ingresos_real_2 ing_real_growth
1 110 2005 14000 NA
2 110 2006 15000 7.14
3 110 2007 13000 -13.3
4 115 2008 15000 NA
5 115 2012 13000 NA
6 115 2013 14000 7.69
I will really appreciate your help.
The easiest way to get your original table into a format where there are NAs for columns is to create a tibble with an all-by-all of the grouping columns and your years. Expand creates an all-by-all tibble of the variables you are interested in and {.} takes in whatever was piped more robustly than . (by creating a copy, I believe). Since any mathematical operation that includes an NA will result in an NA, this should get you what you're after if you use your group_by, arrange, mutate code after it.
total_firmas %>%
left_join(
expand({.}, firma, anio),
by = c("firma","anio")
)

Frequency count with multiple conditions R

Have a data frame
Date Team Opponent Weather Outcome
2017-05-01 All Stars B Stars Rainy 1
2017-05-02 All Stars V Stars Rainy 1
2017-05-03 All Stars M Trade Sunny 0
.
.
2017-05-11 All Stars Vdronee Sunny 0
Where Outcome 1 indicates a win. I have used the table function to get the frequency and applied condition.
table(df$Outcome, df$Team == "All Stars")
Returns me this
FALSE TRUE
0 1005 30
1 1323 57
So frequency of win is 57/87 =0.655
Two Questions:
Rather the calculating the win frequency manually, how do I embed this directly in a formula?
and
How do I filter based on the x most recent observations? i.e something like
table(df$Outcome, df$Team == "All Stars" & df$date = filtering for the 5 most recent observations)
thanks
An option is to use data.table
libray(data.table)
dt <- data.table(df)
dt[, .(prop=sum(outcome)/.N),Team]
to get the 5 most recent observations you can to the following:
dt[,head(.SD,5),by=.(Team,Date)][,.(prop=sum(outcoume/.N),Team]

How to check if any row has negative values by leaving out selected rows?

Below is the dataframe I get by running a query. Please note that df1 is a dynamic dataframe and it might return either an empty df or partial df with not all quarters as seen below:
df1
FISC_QTR_VAL Revenue
1 2014-Q1 0.00
2 2014-Q2 299111.86
3 2014-Q3 174071.98
4 2014-Q4 257655.30
5 2015-Q1 0.00
6 2015-Q2 317118.63
7 2015-Q3 145461.88
8 2015-Q4 162972.41
9 2016-Q1 96896.04
10 2016-Q2 135058.78
11 2016-Q3 111773.77
12 2016-Q4 138479.28
13 2017-Q1 169276.04
I would want to check the values of all the rows in Revenue column and see if any value is 0 or negative excluding 2014-Q1 row
Also, the df1 is dynamic and will contain only 12 quarters of data i.e. when I reach next qtr i.e. 2017-Q2, the Revenue associated with 2014-Q2 becomes 0 and it will look like this:
df1
FISC_QTR_VAL Revenue
1 2014-Q1 0.00
2 2014-Q2 0.00
3 2014-Q3 174071.98
4 2014-Q4 257655.30
5 2015-Q1 0.00
6 2015-Q2 317118.63
7 2015-Q3 145461.88
8 2015-Q4 162972.41
9 2016-Q1 96896.04
10 2016-Q2 135058.78
11 2016-Q3 111773.77
12 2016-Q4 138479.28
13 2017-Q1 169276.04
14 2017-Q2 146253.64
In the above case, I would need to check all rows for the Revenue column by excluding 2014-Q1 and 2014-Q2
And this goes on as quarter progresses
Need your help to generate the code which would dynamically do all the above steps of excluding the row(s) and check only the rows that matter for a particular quarter
Currently, I am using the below code:
#Taking the first df1 into consideration which has 2017-Q1 as the last quarter
startQtr <- "2014-Q2" #This value is dynamically achieved and will change as we move ahead. Next quarter, the value changes to 2014-Q3 and so on
if(length(df1[["FISC_QTR_VAL"]][nrow(df1)-11] == startQtr) == 1){
if(nrow(df1[df1$Revenue < 0,]) == 0 & nrow(df1[df1$Revenue == 0,]) == 0){
df1 <- df1 %>% slice((nrow(df1)-11):(nrow(df1)))
}
}
The first IF loop checks if there is data in df1
If the df is empty, df1[["FISC_QTR_VAL"]][nrow(df1)-10] == startQtr condition would return numeric(0) whose length would be 0 and hence the condition fails
If not, then it goes to the next IF loop and checks for -ve and 0 values in Revenue column. But it does for all the rows. I want 2014-Q1 excluded in this case, and going forward to the future quarters, would want the condition to be dynamic as explained above.
Also, I do not want to slice the dataset before the if condition as the code would throw an error if the initial dataframe df1 returns 1 row or 2 rows and we try to slice those further
Thanks
Here's a solution using a few functions from the dplyr and tidyr packages.
Here's a toy data set to work with:
d <- data.frame(
FISC_QTR_VAL = c("2015-Q1", "2014-Q2", "2014-Q1", "2015-Q2"),
Revenue = c(100, 200, 0, 0)
)
d
#> FISC_QTR_VAL Revenue
#> 1 2015-Q1 100
#> 2 2014-Q2 200
#> 3 2014-Q1 0
#> 4 2015-Q2 0
Notice that FISC_QTR_VAL is intentionally out of order (as a precaution).
Next, set variables for the current year and quarter (you'll see why separate in a moment):
current_year <- 2014
current_quarter <- 2
Then run the following:
d %>%
separate(FISC_QTR_VAL, c("year", "quarter"), sep = "-Q") %>%
arrange(year, quarter) %>%
slice(which(year == current_year & quarter == current_quarter):n()) %>%
filter(Revenue <= 0)
#> year quarter Revenue
#> 1 2015 2 0
First, we separate() the FISC_QTR_VAL into separate year and quarter variables for (a) a tidy data set and (b) a way to order the data in case it's out of order (as in the toy used here). We then arrange() the data so that it's ordered by year and quarter. Then, we slice() away any quarters prior to the current one, and then filter() to return all rows where Revenue <= 0.
To alternatively get, for example, a count of the number of rows that are returned, you can pipe on something like nrow().
Is the subset function an option for you?
exclude.qr <- c("2014-Q1", "2014-Q2")
df <- data.frame(
FISC_QTR_VAL = c("2014-Q1", "2014-Q2", "2014-Q3", "2014-Q4"),
Revenue = c(0.00, 299111.86, 174071.98, 257655.30))
subset(
df,
FISC_QTR_VAL != exclude.qr, Revenue > 0)
You can easily create exclue.qr dynamically, e.g. via paste an years <- 2010:END.
I hope this is helpfull!

Count the number of previous occurrences using a time window, not a fixed window size

I have a dataset like the following, the last column is desired output.
DX_CD AID date2 <count.occurences.1000.days>
1 272.4 1649 2007-02-10 0 or N/A
2 V58.67 1649 2007-02-10 0<- (excluding the same day). OR 1
3 787.91 1649 2010-04-14 0
4 788.63 1649 2011-03-10 1
5 493.90 4193 2007-09-13 0 or N/A #new AID
6 787.20 6954 2010-02-25 0 or N/A #new AID
.....
I want to compute the column (count.occurences.1000.days) that counts the number of previous occurrences within X days (e.g. X=1000) by AID.
The first value is 0 or N/A because there is no previous record before record #1 for AID=1649. The second value is 0 because this event occurs on the same day as record #1. Third value is 0 because there are records older than 2010-04-14, but they are beyond 1000days. Fourth value is 1 because the record #3 happened within 1000 days. Same logic goes for AID=4193 and AID=6954
Can someone provide an idea, preferably vectorized?
If I understood correctly the question, this should do
First, a sample of the data
df <- data.frame(date2=days <-
seq(as.Date("2008-12-30"), as.Date("2015-01-03"), by="days"),
AID=sample(c(1649, 4193, 6954, 3466), 2196, replace=T),
count=(rep.int(1,2196)))
Now we group by the 1000 days from max to min
df$date.bin <- Hmisc::cut2(df$date2,
cuts=sort(seq(max(df$date2), length=10,by="-1000 days")))
Now we use cumsum on the grouped variables
res <-df %>% dplyr::arrange(date.bin, AID) %>% group_by(date.bin, AID) %>%
mutate(cumsum=cumsum(count))

Resources