Calculating rolling rates and excluding null rows - r

I have a dataset with ~40 variables with rows for each of the 25 areas and quarters, we have data from 2019 Q1 to today, 2022 Q2. For each quarter I am creating a rate (variable/population*10000) to allow comparison, however, we want each quarters rate to be based on the preceding 4 quarters i.e. 2022 Q2 rate will be the sum of the variable for 2022 Q2, Q1, 2021 Q4 and Q3. I can calculate this for all the relevant columns using the below
full_data_rates_pop %>%
group_by(Area) %>%
summarise(across(4:21, ~(sum(., na.rm = T))/(mean(Population_17.24))*10000)) %>%
bind_rows(full_data_rates_pop) %>%
arrange(Area,-Quarter)%>% # sort so that total rows come after the quarters
mutate(Timeframe=if_else(is.na(Quarter),timeframe_value, 'Quarterly'))
This does the job for my areas however I also want to create regional rates for each time period, originally I just summed up the variable and population for all the areas and created the rates in the same way. However, I have realised that for some areas/time periods data is missing and as such the current method produces inaccurate results. I want for each column to be able to exclude any rows which are Null.
Area
Quarter
Metric_1
Metric_2
Population
A
2022.2
45
89
12000
A
2022.1
58
23
12000
A
2021.4
NULL
64
11000
A
2021.3
20
76
11000
B
2022.2
56
101
9700
B
2022.1
32
78
9700
B
2021.4
41
NULL
10100
B
2021.3
38
NULL
10100
This is a mini dummy version of my data just with the latest 4 quarters but I want the new row to calculate so that the values are the sum of all values and the sum of population excluding any rows where the metric value was null
Area
Quarter
Metric_1_rate
Metric_2_rate
ALL
2022.2
38.87
75.08
Is there a way to filter out any rows which have a null value for that column however it will still be needed for other rows where there is no null value?

Related

Normalize aggregation results according to number of days per month

I have this table:
Month
nbr_of_days
aggregated_sum
1
25
120
2
28
70
3
30
130
4
31
125
My goal here is to normalize the aggregated sum to an assumed value of 30 (nbr_of_days) per month.
So for the first row, for example, the normalized aggregated_sum would be: 30*120/25=144
How to do this in R?
df <- df%>% mutate(normalized_aggregated_sum=30*aggregated_sum/nbr_of_days)
Note: while asking the question, I realized how it can be answered

Multiplying values in a column of data frame based on the numbers found in other column

I have a data frame in the form shown below:
payment <- c("Annual", "Monthly","Monthly","Monthly","Quarterly", "Semi Annual")
number_pay <- c(7,81,85,79,16,10)
df <- data.frame(payment, number_pay)
payment number_pay
1 Annual 7
2 Monthly 81
3 Monthly 85
4 Monthly 79
5 Quarterly 16
6 Semi Annual 10
What I want to do is to creat a new column and change all numbers to Monthly number, so for example for the first row, we should have 7*12=84 Month as annual payment means 12-month payment. How can I do this? I was thinking about using ifelse; however, this solution doesn't look to be efficient:
df <- df %>% mutate(total=ifelse(df$Payment=="Annual", number_pay*12,ifelse(df$Payment=="Quarterly",number_pay*3,)))
so basically ifelse inside ifelse. IS there a better way to do this?
You can specify the conditions with case_when which is cleaner than multiple multiple ifelse statement.
library(dplyr)
df %>%
mutate(number_pay = case_when(payment == 'Annual' ~number_pay * 12,
payment == 'Quarterly'~ number_pay * 3,
payment == 'Semi Annual' ~number_pay * 6,
TRUE ~ number_pay))
# payment number_pay
#1 Annual 84
#2 Monthly 81
#3 Monthly 85
#4 Monthly 79
#5 Quarterly 48
#6 Semi Annual 60

Growth Rates in Unbalanced Panel Data

I am trying to get a Growth Rate for some variables in an Unbalanced Panel data, but I´m still getting results for years in which the lag does not exist.
I've been trying to get the Growth Rates using library Dplyr. As I Show down here:
total_firmas_growth <- total_firmas %>%
group_by(firma) %>%
arrange(anio, .by_group = T) %>% mutate(
ing_real_growth = (((ingresos_real_2/Lag(ingresos_real_2))-1)*100)
)
for Instance, if a firm has a value for "ingresos_real_2" in the year 2008 and the next value is in year 2012, the code calculate the growth rate instead of get an NA, because of the missing year (i.e 2011 is missing to calculate 2012 growth rate, as you can see in the example with the "firma" 115 (id) right below:
total_firmas_growth <-
" firma anio ingresos_real_2 ing_real_growth
1 110 2005 14000 NA
2 110 2006 15000 7.14
3 110 2007 13000 -13.3
4 115 2008 15000 NA
5 115 2012 13000 NA
6 115 2013 14000 7.69
I will really appreciate your help.
The easiest way to get your original table into a format where there are NAs for columns is to create a tibble with an all-by-all of the grouping columns and your years. Expand creates an all-by-all tibble of the variables you are interested in and {.} takes in whatever was piped more robustly than . (by creating a copy, I believe). Since any mathematical operation that includes an NA will result in an NA, this should get you what you're after if you use your group_by, arrange, mutate code after it.
total_firmas %>%
left_join(
expand({.}, firma, anio),
by = c("firma","anio")
)

R - replace zero values by average of non-zero ones for fixed categories

I am given a dataset of the following form
year<-rep(c(1990:1999),each=10)
age<-rep(50:59, 10)
cat1<-rep(c("A","B","C","D","E"),each=100)
value<-rnorm(10*10*5)
value[c(3,51,100,340,441)]<-0
df<-data.frame(year,age,cat1,value)
year age cat1 value
1 1990 50 A -0.7941799
2 1990 51 A 0.1592270
3 1990 52 A 0.0000000
4 1990 53 A 1.9222384
5 1990 54 A 0.3922259
6 1990 55 A -1.2671957
I now would like to replace any zeroes in the "value" column by the average over the column "cat1" of the non-zero entries of "value" for the corresponding year and age. For example, for year 1990, age 52 the enty for cat1=A is zero, this should be replaced by average of the non-zero entries of the remaining categories for this specific year and age.
As we have
df[df$year==1990 & df$age==52,]
year age cat1 value
3 1990 52 A 0.0000000
103 1990 52 B -1.1325446
203 1990 52 C -1.6136773
303 1990 52 D 0.5724360
403 1990 52 E 0.2795241
we would replace the entry 0 by
sum(df[df$year==1990 & df$age==52,4])/4
[1] -0.4735654
Is there a nice and clean way to this generally?
library(data.table)
setDT(df)[value==0, value := NA,]
df[, value := replace(value, is.na(value), mean(value, na.rm = TRUE)) , by = .(year, age)]
maybe 99,9% of operations with tables can be decomposed into basic fast and optimized: split, concatenation(in case of numeric: sum, multiplication etc), filter, sort, join.
Here left_join from dplyr is your way to go.
Just create another dataframe filtered from zeroes and aggregated over value with proper grouping. Then substitute zeroes with values from new joined column.

Remove all rows of a category if zero in a % of cases

I have the following data set of weekly retail data ordered after Category(e.g. Chocolate), Brand (e.g. Cadbury's), and Week(1-208). CBX is a unique global identifier for each brand.
Category Brand Week Sales Price CBX
33 2 1 167650. 2.20 33 - 2
33 2 2 168044. 2.18 33 - 2
33 2 3 160770 2.24 33 - 2
I now want to remove the brands that have zero sales in more than 75% of the weeks (thus have positive sales in at least 156 weeks).
At first I deleted all brands with any zero sales using dplyr, but it deleted too much of the data. This was the code I used:
library(dplyr)
Final_df_ <- Final_df %>%
group_by(Final_df$CBX) %>%
filter(!any(Sales==0 & Price==0))
Now I'm trying to change the code so it only deletes all rows belonging to a brand (CBX) if the sales of that brand are zero in more than 25% of the cases.
This is how far I've come:
Final_df_ <- Final_df %>%
group_by(Final_df$CBX) %>%
filter(!((Final_df$Sales==0)>0.75))
Thank you!

Resources