Cumsum w/ panel data: different start dates - r

Trying to find the cumsum across different types of contracts. Each has a unique stop (i.e. delivery) date with several months of expected delivery leading up to that date. Needing to calculate the cumsum of all expected deliveries before the actual delivery date.
For some reason the cumsum/rollsum function is not working. I have tried both DT and dplyr versions but both have failed.
Here is a simplified data for the problem I am working on.
df <- data.frame(report_year = c(rep(2017,10), rep(2018,10)),
report_month = c(seq(1,5,1), seq(2,6,1), seq(3,7,1), seq(2,6,1)),
delivery_year = c(rep(2017,10), rep(2018,10)),
delivery_month = c(rep(5,5),rep(6,5), rep(7,5), rep(6,5)),
sum = c(rep(seq(100,500,100), 4)),
cumsum = c(rep(c(100,300,600,1000,1500),4)))
The first 5 columns is what I currently have.
I am trying to get the last column (i.e. cumsum)
I am probably doing something wrong. Any help is appreciated.

The question did not specifically define which grouping columns to use so this may have to be modified slightly depending on what you want but this does it without any packages:
df$cumsum <- NULL # remove the result from df shown in question
transform(df, cumsum = ave(sum, delivery_year, delivery_month, FUN = cumsum))
Note that although the above works you may run into some problems using sum and cumsum as the column names due to confusion with the functions of the same name so you might want to use Sum and Cumsum, say. For example if you don't null out cumsum as we did above then FUN = cumsum will think that you want to apply the cumsum column which is not a function.

Use arrange and mutate
# Import library
library(dplyr)
# Calculating cumsum
df %>%
group_by(delivery_year, delivery_month) %>%
arrange(sum) %>%
mutate(cs = cumsum(sum))
Output
report_year report_month delivery_year delivery_month sum cumsum cs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2017 1 2017 5 100 100 100
2 2017 2 2017 6 100 100 100
3 2018 3 2018 7 100 100 100
4 2018 2 2018 6 100 100 100
5 2017 2 2017 5 200 300 300
6 2017 3 2017 6 200 300 300
7 2018 4 2018 7 200 300 300

Related

Applying own function

I am trying to implement my own function. The function works with three arguments that need to be changed for each subsequent column.
# Data
library(dplyr)
df<-data.frame(
Year=c("2000","2001","2002","2003","2004","2005","2006","2007","2008","2009"),
Sales=c(100,200,300,400,500,600,100,300,200,200),
# Store,Mall and Grocery
Store=c(100,400,300,800,900,400,800,400,300,100),
Mall=c(100,600,300,200,200,300,200,500,200,400),
Grocery=c(100,600,300,200,200,300,200,500,200,400),
# Building + Store,Mall and Grocery
Building_Store=c(100,200,300,400,500,600,100,300,200,400),
Building_Mall=c(100,400,300,800,900,400,800,400,300,600),
Building_Grocery=c(100,600,300,200,200,300,200,500,200,400))
# Own function
my_function <- function(x,y,z){((x-(y*lag(z))))}
This function I applied this with dplyr and code you can see below
estimation<-mutate(df,
df_Store=my_function(Store,Sales,Building_Store),
df_Mall=my_function(Mall,Sales,Building_Mall),
df_Grocery=my_function(Grocery,Sales,Building_Grocery))
In this way, I applied this function by manually changing arguments in the function. Results you can see below
Otherwise, in practice, I have a huge set with dozens of such arguments and it is not possible to enter them all manually.
Can someone help me by applying the map function to automatically get the results shown in the above table?
You can try this:
library(dplyr)
library(tidyr)
rename(df,
Value_Store=Store,
Value_Mall=Mall,
Value_Grocery=Grocery) %>%
pivot_longer(-c(Year, Sales), names_to=c(".value", "name"), names_sep="_") %>%
mutate(df=my_function(Value, Sales, Building)) %>%
pivot_wider(values_from=c(Value, Building, df)) %>%
select(Year, Sales, starts_with('df'))
# A tibble: 10 × 5
Year Sales df_Store df_Mall df_Grocery
<chr> <dbl> <dbl> <dbl> <dbl>
1 2000 100 NA -9900 -9900
2 2001 200 -19600 -39400 -79400
3 2002 300 -179700 -89700 -89700
4 2003 400 -119200 -159800 -319800
5 2004 500 -99100 -249800 -449800
6 2005 600 -119600 -359700 -239700
7 2006 100 -29200 -9800 -79800
8 2007 300 -59600 -89500 -119500
9 2008 200 -99700 -39800 -59800
10 2009 200 -39900 -79600 -119600

index a dataframe with repeated values according to vector

I am trying to average values in different months over vectors of dates. Basically, I have a dataframe with monthly values of a variable, and I'm trying to get a representative average of the experienced values for samples that sometimes span month boundaries.
I've ended up with a dataframe of monthly values, and vectors of the representative number of "month-year" combinations of every sampling duration (e.g. if a sample was out from Jan 28, 2000 to Feb 1, 2000, the vector would show 4 values of Jan 2000, 1 value of Feb 2000). Later I'm going to average the values with these weights, so it's important that the returned variable values appear in representative numbers.
I am having trouble figuring out how to index the dataframe pulling the representative value repeatedly. See below.
# data frame of monthly values
reprex_df <-
tribble(
~my, ~value,
"2000-01", 10,
"2000-02", 11,
"2000-03", 15,
"2000-04", 9,
"2000-05", 13
) %>%
as.data.frame()
# vector of month-year dates from Jan 28 to Feb 1:
reprex_vec <- c("2000-01","2000-01","2000-01","2000-01","2000-02")
# I want to index the df using the vector to get a return vector of
# January value*4, Feb value*1, or 10, 10, 10, 10, 11
# I tried this:
reprex_df[reprex_df$my %in% reprex_vec,"value"]
# but %in% only returns each value once ("10 11", not "10 10 10 10 11").
# is there a different way I should be indexing to account for repeated values?
# eventually I will take an average, e.g.:
mean(reprex_df[reprex_df$my %in% reprex_vec,"value"])
# but I want this average to equal 10.2 for mean(c(10,10,10,10,11)), not 10.5 for mean(c(10,11))
Simple tidy solution with inner_join:
dplyr::inner_join(reprex_df, data.frame(my = reprex_vec), by = "my")$value
in base R:
merge(reprex_df, list(my = reprex_vec))
my value
1 2000-01 10
2 2000-01 10
3 2000-01 10
4 2000-01 10
5 2000-02 11
Perhaps use match from base R to get the index
reprex_df[match(reprex_vec, reprex_df$my),]
my value
1 2000-01 10
1.1 2000-01 10
1.2 2000-01 10
1.3 2000-01 10
2 2000-02 11
Another base R option using setNames
with(
reprex_df,
data.frame(
my = reprex_vec,
value = setNames(value, my)[reprex_vec]
)
)
gives
my value
1 2000-01 10
2 2000-01 10
3 2000-01 10
4 2000-01 10
5 2000-02 11

How to relate two different dataframes to make calculations

I know how to work and computing math/statistics with one dataframe. But, what happens when I have to deal with two? For example:
> df1
supervisor salesperson
1 Supervisor1 Matt
2 Supervisor2 Amelia
3 Supervisor2 Philip
> df2
month channel Matt Amelia Philip
1 Jan Internet 10 50 20
2 Jan Cellphone 20 60 30
3 Feb Internet 40 40 30
4 Feb Cellphone 30 120 40
How can I compute the sales by supervisor grouped by channel in a efficient and generalizable way?. Is there any methodology or criteria when you need to relate two or more dataframes in order to compute the data you need?
PS: The number are the sales made by each sales person.
Here is the idea of converting to long and merging using tidyverse,
library(tidyverse)
df2 %>%
gather(salesperson, val, -c(1:2)) %>%
left_join(., df1, by = 'salesperson') %>%
spread(salesperson, val, fill = 0) %>%
group_by(channel, supervisor) %>%
summarise_at(vars(names(.)[4:6]), funs(sum))
which gives,
# A tibble: 4 x 5
# Groups: channel [?]
channel supervisor Amelia Matt Philip
<fct> <fct> <dbl> <dbl> <dbl>
1 Cellphone Supervisor1 0. 50. 0.
2 Cellphone Supervisor2 180. 0. 70.
3 Internet Supervisor1 0. 50. 0.
4 Internet Supervisor2 90. 0. 50.
NOTE: You can also add month in the group_by

Using custom order to arrange rows after previous sorting with arrange

I know this has already been asked, but I think my issue is a bit different (nevermind if it is in Portuguese).
I have this dataset:
df <- cbind(c(rep(2012,6),rep(2016,6)),
rep(c('Emp.total',
'Fisicas.total',
'Outros,total',
'Politicos.total',
'Receitas.total',
'Proprio.total'),2),
runif(12,0,1))
colnames(df) <- c('Year,'Variable','Value)
I want to order the rows to group first everything that has the same year. Afterwards, I want the Variable column to be ordered like this:
Receitas.total
Fisicas.total
Emp.total
Politicos.total
Proprio.total
Outros.total
I know I could usearrange() from dplyr to sort by the year. However, I do not know how to combine this with any routine using factor and order without messing up the previous ordering by year.
Any help? Thank you
We create a custom order by converting the 'Variable' into factor with levels specified in the custom order
library(dplyr)
df %>%
arrange(Year, factor(Variable, levels = c('Receitas.total',
'Fisicas.total', 'Emp.total', 'Politicos.total',
'Proprio.total', 'Outros.total')))
# A tibble: 12 x 3
# Year Variable Value
# <dbl> <chr> <dbl>
# 1 2012 Receitas.total 0.6626196
# 2 2012 Fisicas.total 0.2248911
# 3 2012 Emp.total 0.2925740
# 4 2012 Politicos.total 0.5188971
# 5 2012 Proprio.total 0.9204438
# 6 2012 Outros,total 0.7042230
# 7 2016 Receitas.total 0.6048889
# 8 2016 Fisicas.total 0.7638205
# 9 2016 Emp.total 0.2797356
#10 2016 Politicos.total 0.2547251
#11 2016 Proprio.total 0.3707349
#12 2016 Outros,total 0.8016306
data
set.seed(24)
df <- data_frame(Year =c(rep(2012,6),rep(2016,6)),
Variable = rep(c('Emp.total',
'Fisicas.total',
'Outros,total',
'Politicos.total',
'Receitas.total',
'Proprio.total'),2),
Value = runif(12,0,1))

tapply based on multiple indexes in R

I have a data frame, much like this one:
ref=rep(c("A","B"),each=240)
year=rep(rep(2014:2015,each=120),2)
month=rep(rep(1:12,each=10),4)
values=c(rep(NA,200),rnorm(100,2,1),rep(NA,50),rnorm(40,4,2),rep(NA,90))
DF=data.frame(ref,year,month,values)
I would like to compute the maximum number of consecutive NAs per reference, per year.
I have created a function, which works out the maximum number of consecutive NAs, but can only be based on one variable.
For example,
func <- function(x) {
max(rle(is.na(x))$lengths)
}
with(DF, tapply(values,ref, func))
# A B
# 200 90
with(DF, tapply(values,year, func))
# 2014 2015
# 120 90
So there are a maximum of 200 consecutive NAs in ref A in total, and maximum of 90 in ref B, which is correct. There are also 120 NAs in 2014, and 90 in 2015.
What I'd like is a result per ref and year, such as:
A 2015 80
A 2014 120
B 2015 90
B 2014 50
There are multiple ways of doing this, one is with the plyr library:
library(plyr)
ddply(DF,c('ref','year'),summarise,NAs=max(rle(is.na(values))$lengths))
ref year NAs
1 A 2014 120
2 A 2015 80
3 B 2014 60
4 B 2015 90
Using your function, you could also try:
with(DF, tapply(values,list(ref,year), func))
which gives a slightly different output
2014 2015
A 120 80
B 60 90
By using melt() you can however get to the same dataframe.
Very similar to the tapply solution above. I find aggregate give a better output than tapply though.
with(DF, aggregate(list(Value = values),list(Year = year,ref = ref), func))
Year ref Value
1 2014 A 120
2 2015 A 80
3 2014 B 60
4 2015 B 90
I like the recipe format
library(dplyr)
DF$values[is.na(DF$values)] <- 1
DF %>%
filter(values==1) %>%
group_by(ref,year) %>%
mutate(csum=cumsum(values)) %>%
group_by(ref,year) %>%
summarise(max(csum))
Source: local data frame [4 x 3]
Groups: ref [?]
ref year max(csum)
(fctr) (int) (dbl)
1 A 2014 120
2 A 2015 80
3 B 2014 50
4 B 2015 90

Resources