Creating a monthly growth rate column for different factors in R - r

Suppose I have the following data
set.seed(123)
Company <- c(rep("Company 1",5),rep("Company 2",10))
Dates <- c(seq(as.Date("2014-09-01"), as.Date("2015-01-01"), by="months"),
seq(as.Date("2011-09-01"), as.Date("2012-06-01"), by="months"))
X.1 <- sample(c(0,0,5,10,15,20,25,30,40,50),size=15,replace=TRUE)
X.2 <- sample(c(0,0,5,10,15,20,25,30,40,50),size=15,replace=TRUE)
df <- data.frame(Dates,Company,X.1,X.2)
Dates Company X.1 X.2
1 2014-09-01 Company 1 50 0
2 2014-10-01 Company 1 50 5
3 2014-11-01 Company 1 25 15
4 2014-12-01 Company 1 30 5
5 2015-01-01 Company 1 0 40
6 2011-09-01 Company 2 15 0
7 2011-10-01 Company 2 30 15
8 2011-11-01 Company 2 5 30
9 2011-12-01 Company 2 10 0
10 2012-01-01 Company 2 5 20
11 2012-02-01 Company 2 0 5
12 2012-03-01 Company 2 15 0
13 2012-04-01 Company 2 15 30
14 2012-05-01 Company 2 10 40
15 2012-06-01 Company 2 0 10
What I am trying to do is find monthly growth rates for variables X.1 and X.2
within each company and bind those columns to the right side of the dataframe. The problem here is that the date ranges for each Company are different, which is why I am having trouble with this. Also, since I have 0s in the data Inf and NAs are okay.
Thanks for your help.

#I computed the growth using log: growth=log(X(t)/X(t-1)). If you want to compute using (X(t)-X(t-1))/X(t-1), you can just use that. Also, for the first period, it will be NA.
#Assumption: the data are equally spaced for each company. You get Inf, if your last period value is 0 and -Inf if your current period value is 0 (because we used log). The growth will be 0 (when current period is zero) if you don't use log (see second method)
library(dplyr)
df%>%
group_by(Company)%>%
mutate(gx_1=log(X.1/lag(X.1,1)),gx_2=log(X.1/lag(X.2,1))
)
Source: local data frame [15 x 6]
Groups: Company
Dates Company X.1 X.2 gx_1 gx_2
1 2014-09-01 Company 1 5 40 NA NA
2 2014-10-01 Company 1 30 5 1.7917595 -0.2876821
3 2014-11-01 Company 1 15 0 -0.6931472 1.0986123
4 2014-12-01 Company 1 40 10 0.9808293 Inf
5 2015-01-01 Company 1 50 50 0.2231436 1.6094379
6 2011-09-01 Company 2 0 40 NA NA
7 2011-10-01 Company 2 20 25 Inf -0.6931472
8 2011-11-01 Company 2 40 25 0.6931472 0.4700036
9 2011-12-01 Company 2 20 50 -0.6931472 -0.2231436
10 2012-01-01 Company 2 15 25 -0.2876821 -1.2039728
11 2012-02-01 Company 2 50 30 1.2039728 0.6931472
12 2012-03-01 Company 2 15 20 -1.2039728 -0.6931472
13 2012-04-01 Company 2 25 20 0.5108256 0.2231436
14 2012-05-01 Company 2 20 5 -0.2231436 0.0000000
15 2012-06-01 Company 2 0 0 -Inf -Inf
#without using log , i.e.
df%>%
group_by(Company)%>%
mutate(gx_1=((X.1-lag(X.1,1))/lag(X.1,1)),gx_2=((X.2-lag(X.2,1))/lag(X.2,1)))

Related

Cumulative sums in R with multiple conditions?

I am trying to figure out how to create a cumulative or rolling sum in R based on a few conditions.
The data set in question is a few million observations of library loans, and the question is to determine how many copies of a given book/title would be necessary to meet demand.
So for each Title.ID, begin with 1 copy for the first instance (ID.Index). Then for each instance after, determine whether another copy is needed based on whether the REQUEST.DATE is within 16 weeks (112 days) of the previous request.
# A tibble: 15 x 3
# Groups: Title.ID [2]
REQUEST.DATE Title.ID ID.Index
<date> <int> <int>
1 2013-07-09 2 1
2 2013-08-07 2 2
3 2013-08-20 2 3
4 2013-09-08 2 4
5 2013-09-28 2 5
6 2013-12-27 2 6
7 2014-02-10 2 7
8 2014-03-12 2 8
9 2014-03-14 2 9
10 2014-08-27 2 10
11 2014-04-27 6 1
12 2014-08-01 6 2
13 2014-11-13 6 3
14 2015-02-14 6 4
15 2015-05-14 6 5
The tricky part is that determining whether a new copy is needed is based not only on the number of request (ID.Index) and the REQUEST.DATE of some previous loan, but also on the preceding accumulating sum.
For instance, for the third request for title 2 (Title.ID 2, ID.Index 3), there are now two copies, so to determine whether a new copy is needed, you have to see whether the REQUEST.DATE is within 112 days of the first (not second) request (ID.Index 1). By contrast, for the third request for title 6 (Title.ID 6, ID.Index 3), there is only one copy available (since request 2 was not within 112 days), so determining whether a new copy is needed is based on looking back to the REQUEST.DATE of ID.Index 2.
The desired output ("Copies") would take each new request (ID.Index), then look back to the relevant REQUEST.DATE based on the number of available copies, and doing that would mean looking at the accumulating sum for the preceding calculation. (Note: The max number of copies would be 10.)
I've provided the desired output for the sample below ("Copies").
# A tibble: 15 x 4
# Groups: Title.ID [2]
REQUEST.DATE Title.ID ID.Index Copies
<date> <int> <int> <dbl>
1 2013-07-09 2 1 1
2 2013-08-07 2 2 2
3 2013-08-20 2 3 3
4 2013-09-08 2 4 4
5 2013-09-28 2 5 5
6 2013-12-27 2 6 5
7 2014-02-10 2 7 5
8 2014-03-12 2 8 5
9 2014-03-14 2 9 5
10 2014-08-27 2 10 5
11 2014-04-27 6 1 1
12 2014-08-01 6 2 2
13 2014-11-13 6 3 2
14 2015-02-14 6 4 2
15 2015-05-14 6 5 2
>
I recognize that the solution will be way beyond my abilities, so I will be extremely grateful for any solution or advice about how to solve this type of problem in the future.
Thanks a million!
*4/19 update: new examples where new copy may be added after delay, i.e., not in sequence. I've also added columns showing days since a given previous request, which helps checking whether a new copy should be added, based on how many copies there are.
Sample 2: new copy should be added with third request, since it has only been 96 days since last request (and there is only one copy)
REQUEST.NUMBER REQUEST.DATE Title.ID ID.Index Days.Since Days.Since2 Days.Since3 Days.Since4 Days.Since5 Copies
<fct> <date> <int> <int> <drtn> <drtn> <drtn> <drtn> <drtn> <int>
1 BRO-10680332 2013-10-17 6 1 NA days NA days NA days NA days NA days 1
2 PEN-10835735 2014-04-27 6 2 192 days NA days NA days NA days NA days 1
3 PEN-10873506 2014-08-01 6 3 96 days 288 days NA days NA days NA days 1
4 PEN-10951264 2014-11-13 6 4 104 days 200 days 392 days NA days NA days 1
5 PEN-11029526 2015-02-14 6 5 93 days 197 days 293 days 485 days NA days 1
6 PEN-11106581 2015-05-14 6 6 89 days 182 days 286 days 382 days 574 days 1
Sample 3: new copy should be added with last request, since there are two copies, and the oldest request is 45 days.
REQUEST.NUMBER REQUEST.DATE Title.ID ID.Index Days.Since Days.Since2 Days.Since3 Days.Since4 Days.Since5 Copies
<fct> <date> <int> <int> <drtn> <drtn> <drtn> <drtn> <drtn> <int>
1 BRO-10999392 2015-01-20 76 1 NA days NA days NA days NA days NA days 1
2 YAL-11004302 2015-01-22 76 2 2 days NA days NA days NA days NA days 2
3 COR-11108471 2015-05-18 76 3 116 days 118 days NA days NA days NA days 2
4 HVD-11136632 2015-07-27 76 4 70 days 186 days 188 days NA days NA days 2
5 MIT-11164843 2015-09-09 76 5 44 days 114 days 230 days 232 days NA days 2
6 HVD-11166239 2015-09-10 76 6 1 days 45 days 115 days 231 days 233 days 2
You can use runner package to apply any R function on cumulative window.
This time we execute function f using x = REQUEST.DATE. We just count number of observations which are within min(x) + 112.
library(dplyr)
library(runner)
data %>%
group_by(Title.ID) %>%
mutate(
Copies = runner(
x = REQUEST.DATE,
f = function(x) {
length(x[x <= (min(x + 112))])
}
)
)
# # A tibble: 15 x 4
# # Groups: Title.ID [2]
# REQUEST.DATE Title.ID ID.Index Copies
# <date> <int> <int> <int>
# 1 2013-07-09 2 1 1
# 2 2013-08-07 2 2 2
# 3 2013-08-20 2 3 3
# 4 2013-09-08 2 4 4
# 5 2013-09-28 2 5 5
# 6 2013-12-27 2 6 5
# 7 2014-02-10 2 7 5
# 8 2014-03-12 2 8 5
# 9 2014-03-14 2 9 5
# 10 2014-08-27 2 10 5
# 11 2014-04-27 6 1 1
# 12 2014-08-01 6 2 2
# 13 2014-11-13 6 3 2
# 14 2015-02-14 6 4 2
# 15 2015-05-14 6 5 2
data
data <- read.table(
text = " REQUEST.DATE Title.ID ID.Index
1 2013-07-09 2 1
2 2013-08-07 2 2
3 2013-08-20 2 3
4 2013-09-08 2 4
5 2013-09-28 2 5
6 2013-12-27 2 6
7 2014-02-10 2 7
8 2014-03-12 2 8
9 2014-03-14 2 9
10 2014-08-27 2 10
11 2014-04-27 6 1
12 2014-08-01 6 2
13 2014-11-13 6 3
14 2015-02-14 6 4
15 2015-05-14 6 5",
header = TRUE)
data$REQUEST.DATE <- as.Date(as.character(data$REQUEST.DATE))
I was able to find a workable solution based on finding the max number of other requests within 112 days of a request (after creating return date), for each title.
data$RETURN.DATE <- as.Date(data$REQUEST.DATE + 112)
data <- data %>%
group_by(Title.ID) %>%
mutate(
Copies = sapply(REQUEST.DATE, function(x)
sum(as.Date(REQUEST.DATE) <= as.Date(x) &
as.Date(RETURN.DATE) >= as.Date(x)
))
)
Then I de-duplicated the list of titles, using the max number for each title, and added it back to the original data.
I still think there's a solution to the original problem, where I could go back and see at which point new copies needed to be added (for analysis based on when a title is published), but this works for now.

Group records with time interval overlap

I have a data frame (with N=16) contains ID (character), w_from (date), and w_to (date). Each record represent a task.
Here’s the data in R.
ID <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2)
w_from <- c("2010-01-01","2010-01-05","2010-01-29","2010-01-29",
"2010-03-01","2010-03-15","2010-07-15","2010-09-10",
"2010-11-01","2010-11-30","2010-12-15","2010-12-31",
"2011-02-01","2012-04-01","2011-07-01","2011-07-01")
w_to <- c("2010-01-31","2010-01-15", "2010-02-13","2010-02-28",
"2010-03-16","2010-03-16","2010-08-14","2010-10-10",
"2010-12-01","2010-12-30","2010-12-20","2011-02-19",
"2011-03-23","2012-06-30","2011-07-31","2011-07-06")
df <- data.frame(ID, w_from, w_to)
df$w_from <- as.Date(df$w_from)
df$w_to <- as.Date(df$w_to)
I need to generate a group number by ID for the records that their time intervals overlap. As an example, and in general terms, if record#1 overlaps with record#2, and record#2 overlaps with record#3, then record#1, record#2, and record#3 overlap.
Also, if record#1 overlaps with record#2 and record#3, but record#2 doesn't overlap with record#3, then record#1, record#2, record#3 are all overlap.
In the example above and for ID=1, the first four records overlap.
Here is the final output:
Also, if this can be done using dplyr, that would be great!
Try this:
library(dplyr)
df %>%
group_by(ID) %>%
arrange(w_from) %>%
mutate(group = 1+cumsum(
cummax(lag(as.numeric(w_to), default = first(as.numeric(w_to)))) < as.numeric(w_from)))
# A tibble: 16 x 4
# Groups: ID [2]
ID w_from w_to group
<dbl> <date> <date> <dbl>
1 1 2010-01-01 2010-01-31 1
2 1 2010-01-05 2010-01-15 1
3 1 2010-01-29 2010-02-13 1
4 1 2010-01-29 2010-02-28 1
5 1 2010-03-01 2010-03-16 2
6 1 2010-03-15 2010-03-16 2
7 1 2010-07-15 2010-08-14 3
8 1 2010-09-10 2010-10-10 4
9 1 2010-11-01 2010-12-01 5
10 1 2010-11-30 2010-12-30 5
11 1 2010-12-15 2010-12-20 5
12 1 2010-12-31 2011-02-19 6
13 1 2011-02-01 2011-03-23 6
14 2 2011-07-01 2011-07-31 1
15 2 2011-07-01 2011-07-06 1
16 2 2012-04-01 2012-06-30 2

Conditional aggregation by regarding missing data in R

I am trying to aggregate hourly data to daily data in R. The problem is missing values. I want to consider a threshold for the number of missing values before the aggregation. If the number of missing values is more than two in the given day, DO NOT compute the daily average and fill that day with NA.
My dummy data are daily data for the first day of 2005.
day hour amount amount2
1 2005-01-01 0 1 1
2 2005-01-01 1 2 NA
3 2005-01-01 2 4 4
4 2005-01-01 3 5 5
5 2005-01-01 4 11 NA
6 2005-01-01 5 4 NA
7 2005-01-01 6 NA NA
8 2005-01-01 7 2 2
9 2005-01-01 8 4 4
10 2005-01-01 9 2 2
11 2005-01-01 10 4 20
12 2005-01-01 11 12 12
13 2005-01-01 12 13 13
14 2005-01-01 13 7 7
15 2005-01-01 14 4 4
16 2005-01-01 15 12 12
17 2005-01-01 16 4 4
18 2005-01-01 17 12 12
19 2005-01-01 18 5 5
20 2005-01-01 19 11 11
21 2005-01-01 20 4 4
22 2005-01-01 21 12 12
23 2005-01-01 22 13 13
24 2005-01-01 23 7 7
What I already got
agg
day amount amount2
1 2005-01-01 6.9 7.7
what I want to have
agg
day amount amount2
1 2005-01-01 6.9 NA
Because the number of missing values in the column amount2 is more than two, I want its daily average filled by NA (not 7.7) while it is calculated for the column amount (6.9).
I have used the function "aggregate" from "stats" library.
library(stats)
amount=(c(1,2,4,5,11,4,NA,2,4,2,4,12,13,7,4,12,4,12,5,11,4,12,13,7))
amount2=(c(1,NA,4,5,NA,NA,NA,2,4,2,20,12,13,7,4,12,4,12,5,11,4,12,13,7))
day=c("2005-01-01","2005-01-01","2005-01-01","2005-01-01","2005-01-01",
"2005-01-01","2005-01-01","2005-01-01","2005-01-01","2005-01-01",
"2005-01-01","2005-01-01","2005-01-01","2005-01-01","2005-01-01",
"2005-01-01","2005-01-01","2005-01-01","2005-01-01","2005-01-01",
"2005-01-01","2005-01-01","2005-01-01","2005-01-01")
hour=seq(0,23)
date=data.frame(day,hour)
dummy=cbind(date,amount,amount2)
agg <- aggregate(cbind(amount,amount2) ~ day, dummy, mean)

Calculating yearly growth-rates from quarterly, long form data in r

My data takes the following form:
df <- data.frame(Sector=c(rep("A",8),rep("B",8)), Country = c(rep("USA", 16)),
Quarter=rep(1:8,2),Income=20:35)
df2 <- data.frame(Sector=c(rep("A",8),rep("B",8)), Country = c(rep("UK", 16)),
Quarter=rep(1:8,2),Income=32:47)
df <- rbind(df, df2)
What I want to do is to calculate the growth rate from the first quarter each year to the first quarter the second year, within country and sector. In the example above it would be the growth rate from quarter 1 to quarter 5. So for Sector A, in the USA, it would be (24/20)-1=0.2
I then want to append this data to the dataframe as a new column.
I looked at the solutions in:
How calculate growth rate in long format data frame?
But didn't have the r-skills to get it to work if the lag is more then one time-unit. Any suggestions?
ADDITION
So what i want is the growth-rate, that is (24/20)-1=0.2 in the example below. Not 1-(24/20), which I first wrote. The desired output should look something like this:
Sector Country Quarter Income growth
(fctr) (fctr) (int) (int) (dbl)
1 A USA 1 20 NA
2 A USA 2 21 NA
3 A USA 3 22 NA
4 A USA 4 23 NA
5 A USA 5 24 0.2
6 A USA 6 25 0.1904
7 A USA 7 26 0.1818
I think you need something like this:
library(dplyr)
df %>%
#group by sector and country
group_by(Sector, Country) %>%
#calculate growth as (quarter / 5-period-lagged quarter) - 1
mutate(growth = Income / lag(Income, 4) - 1)
Output
Source: local data frame [32 x 5]
Groups: Sector, Country [4]
Sector Country Quarter Income growth
(fctr) (fctr) (int) (int) (dbl)
1 A USA 1 20 NA
2 A USA 2 21 NA
3 A USA 3 22 NA
4 A USA 4 23 NA
5 A USA 5 24 0.2000000
6 A USA 6 25 0.1904762
7 A USA 7 26 0.1818182
8 A USA 8 27 0.1739130
9 B USA 1 28 NA
10 B USA 2 29 NA
.. ... ... ... ... ...
df3 = copy(df)
df3$Quarter = df3$Quarter - 4
df = merge(df,df3,c('Sector','Country','Quarter'), suffixes = c('','_prev'), all.x = T)
df$growth = 1 - (df$Income_prev/df$Income
> df
Sector Country Quarter Income Income_prev growth
1 A USA 1 20 24 -4
2 A USA 2 21 25 -4
3 A USA 3 22 26 -4
4 A USA 4 23 27 -4
5 A USA 5 24 NA NA
6 A USA 6 25 NA NA
7 A USA 7 26 NA NA
8 A USA 8 27 NA NA
9 A UK 1 32 36 -4
10 A UK 2 33 37 -4
11 A UK 3 34 38 -4
12 A UK 4 35 39 -4
13 A UK 5 36 NA NA
14 A UK 6 37 NA NA
15 A UK 7 38 NA NA
16 A UK 8 39 NA NA
17 B USA 1 28 32 -4
18 B USA 2 29 33 -4
19 B USA 3 30 34 -4
20 B USA 4 31 35 -4
21 B USA 5 32 NA NA
22 B USA 6 33 NA NA
23 B USA 7 34 NA NA
24 B USA 8 35 NA NA
25 B UK 1 40 44 -4
26 B UK 2 41 45 -4
27 B UK 3 42 46 -4
28 B UK 4 43 47 -4
29 B UK 5 44 NA NA
30 B UK 6 45 NA NA
31 B UK 7 46 NA NA
32 B UK 8 47 NA NA
>

R - Calculate Time Elapsed Since Last Event with Multiple Event Types

I have a dataframe that contains the dates of multiple types of events.
df <- data.frame(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000"
,"03/01/2001","17/03/2001","23/04/2001",
"26/05/2001","01/06/2001",
"30/06/2001","02/07/2001","15/07/2001"
,"21/12/2001"), "%d/%m/%Y"),
event_type=c(0,4,1,2,4,1,0,2,3,3,4,3))
date event_type
---------------- ----------
1 2000-07-06 0
2 2000-09-15 4
3 2000-10-15 1
4 2001-01-03 2
5 2001-03-17 4
6 2001-04-23 1
7 2001-05-26 0
8 2001-06-01 2
9 2001-06-30 3
10 2001-07-02 3
11 2001-07-15 4
12 2001-12-21 3
I am trying to calculate the days between each event type so the output looks like the below:
date event_type days_since_last_event
---------------- ---------- ---------------------
1 2000-07-06 0 NA
2 2000-09-15 4 NA
3 2000-10-15 1 NA
4 2001-01-03 2 NA
5 2001-03-17 4 183
6 2001-04-23 1 190
7 2001-05-26 0 324
8 2001-06-01 2 149
9 2001-06-30 3 NA
10 2001-07-02 3 2
11 2001-07-15 4 120
12 2001-12-21 3 172
I have benefited from the answers from these two previous posts but have not been able to address my specific problem in R; multiple event types.
Calculate elapsed time since last event
Calculate days since last event in R
Below is as far as I have gotten. I have not been able to leverage the last event index to calculate the last event date.
df <- cbind(df, as.vector(data.frame(count=ave(df$event_type==df$event_type,
df$event_type, FUN=cumsum))))
df <- rename(df, c("count" = "last_event_index"))
date event_type last_event_index
--------------- ------------- ----------------
1 2000-07-06 0 1
2 2000-09-15 4 1
3 2000-10-15 1 1
4 2001-01-03 2 1
5 2001-03-17 4 2
6 2001-04-23 1 2
7 2001-05-26 0 2
8 2001-06-01 2 2
9 2001-06-30 3 1
10 2001-07-02 3 2
11 2001-07-15 4 3
12 2001-12-21 3 3
We can use diff to get the difference between adjacent 'date' after grouping by 'event_type'. Here, I am using data.table approach by converting the 'data.frame' to 'data.table' (setDT(df)), grouped by 'event_type', we get the diff of 'date'.
library(data.table)
setDT(df)[,days_since_last_event :=c(NA,diff(date)) , by = event_type]
df
# date event_type days_since_last_event
# 1: 2000-07-06 0 NA
# 2: 2000-09-15 4 NA
# 3: 2000-10-15 1 NA
# 4: 2001-01-03 2 NA
# 5: 2001-03-17 4 183
# 6: 2001-04-23 1 190
# 7: 2001-05-26 0 324
# 8: 2001-06-01 2 149
# 9: 2001-06-30 3 NA
#10: 2001-07-02 3 2
#11: 2001-07-15 4 120
#12: 2001-12-21 3 172
Or as #Frank mentioned in the comments, we can also use shift (from version v1.9.5+ onwards) to get the lag (by default, the type='lag') of 'date' and subtract from the 'date'.
setDT(df)[, days_since_last_event := as.numeric(date-shift(date,type="lag")),
by = event_type]
The base R version of this is to use split/lapply/rbind to generate the new column.
> do.call(rbind,
lapply(
split(df, df$event_type),
function(d) {
d$dsle <- c(NA, diff(d$date)); d
}
)
)
date event_type dsle
0.1 2000-07-06 0 NA
0.7 2001-05-26 0 324
1.3 2000-10-15 1 NA
1.6 2001-04-23 1 190
2.4 2001-01-03 2 NA
2.8 2001-06-01 2 149
3.9 2001-06-30 3 NA
3.10 2001-07-02 3 2
3.12 2001-12-21 3 172
4.2 2000-09-15 4 NA
4.5 2001-03-17 4 183
4.11 2001-07-15 4 120
Note that this returns the data in a different order than provided; you can re-sort by date or save the original indices if you want to preserve that order.
Above, #akrun has posted the data.tables approach, the parallel dplyr approach would be straightforward as well:
library(dplyr)
df %>% group_by(event_type) %>% mutate(days_since_last_event=date - lag(date, 1))
Source: local data frame [12 x 3]
Groups: event_type [5]
date event_type days_since_last_event
(date) (dbl) (dfft)
1 2000-07-06 0 NA days
2 2000-09-15 4 NA days
3 2000-10-15 1 NA days
4 2001-01-03 2 NA days
5 2001-03-17 4 183 days
6 2001-04-23 1 190 days
7 2001-05-26 0 324 days
8 2001-06-01 2 149 days
9 2001-06-30 3 NA days
10 2001-07-02 3 2 days
11 2001-07-15 4 120 days
12 2001-12-21 3 172 days

Resources