Survival dataset split by year starting at January 1st - r

I have the following survival dataset that I would like to split the interval by January 1st of each year. For example, for personid 1220, i would make the split at 1912-01-01, 1913-01-01, 1914-01-01, 1915-01-01. I tried to use survSplit but they can only do numeric vector. Can you please let me know if there any other way?
In the dataset below, time = EndDate - StartDate. Here is what I have so far:
test.ts <- survSplit(Surv(time, censor) ~ .,
data = test,
cut = seq(0, 1826.25, 365.25),
episode = "tgroup")
but that only split by each year.
ID EndDate StartDate censor time status
1 1220 1915-03-01 1911-10-04 1 1244 Alive
3 4599 1906-02-15 1903-05-16 1 1006 Alive
4 6375 1899-04-10 1896-10-27 1 895 Alive
6 6386 1929-10-05 1922-01-26 0 1826 Outmigrated
7 6389 1933-12-08 1929-10-05 1 1525 Outmigrated
8 6390 1932-01-17 1927-07-24 1 1638 Dead 0-4 yrs

Not sure I understood what you wanted but it you want to replicate the information in your data frame for each year in the range of Start;End, you can do:
library(tidyverse)
library(lubridate)
df %>%
as_tibble() %>%
mutate(
RangeYear = map2(StartDate, EndDate, function(start, end) {
start <- `if`(day(start) == 1 && month(start) == 1,
year(start),
year(start) + 1)
seq(start, year(end))
})
) %>%
unnest(RangeYear)

Related

Manipulating data for Regression Model using dplyr in R

I have data like this.
library(lubridate)
set.seed(2021)
gen_date <- seq(ymd_h("2021-01-01-00"), ymd_h("2021-09-30-23"), by = "hours")
hourx <- hour(gen_date)
datex <- date(gen_date)
sales <- round(runif(length(datex), 10, 50), 0)*100
mydata <- data.frame(datex, hourx, sales)
head(mydata)
# datex hourx sales
#1 2021-01-01 0 2800
#2 2021-01-01 1 4100
#3 2021-01-01 2 3800
#4 2021-01-01 3 2500
#5 2021-01-01 4 3500
#6 2021-01-01 5 3800
tail(mydata
# datex hourx sales
#6547 2021-09-30 18 3900
#6548 2021-09-30 19 3600
#6549 2021-09-30 20 3000
#6550 2021-09-30 21 4700
#6551 2021-09-30 22 4700
#6552 2021-09-30 23 3600
I have task to do modelling using Linear Regression but with tricky data. Assume we have data from January to March, we need those data to forecast April data. Here the steps:
We use January and February data as Independent Variables (X) and March data as Dependent Variable (Y) for building regression model, because February has the fewest days, which is 28 days, then we cut January & March data into 28 days too.
data_jan <- mydata[1:672,]
data_feb <- mydata[745:1416,]
data_mar <- mydata[1417:2088,]
Modelling Regression using lm function
mydata_reg <- data.frame(x1 = data_jan$sales,
x2 = data_feb$sales,
y = data_mar$sales)
model_reg <- lm(y~., data = mydata_reg)
After get model, we use new data within February & March as independent data (X)
mydata_reg_for <- data.frame(x1 = data_feb$sales,
x2 = data_mar$sales)
pred_data_apr <- predict(model_reg, newdata = mydata_reg_for)
Check lenght of the month, Because april has 30 days and we only get 28 days forecast data, so we still need 2 days data to complete our forecast. February only has 28 days, so we use first two dates from March, which are "2021-03-01" & "2021-03-02". Now, March has 31 days, then we don't need do anything, we just add "2021-03-29" & "2021-03-30".
data_feb_add <- mydata[1417:1464,]
data_mar_add <- mydata[2089:2136,]
mydata_reg_add <- data.frame(x1 = data_feb_add$sales,
x2 = data_mar_add$sales)
After that we do modelling using model_reg function before and Add all april forecast.
pred_data_apr_add <- predict(model_reg, newdata = mydata_reg_add)
data_apr <- c(as.numeric(pred_data_apr), as.numeric(pred_data_apr_add))
My question is how do we make this process run automatically every month using dplyr package? Because every month has different days. I use february data because it has the fewest days. This condition also is applied to other months. Many Thank You.
If you want to control the number of days after each month (or in each month) you could filter by the date not the row numbers.
I'm sure it can be tidied up more than this, but you would just need to change the forecast_date <- as.Date("2021-04-01") to whichever month you want to forecast.
##set the forecast month. This should be straight forward to automate with a list or an increment
forcast_date <- as.Date("2021-04-01") # April
##get the forecast month length. This would be used for the data_feb_add and data_mar_add step.
forcast_month_length <- days_in_month(forcast_date) #30 days
##get dates for the previous 3 months
month_1_date <- forcast_date %m-% months(3)
month_2_date <- forcast_date %m-% months(2)
month_3_date <- forcast_date %m-% months(1)
##find the shortest month in that time range.
shortest_month <- min(c(days_in_month(month_1_date),
days_in_month(month_2_date),
days_in_month(month_2_date))) #28 days
##select the first 28 days (the shortest month) for each of the months used for the variables
data_month_1 <- mydata[mydata$datex %in% month_1_date:(month_1_date + shortest_month - 1),]
data_month_2 <- mydata[mydata$datex %in% month_2_date:(month_2_date + shortest_month - 1),]
data_month_3 <- mydata[mydata$datex %in% month_3_date:(month_3_date + shortest_month - 1),]
##select the number of days needed for each month for the forecast data (30 days for april)
month_2_forecast_length <- mydata[mydata$datex %in% month_2_date:(month_2_date + forcast_month_length - 1),]
month_3_forecast_length <- mydata[mydata$datex %in% month_3_date:(month_3_date + forcast_month_length - 1),]
You can simply split data by group_split
mydata %>%
group_split(month(datex))
this code will split mydata into 12 lists, and each list elements are dataframe with each 12 month

How can I calculate log returns for a financial year different from calendar year in R?

I want to calculate log returns for a stock in R. The issue is that my financial year is from April 1 to March 31. I have tried using packages tidyquant and tidyverse. The code I have tried is as follows:
library(tidyquant)
RIL<- tq_get("RELIANCE.NS") # download the stock price data of Reliance Industries Limited listed on NSE of India. The data is from January 2011 to May 2021.
library(tidyverse)
RIL1<- RIL %>% mutate(CalYear = year(date),
Month = month(date),
FinYear = if_else(Month<4,CalYear,CalYear+1)) # This creates a new variable called FinYear, which correctly shows the financial year. If the month is >3 (ie March), the financial year is calendar year +1.
RIL_Returns<- RIL1 %>%
group_by(FinYear) %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "yearly",
type = "log") #This part of the code has the problem.
From this code, I get two values for log returns per each year. This can't be true. I want a table with columns FinYear and Log_Returns, where Log_Returns is defined as ln(adjusted close price for the last trading day of given FinYear/adjusted close price for the first trading day of the given FinYear). How can I do this?
Perhaps this is not the most elegant but I think it works, I obtained the first and last day of each year manually and computed the log returns accordingly
# Get data
library("tibble")
library("tidyquant")
RIL<- tq_get("RELIANCE.NS")
RIL1<- RIL %>% mutate(CalYear = year(date),
Month = month(date),
FinYear = if_else(Month<4,CalYear,CalYear+1))
# Get minimum and max dates in each year
start_dates = c()
end_dates = c()
for(year in format(min(RIL1$date),"%Y"):format(max(RIL1$date),"%Y")){
start_dates =
c(start_dates,
min(RIL1$date[format(RIL1$date, "%Y") == format(as.Date(ISOdate(year, 1, 1)),"%Y")])
)
end_dates =
c(end_dates,
max(RIL1$date[format(RIL1$date, "%Y") == format(as.Date(ISOdate(year, 1, 1)),"%Y")])
)
}
# Get filtered data
RIL2 <- RIL1[(RIL1$date %in% start_dates | RIL1$date %in% end_dates),]
# Get log returns, even indexes represent end of each year rows
end_adjusted = RIL2$adjusted[1:length(RIL2$adjusted) %% 2 == 0]
beginning_adjusted = RIL2$adjusted[1:length(RIL2$adjusted) %% 2 != 0]
log_returns = log(end_adjusted/beginning_adjusted)
# Put log returns and years in a tibble.
result = tibble(log_returns ,format(RIL2$date[1:length(RIL2$date) %% 2 == 0], "%Y"))
# Result
result
Outputs
# A tibble: 11 x 2
log_returns `format(RIL2$date[1:length(RIL2$date)%%2 == 0],…
<dbl> <chr>
1 -0.412 2011
2 0.185 2012
3 0.0739 2013
4 0.0117 2014
5 0.145 2015
6 0.0743 2016
7 0.537 2017
8 0.215 2018
9 0.306 2019
10 0.287 2020
11 0.0973 2021

#R - Split Quarterly data into monthly data using R

Please see the sample data below.
I want to convert the quarterly sale data (with a start date and end date) into monthly sale data.
For example:
Data set A-Row 1 will be split into Data set B- Row 1, 2 and 3 for June, July and August separately and the sale will be pro rata based on number of days in that month, all other columns will be the same;
Data set A-Row 2 will pick up what was left in Row 1 (which ends in 5/9/2017) and formed a complete September.
Is there an efficient way to execute this, the actual data is a csv file with 100K x 15 data size, which will be split to approximately 300K x 15 new data set for monthly analysis.
Some key characteristic from sample question data includes:
The start day for the first quarterly sales data is the day that customer joins, so it could be any day;
All sales will be quarterly but in various days between 90, 91, or 92 days, but it is also possible to have imcomplete quarterly sale data as customer leave in the quarter.
Sample Question:
Customer.ID Country Type Sale Start..Date End.Date Days
1 1 US Commercial 91 7/06/2017 5/09/2017 91
2 1 US Commerical 92 6/09/2017 6/12/2017 92
3 2 US Casual 25 10/07/2017 3/08/2017 25
4 3 UK Commercial 64 7/06/2017 9/08/2017 64
Sample Answer:
Customer.ID Country Type Sale Start.Date End.Date Days
1 1 US Commercial 24 7/06/2017 30/06/2017 24
2 1 US Commercial 31 1/07/2017 31/07/2017 31
3 1 US Commercial 31 1/08/2017 31/08/2017 31
4 1 US Commercial 30 1/09/2017 30/09/2017 30
5 1 US Commercial 31 1/10/2017 31/10/2017 31
6 1 US Commercial 30 1/11/2017 30/11/2017 30
7 1 US Commercial 6 1/12/2017 6/12/2017 6
8 2 US Casual 22 10/07/2017 31/07/2017 22
9 2 US Casual 3 1/08/2017 3/08/2017 3
10 3 UK Commercial 24 7/06/2017 30/06/2017 24
11 3 UK Commercial 31 1/07/2017 31/07/2017 31
12 3 UK Commercial 9 1/08/2017 9/08/2017 9
I just ran CIAndrews' code. It seems to work for the most part, but it is very slow when run on a dataset with 10,000 rows. I eventually cancelled the execution after a few minutes of waiting. There's also an issue with the number of days: For example, July has 31 days, but the days variable only shows thirty. It's true that 31-1 = 30, but the first day should be counted as well.
The code below only takes about 21 seconds on my 2015 MacBook Pro (not including data generation), and takes care of the other problem, too.
library(tidyverse)
library(lubridate)
# generate data -------------------------------------------------------------
set.seed(666)
# assign variables
customer <- sample.int(n = 2000, size = 10000, replace = T)
country <- sample(c("US", "UK", "DE", "FR", "IS"), 10000, replace = T)
type <- sample(c("commercial", "casual", "other"), 10000, replace = T)
start <- sample(seq(dmy("7/06/2011"), today(), by = "day"), 10000, replace = T)
days <- sample(85:105, 10000, replace = T)
end <- start + days
sale <- sample(500:3000, 10000, replace = T)
# generate dataframe of artificial data
df_quarterly <- tibble(customer, country, type, sale, start, end, days)
# split quarters into months ----------------------------------------------
# initialize empty list with length == nrow(dataframe)
list_date_dfs <- vector(mode = "list", length = nrow(df_quarterly))
# for-loop generates new dates and adds as dataframe to list
for (i in 1:length(list_date_dfs)) {
# transfer dataframe row to variable `row`
row <- df_quarterly[i,]
# correct end date so split successful when interval doesn't cover full month
end_corr <- row$end + day(row$start) - day(row$end)
# use lubridate to compute first and last days of relevant months
m_start <- seq(row$start, end_corr, by = "month") %>%
floor_date(unit = "month")
m_end <- m_start + days_in_month(m_start) - 1
# replace first and last elements with original dates
m_start[1] <- row$start
m_end[length(m_end)] <- row$end
# compute the number of days per month as well as sales per month
# correct difference by adding 1
m_days <- as.integer(m_end - m_start) + 1
m_sale <- (row$sale / sum(m_days)) * m_days
# add tibble to list
list_date_dfs[[i]] <- tibble(customer = row$customer,
country = row$country,
type = row$type,
sale = m_sale,
start = m_start,
end = m_end,
days = m_days
)
}
# bind dataframe list elements into single dataframe
df_monthly <- bind_rows(list_date_dfs)
It's not pretty as it uses multiple functions and loops, since it consists out of multiple operations:
# Creating the dataset
library(tidyr)
customer <- c(1,1,2,3)
country <- c("US","US","US","UK")
type <- c("Commercial","Commercial","Casual","Commercial")
sale <- c(91,92,25,64)
Start <- as.Date(c("7/06/2017","6/09/2017","10/07/2017","7/06/2017"),"%d/%m/%Y")
Finish <- as.Date(c("5/09/2017","6/12/2017","3/08/2017","9/08/2017"),"%d/%m/%Y")
days <- c(91,92,25,64)
df <- data.frame(customer,country, type,sale, Start,Finish,days)
# Function to split per month
library(zoo)
addrowFun <- function(y){
temp <- do.call("rbind", by(y, 1:nrow(y), function(x) with(x, {
eom <- as.Date(as.yearmon(Start), frac = 1)
if (eom < Finish)
data.frame(customer, country, type, Start = c(Start, eom+1), Finish = c(eom, Finish))
else x
})))
return(temp)
}
loop <- df
for(i in 1:10){ #not all months are split up at once
loop <- addrowFun(loop)
}
# Calculating the days per month
loop$days <- as.numeric(difftime(loop$Finish,loop$Start, units="days"))
# Creating the function to get the monthly sales pro rata
sumFun <- function(x){
tempSum <- df[x$Start >= df$Start & x$Finish <= df$Finish & df$customer == x$customer,]
totalSale <- sum(tempSum$sale)
totalDays <- sum(tempSum$days)
return(x$days / totalDays * totalSale)
}
for(i in 1:length(loop$customer)){
loop$sale[i] <- sumFun(loop[i,])
}
loop
CiAndrews,
Thanks for the help and patience. I have managed to get the answer with small change. I have replace the "rbind" with "rbind.fill" from "plyr" package and everything runs smoothly after that.
Please see the head of sample2.csv below
customer country type sale Start Finish days
1 43108181108 US Commercial 3330 17/11/2016 24/02/2017 99
2 43108181108 US Commercial 2753 24/02/2017 23/05/2017 88
3 43108181108 US Commercial 3043 13/02/2018 18/05/2018 94
4 43108181108 US Commercial 4261 23/05/2017 18/08/2017 87
5 43103703637 UK Casual 881 4/11/2016 15/02/2017 103
6 43103703637 UK Casual 1172 26/07/2018 1/11/2018 98
Please see the codes below:
library(tidyr)
#read data and change the start and finish to data type
data <- read.csv("Sample2.csv")
data$Start <- as.Date(data$Start, "%d/%m/%Y")
data$Finish <- as.Date(data$Finish, "%d/%m/%Y")
customer <- data$customer
country <- data$country
days <- data$days
Finish <- data$Finish
Start <- data$Start
sale <- data$sale
type <- data$type
df <- data.frame(customer, country, type, sale, Start, Finish, days)
# Function to split per month
library(zoo)
library(plyr)
addrowFun <- function(y){
temp <- do.call("rbind.fill", by(y, 1:nrow(y), function(x) with(x, {
eom <- as.Date(as.yearmon(Start), frac = 1)
if (eom < Finish)
data.frame(customer, country, type, Start = c(Start, eom+1), Finish = c(eom, Finish))
else x
})))
return(temp)
}
loop <- df
for(i in 1:10){ #not all months are split up at once
loop <- addrowFun(loop)
}
# Calculating the days per month
loop$days <- as.numeric(difftime(loop$Finish,loop$Start, units="days"))
# Creating the function to get the monthly sales pro rata
sumFun <- function(x){
tempSum <- df[x$Start >= df$Start & x$Finish <= df$Finish & df$customer == x$customer,]
totalSale <- sum(tempSum$sale)
totalDays <- sum(tempSum$days)
return(x$days / totalDays * totalSale)
}
for(i in 1:length(loop$customer)){
loop$sale[i] <- sumFun(loop[i,])
}
loop

R: converting start/end dates into data series

I have the following data frame representing user subscriptions:
User StartDate EndDate
1 2015-09-03 2015-10-17
2 2015-10-27 2015-12-25
...
How can I transform it into a time series that gives me the count of active monthly subscriptions over time (assuming it is active in the month if at least for one day in that month). Something like this (based on the example above, assuming only 2 records):
Month Count
2015-08 0
2015-09 1
2015-10 2
2015-11 1
2015-12 1
2016-01 0
Rem: I took some arbitrary start and end dates for the time series, to make the example clear.
Prepare the data and make sure that the date columns are actually stored as dates:
data <- read.table(text = "User StartDate EndDate
1 2015-09-03 2015-10-17
2 2015-10-27 2015-12-25", header = TRUE)
data$StartDate <- as.Date(StartDate)
data$EndDate <- as.Date(EndDate))
This function returns a vector with all month that are within a subscription:
library(lubridate)
subscr_month <- function(start, end) {
start <- floor_date(start, "month")
seq <- seq(start, end, by = "1 month")
months <- format(seq, format = "%Y-%m")
return(months)
}
It uses the function floor_date() from the lubridate package. It is necessary to round of the start date, because otherwise the last month might be missing. For example, for user 2, if you add two month to the start date, you end up on 2015-12-27, which is after the end date, such that no date from December will be included in seq. The last line converts the Dates to character that only include year and month.
Now, you can apply this function to each start and end date from your data using mapply(). Afterwards, table() creates a table of counts of all dates in the resulting list:
all_month <- mapply(subscr_month, data$StartDate, data$EndDate, SIMPLIFY = FALSE)
table(unlist(all_month))
## 2015-09 2015-10 2015-11 2015-12
## 1 2 1 1
You can also convert the table to a data frame:
as.data.frame(table(unlist(all_month)))
## Var1 Freq
## 1 2015-09 1
## 2 2015-10 2
## 3 2015-11 1
## 4 2015-12 1
Your example output also includes the counts for months that do not appear in the data set. If you want to have this, you can convert the vector of months to a factor and set the levels to all the months you want to include:
month_list <- format(seq(as.Date("2015-08-01"), as.Date("2016-01-01"), by = "1 month"), format = "%Y-%m")
all_month_factor <- factor(unlist(all_month), levels = month_list)
table(all_month_factor)
## all_month_factor
## 2015-08 2015-09 2015-10 2015-11 2015-12 2016-01
## 0 1 2 1 1 0
read the data frame mentioned.
df = structure(list(StartDate = structure(c(16681, 16735), class = "Date"),
EndDate = structure(c(16735, 16794), class = "Date")), class = "data.frame", .Names = c("StartDate",
"EndDate"), row.names = c(NA, -2L))
Could make good use of do in dplyr package and seq
df %>%
rowwise() %>% do({
w <- seq(.$StartDate,.$EndDate,by = "15 days") #for month difference less than 1 complete month
m <- format(w,"%Y-%m") %>% unique
data.frame(Month = m)
}) %>%
group_by(Month) %>%
summarise(Count = length(Month))

Splitting a row at year change

I have a large data set of data representing paired blocks of time, however I want to be able to have a clean break across year boundaries with each row starting and finishing in the same year.
As an example see the table below.
type duration cumsum year year.split
1 1 236 236 1 365
2 0 129 365 1 365
3 1 154 519 2 730
4 0 216 735 3 1095
There is no overlap between years one and two as row 3 starts on the first day of year two, however row 4 starts in year two and ends 5 days into year three. I want to split row 4 so that the table looks like the following.
type duration cumsum year year.split
1 1 236 236 1 365
2 0 129 365 1 365
3 1 0 519 1 365
4 1 154 519 2 730
5 0 211 524 2 730
6 0 5 735 3 1095
As can be seen there is no overlap across years as each overlapping block of time has been split up so each row starts and finishes in the same year. The way I have done this so far is as follows, however it seems clunky and I would hope there is a more elegant solution.
set.seed(808)
test <- data.frame(type = c(1,0), duration = round(runif(20, min = 100, max = 250))) %>%
mutate(cumsum = cumsum(duration), year = ceiling(cumsum/365), year.split = year*365 )
test <- rbind(test[1,],
filter(test, lag(year) == year),
filter(test, lag(year) != year) %>%
mutate( duration = cumsum - (year-1)*365),
filter(test, lag(year) != year) %>%
mutate( duration = ((year-1)*365 + duration- cumsum),
cumsum = cumsum-duration,
year = year -1,
year.split = year*365) ) %>% arrange(year, cumsum)
test <- group_by( test,type, year) %>%
summarise( duration = sum(duration)) %>% ungroup %>% arrange(year)
The final two lines of code summarises the data as I am interested in the total amount of each type per year.
What is a better way of doing this?
This seems to work, assuming that the durations are all strictly positive:
cs<-test$cumsum
cs0<-sort(unique(c(cs,(1:floor(max(cs)/365))*365)))
data.frame(type=test$type[findInterval(cs0-0.5,cs)+1],
duration=diff(c(0,cs0)),cumsum=cs0,year=ceiling(cs0/365))
type duration cumsum year
1 1 236 236 1
2 0 129 365 1
3 1 154 519 2
4 0 211 730 2
5 0 5 735 3
Not sure if it's the R way that you are looking for, but you can simplify a bit your rbind function:
rbind (filter(test, cumsum - duration >= (year - 1) * 365),
filter(test, cumsum - duration < (year - 1) * 365) %>%
mutate(duration = cumsum - (year - 1) * 365),
filter(test, cumsum - duration < (year - 1) * 365) %>%
mutate(year = year - 1, # I'm changing the year first so it will propagate
duration = duration - (cumsum - (year * 365)),
cumsum = (year) * 365,
year.split = year * 365)
)
As you can see I combine three data.frame:
Row which are correct, because the duration doesn't overlap two years
I take the rows overlapping and I set the duration to the number of days in the last year
I take the same rows and I change the values accordingly to the previous year.
There are two things I don't like here: I used twice the same filter (for case 2 and 3) and tomorrow I will need 10/15 minutes to understand this code (or I can put a comment like # It works, don't worry).
I think that a more verbose version of this code will be easier to maintain:
# These don't overlap
ok <- filter(test, cumsum - duration >= (year - 1) * 365)
# These do overlap! We need to split them in two
ko <- filter(test, cumsum - duration < (year - 1) * 365)
# For the most recent year, it's enough to change the duration
ko.recent <- mutate(ko,
duration = cumsum - (year - 1) * 365
)
# For the previous year, a bit more
ko.previous <- mutate(ko,
year = year - 1, # I'm changing the year first
# so it will propagate
duration = duration - (cumsum - (year * 365)),
cumsum = (year) * 365,
year.split = year * 365
)
# Let me put them back together and sort them for you
test1 <- rbind (ok,
ko.recent,
ko.previous
)
Not sure if this was the answer you were looking for, I'm just learning R.

Resources