(In)correct use of a linear time trend variable, and most efficient fix? - r

I have 3133 rows representing payments made on some of the 5296 days between 7/1/2000 and 12/31/2014; that is, the "Date" feature is non-continuous:
> head(d_exp_0014)
Year Month Day Amount Count myDate
1 2000 7 6 792078.6 9 2000-07-06
2 2000 7 7 140065.5 9 2000-07-07
3 2000 7 11 190553.2 9 2000-07-11
4 2000 7 12 119208.6 9 2000-07-12
5 2000 7 16 1068156.3 9 2000-07-16
6 2000 7 17 0.0 9 2000-07-17
I would like to fit a linear time trend variable,
t <- 1:3133
to a linear model explaining the variation in the Amount of the expenditure.
fit_t <- lm(Amount ~ t + Count, d_exp_0014)
However, this is obviously wrong, as t increments in different amounts between the dates:
> head(exp)
Year Month Day Amount Count Date t
1 2000 7 6 792078.6 9 2000-07-06 1
2 2000 7 7 140065.5 9 2000-07-07 2
3 2000 7 11 190553.2 9 2000-07-11 3
4 2000 7 12 119208.6 9 2000-07-12 4
5 2000 7 16 1068156.3 9 2000-07-16 5
6 2000 7 17 0.0 9 2000-07-17 6
Which to me is the exact opposite of a linear trend.
What is the most efficient way to get this data.frame merged to a continuous date-index? Will a date vector like
CTS_date_V <- as.data.frame(seq(as.Date("2000/07/01"), as.Date("2014/12/31"), "days"), colnames = "Date")
yield different results?
I'm open to any packages (using fpp, forecast, timeSeries, xts, ts, as of right now); just looking for a good answer to deploy in functional form, since these payments are going to be updated every week and I'd like to automate the append to this data.frame.

I think some kind of transformation to regular (continuous) time series is a good idea.
You can use xts to transform time series data (it is handy, because it can be used in other packages as regular ts)
Filling the gaps
# convert myDate to POSIXct if necessary
# create xts from data frame x
ts1 <- xts(data.frame(a = x$Amount, c = x$Count), x$myDate )
ts1
# create empty time series
ts_empty <- seq( from = start(ts1), to = end(ts1), by = "DSTday")
# merge the empty ts to the data and fill the gap with 0
ts2 <- merge( ts1, ts_empty, fill = 0)
# or interpolate, for example:
ts2 <- merge( ts1, ts_empty, fill = NA)
ts2 <- na.locf(ts2)
# zoo-xts ready functions are:
# na.locf - constant previous value
# na.approx - linear approximation
# na.spline - cubic spline interpolation
Deduplicate dates
In your sample there is now sign of duplicated values. But based on a new question it is very likely. I think you want to aggregate values with sum function:
ts1 <- period.apply( ts1, endpoints(ts1,'days'), sum)

Related

Cumsum function step wise in R

I am facing one problem, I calculated a monthly interest rate for a mortgage, however, I would need to sum the results in order to have it yearly (always 12 months).
H <- 2000000 # mortgage
i.m <- 0.03/12 # rate per month
year <- 15 # years
a <- (H*i.m*(1+i.m)^(12*year))/
((1+i.m)^(12*year)-1)
a # monthly payment
interest <- a*(1-(1/(1+i.m)^(0:(year*12))))
interest
cumsum(a*(1-(1/(1+i.m)^(0:(year*12))))) # first 12 values together and then next 12 values + first values and ... (I want to have for every year a value)
You may do this with tapply in base R.
monthly <- cumsum(a*(1-(1/(1+i.m)^(0:(year*12)))))
yearly <- tapply(monthly, ceiling(seq_along(monthly)/12), sum)
I think you can use the following solution:
monthly <- cumsum(a*(1-(1/(1+i.m)^(0:(year*12)))))
sapply(split(monthly, ceiling(seq_along(monthly) / 12)), function(x) x[length(x)])
1 2 3 4 5 6 7 8
2254.446 9334.668 21098.218 37406.855 58126.414 83126.695 112281.337 145467.712
9 10 11 12 13 14 15 16
182566.812 223463.138 268044.605 316202.434 367831.057 422828.023 481093.905 486093.905

R - Analysis of time series with semi-annual data?

I have a time series with semi-annual (half-yearly) data points.
It seems that the ts() function can't handle that as "frequency = 2" returns a very strange time series object that extends far beyond the actual time period.
Is there any way to do time series analysis of this kind of time series object in R?
EDIT: Here's an example:
dat <- seq(1, 17, by = 1)
> semi <- ts(dat, start = c(2008,12), frequency = 2)
> semi
Time Series:
Start = c(2013, 2)
End = c(2021, 2)
Frequency = 2
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
I was expecting:
> semi
s1 s2
2008 1
2009 2 3
2010 4 5
2011 6 7
2012 8 9
2013 10 11
2014 12 13
2015 14 15
2016 16 17
First let me explain why the first ts element starts at 2013 in stead of 2008. The function start and end work with the periods/frequencies. You selected the 12th period after 2008 which is the second period in 2013 if your frequency is 2.
This should work for the period:
semi <- ts(dat, start = c(2008,2), frequency = 2)
Still semi gives the correct timeseries, however, it does not know the names with a frequency of 2. If you plot the timeseries the correct half yearly graph will be shown.
plot.ts(semi)
In this problem someone explained about the standard frequencies, which ts() knows.

Manipulating Dates with dplyr

I have longitudinal, geocoded address data and the length of time at each geocode. I then have a series of variables (I'm just calling them x here) that give characteristics of each geoid location. Below here is just two cases but I have thousands.
id<-c(1,1,1,7,7,7,7)
geoid<-c(53,45,45,16,18,42)
start<-c("1/1/2004","10/31/2004","1/1/2005","1/1/2005","6/1/2007","7/2/2007")
end<-c("10/30/2004","12/31/2004","12/31/2007","5/31/2007","7/1/2007","12/31/2007")
x<-c(.5,.7,.7,.3,.4,.6)
dat<-data.frame(id,geoid,x,start,end)
dat$start<-as.Date(dat$start,format='%m/%d/%Y')
dat$end<-as.Date(dat$end,format='%m/%d/%Y')
dat
id geoid x start end
1 53 0.5 2004-01-01 2004-10-30
1 45 0.7 2004-10-31 2004-12-31
1 45 0.7 2005-01-01 2007-12-31
7 16 0.3 2005-01-01 2007-05-31
7 18 0.4 2007-06-01 2007-08-01
7 42 0.6 2007-08-02 2007-12-31
I need to end up with a single value for each year (2004, 2005, 2006, 2007) and for each case (1, 7) that is weighted by the length of time at each address. So case 1 moves from geoid 53 to 45 in 2004 and case 7 moves from geoid 16 to 18 to 42 in 2007. So I calculate the percent of the year at each geoid (and eventually I will multiply that by x and take the mean for each year to get a weighted average). Cases staying put for a whole year will get a weight of 1.
#calculate the percentage of year at each address for id 1
(as.Date("10/31/2004",format='%m/%d/%Y')-as.Date("1/1/2004",format='%m/%d/%Y'))/365.25
Time difference of 0.8323066
(as.Date("12/31/2004",format='%m/%d/%Y')-as.Date("10/31/2004",format='%m/%d/%Y'))/365.25
Time difference of 0.1670089
#calculate the percentage of year at each address for id 7
(as.Date("05/31/2007",format='%m/%d/%Y')-as.Date("1/1/2007",format='%m/%d/%Y'))/365.25
Time difference of 0.4106776
(as.Date("07/01/2007",format='%m/%d/%Y')-as.Date("06/01/2007",format='%m/%d/%Y'))/365.25
Time difference of 0.08213552
(as.Date("12/31/2007",format='%m/%d/%Y')-as.Date("07/02/2007",format='%m/%d/%Y'))/365.25
Time difference of 0.4982888
I can do this by brute force by looking at each year individually, calculating the percent of the year spent at that address. Then I would multiply each weight by the x values and take the mean for that year - that will not be reasonably possible to do with thousands of cases. Any ideas of how to address this more efficiently would be much appreciated. Seems like it might be doable with dplyr slice but I'm stalled out at the moment. The key is separating out each year.
As eipi10 mentioned, some of your data spans more than a year. It also looks inconsistent with the data you used in your time difference calculations, which are all within the same year.
Assuming that your start and end dates would actually be in the same year, you can do something like the following:
foo <- dat %>%
mutate(start_year=year(dat$start),
end_year=year(dat$end),
same_year=(start_year==end_year),
year_frac=as.numeric(dat$end - dat$start)/365.25,
wtd_x = year_frac * x)
This gives you:
id geoid x start end start_year end_year same_year year_frac wtd_x
1 1 53 0.5 2004-01-01 2004-10-31 2004 2004 TRUE 0.83230664 0.41615332
2 1 45 0.7 2004-10-31 2004-12-31 2004 2004 TRUE 0.16700890 0.11690623
3 1 45 0.7 2005-01-01 2007-12-31 2005 2007 FALSE 2.99520876 2.09664613
4 7 16 0.3 2007-01-01 2007-05-31 2007 2007 TRUE 0.41067762 0.12320329
5 7 18 0.4 2007-06-01 2007-07-01 2007 2007 TRUE 0.08213552 0.03285421
6 7 42 0.6 2007-07-02 2007-12-31 2007 2007 TRUE 0.49828884 0.29897331
You can then group and summarise the data using:
bar <- foo %>%
group_by(start_year, id) %>%
summarise(sum(wtd_x))
to give you the answer:
start_year id sum(wtd_x)
(dbl) (dbl) (dfft)
1 2004 1 0.5330595 days
2 2005 1 2.0966461 days
3 2007 7 0.4550308 days
Hopefully this will get you started. I wasn't sure how you wanted to deal with cases where the period from start to end spans more than one year or crosses calendar years.
library(dplyr)
dat %>%
mutate(fractionOfYear = as.numeric(end - start)/365.25)
id geoid x start end fractionOfYear
1 1 53 0.5 2004-01-01 2004-10-30 0.82956879
2 1 45 0.7 2004-10-31 2004-12-31 0.16700890
3 1 45 0.7 2005-01-01 2007-12-31 2.99520876
4 7 16 0.3 2005-01-01 2007-05-31 2.40930869
5 7 18 0.4 2007-06-01 2007-07-01 0.08213552
6 7 42 0.6 2007-07-02 2007-12-31 0.49828884
I was able to find some local help that led us to a simple function. We're still stuck on how to use apply with dates but this overall handles it.
#made up sample address data
id<-c(1,1,1,7,7,7)
geoid<-c(53,45,45,16,18,42)
start<-c("1/31/2004","10/31/2004","1/1/2005","1/1/2005","6/1/2007","7/2/2007")
end<-c("10/30/2004","12/31/2004","12/31/2007","5/31/2007","7/1/2007","12/31/2007")
dat <- data.frame(id,geoid,start,end)
#format addresses
dat$start<-as.Date(dat$start,format='%m/%d/%Y')
dat$end<-as.Date(dat$end,format='%m/%d/%Y')
#function to create proportion of time at each address
prop_time <- function(drange, year){
start <- drange[[1]]; end <- drange[[2]]
#start year and end year
syear <- as.numeric(format(start,'%Y'))
eyear <- as.numeric(format(end,'%Y'))
#select only those dates that are within the same year
if(syear<=year & year<=eyear){
byear <- as.Date(paste("1/1", sep="/", year), format='%m/%d/%Y')
eyear <- as.Date(paste("12/31", sep="/", year), format='%m/%d/%Y')
astart <- max(byear, start)
aend <- min(eyear, end)
prop <- as.numeric((aend - astart))/as.numeric((eyear - byear))
} else prop <- 0 #if no proportion within same year calculated then gets 0
prop
}
#a second function to apply prop_time to multiple cases
prop_apply <- function(dat_times, year){
out <- NULL
for(i in 1:dim(dat_times)[1]){
out <- rbind(out,prop_time(dat_times[i,], year))
}
out
}
#create new data frame to populate years
dat <- data.frame(dat, y2004=0, y2005=0, y2006=0, y2007=0)
dat_times <- dat[,c("start", "end")]
#run prop_apply in a loop across cases and selected years
for(j in 2004:2007){
newdate <- paste("y", j, sep="")
dat[,newdate] <- prop_apply(dat_times, j)
}

How to make all the months to have an equal number of days (for example 22 days) for a MIDAS regression in R

This is a follow up question for these two posts.
How to deal with impossible dates for midasr package
https://stats.stackexchange.com/questions/77495/what-can-i-do-with-these-two-time-series
I need to use mls function in MIDAS package in R to transform the high frequency (daily) financial data to low frequency (quarterly) macroeconomic data.
The author #mpiktas mentioned
You must make all the months to have an equal number of days. And then
set frequency to that number. You can achieve that by discarding data,
padding NAs or extrapolating.
and
You could use zoo objects to make the padding easier, but in the end
simple numeric vector should be passed.
I tried different ways to search and did not find an easy way to implement.
I use dplyr to get each month to have 31 days with 7-11 NA.
# generate the date vector
library(midasr)
library(dplyr)
library(quantmod)
tsxdate <- as.Date( paste(1979, rep(1:12, each=31), 1:31, sep="-") )
for (year in 1980:2015){
tsxdate <- c(tsxdate,as.Date( paste(year, rep(1:12, each=31), 1:31, sep="-") ))
}
# transform to dataframe
tsxdate.df <- as.data.frame(tsxdate)
# get the stock market index from yahoo
tsxindex <- getSymbols("^GSPTSE",src="yahoo", from = '1977-01-01', auto.assign = FALSE)
# merge two data frame to get each month with 31 days
tsx.df <- left_join(tsxdate.df, tsxindex)
I doubt this caused a problem due to too many NAs.
I put the new daily data into MIDAS regression in R. It did not work. None of the weight functions work.
# since each month has 31 days. one quarter yy correspond to 93 days data.
midas_r(midas_r(yy~trend+fmls(zz,30,93,nealmon) ,start=list(zz=rep(0,4))), Ofunction="nls")
Could you tell me how to make all the months to have an equal number of days?
update:
Finally, I got a way in zoo package with aggregate and first function. It is not perfect, but it works and fast. first will add NAs according to the parameter.
I still need to figure out how to fit it into a MIDAS regression.
# get data
tsx <- getSymbols("^GSPTSE",src="yahoo", from = '1977-01-01', auto.assign = FALSE)
# subset
# generate a zoo object
library(zoo)
tsx.zoo <- zoo(tsx$GSPTSE.Adjusted)
# group by yearmonth and take first 22 days data.
days <-aggregate(tsx.zoo, as.yearmon, first, 22)
It looks like this: each row is one month with 22 days data.
Jun 1979 1614.29 NA NA NA NA NA NA NA NA NA
Jul 1979 1614.29 1598.73 1579.88 1582.57 1582.27 1576.19 1559.23 1529.81 1533.50 1547.66
Aug 1979 1554.14 1556.94 1553.84 1553.84 1551.95 1561.23 1562.52 1571.00 1578.08 1580.28
Sep 1979 1685.11 1657.58 1690.10 1720.92 1716.53 1711.34 1722.71 1714.63 1727.50 1724.51
Oct 1979 1749.05 1767.40 1775.98 1786.35 1800.12 1800.12 1735.88 1685.21 1681.52 1670.65
Nov 1979 1599.33 1606.81 1596.54 1592.94 1574.49 1569.20 1583.97 1608.70 1611.00 1619.78
Jun 1979 NA NA NA NA NA NA NA NA NA NA
Jul 1979 1556.94 1546.86 1548.46 1553.54 1542.07 1543.17 1552.85 1566.01 1573.99 1564.12
Aug 1979 1596.64 1602.82 1615.09 1636.53 1653.09 1660.97 1657.78 1665.46 1674.44 1674.64
Sep 1979 1714.73 1717.53 1732.59 1736.48 1731.19 1732.49 1746.75 1754.33 1747.45 NA
Oct 1979 1639.03 1613.19 1616.29 1635.34 1593.44 1533.40 1522.12 1534.49 1517.24 1523.92
Nov 1979 1628.55 1621.57 1624.36 1627.56 1620.27 1647.51 1677.93 1683.81 1690.70 1698.97
Jun 1979 NA NA
Jul 1979 1554.14 NA
Aug 1979 1674.24 1675.43
Sep 1979 NA NA
Oct 1979 1538.68 1552.25
update again:
#mpiktas gives a better and right way to do it.
1 NAs should be padded at beginning of each period.
2 Data should be gather in the frequency of response variable. In my case, it is quarterly.
His function can be used in aggregate function in zoo. I guess it do the same job as group_by plus do in dplyr: split, operate, and give back a list of results. I try this
tsxdaily <- aggregate(tsx.zoo, yearqtr, padd_nas, 66)
yearqtr is the frequency of response variable.
Here is one possible way of how to add NAs.
First, note that MIDAS regression puts the emphasis on the last values of the period, so you need to put NAs in front, not in the back.
Suppose that we have the following dummy data:
> dt <- data.frame(Day=1:10,Quarter=c(rep(1,6),rep(2,4)),value=1:10)
> dt
Day Quarter value
1 1 1 1
2 2 1 2
3 3 1 3
4 4 1 4
5 5 1 5
6 6 1 6
7 7 2 7
8 8 2 8
9 9 2 9
10 10 2 10
In this example there are two quarters, the first one has 6 days, the second one 4. Suppose we want to harmonize the data, so that the quarter has 7 days (for example).
Define simple function which adds NAs at the beginning of the data:
padd_nas <- function(x, desired_length) {
n <- length(x)
if(n < desired_length) {
c(rep(NA,desired_length-n),x)
} else {
tail(x,desired_length)
}
}
Here is an example illustrating how this function works:
> padd_nas(1:4,7)
[1] NA NA NA 1 2 3 4
>
Now add NAs for each quarter and make sure that the data is ordered by day:
library(dplyr)
pdt <- dt %>% arrange(Day) %>% group_by(Quarter) %>% do(pv = padd_nas(.$value, 7))
> pdt
Source: local data frame [2 x 2]
Groups: <by row>
Quarter pv
1 1 <int[7]>
2 2 <int[7]>
To get the padded result simply use unlist on column pv:
> pv <- pdt$pv %>% unlist
> pv
[1] NA 1 2 3 4 5 6 NA NA NA 7 8 9 10
Now we can prepared this for MIDAS regression with mls. Suppose that only last 3 days are relevant for each quarter:
> library(midasr)
> mls(pv, 0:2, 7)
X.0/m X.1/m X.2/m
[1,] 6 5 4
[2,] 10 9 8
Compare this with original data dt.
This approach can be generalized for any low and high frequency data configuration.

as.Date is throwing a row number mismatch, but all vectors are same length

The following (CSV) dataset has 3133 rows of expenses by day between 7/1/2000 and 12/31/2014:
head(d_exp_0014)
2000 7 6 792078.595 9
2000 7 7 140065.5 9
2000 7 11 190553.2 9
2000 7 12 119208.65 9
2000 7 16 1068156.293 9
2000 7 17 0 9
2000 7 21 457828.8033 9
2000 7 26 661445.0775 9
2000 7 28 211122.82 9
2000 8 2 273575.1733 8
The columns here are Year, Month, Day, Expense, and Count (for how many days of the each month had an expense).
I am trying to do a forecast out to the end of 2015, and need to deal with these messy date columns so I can slice and dice xts (?) objects with dplyr. ISOdate and as.Date functions are throwing this error:
> exp <- data.frame(data = d_exp_0014, Date = as.Date(paste(Year, Month, Day), format = "m%/d%/Y%"), Amount = Amount, Count = Count, t = c(1:3133))
Error in data.frame(data = d_exp_0014, Date = as.Date(paste(Year, Month, :
arguments imply differing number of rows: 3133, 3134
> length(d_exp_0014$Year)
[1] 3133
> length(d_exp_0014$Month)
[1] 3133
> length(d_exp_0014$Day)
[1] 3133
What am I doing wrong? And should I instead build a vector of 5296 continuous dates between 7/1/2000 and 12/31/2014 and merge my 3133 rows of observations to this table (thus effectively inserting '0' in the Amount column for days on which there were no payments)?
Several errors (but not from paste): I'm guessing you were taught to use attach. That is probably the source of this particular error. Start by
detach(d_exp_0014)
d_exp_0014 <- cbind(d_exp_0014,
myDate = with(d_exp_0014,
as.Date(paste(Year, Month, Day, sep="/"),
format = "%Y/%m/%d") # note % first then letter
)
)
Then you can add further columns as needed.

Resources