How to make all the months to have an equal number of days (for example 22 days) for a MIDAS regression in R - r

This is a follow up question for these two posts.
How to deal with impossible dates for midasr package
https://stats.stackexchange.com/questions/77495/what-can-i-do-with-these-two-time-series
I need to use mls function in MIDAS package in R to transform the high frequency (daily) financial data to low frequency (quarterly) macroeconomic data.
The author #mpiktas mentioned
You must make all the months to have an equal number of days. And then
set frequency to that number. You can achieve that by discarding data,
padding NAs or extrapolating.
and
You could use zoo objects to make the padding easier, but in the end
simple numeric vector should be passed.
I tried different ways to search and did not find an easy way to implement.
I use dplyr to get each month to have 31 days with 7-11 NA.
# generate the date vector
library(midasr)
library(dplyr)
library(quantmod)
tsxdate <- as.Date( paste(1979, rep(1:12, each=31), 1:31, sep="-") )
for (year in 1980:2015){
tsxdate <- c(tsxdate,as.Date( paste(year, rep(1:12, each=31), 1:31, sep="-") ))
}
# transform to dataframe
tsxdate.df <- as.data.frame(tsxdate)
# get the stock market index from yahoo
tsxindex <- getSymbols("^GSPTSE",src="yahoo", from = '1977-01-01', auto.assign = FALSE)
# merge two data frame to get each month with 31 days
tsx.df <- left_join(tsxdate.df, tsxindex)
I doubt this caused a problem due to too many NAs.
I put the new daily data into MIDAS regression in R. It did not work. None of the weight functions work.
# since each month has 31 days. one quarter yy correspond to 93 days data.
midas_r(midas_r(yy~trend+fmls(zz,30,93,nealmon) ,start=list(zz=rep(0,4))), Ofunction="nls")
Could you tell me how to make all the months to have an equal number of days?
update:
Finally, I got a way in zoo package with aggregate and first function. It is not perfect, but it works and fast. first will add NAs according to the parameter.
I still need to figure out how to fit it into a MIDAS regression.
# get data
tsx <- getSymbols("^GSPTSE",src="yahoo", from = '1977-01-01', auto.assign = FALSE)
# subset
# generate a zoo object
library(zoo)
tsx.zoo <- zoo(tsx$GSPTSE.Adjusted)
# group by yearmonth and take first 22 days data.
days <-aggregate(tsx.zoo, as.yearmon, first, 22)
It looks like this: each row is one month with 22 days data.
Jun 1979 1614.29 NA NA NA NA NA NA NA NA NA
Jul 1979 1614.29 1598.73 1579.88 1582.57 1582.27 1576.19 1559.23 1529.81 1533.50 1547.66
Aug 1979 1554.14 1556.94 1553.84 1553.84 1551.95 1561.23 1562.52 1571.00 1578.08 1580.28
Sep 1979 1685.11 1657.58 1690.10 1720.92 1716.53 1711.34 1722.71 1714.63 1727.50 1724.51
Oct 1979 1749.05 1767.40 1775.98 1786.35 1800.12 1800.12 1735.88 1685.21 1681.52 1670.65
Nov 1979 1599.33 1606.81 1596.54 1592.94 1574.49 1569.20 1583.97 1608.70 1611.00 1619.78
Jun 1979 NA NA NA NA NA NA NA NA NA NA
Jul 1979 1556.94 1546.86 1548.46 1553.54 1542.07 1543.17 1552.85 1566.01 1573.99 1564.12
Aug 1979 1596.64 1602.82 1615.09 1636.53 1653.09 1660.97 1657.78 1665.46 1674.44 1674.64
Sep 1979 1714.73 1717.53 1732.59 1736.48 1731.19 1732.49 1746.75 1754.33 1747.45 NA
Oct 1979 1639.03 1613.19 1616.29 1635.34 1593.44 1533.40 1522.12 1534.49 1517.24 1523.92
Nov 1979 1628.55 1621.57 1624.36 1627.56 1620.27 1647.51 1677.93 1683.81 1690.70 1698.97
Jun 1979 NA NA
Jul 1979 1554.14 NA
Aug 1979 1674.24 1675.43
Sep 1979 NA NA
Oct 1979 1538.68 1552.25
update again:
#mpiktas gives a better and right way to do it.
1 NAs should be padded at beginning of each period.
2 Data should be gather in the frequency of response variable. In my case, it is quarterly.
His function can be used in aggregate function in zoo. I guess it do the same job as group_by plus do in dplyr: split, operate, and give back a list of results. I try this
tsxdaily <- aggregate(tsx.zoo, yearqtr, padd_nas, 66)
yearqtr is the frequency of response variable.

Here is one possible way of how to add NAs.
First, note that MIDAS regression puts the emphasis on the last values of the period, so you need to put NAs in front, not in the back.
Suppose that we have the following dummy data:
> dt <- data.frame(Day=1:10,Quarter=c(rep(1,6),rep(2,4)),value=1:10)
> dt
Day Quarter value
1 1 1 1
2 2 1 2
3 3 1 3
4 4 1 4
5 5 1 5
6 6 1 6
7 7 2 7
8 8 2 8
9 9 2 9
10 10 2 10
In this example there are two quarters, the first one has 6 days, the second one 4. Suppose we want to harmonize the data, so that the quarter has 7 days (for example).
Define simple function which adds NAs at the beginning of the data:
padd_nas <- function(x, desired_length) {
n <- length(x)
if(n < desired_length) {
c(rep(NA,desired_length-n),x)
} else {
tail(x,desired_length)
}
}
Here is an example illustrating how this function works:
> padd_nas(1:4,7)
[1] NA NA NA 1 2 3 4
>
Now add NAs for each quarter and make sure that the data is ordered by day:
library(dplyr)
pdt <- dt %>% arrange(Day) %>% group_by(Quarter) %>% do(pv = padd_nas(.$value, 7))
> pdt
Source: local data frame [2 x 2]
Groups: <by row>
Quarter pv
1 1 <int[7]>
2 2 <int[7]>
To get the padded result simply use unlist on column pv:
> pv <- pdt$pv %>% unlist
> pv
[1] NA 1 2 3 4 5 6 NA NA NA 7 8 9 10
Now we can prepared this for MIDAS regression with mls. Suppose that only last 3 days are relevant for each quarter:
> library(midasr)
> mls(pv, 0:2, 7)
X.0/m X.1/m X.2/m
[1,] 6 5 4
[2,] 10 9 8
Compare this with original data dt.
This approach can be generalized for any low and high frequency data configuration.

Related

How to use Sys.Date() To Extract Current Year? [duplicate]

This question already has answers here:
How can I get the extract the previous year (2020) using Sys.Date()?
(2 answers)
Closed 1 year ago.
I have manually separated my dataset (discrete_8) into 2 separate datasets (data & data2). 'Data' contains the data from this current year (2021), whereas 'Data2' contains data from previous years. Of course, this is based on the current year (2021), but I want to automate the line of code so that when the year 2022 comes, I will not have to edit the script to change 2021 to 2022. Should I use Sys.Date() for calling the most recent year? How would I go about incorporating sys.date() to partition the dataset?
Here is my code so far, where I partition the dataset:
data <- discrete_8 %>% filter(PS_DATE >= as.POSIXct("2021-01-01"))#current year
data2 <- discrete_8 %>% filter(PS_DATE < as.POSIXct("2021-01-01"))#past years
Here is what discrete_8 looks like:
X PS_DATE PS_NAME Control.Parameters.Cell.Return.Flow.Rate Control.Parameters.Harvest.Flow.Rate Control.Parameters.Microsparger.Total.Gas.Flow.Rate
1 0 2014-02-06 123 NA NA 1
2 1 2014-02-07 124 NA NA 1
3 2 2014-02-08 125 NA NA 1
4 3 2014-02-09 126 1.5 NA 1
5 4 2014-02-10 127 1.5 NA 1
6 5 2014-02-11 128 1.5 NA 1
There is somewhat tedious bug still present in that trunc(Sys.Date(), "year") does not give you Jan 01 of the current year -- it does in R-devel.
But you can build yourself a helper such as this:
> firstDay <- function() { d <- Sys.Date(); d - as.POSIXlt(d)$yday }
> firstDay()
[1] "2021-01-01"
and you can use that to compare. (Also, in the code you posted, as.Date() is simpler as you ignore hours/minutes/seconds here.)
one option can be the lubridate::floor_date() function:
lubridate::floor_date(Sys.Date(), unit = "years")
[1] "2021-01-01"
I use substr(Sys.Date(),1,4) to get the current year. In your code you can replace as.POSIXct("2021-01-01") with
as.POSIXct(paste0(substr(Sys.Date(),1,4),"-01-01"))
This will give the 1st of the current year in your datetime format.

Replace values based on months in a dataframe with values in another column in r, using apply functions

I am working with a time series of precipitation data and attempting to use the median imputation method to replace all 0 value data points with the median of all data points for the corresponding month that that 0 value was recorded.
I have two data frames, one with the original precipitation data:
> head(df.m)
prcp date
1 121.00485 1975-01-31
2 122.41667 1975-02-28
3 82.74026 1975-03-31
4 104.63514 1975-04-30
5 57.46667 1975-05-31
6 38.97297 1975-06-30
And one with the median monthly values:
> medians
Group.1 x
1 01 135.90680
2 02 123.52613
3 03 113.09841
4 04 98.10044
5 05 75.21976
6 06 57.47287
7 07 54.16667
8 08 45.57653
9 09 77.87740
10 10 103.25179
11 11 124.36795
12 12 131.30695
Below is the current solution that I have come up with utilizing the 1st answer here:
df.m[,"prcp"] <- sapply(df.m[,"prcp"], function(y) ifelse(y==0, medians$x,y))
This has not worked as it only applies the first value of the df medians$Group.1, which is the month of January (01). How can I get the values so that correct median will be applied from the corresponding month?
Another way I have attempted a solution is via the below:
df.m[,"prcp"] <- sapply(medians$Group.1, function(y)
ifelse(df.m[format.Date(df.m$date, "%m") == y &
df.m$prcp == 0, "prcp"], medians[medians$Group.1 == y,"x"],
df.m[,"prcp"]))
Description of the above function - this function tests and returns the amount of zeros for every month that there is a zero value in df.m[,"prcp"]
Same issue here as the 1st solution, but it does return all of the 0 values by month (if just executing the sapply() portion).
How can I replace all 0 in df.m$prcp with their corresponding medians from the medians df based on the month of the data?
Apologies if this is a basic question, I'm somewhat of a newbie here. Any and all help would be greatly appreciated.
Consider merging the two dataframes by month/group and then calculating with ifelse:
# MERGE TWO FRAMES
df.m$month <- format(df.m$date, "%m")
df.merge <- merge(df.m, medians, by.x="month", by.y="Group.1")
# CONDITIONAL CALCULATION
df.merge$prcp <- ifelse(df.merge$prcp == 0, df.merge$x, df.merge$prcp)
# RETURN BACK TO ORIGINAL STRUCTURE
df.m <- df.merge[names(df.m)]
A dplyr version, which does not rely on original order. This uses slightly modified test data to show replacement of zeroes and multiple years
require(dplyr)
## test data with zeroes - extended for addtional years
df.m <- read.delim(text="
i prcp date
1 121.00485 1975-01-31
2 122.41667 1975-02-28
3 82.74026 1975-03-31
4 104.63514 1975-04-30
5 57.46667 1975-05-31
6 38.97297 1975-06-30
7 0 1976-06-30
8 0 1976-07-31
9 70 1976-08-31
", sep="", stringsAsFactors = FALSE)
medians <- read.delim(text="
i month x
1 01 135.90680
2 02 123.52613
3 03 113.09841
4 04 98.10044
5 05 75.21976
6 06 57.47287
7 07 54.16667
8 08 45.57653
9 09 77.87740
10 10 103.25179
11 11 124.36795
12 12 131.30695
", sep = "", stringsAsFactors = FALSE, strip.white = TRUE)
# extract the month as integer
df.m$month = as.integer(substr(df.m$date,6,7))
# match to medians by joining
result <- df.m %>%
inner_join(medians, by='month') %>%
mutate(prcp = ifelse(prcp == 0, x, prcp)) %>%
select(prcp, date)
result
yields
prcp date
1 121.00485 1975-01-31
2 122.41667 1975-02-28
3 82.74026 1975-03-31
4 104.63514 1975-04-30
5 57.46667 1975-05-31
6 38.97297 1975-06-30
7 57.47287 1976-06-30
8 54.16667 1976-07-31
9 70.00000 1976-08-31
I created small datasets with some zero values and added one line of code:
#create sample data
prcp <- c(1.5,0.0,0.0,2.1)
date <- c(01,02,03,04)
x <- c(1.11,2.22,3.33,4.44)
df <- data.frame(prcp,date)
grp <- data.frame(x,date)
#Make the assignment
df[df$prcp == 0,]$prcp <- grp[df$prcp == 0,]$x

Use dplyr to compute lagging difference

My data frame consists of three columns: state name, year, and the tax receipt for each year and each state. Below is an example for just one state.
year RealTaxRevs
1 1971 8335046
2 1972 9624026
3 1973 10498935
4 1974 10052305
5 1975 8708381
6 1976 8911262
7 1977 10759032
I'd like to compute the change in tax receipt from one year to the next, for each state. I used the following code:
data %>% group_by(state) %>% summarise(diff(RealTaxRevs, lag = 1, differences = 1))
but it gives me "Error: expecting a single value".
Could anyone explain this error message, and help me do this correctly using dplyr? Thank you.
If you want to use diff like function, then consider using the zoo library as well. Then you can have code which looks like the following:
library(zoo)
diff(as.zoo(1:4), na.pad=T)
In a data frame setting it would be like:
dat <- data.frame(a=c(8335046, 9624026, 10498935, 10052305, 8708381, 8911262, 10759032))
dat %>% mutate(b=diff(as.zoo(a), na.pad=T))
# a b
# 1 8335046 NA
# 2 9624026 1288980
# 3 10498935 874909
# 4 10052305 -446630
# 5 8708381 -1343924
# 6 8911262 202881
# 7 10759032 1847770
This way you can easily increase the number of lags, without continually adding NA
dat %>% mutate(b2=diff(as.zoo(a), lag=2, na.pad=T))
# a b2
# 1 8335046 NA
# 2 9624026 NA
# 3 10498935 2163889
# 4 NA NA
# 5 8708381 -1790554
# 6 8911262 NA
# 7 10759032 2050651
We can use data.table
library(data.table)
setDT(data)[, Diffs := RealTaxRevs - shift(RealTaxRevs)[[1]], state]

(In)correct use of a linear time trend variable, and most efficient fix?

I have 3133 rows representing payments made on some of the 5296 days between 7/1/2000 and 12/31/2014; that is, the "Date" feature is non-continuous:
> head(d_exp_0014)
Year Month Day Amount Count myDate
1 2000 7 6 792078.6 9 2000-07-06
2 2000 7 7 140065.5 9 2000-07-07
3 2000 7 11 190553.2 9 2000-07-11
4 2000 7 12 119208.6 9 2000-07-12
5 2000 7 16 1068156.3 9 2000-07-16
6 2000 7 17 0.0 9 2000-07-17
I would like to fit a linear time trend variable,
t <- 1:3133
to a linear model explaining the variation in the Amount of the expenditure.
fit_t <- lm(Amount ~ t + Count, d_exp_0014)
However, this is obviously wrong, as t increments in different amounts between the dates:
> head(exp)
Year Month Day Amount Count Date t
1 2000 7 6 792078.6 9 2000-07-06 1
2 2000 7 7 140065.5 9 2000-07-07 2
3 2000 7 11 190553.2 9 2000-07-11 3
4 2000 7 12 119208.6 9 2000-07-12 4
5 2000 7 16 1068156.3 9 2000-07-16 5
6 2000 7 17 0.0 9 2000-07-17 6
Which to me is the exact opposite of a linear trend.
What is the most efficient way to get this data.frame merged to a continuous date-index? Will a date vector like
CTS_date_V <- as.data.frame(seq(as.Date("2000/07/01"), as.Date("2014/12/31"), "days"), colnames = "Date")
yield different results?
I'm open to any packages (using fpp, forecast, timeSeries, xts, ts, as of right now); just looking for a good answer to deploy in functional form, since these payments are going to be updated every week and I'd like to automate the append to this data.frame.
I think some kind of transformation to regular (continuous) time series is a good idea.
You can use xts to transform time series data (it is handy, because it can be used in other packages as regular ts)
Filling the gaps
# convert myDate to POSIXct if necessary
# create xts from data frame x
ts1 <- xts(data.frame(a = x$Amount, c = x$Count), x$myDate )
ts1
# create empty time series
ts_empty <- seq( from = start(ts1), to = end(ts1), by = "DSTday")
# merge the empty ts to the data and fill the gap with 0
ts2 <- merge( ts1, ts_empty, fill = 0)
# or interpolate, for example:
ts2 <- merge( ts1, ts_empty, fill = NA)
ts2 <- na.locf(ts2)
# zoo-xts ready functions are:
# na.locf - constant previous value
# na.approx - linear approximation
# na.spline - cubic spline interpolation
Deduplicate dates
In your sample there is now sign of duplicated values. But based on a new question it is very likely. I think you want to aggregate values with sum function:
ts1 <- period.apply( ts1, endpoints(ts1,'days'), sum)

R: How to arrange a daily time series of rows and columns to a single column?

I have a dataset, a daily timeseries and I want to arrange into a single column, this is my data:
Date Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 .... Day 31
01/01/1964 0 0 0 0 0 0 3
01/02/1964 NA NA NA NA NA NA ...
01/03/1964 195 445 329 121 61,6 44 ...
01/04/1964 17,2 14,9 17,1 102 54,3 9,33 ...
I want this:
Day1 0
Day2 0
.
.
.
Day31 3
I having problems because of leap years that have 366 days, i trying this, but no succes, thanks in advanced.
EDIT:
I finally got it, but if anyone knows a more easy way, using some package or function, I'm grateful. Or I'll create my own function.
EDIT 2:
Now I have a problem, when I not start in the first month of a year.
rm(list = ls())
cat("\014")
setwd("C:/")
require(XLConnect)
# Load Streamflow Gauging Station
wb <- loadWorkbook("rainfall.xls")
Data<- readWorksheet(wb, sheet = "rainfall",header = FALSE,region = "B02:AF517")
R<- Data; ##1964 - 2006
sum(R[is.na(R)==FALSE])
# Number of days in each month
Ny<- c(31,28,31,30,31,30,31,31,30,31,30,31); # Normal Year
Ly<- c(31,29,31,30,31,30,31,31,30,31,30,31); # Leap/bissextile Year
S1<- c(1,0,0,0) # Leap year, normal year...
S2<- c(0,1,0,0) # Normal year, leap year...
S3<- c(0,0,1,0) #...
S4<- c(0,0,0,1) #...
Iab<- rep(S1,times=ceiling((nrow(R)/12)/4)); # Index of years
Iab<- Iab[1:(nrow(R)/12)];
Rnew<- matrix(numeric(0), 0,0);
#Organize data in a only collumn
for(i in 1:(nrow(R)/12)){
for(j in 1:12){
if(Iab[i]==0){
Rnew<-c(Rnew, t(R[12*(i-1)+j,1:Ny[j]]))
}else{
Rnew<-c(Rnew, t(R[12*(i-1)+j,1:Ly[j]]))
}
}
}
sum(R[is.na(R)==FALSE])==sum(Rnew[is.na(Rnew)==FALSE]) #Test for succes of organize
sum(R[is.na(R)==FALSE])
sum(Rnew[is.na(Rnew)==FALSE])
I have a similar problem. However in a way even worse, since I have discharge data (Brasilian ANA station) with several interruptions of several month and years. Vazao01 stands for the discharge at the first day of the month, Vazao02 for the second and the data frame goes up to Vazao31 (which is obviously NA for month with less days, but can as well be NA for existing days without record). The data looks like this and is the data.frame "ANAday"
Date Vazao01 Vazao02 Vazao03...
20 01.05.1989 3463.00 3476.500 3463.000
21 01.06.1989 1867.70 1835.900 1809.400
22 01.07.1989 809.90 798.200 774.800
23 01.08.1989 344.60 308.700 297.900
24 01.11.1989 376.50 388.100 391.000
25 01.12.1989 279.00 289.800 319.500
26 01.01.1990 1715.00 1649.000 1573.200
27 01.02.1990 1035.20 1005.800 972.200
28 01.03.1990 2905.60 2962.100 NA
29 01.06.1990 NA NA NA
30 01.07.1990 297.90 284.400 271.200
31 01.08.1990 228.00 223.200 218.400
32 01.08.1999 NA NA 144.000
33 01.09.1999 20.74 18.620 16.500
34 01.10.1999 119.85 111.450 95.385
35 01.11.1999 11.20 23.705 48.370
36 01.12.1999 160.10 179.000 187.400
37 01.01.2000 843.00 865.300 914.500
38 01.02.2000 1331.30 1368.900 1387.800
39 01.04.2000 1823.60 1808.000 1789.800
40 01.05.2000 1579.00 1524.100 1445.700
I made a list of the month with data
ANAm=as.Date(ANAday[,1], format="%d.%m.%Y")
format(ANAm, format="%Y-%m")
Than I used the "monthDays" function of the Hmisc package to list the number of days in each month
require(Hmisc)
nodm=monthDays(ANAm)
Nodm=cbind.data.frame(ANAm,nodm)
I prepared a data.frame for the data I want to have with 3 columns for "YEAR MONTH", "DAY" and "DISCHARGE"
ANATS=array(NA,c(1,3))
colnames(ANATS)=c("mY","d","Q")
And used a simple "for" loop to extract the data into one column according to the number of days in each month
for(i in 1:nrow(Nodm)){
selectANA=as.vector(ANAd[i,1:(Nodm[i,2]) ])
selectANA=as.vector(t(selectANA))##to generate a simple vector
dayANA=c(1:(Nodm[i,2]))
monthANA=rep(format(as.Date(Nodm[i,1]),format="%Y-%m"),times=as.numeric(Nodm[i,2]))
ANAts=cbind(monthANA,dayANA,auswahlANA)
ANATS<<-rbind(ANATS,ANAts)
}
The ANATS can than be transferred into a timeseries:
combine.date=as.character(paste(ANATS[,1],ANATS[,2],sep="-"))
DATE=as.Date(combine.date, format="%Y-%m-%d")
rownames(ANATS)=as.character(DATE)
ANATS=ANATS[-1,]
ANAXTS=as.xts(ANATS)
Maybe I'm having trouble understanding exactly what you're looking for, but are you trying to transpose the data?
t(data)

Resources