zoo object aggregation - r

Dear Community,
the data I receive will be in a data frame:
Var_1 Var_2 Date VaR_3 VaR_4 VaR_5 Var_6
1 4 2010-01-18 7 apple 10 sweet
2 5 2010-07-19 8 orange 11 sour
3 6 2010-01-18 9 kiwi 12 juicy
... ... ... ... ... ... ...
I would like to use zoo, since it seems to be a flexible object class. I'm just starting with R and I tried to read the description (vignettes) for the package.
Questions:
Given the above data as a data frame, which method is recommended to convert the complete df into a zoo object, telling zoo that it shall use the third column as date column (dates can occur multiple times in the data)?
How do I aggregate all other columns monthly, except columns 4 and 6 using zoo built-in functions? Is zoo able to automatically discard categorical variables and just use those columns that are suited for aggregation?
How do I aggregate all numeric columns monthly, for each category in column 4 (column 6 shall not be included, since it is non-numeric).
Thanks for your support.

zoo objects are time series and are normally numeric vectors or matrices. It seems that what you really have is a bunch of different time series where column 5 identifies which series it is. That is, there is an apple series, an orange series, a kiwi series, etc. and each of them have several columns.
Dropping the last column since its not numeric, using the third column as the index and splitting on column 5 we have:
# create test data
Lines <- "Var_1 Var_2 Date VaR_3 VaR_4 VaR_5 Var_6
1 4 2010-01-18 7 apple 10 sweet
2 5 2010-07-19 8 orange 11 sour
3 6 2010-01-18 9 kiwi 12 juicy"
cat(Lines, "\n", file = "data.txt")
library(zoo)
z <- read.zoo("data.txt", header = TRUE, index = 3, split = "VaR_5",
colClasses = c(Var_6 = "NULL"))
The result is:
> z
Var_1.apple Var_2.apple VaR_3.apple VaR_5.apple Var_1.kiwi
2010-01-18 1 4 7 10 3
2010-07-19 NA NA NA NA NA
Var_2.kiwi VaR_3.kiwi VaR_5.kiwi Var_1.orange Var_2.orange
2010-01-18 6 9 12 NA NA
2010-07-19 NA NA NA 2 5
VaR_3.orange VaR_5.orange
2010-01-18 NA NA
2010-07-19 8 11
The above assumes that for a given value of column 5 that the dates are unique. If that is not the case then include the aggregate = mean argument or some other value for aggregate.
To now aggregate it into a monthly zoo series we have:
aggregate(z, as.yearmon, mean)
It would also be possible to convert it straight away to monthly by using the FUN = as.yearmon argument:
zm <- read.zoo("data.txt", header = TRUE, index = "Date", split = "VaR_4",
FUN = as.yearmon, colClasses = c(Var_6 = "NULL"), aggregate = mean)
See ?read.zoo, vignette("zoo-read"), ?aggregate.zoo and the other vignettes and help files as well.

Related

Using tapply and cumsum function for multiple vectors in R

I have a data frame with four columns.
country date pangolin_lineage n cum_country
1 Albania 2020-09-05 B.1.236 1 1
2 Algeria 2020-03-02 B.1 2 2
3 Algeria 2020-03-08 B.1 1 3
4 Algeria 2020-06-09 B.1.1.119 1 4
5 Algeria 2020-06-15 B.1 1 5
6 Algeria 2020-06-15 B.1.36 1 6
I wished to calculate the cumulative sum of n across country and date. I was able to do that with this code:
date_country$cum_country <- as.numeric(unlist(tapply(date_country$n, date_country$country, cumsum)))
I now, however, would like to do the same thing, but the cumulative sum across country, pangolin_lineage, and date. I have tried to add another vector into the above function, but it seems you can only input one index input and one vector input for tapply. I get this error:
date_country$cum_country_pangol <- as.numeric(unlist(tapply(date_country$n, date_country$country, date_country$pangolin_lineage, cumsum)))
Error in match.fun(FUN) :
'date_country$pangolin_lineage' is not a function, character or symbol
Does anyone have any ideas how how to use cumsum in tapply across multiple vectors (country, pangolin_lineage, date?
if there are more than one group, wrap it in a list, but note that tapply in a summarising function and it can split up when we specify function like cumsum.
tapply(date_country$n, list(date_country$country, date_country$pangolin_lineage), cumsum))
But, this is much more easier with ave i.e. if we want to create a new column, avoid the hassle of unlist etc. by just using ave
ave(date_country$n, date_country$country,
date_country$pangolin_lineage, FUN = cumsum)
#[1] 1 2 3 1 4 1

Type 'double' with column and row information in console does not appear correctly when using View()

Something strange (to me) is going on.
I have time series data collected by running the commands
data.ts = ts(1:10, frequency = 4, start = c(1959, 2))
D = decompose(data.ts)
df = D$trend
I have what I thought was a data frame (but is actually of type double), df, that when executed in the console, looks like
>df
Qtr1 Qtr2 Qtr3 Qtr4
1959 NA NA 3
1960 4 5 6 7
1961 8 NA NA
However, when using View(df), the data looks like the following below (and does not have the years or quarter information with it):
>View(df)
z
1 NA
2 NA
3 3
4 4
5 5
6 6
7 7
8 8
9 NA
10 NA
I have been trying to converting this type double (it is not a ts object to a data frame that looks like the result I'm getting currently in the console, but using as.data.frame(df) converts df to a data frame that looks like the 2 column example from earlier.
What is going on exactly?
Bonus: How do I create a data frame out of df while keeping the months and years intact?

Turning one row into multiple rows in r [duplicate]

This question already has answers here:
Combine Multiple Columns Into Tidy Data [duplicate]
(3 answers)
Closed 5 years ago.
In R, I have data where each person has multiple session dates, and the scores on some tests, but this is all in one row. I would like to change it so I have multiple rows with the persons info, but only one of the session dates and corresponding test scores, and do this for every person. Also, each person may have completed different number of sessions.
Ex:
ID Name Session1Date Score Score Session2Date Score Score
23 sjfd 20150904 2 3 20150908 5 7
28 addf 20150905 3 4 20150910 6 8
To:
ID Name SessionDate Score Score
23 sjfd 20150904 2 3
23 sjfd 20150908 5 7
28 addf 20150905 3 4
28 addf 20150910 6 8
You can use melt from the devel version of data.table ie. v1.9.5. It can take multiple 'measure' columns as a list. Instructions to install are here
library(data.table)#v1.9.5+
melt(setDT(df1), measure = patterns("Date$", "Score(\\.2)*$", "Score\\.[13]"))
# ID Name variable value1 value2 value3
#1: 23 sjfd 1 20150904 2 3
#2: 28 addf 1 20150905 3 4
#3: 23 sjfd 2 20150908 5 7
#4: 28 addf 2 20150910 6 8
Or using reshape from base R, we can specify the direction as 'long' and varying as a list of column index
res <- reshape(df1, idvar=c('ID', 'Name'), varying=list(c(3,6), c(4,7),
c(5,8)), direction='long')
res
# ID Name time Session1Date Score Score.1
#23.sjfd.1 23 sjfd 1 20150904 2 3
#28.addf.1 28 addf 1 20150905 3 4
#23.sjfd.2 23 sjfd 2 20150908 5 7
#28.addf.2 28 addf 2 20150910 6 8
If needed, the rownames can be changed
row.names(res) <- NULL
Update
If the columns follow a specific order i.e. 3rd grouped with 6th, 4th with 7th, 5th with 8th, we can create a matrix of column index and then split to get the list for the varying argument in reshape.
m1 <- matrix(3:8,ncol=2)
lst <- split(m1, row(m1))
reshape(df1, idvar=c('ID', 'Name'), varying=lst, direction='long')
If your data frame name is data
Use this
data1 <- data[1:5]
data2 <- data[c(1,2,6,7,8)]
newdata <- rbind(data1,data2)
This works for the example you've given. You might have to change column names appropriately in data1 and data2 for a proper rbind

How to make all the months to have an equal number of days (for example 22 days) for a MIDAS regression in R

This is a follow up question for these two posts.
How to deal with impossible dates for midasr package
https://stats.stackexchange.com/questions/77495/what-can-i-do-with-these-two-time-series
I need to use mls function in MIDAS package in R to transform the high frequency (daily) financial data to low frequency (quarterly) macroeconomic data.
The author #mpiktas mentioned
You must make all the months to have an equal number of days. And then
set frequency to that number. You can achieve that by discarding data,
padding NAs or extrapolating.
and
You could use zoo objects to make the padding easier, but in the end
simple numeric vector should be passed.
I tried different ways to search and did not find an easy way to implement.
I use dplyr to get each month to have 31 days with 7-11 NA.
# generate the date vector
library(midasr)
library(dplyr)
library(quantmod)
tsxdate <- as.Date( paste(1979, rep(1:12, each=31), 1:31, sep="-") )
for (year in 1980:2015){
tsxdate <- c(tsxdate,as.Date( paste(year, rep(1:12, each=31), 1:31, sep="-") ))
}
# transform to dataframe
tsxdate.df <- as.data.frame(tsxdate)
# get the stock market index from yahoo
tsxindex <- getSymbols("^GSPTSE",src="yahoo", from = '1977-01-01', auto.assign = FALSE)
# merge two data frame to get each month with 31 days
tsx.df <- left_join(tsxdate.df, tsxindex)
I doubt this caused a problem due to too many NAs.
I put the new daily data into MIDAS regression in R. It did not work. None of the weight functions work.
# since each month has 31 days. one quarter yy correspond to 93 days data.
midas_r(midas_r(yy~trend+fmls(zz,30,93,nealmon) ,start=list(zz=rep(0,4))), Ofunction="nls")
Could you tell me how to make all the months to have an equal number of days?
update:
Finally, I got a way in zoo package with aggregate and first function. It is not perfect, but it works and fast. first will add NAs according to the parameter.
I still need to figure out how to fit it into a MIDAS regression.
# get data
tsx <- getSymbols("^GSPTSE",src="yahoo", from = '1977-01-01', auto.assign = FALSE)
# subset
# generate a zoo object
library(zoo)
tsx.zoo <- zoo(tsx$GSPTSE.Adjusted)
# group by yearmonth and take first 22 days data.
days <-aggregate(tsx.zoo, as.yearmon, first, 22)
It looks like this: each row is one month with 22 days data.
Jun 1979 1614.29 NA NA NA NA NA NA NA NA NA
Jul 1979 1614.29 1598.73 1579.88 1582.57 1582.27 1576.19 1559.23 1529.81 1533.50 1547.66
Aug 1979 1554.14 1556.94 1553.84 1553.84 1551.95 1561.23 1562.52 1571.00 1578.08 1580.28
Sep 1979 1685.11 1657.58 1690.10 1720.92 1716.53 1711.34 1722.71 1714.63 1727.50 1724.51
Oct 1979 1749.05 1767.40 1775.98 1786.35 1800.12 1800.12 1735.88 1685.21 1681.52 1670.65
Nov 1979 1599.33 1606.81 1596.54 1592.94 1574.49 1569.20 1583.97 1608.70 1611.00 1619.78
Jun 1979 NA NA NA NA NA NA NA NA NA NA
Jul 1979 1556.94 1546.86 1548.46 1553.54 1542.07 1543.17 1552.85 1566.01 1573.99 1564.12
Aug 1979 1596.64 1602.82 1615.09 1636.53 1653.09 1660.97 1657.78 1665.46 1674.44 1674.64
Sep 1979 1714.73 1717.53 1732.59 1736.48 1731.19 1732.49 1746.75 1754.33 1747.45 NA
Oct 1979 1639.03 1613.19 1616.29 1635.34 1593.44 1533.40 1522.12 1534.49 1517.24 1523.92
Nov 1979 1628.55 1621.57 1624.36 1627.56 1620.27 1647.51 1677.93 1683.81 1690.70 1698.97
Jun 1979 NA NA
Jul 1979 1554.14 NA
Aug 1979 1674.24 1675.43
Sep 1979 NA NA
Oct 1979 1538.68 1552.25
update again:
#mpiktas gives a better and right way to do it.
1 NAs should be padded at beginning of each period.
2 Data should be gather in the frequency of response variable. In my case, it is quarterly.
His function can be used in aggregate function in zoo. I guess it do the same job as group_by plus do in dplyr: split, operate, and give back a list of results. I try this
tsxdaily <- aggregate(tsx.zoo, yearqtr, padd_nas, 66)
yearqtr is the frequency of response variable.
Here is one possible way of how to add NAs.
First, note that MIDAS regression puts the emphasis on the last values of the period, so you need to put NAs in front, not in the back.
Suppose that we have the following dummy data:
> dt <- data.frame(Day=1:10,Quarter=c(rep(1,6),rep(2,4)),value=1:10)
> dt
Day Quarter value
1 1 1 1
2 2 1 2
3 3 1 3
4 4 1 4
5 5 1 5
6 6 1 6
7 7 2 7
8 8 2 8
9 9 2 9
10 10 2 10
In this example there are two quarters, the first one has 6 days, the second one 4. Suppose we want to harmonize the data, so that the quarter has 7 days (for example).
Define simple function which adds NAs at the beginning of the data:
padd_nas <- function(x, desired_length) {
n <- length(x)
if(n < desired_length) {
c(rep(NA,desired_length-n),x)
} else {
tail(x,desired_length)
}
}
Here is an example illustrating how this function works:
> padd_nas(1:4,7)
[1] NA NA NA 1 2 3 4
>
Now add NAs for each quarter and make sure that the data is ordered by day:
library(dplyr)
pdt <- dt %>% arrange(Day) %>% group_by(Quarter) %>% do(pv = padd_nas(.$value, 7))
> pdt
Source: local data frame [2 x 2]
Groups: <by row>
Quarter pv
1 1 <int[7]>
2 2 <int[7]>
To get the padded result simply use unlist on column pv:
> pv <- pdt$pv %>% unlist
> pv
[1] NA 1 2 3 4 5 6 NA NA NA 7 8 9 10
Now we can prepared this for MIDAS regression with mls. Suppose that only last 3 days are relevant for each quarter:
> library(midasr)
> mls(pv, 0:2, 7)
X.0/m X.1/m X.2/m
[1,] 6 5 4
[2,] 10 9 8
Compare this with original data dt.
This approach can be generalized for any low and high frequency data configuration.

(In)correct use of a linear time trend variable, and most efficient fix?

I have 3133 rows representing payments made on some of the 5296 days between 7/1/2000 and 12/31/2014; that is, the "Date" feature is non-continuous:
> head(d_exp_0014)
Year Month Day Amount Count myDate
1 2000 7 6 792078.6 9 2000-07-06
2 2000 7 7 140065.5 9 2000-07-07
3 2000 7 11 190553.2 9 2000-07-11
4 2000 7 12 119208.6 9 2000-07-12
5 2000 7 16 1068156.3 9 2000-07-16
6 2000 7 17 0.0 9 2000-07-17
I would like to fit a linear time trend variable,
t <- 1:3133
to a linear model explaining the variation in the Amount of the expenditure.
fit_t <- lm(Amount ~ t + Count, d_exp_0014)
However, this is obviously wrong, as t increments in different amounts between the dates:
> head(exp)
Year Month Day Amount Count Date t
1 2000 7 6 792078.6 9 2000-07-06 1
2 2000 7 7 140065.5 9 2000-07-07 2
3 2000 7 11 190553.2 9 2000-07-11 3
4 2000 7 12 119208.6 9 2000-07-12 4
5 2000 7 16 1068156.3 9 2000-07-16 5
6 2000 7 17 0.0 9 2000-07-17 6
Which to me is the exact opposite of a linear trend.
What is the most efficient way to get this data.frame merged to a continuous date-index? Will a date vector like
CTS_date_V <- as.data.frame(seq(as.Date("2000/07/01"), as.Date("2014/12/31"), "days"), colnames = "Date")
yield different results?
I'm open to any packages (using fpp, forecast, timeSeries, xts, ts, as of right now); just looking for a good answer to deploy in functional form, since these payments are going to be updated every week and I'd like to automate the append to this data.frame.
I think some kind of transformation to regular (continuous) time series is a good idea.
You can use xts to transform time series data (it is handy, because it can be used in other packages as regular ts)
Filling the gaps
# convert myDate to POSIXct if necessary
# create xts from data frame x
ts1 <- xts(data.frame(a = x$Amount, c = x$Count), x$myDate )
ts1
# create empty time series
ts_empty <- seq( from = start(ts1), to = end(ts1), by = "DSTday")
# merge the empty ts to the data and fill the gap with 0
ts2 <- merge( ts1, ts_empty, fill = 0)
# or interpolate, for example:
ts2 <- merge( ts1, ts_empty, fill = NA)
ts2 <- na.locf(ts2)
# zoo-xts ready functions are:
# na.locf - constant previous value
# na.approx - linear approximation
# na.spline - cubic spline interpolation
Deduplicate dates
In your sample there is now sign of duplicated values. But based on a new question it is very likely. I think you want to aggregate values with sum function:
ts1 <- period.apply( ts1, endpoints(ts1,'days'), sum)

Resources