Dealing with apply functions of xts object in R - r

I have a sample xts object with the some data:
dates <- seq.Date(from = as.Date("2010-01-01", format = "%Y-%m-%d"),
to = as.Date("2013-12-01", format = "%Y-%m-%d"), by = "month")
sample_data <- cbind(1:length(dates),length(dates):1)
xts_object <- xts(x = sample_data, order.by = dates)
I then use apply.yearly on it with the function cumsum:
apply.yearly(x = xts_object, FUN = cumsum)
The output is a tranposed matrix, which is not what I originally intended it to return.
I would expect the snippet above to return the same output as:
rbind(apply(X = xts_object[1:12],MARGIN = 2,FUN = cumsum),
apply(X = xts_object[13:24],MARGIN = 2,FUN = cumsum),
apply(X = xts_object[25:36],MARGIN = 2,FUN = cumsum),
apply(X = xts_object[37:48],MARGIN = 2,FUN = cumsum))
The problem with using apply is that it returns a matrix and not an xts object. While I could solve this by using as.xts, I would like to know if there is something I am missing, or if I am using apply.yearly incorrectly. Using pure apply seems to be more prone to difficult to catch errors and bugs.

This might not be the most elegant solution, but it works:
# Split xts_object by year
xts_list = split(xts_object, "years")
# cumsum for each year
cumsum_list = lapply(xts_list, FUN = cumsum)
# rbind them together
do.call(rbind, cumsum_list)
# [,1] [,2]
# 2010-01-01 1 48
# 2010-02-01 3 95
# 2010-03-01 6 141
# 2010-04-01 10 186
# 2010-05-01 15 230
# 2010-06-01 21 273
# 2010-07-01 28 315
# 2010-08-01 36 356
# 2010-09-01 45 396
# 2010-10-01 55 435
# 2010-11-01 66 473
# 2010-12-01 78 510
# 2011-01-01 13 36
# 2011-02-01 27 71
# 2011-03-01 42 105
# 2011-04-01 58 138
# 2011-05-01 75 170
# 2011-06-01 93 201
# 2011-07-01 112 231
# 2011-08-01 132 260
# 2011-09-01 153 288
# 2011-10-01 175 315
# 2011-11-01 198 341
# 2011-12-01 222 366
# 2012-01-01 25 24
# 2012-02-01 51 47
# 2012-03-01 78 69
# 2012-04-01 106 90
# 2012-05-01 135 110
# 2012-06-01 165 129
# 2012-07-01 196 147
# 2012-08-01 228 164
# 2012-09-01 261 180
# 2012-10-01 295 195
# 2012-11-01 330 209
# 2012-12-01 366 222
# 2013-01-01 37 12
# 2013-02-01 75 23
# 2013-03-01 114 33
# 2013-04-01 154 42
# 2013-05-01 195 50
# 2013-06-01 237 57
# 2013-07-01 280 63
# 2013-08-01 324 68
# 2013-09-01 369 72
# 2013-10-01 415 75
# 2013-11-01 462 77
# 2013-12-01 510 78
class(do.call(rbind, cumsum_list))
# [1] "xts" "zoo"
The resulting object would still be "xts"

Related

How do I use the suffixes argument when merging multiple dataframes in R?

First merge of two datasets:
Agg_Bpm <- merge(agg_bpm_mean, agg_bpm_median, by = "Group.1", all = TRUE, suffixes = c(".mean",".median"))
Output:
Group.1 x.mean x.median
1 2022484408 80.23686 76
2 2026352035 93.77631 95
3 2347167796 76.72279 73
4 4020332650 82.30058 83
5 4388161847 66.13300 62
Now I want to merge another dataset:
> agg_bpm_max
Group.1 x
1 2022484408 203
2 2026352035 125
3 2347167796 195
4 4020332650 191
5 4388161847 180
So I try this:
Agg_Bpm <- merge(Agg_Bpm, agg_bpm_max, by = "Group.1", all = TRUE, suffixes = c(c(".mean",".median"),".max"))
Can merge it but how do I suffix the new column properly??
> Agg_Bpm
Group.1 x.mean x.median x
1 2022484408 80.23686 76 203
2 2026352035 93.77631 95 125
3 2347167796 76.72279 73 195
4 4020332650 82.30058 83 191
5 4388161847 66.13300 62 180
I decided a simple solution was to just not use the suffixes argument and just rename columns after the fact:
> names(Agg_Bpm) <- c("id","mean","median","max","sd")
> Agg_Bpm
id mean median max sd
1 2022484408 80.23686 76 203 17.59559
2 2026352035 93.77631 95 125 12.61053
3 2347167796 76.72279 73 195 15.51571
4 4020332650 82.30058 83 191 15.91159
5 4388161847 66.13300 62 180 15.84728
Could you try:
Agg_Bpm <- merge(Agg_Bpm, agg_bpm_max, by = "Group.1", all = TRUE, suffixes = c("",".max"))
Pls provide next time a reproducible dataset. You can do this with function dput(Agg_Bpm)

How to use mutate_at() with two sets of variables, in R

Using dplyr, I want to divide a column by another one, where the two columns have a similar pattern.
I have the following data frame:
My_data = data.frame(
var_a = 101:110,
var_b = 201:210,
number_a = 1:10,
number_b = 21:30)
I would like to create a new variable: var_a_new = var_a/number_a, var_b_new = var_b/number_b and so on if I have c, d etc.
My_data %>%
mutate_at(
.vars = c('var_a', 'var_b'),
.funs = list( new = function(x) x/(.[,paste0('number_a', names(x))]) ))
I did not get an error, but a wrong result. I think that the problem is that I don't understand what the 'x' is. Is it one of the string in .vars? Is it a column in My_data? Something else?
One option could be:
bind_cols(My_data,
My_data %>%
transmute(across(starts_with("var"))/across(starts_with("number"))) %>%
rename_all(~ paste0(., "_new")))
var_a var_b number_a number_b var_a_new var_b_new
1 101 201 1 21 101.00000 9.571429
2 102 202 2 22 51.00000 9.181818
3 103 203 3 23 34.33333 8.826087
4 104 204 4 24 26.00000 8.500000
5 105 205 5 25 21.00000 8.200000
6 106 206 6 26 17.66667 7.923077
7 107 207 7 27 15.28571 7.666667
8 108 208 8 28 13.50000 7.428571
9 109 209 9 29 12.11111 7.206897
10 110 210 10 30 11.00000 7.000000
You can do this directly provided the columns are correctly ordered meaning "var_a" is first column in "var" group and "number_a" is first column in "number" group and so on for other pairs.
var_cols <- grep('var', names(My_data), value = TRUE)
number_cols <- grep('number', names(My_data), value = TRUE)
My_data[paste0(var_cols, '_new')] <- My_data[var_cols]/My_data[number_cols]
My_data
# var_a var_b number_a number_b var_a_new var_b_new
#1 101 201 1 21 101.00000 9.571429
#2 102 202 2 22 51.00000 9.181818
#3 103 203 3 23 34.33333 8.826087
#4 104 204 4 24 26.00000 8.500000
#5 105 205 5 25 21.00000 8.200000
#6 106 206 6 26 17.66667 7.923077
#7 107 207 7 27 15.28571 7.666667
#8 108 208 8 28 13.50000 7.428571
#9 109 209 9 29 12.11111 7.206897
#10 110 210 10 30 11.00000 7.000000
The function across() has replaced scope variants such as mutate_at(), summarize_at() and others. For more details, see vignette("colwise") or https://cran.r-project.org/web/packages/dplyr/vignettes/colwise.html. Based on tmfmnk's answer, the following works well:
My_data %>%
mutate(
new = across(starts_with("var"))/across(starts_with("number")))
The prefix "new." will be added to the names of the new variables.
var_a var_b number_a number_b new.var_a new.var_b
1 101 201 1 21 101.00000 9.571429
2 102 202 2 22 51.00000 9.181818
3 103 203 3 23 34.33333 8.826087
4 104 204 4 24 26.00000 8.500000
5 105 205 5 25 21.00000 8.200000
6 106 206 6 26 17.66667 7.923077
7 107 207 7 27 15.28571 7.666667
8 108 208 8 28 13.50000 7.428571
9 109 209 9 29 12.11111 7.206897
10 110 210 10 30 11.00000 7.000000

Subtracting similar column names R

I have a dataframe with columns that have 'x1' and 'x1_fit' with the numbers going up to 5 in some cases.
date <- seq(as.Date('2019-11-04'), by = "days", length.out = 7)
x1 <- c(100,120,111,152,110,112,111)
x1_fit <- c(150,142,146,148,123,120,145)
x2 <- c(110,130,151,152,150,142,161)
x2_fit <- c(170,172,176,178,173,170,175)
df <- data.frame(date,x1,x1_fit,x2,x2_fit)
How can I do x1_fit - x1 and so on. The number of x's will change every time.
You can select those columns with regular expressions (surppose the columns are in appropriate order):
> df[, grep('^x\\d+_fit$', colnames(df))] - df[, grep('^x\\d+$', colnames(df))]
x1_fit x2_fit
1 50 60
2 22 42
3 35 25
4 -4 26
5 13 23
6 8 28
7 34 14
If you want to assign the differences to the original df:
df[, paste0(grep('^x\\d+$', colnames(df), value = TRUE), '_diff')] <-
df[, grep('^x\\d+_fit$', colnames(df))] - df[, grep('^x\\d+$', colnames(df))]
# > df
# date x1 x1_fit x2 x2_fit x1_diff x2_diff
# 1 2019-11-04 100 150 110 170 50 60
# 2 2019-11-05 120 142 130 172 22 42
# 3 2019-11-06 111 146 151 176 35 25
# 4 2019-11-07 152 148 152 178 -4 26
# 5 2019-11-08 110 123 150 173 13 23
# 6 2019-11-09 112 120 142 170 8 28
# 7 2019-11-10 111 145 161 175 34 14
Solution from #mt1022 is straightforward, however since you have tagged this as dplyr, here is one approach following it where we convert the data to long format, subtract the corresponding values and get the data in wide format again.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -date) %>%
mutate(name = sub('_.*', '', name)) %>%
group_by(date, name) %>%
summarise(diff = diff(value)) %>%
pivot_wider(names_from = name, values_from = diff) %>%
rename_at(-1, ~paste0(., "_diff")) %>%
left_join(df, by = "date")
# date x1_diff x2_diff x1 x1_fit x2 x2_fit
# <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 2019-11-04 50 60 100 150 110 170
#2 2019-11-05 22 42 120 142 130 172
#3 2019-11-06 35 25 111 146 151 176
#4 2019-11-07 -4 26 152 148 152 178
#5 2019-11-08 13 23 110 123 150 173
#6 2019-11-09 8 28 112 120 142 170
#7 2019-11-10 34 14 111 145 161 175
In base R, you could loop over the unique column names and diff on the the fitted column using
> lapply(setNames(nm = unique(gsub("_.*", "", names(df)))), function(nm) {
fit <- paste0(nm, "_fit")
diff <- df[, nm] - df[, fit]
})
# $x1
# [1] -50 -22 -35 4 -13 -8 -34
#
# $x2
# [1] -60 -42 -25 -26 -23 -28 -14
Here, I set the Date column as the row names and removed the column using
df <- data.frame(date,x1,x1_fit,x2,x2_fit)
row.names(df) <- df$date
df$date <- NULL
but you could just loop over the the column names without the Date column.
We can also do with a split in base R
out <- sapply(split.default(df[-1], sub("_.*", "", names(df)[-1])),
function(x) x[,2] - x[1])
df[sub("\\..*", "_diff", names(lst1))] <- out
df
# date x1 x1_fit x2 x2_fit x1_diff x2_diff
#1 2019-11-04 100 150 110 170 50 60
#2 2019-11-05 120 142 130 172 22 42
#3 2019-11-06 111 146 151 176 35 25
#4 2019-11-07 152 148 152 178 -4 26
#5 2019-11-08 110 123 150 173 13 23
#6 2019-11-09 112 120 142 170 8 28
#7 2019-11-10 111 145 161 175 34 14

Finding Fitted Forecast with ARIMA / Exponential Smoothing in R

I've written some code to sort claim dates, count them per month/year and I am attempting to forecast them with either ARIMA / exponential smoothing parameters.
See claims list:
2012-01-31 82
2012-02-29 65
2012-03-31 64
2012-04-30 73
2012-05-31 71
2012-06-30 79
2012-07-31 72
2012-08-31 82
2012-09-29 64
2012-10-31 72
2012-11-30 63
2012-12-31 80
2013-01-31 67
2013-02-27 65
2013-03-31 84
2013-04-30 68
2013-05-31 68
2013-06-29 66
2013-07-30 64
2013-08-31 69
2013-09-29 66
2013-10-31 65
2013-11-30 56
2013-12-31 76
2014-01-31 75
2014-02-28 58
2014-03-29 80
2014-04-30 76
2014-05-31 80
2014-06-28 68
2014-07-31 82
2014-08-30 79
2014-09-27 60
2014-10-31 85
2014-11-30 60
2014-12-31 76
2015-01-31 75
2015-02-28 84
2015-03-31 77
2015-04-30 79
2015-05-30 91
2015-06-30 82
2015-07-31 98
2015-08-31 65
2015-09-30 77
2015-10-31 115
2015-11-30 79
2015-12-31 80
2016-01-30 91
2016-02-29 105
2016-03-31 77
2016-04-30 107
2016-05-31 85
2016-06-30 89
2016-07-30 112
2016-08-31 88
2016-09-30 90
2016-10-30 79
2016-11-30 85
2016-12-31 66
The issue I'm facing with my code is that I am getting a mean forecast rather than my desired fitted data similar to this example: https://stats.stackexchange.com/questions/115506/forecasting-a-seasonal-time-series-in-r
Please see the R code:
Sorting the claim dates and counting them
library(forecast)
library(ggplot2)
library(xts)
library(reshape2)
library(zoo)
library(lubridate)
data = read.csv('Claims1.csv')
data$DISABILITYDATE <- as.Date(data$DISABILITYDATE, "%m/%d/%Y")
data
str(data)
as.Date(data[,1])
xts(x=data[,-1], order.by = data[,1])
data = read.csv('Claims1.csv')
data$DISABILITYDATE <- as.Date (data$DISABILITYDATE, "%m/%d/%Y")
df <- xts(rep(1,length(data$DISABILITYDATE)),order.by=data$DISABILITYDATE)
df1 <- apply.monthly(df,function(x) length(x))
df1
t(df1)
str(df1)
df2 <- data.frame(df1=c("Jan 2012","Feb 2012","Mar 2012","Apr 2012","May 2012","Jun 2012","Jul 2012","Aug 2012","Sep 2012","Oct 2012","Nov 2012","Dec 2012","Jan 2013","Feb 2013","Mar 2013","Apr 2013","May 2013","Jun 2013","Jul 2013","Aug 2013","Sep 2013","Oct 2013","Nov 2013","Dec 2013","Jan 2014","Feb 2014","Mar 2014","Apr 2014","May 2014","Jun 2014","Jul 2014","Aug 2014","Sep 2014","Oct 2014","Nov 2014","Dec 2014","Jan 2015","Feb 2015","Mar 2015","Apr 2015","May 2015","Jun 2015","Jul 2015","Aug 2015","Sep 2015","Oct 2015","Nov 2015","Dec 2015","Jan 2016","Feb 2016","Mar 2016","Apr 2016","May 2016","Jun 2016","Jul 2016","Aug 2016","Sep 2016","Oct 2016","Nov 2016","Dec 2016"),score=c(df1))
df2
t(df2)
df2[-1]
2.1 Forecasting with ETS (Exponential Smoothing)
library(forecast)
x.ts <- as.ts(df2[2])
x.ts
x.ets <- ets(x.ts)
x.ets
x.fore <- forecast(x.ets$fitted, h=12)
x.fore
x <- ts(df2[2], start = 2012, frequency = 12)
plot(forecast(ets(x), 24))
x
plot(forecast(x, h=12))
date1 <- ymd("2012-01-01","2013-01-01","2014-01-01","2015-01-01","2016-01-01","2017-01-01")
abline(v=decimal_date(date1), col="blue")
2.2 Forecasting with ARIMA
ARIMAfit = auto.arima(x, approximation=FALSE,trace=FALSE)
summary(ARIMAfit)
plot(ARIMAfit)
pred = predict(ARIMAfit, n.ahead = 48)
round(as.numeric(pred$fitted,0))
pred
library(TSPred)
plotarimapred(pred$pred,x, xlim=c(2012, 2020), range.percent = 0.05)
My output is this:
example of desired output

Generate entries in time series data

I want to generate a row (with zero ammount) for each missing month (until the current) in the following dataframe. Can you please give me a hand in this? Thanks!
trans_date ammount
1 2004-12-01 2968.91
2 2005-04-01 500.62
3 2005-05-01 434.30
4 2005-06-01 549.15
5 2005-07-01 276.77
6 2005-09-01 548.64
7 2005-10-01 761.69
8 2005-11-01 636.77
9 2005-12-01 1517.58
10 2006-03-01 719.09
11 2006-04-01 1231.88
12 2006-05-01 580.46
13 2006-07-01 1468.43
14 2006-10-01 692.22
15 2006-11-01 505.81
16 2006-12-01 1589.70
17 2007-03-01 1559.82
18 2007-06-01 764.98
19 2007-07-01 964.77
20 2007-09-01 405.18
21 2007-11-01 112.42
22 2007-12-01 1134.08
23 2008-02-01 269.72
24 2008-03-01 208.96
25 2008-04-01 353.58
26 2008-05-01 756.00
27 2008-06-01 747.85
28 2008-07-01 781.62
29 2008-09-01 195.36
30 2008-10-01 424.24
31 2008-12-01 166.23
32 2009-02-01 237.11
33 2009-04-01 110.94
34 2009-07-01 191.29
35 2009-11-01 153.42
36 2009-12-01 222.87
37 2010-09-01 1259.97
38 2010-11-01 375.61
39 2010-12-01 496.48
40 2011-02-01 360.07
41 2011-03-01 324.95
42 2011-04-01 566.93
43 2011-06-01 281.19
44 2011-08-01 428.04
'data.frame': 44 obs. of 2 variables:
$ trans_date : Date, format: "2004-12-01" "2005-04-01" "2005-05-01" "2005-06-01" ...
$ ammount: num 2969 501 434 549 277 ...
you can use seq.Date and merge:
> str(df)
'data.frame': 44 obs. of 2 variables:
$ trans_date: Date, format: "2004-12-01" "2005-04-01" "2005-05-01" "2005-06-01" ...
$ ammount : num 2969 501 434 549 277 ...
> mns <- data.frame(trans_date = seq.Date(min(df$trans_date), max(df$trans_date), by = "month"))
> df2 <- merge(mns, df, all = TRUE)
> df2$ammount <- ifelse(is.na(df2$ammount), 0, df2$ammount)
> head(df2)
trans_date ammount
1 2004-12-01 2968.91
2 2005-01-01 0.00
3 2005-02-01 0.00
4 2005-03-01 0.00
5 2005-04-01 500.62
6 2005-05-01 434.30
and if you need months until current, use this:
mns <- data.frame(trans_date = seq.Date(min(df$trans_date), Sys.Date(), by = "month"))
note that it is sufficient to call simply seq instead of seq.Date if the parameters are Date class.
If you're using xts objects, you can use timeBasedSeq and merge.xts. Assuming your original data is in an object Data:
# create xts object:
# no comma on the first subset (Data['ammount']) keeps column name;
# as.Date needs a vector, so use comma (Data[,'trans_date'])
x <- xts(Data['ammount'],as.Date(Data[,'trans_date']))
# create a time-based vector from 2004-12-01 to 2011-08-01. The "m" denotes
# monthly time-steps. By default this returns a yearmon class. Use
# retclass="Date" to return a Date vector.
d <- timeBasedSeq(paste(start(x),end(x),"m",sep="/"), retclass="Date")
# merge x with an "empty" xts object, xts(,d), filling with zeros
y <- merge(x,xts(,d),fill=0)

Resources