Overlay line plots in ggplot2 - r

I've created a multiple line graph using ggplot2, where each line represents a year that is plotted against month (click link below). Volume is represented on the y-axis.
Here is the code I used to plot the figure above:
ggplot(data=df26, aes(x=Month, y=C1, group=Year, colour=factor(Year))) +
geom_line(size=.75) + geom_point() +
scale_x_discrete(limits=c("Jan","Feb","Mar","Apr","May","Jun","Jul",
"Aug","Sep","Oct","Nov","Dec")) +
scale_y_continuous(labels=comma) +
scale_colour_manual(values=cPalette, name="Year") +
ylab("Volume")
Question: How do I also include another line to the plot that represents the mean volume within each month with the ability to modify the line thickness and color of that mean line? So far, all of my attempts at producing the right code have been unsuccessful (most likely due to my relative newbie status using R). Any help is much appreciated!
Edit: Dataframe df26 is provided below (as requested by a commenter):
Year Month C1
2010 Jan NA
2010 Feb NA
2010 Mar NA
2010 Apr NA
2010 May NA
2010 Jun NA
2010 Jul NA
2010 Aug 183.6516764
2010 Sep 120.6303348
2010 Oct 85.31007613
2010 Nov 13.7347988
2010 Dec 20.93950545
2011 Jan 13.35780833
2011 Feb 14.16910945
2011 Mar 9.786319721
2011 Apr 41.24848885
2011 May 122.3014387
2011 Jun 422.4012809
2011 Jul 539.8569592
2011 Aug 527.6301222
2011 Sep 385.8199781
2011 Oct 201.7846973
2011 Nov 27.91934061
2011 Dec 7.919004379
2012 Jan 10.22724424
2012 Feb 10.64391791
2012 Mar 88.06585438
2012 Apr 124.0320675
2012 May 325.1399457
2012 Jun 465.938168
2012 Jul 567.2273488
2012 Aug 459.769634
2012 Sep 333.8636373
2012 Oct 102.0607986
2012 Nov 23.18822051
2012 Dec 15.64841121
2013 Jan 7.458238256
2013 Feb 4.34972039
2013 Mar 26.2019396
2013 Apr 38.82781323
2013 May 257.0920645
2013 Jun 357.594195
2013 Jul 383.2780483
2013 Aug 456.469314
2013 Sep 319.3616298
2013 Oct NA
2013 Nov NA
2013 Dec 17.01748185

You need to calculate the means. Then you can plot them.
Using dplyr
library(dplyr)
df26means <- df26 %>%
group_by(Month) %>%
summarize(C1 = mean(C1, na.rm = T))
Then add it to your plot:
ggplot(data=df26, aes(x=Month, y=C1, group=Year, colour=factor(Year))) +
geom_line(size=.75) + geom_point() +
scale_x_discrete(limits=c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) +
scale_y_continuous(labels=comma) +
scale_colour_manual(values=cPalette, name="Year") +
ylab("Volume") +
geom_line(data = df26means, aes(group = 1), size = 1.25, color = "black")
I'd recommend using annotate to add a nice piece of text on the plot identifying that line as the mean line. To get it in the legend, you'd probably need to set df26means$Year = "Mean", convert df26$Year to a character, rbind the two dataframes together, then convert Year to a factor. The plot code would be simpler, but the data wrangling is a bit more complicated.

Related

Applying custom function to a list of DFs, taking another list as an input - R

I have a list of dfs and a list of annual budgets.
Each df represents one business year, and each budget represents a total spend for that year.
# the business year starts from Feb and ends in Jan.
# the budget column is first populated with the % of annual budget allocation
df <- data.frame(monthly_budget=c(0.06, 0.13, 0.07, 0.06, 0.1, 0.06, 0.06, 0.09, 0.06, 0.06, 0.1, 0.15),
month=month.abb[c(2:12, 1)])
# dfs for 3 years
df2019_20 <- df
df2020_21 <- df
df2021_22 <- df
# budgets for 3 years
budget2019_20 <- 6000000
budget2020_21 <- 7000000
budget2021_22 <- 8000000
# into lists
df_list <- list(df2019_20, df2020_21, df2021_22)
budget_list <- list(budget2019_20, budget2020_21, budget2021_22)
I've written the following function to both apply the right year to Jan and fill in the rest by deparsing the respective dfs name.
It works perfectly if I supply a single df and a single budget.
budget_func <- function(df, budget){
df_name <- deparse(substitute(df))
df <- df %>%
mutate(year=ifelse(month=="Jan",
as.numeric(str_sub(df_name, -2)) + 2000,
as.numeric(str_extract(df_name, "\\d{4}(?=_)")))
)
for (i in 1:12){
df[i,1] <- df[i,1] * budget
i <- i+1
}
return(df)
}
To speed things up I want to pass both lists as arguments to mapply. However I don't get the results I want - what am I doing wrong?
final_budgets <- mapply(budget_func, df_list, budget_list)
Instead of using deparse/substitute (which works when we are passing a single dataset, and is different in the loop because the object passed is not the object name), we may add a new argument to pass the names. In addition, when we create the list, it should have the names as well. We can either use list(df2019_20 = df2019_20, ...) or use setNames or an easier option is dplyr::lst which does return with the name of the object passed
budget_func <- function(df, budget, nm1){
df_name <- nm1
df <- df %>%
mutate(year=ifelse(month=="Jan",
as.numeric(str_sub(df_name, -2)) + 2000,
as.numeric(str_extract(df_name, "\\d{4}(?=_)")))
)
for (i in 1:12){
df[i,1] <- df[i,1] * budget
i <- i+1
}
return(df)
}
-testing
df_list <- dplyr::lst(df2019_20, df2020_21, df2021_22)
budget_list <- list(budget2019_20, budget2020_21, budget2021_22)
Map(budget_func, df_list, budget_list, names(df_list))
-output
$df2019_20
monthly_budget month year
1 360000 Feb 2019
2 780000 Mar 2019
3 420000 Apr 2019
4 360000 May 2019
5 600000 Jun 2019
6 360000 Jul 2019
7 360000 Aug 2019
8 540000 Sep 2019
9 360000 Oct 2019
10 360000 Nov 2019
11 600000 Dec 2019
12 900000 Jan 2020
$df2020_21
monthly_budget month year
1 420000 Feb 2020
2 910000 Mar 2020
3 490000 Apr 2020
4 420000 May 2020
5 700000 Jun 2020
6 420000 Jul 2020
7 420000 Aug 2020
8 630000 Sep 2020
9 420000 Oct 2020
10 420000 Nov 2020
11 700000 Dec 2020
12 1050000 Jan 2021
$df2021_22
monthly_budget month year
1 480000 Feb 2021
2 1040000 Mar 2021
3 560000 Apr 2021
4 480000 May 2021
5 800000 Jun 2021
6 480000 Jul 2021
7 480000 Aug 2021
8 720000 Sep 2021
9 480000 Oct 2021
10 480000 Nov 2021
11 800000 Dec 2021
12 1200000 Jan 2022

How to change negative values to 0 of forecasts in R?

As the data is of rainfall, I want to replace the negative values both in point forecasts and intervals with 0. How can this be done in R ? Looking for the R codes that can make the required changes.
The Forecast values obtained in R using an ARIMA model are given below
> Predictions
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jan 2021 -1.6625108 -165.62072 162.2957 -252.41495 249.0899
Feb 2021 0.8439712 -165.57869 167.2666 -253.67752 255.3655
Mar 2021 35.9618300 -130.53491 202.4586 -218.67297 290.5966
Apr 2021 53.4407679 -113.05822 219.9398 -201.19746 308.0790
May 2021 206.7464927 40.24744 373.2455 -47.89184 461.3848
Jun 2021 436.2547446 269.75569 602.7538 181.61641 690.8931
Jul 2021 408.2814434 241.78239 574.7805 153.64311 662.9198
Aug 2021 431.7649076 265.26585 598.2640 177.12657 686.4032
Sep 2021 243.5520546 77.05300 410.0511 -11.08628 498.1904
Oct 2021 117.4581047 -49.04095 283.9572 -137.18023 372.0964
Nov 2021 25.0773401 -141.42171 191.5764 -229.56098 279.7157
Dec 2021 28.9468415 -137.55188 195.4456 -225.69098 283.5847
Jan 2022 -0.4912674 -171.51955 170.5370 -262.05645 261.0739
Feb 2022 2.2963271 -168.86759 173.4602 -259.47630 264.0690
Mar 2022 43.3561613 -127.81187 214.5242 -218.42275 305.1351
Apr 2022 48.6538398 -122.51431 219.8220 -213.12526 310.4329
May 2022 228.4762035 57.30805 399.6444 -33.30290 490.2553
Jun 2022 445.3540781 274.18592 616.5222 183.57497 707.1332
Jul 2022 441.8287867 270.66063 612.9969 180.04968 703.6079
Aug 2022 592.5766086 421.40845 763.7448 330.79751 854.3557
Sep 2022 220.6996396 49.53148 391.8678 -41.07946 482.4787
Oct 2022 158.7952154 -12.37294 329.9634 -102.98389 420.5743
Nov 2022 29.9052184 -141.26288 201.0733 -231.87380 291.6842
Dec 2022 25.9432583 -145.22303 197.1095 -235.83298 287.7195
In this context, try using:
Predictions[Predictions < 0] <- 0
Which will replace all values less than 0 with 0. Because of the processing, the use of for loops is discouraged in applications where vectorization can be applied.

how to plot a expenditure vs year in r

I have a dataset which has about 100,000 datapoints.
I want to plot two columns.
Y axis - Year
X axis - Sales
Sample Data:
Sales Year
22 2016
10 2016
3.99 2017
8.99 2017
12.99 2017
8.00 2016
12.00 2017
5.00 2016
22 2017
50 2016
53 2017
Im using the following code
plot(subset_4$SALES ~ subset_4$YEAR)
But the plot doesn't look great. Is there any nicer way of doing this?
Update: plot(subset_4$SALES ~ subset_4$WEEKS)
You can try ggplot2 library
df <- data.frame(sales, year)
ggplot(df, aes(x = sales, y = year, color = year)) +
geom_point() +
xlab("Sales") +
ylab("Year")

Can this time series forecasting model (in R) be further improved?

I am trying to build this forecasting model but can't get impressive results. The low no. of records to train the model is one of the reasons for not so good results, I believe, and so I am seeking help.
Here is the predictor variables' time series matrix. Here the Paidts7 variable is actually a lag variable of Paidts6.
XREG =
Paidts2 Paidts6 Paidts7 Paidts4 Paidts5 Paidts8
Jan 2014 32932400 29703000 58010000 21833 38820 102000.0
Feb 2014 33332497 35953000 29703000 10284 38930 104550.0
Mar 2014 35811723 40128000 35953000 11132 39840 104550.0
Apr 2014 28387000 29167000 40128000 13171 40010 104550.0
May 2014 27941601 27942000 29167000 9192 39640 104550.0
Jun 2014 34236746 35010000 27942000 8766 39430 104550.0
Jul 2014 22986887 26891000 35010000 11217 39060 104550.0
Aug 2014 31616679 31990000 26891000 8118 38840 104550.0
Sep 2014 41839591 46052000 31990000 10954 38380 104550.0
Oct 2014 36945266 36495000 46052000 14336 37920 104550.0
Nov 2014 44026966 41716000 36495000 12362 36810 104550.0
Dec 2014 57689000 60437000 41716000 14498 36470 104550.0
Jan 2015 35150678 35263000 60437000 22336 34110 104550.0
Feb 2015 33477565 33749000 35263000 12188 29970 107163.8
Mar 2015 41226928 41412000 33749000 11122 28580 107163.8
Apr 2015 31031405 30588000 41412000 12605 28970 107163.8
May 2015 31091543 29327000 30588000 9520 27820 107163.8
Jun 2015 38212015 35818000 29327000 10445 28880 107163.8
Jul 2015 32523660 32102000 35818000 12006 28730 107163.8
Aug 2015 33749299 33482000 32102000 9303 27880 107163.8
Sep 2015 48275932 44432000 33482000 10624 25950 107163.8
Oct 2015 32067045 32542000 44432000 15324 25050 107163.8
Nov 2015 46361434 40862000 32542000 10706 25190 107163.8
Dec 2015 68206802 71005000 40862000 14499 24670 107163.8
Jan 2016 34847451 29226000 71005000 23578 23100 107163.8
Feb 2016 34249625 43835001 29226000 13520 21430 109842.9
Mar 2016 45707923 56087003 43835001 15247 19980 109842.9
Apr 2016 33512366 37116000 56087003 18797 20900 109842.9
May 2016 33844153 42902002 37116000 11870 21520 109842.9
Jun 2016 40251630 53203010 42902002 14374 23150 109842.9
Jul 2016 33947604 38411008 53203010 18436 24230 109842.9
Aug 2016 35391779 38545003 38411008 11654 24050 109842.9
Sep 2016 49399281 55589008 38545003 13448 23510 109842.9
Oct 2016 36463617 45751005 55589008 19871 23940 109842.9
Nov 2016 45182618 51641006 45751005 14998 24540 109842.9
Dec 2016 64894588 79141002 51641006 18143 24390 109842.9
Here is the Y variable (to be predicted)
Jan Feb Mar Apr May Jun
2014 1266757.8 1076023.4 1285495.7 1026840.2 910148.8 1111744.5
2015 1654745.7 1281946.6 1372669.3 1017266.6 841578.4 1353995.5
2016 1062048.8 1860531.1 1684564.3 1261672.0 1249547.7 1829317.9
Jul Aug Sep Oct Nov Dec
2014 799973.1 870778.9 1224827.3 1179754.0 1186726.3 1673259.5
2015 1127006.2 779374.9 1223445.6 925473.6 1460704.8 1632066.2
2016 1410316.4 1276771.1 1668296.7 1477083.3 1466419.2 2265343.3
I tried Forecast::ARIMA and Forecast::NNETAR models with external regressor but couldn't bring MAPE below 7. I am targetting MAPE below 3 and RMSE under 50000. You are welcome to use any other package and function.
Here is the test data: XREG =
Paidts2test Paidts6test Paidts7test Paidts4test
Jan 2017 31012640 36892000 79141002 27912
Feb 2017 33009746 39020000 36892000 9724
Mar 2017 39296653 52787000 39020000 11335
Apr 2017 36387649 36475000 52787000 17002
May 2017 40269571 41053000 36475000 11436
Paidts5test Paidts8test
Jan 2017 25100 109842.9
Feb 2017 25800 112589.0
Mar 2017 25680 112589.0
Apr 2017 25540 112589.0
May 2017 25830 112589.0
Y =
1627598 1041766 1381536 1346429 1314992
If you find out that removing one or more of the predictor variables is improving the result significantly, please go ahead. Your help will be greatly appreciated and please suggest in 'R' only not in some other tool.
-Thanks
Try auto.arima, it will also allow you to use xreg.
https://www.rdocumentation.org/packages/forecast/versions/8.1/topics/auto.arima

Fill in missing year in ordered list of dates

I have collected some time series data from the web and the timestamp that I got looks like below.
24 Jun
21 Mar
20 Jan
10 Dec
20 Jun
20 Jan
10 Dec
...
The interesting part is that the year is missing in the data, however, all the records are ordered, and you can infer the year from the record and fill in the missing data. So the data after imputing should be like this:
24 Jun 2014
21 Mar 2014
20 Jan 2014
10 Dec 2013
20 Jun 2013
20 Jan 2013
10 Dec 2012
...
Before lifting my sleeves and start writing a for loop with nested logic.. is there a easy way that might work out of box in R to impute the missing year.
Thanks a lot for any suggestion!
Here's one idea
## Make data easily reproducible
df <- data.frame(day=c(24, 21, 20, 10, 20, 20, 10),
month = c("Jun", "Mar", "Jan", "Dec", "Jun", "Jan", "Dec"))
## Convert each month-day combo to its corresponding "julian date"
datestring <- paste("2012", match(df[[2]], month.abb), df[[1]], sep = "-")
date <- strptime(datestring, format = "%Y-%m-%d")
julian <- as.integer(strftime(date, format = "%j"))
## Transitions between years occur wherever julian date increases between
## two observations
df$year <- 2014 - cumsum(diff(c(julian[1], julian))>0)
## Check that it worked
df
# day month year
# 1 24 Jun 2014
# 2 21 Mar 2014
# 3 20 Jan 2014
# 4 10 Dec 2013
# 5 20 Jun 2013
# 6 20 Jan 2013
# 7 10 Dec 2012
The OP has requested to complete the years in descending order starting in 2014.
Here is an alternative approach which works without date conversion and fake dates. Furthermore, this approach can be modified to work with fiscal years which start on a different month than January.
# create sample dataset
df <- data.frame(
day = c(24L, 21L, 20L, 10L, 20L, 20L, 21L, 10L, 30L, 10L, 10L, 7L),
month = c("Jun", "Mar", "Jan", "Dec", "Jun", "Jan", "Jan", "Dec", "Jan",
"Jan", "Jan", "Jun"))
df$year <- 2014 - cumsum(c(0L, diff(100L*as.integer(
factor(df$month, levels = month.abb)) + df$day) > 0))
df
day month year
1 24 Jun 2014
2 21 Mar 2014
3 20 Jan 2014
4 10 Dec 2013
5 20 Jun 2013
6 20 Jan 2013
7 21 Jan 2012
8 10 Dec 2011
9 30 Jan 2011
10 10 Jan 2011
11 10 Jan 2011
12 7 Jun 2010
Completion of fiscal years
Let's assume the business has decided to start its fiscal year on February 1. Thus, January lies in a different fiscal year than February or March of the same calendar year.
To handle fiscal years, we only need to shuffle the factor levels accordingly:
df$fy <- 2014 - cumsum(c(0L, diff(100L*as.integer(
factor(df$month, levels = month.abb[c(2:12, 1)])) + df$day) > 0))
df
day month year fy
1 24 Jun 2014 2014
2 21 Mar 2014 2014
3 20 Jan 2014 2013
4 10 Dec 2013 2013
5 20 Jun 2013 2013
6 20 Jan 2013 2012
7 21 Jan 2012 2011
8 10 Dec 2011 2011
9 30 Jan 2011 2010
10 10 Jan 2011 2010
11 10 Jan 2011 2010
12 7 Jun 2010 2010

Resources