Extraction of year issue in covid-19 dataset from JHU - r

I'm trying to follow the worked out exercise given at COVID-19 Data Analysis with R - Worldwide by Yanchang Zhao which is a covid-19 related data analysis performed in R.
and during the step of loading data, I find an issue with the date format where it is suppose to print the date range in the data set as
"2020-01-22" "2020-06-15"
but when I attempt, it gives me
"2002-10-10" "2020-10-09"
as I'm strictly following the exercise script, I am confused with the origin of time being "2002" instead of "2020". Did anyone in the community has experience such issue?
Thank you!

Related

Forecasting Hospital Bed Demand Using Daily Observations

Basically, my task for the next 3 months is to forecast bed demand and a couple of other variables in a hospital's emergency department. The data is 5 years worth of daily observations of these variables. The data is complete with no missing values.
The goal is to improve the prediction accuracy of the current tool, which is an Excel workbook.
I have not taken any time series or optimization courses in college thus far- so imagine my horror when I realised I had no clue on how to approach this project and that I would be working entirely alone. I was told no one in the department has any experience and no one would be able to help me.
I'm using RStudio, but I'm not very proficient since it was self-taught.
From trying out the questions asked on here as well as YouTube tutorials to learn the appropriate syntax and functions, what I have managed to find out is:
1) My data is a time series and I should apply forecasting models to predict future values based on the historical data I have.
2) Daily observations of a long time series has weekly and annual seasonality, so I should define the data as a multi-seasonal time series.
I first tried defining my data as ts(), then msts(). One of the answers here mentioned zoo() would be more appropriate for daily obervations, so I tried that too. The forecasting models I've tried are snaive, ets, auto.arima and TBATS.
I would like to present the plots of the values/forecasts based on day-of-the-week other than all 365 days of the year, which is the only output I could plot. I tried using frequency = 365 and 7, and start = c(2014, 1) and end= c(2018, 365), but I haven't had any luck.
I would really appreciate any advice and help I could get from anyone. Thank you!
Without looking at your data, have you tried to get started with some basic ARIMA modeling and seeing what results you get from that? It’s a fairly friendly way to get started with time series forecasting, depending on your data. I was forecasting by the hour, but the frequency can be adjusted to whatever you need to forecast in. As you have mentioned, you are looking ot change the frequency. Sometimes it’s easier to see a pattern at larger time intervals, and can aggregate your data at larger time intervals.
For example, this converts daily observations to monthly.
library(xts)
dates <- seq(as.Date('2012-01-01'),as.Date('2019-03-31'),by='days')
beds$date.formatted <- dates
beds.xts <- xts(x=beds$neds.count,as.POSIXct(paste(beds$date.formatted)))
end.month <- endpoints(beds.xts,'months')
beds.month <- period.apply(beds.xts,end.month,sum)
beds.monthly.df <- data.frame(date=index(beds.month),coredata(beds.month))
colnames(beds.monthly.df) <- c('Date','Sessions')
beds.monthly <- ts(sessions.monthly.df$Sessions,start=c(2012,1),end=c(2019,3),frequency=12)
plot(beds.monthly)
I’m not sure if that would answer your question, but as you mentioned you are self-taught and stating out, I can share a script with you to help you go get started with an example, and maybe this would help you? It goes through the whole process of checking you have read your data in as a time series, what is time series data, how to check for non-stationary data and seasonality trends, plots that are useful for this, modeling, prediction, plotting actual vs predicted, accuracy, and further issues with the data that could be hindering your model. The video tutorial series are scripted in Python, but you can follow the end-to-end process of forecasting in ARIMA using the equivalent R script for this tutorial: https://code.datasciencedojo.com/rebeccam/tutorials/blob/master/Time%20Series/r_time_series_example.R
https://tutorials.datasciencedojo.com/time-series-python-reading-data/

ARIMA Time Series Graph in R

I am just starting out learning R Programming and have a question if anyone can help me.
When I create a graph (ie plot()) with my date data on the x-axis and polling data on the y-axis all works well.
But when I covert the polling data with arima and try to add the converted data with my dates (which worked beforehand) the message,
“Error: unexpected symbol in "plot(x,y)"
Appears where it did not beforehand.
Here is the code I am using:
Politicalpollingdata <- arima(politicalparty, order=c(0,1,1))
Futurepoliticalforecast <- forecast(Politicalpollingdata, h=20)
plot(Datedata, Futurepoliticalforecast, main = "Political Party’s Polling Data",
ylab = "% of Votes", xlab = "Years / Months")
Does anyone know
A) why this occurs after I convert the polling data with the
arima command?
B) is there a way to be able to use time series data in the x-axis (preferably” y/m/d”)
Apologies if this a simple fix, but I am new to R programming and I have spent hours and hours trying to find a solution with no luck!
Thanks in advance
Here's your problem:
Data was not converted using ARIMA, you made a model. When you ask it for a forecast, it will output an object of type forecast (a list) which includes a lot more information than just the fitted values. You will be able to access the fitted values using Futurepoliticalforecast$fitted.
Yes, there are many. You have not produced your data so I don't know what the problem is. However, x-labels act very well when the input is in type Date, but again, I don't know what the problem could be for your data.
Administrative notes about your question:
Please make sure you provide a reproducible example (data and all code more here https://stackoverflow.com/help/how-to-ask).
Also, please use the correct format for your code.

How to construct dataframe for time series data using ensemble learning methods

I am trying to predict the Bitcoin price at t+5, i.e. 5 minutes ahead, using 11 technical indicators up to time t which can all be calculated from the open, high, low, close and volume values from the Bitcoin time series (see my full data set here). As far as I know, it is not necessary to manipulate the data frame when using algorithms like regression trees, support vector machines or artificial neural networks, but when using ensemble methods like random forests (RF) and Boosting, I heard that it is necessary to re-arrange the data frame in some way, because ensemble methods draw repeated RANDOM samples from the training data, in which case the sequence of the Bitcoin time series will be ruined. So, is there a way to re-arrange the data frame in some way such that the time series will still be in chronological order every time repeated samples are drawn from the training data?
I was provided with an explanation of how to construct the data frame here and possibly here, too, but unfortunately, I didn't really understand these explanations, because I didn't see a visual example of the to-be-constructed data frame and because I wasn't able to identify the relevant line of code. So, if someone could, show me how to re-arrange the data frame using an example data frame, I would be very thankful. As example data frame, you might consider using the airquality in-built data frame in r (I think it contains time series data), the data I provided above, or any other data frame you think is best.
Many thanks!
There is no problem with resampling for ML algorithms. To capture (auto)correlation just add columns with lagged values of time series. E.g. in case of univarate time-series x[t], where t is time in minutes, you add x[t - 1], x[t - 2], ..., x[t - n] columns with lagged values. More lags you add more history will be accounted at model training.
Some very basic working example you can find here: Prediction using neural networks
More advanced staff with Keras is here: Time series prediction using RNN
However, just for your information, special message by Mr Chollet and Mr Allaire from the above-mentioned article ,):
NOTE: Markets and machine learning
Some readers are bound to want to take the techniques we’ve introduced
here and try them on the problem of forecasting the future price of
securities on the stock market (or currency exchange rates, and so
on). Markets have very different statistical characteristics than
natural phenomena such as weather patterns. Trying to use machine
learning to beat markets, when you only have access to publicly
available data, is a difficult endeavor, and you’re likely to waste
your time and resources with nothing to show for it.
Always remember that when it comes to markets, past performance is not
a good predictor of future returns – looking in the rear-view mirror
is a bad way to drive. Machine learning, on the other hand, is
applicable to datasets where the past is a good predictor of the
future.

band filtering based on PSD, to filter out frequency domains in r, probably using "buttord" from signal

I'm still a novice in R and I read quite a couple of posts and discussions on how to filter out frequency domains in a time series, but none of those quite matched my problem.
I would like to ask for your suggestions about the following:
I calculated wavelet coherence for two annually measured time series and taking a look at the wavelet coherence PSD graph:
The purple line (i.e. 8 year period) represents the border under which I would like to filter out the frequency domain, but not in the PSD, but in the original input data.
I though about using the butter function from the signal package, but it was overcomplicated for my purposes.
Thus I approached the problem with the bwfilter function of the mFilter package fo pass through the data over the 8 year period which corresponds to 2.37E-7 Hz.
name="dta OAK.resid Tair "
adat=read.table(file=paste(name,".csv", sep=""), sep=";", header=T)
dta=adat$ya
highpass <- bwfilter(dta, freq=8,drift=FALSE)
plot(highpass)
However, the results do not seem to be correct, because it seems to filter out too much from the data, the trend is too much aligned to the original time series.
Do you have any idea what may have gone wrong? The measurement unit maybe?
Any help is appreciated and if any additional details are needed I am happy to provide them!
Thank you!
The data can be found here

R Forecasting for highly seasonal revenue data

I have three years of daily revenue data. There is some fairly constant data growth per year, but the data is highly seasonal with huge peaks in Q4 (black friday, before Christmass frenzy, etc) and intra-week seansonaly (high revenue on Monday, less and less during the week, lowest on saturday, starts to pick up on sundays)
Instead of using a boring spreadsheet with linear forecasting, I'd like an R script that takes for input three years worth of daily data and apply an algorithm to predict daily revenue forecast for the next 6 months. I'd love for the input to be just a CSV file with dates and revenue numbers.
I heard ARIMA is good, but an economist friend of mine who has seen my data thinks that forecasting with Kalman Filters would yield very good results.
Could someone post a script to show me how to apply either the ARIMA algo or the Kalman Filter algo to forecast my data? Thanks!
While R certainly has tools that implement these analyses, they are power tools, and it would probably be best if you read up on them and how they work ... (Venables and Ripley's Modern Applied Statistics in S might be a reasonable starting point, although I don't know if it discusses Kalman filters). In the meantime:
??arima
??kalman
?arima
?KalmanLike
Or, having installed the sos package:
library("sos")
findFn("arima forecast")
findFn("kalman forecast")
Or just Google "kalman filter R" (!!) -- I did and found that the first 8 (!) hits looked highly useful (the 9th was an introduction to Kalman filters in MATLAB :-) )
Others may feel differently, but I will generally spend more effort helping someone work their way through an analysis when I can see that they have tried tackling it for themselves ...
This should be solved using Regression. You would have 6 dummy variables for the day of the week impacts. You would have 11 monthly dummy variables for the seasonality. You would have dummy variables for each of the holidays.

Resources