`ts()` has converted my daily data to a wrong daily series [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 months ago.
Improve this question
This is the CCMP Index. R produces a wrong graph. It never took a dip. The graph in Excel shows it correctly. The data in CSV file has no problem, so what did I do wrong?
ccmp <- read.csv("/Users/jackconnors/Downloads/yeet.csv")
ccmp$time=as.Date(ccmp$date, format ="%m/%d/%Y")
ccmp=ccmp[order(ccmp$time), ]
### Find the Date Range
ccmp_min_date = min(ccmp$time)
ccmp_max_date=max(ccmp$time)
### TS variable
ccmp_ts=ts(ccmp$price ,start=c(2012, 7), end=c(2022, 7), frequency=365)
View(ccmp_ts)
plot(ccmp_ts, xlab="Year", ylab="Price", main="CCMP Prie", lwd=2.5)

This can be diagnosed even without access to your data. Knowing the real length of your series, i.e., length(ccmp$price), is sufficient. But basically, your following usage of ts() is wrong.
ccmp_ts <- ts(ccmp$price, start = c(2012, 7), end = c(2022, 7), frequency = 365)
By specifying start = c(2012, 7) and frequency = 365, you tell ts() that there are 365 data in each year and the series starts from day 7 in 2012. By specifying end = c(2022, 7), you tell ts() that you want to get 365 * (2022 - 2012 + 1) + 1 = 4016 out of your series ccmp$price. Check length(ccmp_ts) to verify this.
But what if ccmp$price has fewer data than this? Well, data will be recycled. This is what happened to you. The figure clearly shows that data in 2019 ~ 2022 are identical to data in 2012 ~ 2015.
Usually we never specify start and end at the same time when doing y <- ts(x, ...), as they will exactly imply the length of the resulting series.
If y is shorter than x, then x will be truncated, which is fine;
If y is longer than x, your series will be recycled, which causes problem.
By omitting either start or end, the other will be auto-determined based on frequency. All data in x are kept: no truncation or recycling. The resulting y is identical to x.
So, to make your code run without problem, you can drop either start = c(2012, 7) or end = c(2022, 7).
But working code does not mean everything. Believe it or not, although you can pass any positive value into frequency, only 1 (evenly spaced series), 4 (quarterly series) and 12 (monthly series) have natural interpretation. When you pass other values, you need to make sure it is a sensible period. Here, 365 is not a good one for day of year, because leap years have 366 days.
I can only imagine two situations where using ts() for daily time series is reasonable.
Daily series grouped by week, i.e., frequency = 7. So time can be interpreted as Monday, Tuesday, ..., Sunday.
Daily series with no grouping, i.e., frequency = 1. So time is simply interpreted as day 1, day 2, etc.
If you want to identify daily series with full time information, like year, month, etc, you have to use package zoo or xts to create a "zoo" object or "xts" object.

Related

Is it possible to use difftime function in R with years ONLY (i.e. no DD/MM)?

For example:
Year A
Year B
1990
2021
1980
2021
Thanks in advance.
It depends of what you expect.
If you only want to use the difftime() function with date objects composed only of years (as below), it will work (it will set the day and month to the ones of today for the calculation).
> a = as.Date("2021", format("%Y"))
> b = as.Date("2010", format("%Y"))
> difftime(a,b)
Time difference of 4018 days
But if you want to get the difference in year, it is not possible, as the function documentation clearly state that the return value unit must be: units = c("auto", "secs", "mins", "hours", "days", "weeks")
You might find better way to handle date/time data with the lubridate package.
difftime requests a date object to be used, I tried reproducing this using only years but was unable to.
Why not simply use absolute value (abs())If you're only interested in year difference?
as an example so you can see the difference added to a new column:
Year_A <- c(1990, 1980)
Year_B <- c(2021, 2021)
df <- data.frame(Year_A, Year_B)
df$diff <- abs(Year_A - Year_B)
P.S I noticed the answer above me was added while I was answering and I can't comment to it due to low rep, i see you can't use "years" as a unit value there, the biggest one being weeks, but you can manipulate that from days/weeks to years if that's what you're after.

Define different timeseries for different columns

I have a dataframe where some of the columns are starting later than the other. Please find a reproducible example.
set.seed(354)
df <- data.frame(Product_Id = rep(1:100, each = 50),
Date = seq(from = as.Date("2014/1/1"),
to = as.Date("2018/2/1"),
by = "month"),
Sales = rnorm(100, mean = 50, sd= 20))
df <- df[-c(251:256, 301:312, 2551:2562, 2651:2662, 2751:2762), ]
library(zoo)
z <- read.zoo(df, index = "Date", split = "Product_Id", FUN = as.yearmon)
tt <- as.ts(z)
Now for this dataframe for the columns 6,7,52,54 and 56 I want to define them as timeseries starting from a different date as compared to the rest of the dataframe. Supposedly the data begins from Jan 2000, column 6 will begin from July 2000, column 7 from Jan 2001 and so on. How should I proceed to do this?
Later, I want to perform a forecast on this dataset. Any inputs on this? Should I consider each column as a seperate dataframe and do the forecasting. Or can I convert each column to a different timeseries object that starts from the first non NA value?
Now for this dataframe for the columns 6,7,52,54 and 56 I want to define them as timeseries starting from a different date as compared to the rest of the dataframe. Supposedly the data begins from Jan 2000, column 6 will begin from July 2000, column 7 from Jan 2001 and so on. How should I proceed to do this?
There, AFAIK, no way to do this in R in a time series matrix. And if each column started at a different date, then (since each column has the same number of entries), each column would also need to end at a different date. Is this really what you need? A collection of time series that all happen to be of the same length (so they can fit into a matrix), but that start and end with offsets? I struggle to understand where something like this would be useful, outside a kind of forecasting competition.
If you really need this, then I would recommend you put your time series into a list structure. Then each one can start and end at any date, and they can be the same or different lengths. Take inspiration from Mcomp::M3.
Later, I want to perform a forecast on this dataset. Any inputs on this? Should I consider each column as a seperate dataframe and do the forecasting. Or can I convert each column to a different timeseries object that starts from the first non NA value?
Since your tt is already a time series object, the simplest way would be simply to iterate over its columns:
fcst <- matrix(nrow=10,ncol=ncol(tt))
for ( ii in 1:ncol(tt) ) fcst <- forecast(ets(tt[,ii]),10)$mean
Note that most modeling functions in forecast will throw a warning and do something reasonable on encountering NA values. Here, e.g.:
1: In ets(tt[, ii]) :
Missing values encountered. Using longest contiguous portion of time series
Of course, you could do something yourself inside the loop, e.g., search for the last NA and start the time series for modeling right after that (but make sure you fail gracefully if the last entry is NA).

creating inteval object in r using lubridate package [duplicate]

This question already has an answer here:
indicateing to which interval a date belongs
(1 answer)
Closed 4 years ago.
hi i have data from uber :
about pick ups in NYC .
im trying to add a column to the raw data, that indicates for each row, for
which time interval (which is represented by a single timepoint at the beginning of thetime interval) it belongs.
i want to Create a vector containing all relevant timepoints (i.e. every 15 minutes
Use int_diff function from lubridate package on this vector to create an
interval object.
Run a loop on all the time points in the raw data and for each data
point; indicate to which interval (which is represented by a single
timepoint at the beginning of the time interval) it belongs.
i tried looking for explanations how to use the int_diff function but i dont understand how my vector should look and how the syntax of int_diff works
tanks for the help :)
Is this what you have in mind?
start <- mdy_hm('4/11/2014 0:00') # start of the period
end <- mdy_hm('5/12/2015 0:00') # end
time_seq <- seq(from = start, to = end, by = '15 mins') # sequence by 15 minutes
times <- mdy_hm(c('4/11/2014 0:12', '4/11/2014 1:24')) # times to find intervals for
dat <- data.frame(times)
dat$intervals <- cut(times, breaks = time_seq) # assign each time to an interval
intervals_cols <- model.matrix(~ - + intervals, dat) # turn this into a set of columns, one for each interval, with a 1 indicating that this observation falls into the column

Generate a sequence of time using R and lubridate

Is there an efficient way to generate a time-sequence vector with tidyverse and lubridate? I know the two can work with seq() when one use the number of dates as the interval. For example, with the input:
seq(today(), today()+dyears(1), 60)
one can get a series of dates with a 60-day interval
"2017-02-14" "2017-04-15" "2017-06-14" "2017-08-13" "2017-10-12" "2017-12-11" "2018-02-09"
However, is there any way that this can work for weeks, months and years as well? Perhaps something similar to the code below, which I thought would work but did not:
seq(as_date(2000-01-01), as_date(2017-01-01), dyears(1))
Error: Incompatible duration classes (Duration, numeric). Please coerce with as.duration.
I know it is possible to change dyears(1) into 365 or 30 if one only need an approximation for year or month, but was wondering whether there are smarter ways to take leap years and months into account.
To provide more context, I would like to generate a date vector so I can customize the scale_x_date in ggplot. Instead of letting the waiver() display 2000, 2003, 2006, 2009, I want the plot to have all individual years or even every three-month period if possible.
you could try using seq.Date():
seq.Date(from=as.Date("2000-01-01"), to=as.Date("2010-01-01"), by="month")
or:
seq(as.Date("2000/1/1"), by = "month", length.out = 12)

What does the ts function do in R

I have downloaded the historical prices between Jan-1-2010 and Dec-31-2014 for Twitter, Inc. (TWTR) -NYSE from YAHOO! FINANCE in a twitter.csv file.
I then loaded it into RStudio using:
x = read.csv("Z:/path/to/file/twitter.csv", header=T,stringsAsFactors=F)
Here is how table x looks like:
View(x)
Then I used ts function to get the time series of Adj.Close:
x.ts = ts(x$Adj.Close, frequency = 12, start=c(2010,1), end=c(2014,12)
x.ts
How the previous results have been obtained? They are really different from table x data. Do they need any adjustements?
Your problem is the scale in which the data are read. With frequency = 12, start=c(2010,1), end=c(2014,12) you are telling the function that you have one number per month. If you have one number per day, as it's your case, you should try with:
x.ts = ts(x$Adj.Close, frequency = 365, start=c(2010,1), end=c(2014,365)
Firstly, frequency should be set to 365 if you deal with daily data, 12 if monthly etc.
Secondly
Secondly, I think you need to arrange the data ascending chronologically before using the ts() function.
The function blindly follows exactly what you are telling it, e.g. the data from the chart starts with the first value 35.87 in 2014-12-31 but the start date in the code is 2010, January, meaning it will attribute that value to being associated with Jan-2010.
x <- x %>%
dplyr::arrange(date)
ts.x <- ts(x$Adj.Close, frequency = 365, start=min(x$date), end=max(x$date))

Resources