R - Analysis of time series with semi-annual data? - r

I have a time series with semi-annual (half-yearly) data points.
It seems that the ts() function can't handle that as "frequency = 2" returns a very strange time series object that extends far beyond the actual time period.
Is there any way to do time series analysis of this kind of time series object in R?
EDIT: Here's an example:
dat <- seq(1, 17, by = 1)
> semi <- ts(dat, start = c(2008,12), frequency = 2)
> semi
Time Series:
Start = c(2013, 2)
End = c(2021, 2)
Frequency = 2
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
I was expecting:
> semi
s1 s2
2008 1
2009 2 3
2010 4 5
2011 6 7
2012 8 9
2013 10 11
2014 12 13
2015 14 15
2016 16 17

First let me explain why the first ts element starts at 2013 in stead of 2008. The function start and end work with the periods/frequencies. You selected the 12th period after 2008 which is the second period in 2013 if your frequency is 2.
This should work for the period:
semi <- ts(dat, start = c(2008,2), frequency = 2)
Still semi gives the correct timeseries, however, it does not know the names with a frequency of 2. If you plot the timeseries the correct half yearly graph will be shown.
plot.ts(semi)
In this problem someone explained about the standard frequencies, which ts() knows.

Related

How to query NOAA for historical daily temperature averages using rnoaa?

I'm trying to find the historical average temperature between a range of dates using NOAA data and comparing to the long term average temperatures.
I'm using the rnoaa package and have hit a bit of a snag. For long term averages, I have been successful using the following syntax:
library('rnoaa')
start_date = "2010-01-15"
end_date = "2010-11-14"
station_id = "USW00093738"
weather_data <- ncdc(datasetid='NORMAL_DLY', stationid=paste0('GHCND:',station_id),
datatypeid='dly-tavg-normal',
startdate = start_date, enddate = end_date,limit=365)
This lets me parse weather_data$data for the long term average temperatures for that given station between January 15th and November 14th.
However, I can't seem to find the right dataset or datatype for historical average temperatures. I'd like to get the same data as the code above except with the actual daily average temperatures for those days. Any idea how to query this? I've been at it for a few hours and have had no luck.
Something I tried was the following:
weather_data <- ncdc(datasetid='GHCND', stationid=paste0('GHCND:',station_id),
startdate = start_date, enddate = end_date,limit=365)
uniq_d_types = unique(weather_data$data$datatype)
View(uniq_d_types)
This let me see the unique data types in the GHCND dataset but none of the data types seemed to be daily average temperatures. Any thoughts?
In order to obtain average daily actual temperatures from the NOAA data using the rnoaa package, one must use the hourly data and aggregate it by day. Hourly NOAA data is in the NORMAL_HLY data set, and the required data type is HLY-TEMP-NORMAL.
library('rnoaa')
library(lubridate)
options(noaakey = "obtain key from NOAA website")
start_date = "2010-01-15"
end_date = "2010-01-31"
station_id = "USW00093738"
weather_data <- ncdc(datasetid='NORMAL_HLY', stationid=paste0('GHCND:',station_id),
datatypeid = "HLY-TEMP-NORMAL",
startdate = start_date, enddate = end_date,limit=500)
data <- weather_data$data
data$year <- year(data$date)
data$month <- month(data$date)
data$day <- day(data$date)
# summarize to average daily temps
aggregate(value ~ year + month + day,mean,data = data)
...and the output:
> aggregate(value ~ year + month + day,mean,data = data)
year month day value
1 2010 1 15 323.5417
2 2010 1 16 322.8750
3 2010 1 17 323.4167
4 2010 1 18 323.7500
5 2010 1 19 323.2083
6 2010 1 20 321.0833
7 2010 1 21 318.4167
8 2010 1 22 317.6667
9 2010 1 23 319.0000
10 2010 1 24 321.0833
11 2010 1 25 323.5417
12 2010 1 26 326.0833
13 2010 1 27 328.4167
14 2010 1 28 330.9583
15 2010 1 29 333.2917
16 2010 1 30 335.7917
17 2010 1 31 308.0000
>
Note that temperatures are stored in tenths of degrees in this data set, so for the period between January 15th and 31st 2010, the average daily temperatures at the Dulles International Airport weather station were between 30.8 degrees and 33.5 degrees.
Also note that to calculate the average by stationId and run across multiple weather stations, simply add station to the aggregate() function.
> # summarize to average daily temps by station
> aggregate(value ~ station + year + month + day,mean,data = data)
station year month day value
1 GHCND:USW00093738 2010 1 15 323.5417
2 GHCND:USW00093738 2010 1 16 322.8750
3 GHCND:USW00093738 2010 1 17 323.4167
4 GHCND:USW00093738 2010 1 18 323.7500
5 GHCND:USW00093738 2010 1 19 323.2083
6 GHCND:USW00093738 2010 1 20 321.0833
7 GHCND:USW00093738 2010 1 21 318.4167
8 GHCND:USW00093738 2010 1 22 317.6667
9 GHCND:USW00093738 2010 1 23 319.0000
10 GHCND:USW00093738 2010 1 24 321.0833
11 GHCND:USW00093738 2010 1 25 323.5417
12 GHCND:USW00093738 2010 1 26 326.0833
13 GHCND:USW00093738 2010 1 27 328.4167
14 GHCND:USW00093738 2010 1 28 330.9583
15 GHCND:USW00093738 2010 1 29 333.2917
16 GHCND:USW00093738 2010 1 30 335.7917
17 GHCND:USW00093738 2010 1 31 308.0000
>
The answer is to grab historical (meaning actual, on the day specified-- not long term average) weather data from the NOAA's ISD database. USAF and WBAN values can be found by looking through the isd-history.csv file found here:
ftp://ftp.ncdc.noaa.gov/pub/data/noaa
Here's an example query.
out <- isd(usaf='724030', wban = '93738', year=2018)
This will grab a years worth of ~hourly weather data from ISD mapping. You can then parse/process this data however you see fit (e.g. for daily average temperatures like I did).

How to get maximum high or minimum low for each week in a daily timeseries data then make it a weekly timeseries?

I have an ohlc daily data for US stocks. I would like to derive a weekly timeseries from it and compute SMA and EMA. To be able to do that though, requirement is to create the weekly timeseries from the maximum high per week, and another weekly timeseries from the minimum low per week. After that I, would then compute their sma and ema then assign to every days of the week (one period forward). So, first problem first, how do I get the weekly from the daily using R (any package), or better if you can show me an algo for it, any language but preferred is Golang? Anyway, I can rewrite it in golang if needed.
Date High Low Week(High) Week(Low) WkSMAHigh 2DP WkSMALow 2DP
(one period forward)
Dec 24 Fri 6 3 8 3 5.5 1.5
Dec 23 Thu 7 5 5.5 1.5
Dec 22 Wed 8 5 5.5 1.5
Dec 21 Tue 4 4 5.5 1.5
Assume Holiday (Dec 20)
Dec 17 Fri 4 3 6 2 None
Dec 16 Thu 4 3
Dec 15 Wed 5 2
Dec 14 Tue 6 4
Dec 13 Mon 6 4
Dec 10 Fri 5 1 5 1 None
Dec 9 Thu 4 3
Dec 8 Wed 3 2
Assume Holiday (Dec 6 & 7)
I'd start by generating a column which specifies which week it is.
You could use the lubridate package to do this, that would require converting your dates into Date types. It has a function called week which returns the number of full 7 day periods that have passed since Jan 1st + 1. However I don't know if this data goes over several years or not. Plus I think there's a simpler way to do this.
The example I'll give below will simply do it by creating a column which just repeats an integer 7 times up to the length of your data frame.
Pretend your data frame is called ohlcData.
# Create a sequence 7 at a time all the way up to the end of the data frame
# I limit the sequence to the length nrow(ohlcData) so the rounding error
# doesn't make the vectors uneven lengths
ohlcData$Week <- rep(seq(1, ceiling(nrow(ohlcData)/7), each = 7)[1:nrow(ohlcData)]
With that created we can then go ahead and use the plyr package, which has a really useful function called ddply. This function applies a function to columns of data grouped by another column of data. In this case we will apply the max and min functions to your data based on its grouping by our new column Week.
library(plyr)
weekMax <- ddply(ohlcData[,c("Week", "High")], "Week", numcolwise(max))
weekMin <- ddply(ohlcData[,c("Week", "Low")], "Week", numcolwise(min))
That will then give you the min and max of each week. The dataframe returned for both weekMax and weekMin will have 2 columns, Week and the value. Combine these however you see fit. Perhaps weekExtreme <- cbind(weekMax, weekMin[,2]). If you want to be able to marry up date ranges to the week numbers it will just be every 7th date starting with whatever your first date was.

R: How to plot multiple series when the series is included as a variable?

I want to plot multiple lines to one graph of five different time series. The problem is that my data frame is arranged like so:
Series Time Price ...
1 Dec 2003 5
2 Dec 2003 10
3 Dec 2003 2
1 Jan 2004 10
2 Jan 2004 10
3 Jan 2004 5
This is a simplified version, and there are many other variables for each observation. I'd like to be able to plot time vs price and use the first variable as the indicator for which series.
The time period is 77 months long, so I'm not sure if there's an easy way to reshape the data to look like:
Series Dec.2003.Price Jan.2004.Price ...
1 5 10
2 10 10
3 2 5
or a way to graph these like I said without reshaping.
You can try
xyplot(Price ~ Time, groups=Series, data=df, type="l")

(In)correct use of a linear time trend variable, and most efficient fix?

I have 3133 rows representing payments made on some of the 5296 days between 7/1/2000 and 12/31/2014; that is, the "Date" feature is non-continuous:
> head(d_exp_0014)
Year Month Day Amount Count myDate
1 2000 7 6 792078.6 9 2000-07-06
2 2000 7 7 140065.5 9 2000-07-07
3 2000 7 11 190553.2 9 2000-07-11
4 2000 7 12 119208.6 9 2000-07-12
5 2000 7 16 1068156.3 9 2000-07-16
6 2000 7 17 0.0 9 2000-07-17
I would like to fit a linear time trend variable,
t <- 1:3133
to a linear model explaining the variation in the Amount of the expenditure.
fit_t <- lm(Amount ~ t + Count, d_exp_0014)
However, this is obviously wrong, as t increments in different amounts between the dates:
> head(exp)
Year Month Day Amount Count Date t
1 2000 7 6 792078.6 9 2000-07-06 1
2 2000 7 7 140065.5 9 2000-07-07 2
3 2000 7 11 190553.2 9 2000-07-11 3
4 2000 7 12 119208.6 9 2000-07-12 4
5 2000 7 16 1068156.3 9 2000-07-16 5
6 2000 7 17 0.0 9 2000-07-17 6
Which to me is the exact opposite of a linear trend.
What is the most efficient way to get this data.frame merged to a continuous date-index? Will a date vector like
CTS_date_V <- as.data.frame(seq(as.Date("2000/07/01"), as.Date("2014/12/31"), "days"), colnames = "Date")
yield different results?
I'm open to any packages (using fpp, forecast, timeSeries, xts, ts, as of right now); just looking for a good answer to deploy in functional form, since these payments are going to be updated every week and I'd like to automate the append to this data.frame.
I think some kind of transformation to regular (continuous) time series is a good idea.
You can use xts to transform time series data (it is handy, because it can be used in other packages as regular ts)
Filling the gaps
# convert myDate to POSIXct if necessary
# create xts from data frame x
ts1 <- xts(data.frame(a = x$Amount, c = x$Count), x$myDate )
ts1
# create empty time series
ts_empty <- seq( from = start(ts1), to = end(ts1), by = "DSTday")
# merge the empty ts to the data and fill the gap with 0
ts2 <- merge( ts1, ts_empty, fill = 0)
# or interpolate, for example:
ts2 <- merge( ts1, ts_empty, fill = NA)
ts2 <- na.locf(ts2)
# zoo-xts ready functions are:
# na.locf - constant previous value
# na.approx - linear approximation
# na.spline - cubic spline interpolation
Deduplicate dates
In your sample there is now sign of duplicated values. But based on a new question it is very likely. I think you want to aggregate values with sum function:
ts1 <- period.apply( ts1, endpoints(ts1,'days'), sum)

insert NA into a time series object in r

i want to sum the months for all the years in a time series that looks like
Jan Feb Mar Apr Jun Jul Aug Sep Oct Nov Dec
2006 4 4 3 4 4 5 5 3 3
2007 3 3 2 2 4 3 3 2 2 5 5
2008 3 3 3 2 2 4 4 3
by using
window(the time series object,start=c(2006,3),end=c(2008,3),frequency=1)
this line gives you a new ts object with just march of 2006-2007. However this does not work when the month does not have any values in it, is there any way to replace the gaps with NA? I have seen questions like this before but the dont answer i think for a ts object.
Assuming that
the_time_series_object <- ts(1:31, frequency = 12, start = c(2006, 3))
then:
window(the time series object, start = c(2006,3), end = c(2008,3), frequency = 12)
Your frequency should be 12 instead of 1. There's no NA problem it's just that one variable that you have wrong

Resources