R: How to plot multiple series when the series is included as a variable? - r

I want to plot multiple lines to one graph of five different time series. The problem is that my data frame is arranged like so:
Series Time Price ...
1 Dec 2003 5
2 Dec 2003 10
3 Dec 2003 2
1 Jan 2004 10
2 Jan 2004 10
3 Jan 2004 5
This is a simplified version, and there are many other variables for each observation. I'd like to be able to plot time vs price and use the first variable as the indicator for which series.
The time period is 77 months long, so I'm not sure if there's an easy way to reshape the data to look like:
Series Dec.2003.Price Jan.2004.Price ...
1 5 10
2 10 10
3 2 5
or a way to graph these like I said without reshaping.

You can try
xyplot(Price ~ Time, groups=Series, data=df, type="l")

Related

Selecting later date observation in panel data in R

I have the following panel data in R:
ID_column<- c("A","A","A","A","B","B","B","B")
Date_column<-c(20040131, 20041231,20051231,20061231, 20051231, 20061231, 20071231, 20081231)
Price_column<-c(12,13,17,19,35,38,39,41)
Data<- data.frame(ID_column, Date_column, Price_column)
#The data looks like this:
ID_column Date_column Price_column
1: A 20040131 12
2: A 20041231 13
3: A 20051231 17
4: A 20061231 19
5: B 20051231 35
6: B 20061231 38
7: B 20071231 39
8: B 20081231 41
My next aim would be to convert the Date column which is currently in a numeric YYYYMMDD format into YYYY by simply taking the first four digits of each entry in the data column as follows:
Data$Date_column<- substr(Data$Date_column,1,4)
#The data then looks like:
ID_column Date_column Price_column
1 A 2004 12
2 A 2004 13
3 A 2005 17
4 A 2006 19
5 B 2005 35
6 B 2006 38
7 B 2007 39
8 B 2008 41
My ultimate goal would be to employ the plm package for panel data regression, but when applying the package and using pdata.frame to set the ID and Time variables as indices, I get error messages of duplicate ID/Time pairs (In this case rows 1 and 2 which would both be given the tag: A,2004). To solve this issue, I would like to delete row 1 in the original data, and only keep the newer observation from the year 2004. This would the provide me with unique ID/Time pairs across the whole data.
Therefore I was hoping for someone to help me out with a loop or a package suggestion with which I can only keep the row with the newer/later observation within a year, if this occurs, also for application to larger data sets.. I believe this involves a couple commands of conditional formatting which I am having difficulties putting together currently. I believe a loop that evaluates whether the first four digits of consecutive date observations are identical and then deletes the one with the "smaller" date/takes the "larger" date would do it, but my experience with loops is very limited.
Kind regards and thank you!
I'd recommend to keep the Date_column as a reference to pick the later observation and mutate a new column for only the year,since you want the latest observation each year.
Data$year<- substr(Data$Date_column,1,4)
> Data$Date_column<- lubridate::ymd(Data$Date_column)
>
> Data %>% arrange(desc(Date_column)) %>%
+ distinct(ID_column,year,.keep_all = TRUE) %>%
+ arrange(Date_column)
ID_column Date_column Price_column year
1 A 2004-12-31 13 2004
2 A 2005-12-31 17 2005
3 B 2005-12-31 35 2005
4 A 2006-12-31 19 2006
5 B 2006-12-31 38 2006
6 B 2007-12-31 39 2007
since we arranged in the actual date in descending order, you guarantee that dropped rows for the unique combination of ID and year is the oldest. you can change the arrangement for the opposite; to get the oldest occuerence

How to get maximum high or minimum low for each week in a daily timeseries data then make it a weekly timeseries?

I have an ohlc daily data for US stocks. I would like to derive a weekly timeseries from it and compute SMA and EMA. To be able to do that though, requirement is to create the weekly timeseries from the maximum high per week, and another weekly timeseries from the minimum low per week. After that I, would then compute their sma and ema then assign to every days of the week (one period forward). So, first problem first, how do I get the weekly from the daily using R (any package), or better if you can show me an algo for it, any language but preferred is Golang? Anyway, I can rewrite it in golang if needed.
Date High Low Week(High) Week(Low) WkSMAHigh 2DP WkSMALow 2DP
(one period forward)
Dec 24 Fri 6 3 8 3 5.5 1.5
Dec 23 Thu 7 5 5.5 1.5
Dec 22 Wed 8 5 5.5 1.5
Dec 21 Tue 4 4 5.5 1.5
Assume Holiday (Dec 20)
Dec 17 Fri 4 3 6 2 None
Dec 16 Thu 4 3
Dec 15 Wed 5 2
Dec 14 Tue 6 4
Dec 13 Mon 6 4
Dec 10 Fri 5 1 5 1 None
Dec 9 Thu 4 3
Dec 8 Wed 3 2
Assume Holiday (Dec 6 & 7)
I'd start by generating a column which specifies which week it is.
You could use the lubridate package to do this, that would require converting your dates into Date types. It has a function called week which returns the number of full 7 day periods that have passed since Jan 1st + 1. However I don't know if this data goes over several years or not. Plus I think there's a simpler way to do this.
The example I'll give below will simply do it by creating a column which just repeats an integer 7 times up to the length of your data frame.
Pretend your data frame is called ohlcData.
# Create a sequence 7 at a time all the way up to the end of the data frame
# I limit the sequence to the length nrow(ohlcData) so the rounding error
# doesn't make the vectors uneven lengths
ohlcData$Week <- rep(seq(1, ceiling(nrow(ohlcData)/7), each = 7)[1:nrow(ohlcData)]
With that created we can then go ahead and use the plyr package, which has a really useful function called ddply. This function applies a function to columns of data grouped by another column of data. In this case we will apply the max and min functions to your data based on its grouping by our new column Week.
library(plyr)
weekMax <- ddply(ohlcData[,c("Week", "High")], "Week", numcolwise(max))
weekMin <- ddply(ohlcData[,c("Week", "Low")], "Week", numcolwise(min))
That will then give you the min and max of each week. The dataframe returned for both weekMax and weekMin will have 2 columns, Week and the value. Combine these however you see fit. Perhaps weekExtreme <- cbind(weekMax, weekMin[,2]). If you want to be able to marry up date ranges to the week numbers it will just be every 7th date starting with whatever your first date was.

R - Analysis of time series with semi-annual data?

I have a time series with semi-annual (half-yearly) data points.
It seems that the ts() function can't handle that as "frequency = 2" returns a very strange time series object that extends far beyond the actual time period.
Is there any way to do time series analysis of this kind of time series object in R?
EDIT: Here's an example:
dat <- seq(1, 17, by = 1)
> semi <- ts(dat, start = c(2008,12), frequency = 2)
> semi
Time Series:
Start = c(2013, 2)
End = c(2021, 2)
Frequency = 2
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
I was expecting:
> semi
s1 s2
2008 1
2009 2 3
2010 4 5
2011 6 7
2012 8 9
2013 10 11
2014 12 13
2015 14 15
2016 16 17
First let me explain why the first ts element starts at 2013 in stead of 2008. The function start and end work with the periods/frequencies. You selected the 12th period after 2008 which is the second period in 2013 if your frequency is 2.
This should work for the period:
semi <- ts(dat, start = c(2008,2), frequency = 2)
Still semi gives the correct timeseries, however, it does not know the names with a frequency of 2. If you plot the timeseries the correct half yearly graph will be shown.
plot.ts(semi)
In this problem someone explained about the standard frequencies, which ts() knows.

(In)correct use of a linear time trend variable, and most efficient fix?

I have 3133 rows representing payments made on some of the 5296 days between 7/1/2000 and 12/31/2014; that is, the "Date" feature is non-continuous:
> head(d_exp_0014)
Year Month Day Amount Count myDate
1 2000 7 6 792078.6 9 2000-07-06
2 2000 7 7 140065.5 9 2000-07-07
3 2000 7 11 190553.2 9 2000-07-11
4 2000 7 12 119208.6 9 2000-07-12
5 2000 7 16 1068156.3 9 2000-07-16
6 2000 7 17 0.0 9 2000-07-17
I would like to fit a linear time trend variable,
t <- 1:3133
to a linear model explaining the variation in the Amount of the expenditure.
fit_t <- lm(Amount ~ t + Count, d_exp_0014)
However, this is obviously wrong, as t increments in different amounts between the dates:
> head(exp)
Year Month Day Amount Count Date t
1 2000 7 6 792078.6 9 2000-07-06 1
2 2000 7 7 140065.5 9 2000-07-07 2
3 2000 7 11 190553.2 9 2000-07-11 3
4 2000 7 12 119208.6 9 2000-07-12 4
5 2000 7 16 1068156.3 9 2000-07-16 5
6 2000 7 17 0.0 9 2000-07-17 6
Which to me is the exact opposite of a linear trend.
What is the most efficient way to get this data.frame merged to a continuous date-index? Will a date vector like
CTS_date_V <- as.data.frame(seq(as.Date("2000/07/01"), as.Date("2014/12/31"), "days"), colnames = "Date")
yield different results?
I'm open to any packages (using fpp, forecast, timeSeries, xts, ts, as of right now); just looking for a good answer to deploy in functional form, since these payments are going to be updated every week and I'd like to automate the append to this data.frame.
I think some kind of transformation to regular (continuous) time series is a good idea.
You can use xts to transform time series data (it is handy, because it can be used in other packages as regular ts)
Filling the gaps
# convert myDate to POSIXct if necessary
# create xts from data frame x
ts1 <- xts(data.frame(a = x$Amount, c = x$Count), x$myDate )
ts1
# create empty time series
ts_empty <- seq( from = start(ts1), to = end(ts1), by = "DSTday")
# merge the empty ts to the data and fill the gap with 0
ts2 <- merge( ts1, ts_empty, fill = 0)
# or interpolate, for example:
ts2 <- merge( ts1, ts_empty, fill = NA)
ts2 <- na.locf(ts2)
# zoo-xts ready functions are:
# na.locf - constant previous value
# na.approx - linear approximation
# na.spline - cubic spline interpolation
Deduplicate dates
In your sample there is now sign of duplicated values. But based on a new question it is very likely. I think you want to aggregate values with sum function:
ts1 <- period.apply( ts1, endpoints(ts1,'days'), sum)

Flowstrates and R: extracting and reshaping data in required format

I'm working on trying to transform a large dataset into the required formats for analyzing within the flowstrates package.
What I currently have is a large file (600k trips) with origin and destination points.
Format is sort of like this:
tripID Month start_pt end_pt
1 June 1 3
2 June 1 3
3 July 1 5
4 July 1 7
5 July 1 7
What I need to be able to generate is a file that has trip counts by unit time (let's say months) in a format like this:
start_pt end_pt June July August ... December
1 3 2 0 5 9
1 5 0 1 4 4
1 7 0 2 0 0
It's easy enough to pre-segment the data by month and then generate counts for each origin-destination pair, but then putting it all back together causes all sorts of problems since each of the pre-segmented chunks of data have very different sizes. So it seems that I'd need to do this for the entire dataset at once.
Are there any packages for doing this type of processing? Would it be easier to do this in something like SQL or SQLite?
Thanks in advance for any help.
You can use the reshape2 package to do this fairly easily.
If your data is dat,
library("reshape2")
dcast(dat, start_pt+end_pt~Month, value.var="tripID", fun.aggregate=length)
This gives a single entry for each start_pt/end_pt/Month combination, the value of which is how many cases had that combination (the length of tripID for that set).

Resources