Specifying start date of timeseries data in R as Q2 - r

I have time series data that is seasonal by the quarter. However, the data starts in the 2nd quarter of the first year but all other years have all four quarters.
> EquifaxData
DATE EQFXSUBPRIME013045
1 2014-04-01 42.58513
2 2014-07-01 43.15483
3 2014-10-01 43.55090
4 2015-01-01 42.59218
5 2015-04-01 41.47105
6 2015-07-01 41.53640
7 2015-10-01 41.82020
8 2016-01-01 40.98760
9 2016-04-01 40.51305
10 2016-07-01 39.91170
11 2016-10-01 40.15402
I then converted the Date column to a date as follows:
> EquifaxData$DATE <- as.Date(EquifaxData$DATE)
Now comes the issue. I want to convert this data to a time series. But I need to specify my start date as the beginning of Q2 in 2014. Not the beginning of 2014. As you can see below from what I have tried, the resulting time series shown by head has all the values shifted one quarter back because it is starting from the beginning of 2014.
> EquifaxTs <- ts(EquifaxData$EQFXSUBPRIME013045, start=2014, frequency = 4)
> head(EquifaxTs)
Qtr1 Qtr2 Qtr3 Qtr4
2014 42.58513 43.15483 43.55090 42.59218
2015 41.47105 41.53640
>
How can I define EquifaxTs to correctly start in Q2 2014 and still remain seasonal with a frequency of 4 per year?

I think that's it solves:
EquifaxTs <- ts(EquifaxData$EQFXSUBPRIME013045, start = c(2014, 2), frequency = 4)

Related

Formatting date column with different formats (including missing day information) - lubridate

I'm relatively new to R. I downloaded a dataset about clinical trial data, but it occurred to me, that the format of the dates in the relative column are mixed up: most of them are like "September 1, 2012", but some are missing the day information (e.g. October 2015).
I want to express them all in the same way (eg. yyyy-mm-dd), to work with them. That went fine, the only problem that is missing is the name of the output column. In the last function (date_correction) I planned to include an argument "output_col" which I can pass the intended name for the created (formatted) column, but it only prints output_col all the time.
Do you know, how I could handle this? To pass the intended name of the output column right into the function?
Is there a better way to solve my problem?
-> I even tried to manage more complex orders-argument for lubricate::parse_date_time like
parse_date_time(input_col, orders="mdy", "my")
but this didn't work.
Here's the code:
library("tidyverse")
library("lubridate")
Observation <- c(seq(1:5))
Date_original <- c("October 2014","August 2014","June 2013",
"June 24, 2010","January 2005")
df_dates <- data.frame(Observation, Date_original)
# looking for a comma in the cell
comma_detect <- function(a_string){
str_detect(a_string, ",")
}
# if comma: assume "mdy", if not apply "my" -> return formatted value
date_correction_row <- function(input_col){
if_else(comma_detect(input_col),
parse_date_time(input_col, orders="mdy"),
parse_date_time(input_col, orders="my"))
}
# prepare function for dataframe:
date_correction <- function(df, input_col, output_col){
mutate(df, output_col = date_correction_row(input_col))
}
df_dates %>% date_correction(df_dates$Date_original, date_formatted) %>% view()
OUTPUT
Observation Date_original output_col
1 1 October 2014 2014-10-01
2 2 August 2014 2014-08-01
3 3 June 2013 2013-06-01
4 4 June 24, 2010 2010-06-24
5 5 January 2005 2005-01-01
In the code below we assume that output_col equals "Date". They all set the column name, give no warnings and use Date class.
1) Try each format and take the one that does not give NA. This uses only base R.
output_col <- "Date"
within(df_dates, assign(output_col, pmin(na.rm = TRUE,
as.Date(Date_original, "%B %d, %Y"),
as.Date(paste(Date_original, 1), "%B %Y %d"))))
## Observation Date_original Date
## 1 1 October 2014 2014-10-01
## 2 2 August 2014 2014-08-01
## 3 3 June 2013 2013-06-01
## 4 4 June 24, 2010 2010-06-24
## 5 5 January 2005 2005-01-01
2) This can also be done in lubridate. It is important that my is the first rather than second argument to coalesce since it outputs NA for those values that do not match the format whereas mdy gives a wrong date so if that were first coalesce would never get to my. This approach is shorter than (3) but you might prefer the robustness (3) since it does not depend on what is returned for non-matching dates.
library(dplyr)
library(lubridate)
output_col <- "Date"
df_dates %>%
mutate(!!output_col := coalesce(my(Date_original, quiet = TRUE),
mdy(Date_original)))
## Observation Date_original Date
## 1 1 October 2014 2014-10-01
## 2 2 August 2014 2014-08-01
## 3 3 June 2013 2013-06-01
## 4 4 June 24, 2010 2010-06-24
## 5 5 January 2005 2005-01-01
3) If you prefer your own method of first checking for comma here is a variation of that which is more compact. It uses my and mdy instead of parse_date_time since my and mdy give Date class results which are more appropriate here than the POSIXct of parse_date_time given that there are no times.
library(dplyr)
library(lubridate)
output_col <- "Date"
df_dates %>%
mutate(!!output_col := if_else(grepl(",", Date_original),
mdy(Date_original), my(Date_original, quiet = TRUE)))
## 1 1 October 2014 2014-10-01
## 2 2 August 2014 2014-08-01
## 3 3 June 2013 2013-06-01
## 4 4 June 24, 2010 2010-06-24
## 5 5 January 2005 2005-01-01
When the date structure is known, I like to explicitly correct the date structure first, then parse. Here I use regex to sub in 1 when the day is missing, then we just parse like normal.
library(tidyverse)
df_dates %>%
mutate(
output_col = gsub("(?<!,)\\s(?=\\d{4})", " 1, ", Date_original, perl = TRUE) %>%
as.Date(., format = '%B %d, %Y')
)
Observation Date_original output_col
1 1 October 2014 2014-10-01
2 2 August 2014 2014-08-01
3 3 June 2013 2013-06-01
4 4 June 24, 2010 2010-06-24
5 5 January 2005 2005-01-01

How to divide monthly totals by the seasonal monthly ratio in R

I am trying de-seasonalize my data by dividing my monthly totals by the average seasonality ratio per that month. I have two data frames. avgseasonality that has 12 rows of the average seasonality ratio per month. The problem is since the seasonality ratio is the ratio of each month averaged only has 12 rows and the ordertotal data frame has 147 rows.
deseasonlize <- transform(avgseasonalityratio, deseasonlizedtotal =
df1$OrderTotal / avgseasonality$seasonalityratio)
This runs but it does not pair the months appropriately. It uses the first ratio of april and runs it on the first ordertotal of december.
> avgseasonality
Month seasonalityratio
1 April 1.0132557
2 August 1.0054602
3 December 0.8316988
4 February 0.9813396
5 January 0.8357475
6 July 1.1181648
7 June 1.0439899
8 March 1.1772450
9 May 1.0430667
10 November 0.9841149
11 October 0.9595041
12 September 0.8312318
> df1
# A tibble: 157 x 3
DateEntLabel OrderTotal `d$Month`
<dttm> <dbl> <chr>
1 2005-12-01 00:00:00 512758. December
2 2006-01-01 00:00:00 227449. January
3 2006-02-01 00:00:00 155652. February
4 2006-03-01 00:00:00 172923. March
5 2006-04-01 00:00:00 183854. April
6 2006-05-01 00:00:00 239689. May
7 2006-06-01 00:00:00 237638. June
8 2006-07-01 00:00:00 538688. July
9 2006-08-01 00:00:00 197673. August
10 2006-09-01 00:00:00 144534. September
# ... with 147 more rows
I need the ordertotal and ratio of each month respectively. The calculations would for each month respectively be such as (december) 512758/0.8316988 = 616518.864762 The output for the calculations would be in their new column that corresponds with the month and ordertotal. Please any help is greatly appreciated!
Easiest way would be to merge() your data first, then do the operation. You can use R base merge() function, though I will show here using the tidyverse left_join() function. I see that one of your columns has a strange name d$Month, renameing this to Month will simplify the merge!
Reproducible example:
library(tidyverse)
df_1 <- data.frame(Month = c("Jan", "Feb"), seasonalityratio = c(1,2))
df_2 <- data.frame(Month = rep(c("Jan", "Feb"),each=2), OrderTotal = 1:4)
df_1 %>%
left_join(df_2, by = "Month") %>%
mutate(eseasonlizedtotal = OrderTotal / seasonalityratio)
#> Month seasonalityratio OrderTotal eseasonlizedtotal
#> 1 Jan 1 1 1.0
#> 2 Jan 1 2 2.0
#> 3 Feb 2 3 1.5
#> 4 Feb 2 4 2.0
Created on 2019-01-30 by the reprex package (v0.2.1)

FRED data: aggregate quarterly data into annual

I need to convert quarterly data into yearly, by summing over 4 quarters in each year. When I searched stackoverflow.com, I found that using a function to sum over periods, seem to work. However, the format did not match, so I couldn't work with the converted year data array with the other arrays
For example, annual data in FRED looks as follows:
2009-01-01 12126.078
2010-01-01 12739.542
2011-01-01 13352.255
2012-01-01 14061.878
2013-01-01 14444.823
However, when I changed the data using the following function:
library("quantmod")
library(zoo)
library(mFilter)
library(nleqslv)
fredsym <- c("PROPINC")
quarter.proprietors_income <- PROPINC
## convert to annual
as.year <- function(x) as.integer(as.yearqtr(x)) # a new function
annual.proprietors_income <- aggregate(quarter.proprietors_income, as.yearqtr, sum) # sum over quarters
it changes from this:
2016-01-01 1327.613
2016-04-01 1339.493
2016-07-01 1346.067
2016-10-01 1354.560
2017-01-01 1380.221
2017-04-01 1378.637
2017-07-01 1381.911
2017-10-01 1403.114
to this:
2011 4574.669
2012 4965.486
2013 5138.968
2014 5263.208
2015 5275.225
2016 5367.733
2017 5543.883
What I need is having an annual data but with the original YYYY-MM-DD format, and it should appear as 01-01 for each yearly data.. Otherwise it doesn't work with other annual data...
Is there any way to solve this issue?
Using DF in the Note below use cut as shown:
aggregate(DF["value"], list(year = as.Date(cut(as.Date(DF$Date), "year"))), sum)
giving:
year value
1 2016-01-01 5367.733
2 2017-01-01 5543.883
Note
Lines <- "Date value
2016-01-01 1327.613
2016-04-01 1339.493
2016-07-01 1346.067
2016-10-01 1354.560
2017-01-01 1380.221
2017-04-01 1378.637
2017-07-01 1381.911
2017-10-01 1403.114"
DF <- read.table(text = Lines, header = TRUE)
I found that, the aggregate command makes the class into zoo. No more xts to be remained as time series.
Alternatively, apply.yearly seems to work.
annual.proprietors_income <- apply.yearly(xts(quarter.proprietors_income),sum)
This is now in xts. BUt the thing is they show mon-day as ending quarter as YYYY-10-01 for each year. How can I make it into YYYY-01-01....

Plotting the frequency of string matches over time in R

I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea)
id when time day month year handle what
UK1.1 Sat Feb 20 2016 12:34:02 20 2 2016 dave Great goal by #lfc
UK1.2 Sat Feb 20 2016 15:12:42 20 2 2016 john Can't wait for the weekend
UK1.3 Sat Mar 01 2016 12:09:21 1 3 2016 smith Generic boring tweet
Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. But I have no idea how to do this.
I know how to use grep to create subsets of this dataframe, e.g. for all tweets including the #lfc hashtag, but I don't really know where to go from there.
The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds.
So basically, I'd like to make plots like these, where frequency of string x is plotted against time
Thanks!
Here's one way to count up tweets by day. I've illustrated with a simplified fake data set:
library(dplyr)
library(lubridate)
# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000),
what = sample(LETTERS, 10000, replace=TRUE))
tweet.summary = dat %>% group_by(day = date(time)) %>% # To summarise by month: group_by(month = month(time, label=TRUE))
summarise(total.tweets = n(),
A.tweets = sum(grepl("A", what)),
pct.A = A.tweets/total.tweets,
B.tweets = sum(grepl("B", what)),
pct.B = B.tweets/total.tweets)
tweet.summary
day total.tweets A.tweets pct.A B.tweets pct.B
1 2016-01-01 28 3 0.10714286 0 0.00000000
2 2016-01-02 27 0 0.00000000 1 0.03703704
3 2016-01-03 28 4 0.14285714 1 0.03571429
4 2016-01-04 27 2 0.07407407 2 0.07407407
...
Here's a way to plot the data using ggplot2. I've also summarized the data frame on the fly within ggplot, using the dplyr and reshape2 packages:
library(ggplot2)
library(reshape2)
library(scales)
ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
summarise(A = sum(grepl("A", what))/n(),
B = sum(grepl("B", what))/n()) %>%
melt(id.var="Month"),
aes(Month, value, colour=variable, group=variable)) +
geom_line() +
theme_bw() +
scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
labs(colour="", y="")
Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct. Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. Once you've done this conversion, you can run the code above to count up tweets by day, month, etc.
# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016,
time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))
# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day)," ",
time)))
# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day))))
dat2
day month year time posix.date date
1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2 22 6 2016 12:28 2016-06-22 12:28:00 2016-06-22
3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4 15 8 2016 10:13 2016-08-15 10:13:00 2016-08-15
5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6 2 12 2016 02:38 2016-12-02 02:38:00 2016-12-02
7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8 12 3 2016 07:20 2016-03-12 07:20:00 2016-03-12
9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24
10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27
You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date). Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date).

Choose specific date with strptime in r

I have a text file dataset with headers
YEAR MONTH DAY value
which runs hourly from 1/6/2010 to 14/7/2012. I open and plot the data with the following commands:
data=read.table('example.txt',header=T)
time = strptime(paste(data$DAY,data$MONTH,data$YEAR,sep="-"), format="%d-%m-%Y")
plot(time,data$value)
However, when the data are plotted, the x axis only shows 2011 and 2012. . How can I do to keep the 2011 and 2012 labels but also to add some specific month, e.g. if I want March, June & September?
I have made the data available on this link
https://dl.dropbox.com/u/107215263/example.txt
You need to use function axis.POSIXct to format and dispose of your date labels as you wish:
plot(time,data$value,xaxt="n") #Skip the x-axis here
axis.POSIXct(1, at=pretty(time), format="%B %Y")
To see all possible formats, see ?strptime.
You can of course play with parameter at to place your ticks wherever you want, for instance:
axis.POSIXct(1, at=seq(time[1],time[length(time)],"3 months"),
format="%B %Y")
While this doesn't answer question directly, I would like to suggest you to use xts package for any timeseries analysis. It makes timeseries analysis very convenient
require(xts)
DF <- read.table("https://dl.dropbox.com/u/107215263/example.txt", header = TRUE)
head(DF)
## YEAR MONTH DAY value
## 1 2010 6 1 95.3244
## 2 2010 6 2 95.3817
## 3 2010 6 3 100.1968
## 4 2010 6 4 103.8667
## 5 2010 6 5 104.5969
## 6 2010 6 6 107.2666
#Get Index for xts object which we will create in next step
DFINDEX <- ISOdate(DF$YEAR, DF$MONTH, DF$DAY)
#Create xts timeseries
DF.XTS <- .xts(x = DF$value, index = DFINDEX, tzone = "GMT")
head(DF.XTS)
## [,1]
## 2010-06-01 12:00:00 95.3244
## 2010-06-02 12:00:00 95.3817
## 2010-06-03 12:00:00 100.1968
## 2010-06-04 12:00:00 103.8667
## 2010-06-05 12:00:00 104.5969
## 2010-06-06 12:00:00 107.2666
#plot xts
plot(DF.XTS)

Resources