I need to convert quarterly data into yearly, by summing over 4 quarters in each year. When I searched stackoverflow.com, I found that using a function to sum over periods, seem to work. However, the format did not match, so I couldn't work with the converted year data array with the other arrays
For example, annual data in FRED looks as follows:
2009-01-01 12126.078
2010-01-01 12739.542
2011-01-01 13352.255
2012-01-01 14061.878
2013-01-01 14444.823
However, when I changed the data using the following function:
library("quantmod")
library(zoo)
library(mFilter)
library(nleqslv)
fredsym <- c("PROPINC")
quarter.proprietors_income <- PROPINC
## convert to annual
as.year <- function(x) as.integer(as.yearqtr(x)) # a new function
annual.proprietors_income <- aggregate(quarter.proprietors_income, as.yearqtr, sum) # sum over quarters
it changes from this:
2016-01-01 1327.613
2016-04-01 1339.493
2016-07-01 1346.067
2016-10-01 1354.560
2017-01-01 1380.221
2017-04-01 1378.637
2017-07-01 1381.911
2017-10-01 1403.114
to this:
2011 4574.669
2012 4965.486
2013 5138.968
2014 5263.208
2015 5275.225
2016 5367.733
2017 5543.883
What I need is having an annual data but with the original YYYY-MM-DD format, and it should appear as 01-01 for each yearly data.. Otherwise it doesn't work with other annual data...
Is there any way to solve this issue?
Using DF in the Note below use cut as shown:
aggregate(DF["value"], list(year = as.Date(cut(as.Date(DF$Date), "year"))), sum)
giving:
year value
1 2016-01-01 5367.733
2 2017-01-01 5543.883
Note
Lines <- "Date value
2016-01-01 1327.613
2016-04-01 1339.493
2016-07-01 1346.067
2016-10-01 1354.560
2017-01-01 1380.221
2017-04-01 1378.637
2017-07-01 1381.911
2017-10-01 1403.114"
DF <- read.table(text = Lines, header = TRUE)
I found that, the aggregate command makes the class into zoo. No more xts to be remained as time series.
Alternatively, apply.yearly seems to work.
annual.proprietors_income <- apply.yearly(xts(quarter.proprietors_income),sum)
This is now in xts. BUt the thing is they show mon-day as ending quarter as YYYY-10-01 for each year. How can I make it into YYYY-01-01....
Related
I have time series data that is seasonal by the quarter. However, the data starts in the 2nd quarter of the first year but all other years have all four quarters.
> EquifaxData
DATE EQFXSUBPRIME013045
1 2014-04-01 42.58513
2 2014-07-01 43.15483
3 2014-10-01 43.55090
4 2015-01-01 42.59218
5 2015-04-01 41.47105
6 2015-07-01 41.53640
7 2015-10-01 41.82020
8 2016-01-01 40.98760
9 2016-04-01 40.51305
10 2016-07-01 39.91170
11 2016-10-01 40.15402
I then converted the Date column to a date as follows:
> EquifaxData$DATE <- as.Date(EquifaxData$DATE)
Now comes the issue. I want to convert this data to a time series. But I need to specify my start date as the beginning of Q2 in 2014. Not the beginning of 2014. As you can see below from what I have tried, the resulting time series shown by head has all the values shifted one quarter back because it is starting from the beginning of 2014.
> EquifaxTs <- ts(EquifaxData$EQFXSUBPRIME013045, start=2014, frequency = 4)
> head(EquifaxTs)
Qtr1 Qtr2 Qtr3 Qtr4
2014 42.58513 43.15483 43.55090 42.59218
2015 41.47105 41.53640
>
How can I define EquifaxTs to correctly start in Q2 2014 and still remain seasonal with a frequency of 4 per year?
I think that's it solves:
EquifaxTs <- ts(EquifaxData$EQFXSUBPRIME013045, start = c(2014, 2), frequency = 4)
So I have a xts time serie over the year with time zone "UTC". The time interval between each row is 15 minutes.
x1 x2
2014-12-31 23:15:00 153.0 0.0
2014-12-31 23:30:00 167.1 5.4
2014-12-31 23:45:00 190.3 4.1
2015-01-01 00:00:00 167.1 9.7
As I want data over one hour to allow for comparison with other data sets, I tried to use period.apply:
dat <- period.apply(dat, endpoints(dat,on="hours",k=1), colSums)
The problem is that the first row in my new data set is 2014-12-31 23:45:00 and not 2015-01-01 00:00:00. I tried changing the endpoint vector but somehow it keeps saying that it is out of bounds. I also thought this was my answer: https://stats.stackexchange.com/questions/5305/how-to-re-sample-an-xts-time-series-in-r/19003#19003 but it was not. I don't want to change the names of my columns, I want to sum over a different interval.
Here a reproducible example:
library(xts)
seq<-seq(from=ISOdate(2014,12,31,23,15),length.out = 100, by="15 min", tz="UTC")
xts<-xts(rep(1,100),order.by = seq)
period.apply(xts, endpoints(xts,on="hours",k=1), colSums)
And the result looks like this:
2014-12-31 23:45:00 3
2015-01-01 00:45:00 4
2015-01-01 01:45:00 4
2015-01-01 02:45:00 4
and ends up like this:
2015-01-01 21:45:00 4
2015-01-01 22:45:00 4
2015-01-01 23:45:00 4
2015-01-02 00:00:00 1
Whereas I would like it to always sum over the same interval, meaning I would like only 4s.
(I am using RStudio 0.99.903 with R x64 3.3.2)
The problem is that you're using endpoints, but you want to align by the start of the interval, not the end. I thought you might be able to use this startpoints function, but that produced weird results.
The basic idea of the work-around below is to subtract a small amount from all index values, then use endpoints and period.apply to aggregate. Then call align.time on the result. I'm not sure if this is a general solution, but it seems to work for your example.
library(xts)
seq<-seq(from=ISOdate(2014,12,31,23,15),length.out = 100, by="15 min", tz="UTC")
xts<-xts(rep(1,100),order.by = seq)
# create a temporary object
tmp <- xts
# subtract a small amount of time from each index value
.index(tmp) <- .index(tmp)-0.001
# aggregate to hourly
agg <- period.apply(tmp, endpoints(tmp, "hours"), colSums)
# round index up to next hour
agg_aligned <- align.time(agg, 3600)
I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea)
id when time day month year handle what
UK1.1 Sat Feb 20 2016 12:34:02 20 2 2016 dave Great goal by #lfc
UK1.2 Sat Feb 20 2016 15:12:42 20 2 2016 john Can't wait for the weekend
UK1.3 Sat Mar 01 2016 12:09:21 1 3 2016 smith Generic boring tweet
Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. But I have no idea how to do this.
I know how to use grep to create subsets of this dataframe, e.g. for all tweets including the #lfc hashtag, but I don't really know where to go from there.
The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds.
So basically, I'd like to make plots like these, where frequency of string x is plotted against time
Thanks!
Here's one way to count up tweets by day. I've illustrated with a simplified fake data set:
library(dplyr)
library(lubridate)
# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000),
what = sample(LETTERS, 10000, replace=TRUE))
tweet.summary = dat %>% group_by(day = date(time)) %>% # To summarise by month: group_by(month = month(time, label=TRUE))
summarise(total.tweets = n(),
A.tweets = sum(grepl("A", what)),
pct.A = A.tweets/total.tweets,
B.tweets = sum(grepl("B", what)),
pct.B = B.tweets/total.tweets)
tweet.summary
day total.tweets A.tweets pct.A B.tweets pct.B
1 2016-01-01 28 3 0.10714286 0 0.00000000
2 2016-01-02 27 0 0.00000000 1 0.03703704
3 2016-01-03 28 4 0.14285714 1 0.03571429
4 2016-01-04 27 2 0.07407407 2 0.07407407
...
Here's a way to plot the data using ggplot2. I've also summarized the data frame on the fly within ggplot, using the dplyr and reshape2 packages:
library(ggplot2)
library(reshape2)
library(scales)
ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
summarise(A = sum(grepl("A", what))/n(),
B = sum(grepl("B", what))/n()) %>%
melt(id.var="Month"),
aes(Month, value, colour=variable, group=variable)) +
geom_line() +
theme_bw() +
scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
labs(colour="", y="")
Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct. Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. Once you've done this conversion, you can run the code above to count up tweets by day, month, etc.
# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016,
time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))
# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day)," ",
time)))
# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day))))
dat2
day month year time posix.date date
1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2 22 6 2016 12:28 2016-06-22 12:28:00 2016-06-22
3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4 15 8 2016 10:13 2016-08-15 10:13:00 2016-08-15
5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6 2 12 2016 02:38 2016-12-02 02:38:00 2016-12-02
7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8 12 3 2016 07:20 2016-03-12 07:20:00 2016-03-12
9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24
10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27
You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date). Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date).
I have a text file dataset with headers
YEAR MONTH DAY value
which runs hourly from 1/6/2010 to 14/7/2012. I open and plot the data with the following commands:
data=read.table('example.txt',header=T)
time = strptime(paste(data$DAY,data$MONTH,data$YEAR,sep="-"), format="%d-%m-%Y")
plot(time,data$value)
However, when the data are plotted, the x axis only shows 2011 and 2012. . How can I do to keep the 2011 and 2012 labels but also to add some specific month, e.g. if I want March, June & September?
I have made the data available on this link
https://dl.dropbox.com/u/107215263/example.txt
You need to use function axis.POSIXct to format and dispose of your date labels as you wish:
plot(time,data$value,xaxt="n") #Skip the x-axis here
axis.POSIXct(1, at=pretty(time), format="%B %Y")
To see all possible formats, see ?strptime.
You can of course play with parameter at to place your ticks wherever you want, for instance:
axis.POSIXct(1, at=seq(time[1],time[length(time)],"3 months"),
format="%B %Y")
While this doesn't answer question directly, I would like to suggest you to use xts package for any timeseries analysis. It makes timeseries analysis very convenient
require(xts)
DF <- read.table("https://dl.dropbox.com/u/107215263/example.txt", header = TRUE)
head(DF)
## YEAR MONTH DAY value
## 1 2010 6 1 95.3244
## 2 2010 6 2 95.3817
## 3 2010 6 3 100.1968
## 4 2010 6 4 103.8667
## 5 2010 6 5 104.5969
## 6 2010 6 6 107.2666
#Get Index for xts object which we will create in next step
DFINDEX <- ISOdate(DF$YEAR, DF$MONTH, DF$DAY)
#Create xts timeseries
DF.XTS <- .xts(x = DF$value, index = DFINDEX, tzone = "GMT")
head(DF.XTS)
## [,1]
## 2010-06-01 12:00:00 95.3244
## 2010-06-02 12:00:00 95.3817
## 2010-06-03 12:00:00 100.1968
## 2010-06-04 12:00:00 103.8667
## 2010-06-05 12:00:00 104.5969
## 2010-06-06 12:00:00 107.2666
#plot xts
plot(DF.XTS)
My time series data looks like to 8/18/2012 11:18:00 PM for 6 month, how I can subset them monthly and average for a variable within a month? (using R)
Thank you so much
You can use xts package. First I generate your data. here I create a 6 month, half daily data. long data
dat <- data.frame(date = as.POSIXct('8/18/2012 11:18:00',
format='%m/%d/%Y %H:%M:%S') +
seq(0,by = 60*60*12,length.out=365),
value = rnorm(365))
Then I create an xts object
library(xts)
dat.xts <- xts(x= dat$value,order.by = dat$date)
Finaly I use the handy function apply.monthly equivalent to lapply to get something like this :
apply.monthly(dat.xts,mean)
2012-08-31 23:18:00 0.03415933
2012-09-30 23:18:00 0.02884122
2012-10-31 22:18:00 -0.27767240
2012-11-30 22:18:00 -0.15614567
2012-12-31 22:18:00 -0.02595911
2013-01-31 22:18:00 -0.23284335
2013-02-16 10:18:00 0.14537790
You can format the dates and compute the averages with aggregate (thanks to #agstudy for the sample data):
aggregate(value~format(date,"%Y-%m"),dat,FUN=mean)
format(date, "%Y-%m") value
1 2012-08 -0.31409786
2 2012-09 -0.37585310
3 2012-10 -0.04552703
4 2012-11 -0.05726177
5 2012-12 0.04822608
6 2013-01 0.03412790
7 2013-02 -0.10157931