How to find out how many trading days in each month in R? - r

I have a dataframe like this. The time span is 10 years. Because it's Chinese market data, and China has Lunar Holidays. So each year have different holiday times in terms of the western calendar.
When it is a holiday, the stock market does not open, so it is a non-trading day. Weekends are non-trading days too.
I want to find out which month of which year has the least number of trading days, and most importantly, what number is that.
There are not repeated days.
date change open high low close volume
1 1995-01-03 -1.233 637.72 647.71 630.53 639.88 234518
2 1995-01-04 2.177 641.90 655.51 638.86 653.81 422220
3 1995-01-05 -1.058 656.20 657.45 645.81 646.89 430123
4 1995-01-06 -0.948 642.75 643.89 636.33 640.76 487482
5 1995-01-09 -2.308 637.52 637.55 625.04 625.97 509851
6 1995-01-10 -2.503 616.16 617.60 607.06 610.30 606925

If there are not repeated days, you can count days per month and year by:
library(data.table) "maxx"))), .Names = c("X2005", "X2006", "X2007", "X2008"))
library(lubridate)
dt <- as.data.table(dt)
dt_days <- dt[, .(count_day=.N), by=.(year(date), month(date))]
Then you only need to do this to get the min:
dt_days[count_day==min(count_day)]

The chron and bizdays packages deal with business days but neither actually contains a usable calendar of holidays limiting their usefulness.
We will use chron below assuming you have defined the .Holidays vector of dates that are holidays. (If you run the code below without doing that only weekdays will be regarded as business days as the default .Holidays vector supplied by chron has very few dates in it.) DF has 120 rows (one row for each year/month) and the last line subsets that to just the month in each year having least business days.
library(chron)
library(zoo)
st <- as.yearmon("2001-01")
en <- as.yearmon("2010-12")
ym <- seq(st, en, 1/12) # sequence of year/months of interest
# no of business days in each yearmonth
busdays <- sapply(ym, function(x) {
s <- seq(as.Date(x), as.Date(x, frac = 1), "day")
sum(!is.weekend(s) & !is.holiday(s))
})
# data frame with one row per year/month
yr <- as.integer(ym)
DF <- data.frame(year = yr, month = cycle(ym), yearmon = ym, busdays)
# data frame with one row per year
wx.min <- ave(busdays, yr, FUN = function(x) which.min(x) == seq_along(x))
DF[wx.min == 1, ]
giving:
year month yearmon busdays
2 2001 2 Feb 2001 20
14 2002 2 Feb 2002 20
26 2003 2 Feb 2003 20
38 2004 2 Feb 2004 20
50 2005 2 Feb 2005 20
62 2006 2 Feb 2006 20
74 2007 2 Feb 2007 20
95 2008 11 Nov 2008 20
98 2009 2 Feb 2009 20
110 2010 2 Feb 2010 20

Related

How to calculate the average year

I have a 20-year monthly XTS time series
Jan 1990 12.3
Feb 1990 45.6
Mar 1990 78.9
..
Jan 1991 34.5
..
Dec 2009 89.0
I would like to get the average (12-month) year, or
Jan xx
Feb yy
...
Dec kk
where xx is the average of every January, yy of every February, and so on.
I have tried apply.yearly and lapply but these return 1 value, which is the 20-year total average
Would you have any suggestions? I appreciate it.
The lubridate package could be useful for you. I would use the functions year() and month() in conjunction with aggregate():
library(xts)
library(lubridate)
#set up some sample data
dates = seq(as.Date('2000/01/01'), as.Date('2005/01/01'), by="month")
df = data.frame(rand1 = runif(length(dates)), rand2 = runif(length(dates)))
my_xts = xts(df, dates)
#get the mean by year
aggregate(my_xts$rand1, by=year(index(my_xts)), FUN=mean)
This outputs something like:
2000 0.5947939
2001 0.4968154
2002 0.4941752
2003 0.5291211
2004 0.6631564
To find the mean for each month you can do:
#get the mean by month
aggregate(my_xts$rand1, by=month(index(my_xts)), FUN=mean)
which will output something like
1 0.5560279
2 0.6352220
3 0.3308571
4 0.6709439
5 0.6698147
6 0.7483192
7 0.5147294
8 0.3724472
9 0.3266859
10 0.5331233
11 0.5490693
12 0.4642588

How to query NOAA for historical daily temperature averages using rnoaa?

I'm trying to find the historical average temperature between a range of dates using NOAA data and comparing to the long term average temperatures.
I'm using the rnoaa package and have hit a bit of a snag. For long term averages, I have been successful using the following syntax:
library('rnoaa')
start_date = "2010-01-15"
end_date = "2010-11-14"
station_id = "USW00093738"
weather_data <- ncdc(datasetid='NORMAL_DLY', stationid=paste0('GHCND:',station_id),
datatypeid='dly-tavg-normal',
startdate = start_date, enddate = end_date,limit=365)
This lets me parse weather_data$data for the long term average temperatures for that given station between January 15th and November 14th.
However, I can't seem to find the right dataset or datatype for historical average temperatures. I'd like to get the same data as the code above except with the actual daily average temperatures for those days. Any idea how to query this? I've been at it for a few hours and have had no luck.
Something I tried was the following:
weather_data <- ncdc(datasetid='GHCND', stationid=paste0('GHCND:',station_id),
startdate = start_date, enddate = end_date,limit=365)
uniq_d_types = unique(weather_data$data$datatype)
View(uniq_d_types)
This let me see the unique data types in the GHCND dataset but none of the data types seemed to be daily average temperatures. Any thoughts?
In order to obtain average daily actual temperatures from the NOAA data using the rnoaa package, one must use the hourly data and aggregate it by day. Hourly NOAA data is in the NORMAL_HLY data set, and the required data type is HLY-TEMP-NORMAL.
library('rnoaa')
library(lubridate)
options(noaakey = "obtain key from NOAA website")
start_date = "2010-01-15"
end_date = "2010-01-31"
station_id = "USW00093738"
weather_data <- ncdc(datasetid='NORMAL_HLY', stationid=paste0('GHCND:',station_id),
datatypeid = "HLY-TEMP-NORMAL",
startdate = start_date, enddate = end_date,limit=500)
data <- weather_data$data
data$year <- year(data$date)
data$month <- month(data$date)
data$day <- day(data$date)
# summarize to average daily temps
aggregate(value ~ year + month + day,mean,data = data)
...and the output:
> aggregate(value ~ year + month + day,mean,data = data)
year month day value
1 2010 1 15 323.5417
2 2010 1 16 322.8750
3 2010 1 17 323.4167
4 2010 1 18 323.7500
5 2010 1 19 323.2083
6 2010 1 20 321.0833
7 2010 1 21 318.4167
8 2010 1 22 317.6667
9 2010 1 23 319.0000
10 2010 1 24 321.0833
11 2010 1 25 323.5417
12 2010 1 26 326.0833
13 2010 1 27 328.4167
14 2010 1 28 330.9583
15 2010 1 29 333.2917
16 2010 1 30 335.7917
17 2010 1 31 308.0000
>
Note that temperatures are stored in tenths of degrees in this data set, so for the period between January 15th and 31st 2010, the average daily temperatures at the Dulles International Airport weather station were between 30.8 degrees and 33.5 degrees.
Also note that to calculate the average by stationId and run across multiple weather stations, simply add station to the aggregate() function.
> # summarize to average daily temps by station
> aggregate(value ~ station + year + month + day,mean,data = data)
station year month day value
1 GHCND:USW00093738 2010 1 15 323.5417
2 GHCND:USW00093738 2010 1 16 322.8750
3 GHCND:USW00093738 2010 1 17 323.4167
4 GHCND:USW00093738 2010 1 18 323.7500
5 GHCND:USW00093738 2010 1 19 323.2083
6 GHCND:USW00093738 2010 1 20 321.0833
7 GHCND:USW00093738 2010 1 21 318.4167
8 GHCND:USW00093738 2010 1 22 317.6667
9 GHCND:USW00093738 2010 1 23 319.0000
10 GHCND:USW00093738 2010 1 24 321.0833
11 GHCND:USW00093738 2010 1 25 323.5417
12 GHCND:USW00093738 2010 1 26 326.0833
13 GHCND:USW00093738 2010 1 27 328.4167
14 GHCND:USW00093738 2010 1 28 330.9583
15 GHCND:USW00093738 2010 1 29 333.2917
16 GHCND:USW00093738 2010 1 30 335.7917
17 GHCND:USW00093738 2010 1 31 308.0000
>
The answer is to grab historical (meaning actual, on the day specified-- not long term average) weather data from the NOAA's ISD database. USAF and WBAN values can be found by looking through the isd-history.csv file found here:
ftp://ftp.ncdc.noaa.gov/pub/data/noaa
Here's an example query.
out <- isd(usaf='724030', wban = '93738', year=2018)
This will grab a years worth of ~hourly weather data from ISD mapping. You can then parse/process this data however you see fit (e.g. for daily average temperatures like I did).

Weekends in a Month in R

I am trying to prepare an xreg serie for my Arima model and I will use number of weekends in a month for it. I can find results for a year but when it is longer than a year, it usually is, I couldn't find a way. Here is what I do so far.
dates <- seq(from=as.Date("2001-01-01"), to=as.Date("2010-12-31"), by = "day")
wd <- weekdays(dates)
aylar <- months(dates[which(wd == "Sunday" | wd == "Satuday")])
table(aylar)
What I want is gathering all months' weekends not based on only months but also years. So that I can have the same length of serie with my original forecast serie.
Here is my solution:
library(chron)
library(dplyr)
library(lubridate)
month <- months(dates[chron::is.weekend(dates)])
day <- dates[chron::is.weekend(dates)]
# create data.frame
df <- data.frame(date = day, month = month, year = chron::years(day))
df %>% group_by(year, month) %>% summarize(weekends = floor(n()/2))
# year month weekends
# <dbl> <fctr> <dbl>
#1 2001 April 4
#2 2001 August 4
#3 2001 Dezember 5
#4 2001 Februar 4
#5 2001 Januar 4
#6 2001 Juli 4
#7 2001 Juni 4
#8 2001 Mai 4
#9 2001 März 4
#10 2001 November 4
## ... with 110 more rows
I hope this is a starting point for your work.

Aggregate count of timeseries values which exceed threshold, by year-month

I am now learning R and using the SEAS package to help me with some calculation in R and data is the same format as SEAS package likes. It is a time series
require(seas)
data(mscdata)
dat.int <- (mksub(mscdata, id=1108447))
the heading of the data and it is 20 years of data
year yday date t_max t_min t_mean rain snow precip
However, I now need to calculate the number of days in each month rainfall is >= 1.0mm . So at the end of it. I would have two columns ( each month in each year and total # of days in each month rainfall>= 1.0mm )
I'm not certain how to write this code and any help would be appreciated
Thank you
Lam
I now need to calculate the number of days in each month rainfall is >= 1.0mm. So at the end of it. I would have two columns ( each month in each year and total # of days in each month rainfall>= 1.0mm )
1) So dat.int$date is a Date object. First step is you need to create a new column dat.int$yearmon extracting the year-month, e.g. using zoo::yearmon
Extract month and year from a zoo::yearmon object
require(zoo)
dat.int$yearmon <- as.yearmon(dat.int$date, "%b %y")
2) Second, you need to do a summarize operation (recommend you use plyr or the newer dplyr) on rain>=1.0 aggregated by yearmon. Let's name our resulting column rainy_days.
If you want to store rainy_days column back into the dat.int dataframe, you use a transform instead of a summarize:
ddply(dat.int, .(yearmon), transform, rainy_days=sum(rain >= 1.0) )
or else if you really just want a new summary dataframe:
require(plyr)
rainydays_by_yearmon <- ddply(dat.int, .(yearmon), summarize, rainy_days=sum(rain >= 1.0) )
print.data.frame(rainydays_by_yearmon)
yearmon rainy_days
1 Jan 1975 14
2 Feb 1975 12
3 Mar 1975 13
4 Apr 1975 6
5 May 1975 6
6 Jun 1975 5
...
355 Jul 2004 3
356 Aug 2004 7
357 Oct 2004 14
358 Nov 2004 16
359 Dec 2004 19
Note: you can do the above with plain old R, without using zoo or plyr/dplyr packages. But might as well teach you nicer, more scalable, maintainable code idioms.

return final row of dataframe - recurring variable names

I want to return the final row for each subsection of a dataframe. I'm aware of the ddply and aggregate functions, but they are not giving the expected output in this case, as the column by which I split the data has recurring names.
For example, in df:
year <- rep(c(2011, 2012, 2013), each=12)
season <- rep(c("Spring", "Summer", "Autumn", "Winter"), each=3)
allseason <- rep(season, 3)
temp <- rnorm(36, mean = 61, sd = 10)
df <- data.frame(year, allseason, temp)
I want to return the final temp reading at the end of every season. When I run either
final1 <- aggregate(df, list(df$allseason), tail, 1)
or
final2 <- ddply(df, .(allseason), tail, 1)
I get only the final 4 seasons (i.e. those of 2013). The function seems to stop there and does not go back to previous years/seasons. My intended output is a data frame with 12 rows * 3 columns.
All help appreciated!
*I notice that in the df created here, the allseasons column is designated as a factor with 4 levels, whereas this is not the case in my original dataframe.
In your ddply code, you only forgot to also group by year:
With plyr:
library(plyr)
ddply(df, .(year, allseason), tail, 1)
Or with dplyr
library(dplyr)
df %>%
group_by(year, allseason) %>%
do(tail(.,1))
Or if you want a base R alternative you can use ave:
df[with(df, ave(year, list(year, allseason), FUN = seq_along)) == 3,]
Result:
# year allseason temp
#1 2011 Autumn 63.40626
#2 2011 Spring 59.69441
#3 2011 Summer 42.33252
#4 2011 Winter 79.10926
#5 2012 Autumn 63.14974
#6 2012 Spring 60.32811
#7 2012 Summer 67.57364
#8 2012 Winter 61.39100
#9 2013 Autumn 50.30501
#10 2013 Spring 61.43044
#11 2013 Summer 55.16605
#12 2013 Winter 69.37070
Note that the output will contain the same rows in each case, only the ordering may differ.
And just to add to #beginneR's answer, your aggregate solution should look like:
aggregate(temp ~ allseason + year, data = df, tail, 1)
# or:
with(df, aggregate(temp, list(allseason, year), tail, 1))
Result:
allseason year temp
1 Autumn 2011 64.51539
2 Spring 2011 45.14341
3 Summer 2011 62.29240
4 Winter 2011 47.97461
5 Autumn 2012 43.16781
6 Spring 2012 80.02419
7 Summer 2012 72.31149
8 Winter 2012 45.58344
9 Autumn 2013 55.92607
10 Spring 2013 52.06778
11 Summer 2013 51.01308
12 Winter 2013 53.22452

Resources