How to query NOAA for historical daily temperature averages using rnoaa? - r

I'm trying to find the historical average temperature between a range of dates using NOAA data and comparing to the long term average temperatures.
I'm using the rnoaa package and have hit a bit of a snag. For long term averages, I have been successful using the following syntax:
library('rnoaa')
start_date = "2010-01-15"
end_date = "2010-11-14"
station_id = "USW00093738"
weather_data <- ncdc(datasetid='NORMAL_DLY', stationid=paste0('GHCND:',station_id),
datatypeid='dly-tavg-normal',
startdate = start_date, enddate = end_date,limit=365)
This lets me parse weather_data$data for the long term average temperatures for that given station between January 15th and November 14th.
However, I can't seem to find the right dataset or datatype for historical average temperatures. I'd like to get the same data as the code above except with the actual daily average temperatures for those days. Any idea how to query this? I've been at it for a few hours and have had no luck.
Something I tried was the following:
weather_data <- ncdc(datasetid='GHCND', stationid=paste0('GHCND:',station_id),
startdate = start_date, enddate = end_date,limit=365)
uniq_d_types = unique(weather_data$data$datatype)
View(uniq_d_types)
This let me see the unique data types in the GHCND dataset but none of the data types seemed to be daily average temperatures. Any thoughts?

In order to obtain average daily actual temperatures from the NOAA data using the rnoaa package, one must use the hourly data and aggregate it by day. Hourly NOAA data is in the NORMAL_HLY data set, and the required data type is HLY-TEMP-NORMAL.
library('rnoaa')
library(lubridate)
options(noaakey = "obtain key from NOAA website")
start_date = "2010-01-15"
end_date = "2010-01-31"
station_id = "USW00093738"
weather_data <- ncdc(datasetid='NORMAL_HLY', stationid=paste0('GHCND:',station_id),
datatypeid = "HLY-TEMP-NORMAL",
startdate = start_date, enddate = end_date,limit=500)
data <- weather_data$data
data$year <- year(data$date)
data$month <- month(data$date)
data$day <- day(data$date)
# summarize to average daily temps
aggregate(value ~ year + month + day,mean,data = data)
...and the output:
> aggregate(value ~ year + month + day,mean,data = data)
year month day value
1 2010 1 15 323.5417
2 2010 1 16 322.8750
3 2010 1 17 323.4167
4 2010 1 18 323.7500
5 2010 1 19 323.2083
6 2010 1 20 321.0833
7 2010 1 21 318.4167
8 2010 1 22 317.6667
9 2010 1 23 319.0000
10 2010 1 24 321.0833
11 2010 1 25 323.5417
12 2010 1 26 326.0833
13 2010 1 27 328.4167
14 2010 1 28 330.9583
15 2010 1 29 333.2917
16 2010 1 30 335.7917
17 2010 1 31 308.0000
>
Note that temperatures are stored in tenths of degrees in this data set, so for the period between January 15th and 31st 2010, the average daily temperatures at the Dulles International Airport weather station were between 30.8 degrees and 33.5 degrees.
Also note that to calculate the average by stationId and run across multiple weather stations, simply add station to the aggregate() function.
> # summarize to average daily temps by station
> aggregate(value ~ station + year + month + day,mean,data = data)
station year month day value
1 GHCND:USW00093738 2010 1 15 323.5417
2 GHCND:USW00093738 2010 1 16 322.8750
3 GHCND:USW00093738 2010 1 17 323.4167
4 GHCND:USW00093738 2010 1 18 323.7500
5 GHCND:USW00093738 2010 1 19 323.2083
6 GHCND:USW00093738 2010 1 20 321.0833
7 GHCND:USW00093738 2010 1 21 318.4167
8 GHCND:USW00093738 2010 1 22 317.6667
9 GHCND:USW00093738 2010 1 23 319.0000
10 GHCND:USW00093738 2010 1 24 321.0833
11 GHCND:USW00093738 2010 1 25 323.5417
12 GHCND:USW00093738 2010 1 26 326.0833
13 GHCND:USW00093738 2010 1 27 328.4167
14 GHCND:USW00093738 2010 1 28 330.9583
15 GHCND:USW00093738 2010 1 29 333.2917
16 GHCND:USW00093738 2010 1 30 335.7917
17 GHCND:USW00093738 2010 1 31 308.0000
>

The answer is to grab historical (meaning actual, on the day specified-- not long term average) weather data from the NOAA's ISD database. USAF and WBAN values can be found by looking through the isd-history.csv file found here:
ftp://ftp.ncdc.noaa.gov/pub/data/noaa
Here's an example query.
out <- isd(usaf='724030', wban = '93738', year=2018)
This will grab a years worth of ~hourly weather data from ISD mapping. You can then parse/process this data however you see fit (e.g. for daily average temperatures like I did).

Related

Create time series in R with weekly measurements for 30 years period

I have a set of weekly data for 30 years (1991 - 2020). The data was collected weekly between 5th may - 10 October every year. This gives me 23 weeks of data every year for 30 years.
I want to create a time series in R with this data. How do I do that please? It should be just 690 entriesin the output, but it is generating 1531 entries in the output See my codes and data below:
I saw a similar question HERE, but mine repeats for 30 years.
myts <- ts(df$Kc_Kamble, start = c(1991, 1), end = c(2020, 23), frequency = 52)
Output in R:
Time Series:
Start = c(1991, 1)
End = c(2020, 23)
Frequency = 52
Sample data:
Year Week Kc_Kamble
1991 1 0.357445197
1991 2 0.36902168
1991 3 0.383675947
1991 4 0.400703221
1991 5 0.418901921
1991 6 0.437049406
1991 7 0.453742803
1991 8 0.467291036
1991 9 0.475942834
1991 10 0.476898402
1991 11 0.464632341
1991 12 0.436298927
1991 13 0.396338825
1991 14 0.352731819
1991 15 0.313539638
1991 16 0.283932169
1991 17 0.2627343
1991 18 0.247373874
1991 19 0.235647483
1991 20 0.225655859
1991 21 0.216663659
1991 22 0.208550065
1991 23 0.203605036
1992 1 0.336754943
1992 2 0.334735193
1992 3 0.342654691
1992 4 0.363520428
1992 5 0.397733301
1992 6 0.4399758
1992 7 0.483592219
1992 8 0.521920773
1992 9 0.548597061
1992 10 0.560150059
1992 11 0.557210705
1992 12 0.542114151
1992 13 0.5173071
1992 14 0.485236257
1992 15 0.448348321
1992 16 0.409089999
1992 17 0.369907993
1992 18 0.333162073
1992 19 0.300014261
1992 20 0.270225988
1992 21 0.243406301
1992 22 0.219247646
1992 23 0.204966601
Let me suggest the following steps to set up and start analyzing your time series.
Initialize your time series by creating a 'dates' sequence and 'data' (set to NA). Use the library xts to create the time series.
library(xts)
dates <- seq(as.Date("1991-01-01"), as.Date("2020-01-01"), by = "weeks")
data <- rep(NA, length(dates))
myxts <- xts(x = data, order.by = dates)
str(myxts); head(myxts); tail(myxts)
Collect your data.
Data is collected weekly between 5th may - 10 October every year.
Let's read the data and work with Weekly Total Precipitation for year 2014.
ts_data <- read.table("https://www.dropbox.com/s/k2cxpja3cpsyoyc/ts_data.txt?dl=1", header =TRUE, sep="\t")
year.2014 <- ts_data[which(ts_data$Year == 2014),]
year.2014 # 23 rows of data for 2014.
start <- as.Date("2014-5-5"); end <- as.Date("2014-10-10")
collect <- which ( index(myxts) >= start & index(myxts) <= end )
myxts[collect] <- year.2014$PRPtot
# year.2014 and collect must have the same number of rows
Verify the collected data. You should see data inside each time window, and NA outside the time windows.
myxts2 <- window(myxts, start=start-50, end=end+50)
str(myxts2); myxts2
Visualize the collected data. You could view the complete time series (i.e. myxts). Note that autoplot drops all NAs.
library(ggplot2)
autoplot(myxts2, geom = "point")

Is there a way I can get the maximum value for each group after a double group_by in R?

I am trying to extract the team with the maximum number of wins each year in women's college basketball, and I am currently stuck with having the number of wins for each year for each team, and I want only the team with the maximum number of wins in each year.
winsbyyear <- WomenCBnewdf %>%
group_by(Year,Team)%>%
summarise(totalwinsyr = sum(Outcome))
Output currently looks like this, but I am expecting to see each year only once with the team with the maximum number of wins in the subsequent columns
Year Team totalwinsyr
<fct> <chr> <dbl>
1 2014 AbileneChristian 10
2 2014 AirForce 0
3 2014 Akron 18
4 2014 Alabama 10
5 2014 AlabamaAM 3
6 2014 AlabamaHuntsville 0
7 2014 AlabamaMobile 0
8 2014 AlabamaSt 15
9 2014 AlaskaAnchorage 1
10 2014 AlbanyNY 16
How to select the rows with maximum values in each group with dplyr?
I have already looked here but I could not find any resources to help with a group_by() with multiple values
Create a new column with the number of wins and then filter:
winsbyyear <- WomenCBnewdf %>%
group_by(Year,Team)%>%
mutate(totalwinsyr = sum(Outcome)) %>%
filter(totalwinsyr == max(totalwinsyr))

How to find out how many trading days in each month in R?

I have a dataframe like this. The time span is 10 years. Because it's Chinese market data, and China has Lunar Holidays. So each year have different holiday times in terms of the western calendar.
When it is a holiday, the stock market does not open, so it is a non-trading day. Weekends are non-trading days too.
I want to find out which month of which year has the least number of trading days, and most importantly, what number is that.
There are not repeated days.
date change open high low close volume
1 1995-01-03 -1.233 637.72 647.71 630.53 639.88 234518
2 1995-01-04 2.177 641.90 655.51 638.86 653.81 422220
3 1995-01-05 -1.058 656.20 657.45 645.81 646.89 430123
4 1995-01-06 -0.948 642.75 643.89 636.33 640.76 487482
5 1995-01-09 -2.308 637.52 637.55 625.04 625.97 509851
6 1995-01-10 -2.503 616.16 617.60 607.06 610.30 606925
If there are not repeated days, you can count days per month and year by:
library(data.table) "maxx"))), .Names = c("X2005", "X2006", "X2007", "X2008"))
library(lubridate)
dt <- as.data.table(dt)
dt_days <- dt[, .(count_day=.N), by=.(year(date), month(date))]
Then you only need to do this to get the min:
dt_days[count_day==min(count_day)]
The chron and bizdays packages deal with business days but neither actually contains a usable calendar of holidays limiting their usefulness.
We will use chron below assuming you have defined the .Holidays vector of dates that are holidays. (If you run the code below without doing that only weekdays will be regarded as business days as the default .Holidays vector supplied by chron has very few dates in it.) DF has 120 rows (one row for each year/month) and the last line subsets that to just the month in each year having least business days.
library(chron)
library(zoo)
st <- as.yearmon("2001-01")
en <- as.yearmon("2010-12")
ym <- seq(st, en, 1/12) # sequence of year/months of interest
# no of business days in each yearmonth
busdays <- sapply(ym, function(x) {
s <- seq(as.Date(x), as.Date(x, frac = 1), "day")
sum(!is.weekend(s) & !is.holiday(s))
})
# data frame with one row per year/month
yr <- as.integer(ym)
DF <- data.frame(year = yr, month = cycle(ym), yearmon = ym, busdays)
# data frame with one row per year
wx.min <- ave(busdays, yr, FUN = function(x) which.min(x) == seq_along(x))
DF[wx.min == 1, ]
giving:
year month yearmon busdays
2 2001 2 Feb 2001 20
14 2002 2 Feb 2002 20
26 2003 2 Feb 2003 20
38 2004 2 Feb 2004 20
50 2005 2 Feb 2005 20
62 2006 2 Feb 2006 20
74 2007 2 Feb 2007 20
95 2008 11 Nov 2008 20
98 2009 2 Feb 2009 20
110 2010 2 Feb 2010 20

Aggregating based on previous year and this year

I have these data sets
month Year Rain
10 2010 376.8
11 2010 282.78
12 2010 324.58
1 2011 73.51
2 2011 225.89
3 2011 22.96
I used
df2prnext<-
aggregate(Rain~Year, data = subdataprnext, mean)
but I need the mean value of 217.53.
I am not getting the expected result. Thank you for your help.

R - Analysis of time series with semi-annual data?

I have a time series with semi-annual (half-yearly) data points.
It seems that the ts() function can't handle that as "frequency = 2" returns a very strange time series object that extends far beyond the actual time period.
Is there any way to do time series analysis of this kind of time series object in R?
EDIT: Here's an example:
dat <- seq(1, 17, by = 1)
> semi <- ts(dat, start = c(2008,12), frequency = 2)
> semi
Time Series:
Start = c(2013, 2)
End = c(2021, 2)
Frequency = 2
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
I was expecting:
> semi
s1 s2
2008 1
2009 2 3
2010 4 5
2011 6 7
2012 8 9
2013 10 11
2014 12 13
2015 14 15
2016 16 17
First let me explain why the first ts element starts at 2013 in stead of 2008. The function start and end work with the periods/frequencies. You selected the 12th period after 2008 which is the second period in 2013 if your frequency is 2.
This should work for the period:
semi <- ts(dat, start = c(2008,2), frequency = 2)
Still semi gives the correct timeseries, however, it does not know the names with a frequency of 2. If you plot the timeseries the correct half yearly graph will be shown.
plot.ts(semi)
In this problem someone explained about the standard frequencies, which ts() knows.

Resources