I have to pull different data sets from the same API regularly but for different reasons, so I have to write out the code for many different pulls. I'd like to create some functions to help with this, but I need some help.
I haven't been able to figure out how to set up the function so that I can change the data set but still pull from the same column each time. In this example, I have 3 columns with timestamps that mean different things (made up in this data). I need to change the timezone here to my local time zone. The column name will remain the same in all of my datasets, but the name of the dataset will change. I have a few places in my code where I need to do this, and I haven't been able to figure it out, so any suggestions would be much appreciated!
The second section of this example code is not included in the actual code, but it is there to set the data up correctly. The data comes out of the API in the format shown as GMT.
df <- data.frame(col_1 = c(1, 2, 3, 4),
time_1 = c("2021-01-20 23:58:21", "2021-01-20 21:21:00", "2021-01-20 17:14:04", "2021-01-20 01:05:18"),
time_2 = c("2021-01-19 23:58:21", "2021-01-19 21:21:00", "2021-01-19 17:14:04", "2021-01-19 01:05:18"),
time_3 = c("2021-01-18 23:46:21", "2021-01-18 36:21:00", "2021-01-18 15:14:04", "2021-01-18 01:05:18"),
time_4 = c("2021-01-17 23:58:21", "2021-01-17 20:21:00", "2021-01-17 18:14:04", "2021-01-17 02:05:18"))
# Not part of actual code
df$time_1 <- as.POSIXlt(df$time_1, tz = "GMT")
df$time_2 <- as.POSIXlt(df$time_2, tz = "GMT")
df$time_3 <- as.POSIXlt(df$time_3, tz = "GMT")
df$time_4 <- as.POSIXlt(df$time_4, tz = "GMT")
# What I want it to do
# df$time_1 <- lubridate::with_tz(df$time_1, tz = "America/Los_Angeles")
# df$time_2 <- lubridate::with_tz(df$time_2, tz = "America/Los_Angeles")
# df$time_3 <- lubridate::with_tz(df$time_3, tz = "America/Los_Angeles")
# df$time_4 <- lubridate::with_tz(df$time_4, tz = "America/Los_Angeles")
# Attempted function
timezone_cleanup <- function(my_df){
my_df$time_1 <- lubridate::with_tz(my_df$time_1, tz = "America/Los_Angeles")
my_df$time_2 <- lubridate::with_tz(my_df$time_2, tz = "America/Los_Angeles")
my_df$time_3 <- lubridate::with_tz(my_df$time_3, tz = "America/Los_Angeles")
my_df$time_4 <- lubridate::with_tz(my_df$time_4, tz = "America/Los_Angeles")
}
# how I'd like to use this function. Not working now. Even if I wrap it with data.frame(), it's not what I wanted.
new_df <- timezone_cleanup(df)
I think you need to return my_df in your function to get the changed dataframe back. However, you can use lapply or across to apply the same function to multiple columns.
library(dplyr)
timezone_cleanup <- function(my_df){
my_df %>%
mutate(across(starts_with('time'),
lubridate::with_tz, tz = "America/Los_Angeles"))
}
new_df <- timezone_cleanup(df)
By the way, I do recive a warning message while using this Unrecognized time zone 'America/Los_Angeles'. Are you sure you are using the correct tz value?
Related
I have been using the tbats and nnetar functions from the forecast package to produce an hourly electric load forecast with a forecasting horizon of a week and a month, and both models perform satisfactorily. My data set comprises of hourly values from January 2017 up to early May 2022 (46848 values). However, when I try to make an hourly load forecast up to the end of the year (07/05/2022-31/12/2022, 5736 hourly values), the results are either flat or lose seasonality. Does anyone have any idea why the long-term forecast gives such poor results? Any idea on either model will be highly appreciated. I apologise for the very large data set.
I have uploaded the data set on git hub:
df <- read.csv(file = "https://raw.githubusercontent.com/Argiro1983/Load/LOAD/LOAD_2017_2022.csv", sep=";")
#fix datetime
df$TIME<- with(df, sprintf("%02d:00", TIME-1))
df$DATE<-as.Date(df$DATE, "%d/%m/%Y")
df$TIME <- paste(df$TIME, ':00', sep = '')
View(df)
library(ggpubr)
library(chron)
df$TIME <- chron(times=df$TIME)
DATETIME<-as.POSIXct(paste(df$DATE, df$TIME), origin = "1970-01-01 00:00:00", tz="UTC", usetz=TRUE)
my_df <- data.frame(timestamp = as.POSIXct(DATETIME, format = "%d.%m.%Y %H:%M", origin = "1970-01-01 00:00:00", tz = "UTC"), input = df[,3])
my_df <- setNames(my_df, c("DATETIME","LOAD"))
Particularly the TBATS model results lose seasonality and seem strange. The code I used is the following:
library(ggplot2)
library(forecast)
library(tseries)
library(dplyr)
Load = ts(my_df[, c('LOAD')])
my_df$Clean_Load = tsclean(Load)
Clean_Load = ts(my_df[, c('Clean_Load')])
load_ts = ts(Clean_Load)
msts <- msts(load_ts, seasonal.periods=c(24,168,8760), start=c(2017,01))
plot(msts, main="Load", xlab="Year", ylab="MWh")
s <- tbats(msts)
sp<- predict(s,h=5736)
The results are also flat when I run the nnetar function, with or without temperature as an external regressor. I have tried different lambdas, but none seems to work:
#create dataframe for temperature historical values
Temperature_history <- read.csv(file = "https://raw.githubusercontent.com/Argiro1983/Load/LOAD/Temperature_history.csv", sep=";")
DATETIME<-as.POSIXct(Temperature_history$Datetime, format = "%d/%m/%Y %H:%M", tz="UCT", usetz=TRUE)
Temperature_df <- data.frame(timestamp = as.POSIXct(DATETIME, format = "%d/%m/%Y %H:%M", tz = "UCT"), input = Temperature_history$Temperature)
Temperature_df<- setNames(Temperature_df, c("DATETIME","TEMPERATURE"))
#create dataframe for temperature forecasted values
Temperature_forecast <- read.csv(file = "https://raw.githubusercontent.com/Argiro1983/Load/LOAD/Temperature_forecast.csv", sep=";")
DATETIME2<-as.POSIXct(Temperature_forecast$datehour, format = "%d/%m/%Y %H:%M", tz="UCT", usetz=TRUE)
Temp_forecast <- data.frame(timestamp = as.POSIXct(DATETIME2, format = "%d/%m/%Y %H:%M", tz = "UCT"), input = Temperature_forecast$TEMP_FORECAST)
View(Temp_forecast)
Temp_forecast <- setNames(Temp_forecast, c("DATETIME","TEMPERATURE"))
View(Temp_forecast)
#define and run NN model
library(forecast)
myts = ts(my_df$LOAD, frequency = 24)
fit2 = nnetar(myts,xreg = Temperature_df$TEMPERATURE, lambda = 0.5, P=1, MaxNWts=1177)
nnetforecast <- forecast(fit2, xreg = Temp_forecast$TEMPERATURE, h = 5736, PI = F, npaths=100, bootstrap = TRUE)
autoplot(nnetforecast, h = 5736)
First, your code won't work because the github link does not point to the csv file. Replace the first line as follows
df <- read.csv(file = "https://raw.githubusercontent.com/Argiro1983/Load/LOAD/LOAD_2017_2022.csv", sep=";")
Then running your code, I get reasonable results for the tbats model for the first few weeks:
sp <- forecast(s,h=14*24)
autoplot(sp, include=14*24)
Using a time series model to forecast much further ahead makes little sense here.
In any case, there are well-developed models for electricity demand that will do better than either TBATS or NNETAR. For a simple starting point, try Tao Hong's vanilla model, described in Section 2.2 of https://doi.org/10.1016/j.ijforecast.2015.09.006. It's just a linear regression, but it will do better than any of these models you are trying.
Sadly this answer here seems to not work for me.
From what I saw in the documentation, in the latest version, 0.10-1, the major.format parameter has been removed, opposed to previous versions, like 0.9-7, which has the major.format, that would solve easily my question.
It seems such a major feature to be deprecated. Is there any new way to do this? Seems something simple and easy, but I've been digging this issue for hours without success.
In case the issue lies in my code, here is a snippet of what I'm using.
merra2 = read.table("C:/merra2.csv", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
merra2$utc = as.POSIXct(merra2$utc, format = "%Y-%m-%d %H:%M:%S", tz="UTC")
merra2$m2_power = as.xts(x=merra2[,"m2_power"],order.by=merra2[,"utc"])
merra2$doy = as.xts(x=merra2[,"doy"],order.by=merra2[,"utc"])
plot.xts(merra2$m2_power, col="blue", lwd = 2, major.ticks="weeks", subset="2012-04-01/2014-04-01")
plot.xts(merra2$m2_power, col="blue", lwd = 2, major.ticks="months", subset="2012-04-01/2014-04-01")
And the input file contains something like:
utc,m2_power,doy
"1980-01-01 00:00:00",643.000,181.5000
"1980-01-01 01:00:00",643.000,181.4583
"1980-01-01 02:00:00",354.000,181.4167
If I add the major.format parameter, nothing changes, the axis stays the same.
Here, a reproductible example :
# Generate a sequence of Dates
StartDate<-"2017-07-01"
EndDate<- "2018-07-05"
dates<-seq(as.POSIXct(StartDate, format="%Y-%m-%d", tz="UTC")
, as.POSIXct(EndDate, format="%Y-%m-%d", tz="UTC")
, by='mins')
# Generate a sequence of x
x <- seq(1, length(dates))
# Create a dataframe, renaming columns
df <- as.data.frame(cbind(as.character(dates,format="%Y-%m-%d", tz="UTC"),x))
colnames(df) <- c("Dates","x")
# Redefine format
df$Dates <- as.POSIXct(df$Dates,format="%Y-%m-%d", tz="UTC")
df$x2 <- as.xts(x= as.numeric(df$x),order.by=df$Dates )
# Plot results
plot.xts(df$x2
, col="blue"
, lwd = 2
, major.ticks="weeks"
, major.format = TRUE
, subset="2017-08-01/2017-08-30")
If you change "major.ticks" the axis change... Have you take a look on the "utc" variable ? What is the complete time interval?
Link to the data set which is a date and time column along with electricity usage columns
https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip
power1 <- read.csv(file = "c:/datasets/household_power_consumption.txt", stringsAsFactors=F, header = TRUE,
sep=";", dec = ".", na.strings="?", col.names = c("date1","time1","Global_active_power", "Global_reactive_power",
"Voltage","Global_intensity","Sub_metering_1","Sub_metering_2",
"Sub_metering_3"))
power1$date1 <- as.Date(power1$date1, format="%d/%m/%Y")
power2 <- subset(power1, subset=(date1 >= "2007-02-01" & date1 <= "2007-02-02"))
datetime1 <- paste(as.Date(power2$date1), power2$time1)
power2$Datetime <- as.POSIXct(datetime1)
plot(power2$Global_active_power~power2$Datetime, type="l", ylab="Global Active Power (kilowatts)", xlab="")
When I run the above, I get the graph like I'm supposed to with the days of the week on the x axis even when I run summary, head and str() I don't see anything in the data about days of the week.
I tried to add my own day column with mutate but it didn't work.
And it didn't work when I subset it like the following. It subset properly where I had only the data I needed, but it wouldn't plot with the date1 column or the day of the week column I created via mutate
power2 <- subset(power1, subset=(as.Date(date1, format = "%d/%m/%Y") >= "2007-02-01"
& as.Date(date1, format = "%d/%m/%Y") <= "2007-02-02"))
I know that as.Posixct will have all the metadata in there, but I don't understand why is it when I combine the date and time columns into it's own column only then it plots by day of the week graphwithout me asking.
When I run it like this, the combined date and time column data is corrupted with the wrong year
power11 <- read.csv(file = "c:/datasets/household_power_consumption.txt", stringsAsFactors=F, header = TRUE,
sep=";", dec = ".", col.names = c("date1","time1","Global_active_power", "Global_reactive_power",
"Voltage","Global_intensity","Sub_metering_1","Sub_metering_2",
"Sub_metering_3"))
#colClasses = c("Date", "character", "factor", "numeric","numeric","numeric","numeric","numeric","numeric"))
power22 <- subset(power11, subset=(as.Date(date1, format = "%d/%m/%Y") >= "2007-02-01"
& as.Date(date1, format = "%d/%m/%Y") <= "2007-02-02"))
datetime1 <- paste(as.Date(power22$date1), power22$time1)
power22$Datetime <- as.POSIXct(datetime1)
Maybe this link would be helpful:
http://earlh.com/blog/2009/07/07/plotting-with-custom-x-axis-labels-in-r-part-5-in-a-series/
add an argument to your plot() call: xaxt='n'
plot(power2$Global_active_power~power2$Datetime, type="l", ylab="Global Active Power (kilowatts)", xlab="", xaxt='n')
that tells plot not to add x-axis labels. Then add an axis() call:
axis(side=1, at=power22$Datetime, labels=format(power22$Datetime, '%b-%y'))
I used '%b-%y' here, because that's what I saw on the site I referenced, but you would want to use the format code appropriate to your needs.
I have an instrument that exports data in an unruly time format. I need to combine the date and time vectors into a new datetime vector in the following POSIXct format: %Y-%m-%d %H:%M:%S. Out of curiosity, I attempted to do this in three different ways, using as.POSIXct(), strftime(), and strptime(). When using my example data below, only the as.POSIXct() and strftime() functions work, but I am curious as to why strptime() is producing NAs? Also, I cannot convert the strftime() output into a POSIXct object using as.POSIXct()...
When trying these same functions on my real data (of which I've only provided you with the first for rows), I am running into an entirely different problem. Only the strftime() function is working. For some reason the as.POSIXct() function is also producing NAs, which is the only command I actually need for converting my datetime into a POSIXct object...
It seems like there are subtle differences between these functions, and I want to know how to use them more effectively. Thanks!
Reproducible Example:
## Creating dataframe:
date <- c("2017-04-14", "2017-04-14","2017-04-14","2017-04-14")
time <- c("14:24:24.992000","14:24:25.491000","14:24:26.005000","14:24:26.511000")
value <- c("4.106e-06","4.106e-06","4.106e-06","4.106e-06")
data <- data.frame(date, time)
data <- data.frame(data, value) ## I'm sure there is a better way to combine three vectors...
head(data)
## Creating 3 different datetime vectors:
## This works in my example code, but not with my real data...
data$datetime1 <- as.POSIXct(paste(data$date, data$time), format = "%Y-%m-%d %H:%M:%S",tz="UTC")
class(data$datetime1)
## This is producing NAs, and I'm not sure why:
data$datetime2 <- strptime(paste(data$date, data$time), format = "%Y-%m-%d %H:%M%:%S", tz = "UTC")
class(data$datetime2)
## This is working just fine
data$datetime3 <- strftime(paste(data$date, data$time), format = "%Y-%m-%d %H:%M%:%S", tz = "UTC")
class(data$datetime3)
head(data)
## Since I cannot get the as.POSIXct() function to work with my real data, I tried this workaround. Unfortunately I am running into trouble...
data$datetime4 <- as.POSIXct(x$datetime3, format = "%Y-%m-%d %H:%M%:%S", tz = "UTC")
Link to real data:
here
Example using real_data.txt:
## Reading in the file:
fpath <- "~/real_data.txt"
x <- read.csv(fpath, skip = 1, header = FALSE, sep = "", stringsAsFactors = FALSE)
names(x) <- c("date","time","bscat","scat_coef","pressure_mbar","temp_K","CH1","CH2") ## This is data from a Radiance Research Integrating Nephelometer Model M903 for anyone who is interested!
## If anyone could get this to work that would be awesome!
x$datetime1 <- as.POSIXct(paste(x$date, x$time), format = "%Y-%m-%d %H:%M%:%S", tz = "UTC")
## This still doesn't work...
x$datetime2 <- strptime(paste(x$date, x$time), format = "%Y-%m-%d %H:%M%:%S", tz = "UTC")
## This works:
x$datetime3 <- strftime(paste(x$date, x$time), format = "%Y-%m-%d %H:%M%:%S", tz = "UTC")
## But I cannot convert from strftime character to POSIXct object, so it doesn't help me at all...
x$datetime4 <- as.POSIXct(x$datetime3, format = "%Y-%m-%d %H:%M%:%S", tz = "UTC")
head(x)
Solution:
I was not providing the as.POSIXct() function with the correct format string. Once I changed %Y-%m-%d %H:%M%:%S to %Y-%m-%d %H:%M:%S, the data$datetime2, data$datetime4, x$datetime1 and x$datetime2 were working properly! Big thanks to PhilC for debugging!
For your real data issue replace the %m% with %m:
## Reading in the file:
fpath <- "c:/r/data/real_data.txt"
x <- read.csv(fpath, skip = 1, header = FALSE, sep = "", stringsAsFactors = FALSE)
names(x) <- c("date","time","bscat","scat_coef","pressure_mbar","temp_K","CH1","CH2") ## This is data from a Radiance Research Integrating Nephelometer Model M903 for anyone who is interested!
## issue was the %m% - fixed
x$datetime1 <- as.POSIXct(paste(x$date, x$time), format = "%Y-%m-%d %H:%M:%S", tz = "UTC")
## Here too - fixed
x$datetime2 <- strptime(paste(x$date, x$time), format = "%Y-%m-%d %H:%M:%S", tz = "UTC")
head(x)
There was a format string error causing the NAs; try this:
## This is no longer producing NAs:
data$datetime2 <- strptime(paste(data$date, data$time), format = "%Y-%m-%d %H:%M:%S",tz="UTC")
class(data$datetime2)
Formatting to "%Y-%m-%d %H:%M:%OS" is a generic view. To make the fractional seconds to a specific number of decimals call the option for degits.sec, e.g.:
options(digits.secs=6) # This will take care of seconds up to 6 decimal points
data$datetime1 <- lubridate::parse_date_time(data$datetime, "%Y-%m-%d %H:%M:%OS")
I would like to download daily data from yahoo for the S&P 500, the DJIA, and 30-year T-Bonds, map the data to the proper time zone, and merge them with my own data. I have several questions.
My first problem is getting the tickers right. From yahoo's website, it looks like the tickers are: ^GSPC, ^DJI, and ^TYX. However, ^DJI fails. Any idea why?
My second problem is that I would like to constrain the time zone to GMT (I would like to ensure that all my data is on the same clock, GMT seems like a neutral choice), but I couldn' get it to work.
My third problem is that I would like to merge the yahoo data with my own data, obtained by other means and available in a different format. It is also daily data.
Here is my attempt at constraining the data to the GMT time zone. Executed at the top of my R script.
Sys.setenv(TZ = "GMT")
# > Sys.getenv("TZ")
# [1] "GMT"
# the TZ variable is properly set
# but does not affect the time zone in zoo objects, why?
Here is my code to get the yahoo data:
library("tseries")
library("xts")
date.start <- "1999-12-31"
date.end <- "2013-01-01"
# tickers <- c("GSPC","TYX","DJI")
# DJI Fails, why?
# http://finance.yahoo.com/q?s=%5EDJI
tickers <- c("GSPC","TYX") # proceed without DJI
z <- zoo()
index(z) <- as.Date(format(time(z)),tz="")
for ( i in 1:length(tickers) )
{
cat("Downloading ", i, " out of ", length(tickers) , "\n")
x <- try(get.hist.quote(
instrument = paste0("^",tickers[i])
, start = date.start
, end = date.end
, quote = "AdjClose"
, provider = "yahoo"
, origin = "1970-01-01"
, compression = "d"
, retclass = "zoo"
, quiet = FALSE )
, silent = FALSE )
print(x[1:4]) # check that it's not empty
colnames(x) <- tickers[i]
z <- try( merge(z,x), silent = TRUE )
}
Here is the dput(head(df)) of my dataset:
df <- structure(list(A = c(-0.011489000171423, -0.00020300000323914,
0.0430639982223511, 0.0201549995690584, 0.0372899994254112, -0.0183669999241829
), B = c(0.00110999995376915, -0.000153000000864267, 0.0497750006616116,
0.0337960012257099, 0.014121999964118, 0.0127800004556775), date = c(9861,
9862, 9863, 9866, 9867, 9868)), .Names = c("A", "B", "date"
), row.names = c("0001-01-01", "0002-01-01", "0003-01-01", "0004-01-01",
"0005-01-01", "0006-01-01"), class = "data.frame")
I'd like to merge the data in df with the data in z. I can't seem to get it to work.
I am new to R and very much open to your advice about efficiency, best practice, etc.. Thanks.
EDIT: SOLUTIONS
On the first problem: following GSee's suggestions, the Dow Jones Industrial Average data may be downloaded with the quantmod package: thus, instead of the "^DJI" ticker, which is no longer available from yahoo, use the "DJIA" ticker. Note that there is no caret in the "DJIA" ticker.
On the second problem, Joshua Ulrich points out in the comments that "Dates don't have timezones because days don't have a time component."
On the third problem: The data frame appears to have corrupted dates, as pointed out by agstudy in the comments.
My solutions rely on the quantmod package and the attached zoo/xts packages:
library(quantmod)
Here is the code I have used to get proper dates from my csv file:
toDate <- function(x){ as.Date(as.character(x), format("%Y%m%d")) }
dtz <- read.zoo("myData.csv"
, header = TRUE
, sep = ","
, FUN = toDate
)
dtx <- as.xts(dtz)
The dates in the csv file were stored in a single column in the format "19861231". The key to getting correct dates was to wrap the date in "as.character()". Part of this code was inspired from R - Stock market data from csv to xts. I also found the zoo/xts manuals helpful.
I then extract the date range from this dataset:
date.start <- start(dtx)
date.end <- end(dtx)
I will use those dates with quantmod's getSymbols function so that the other data I download will cover the same period.
Here is the code I have used to get all three tickers.
tickers <- c("^GSPC","^TYX","DJIA")
data <- new.env() # the data environment will store the data
do.call(cbind, lapply( tickers
, getSymbols
, from = date.start
, to = date.end
, env = data # data saved inside an environment
)
)
ls(data) # see what's inside the data environment
data$GSPC # access a particular ticker
Also note, as GSee pointed out in the comments, that the option auto.assign=FALSE cannot be used in conjunction with the option env=data (otherwise the download fails).
A big thank you for your help.
Yahoo doesn't provide historical data for ^DJI. Currently, it looks like you can get the same data by using the ticker "DJIA", but your mileage may vary.
It does work in this case because you're only dealing with Dates
the df object your provided is yearly data beginning in the year 0001. So, that's probably not what you wanted.
Here's how I would fetch and merge those series (or use an environment and only make one call to getSymbols)
library(quantmod)
do.call(cbind, lapply(c("^GSPC", "^TYX"), getSymbols, auto.assign=FALSE))