tsibble -- how do you get around implicit gaps when there are none - r

I am new to the tsibble package. I have monthly data that I coerced to a tsibble to use the fable package. A few issues I am having
It appears the index variable (from my testing) is not of class date even though I applied
lubridate's ymd function to it.
has_gaps function returns FALSE but when I model on the data, I get the error that ".data contains
implicit gaps in time"
library(dplyr)
library(fable)
library(lubridate)
library(tsibble)
test <- data.frame(
YearMonth = c(20160101, 20160201, 20160301, 20160401, 20160501, 20160601,
20160701, 20160801, 20160901, 20161001, 20161101, 20161201),
Claims = c(13032647, 1668005, 24473616, 13640769, 17891432, 11596556,
23176360, 7885872, 11948461, 16194792, 4971310, 18032363),
Revenue = c(12603367, 18733242, 5862766, 3861877, 15407158, 24534258,
15633646, 13720258, 24944078, 13375742, 4537475, 22988443)
)
test_ts <- test %>%
mutate(YearMonth = ymd(YearMonth)) %>%
as_tsibble(
index = YearMonth,
regular = FALSE #because it picks up gaps when I set it to TRUE
)
# Are there any gaps?
has_gaps(test_ts, .full = T)
model_new <- test_ts %>%
model(
snaive = SNAIVE(Claims))
Warning messages:
1: 1 error encountered for snaive
[1] .data contains implicit gaps in time. You should check your data and convert implicit gaps into explicit missing values using `tsibble::fill_gaps()` if required.
Any help will appreciated.

You have a daily index, but you want a monthly index. The simplest way is to use the tsibble::yearmonth() function, but you will need to convert the date to character first.
library(dplyr)
library(tsibble)
test <- data.frame(
YearMonth = c(20160101, 20160201, 20160301, 20160401, 20160501, 20160601,
20160701, 20160801, 20160901, 20161001, 20161101, 20161201),
Claims = c(13032647, 1668005, 24473616, 13640769, 17891432, 11596556,
23176360, 7885872, 11948461, 16194792, 4971310, 18032363),
Revenue = c(12603367, 18733242, 5862766, 3861877, 15407158, 24534258,
15633646, 13720258, 24944078, 13375742, 4537475, 22988443)
)
test_ts <- test %>%
mutate(YearMonth = yearmonth(as.character(YearMonth))) %>%
as_tsibble(index = YearMonth)

Looks like as_tsibble isn't able to recognize the interval properly in the YearMonth column because it is a Date class object. It's hidden in the 'Index' section of help page that that might be problem:
For a tbl_ts of regular interval, a choice of index representation has to be made. For example, a monthly data should correspond to time index created by yearmonth or zoo::yearmon, instead of Date or POSIXct.
Like that excerpt suggests you can get around the problem with yearmonth(). But that requires a little string manipulation first to get it into a format that will parse properly.
test_ts <- test %>%
mutate(YearMonth = gsub("(.{2})01$", "-\\1", YearMonth) %>%
yearmonth()
) %>%
as_tsibble(
index = YearMonth
)
Now the model should run error free! Not sure why the has_gaps() test is saying everything is okay in your example...

Related

Tidysynth error -- Please specify only one treated unit

I am trying to calculate a Synthetic control using the tidysynth package. I'm fairly new to the package and the data but here is my code:
#Import data
synth <- read.csv("https://raw.githubusercontent.com/FDobkin/coal_paper/main/synth_data.csv")
#Convert year to date
synth$year <- strptime(synth$year, format = "%Y")
synth_out <- synth %>%
synthetic_control(
outcome=saleprice,
unit=fips,
time=year,
i_unit=47145,
i_time=2009-10-19,
generate_placebos=T
) %>%
generate_predictor(
time_window=2000-10-19:2009-10-19,
population = pop,
white = white_p,
age = age65p_p,
rucc = rucc_code,
income = median_income,
unemploy = unemprate,
laborforce = lfrate
) %>%
generate_weights(optimization_window = 2000-10-19:2009-10-19, # time to use in the optimization task
margin_ipop = .02,sigf_ipop = 7,bound_ipop = 6 # optimizer options
)
%>%
generate_control()
The error is:
Error in synth_method(treatment_unit_covariates = treatment_unit_covariates, :
Please specify only one treated unit.
The error seems to becoming from the generate_weights() statement. I am specifying the specific county that is receiving the treatment in the synthetic_control() statement. What is the error noting is wrong?

Advise to filter yearmonth

I'm solving the exercises from book Forecasting:Principles and Practice 3rd edition
On chapter 7 ex 1 I want to filter Jan 2014 month from tsibbledata:vic_elec and summarise data by day, here's the code :
jan14_vic_elec <- vic_elec %>%
filter(yearmonth(Time) == yearmonth("2014 Jan")) %>%
index_by(Date = as_date(Time)) %>%
summarise(
Demand = sum(Demand),
Temperature = max(Temperature)
)
This chunk on the filter() functions gives an error :
Error: Problem with filter() input ..1. i Input ..1 is
yearmonth(Time) == yearmonth("2014 Jan"). x function
'Rcpp_precious_remove' not provided by package 'Rcpp'
Can somebody help ?
Open a new r window, and do this.
It should work!
Main issue is you have some package clashes. Hence start in a new window
library(fpp3)
jan14_vic_elec <- vic_elec %>%
filter(yearmonth(Time) == yearmonth("2014 Jan")) %>%
index_by(Date = as_date(Time)) %>%
summarise(
Demand = sum(Demand),
Temperature = max(Temperature)
)

Cut time series into specific bins and label each chunk

I have some experimental data on CO2 values over a few days in a room which are time and date-stamped. I would like to break it up into a series of "experiments" based on an experiment list of when each experiment happened.
e.g.
Data
df<-data.frame(CO2.ppm.=runif(10), Date.time.=as.POSIXct(" 2019-2-08 07:00:00") + runif(n=10, min=0, max=3600))
List of experiments with start and stop times:
ExpertimentList<- data.frame(StartTime=c("2019-2-08 07:10:00", "2019-2-08 07:15:00", "2019-2-08 08:30:00"), StopTime=c("2019-2-08 07:12:00","2019-2-08 07:16:00","2019-2-08 08:15:00"),ExptID=c(1,2,3))
Note there is time when CO2 is measured but no experiment is happening. E.g. between 07:12:00 and 07:15:00.
I would like to split df$Date.time. by ExperimentList's StartTime and StopTime
So far I've converted everything to integers
df$Date.time.<-as.integer(df$Date.time.)
ExperimentList$StartTime<-as.integer(ExperimentList$StartTime
ExperimentList$StopTime<-as.integer(ExperimentList$StopTime)
Then looking at cut
breakz<-dplyr::arrange(paste(Experiment_List$StartTime,Experiment_List$StopTime)%>%as_tibble())
cut(df$Dev.Date.Time,breaks=unique(breakz$value))
But I can't filter out the data when no experiment was taking place. Any thoughts are much appreciated.
Expected output:
set.seed(143)
data.frame(CO2.ppm.=runif(10), Date.time.=sort(as.POSIXct(" 2019-2-08 07:00:00") + runif(n=10, min=0, max=3600)),ExptID=c(NA,NA,NA,1,NA,NA,NA,NA,NA,NA))
ANSWER:
I found that I would run out of memory with #Ronak's answer so I chunked the data.frame into 10000 row segments:
df<-split(df, (as.numeric(rownames(df))-1) %/% 10000)
Then based on #Ronak's answer, I popped the code into a function and used mclapply from the parallel package.
#Do a left join to remove any rows not belonging to an experiment
fuzzyJoinFunction<-function(a){
a<-fuzzy_left_join(a, Experiment_List,
by = c('Dev.Date.Time' = 'StartTime', 'Dev.Date.Time'= 'StopTime'),
match_fun = c(`>=`, `<=`))
a
}
df<-rbindlist(mclapply(X=df,FUN=fuzzyJoinFunction,mc.cores=4))
We can use fuzzyjoin::fuzzy_inner_join to keep only the rows which are in range.
library(dplyr)
library(fuzzyjoin)
#All the datetime values should be of type POSIXct.
ExpertimentList %>%
mutate(across(c(StartTime, StopTime), lubridate::ymd_hms)) -> ExpertimentList
fuzzy_inner_join(df, ExpertimentList,
by = c('Date.time.' = 'StartTime', 'Date.time.'= 'StopTime'),
match_fun = c(`>=`, `<=`))
To get all df values in the final output with NA for ExptID use fuzzy_left_join.

How to create a variable that uses previous (lagged) instance of its own value?

There is a panel dataset. I have to compute a new variable:
cases_new = 3^(lag(cases, n=1L))
for the first date alone. Then the 2nd date onwards, it uses lag (i.e previous instance) on its own value, i.e.:
cases_new = 3^lag(cases_new, n=1L)
I tried to implement using the following code:
df2 <-df1 %>%
mutate(cases_new = if_else(date == "2020-01-01", 3^(lag(cases, n=1L)), 3^(lag(cases_new, n=1L)))
But it threw an error stating that object 'cases_new' not found. I even tried initializing cases_new prior to the above assignment by:
df1$cases_new <- NA
But this did not update cases_new correctly (I get all NAs). Can someone help me out with this particular recursive implementation?
You could solve your initialization problem like this:
df2 <-df1 %>%
mutate(
cases_new = 3^(lag(cases, n=1L)),
cases_new = if_else(date == "2020-01-01", cases_new, 3^(lag(cases_new, n=1L)))
)
BUT, I dont think that this will give you the values you want. You will have to iterate over the rows like this:
cases_new <- c(3^(lag(cases, n=1L)), rep(0, nrow(df1) - 1))
for (i in 2:nrow(df1)) {
cases_new[i] <- 3^(lag(cases_new[i - 1], n=1L))
}
df1$cases_new <- cases_new

Nestled Loop not Working to gather data from NOAA

I'm using the R package rnoaa(along with it required other packages) to gather historical weather data. I wrote this nestled loop to gather all the data sets but I keep getting errors when I run it. It seems to run for a second fine
The loop:
require('triebeard')
require('bindr')
require('colorspace')
require('mime')
require('curl')
require('openssl')
require('R6')
require('urltools')
require('httpcode')
require('stringr')
require('assertthat')
require('bindrcpp')
require('glue')
require('magrittr')
require('pkgconfig')
require('rlang')
require('Rcpp')
require('BH')
require('plogr')
require('purrr')
require('stringi')
require('tidyselect')
require('digest')
require('gtable')
require('plyr')
require('reshape2')
require('lazyeval')
require('RColorBrewer')
require('dichromat')
require('munsell')
require('labeling')
require('viridisLite')
require('data.table')
require('rjson')
require('httr')
require('crul')
require('lubridate')
require('dplyr')
require('tidyr')
require('ggplot2')
require('scales')
require('XML')
require('xml2')
require('jsonlite')
require('rappdirs')
require('gridExtra')
require('tibble')
require('isdparser')
require('geonames')
require('hoardr')
require('rnoaa')
install.package('ncdf4')
install.packages("devtools")
library(devtools)
install_github("rnoaa", "ropensci")
library(rnoaa)
list <- buoys(dataset='wlevel')
lid <- data.frame(list$id)
foo <- for(range in 1990:2017){
for(bid in lid){
bid_range <- buoy(dataset = 'wlevel', buoyid = bid, year = range)
bid.year.data <- data.frame(bid.year$data)
write.csv(bid.year.data, file='cwind/bid_range.csv')
}
}
The response:
Using c1990.nc
Using
Error: length(url) == 1 is not TRUE
It saves the first data-set but it does not apply the for in the file name it just names it bid_range.csv.
This error message shows that there are no any data of a given station id in 1990. Because you were using for loop, once it gots an error, it stops.
Here I introduce the use of tidyverse to download the NOAA buoy data. A lot of the following functions are from the purrr package, which is part of the tidyverse.
# Load packages
library(tidyverse)
library(rnoaa)
Step 1: Create a "Grid" containing all combination of id and year
The expand function from tidyr can create the combination of different values.
data_list <- buoys(dataset = 'wlevel')
data_list2 <- data_list %>%
select(id) %>%
expand(id, year = 1990:2017)
Step 2: Create a "safe" version that does not break when there is no data.
Also make this function suitable for the map2 function
Because we will use map2 to loop through all the combination of id and year using the map2 function by its .x and .y argument. We modified the sequence of argument to create buoy_modify. We also use the safely function to create a safe version of buoy_modify. Now when it meets error, it will store the error message and moves to the next one rather than breaks.
# Modify the buoy function
buoy_modify <- function(buoyid, year, dataset, ...){
buoy(dataset, buoyid = buoyid, year = year, ...)
}
# Creare a safe version of buoy_modify
buoy_safe <- safely(buoy_modify)
Step 3: Apply the buoy_safe function
wlevel_data <- map2(data_list2$id, data_list2$year, buoy_safe, dataset = "wlevel")
# Assign name for the element in the list based on id and year
names(wlevel_data) <- paste(data_list2$id, data_list2$year, sep = "_")
After this step, all the data were downloaded in wlevel_data. Each element in wlevel_data has two parts. $result shows the data if the download is successful, otherwise, it shows NULL. $error shows NULL if the download is successful, otherwise, it shows the error message.
Step 4: Access the data
transpose can turn a list "inside out". So now wlevel_data2 has two elements: result and error. We can store these two and access the data.
# Turn the list "inside out"
wlevel_data2 <- transpose(wlevel_data)
# Get the error message
wlevel_error <- wlevel_data2$error
# Get he result
wlevel_result <- wlevel_data2$result
# Remove NULL element in wlevel_result
wlevel_result2 <- wlevel_result[!map_lgl(wlevel_result, is.null)]

Resources