Reconstruct seasonally (and non seasonally) differenced data in R - r

I've achieved stationary data for use in arima (see) forecasts using seasonal and non seasonal differencing. Now how do I revert back to the original date using the differenced data?
raw <- read.csv("https://raw.githubusercontent.com/thistleknot/Python-Stock/master/data/combined_set.csv",row.names=1,header=TRUE)
temp <- raw$CSUSHPINSA
#tells me to seasonally difference 1 time
print(nsdiffs(ts(temp,frequency=4)))
temp_1 <- temp-dplyr::lag(temp,1*season)
#tells me I need to difference it once more
print(ndiffs(temp_2))
temp_2 <- temp_1-dplyr::lag(temp_1,1)
#shows data is somewhat stationary
plot(temp_2)
#gives me back the original dataset if I only had seasonal differencing
na.omit(dplyr::lag(raw$CSUSHPINSA ,4)+temp_1)
#how to do this with temp_2?
Some references
Pandas reverse of diff()
Reverse Diff function in R

Nevermind, I got it
dplyr::lag(raw$CSUSHPINSA ,4) + dplyr::lag(temp_1,1)+temp_2
More complete examples
temp <- raw$MSPUS
#print(nsdiffs(ts(temp,frequency=4)))
#temp_1 <- temp-dplyr::lag(temp,1*season)
print(ndiffs(temp_1))
temp_1 <- temp-dplyr::lag(temp,1)
temp_2 <- temp_1-dplyr::lag(temp_1,1)
#forecast values of temp_2
temp_3 <- dplyr::lag(temp_1,1)+temp_2
temp_4 = (dplyr::lag(raw$MSPUS ,1) + temp_3)
new_temp_2_values = c(8000,10000)
extended <- c(temp_4,tail(c(c(temp_3),tail(temp_4,1)+cumsum(tail(temp_3,1)+cumsum(new_temp_2_values))),length(new_temp_2_values)))
print(extended)
Wrote a more involved version here
https://gist.github.com/thistleknot/eeaf1631f736e20806c37107f344d50e

Related

How can I effectivize my script for correcting a logger's seasonal drift in R?

I have installed a bunch of CO2 loggers in water that log CO2 every hour for the open water season. I have characterized the loggers at 3 different concentrations of CO2 before and after installing them.
I assume that the seasonal drift in error will be linear
I assume that the error between my characterization points will be linear
My script is based on a for loop that goes through each timestamp and corrects the value, this works but is unfortuneately not fast enough. I know that this can be done within a second but I am not sure how. I seek some advice and I would be grateful if someone could show me how.
Reproduceable example based on basic R:
start <- as.POSIXct("2022-08-01 00:00:00")#time when logger is installed
stop <- as.POSIXct("2022-09-01 00:00:00")#time when retrieved
dt <- seq.POSIXt(start,stop,by=3600)#generate datetime column, measured hourly
#generate a bunch of values within my measured range
co2 <- round(rnorm(length(dt),mean=600,sd=100))
#generate dummy dataframe
dummy <- data.frame(dt,co2)
#actual values used in characterization
actual <- c(0,400,1000)
#measured in the container by the instruments being characterized
measured.pre <- c(105,520,1150)
measured.post <- c(115,585,1250)
diff.pre <- measured.pre-actual#diff at precharacterization
diff.post <- measured.post-actual#diff at post
#linear interpolation of how deviance from actual values change throughout the season
#I assume that the temporal drift is linear
diff.0 <- seq(diff.pre[1],diff.post[1],length.out=length(dummy$dt))
diff.400 <- seq(diff.pre[2],diff.post[2],length.out = length(dummy$dt))
diff.1000 <- seq(diff.pre[3],diff.post[3],length.out = length(dummy$dt))
#creates a data frame with the assumed drift at each increment throughout the season
dummy <- data.frame(dummy,diff.0,diff.400,diff.1000)
#this loop makes a 3-point calibration at each day in the dummy data set
co2.corrected <- vector()
for(i in 1:nrow(dummy)){
print(paste0("row: ",i))#to show the progress of the loop
diff.0 <- dummy$diff.0[i]#get the differences at characterization increments
diff.400 <- dummy$diff.400[i]
diff.1000 <- dummy$diff.1000[i]
#values below are only used for encompassing the range of measured values in the characterization
#this is based on the interpolated difference at the given time point and the known concentrations used
measured.0 <- diff.0+0
measured.400 <- diff.400+400
measured.1000 <- diff.1000+1000
#linear difference between calibration at 0 and 400
seg1 <- seq(diff.0,diff.400,length.out=measured.400-measured.0)
#linear difference between calibration at 400 and 1000
seg2 <- seq(diff.400,diff.1000,length.out=measured.1000-measured.400)
#bind them together to get one vector
correction.ppm <- c(seg1,seg2)
#the complete range of measured co2 in the characterization.
#in reality it can not be below 0 and thus it can not be below the minimum measured in the range
measured.co2.range <- round(seq(measured.0,measured.1000,length.out=length(correction.ppm)))
#generate a table from which we can characterize the measured values from
correction.table <- data.frame(measured.co2.range,correction.ppm)
co2 <- dummy$co2[i] #measured co2 at the current row
#find the measured value in the table and extract the difference
diff <- correction.table$correction.ppm[match(co2,correction.table$measured.co2.range)]
#correct the value and save it to vector
co2.corrected[i] <- co2-diff
}
#generate column with calibrated values
dummy$co2.corrected <- co2.corrected
This is what I understand after reviewing the code. You have a series of CO2 concentration readings, but they need to be corrected based on characterization measurements taken at the beginning of the timeseries and at the end of the timeseries. Both sets of characterization measurements were made using three known concentrations: 0, 400, and 1000.
Your code appears to be attempting to apply bilinear interpolation (over time and concentration) to apply the needed correction. This is easy to vectorize:
set.seed(1)
start <- as.POSIXct("2022-08-01 00:00:00")#time when logger is installed
stop <- as.POSIXct("2022-09-01 00:00:00")#time when retrieved
dt <- seq.POSIXt(start,stop,by=3600)#generate datetime column, measured hourly
#generate a bunch of values within my measured range
co2 <- round(rnorm(length(dt),mean=600,sd=100))
#actual values used in characterization
actual <- c(0,400,1000)
#measured in the container by the instruments being characterized
measured.pre <- c(105,520,1150)
measured.post <- c(115,585,1250)
# interpolate the reference concentrations over time
cref <- mapply(seq, measured.pre, measured.post, length.out = length(dt))
#generate dummy dataframe with corrected values
dummy <- data.frame(
dt,
co2,
co2.corrected = ifelse(
co2 < cref[,2],
actual[1] + (co2 - cref[,1])*(actual[2] - actual[1])/(cref[,2] - cref[,1]),
actual[2] + (co2 - cref[,2])*(actual[3] - actual[2])/(cref[,3] - cref[,2])
)
)
head(dummy)
#> dt co2 co2.corrected
#> 1 2022-08-01 00:00:00 537 416.1905
#> 2 2022-08-01 01:00:00 618 493.2432
#> 3 2022-08-01 02:00:00 516 395.9776
#> 4 2022-08-01 03:00:00 760 628.2707
#> 5 2022-08-01 04:00:00 633 507.2542
#> 6 2022-08-01 05:00:00 518 397.6533
I do not know what you are calculating (I feel that this could be done differently), but you can increase speed by:
remove print, that takes a lot of time inside loop
remove data.frame creation in each iteration, that is slow and not needed here
This loop should be faster:
for(i in 1:nrow(dummy)){
diff.0 <- dummy$diff.0[i]
diff.400 <- dummy$diff.400[i]
diff.1000 <- dummy$diff.1000[i]
measured.0 <- diff.0+0
measured.400 <- diff.400+400
measured.1000 <- diff.1000+1000
seg1 <- seq(diff.0,diff.400,length.out=measured.400-measured.0)
seg2 <- seq(diff.400,diff.1000,length.out=measured.1000-measured.400)
correction.ppm <- c(seg1,seg2)
s <- seq(measured.0,measured.1000,length.out=length(correction.ppm))
measured.co2.range <- round(s)
co2 <- dummy$co2[i]
diff <- correction.ppm[match(co2, measured.co2.range)]
co2.corrected[i] <- co2-diff
}
p.s. now the slowest part from my testing is round(s). Maybe that can be removed or rewritten...

R Function for Handling Survival Data in intervals

Hello I am learning about survival analysis and I was curious if I could use the survival package on survival data of this form:
Here is some code to genereate data in this form
start_interval <- seq(0, 13)
end_interval <- seq(1, 14)
living_at_start <- round(seq(1000, 0, length.out = 14))
dead_in_interval <- c(abs(diff(living_at_start)), 0)
df <- data.frame(start_interval, end_interval, living_at_start, dead_in_interval)
From my use of the survival package so far it seems to have each individual be a survival time but I might be misreading the documentation of the Surv function. If survival will not work what other packages are out there for this type of data.
If there is not a package or function to easily to estimate the survival function I can easily calculate the survival times myself with the following equation.
Since the survival package need one observation per survival time we need to do some transformations. Using the simulated data.
Simulated Data:
library(survival)
start_interval <- seq(0, 13)
end_interval <- seq(1, 14)
living_at_start <- round(seq(1000, 0, length.out = 14))
dead_in_interval <- c(abs(diff(living_at_start)), 0)
df <- data.frame(start_interval, end_interval, living_at_start, dead_in_interval)
Transforming the data by duplicated by the number dead
duptimes <- df$dead_in_interval
rid <- rep(1:nrow(df), duptimes)
df.t <- df[rid,]
Using the Surv Function
test <- Surv(time = df.t$start_interval,
time2 = df.t$end_interval,
event = rep(1, nrow(df.t)), #Every Observation is a death
type = "interval")
Fitting the survival curve
summary(survfit(test ~ 1))
Comparing with by hand calculation from original data
df$living_at_start/max(df$living_at_start)
They match.
Questions
When using the survfit function why is number of risk 1001 at time 0 when there is only 1000 people in the data?
length(test)

Simple DLNM in R

What I am trying to do is find the relative risk of mortality at the 10th, 50th and 90th percentiles of diurnal temperature range and its additive effects at lags of 0, 1, 3 and 5 days. I'm doing this for a subset of months May-Sept (call subset here for mortality, temperature is already subsetted when read in). I have a code that works below, but no matter what city and what lag I introduce, I get a RR of essentially 1.0, so I believe that something is off or I am missing an argument somewhere. If anyone has more experience with these problems than I, your help would be greatly appreciated.
library('dlnm')
library('splines')
mortdata <- read.table('STLmort.txt', sep="\t", header=T)
morts <- subset(mortdata, Month %in% 5:9)
deaths <- morts$AllMort
tempdata <- read.csv('STLRanges.csv',sep=',',header=T)
temp <- tempdata$Trange
HI <- tempdata$HIrange
#basis.var <- onebasis(1:5, knots=3)
#mklagbasis(maxlag=5, type="poly", degree=3)
basis.temp <- crossbasis(temp,vardegree=3,lag=5)
summary(basis.temp)
model <- glm (deaths ~ basis.temp, family=quasipoisson())
pred.temp <- crosspred(basis.temp, model, at=quantile(temp,c(.10,.50,.90),na.rm=TRUE) , cumul=T)
plot(pred.temp, "slices", var=c(quantile(temp, c(.10, .50, .90),na.rm=TRUE)) ,lag=c(0,1,5))
The problem is you did not put any time variables to control the long-term and seasonal trends in the time-series using DLNM .

auto.arima MAPE is coming as infinity

I have a daily Sales data for around 500 stores. I am trying to fit the ARIMA model in that using the auto.arima function in R. But every time I am running the code, I am getting MAPE either as high as 4000+ or Infinity. Please help me figure out where I am committing the error.
Below mentioned is the snippet of the code
# SALES DATA
# Import the Raw Data
raw <- read.delim('clipboard', header=T)
raw.copy <- raw
class(raw) # data.frame
names(raw) # imported properly
sapply(raw, class)
# Separate out the Dates column
dates <<- as.Date(raw$Dates, format="%m/%d/%y")
raw$Date <- NULL
# Put store sales in a list
storeSales <- lapply(raw, function(x) data.frame(Date=dates, Sales=ts(x, start=c(2012, 3), frequency=365)))
# Accessing data in the list
stats::plot.ts(storeSales[[1]]$Sales)
# -------------------------------------------------------------------
# AUTO ARIMA
# Write the function to retrieve autoArima predictions and MAPE values
wmArima <- function(df){
modArima <- auto.arima(df$Sales)
p <- predict(modArima, n.ahead=365)
m <- accuracy(modArima)[,'MAPE']
return(list(Predicted=p$pred, MAPE=m))
}
# Call the function and retrieve list of predictions + MAPE scores
ArimaResults <- lapply(storeSales[1:15], wmArima)
Have you tried to visualize the results or have a look at the true/predicted sales values?
There may be two possibilities about this problem:
You have sales values being or close to zero, which will lead to infinite MAPE
ARIMA is doing crazy predictions, because of your data or some other factors
It is better to first eliminate the first possibility and then come to SO for help on the second, and a small reproducible example would be very helpful.

Dual seasonal cycles in ts object

I want to strip out seasonality from a ts. This particular ts is daily, and has both yearly and weekly seasonal cycles (frequency 365 and 7).
In order to remove both, I have tried conducting stl() on the ts with frequency set to 365, before extracting trend and remainders, and setting the frequency of the new ts to 7, and repeat.
This doesn't seem to be working very well and I am wondering whether it's my approach, or something inherent to the ts which is causing me problems. Can anyone critique my methodology, and perhaps recommend an alternate approach?
There is a very easy way to do it using a TBATS model implemented in the forecast package. Here is an example assuming your data are stored as x:
library(forecast)
x2 <- msts(x, seasonal.periods=c(7,365))
fit <- tbats(x2)
x.sa <- seasadj(fit)
Details of the model are described in De Livera, Hyndman and Snyder (JASA, 2011).
An approach that can handle not only seasonal components (cyclically reoccurring events) but also trends (slow shifts in the norm) admirably is stl(), specifically as implemented by Rob J Hyndman.
The decomp function Hyndman gives there (reproduced below) is very helpful for checking for seasonality and then decomposing a time series into seasonal (if one exists), trend, and residual components.
decomp <- function(x,transform=TRUE)
{
#decomposes time series into seasonal and trend components
#from http://robjhyndman.com/researchtips/tscharacteristics/
require(forecast)
# Transform series
if(transform & min(x,na.rm=TRUE) >= 0)
{
lambda <- BoxCox.lambda(na.contiguous(x))
x <- BoxCox(x,lambda)
}
else
{
lambda <- NULL
transform <- FALSE
}
# Seasonal data
if(frequency(x)>1)
{
x.stl <- stl(x,s.window="periodic",na.action=na.contiguous)
trend <- x.stl$time.series[,2]
season <- x.stl$time.series[,1]
remainder <- x - trend - season
}
else #Nonseasonal data
{
require(mgcv)
tt <- 1:length(x)
trend <- rep(NA,length(x))
trend[!is.na(x)] <- fitted(gam(x ~ s(tt)))
season <- NULL
remainder <- x - trend
}
return(list(x=x,trend=trend,season=season,remainder=remainder,
transform=transform,lambda=lambda))
}
As you can see it uses stl() (which uses loess) if there is seasonality and penalĀ­ized regresĀ­sion splines if there is no seasonality.
Check if this is useful:
Start and End Values depends on your Data - Change the Frequency values accordingly
splot <- ts(Data1, start=c(2010, 2), end=c(2013, 9), frequency=12)
additive trend, seasonal, and irregular components can be decomposed using the stl() Function
fit <- stl(splot, s.window="period")
monthplot(splot)
library(forecast)
vi <-seasonplot(splot)
vi should give seperate values for a seasonal indices
Also Check the below one:
splot.stl <- stl(splot,s.window="periodic",na.action=na.contiguous)
trend <- splot.stl$time.series[,2]
season <- splot.stl$time.series[,1]
remainder <- splot - trend - season

Resources