What I am trying to do is find the relative risk of mortality at the 10th, 50th and 90th percentiles of diurnal temperature range and its additive effects at lags of 0, 1, 3 and 5 days. I'm doing this for a subset of months May-Sept (call subset here for mortality, temperature is already subsetted when read in). I have a code that works below, but no matter what city and what lag I introduce, I get a RR of essentially 1.0, so I believe that something is off or I am missing an argument somewhere. If anyone has more experience with these problems than I, your help would be greatly appreciated.
library('dlnm')
library('splines')
mortdata <- read.table('STLmort.txt', sep="\t", header=T)
morts <- subset(mortdata, Month %in% 5:9)
deaths <- morts$AllMort
tempdata <- read.csv('STLRanges.csv',sep=',',header=T)
temp <- tempdata$Trange
HI <- tempdata$HIrange
#basis.var <- onebasis(1:5, knots=3)
#mklagbasis(maxlag=5, type="poly", degree=3)
basis.temp <- crossbasis(temp,vardegree=3,lag=5)
summary(basis.temp)
model <- glm (deaths ~ basis.temp, family=quasipoisson())
pred.temp <- crosspred(basis.temp, model, at=quantile(temp,c(.10,.50,.90),na.rm=TRUE) , cumul=T)
plot(pred.temp, "slices", var=c(quantile(temp, c(.10, .50, .90),na.rm=TRUE)) ,lag=c(0,1,5))
The problem is you did not put any time variables to control the long-term and seasonal trends in the time-series using DLNM .
Related
I've achieved stationary data for use in arima (see) forecasts using seasonal and non seasonal differencing. Now how do I revert back to the original date using the differenced data?
raw <- read.csv("https://raw.githubusercontent.com/thistleknot/Python-Stock/master/data/combined_set.csv",row.names=1,header=TRUE)
temp <- raw$CSUSHPINSA
#tells me to seasonally difference 1 time
print(nsdiffs(ts(temp,frequency=4)))
temp_1 <- temp-dplyr::lag(temp,1*season)
#tells me I need to difference it once more
print(ndiffs(temp_2))
temp_2 <- temp_1-dplyr::lag(temp_1,1)
#shows data is somewhat stationary
plot(temp_2)
#gives me back the original dataset if I only had seasonal differencing
na.omit(dplyr::lag(raw$CSUSHPINSA ,4)+temp_1)
#how to do this with temp_2?
Some references
Pandas reverse of diff()
Reverse Diff function in R
Nevermind, I got it
dplyr::lag(raw$CSUSHPINSA ,4) + dplyr::lag(temp_1,1)+temp_2
More complete examples
temp <- raw$MSPUS
#print(nsdiffs(ts(temp,frequency=4)))
#temp_1 <- temp-dplyr::lag(temp,1*season)
print(ndiffs(temp_1))
temp_1 <- temp-dplyr::lag(temp,1)
temp_2 <- temp_1-dplyr::lag(temp_1,1)
#forecast values of temp_2
temp_3 <- dplyr::lag(temp_1,1)+temp_2
temp_4 = (dplyr::lag(raw$MSPUS ,1) + temp_3)
new_temp_2_values = c(8000,10000)
extended <- c(temp_4,tail(c(c(temp_3),tail(temp_4,1)+cumsum(tail(temp_3,1)+cumsum(new_temp_2_values))),length(new_temp_2_values)))
print(extended)
Wrote a more involved version here
https://gist.github.com/thistleknot/eeaf1631f736e20806c37107f344d50e
This is my first question on stack overflow.
Situation: I have 2 time series. Both series have the same values but the second series has 5 NAs at the start. Hence, first series has 105 observations, where 2nd series has 110 observations. I have fitted an ARIMA(0,1,0) using the Arima function to both series separately. And then I used the forecast package to predict 10 steps to the future.
Issue: Even though the ARIMA coefficient for both series are the same, the projections (10 steps) appear to be different. I am uncertain why this is the case. Has anyone come across this before? Any guidance is highly appreciated.
Tried: I tried setting seed, creating index manually, and using auto.ARIMA for the model fitting. However, none of the steps has helped me to reconcile the difference.
I have added a picture to show you what I see. Please note I have hidden the mid part of the series so that you can see the start and the end of the series. The yellow highlighted cells are the projection outputs from the 'Forecast' package. I have manually added the index to be years after extracting the results from R.
Time series projected and base in excel
Rates <- read.csv("Rates_for_ARIMA.csv")
set.seed(123)
#ARIMA with NA
Simple_Arima <- Arima(
ts(Rates$Rates1),
order = c(0,1,0),
include.drift = TRUE)
fcasted_Arima <- forecast(Simple_Arima, h = 10)
fcasted_Arima$mean
#ARIMA Without NA
Rates2 <- as.data.frame(Rates$Rates2)
##Remove the final spaces from the CSV
Rates2 <- Rates2[-c(106,107,108,109,110),]
Simple_Arima2 <- Arima(
ts(Rates2),
order = c(0,1,0),
include.drift = TRUE)
fcasted_Arima2 <- forecast(Simple_Arima2, h = 10)
fcasted_Arima2$mean
The link to data is here, CSV format
Could you share your data and code such that others can see if there is any issue with it?
I tried to come up with an example and got the same results for both series, one that includes NAs and one that doesn't.
library(forecast)
library(xts)
set.seed(123)
ts1 <- arima.sim(model = list(0, 1, 0), n = 105)
ts2 <- ts(c(rep(NA, 5), ts1), start = 1)
fit1 <- forecast::Arima(ts1, order = c(0, 1, 0))
fit2 <- forecast::Arima(ts2, order = c(0, 1, 0))
pred1 <- forecast::forecast(fit1, 10)
pred2 <- forecast::forecast(fit2, 10)
forecast::autoplot(pred1)
forecast::autoplot(pred2)
> all.equal(as.numeric(pred1$mean), as.numeric(pred2$mean))
[1] TRUE
I want to do an out-of-sample forecast experiment using the auto.arima function. Further, time series cross validation with a fixed rolling window size should be applied. The goal is to obtain one step forecasts for 1,3 and 6 steps ahead.
library(forecast)
library(tseries)
#the time series
y1 = 2+ 0.15*(1:20) + rnorm(20,2)
y2 = y1[20]+ 0.3*(1:30) + rnorm(30,2)
y = as.ts(c(y1,y2))
#10obs in test set, 40obs in training set
ntest <- 10
ntrain <- length(y)-ntest
#auto.arima with some prefered specifications
farima <- function(x,h){forecast(auto.arima(x,ic="aic",test=c("adf"),seasonal=FALSE,
stepwise=FALSE, approximation = FALSE,
method=c("ML")),h=h)}
# executing the following function, gives the forecast errors in a matrix for each one-step forecast
e <- tsCV(y,farima,h = 6,window=40)
The predicted values are given by subtracting the error from the true value:
#predicted values
fc1 <- c(NA,y[2:50]-e[1:49,1])
fc1 <- fc1[41:50]
fc3 <- c(NA,y[2:50]-e[1:49,3])
fc3 <- fc3[41:50]
fc6 <- c(NA,y[2:50]-e[1:49,6])
fc6 <- fc6[41:50]
However I´m curious whether the predicted values for the 3-step ahead are coded correctly. Since the first 3-step ahead forecast is the prediction of the 43th observation?
Also i dont understand why the matrix e for the 3-step ahead error [3th column] has a value for observation 40. Since i thought the first 3-step ahead forecast is obtained for observation 43 and thus there shouldnt be an error for observation 40.
Always read the help file:
Value
Numerical time series object containing the forecast errors as a vector (if h=1) and a matrix otherwise. The time index corresponds to the last period of the training data. The columns correspond to the forecast horizons.
So tsCV() returns errors in a matrix where the (i,j)th entry contains the error for forecast origin i and forecast horizon h. So the value in row 40 and column 3 is a 3-step error made at time 40, for time period 43.
Thanks for your help!
So for the h=1,2,3 steps ahead the predicted values are the following:
#predicted values
#h=1
fc1 <- c(NA,y[41:50]-e[40:49,1])
fc1 <- fc1[2:11]
#h=2
fc2 <- c(NA,y[42:50]-e[40:49,2])
fc2 <- fc2[2:10]
#h=3
fc3 <- c(NA,y[43:50]-e[40:49,3])
fc3 <- fc3[2:9]
Is that correct?
I am trying to forecast three variables using R, but I am running into issues on how to deal with correlation.
The three variables I am trying to forecast are Revenue, Subscriptions and Price.
My initial approach was to do two independent time series forecast of subscriptions and price and multiply the outcomes to generate the revenue forecast.
I wanted to understand if this approach makes sense, as there is an inherent correlation between the price and the subscribers, and this is the part I do not know how to deal with.
# Load packages.
library(forecast)
# Read data
data <- read.csv("data.csv")
data.train <- data[0:57,]
data.test <- data[58:72,]
# Create time series for variables of interest
data.subs <- ts(data.train$subs, start=c(2014,1), frequency = 12)
data.price <- ts(data.train$price, start=c(2014,1), frequency = 12)
#Create model
subs.stlm <- stlm(data.subs)
price.stlm <- stlm(data.price)
#Forecast
subs.pred <- forecast(subs.stlm, h = 15, level = c(0.6, 0.75, 0.9))
price.pred <- forecast(price.stlm, h = 15, level = c(0.6, 0.75, 0.9))
Any help is greatly appreciated!
Looks like you can use the vector autoregression (VAR) model. Take a look at the description and the code provided here:
https://otexts.org/fpp2/VAR.html
I'm trying to generate estimates of the percent of Catholics within a given municipality in a country and I'm using multilevel regression and post-stratification of survey data.
The approach fits a multilevel logit and generates predicted probabilities of the dependent variable. It then weights the probabilities using poststratification of the sample to census data.
I can generate the initial estimates (which are essentially just the predicted probability of being Catholic for a given individual in the survey data.) However, when I try to take the average with the last line of code below it only returns NA's for each of the municipalities. The initial cell predictions have some missing values but nowhere near a majority.
I don't understand why I can't generate municipal weighted averages as I've followed the procedure using different data. Any help would be greatly appreciated.
rm(list=ls(all=TRUE))
library("arm")
library("foreign")
#read in megapoll and attach
ES.data <- read.dta("ES4.dta", convert.underscore = TRUE)
#read in municipal-level dataset
munilevel <- read.dta("election.dta",convert.underscore = TRUE)
munilevel <- munilevel[order(munilevel$municode),]
#read in Census data
Census <- read.dta("poststratification4.dta",convert.underscore = TRUE)
Census <- Census[order(Census$municode),]
Census$municode <- match(Census$municode, munilevel$municode)
#Create index variables
#At level of megapoll
ES.data$ur.female <- (ES.data$female *2) + ES.data$ur
ES.data$age.edr <- 6 * (ES.data$age -1) + ES.data$edr
#At census level (same coding as above for all variables)
Census$cur.cfemale <- (Census$cfemale *2) + Census$cur
Census$cage.cedr <- 6 * (Census$cage -1) + Census$cedr
##Municipal level variables
Census$c.arena<- munilevel$c.arena[Census$municode]
Census$c.fmln <- munilevel$c.fmln[Census$municode]
#run individual-level opinion model
individual.model1 <- glmer(formula = catholic ~ (1|ur.female) + (1|age)
+ (1|edr) + (1|age.edr) + (1|municode) + p.arena +p.fmln
,data=ES.data, family=binomial(link="logit"))
display(individual.model1)
#examine random effects and standard errors for urban-female
ranef(individual.model1)$ur.female
se.ranef(individual.model1)$ur.female
#create vector of state ranefs and then fill in missing ones
muni.ranefs <- array(NA,c(66,1))
dimnames(muni.ranefs) <- list(c(munilevel$municode),"effect")
for(i in munilevel$municode){
muni.ranefs[i,1] <- ranef(individual.model1)$municode[i,1]
}
muni.ranefs[,1][is.na(muni.ranefs[,1])] <- 0 #set states with missing REs (b/c not in data) to zero
#create a prediction for each cell in Census data
cellpred1 <- invlogit(fixef(individual.model1)["(Intercept)"]
+ranef(individual.model1)$ur.female[Census$cur.cfemale,1]
+ranef(individual.model1)$age[Census$cage,1]
+ranef(individual.model1)$edr[Census$cedr,1]
+ranef(individual.model1)$age.edr[Census$cage.cedr,1]
+muni.ranefs[Census$municode,1]
+(fixef(individual.model1)["p.fmln"] *Census$c.fmln) # municipal level
+(fixef(individual.model1)["p.arena"] *Census$c.arena)) # municipal level
#weights the prediction by the freq of cell
cellpredweighted1 <- cellpred1 * Census$cpercent.muni
#calculates the percent within each municipality (weighted average of responses)
munipred <- 100* as.vector(tapply(cellpredweighted1, Census$municode, sum))
munipred
The extensive amount of code is totally redundant without the data! I suppose you have NAs in the object cellpredweighted1 and by default sum() propagates NAs to the answer because if one or more elements of a vector is NA then by definition the summation of those elements is also NA.
If the above is the case here, then simply adding na.rm = TRUE to the tapply() call should solve the problem.
tapply(cellpredweighted1, Census$municode, sum, na.rm = TRUE)
You should be asking yourself why there are NAs at this stage and if these result from errors earlier on the process.