Alternative to brute force estimation of parameter in an ecological time series model - r

I am modeling a hydrologic process (water levels [stage] in lakes measured in mm) that can be described as:
where is estimated from a different model and used as a constant in this model. is the unknown and the value is expected to be between (-0.001,0.001). The starting value of S doesn't matter so long as it is greater than 10m (10000mm). The model runs on a daily time step. I have observed stage from multiple different lakes and fit each lake independently.
Currently, I am brute-force identifying the parameter value by:
Creating a 100 value sequence of parameter values spanning (-0.001,0.001)
Predicting stage using the above equation and estimate RMSE between modeled and observed data (significantly fewer observations than modeled data points)
Identifying the B with the lowest RMSE and selecting B values on either side to create a new sequence of parameter values to search over
Step 2 and 3 are repeated until RMSE decreases by less than 0.01 or increases.
The code for the brute force approach I've been using is below along with the data associated with a single lake.
Is there an alternative approach to estimating the unknown parameter Beta2 given the model above and the fact that I only have observed data for a limited number of days?
Thanks!
library(tidyverse)
library(lubridate)
library(Metrics)
#The Data
dat <- read_csv("https://www.dropbox.com/s/skg8wfpu9274npb/driver_data.csv?dl=1")
observeddata <- read_csv("https://www.dropbox.com/s/bhh27g5rupoqps3/observeddata.csv?dl=1") %>% select(Date,Value)
#Setup initial values and vectors
S = matrix(nrow = nrow(dat),ncol = 1) #create an empty matrix for predicted values
S[1,1] = 10*1000 #set initial value (mm)
rmse.diff <- 10^100 #random high value for difference between min RMSE between successive
#parameter searches
b.levels <- seq(from = -0.001,to = 0.001,length.out=100) #random starting parameter that should contain
#the final value being estimated
n = 0 # counter
#Loop to bruteforce search for best parameter estimate
while(rmse.diff > 0.01 ) {
rmse.vec = rep(NA,length(b.levels))
for(t in 1:length(b.levels)){
for(z in 2:nrow(S)){
S[z,1] <- S[(z-1),1] + (1.071663*(dat$X[z])) + (b.levels[t]*(S[(z-1),1])) #-1.532236
} #end of time series loop
extrap_level <- data.frame(Date= dat$Date, level = S) # predicted lake levels
#calculate an offset to center observed data on extrapolated data
dat.offset = observeddata %>% left_join(extrap_level) %>%
mutate(offset = level-Value) %>% drop_na()
offset <- mean(dat.offset$offset)
dat.compare <- observeddata %>% left_join(extrap_level) %>%
mutate(Value = Value + offset) %>% drop_na()
#calculate RMSE between observed and extrapolated values
rmse.vec[t] <- rmse(actual = dat.compare$Value,predicted = dat.compare$level)
#plot the data to watch how parameter choice influences fit while looping
#plots have a hard time keeping up
if(t ==1 | t==50 | t==100) {
plot(extrap_level$Date,extrap_level$level,type="l")
lines(dat.compare$Date,dat.compare$Value,col="red")
}
}
#find minimum RMSE value
min.rmse <- which(rmse.vec==min(rmse.vec))
if(n == 0) rmse.best <- rmse.vec[min.rmse] else rmse.best = c(rmse.best,rmse.vec[min.rmse])
if (n >= 1) rmse.diff <- (rmse.best[n]-rmse.best[n+1])
if(rmse.diff < 0) break()
best.b <- b.levels[min.rmse]
#take the parameter values on either side of the best prior RMSE and use those as search area
b.levels <- seq(from = (b.levels[min.rmse-1]),to = (b.levels[min.rmse+1]),length.out=100)
n = n + 1
}
rmse.best #vector of RMSE for each parameter search
best.b #Last identified best parameter value

Related

Passing a user defined function to `dplyr::summarize()` when 'data' is an argument of user defined function

I am attempting to calculate a forestry biometric called top height for a dataset containing several forest stands each with numerous plots. This biometric requires finding the largest diameter trees representing 40 trees per acre in a plot or a stand, calculating the cumulative trees per acre they represent, and their cumulative height, then dividing the cumulative height by the cumulative trees per acre. This requires a user-defined function, which I have created. My function accepts five arguments: data - a data.frame of tree biometric data, dbh - the column representing the diameter a breast height for individual trees, ht - the column representing the height for the individual trees, tpa - the trees per acre each individual represents, and n - the number of trees per acre to consider in the calculation, by default this is 40 (a forest biometrics standard value in empirical units). As part of my user defined function, I need to order the trees within a plot or stand by the descending order of dbh. I am attempting to use dplyr:: group_by() %>% summarize()to perform this function on each plot and stand combination. However, when I use the "piping" method to pass the data from the group_by() to the summarize() function, the data do not get passed. R throws the following error:
Error in `summarize()`:
! Problem while computing `TOP_HT = topht(dbh = dbh, ht = ht, tpa =
tpa, n = 40)`.
ℹ The error occurred in group 1: groups = "A".
Caused by error:
! argument "data" is missing, with no default
Run `rlang::last_error()` to see where the error occurred.
The obvious answer would simply be take out the data argument and just define the function on the tree biometric arguments. However, this won't work as I need to order all of the variables by descending order of dbh. Is there a way I can pass the grouped data to the data argument within my call to summarize()? Below is my reproducible example with fake data:
##Loading Necessary Package##
library(dplyr)
##Setting Random Number Seed for Reproducibility##
set.seed(55)
##Generating Some Fake Data##
groups<-c(rep("A", 5), rep("B", 5))
ht<-rnorm(10, 125, 20)
tpa<-rnorm(10, 150, 60)
dbh<-rnorm(10, 20, 2)
DF<-data.frame(groups=groups, dbh=dbh, ht=ht, tpa=tpa)
##Defining the topht function##
topht<-function(data, dbh=NULL, ht=NULL, tpa=NULL, n=40){ #function parameters
##evaluate function parameters in the data environment
tmp<-eval(substitute(dbh), envir = data)
odata<-data[base::order(tmp, decreasing=TRUE),]
ht<-eval(substitute(ht), envir=odata)
tpa<-eval(substitute(tpa), envir=odata)
#creating variables for cumulative trees per acre and cumulative height calculations#
cumtpa<-0
cumht<-0
#beginning a loop to calculate top height#
for(i in 1:nrow(odata)){#setting looping range
if(cumtpa < n){ #only run cumulative adding when cumulative trees per acre is less than n
cumtpa<-tpa[i]+cumtpa
cumht<-(ht[i]*tpa[i])+cumht
}#Close conditional
if(cumtpa==n){#End the loop if cumulative tpa = n
break
}#End Conditional
if(cumtpa > n){#Adjust final tree's weight when cumulative tpa exceeds n and end loop
delta <- cumtpa - n
cumtpa<-cumtpa-delta
cumht<-cumht-(delta*ht[i])
break
}#End Conditional
if(cumtpa>0){#Define calculation of top height when trees per acre > 0
topht<-cumht/cumtpa
}else{#Define complement of conditional
topht<-0
}#Close conditional
}#Close loop
return(topht)#Output top height
}#Close function
##Attempting to run top height function independently for groups A and B##
out<-as.data.frame(DF %>% group_by(groups) %>% summarize(TOP_HT=topht(dbh=dbh,ht=ht,tpa=tpa,n=40)))#Throws error
I tried to repair your function and apply it to your data:
library(dplyr)
topht <- function(data, dbh = NULL, ht = NULL, tpa = NULL, n = 40){
##evaluate function parameters in the data environment
tmp <- data %>% pull({{ dbh }})
odata <- data[base::order(tmp, decreasing=TRUE),]
ht <- odata %>% pull({{ ht }})
tpa <- data %>% pull({{ tpa }})
#creating variables for cumulative trees per acre and cumulative height calculations#
cumtpa <- 0
cumht <- 0
outcome <- 0
for(i in 1:nrow(odata)) {
if(cumtpa < n){
cumtpa <- tpa[i] + cumtpa
cumht <- (ht[i] * tpa[i]) + cumht
} else if(cumtpa == n){
break
} else {
delta <- cumtpa - n
cumtpa <- cumtpa - delta
cumht <- cumht - (delta*ht[i])
break
}
if(cumtpa > 0) {
outcome <- cumht / cumtpa
} else {
outcome <- 0
}
}
outcome
}
Now we apply this function to each group:
DF %>%
group_by(groups) %>%
group_modify(~ .x %>% summarize(TOP_HT = topht(., dbh = dbh, ht = ht, tpa = tpa, n = 40))) %>%
ungroup() %>%
as.data.frame()
We want to apply topht to each group, so we use group_modify (it's like purrr's little sister). This returns
groups TOP_HT
1 A 88.75246
2 B 123.01531
A few words of explanation:
Since your function is named topht, you really should not use topht as variable name (even inside this function). I changed it to outcome.
outcome should be defined / initialised with some value. I chose 0, NA or something else might also be possible.
return() at the end of a function is unneccessary. Just use the variable name.
To evaluate the function's arguments (like dbh = dbh) you need the curly-curly operator. As a reference: https://www.r-bloggers.com/2019/06/curly-curly-the-successor-of-bang-bang/
Your first if-construction should be packed together into an if-else if - else construction.
To improve readability, you can use some spacing (see http://adv-r.had.co.nz/Style.html).

R: quickly simulate unbalanced panel with variable that depends on lagged values of itself

I am trying to simulate monthly panels of data where one variable depends on lagged values of that variable in R. My solution is extremely slow. I need around 1000 samples of 2545 individuals, each of whom is observed monthly over many years, but the first sample took my computer 8.5 hours to construct. How can I make this faster?
I start by creating an unbalanced panel of people with different birth dates, monthly ages, and variables xbsmall and error that will be compared to determine the Outcome. All of the code in the first block is just data setup.
# Setup:
library(plyr)
# Would like to have 2545 people (nPerson).
#Instead use 4 for testing.
nPerson = 4
# Minimum and maximum possible ages and birth dates
AgeMin = 10
AgeMax = 50
BornMin = 1950
BornMax = 1963
# Person-specific characteristics
ind =
data.frame(
id = 1:nPerson,
BornYear = floor(runif(length(1:nPerson), min=BornMin, max=BornMax+1)),
BornMonth = ceiling(runif(length(1:nPerson), min=0, max=12))
)
# Make an unbalanced panel of people over age 10 up to year 1986
# panel = ddply(ind, ~id, transform, AgeMonths = BornMonth)
panel = ddply(ind, ~id, transform, AgeMonths = (AgeMin*12):((1986-BornYear)*12 + 12-BornMonth))
# Set up some random variables to approximate the data generating process
panel$xbsmall = rnorm(dim(panel)[1], mean=-.3, sd=.45)
# Standard normal error for probit
panel$error = rnorm(dim(panel)[1])
# Placeholders
panel$xb = rep(0, dim(panel)[1])
panel$Outcome = rep(0, dim(panel)[1])
Now that we have data, here is the part that is slow (around a second on my computer for only 4 observations but hours for thousands of observations). Each month, a person gets two draws (xbsmall and error) from two different normal distributions (these were done above), and Outcome == 1 if xbsmall > error. However, if Outcome equals 1 in the previous month, then Outcome in the current month equals 1 if xbsmall + 4.47 > error. I use xb = xbsmall+4.47 in the code below (xb is the "linear predictor" in a probit model). I ignore the first month for each person for simplicity. For your information, this is simulating a probit DGP (but that is not necessary to know to solve the problem of computation speed).
# Outcome == 1 if and only if xb > -error
# The hard part: xb includes information about the previous month's outcome
start_time = Sys.time()
for(i in 1:nPerson){
# Determine the range of monthly ages to loop over for this person
AgeMonthMin = min(panel$AgeMonths[panel$id==i], na.rm=T)
AgeMonthMax = max(panel$AgeMonths[panel$id==i], na.rm=T)
# Loop over the monthly ages for this person and determine the outcome
for(t in (AgeMonthMin+1):AgeMonthMax){
# Indicator for whether Outcome was 1 last month
panel$Outcome1LastMonth[panel$id==i & panel$AgeMonths==t] = panel$Outcome[panel$id==i & panel$AgeMonths==t-1]
# xb = xbsmall + 4.47 if Outcome was 1 last month
# Otherwise, xb = xbsmall
panel$xb[panel$id==i & panel$AgeMonths==t] = with(panel[panel$id==i & panel$AgeMonths==t,], xbsmall + 4.47*Outcome1LastMonth)
# Outcome == 1 if xb > 0
panel$Outcome[panel$id==i & panel$AgeMonths==t] =
ifelse(panel$xb[panel$id==i & panel$AgeMonths==t] > - panel$error[panel$id==i & panel$AgeMonths==t], 1, 0)
}
}
end_time = Sys.time()
end_time - start_time
My thoughts for reducing computer time:
Something with cumsum()
Some wonderful panel data function that I do not know about
Find a way to make the t loop go through the same starting and ending points for each individual and then somehow use plyr::ddpl() or dplyr::gather_by()
Iterative solution: make an educated guess about the value of Outcome at each monthly age (say, the mode) and somehow adjust values that do not match the previous month. This would work better in my real application because xbsmall has a very clear trend in age.
Do the simulation only for smaller samples and then estimate the effect of sample size on the values I need (the distributions of regression coefficient estimates not calculated here)
One approach is to use a split-apply-combine method. I take out the for(t in (AgeMonthMin+1):AgeMonthMax) loop and put the contents in a function:
generate_outcome <- function(x) {
AgeMonthMin <- min(x$AgeMonths, na.rm = TRUE)
AgeMonthMax <- max(x$AgeMonths, na.rm = TRUE)
for (i in 2:(AgeMonthMax - AgeMonthMin + 1)){
x$xb[i] <- x$xbsmall[i] + 4.47 * x$Outcome[i - 1]
x$Outcome[i] <- ifelse(x$xb[i] > - x$error[i], 1, 0)
}
x
}
where x is a dataframe for one person. This allows us to simplify the panel$id==i & panel$AgeMonths==t construct. Now we can just do
out <- lapply(split(panel, panel$id), generate_outcome)
out <- do.call(rbind, out)
and all.equal(panel$Outcome, out$Outcome) returns TRUE. Computing 100 persons took 1.8 seconds using this method, compared to 1.5 minutes in the original code.

value at risk estimation using fGarch package in R

I am trying to make a similar analysis to McNeil & Frey in their paper 'Estimation of tail-related risk measures for heteroscedastic financial time series: an extreme value approach' but I am stuck with a problem when implementing the models.
The approach is to fit a AR(1)-GARCH(1,1) model in order to estimate the the one-day ahead forecast of the VaR using a window of 1000 observations.
I have simulated data that should work fine with my model, and I assume that if I would be doing this correct, the observed coverage rate should be close to the theoretical one. However it is always below the theoretical coverage rate, and I don´t know why.
I beleive that this is how the calculation of the estimated VaR is done
VaR_hat = mu_hat + sigma_hat * qnorm(alpha)
, but I might be wrong. I have tried to find related questions here at stack but I have not found any.
How I approach this can be summarized in three steps.
Simulate 2000 AR(1)-GARCH(1,1) observations and fit a corresponding model and extract the one day prediction of the conditional mean and standard deviation using a window of 1000 observations.(Thereby making 1000 predictions)
Use the predicted values and the normal quantile to calculate the VaR for the wanted confidence level.
Check if the coverage rate is close to the theoretical one.
If someone could help me I would be extremely thankful, and if I'm unclear in my formalation please just tell me and I'll try to come up with a better explanation to the problem.
The code I'm using is attached below.
Thank you in advance
library(fGarch)
nObs <- 2000 # Number of observations.
quantileLevel <- 0.95 # Since we expect 5% exceedances.
from <- seq(1,1000) # Lower index vector for observations in model.
to <- seq(1001,2000) # Upper index vector for observations in model.
VaR_vec <- rep(0,(nObs-1000)) # Empty vector for storage of 1000 VaR estimates.
# Specs for simulated data (including AR(1) component and all components for GARC(1,1)).
spec = garchSpec(model = list(omega = 1e-6, alpha = 0.08, beta = 0.91, ar = 0.10),
cond.dist = 'norm')
# Simulate 1000 data points.
data_sim <- c(garchSim(spec, n = nObs, n.start = 1000))
for (i in 1:1000){
# The rolling window of 1000 observations.
data_insert <- data_sim[from[i]:to[i]]
# Fitting an AR(1)-GARCH(1,1) model with normal cond.dist.
fitted_model <- garchFit(~ arma(1,0) + garch(1,1), data_insert,
trace = FALSE,
cond.dist = "norm")
# One day ahead forecast of conditional mean and standard deviation.
predict(fitted_model, n.ahead = 1)
prediction_model <- predict(fitted_model, n.ahead = 1)
mu_pred <- prediction_model$meanForecast
sigma_pred <- prediction_model$standardDeviation
# Calculate VaR forecast
VaR_vec[i] <- mu_pred + sigma_pred*qnorm(quantileLevel)
if (length(to)-i != 0){
print(c('Countdown, just',(length(to) - i),'iterations left'))
} else {
print(c('Done!'))
}
}
# Exctract only the estiamtes ralated to the forecasts.
compare_data_sim <- data_sim[1001:length(data_sim)]
hit <- rep(0,length(VaR_vec))
# Count the amount of exceedances.
for (i in 1:length(VaR_vec)){
hit[i] <- sum(VaR_vec[i] <= compare_data_sim[i])
}
plot(data_sim[1001:2000], type = 'l',
ylab = 'Simulated data', main = 'Illustration of one day ahead prediction of 95%-VaR')
lines(VaR_vec, col = 'red')
cover_prop <- sum(hit)/length(hit)
print(sprintf("Diff theoretical level and VaR coverage = %f", (1-quantileLevel) - cover_prop))

How to find an optimal adstock decay factor for an independent variable in panel analysis in R?

I'm working with a panel dataset (24 months of data for 210 DMAs). I'm trying to optimize the adstock decay factor for an independent variable by minimizing the standard error of a fixed effects model.
In this particular case, I want to get a decay factor that minimizes the SE of the adstock-transformed variable "SEM_Br_act_norm" in the model "Mkt_TRx_norm = b0 + b1*Mkt_TRx_norm_prev + b2*SEM+Br_act_norm_adstock".
So far, I've loaded the dataset in panel formal using plm and created a function to generate the adstock values. The function also runs a fixed effects model on the adstock values and returns the SE. I then use optimize() to find the best decay value within the bounds (0,1). While my code is returning an optimal value, I am worried something is wrong because it returns the same optimum (close to 1) on all other variables.
I've attached a sample of my data, as well as key parts of my code. I'd greatly appreciate if someone could take a look and see what is wrong.
Sample Data
# Set panel data structure
alldata <- plm.data (alldata, index = c("DMA", "Month_Num"))
alldata$var <- alldata$SEM_Br_act_norm +0
# Create 1 month time lag for TRx
alldata <- ddply(
alldata, .(DMA), transform,
# This assumes that the data is sorted
Mkt_TRx_norm_prev = c(NA,Mkt_TRx_norm[-length(Mkt_TRx_norm)])
)
# Create adstock function and obtain SE of regression
adstockreg <-function(decay, period, data_vector, pool_vector=0){
data_vector <-alldata$var
pool_vector <- alldata$DMA
data2<-data_vector
l<-length(data_vector)
#if no pool apply zero to vector
if(length(pool_vector)==1)pool_vector<-rep(0,l)
#outer loop: extract data to decay from observation i
for( i in 1:l){
x<-data_vector[i]
#inner loop: apply decay onto following observations after i
for(j in 1:min(period,l)){
#constrain decay to same pool (if data is pooled)
if( pool_vector[i]==pool_vector[min(i+j,l)]){data2[(i+j)]<- data2[(i+j)]+(x*(decay)^j)}
}
}
#reduce length of edited data to equal length of initial data
data2<-data2[1:l]
#regression - excludes NA values
alldata <- plm.data (alldata, index = c("DMA", "Month_Num"))
var_fe <- plm(alldata$Mkt_TRx_norm ~ alldata$Mkt_TRx_norm_prev + data2, data = alldata , model = "within", na.action = na.exclude)
se <- summary(var_fe)$coefficients["data2","Std. Error"]
return(se)
}
# Optimize decay for adstock variable
result <- optimize(adstockreg, interval=c(0,1), period = 6)
print(result)

Tapply only producing missing values

I'm trying to generate estimates of the percent of Catholics within a given municipality in a country and I'm using multilevel regression and post-stratification of survey data.
The approach fits a multilevel logit and generates predicted probabilities of the dependent variable. It then weights the probabilities using poststratification of the sample to census data.
I can generate the initial estimates (which are essentially just the predicted probability of being Catholic for a given individual in the survey data.) However, when I try to take the average with the last line of code below it only returns NA's for each of the municipalities. The initial cell predictions have some missing values but nowhere near a majority.
I don't understand why I can't generate municipal weighted averages as I've followed the procedure using different data. Any help would be greatly appreciated.
rm(list=ls(all=TRUE))
library("arm")
library("foreign")
#read in megapoll and attach
ES.data <- read.dta("ES4.dta", convert.underscore = TRUE)
#read in municipal-level dataset
munilevel <- read.dta("election.dta",convert.underscore = TRUE)
munilevel <- munilevel[order(munilevel$municode),]
#read in Census data
Census <- read.dta("poststratification4.dta",convert.underscore = TRUE)
Census <- Census[order(Census$municode),]
Census$municode <- match(Census$municode, munilevel$municode)
#Create index variables
#At level of megapoll
ES.data$ur.female <- (ES.data$female *2) + ES.data$ur
ES.data$age.edr <- 6 * (ES.data$age -1) + ES.data$edr
#At census level (same coding as above for all variables)
Census$cur.cfemale <- (Census$cfemale *2) + Census$cur
Census$cage.cedr <- 6 * (Census$cage -1) + Census$cedr
##Municipal level variables
Census$c.arena<- munilevel$c.arena[Census$municode]
Census$c.fmln <- munilevel$c.fmln[Census$municode]
#run individual-level opinion model
individual.model1 <- glmer(formula = catholic ~ (1|ur.female) + (1|age)
+ (1|edr) + (1|age.edr) + (1|municode) + p.arena +p.fmln
,data=ES.data, family=binomial(link="logit"))
display(individual.model1)
#examine random effects and standard errors for urban-female
ranef(individual.model1)$ur.female
se.ranef(individual.model1)$ur.female
#create vector of state ranefs and then fill in missing ones
muni.ranefs <- array(NA,c(66,1))
dimnames(muni.ranefs) <- list(c(munilevel$municode),"effect")
for(i in munilevel$municode){
muni.ranefs[i,1] <- ranef(individual.model1)$municode[i,1]
}
muni.ranefs[,1][is.na(muni.ranefs[,1])] <- 0 #set states with missing REs (b/c not in data) to zero
#create a prediction for each cell in Census data
cellpred1 <- invlogit(fixef(individual.model1)["(Intercept)"]
+ranef(individual.model1)$ur.female[Census$cur.cfemale,1]
+ranef(individual.model1)$age[Census$cage,1]
+ranef(individual.model1)$edr[Census$cedr,1]
+ranef(individual.model1)$age.edr[Census$cage.cedr,1]
+muni.ranefs[Census$municode,1]
+(fixef(individual.model1)["p.fmln"] *Census$c.fmln) # municipal level
+(fixef(individual.model1)["p.arena"] *Census$c.arena)) # municipal level
#weights the prediction by the freq of cell
cellpredweighted1 <- cellpred1 * Census$cpercent.muni
#calculates the percent within each municipality (weighted average of responses)
munipred <- 100* as.vector(tapply(cellpredweighted1, Census$municode, sum))
munipred
The extensive amount of code is totally redundant without the data! I suppose you have NAs in the object cellpredweighted1 and by default sum() propagates NAs to the answer because if one or more elements of a vector is NA then by definition the summation of those elements is also NA.
If the above is the case here, then simply adding na.rm = TRUE to the tapply() call should solve the problem.
tapply(cellpredweighted1, Census$municode, sum, na.rm = TRUE)
You should be asking yourself why there are NAs at this stage and if these result from errors earlier on the process.

Resources