I'm trying to model an estimation of the price elasticity of demand for each customer using GAM model, a model like this:
\ln D = \ln P + \ln P \cdot \sum_{i=1}^{20} f(X_i)
PED = \frac{\partial \ln D {\partial \ln P} = 1 + \sum_{i=1}^{20} f(X_i)
https://latex.codecogs.com/svg.image?$$&space;\ln&space;D&space;=\ln&space;P&space;+&space;\ln&space;P&space;\cdot&space;\sum_{i=1}^{20}&space;f(X_i)\\PED&space;=&space;\frac{\partial&space;\ln&space;D}{\partial&space;\ln&space;P}&space;=&space;1&space;+&space;&space;\sum_{i=1}^{20}&space;f(X_i)
where $D$ is Demand, $P$ is rate, PED is price elasticity of demand and $X_i$ is a set of customer's variable.
Since $PED$ is not observable, i want to estimate PED from the model created for log demand using gam model, but I have trying some difficulty in how to estimate that way.
I tried to get the each splines to calculate PED, but i failed. I know there is a package called gratia with derivatives function, but i dont understand how to use it to calculate ped.
Once the model to estimate demand is created, I will need to estimate the price elasticity of demand for each customer, but for these customers I don't have the rate variable, only the 20 personal variables.
I read some links:
https://stats.stackexchange.com/questions/495775/first-derivative-of-fitted-gam-changes-according-to-specified-model-distribution
https://stats.stackexchange.com/questions/590167/how-can-i-calculate-a-derivative-of-a-global-smooth-and-group-level-smooths-with
https://stats.stackexchange.com/questions/32013/what-is-the-mathematical-model-formula-corresponding-to-this-gam-model-fit-in-r
Really appreciate for any explanation, advices or other way to model my data.
Thanks
EDIT
What i've tried:
#create the dataset
A <- sample(x = 0:1000, size = 5000, replace = TRUE)
B <- sample(x = 0:1000, size = 5000, replace = TRUE)
C <- sample(x = 0:1000, size = 5000, replace = TRUE)
D <- sample(x = 0:1000, size = 5000, replace = TRUE)
log.R <- log(rbeta(5000, 5,10)*10) #log rate
log.Y <- log(rgamma(5000, 10, 20)*10000) #log demand
mydata <- data.frame(A, B, C, D, log.R, log.Y)
#the model
model <- gam(log.Y ~ s(A, by=log.R) + s(B, by=log.R) + s(C, by=log.R) + s(D, by=log.R), data = mydata, method = "REML")
mfx <- marginaleffects(model, variables = "log.R", eps = 10^-5)
head(mfx)
mfx returns a 'dydx' column, is it the elasticity of my data used to model?
And when i will apply this model to newdata, i got an error:
newdat = data.frame(A = 750, B = 500, C = 398, D = 740)
marginaleffects(model, variables = "log.R", eps = 10^-5, newdata= newdat, slope = 'dydx')
Error: There is no valid predictor variable. Please change the `variables` argument or supply a new data frame to the `newdata` argument.
What should I do?
Related
I am trying to use kfold CV as a means of evaluating a model run using brms and I feel like I'm missing something. As a reproducible example, my data are structured as a binary response (0, 1) dependent on the length of an individual. Here is some code to generate and plot data similar to those I am working with:
library(brms)
library(tidyverse)
library(loo)
length <- seq(0, 100, by = 1)
n_fish_per_length <- 10
a0 <- -48
a1 <- 2
a2 <- -0.02
prob <- plogis(a0 + a1 * length + a2 * length^2)
plot(length, prob , type = 'l')
sim_data <-
expand_grid(fish_id = seq_len(n_fish_per_length),
length = length) %>%
mutate(prob_use = plogis(a0 + a1 * length + a2 * length^2)) %>%
mutate(is_carp = rbinom(n = n(), size = 1, prob= prob_use))
ggplot(sim_data, aes(x = length, y = is_carp)) +
geom_jitter(width = 0, height = 0.05) +
geom_smooth(method = "glm", formula = y ~ x + I(x^2),
method.args = list(family = binomial(link = "logit")))
I then use brms to run my model.
Bayes_Model_Binary <- brm(formula = is_carp ~ length + I(length^2),
data=sim_data,
family = bernoulli(link = "logit"),
warmup = 2500,
iter = 5000,
chains = 4,
inits= "0",
cores=4,
seed = 123)
summary(Bayes_Model_Binary)
I'd like to use kfold CV to evaluate the model. I can use something like this:
kfold(Bayes_Model_Binary, K = 10, chains = 1, save_fits = T)
but the response in my data is highly imbalanced (~18% = 1, ~82% = 0) and my reading suggests that I need to used stratified kfold cv to account for this. If I use:
sim_data$fold <- kfold_split_stratified(K = 10, x = sim_data$is_carp)
the data are split the way I would expect but I'm not sure what the best way is to move forward with the CV process from here. I saw this post https://mc-stan.org/loo/articles/loo2-elpd.html, but I'm not sure how to modify this to work with a brmsfit object. Alternatively, it appears that I should be able to use:
kfold(Bayes_Model_Binary, K = 10, folds = 'stratified', group = sim_data$is_carp)
but this throws an error. Likely because is_carp is the response rather than a predictor in the model. What would my group be in this context? Am I missing/misinterpreting something here? I'm assuming that there is a very simple solution here that I am overlooking but appreciate any thoughts.
After some additional digging and learning how to access information about each fold in the analysis, I was able to determine that the structure of the data (proportion of 0s and 1s in the response) is maintained using the default settings in the kfold() function. To do this I used the following code.
First, save the kfold CV analysis as an object.
kfold1 <- kfold(Bayes_Model_Binary, K = 10, save_fits = T)
kfold1$fits is a list of the model fitting results and the observations used in the test data set (omitted) for each fold.
From this information, I created a loop to print the proportion of observations in each training data set where is_carp = 1 (could also do this for each test data set) with the following code.
for(i in 1:10){
print(length(which(sim_data$is_carp[-kfold1$fits[i, ]$omitted] == 1)) /
nrow(sim_data[-kfold1$fits[i, ]$omitted, ]))
}
[1] 0.1859186
[1] 0.1925193
[1] 0.1991199
[1] 0.1914191
[1] 0.1881188
[1] 0.1848185
[1] 0.1936194
[1] 0.1980198
[1] 0.190319
[1] 0.1870187
and it's easy to then compare these proportions with the proportion of observations where is_carp = 1 from the original data set.
length(which(sim_data$is_carp == 1)) / nrow(sim_data)
[1] 0.1910891
I attempt to fit a GAM model with interactions between days (tt variable) and lagged predictors (k=2) using k basis functions.
library(mgcv)
# Example data
data=data.frame(
tt=1:107, # days
pol=(sample.int(101,size=107,replace=TRUE)-1)/100,
at_rec=sample.int(101,size=107,replace=TRUE),
w_cas=sample.int(2000,size=107,replace=TRUE)
)
# model
gam1<-gam(pol ~ s(tt, k = 10) +
s(tt, by = Lag(at_rec, k = 2), k = 10)+
s(tt, by = Lag(w_cas, k = 2), k = 10),
data=data,method="GACV.Cp")
summary(gam1)
# while making newdata
> newdata=data.frame(tt=c(12,22),at_rec=c(44,34), w_cas=c(2011,2455))
# and prediction
> predict(gam1,newdata=newdata,se.fit=TRUE)
I got this error
"Error in PredictMat(object$smooth[[k]], data) : Can't find by variable"
How to predict such a model with new data?
I'm 99.9% sure that the predict method can't find the by terms because they are functions of variables and it's looking for variables with exactly the names you provided: "Lag(at_rec, k = 2)".
Try adding those lagged variables to your data frame as explicit variables and refit the model and it should work:
data <- transform(data,
lag_at_rec = Lag(at_rec, k=2),
lag_w_cas = Lag(w_cas, k=2))
gam1 <- gam(pol ~ s(tt, k = 10) +
s(tt, by = lag_at_rec, k = 10)+
s(tt, by = lag_w_cas, k = 10),
data = data, method = "GACV.Cp")
I'm trying to estimate an Okun's law equation with a dlm using the dlm package in R. I can estimate the non-time varying model using nls as follows:
const_coef <- nls(formula = dur~ b1*dur_lag1 + b2*(d2lgdp-b0) + b3*d2lrulc_lag2 ,
start = list(b0 =0.1, b1=0.1, b2=0.1, b3=0.1),
data = mod_data)
the dlm model I want to be able to estimate allows for b1 and b0 in the above to follow random walks. I can do this in Eviews by declaring the measurement equation and appending the states (below is some code provided by the authors of the original paper which I can replicate:
'==========================
' SPECIFY THE KALMAN FILTER
'==========================
'Priors on state variables
vector(2) mprior
mprior(1) = 4 'Prior on starting value for trend GDP growth (annual average GDP growth over 1950s)
mprior(2) = 0 'Prior on starting value for lagged dependent variable
sym(2) vprior
vprior(1,1) = 5 'Prior on variance of trend GDP growth (variance of annual GDP growth over 1950s)
vprior(2,2) = 1 'Prior on variance of lagged dependent variable
'Specify coefficient vector
coef(8) ckf
'Declare state space
sspace ss1
ss1.append dur = lag*dur(-1) + ckf(2)*(d2lgdp-trend)+ckf(3)*D2LRULC(-2)+[var=exp(ckf(4))] 'Measurement equation
ss1.append #state trend = 1*trend(-1) + [var = exp(ckf(5))] 'State equation for trend GDP growth (random walk)
ss1.append #state lag = 1*lag(-1) + [var = exp(ckf(6))] 'State equation for lagged dependent variable (random walk)
'Apply priors to state space
ss1.append #mprior mprior
ss1.append #vprior vprior
'Set parameter starting values
param ckf(2) -0.0495 ckf(3) 0.01942 ckf(4) -2.8913 ckf(5) -4.1757 ckf(6) -6.2466 'starting values for parameters
'=====================
' ESTIMATE THE MODEL
'=====================
'Estimate state space
smpl %estsd %ested 'Estimation sample
ss1.ml(m=500,showopts) 'Estimate Kalman filter by maximum likelihood
freeze(mytab) ss1.stats
I'm really not sure how to do this with the dlm package. I've tried the following:
buildSS <- function(v){
dV <- exp(v[1]) # Variance of the measurment equation (ckf4)
dW <- c(exp(v[2]), # variance of the lagged dep (ckf6)
0, # variance of the coef on d2lgdp ckf(2) set to 0
0, # variance of the coef on d2lrulc ckf(3) set to 0
exp(v[3]) # variance of the random walk intercept (ckf5)
)
beta.vec <- c(1,v[4],v[5],1) # Params ckf(2) ckf3(3)
okuns <- dlmModReg(mod_data.tvp[,-1], addInt = TRUE, dV =dV, dW = dW, m0 = beta.vec)
}
#'Set parameter starting values
ckf4Guess <- -2.8913
ckf2guess <- -0.0495
ckf3guess <- 0.01942
ckf5guess <- -4.1757
ckf6guess <- -6.2466
params <- c(ckf4Guess,
ckf5guess,
ckf6guess,
ckf2guess,
ckf3guess)
tvp_mod.mle <- dlmMLE(mod_data.tvp[,"dur"] , parm = params, build = buildSS)
tvp_mod <- buildSS(tvp_mod.mle$par)
tvp_filter <- dlmFilter(mod_data$dur,tvp_mod)
The above code runs, but the outputs are not correct. I am not specifying the the states properly. Does anyone have any experience in building dlms with mutlvirate regression in R?
I think I have gotten to a solution - I've managed to recreate the estimates in the paper which estimates this model using Eviews (also checked this using Eviews).
#--------------------------------------------------------------------------------------------------------------------------
# tvp model full model - dur = alpha*dur(-1)+ beta(dgdp-potential) + gamma*wages
#--------------------------------------------------------------------------------------------------------------------------
# Construct DLM
OkunsDLMfm <- dlm(
FF = matrix(c(1,1,1,1),ncol = 4, byrow = TRUE),
V = matrix(1),
GG = matrix(c(1,0,0,0,
0,1,0,0,
0,0,1,0,
0,0,0,1), ncol = 4, byrow = TRUE),
W = matrix(c(1,0,0,0,
0,1,0,0,
0,0,1,0,
0,0,0,1), ncol = 4, byrow = TRUE),
JFF = matrix(c(1,2,3,0),ncol = 4, byrow = TRUE),
X = cbind(mod_data$dur_lag1,mod_data$d2lgdp, mod_data$d2lrulc_lag2), # lagged dep var, dgdp, wages.
m0 = c(0,0,0,0),
C0 = matrix(c(1e+07,0,0,0,
0,1e+07,0,0,
0,0,1e+07,0,
0,0,0,1e+07), ncol = 4, byrow = TRUE)
)
buildOkunsFM <- function(p){
V(OkunsDLMfm) <- exp(p[2])
GG(OkunsDLMfm)[1,1] <- 1
GG(OkunsDLMfm)[2,2] <- 1
GG(OkunsDLMfm)[3,3] <- 1
GG(OkunsDLMfm)[4,4] <- 1
W(OkunsDLMfm)[1,1] <- exp(p[3])
W(OkunsDLMfm)[2,2] <- 0
W(OkunsDLMfm)[3,3] <- 0
W(OkunsDLMfm)[4,4] <- exp(p[4])
m0(OkunsDLMfm) <- c(0,0,0,p[1]*4)
C0(OkunsDLMfm)[1,1] <- 1
C0(OkunsDLMfm)[4,4] <- 5
return(OkunsDLMfm)
}
okuns.estfm <- dlmMLE(y = mod_data$dur, parm = c(-0.049,-1.4,-6,-5), build = buildOkunsFM)
OkunsDLM1fm <- buildOkunsFM(okuns.estfm$par)
The time varying level, the estimate of potential output, is derived by dividing the 4 element of the state vector by the second * by negative 1.
Not sure if this is best way to specify the DLM, but the results from the model are very close to what is reported (within 0.01) of the results from using Eviews. That being said, very open to any other specifications.
I am using 'KFAS' package from R to estimate a state-space model with the Kalman filter. My measurement and transition equations are:
y_t = Z_t * x_t + \eps_t (measurement)
x_t = T_t * x_{t-1} + R_t * \eta_t (transition),
with \eps_t ~ N(0,H_t) and \eta_t ~ N(0,Q_t).
So, I want to estimate the variances H_t and Q_t, but also T_t, the AR(1) coefficient. My code is as follows:
library(KFAS)
set.seed(100)
eps <- rt(200, 4, 1)
meas <- as.matrix((arima.sim(n=200, list(ar=0.6), innov = rnorm(200)*sqrt(0.5)) + eps),
ncol=1)
Zt <- 1
Ht <- matrix(NA)
Tt <- matrix(NA)
Rt <- 1
Qt <- matrix(NA)
ss_model <- SSModel(meas ~ -1 + SSMcustom(Z = Zt, T = Tt, R = Rt,
Q = Qt), H = Ht)
fit <- fitSSM(ss_model, inits = c(0,0.6,0), method = 'L-BFGS-B')
But it returns: "Error in is.SSModel(do.call(updatefn, args = c(list(inits, model), update_args)),: System matrices (excluding Z) contain NA or infinite values, covariance matrices contain values larger than 1e+07"
The NA definitions for the variances works well, as documented in the package's paper. However, it seems this cannot be done for the AR coefficients. Does anyone know how can I do this?
Note that I am aware of the SSMarima function, which eases the definition of the transition equation as ARIMA models. Although I am able to estimate the AR(1) coef. and Q_t this way, I still cannot estimate the \eps_t variance (H_t). Moreover, I am migrating my Kalman filter codes from EViews to R, so I need to learn SSMcustom for other models that are more complicated.
Thanks!
It seems that you are missing something in your example, as your error message comes from the function fitSSM. If you want to use fitSSM for estimating general state space models, you need to provide your own model updating function. The default behaviour can only handle NA's in covariance matrices H and Q. The main goal of fitSSM is just to get started with simple stuff. For complex models and/or large data, I would recommend using your self-written objective function (with help of logLik method) and your favourite numerical optimization routines manually for maximum performance. Something like this:
library(KFAS)
set.seed(100)
eps <- rt(200, 4, 1)
meas <- as.matrix((arima.sim(n=200, list(ar=0.6), innov = rnorm(200)*sqrt(0.5)) + eps),
ncol=1)
Zt <- 1
Ht <- matrix(NA)
Tt <- matrix(NA)
Rt <- 1
Qt <- matrix(NA)
ss_model <- SSModel(meas ~ -1 + SSMcustom(Z = Zt, T = Tt, R = Rt,
Q = Qt), H = Ht)
objf <- function(pars, model, estimate = TRUE) {
model$H[1] <- pars[1]
model$T[1] <- pars[2]
model$Q[1] <- pars[3]
if (estimate) {
-logLik(model)
} else {
model
}
}
opt <- optim(c(1, 0.5, 1), objf, method = "L-BFGS-B",
lower = c(0, -0.99, 0), upper = c(100, 0.99, 100), model = ss_model)
ss_model_opt <- objf(opt$par, ss_model, estimate = FALSE)
Same with fitSSM:
updatefn <- function(pars, model) {
model$H[1] <- pars[1]
model$T[1] <- pars[2]
model$Q[1] <- pars[3]
model
}
fit <- fitSSM(ss_model, c(1, 0.5, 1), updatefn, method = "L-BFGS-B",
lower = c(0, -0.99, 0), upper = c(100, 0.99, 100))
identical(ss_model_opt, fit$model)
I am trying to use XGBoost to model claims frequency of data generated from unequal length exposure periods, but have been unable to get the model to treat the exposure correctly. I would normally do this by setting log(exposure) as an offset - are you able to do this in XGBoost?
(A similar question was posted here: xgboost, offset exposure?)
To illustrate the issue, the R code below generates some data with the fields:
x1, x2 - factors (either 0 or 1)
exposure - length of policy period on observed data
frequency - mean number of claims per unit exposure
claims - number of observed claims ~Poisson(frequency*exposure)
The goal is to predict frequency using x1 and x2 - the true model is: frequency = 2 if x1 = x2 = 1, frequency = 1 otherwise.
Exposure can't be used to predict the frequency as it is not known at the outset of a policy. The only way we can use it is to say: expected number of claims = frequency * exposure.
The code tries to predict this using XGBoost by:
Setting exposure as a weight in the model matrix
Setting log(exposure) as an offset
Below these, I've shown how I would handle the situation for a tree (rpart) or gbm.
set.seed(1)
size<-10000
d <- data.frame(
x1 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
x2 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
exposure = runif(size, 1, 10)*0.3
)
d$frequency <- 2^(d$x1==1 & d$x2==1)
d$claims <- rpois(size, lambda = d$frequency * d$exposure)
#### Try to fit using XGBoost
require(xgboost)
param0 <- list(
"objective" = "count:poisson"
, "eval_metric" = "logloss"
, "eta" = 1
, "subsample" = 1
, "colsample_bytree" = 1
, "min_child_weight" = 1
, "max_depth" = 2
)
## 1 - set weight in xgb.Matrix
xgtrain = xgb.DMatrix(as.matrix(d[,c("x1","x2")]), label = d$claims, weight = d$exposure)
xgb = xgb.train(
nrounds = 1
, params = param0
, data = xgtrain
)
d$XGB_P_1 <- predict(xgb, xgtrain)
## 2 - set as offset in xgb.Matrix
xgtrain.mf <- model.frame(as.formula("claims~x1+x2+offset(log(exposure))"),d)
xgtrain.m <- model.matrix(attr(xgtrain.mf,"terms"),data = d)
xgtrain <- xgb.DMatrix(xgtrain.m,label = d$claims)
xgb = xgb.train(
nrounds = 1
, params = param0
, data = xgtrain
)
d$XGB_P_2 <- predict(model, xgtrain)
#### Fit a tree
require(rpart)
d[,"tree_response"] <- cbind(d$exposure,d$claims)
tree <- rpart(tree_response ~ x1 + x2,
data = d,
method = "poisson")
d$Tree_F <- predict(tree, newdata = d)
#### Fit a GBM
gbm <- gbm(claims~x1+x2+offset(log(exposure)),
data = d,
distribution = "poisson",
n.trees = 1,
shrinkage=1,
interaction.depth=2,
bag.fraction = 0.5)
d$GBM_F <- predict(gbm, newdata = d, n.trees = 1, type="response")
At least with the glm function in R, modeling count ~ x1 + x2 + offset(log(exposure)) with family=poisson(link='log') is equivalent to modeling I(count/exposure) ~ x1 + x2 with family=poisson(link='log') and weight=exposure. That is, normalize your count by exposure to get frequency, and model frequency with exposure as the weight. Your estimated coefficients should be the same in both cases when using glm for Poisson regression. Try it for yourself using a sample data set
I'm not exactly sure what objective='count:poisson' corresponds to, but I would expect setting your target variable as frequency (count/exposure) and using exposure as the weight in xgboost would be the way to go when exposures are varying.
I have now worked out how to do this using setinfo to change the base_margin attribute to be the offset (as a linear predictor), ie:
setinfo(xgtrain, "base_margin", log(d$exposure))