Related
I'm trying to model an estimation of the price elasticity of demand for each customer using GAM model, a model like this:
\ln D = \ln P + \ln P \cdot \sum_{i=1}^{20} f(X_i)
PED = \frac{\partial \ln D {\partial \ln P} = 1 + \sum_{i=1}^{20} f(X_i)
https://latex.codecogs.com/svg.image?$$&space;\ln&space;D&space;=\ln&space;P&space;+&space;\ln&space;P&space;\cdot&space;\sum_{i=1}^{20}&space;f(X_i)\\PED&space;=&space;\frac{\partial&space;\ln&space;D}{\partial&space;\ln&space;P}&space;=&space;1&space;+&space;&space;\sum_{i=1}^{20}&space;f(X_i)
where $D$ is Demand, $P$ is rate, PED is price elasticity of demand and $X_i$ is a set of customer's variable.
Since $PED$ is not observable, i want to estimate PED from the model created for log demand using gam model, but I have trying some difficulty in how to estimate that way.
I tried to get the each splines to calculate PED, but i failed. I know there is a package called gratia with derivatives function, but i dont understand how to use it to calculate ped.
Once the model to estimate demand is created, I will need to estimate the price elasticity of demand for each customer, but for these customers I don't have the rate variable, only the 20 personal variables.
I read some links:
https://stats.stackexchange.com/questions/495775/first-derivative-of-fitted-gam-changes-according-to-specified-model-distribution
https://stats.stackexchange.com/questions/590167/how-can-i-calculate-a-derivative-of-a-global-smooth-and-group-level-smooths-with
https://stats.stackexchange.com/questions/32013/what-is-the-mathematical-model-formula-corresponding-to-this-gam-model-fit-in-r
Really appreciate for any explanation, advices or other way to model my data.
Thanks
EDIT
What i've tried:
#create the dataset
A <- sample(x = 0:1000, size = 5000, replace = TRUE)
B <- sample(x = 0:1000, size = 5000, replace = TRUE)
C <- sample(x = 0:1000, size = 5000, replace = TRUE)
D <- sample(x = 0:1000, size = 5000, replace = TRUE)
log.R <- log(rbeta(5000, 5,10)*10) #log rate
log.Y <- log(rgamma(5000, 10, 20)*10000) #log demand
mydata <- data.frame(A, B, C, D, log.R, log.Y)
#the model
model <- gam(log.Y ~ s(A, by=log.R) + s(B, by=log.R) + s(C, by=log.R) + s(D, by=log.R), data = mydata, method = "REML")
mfx <- marginaleffects(model, variables = "log.R", eps = 10^-5)
head(mfx)
mfx returns a 'dydx' column, is it the elasticity of my data used to model?
And when i will apply this model to newdata, i got an error:
newdat = data.frame(A = 750, B = 500, C = 398, D = 740)
marginaleffects(model, variables = "log.R", eps = 10^-5, newdata= newdat, slope = 'dydx')
Error: There is no valid predictor variable. Please change the `variables` argument or supply a new data frame to the `newdata` argument.
What should I do?
I attempt to fit a GAM model with interactions between days (tt variable) and lagged predictors (k=2) using k basis functions.
library(mgcv)
# Example data
data=data.frame(
tt=1:107, # days
pol=(sample.int(101,size=107,replace=TRUE)-1)/100,
at_rec=sample.int(101,size=107,replace=TRUE),
w_cas=sample.int(2000,size=107,replace=TRUE)
)
# model
gam1<-gam(pol ~ s(tt, k = 10) +
s(tt, by = Lag(at_rec, k = 2), k = 10)+
s(tt, by = Lag(w_cas, k = 2), k = 10),
data=data,method="GACV.Cp")
summary(gam1)
# while making newdata
> newdata=data.frame(tt=c(12,22),at_rec=c(44,34), w_cas=c(2011,2455))
# and prediction
> predict(gam1,newdata=newdata,se.fit=TRUE)
I got this error
"Error in PredictMat(object$smooth[[k]], data) : Can't find by variable"
How to predict such a model with new data?
I'm 99.9% sure that the predict method can't find the by terms because they are functions of variables and it's looking for variables with exactly the names you provided: "Lag(at_rec, k = 2)".
Try adding those lagged variables to your data frame as explicit variables and refit the model and it should work:
data <- transform(data,
lag_at_rec = Lag(at_rec, k=2),
lag_w_cas = Lag(w_cas, k=2))
gam1 <- gam(pol ~ s(tt, k = 10) +
s(tt, by = lag_at_rec, k = 10)+
s(tt, by = lag_w_cas, k = 10),
data = data, method = "GACV.Cp")
I am trying to implement lasso linear regression. I train my model but when I try to make prediction on unknown data it gives me the following error:
Error in cbind2(1, newx) %*% nbeta :
invalid class 'NA' to dup_mMatrix_as_dgeMatrix
Summary of my data is:
I want to predict the unknown percent_gc. I initially train my model using data for which percent_gc is known
set.seed(1)
###training data
data.all <- tibble(description = c('Xylanimonas cellulosilytica XIL07, DSM 15894','Teredinibacter turnerae T7901',
'Desulfotignum phosphitoxidans FiPS-3, DSM 13687','Brucella melitensis bv. 1 16M'),
phylum = c('Actinobacteria','Proteobacteria','Proteobacteria','Bacteroidetes'),
genus = c('Acaryochloris','Acetohalobium','Acidimicrobium','Acidithiobacillus'),
Latitude = c('63.93','69.372','3.493.11','44.393.704'),
Longitude = c('-22.1','88.235','134.082.527','-0.130781'),
genome_size = c(8361599,2469596,2158157,3207552),
percent_gc = c(34,24,55,44),
percent_psuedo = c(0.0032987747,0.0291222313,0.0353728489,0.0590663703),
percent_signalpeptide = c(0.02987198,0.040607055,0.048757170,0.061606859))
###data for prediction
data.prediction <- tibble(description = c('Liberibacter crescens BT-1','Saprospira grandis Lewin',
'Sinorhizobium meliloti AK83','Bifidobacterium asteroides ATCC 25910'),
phylum = c('Actinobacteria','Proteobacteria','Proteobacteria','Bacteroidetes'),
genus = c('Acaryochloris','Acetohalobium','Acidimicrobium','Acidithiobacillus'),
Latitude = c('39.53','69.372','5.493.12','44.393.704'),
Longitude = c('20.1','-88.235','134.082.527','-0.130781'),
genome_size = c(474832,2469837,2158157,3207552),
percent_gc = c(NA,NA,NA,NA),
percent_psuedo = c(0.0074639239,0.0291222313,0.0353728489,0.0590663703),
percent_signalpeptide = c(0.02987198,0.040607055,0.048757170,0.061606859))
x=model.matrix(percent_gc~.,data.all)
y=data.all$percent_gc
cv.out <- cv.glmnet (x, y, alpha = 1,family = "gaussian")
best.lambda= cv.out$lambda.min
fit <- glmnet(x,y,alpha=1)
I then want to make predictions for which percent_gc in not known.
newX = matrix(data = data.prediction %>% select(-percent_gc))
data.prediction$percent_gc <-
predict(object = fit ,type="response", s=best.lambda, newx=newX)
And this generates the error I mentioned above.
I don't understand which format newX should be in order to get rid of this help. Insights would be appreciated.
I could not really figure out how to construct a appropiate matrix, but package glmnetUtils provides functionality to directly fit a formula on a dataframe and predict. With this I got it to predict values:
library(glmnetUtils)
fit <- glmnet(percent_gc~.,data.all,alpha=1)
cv.out <- cv.glmnet (percent_gc~.,data.all, alpha = 1,family = "gaussian")
best.lambda= cv.out$lambda.min
predict(object = fit,data.prediction,s=best.lambda)
I'm trying to estimate an Okun's law equation with a dlm using the dlm package in R. I can estimate the non-time varying model using nls as follows:
const_coef <- nls(formula = dur~ b1*dur_lag1 + b2*(d2lgdp-b0) + b3*d2lrulc_lag2 ,
start = list(b0 =0.1, b1=0.1, b2=0.1, b3=0.1),
data = mod_data)
the dlm model I want to be able to estimate allows for b1 and b0 in the above to follow random walks. I can do this in Eviews by declaring the measurement equation and appending the states (below is some code provided by the authors of the original paper which I can replicate:
'==========================
' SPECIFY THE KALMAN FILTER
'==========================
'Priors on state variables
vector(2) mprior
mprior(1) = 4 'Prior on starting value for trend GDP growth (annual average GDP growth over 1950s)
mprior(2) = 0 'Prior on starting value for lagged dependent variable
sym(2) vprior
vprior(1,1) = 5 'Prior on variance of trend GDP growth (variance of annual GDP growth over 1950s)
vprior(2,2) = 1 'Prior on variance of lagged dependent variable
'Specify coefficient vector
coef(8) ckf
'Declare state space
sspace ss1
ss1.append dur = lag*dur(-1) + ckf(2)*(d2lgdp-trend)+ckf(3)*D2LRULC(-2)+[var=exp(ckf(4))] 'Measurement equation
ss1.append #state trend = 1*trend(-1) + [var = exp(ckf(5))] 'State equation for trend GDP growth (random walk)
ss1.append #state lag = 1*lag(-1) + [var = exp(ckf(6))] 'State equation for lagged dependent variable (random walk)
'Apply priors to state space
ss1.append #mprior mprior
ss1.append #vprior vprior
'Set parameter starting values
param ckf(2) -0.0495 ckf(3) 0.01942 ckf(4) -2.8913 ckf(5) -4.1757 ckf(6) -6.2466 'starting values for parameters
'=====================
' ESTIMATE THE MODEL
'=====================
'Estimate state space
smpl %estsd %ested 'Estimation sample
ss1.ml(m=500,showopts) 'Estimate Kalman filter by maximum likelihood
freeze(mytab) ss1.stats
I'm really not sure how to do this with the dlm package. I've tried the following:
buildSS <- function(v){
dV <- exp(v[1]) # Variance of the measurment equation (ckf4)
dW <- c(exp(v[2]), # variance of the lagged dep (ckf6)
0, # variance of the coef on d2lgdp ckf(2) set to 0
0, # variance of the coef on d2lrulc ckf(3) set to 0
exp(v[3]) # variance of the random walk intercept (ckf5)
)
beta.vec <- c(1,v[4],v[5],1) # Params ckf(2) ckf3(3)
okuns <- dlmModReg(mod_data.tvp[,-1], addInt = TRUE, dV =dV, dW = dW, m0 = beta.vec)
}
#'Set parameter starting values
ckf4Guess <- -2.8913
ckf2guess <- -0.0495
ckf3guess <- 0.01942
ckf5guess <- -4.1757
ckf6guess <- -6.2466
params <- c(ckf4Guess,
ckf5guess,
ckf6guess,
ckf2guess,
ckf3guess)
tvp_mod.mle <- dlmMLE(mod_data.tvp[,"dur"] , parm = params, build = buildSS)
tvp_mod <- buildSS(tvp_mod.mle$par)
tvp_filter <- dlmFilter(mod_data$dur,tvp_mod)
The above code runs, but the outputs are not correct. I am not specifying the the states properly. Does anyone have any experience in building dlms with mutlvirate regression in R?
I think I have gotten to a solution - I've managed to recreate the estimates in the paper which estimates this model using Eviews (also checked this using Eviews).
#--------------------------------------------------------------------------------------------------------------------------
# tvp model full model - dur = alpha*dur(-1)+ beta(dgdp-potential) + gamma*wages
#--------------------------------------------------------------------------------------------------------------------------
# Construct DLM
OkunsDLMfm <- dlm(
FF = matrix(c(1,1,1,1),ncol = 4, byrow = TRUE),
V = matrix(1),
GG = matrix(c(1,0,0,0,
0,1,0,0,
0,0,1,0,
0,0,0,1), ncol = 4, byrow = TRUE),
W = matrix(c(1,0,0,0,
0,1,0,0,
0,0,1,0,
0,0,0,1), ncol = 4, byrow = TRUE),
JFF = matrix(c(1,2,3,0),ncol = 4, byrow = TRUE),
X = cbind(mod_data$dur_lag1,mod_data$d2lgdp, mod_data$d2lrulc_lag2), # lagged dep var, dgdp, wages.
m0 = c(0,0,0,0),
C0 = matrix(c(1e+07,0,0,0,
0,1e+07,0,0,
0,0,1e+07,0,
0,0,0,1e+07), ncol = 4, byrow = TRUE)
)
buildOkunsFM <- function(p){
V(OkunsDLMfm) <- exp(p[2])
GG(OkunsDLMfm)[1,1] <- 1
GG(OkunsDLMfm)[2,2] <- 1
GG(OkunsDLMfm)[3,3] <- 1
GG(OkunsDLMfm)[4,4] <- 1
W(OkunsDLMfm)[1,1] <- exp(p[3])
W(OkunsDLMfm)[2,2] <- 0
W(OkunsDLMfm)[3,3] <- 0
W(OkunsDLMfm)[4,4] <- exp(p[4])
m0(OkunsDLMfm) <- c(0,0,0,p[1]*4)
C0(OkunsDLMfm)[1,1] <- 1
C0(OkunsDLMfm)[4,4] <- 5
return(OkunsDLMfm)
}
okuns.estfm <- dlmMLE(y = mod_data$dur, parm = c(-0.049,-1.4,-6,-5), build = buildOkunsFM)
OkunsDLM1fm <- buildOkunsFM(okuns.estfm$par)
The time varying level, the estimate of potential output, is derived by dividing the 4 element of the state vector by the second * by negative 1.
Not sure if this is best way to specify the DLM, but the results from the model are very close to what is reported (within 0.01) of the results from using Eviews. That being said, very open to any other specifications.
I am trying to use XGBoost to model claims frequency of data generated from unequal length exposure periods, but have been unable to get the model to treat the exposure correctly. I would normally do this by setting log(exposure) as an offset - are you able to do this in XGBoost?
(A similar question was posted here: xgboost, offset exposure?)
To illustrate the issue, the R code below generates some data with the fields:
x1, x2 - factors (either 0 or 1)
exposure - length of policy period on observed data
frequency - mean number of claims per unit exposure
claims - number of observed claims ~Poisson(frequency*exposure)
The goal is to predict frequency using x1 and x2 - the true model is: frequency = 2 if x1 = x2 = 1, frequency = 1 otherwise.
Exposure can't be used to predict the frequency as it is not known at the outset of a policy. The only way we can use it is to say: expected number of claims = frequency * exposure.
The code tries to predict this using XGBoost by:
Setting exposure as a weight in the model matrix
Setting log(exposure) as an offset
Below these, I've shown how I would handle the situation for a tree (rpart) or gbm.
set.seed(1)
size<-10000
d <- data.frame(
x1 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
x2 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
exposure = runif(size, 1, 10)*0.3
)
d$frequency <- 2^(d$x1==1 & d$x2==1)
d$claims <- rpois(size, lambda = d$frequency * d$exposure)
#### Try to fit using XGBoost
require(xgboost)
param0 <- list(
"objective" = "count:poisson"
, "eval_metric" = "logloss"
, "eta" = 1
, "subsample" = 1
, "colsample_bytree" = 1
, "min_child_weight" = 1
, "max_depth" = 2
)
## 1 - set weight in xgb.Matrix
xgtrain = xgb.DMatrix(as.matrix(d[,c("x1","x2")]), label = d$claims, weight = d$exposure)
xgb = xgb.train(
nrounds = 1
, params = param0
, data = xgtrain
)
d$XGB_P_1 <- predict(xgb, xgtrain)
## 2 - set as offset in xgb.Matrix
xgtrain.mf <- model.frame(as.formula("claims~x1+x2+offset(log(exposure))"),d)
xgtrain.m <- model.matrix(attr(xgtrain.mf,"terms"),data = d)
xgtrain <- xgb.DMatrix(xgtrain.m,label = d$claims)
xgb = xgb.train(
nrounds = 1
, params = param0
, data = xgtrain
)
d$XGB_P_2 <- predict(model, xgtrain)
#### Fit a tree
require(rpart)
d[,"tree_response"] <- cbind(d$exposure,d$claims)
tree <- rpart(tree_response ~ x1 + x2,
data = d,
method = "poisson")
d$Tree_F <- predict(tree, newdata = d)
#### Fit a GBM
gbm <- gbm(claims~x1+x2+offset(log(exposure)),
data = d,
distribution = "poisson",
n.trees = 1,
shrinkage=1,
interaction.depth=2,
bag.fraction = 0.5)
d$GBM_F <- predict(gbm, newdata = d, n.trees = 1, type="response")
At least with the glm function in R, modeling count ~ x1 + x2 + offset(log(exposure)) with family=poisson(link='log') is equivalent to modeling I(count/exposure) ~ x1 + x2 with family=poisson(link='log') and weight=exposure. That is, normalize your count by exposure to get frequency, and model frequency with exposure as the weight. Your estimated coefficients should be the same in both cases when using glm for Poisson regression. Try it for yourself using a sample data set
I'm not exactly sure what objective='count:poisson' corresponds to, but I would expect setting your target variable as frequency (count/exposure) and using exposure as the weight in xgboost would be the way to go when exposures are varying.
I have now worked out how to do this using setinfo to change the base_margin attribute to be the offset (as a linear predictor), ie:
setinfo(xgtrain, "base_margin", log(d$exposure))