How to find an optimal adstock decay factor for an independent variable in panel analysis in R? - r

I'm working with a panel dataset (24 months of data for 210 DMAs). I'm trying to optimize the adstock decay factor for an independent variable by minimizing the standard error of a fixed effects model.
In this particular case, I want to get a decay factor that minimizes the SE of the adstock-transformed variable "SEM_Br_act_norm" in the model "Mkt_TRx_norm = b0 + b1*Mkt_TRx_norm_prev + b2*SEM+Br_act_norm_adstock".
So far, I've loaded the dataset in panel formal using plm and created a function to generate the adstock values. The function also runs a fixed effects model on the adstock values and returns the SE. I then use optimize() to find the best decay value within the bounds (0,1). While my code is returning an optimal value, I am worried something is wrong because it returns the same optimum (close to 1) on all other variables.
I've attached a sample of my data, as well as key parts of my code. I'd greatly appreciate if someone could take a look and see what is wrong.
Sample Data
# Set panel data structure
alldata <- plm.data (alldata, index = c("DMA", "Month_Num"))
alldata$var <- alldata$SEM_Br_act_norm +0
# Create 1 month time lag for TRx
alldata <- ddply(
alldata, .(DMA), transform,
# This assumes that the data is sorted
Mkt_TRx_norm_prev = c(NA,Mkt_TRx_norm[-length(Mkt_TRx_norm)])
)
# Create adstock function and obtain SE of regression
adstockreg <-function(decay, period, data_vector, pool_vector=0){
data_vector <-alldata$var
pool_vector <- alldata$DMA
data2<-data_vector
l<-length(data_vector)
#if no pool apply zero to vector
if(length(pool_vector)==1)pool_vector<-rep(0,l)
#outer loop: extract data to decay from observation i
for( i in 1:l){
x<-data_vector[i]
#inner loop: apply decay onto following observations after i
for(j in 1:min(period,l)){
#constrain decay to same pool (if data is pooled)
if( pool_vector[i]==pool_vector[min(i+j,l)]){data2[(i+j)]<- data2[(i+j)]+(x*(decay)^j)}
}
}
#reduce length of edited data to equal length of initial data
data2<-data2[1:l]
#regression - excludes NA values
alldata <- plm.data (alldata, index = c("DMA", "Month_Num"))
var_fe <- plm(alldata$Mkt_TRx_norm ~ alldata$Mkt_TRx_norm_prev + data2, data = alldata , model = "within", na.action = na.exclude)
se <- summary(var_fe)$coefficients["data2","Std. Error"]
return(se)
}
# Optimize decay for adstock variable
result <- optimize(adstockreg, interval=c(0,1), period = 6)
print(result)

Related

Failing regression loop in R

I wish to regress certain factor exposures calculated for a portfolio on that portfolios returns over 550 months (monthly observations). Fama-macbeth regression.
So essentially, regressing 550 return observations on "constant," factor exposures previously calculated. My loop thus far is as follows
'''
# Second Regression
library(sandwich)
library(broom)
library(tibble)
#Chose file containing factor exposures (note: data columns = exposures, rows = portfolios)
f <- file.choose()
betas <- read.csv(f)
BTM2R <- betas[1, 3]
BIPR <- betas[1, 4]
BInfR <- betas[1, 5]
BUnR <- betas[1, 6]
BOilR <- betas[1, 7]
#Chose file containing return data (columns = portfolios, rows = monthly return observations)
f <- file.choose()
retur <- read.csv(f)
for (i in 1:nrow(retur)){
mod <- lm(data = retur, retur[[i,1]]~BTM2R+BIPR+BInfR+BUnR+BOilR)
print(mod$coefficients)
}
'''
(I also wish to develop this loop further so that after running this regression for each portfolio, it is run for the next portfolio, such that the column number in "return," = j??? But will first address my current problem)
My current problem is that when I run the regression, all coefficient values return "NA," aside from the intercept, which returns a value. To confuse matters further, the intercept does not equal the return value at time = t, which, although the regression results would still obviously be incorrect, one would expect if all other coefficients return "NA,"

How to conduct parametric bootstrapping in R?

I am working with the orings data set in the faraway package in R. I have written the following grouped binomial model:
orings_model <- glm(cbind(damage, 6-damage) ~ temp, family = binomial, data = orings)
summary(orings_model)
I then constructed the Chi-Square test statistic and calculated the p-value:
pchisq(orings_model$null.deviance, orings_model$df.null,lower=FALSE)
First, I would like to generate data under the null distribution for this test statistic using rbinom with the average proportion of damaged o-rings (i.e., the variable "damage"). Second, I would like to recompute the above test statistic with this new data. I am not sure how to do this.
And second, I want to the process above 1000 times, saving the test statistic
each time. I am also not sure how to do this. My inclination is to use a for loop, but I am not sure how to set it up. Any help would be really appreciated!
It is not completely clear what you're looking to do here, but we can at least show some quick principles of how we can achieve this, and then hopefully you can get to your goal.
1) Simulating the null model
It is not entirely clear that you would like to simulate the null model here. It seems more like you're interested in simulating the actual model fit. Note that the null model is the model with form cbind(damage, 6-damage) ~ 1, and the null deviance and df are from this model. Either way, we can simulate data from the model using the simulate function in base R.
sims <- simulate(orings_model, 1000)
If you want to go the manual way estimate the mean vector of your model and use this for the probabilities in your call to rbinom
nsim <- 1000 * nrow(orings)
probs <- predict(orings_model, type = 'response')
sims_man <- matrix(rbinom(nsim, 6, probs),
ncol = 1000)
# Check they are equal:
# rowMeans(sims_man) - probs
In the first version we get a data.frame with 1000 columns each with a n times 2 matrix (damage vs not damage). In the latter we just summon the damage outcome.
2) Perform the bootstrapping
You could do this manually with the data above.
# Data from simulate
statfun <- function(x){
data <- orings_model$data
data$damage <- if(length(dim(x)) > 1)
x[, 1]
else
x
newmod <- update(orings_model, data = data)
pchisq(newmod$null.deviance, newmod$df.null, lower=FALSE)
}
sapply(sims, statfun)
# data from manual method
apply(sims_man, 2, statfun)
or alternatively one could take a bit of time with the boot function, allowing for a standardized way to perform the bootstrap:
library(boot)
# See help("boot")
ran_gen <- function(data, mle){
data$damage <- simulate(orings_model)[[1]][,1]
data
}
boot_metric <- function(data, w){
model <- glm(cbind(damage = damage, not_damage = 6 - damage) ~ temp,
family = binomial, data = data)
pchisq(model$null.deviance,
model$df.null,
lower=FALSE)
}
boots <- boot(orings, boot_metric,
R = 1000,
sim = 'parametric',
ran.gen = ran_gen,
mle = pchisq(orings_model$null.deviance,
orings_model$df.null,
lower=FALSE))
At which point we have the statistic in boots$t and the null statistic in boots$t0, so a simple statistic can be estimated using sum(boots$t > boots$t0) / boots$R (R being the number of replication).

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

Constructing components from PLSR loadings in R

I want to compute the components for a set of variables using the loadings (weights) from a PLSR using the plsr function.
I thought that the components were computed by summing the values of each variable multiplied by the estimated loading (weight).
However, using the output from plsr and doing that doesn't give me the expected values:
Example:
library("pls")
data(oliveoil)
sens.pcr <- plsr(sensory ~ chemical, ncomp = 4, scale = F, data = oliveoil)
Extract loadings/weights:
df <- cbind(sens.pcr$loadings[,1],sens.pcr$loadings[,2],sens.pcr$loadings[,3],sens.pcr$loadings[,4])
One test observation:
firstrow <- oliveoil$chemical[1,]
Extract the components (scores):
scores <- sens.pcr$scores
Do the linear combination:
sum(firstrow*df[,1])
[1] -12.81924
Which is not equal to the first score scores[1,1] = 0.5100166
What is it that I am missing?
Using the sense.pcr$loadings.weigths didn't make any big difference either.

Using anova() on gamma distributions gives seemingly random p-values

I am trying to determine whether there is a significant difference between two Gamm distributions. One distribution has (shape, scale)=(shapeRef,scaleRef) while the other has (shape, scale)=(shapeTarget,scaleTarget). I try to do analysis of variance with the following code
n=10000
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
glmm1 <- gam(y~x,family=Gamma(link=log))
anova(glmm1)
The resulting p values keep changing and can be anywhere from <0.1 to >0.9.
Am I going about this the wrong way?
Edit: I use the following code instead
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
anova(glm(xy ~ f, family = Gamma(link = log)),test="F")
But, every time I run it I get a different p-value.
You will indeed get a different p-value every time you run this, if you pick different realizations every time. Just like your data values are random variables, which you'd expect to vary each time you ran an experiment, so is the p-value. If the null hypothesis is true (which was the case in your initial attempts), then the p-values will be uniformly distributed between 0 and 1.
Function to generate simulated data:
simfun <- function(n=100,shapeRef=2,shapeTarget=2,
scaleRef=1,scaleTarget=2) {
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
data.frame(xy,f)
}
Function to run anova() and extract the p-value:
sumfun <- function(d) {
aa <- anova(glm(xy ~ f, family = Gamma(link = log),data=d),test="F")
aa["f","Pr(>F)"]
}
Try it out, 500 times:
set.seed(101)
r <- replicate(500,sumfun(simfun()))
The p-values are always very small (the difference in scale parameters is easily distinguishable), but they do vary:
par(las=1,bty="l") ## cosmetic
hist(log10(r),col="gray",breaks=50)

Resources