How to efficiently run negative binomial regression with many variables in R - r

My dataset shows negative binomial distribution, therefore, I want to use negative binomial regression to analyze it.
I followed the instruction described in this web site; https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/
Actually, it worked well, I was able to analyze my data.
However, I have many variable to analyze and I do not want to write a script as
linear <- glm(V1 ~ V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 ... V100, data = df1)
Let's say if I have 100 variables to analyze, how can I write an efficient code for regression to save my time?
Although it works if I simply added everything like + V2 + V3 +V4.... till the end, I really do not want to.
Any comments should be helpful.
Thank you.

as.formula and paste to the rescue
> Vmax=10
> as.formula(paste0("V1~",paste0("V",2:(Vmax-1),sep="+",collapse=""),"V",Vmax,collapse=""))
V1 ~ V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10

Related

How to compute differences in intercept and slope in linear mixed effect regression?

I am trying to perform a simple linear mixed effect regression for the day of green-up in the Arctic.
I would like to find the effect that weather has on the green-up day within each region (16 different regions pan-Arctic), and if the green-up day differs significantly (intercept) between the different regions (ANOVA), and which region then actually differs (posthoc test).
My data, after scaling to center, is as follows:
library(readr)
library(blme)
data <- read_csv("data.csv")
data.sc <- data.frame(scale(data))
head(data.sc)
regions year greenup V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 1 -1.687957 0.9336382 0.85187965 0.71761508 3.16360255 1.6670476 -0.68640856 -0.655334953 -1.799065 1.5577978 3.0125643 -0.5009276
2 1 -1.601395 0.2370278 -0.30385233 0.59926804 0.04203437 1.3281044 -0.62391808 -0.592095987 -1.590912 0.6735797 1.2016789 -0.5955858
3 2 -1.514833 0.8030237 -0.05341494 0.05770376 -0.03230812 0.4116001 -0.34697632 -0.311837094 -1.428521 0.2459790 0.7906408 -0.7097304
4 2 -1.428272 -0.1548155 -0.52504749 0.21499928 -0.53588809 1.3991222 -0.70552873 -0.674684140 -1.491065 -0.5336937 0.1615725 -0.6405863
5 3 -1.341710 0.4111804 -0.55526813 0.41495915 -0.28220429 1.0430792 -0.69965688 -0.668741960 -1.440828 -0.1372223 0.3667727 -0.8735882
6 3 -1.255148 1.7717476 0.37503107 0.49582150 3.25563398 1.1052932 -0.06417199 -0.005461234 -1.859388 1.6666640 3.6155112 -0.6626455
I, therefore, perform a linear mixed model using the Bayesian package in R (blme) to avoid the boundary (singular) warning. Here, I make sure I get the global intercept and slope but also the deviation for each of these regions. Hence, my model looks as follows:
reg <- blmer(greenup ~ V1+V2+V3+V4+V5+V6+V7+V8+(1+V1+V2+V3+V4+V5+V6+V7+V8|regions), data = pre.sc, REML = F)
ranef(reg)
(Intercept) V1 V2 V3 V4 V5 V6 V7 V8
1 -1.5088397 -0.678780774 -0.28176494 0.03297057 0.14762373 0.135628050 0.03870133 -0.16881483 0.01049209
2 1.4277155 0.002705333 0.63878199 0.04527960 -0.29932021 -0.092542944 -0.05871342 0.12192872 0.16659908
3 -1.1971171 -0.643673842 -0.36725321 -0.00437876 0.09999908 -0.195088268 0.02960333 -0.07320157 0.21971584
4 -1.8725315 -0.785756089 -0.38010681 -0.02575994 0.28474568 0.840094435 -0.13247112 -0.05102267 -0.77963082
5 1.0132713 -0.429509626 -0.31425681 0.29430628 -0.19690372 0.282548814 0.39303350 -0.12090616 0.69670371
6 -2.4881956 -1.692458471 -0.09449177 0.01172166 0.06284188 -0.844244673 0.79118303 -0.34123995 1.04264827
7 3.1058386 0.826832058 1.11178582 0.02043532 -0.47408031 -0.338428604 -0.33133031 0.31261663 0.10124955
8 1.9322901 3.317059011 -0.30085324 -0.22707802 0.28574031 -0.569026041 -0.42553220 0.17865665 -0.47420030
9 -1.2204600 -2.438705399 1.87732120 -0.07449666 -0.38793703 0.188712028 -0.38356492 0.11440041 -0.52707496
10 2.0227413 1.395666163 0.67593119 -0.14830203 -0.16997007 -0.385349601 -0.34510500 0.26727338 -0.24816029
11 -1.9223249 -0.767426078 -0.23096931 -0.10032745 0.14049704 0.003901228 0.01478041 -0.07175487 -0.12908571
12 0.0149983 -0.492942263 -0.40391390 0.16250771 -0.06522357 0.566016764 0.35380727 -0.11917404 0.19111680
13 -0.7919515 -0.731003237 -0.34630359 0.17115953 0.10735851 0.940255206 -0.07730108 -0.11185924 -0.37309296
14 4.7216459 4.388771043 -1.08519982 0.01923765 0.10673146 -0.205195431 -0.03391026 0.21661697 0.12766651
15 0.1163438 -0.623755530 -0.09248300 0.15112399 -0.08205915 0.193081476 0.03723803 -0.03449855 0.24479648
16 -3.3534245 -0.647022299 -0.40622380 -0.32839946 0.43995636 -0.520362438 0.12958139 -0.11902088 -0.26974330
How do I proceed to investigate how the intercept (greenup day) differs between the random effect (regions)?
How do I look into, how the variables may have different interaction/influence on the greenup day too?
This may end up being better for CrossValidated.
Probably the easiest way to extract the random effects values with confidence intervals (see caveats below) is
library(broom.mixed)
tt <- (tidy(reg, effects = "ran_vals", conf.int = TRUE)
|> dplyr::filter(term == "(Intercept)")
|> dplyr::select(level, estimate, conf.low, conf.high)
)
This is (obviously) just looking at the intercept term.
You can look at the magnitudes of the variances of the different random effect components and compare their magnitudes.
If you lattice::dotplot(ranef(reg)) you will see all random-effects conditional modes, with their conditional standard deviations, with regions ordered by their intercept. You might want ``lattice::dotplot(ranef(reg), scales = "free")` instead.
If you know ggplot you can use the output of tidy above to plot more flexibly.
Formal testing is more difficult.
You could fit a model without the intercept variation:
reduced <- update(reg, . ~
V1+V2+V3+V4+V5+V6+V7+V8+(0 + V1+V2+V3+V4+V5+V6+V7+V8|regions))
anova(reduced, reg)
This won't work sensibly if any of your covariates is a factor (removing the intercept will just reparameterize the model, not actually drop the intercept variation)
I would not be surprised if the variation in all of the other effects could compensate for the restriction that the intercept isn't allowed to vary
You could do this for any of the random-effects terms.
The idea of post-hoc testing to figure out which of the regions differs significantly in its intercept is right out the window, for theoretical reasons: when you estimate variation as a random effect, the predicted values for individual levels (called "best linear unbiased predictors" [BLUPs] or "conditional modes") are no longer parameter estimates in the formal sense, with sampling distributions, p-values, etc.. In the frequentist world, this is one of the things you sacrifice by using a random effect.

Multivariate Time-Series Forecasting in R

I have a data set that consists of a combination of daily and non-daily data that are location specific. The location is specified by the longitude and latitude (columns V1 and V2.) Column V3 represents a location specific value, a heart-disease prevalence indicator. Columns V4 and V5 represent daily location specific recordings for wind speed on days 1-3, 2-4 respectively. Column V6 is daily cases of deaths in the ER for days 1-3; V7 is the deaths for days 2-4; V8 is the deaths for days 3-5.
I would like to create do a multiple linear regression model that can predict the new deaths for the third day in every location, given the values in V1 to V7.
This is what the data set looks like.
This is an example of what I am trying to do:
d <- as.data.frame(matrix(c(1,1,1,-2,-2,-2,14,14,14,90,90,90,103,103,103,-6,-6,-6,50,50,50,70,70,70,112,112,112,11,11,11,8,8,8,26,26,26,1.2,1.2,1.2,0.8,0.8,0.8,1.3,1.3,1.3,0.7,0.7,0.7,1.7,1.7,1.7,2,2,2,10,20,17,20,25,26,60,70,70,10,12,13,109,117,120,61,67,63,20,17,18,25,26,24,70,70,90,12,13,11,117,120,110,67,63,64,0,4,5,1,4,6,5,7,9,12,23,4,7,6,5,8,9,12,4,5,6,4,6,9,7,9,13,23,4,12,6,5,25,9,12,40,5,6,16 ,6,9,30 ,9,13,32 ,4,12,23 ,5,25,32 ,12,40,61 ), nrow = 18, ncol = 8))
dtrain <- d[1:12,]
dtest <- d[13:18,][1:7]
l <- lm(V8 ~ V1 +V2 + V3 + V4 + V5 + V6 + V7 , data = dtrain)
p <- predict(l,dtest)
I would like to know whether having non-daily and daily data mixed together will affect the accuracy of a regression model in such a case. I am new to machine learning and am unsure about the shape of my dataset and how to tackle multivariate regression (in this case a time series,too.)

Simulated Maximum Likelihood in R, MaxLik

I am trying to estimate a model by simulated maximum likelihood via the MaxLik package in R. Unfortunately, with increasing data size, I am running into serious performance problems. Can anyone advice about the following:
Is there a way to speed up my code (it's already vectorized, so I am kind of clueless how to improve it further)?
Is there a way to implement the optimization process via Rcpp in order to speed it up?
Is there any smarter way to implement simulated maximum likelihood with a custom made likelihood function?
I have already tried doParallel on an AWS instance, but that does not significantly speed up the process.
I have created a reproducable example and commented the most important parts:
#create data:
#Binary DV (y), 10 IDV (V3 - V12), 50 groups (g), with 100 sequential observations each (id)
set.seed(123)
n <- 5000
p <- 10
x <- matrix(rnorm(n * p), n)
g <- rep(seq(1:(n/100)),each=100)
id <- rep(seq(1:(n/max(g))),max(g))
beta <- runif(p)
xb <- c(x %*% beta)
p <- exp(xb) / (1 + exp(xb))
y <- rbinom(n, 1, p)
data <- as.data.table(cbind(id,y,x,g))
#Find starting values for MaxLik via regular glm
standard <-
glm(
y ~
V3 +
V4 +
V5 +
V6 +
V7 +
V8 +
V9 +
V10 +
V11 +
V12,
data = data,
family = binomial(link = "logit")
)
summary(standard)
#set starting values for MaxLik
b <- c(standard$coefficients,sd_V3=0.5,sd_V4=0.5)
#draw 50 x # of groups random values from a normal distribution
draws <- 50
#for each g in the data, 50 randomvalues are drawn
rands <- as.data.table(cbind(g=rep(g,each=draws),matrix(rnorm(length(g)*draws,0,1),length(g)*draws,2)))
colnames(rands) <- c("g","SD_V3","SD_V4")
#merge random draws to each group, so every observation is repeated x # of draws
data <- merge(data,rands,by="g",all=T,allow.cartesian=T)
#the likelihood function (for variables V3 and V4, a mean [b3] & b[4] and a SD b[12] & b[14] is estimated
loglik1 <- function(b){
#I want the standard deviations to vary only across groups (g), but all other parameters to vary across all observations, which is why I am taking the mean across g and id (remember, every observation is a cartesian product with the random draws per group)
ll <- data[,.(gll=mean(((1/(1+exp(-(b[1]+
(b[2]+b[12]*SD_V3)*V3 +
(b[3]+b[13]*SD_V4)*V4 +
(b[4])*V5 +
(b[5])*V6 +
(b[6])*V7 +
(b[7])*V8 +
(b[8])*V9 +
(b[9])*V10 +
(b[10])*V11 +
(b[11])*V12))))^y)*
(1-(1/(1+exp(-(b[1]+
(b[2])*V3 +
(b[3])*V4 +
(b[4])*V5 +
(b[5])*V6 +
(b[6])*V7 +
(b[7])*V8 +
(b[8])*V9 +
(b[9])*V10 +
(b[10])*V11 +
(b[11])*V12)))))^(1-y))),by=.(g,id)]
return(log(ll[,gll]))
}
co <- maxLik::maxControl(gradtol=1e-04,printLevel=2)
maxlik <- maxLik::maxLik(loglik1,start=b,method="bfgs",control=co)
summary(maxlik)
Thank you very much for your advice
I was able to decrease optimization time dramatically (hours to minutes) by changing the inside of loglik1 <- function(b){ ... } to
return(data[,.(g,id,y,logit=1/(1+exp(-(b[1]+
(b[2]+b[12]*SD_V3)*V3 +
(b[3]+b[13]*SD_V4)*V4 +
(b[4])*V5 +
(b[5])*V6 +
(b[6])*V7 +
(b[7])*V8 +
(b[8])*V9 +
(b[9])*V10 +
(b[10])*V11 +
(b[11])*V12))))][,mean(y*log(logit)+(1-y)*log(1)-logit),by=.(g,id)][,sum(V1)])
However, this does only partially solve the problem, since with increasing data size, the estimation time increases once again :(
I will probably have to deal with this, unless someone has an elegant solution?
EDIT: To pick up on this after a while, in case anyone has the problem in the future... The reason, the script takes very long, lies in the package MaxLik and the computation time to derive the Hessian matrix. If you don’t need that, you can tell MaxLik not to compute it. Since I do need it, I decided to compute it via Rcpp.

Memory efficient representation of ``model.matrix``

Assume we have a large data.table object with model variables:
library(data.table)
library(magrittr)
library(pryr)
library(caret)
df <- rnorm(10000000, 0, 1) %>% matrix(., ncol = 10) %>% as.data.table
df[,factor_vars:=LETTERS[sample(1:26, 1000000, replace = T)]]
df[,factor_vars2:=LETTERS[sample(1:5, 1000000, replace = T)]]
I'm looking for an efficient way of making a model variable matrix from the data. At the moment the best way I've found is by using caret::dummyVars in the following manner:
dd_object <- dummyVars(~ -1 + V1 + V2 + V3 + V4 + V5 + V6 + V7 +
V8 + V9 + V10 + I(as.character(factor_vars)) +
I(as.character(factor_vars2)),
data = df)
Note that this creates a very convenient object for exporting and recreating without the original data.
object_size(dd_object)
R> 17.3 kB
On the other hand, same as with the base::model.matrix, it still retains the inefficiencies of the matrix object when dealing with many zeroes, i.e.:
MM1 <- predict(dd_object, newdata = df)
object_size(MM1)
R> 392 MB
object_size(df)
R> 96 MB
Note that the sizes can blow up very easily with more dummy variables added and etc., this is just for demonstration purposes.
My question: I want to use the same model-matrix object for various known modelling packages (glm,glmnet,xgboost and etc). The sparse matrix representation from Matrix packages does sound nice and efficient, but not every package is able to work with it, and as.matrix(.) transformation is a pain in that case.
Are there any known solutions for my case? I'm looking for something with greater efficiency than the base matrix (possibly like sparse matrices) and the capability of forming a storeable model.matrix object just like caret::dummyVars is able to do.
The desired workflow could be something along the lines of
fread %>% predict(dummyVars_object, newdata =.) %>% predict(some_Model, newdata =.)

Looping to extract coefficients from multiply imputed mer objects

I am having a hard time wrapping my head around this problem. I have a list, results4 which contains 5 elements, all of which are mer objects from the zelig package. The mer objects are the result of ls.mixed regressions on each of five imputed datasets. I am trying to combine the results using Rubin's Rules for Multiple Imputation.
I can extract the coefficients and standard errors using summary(results4[[1]])#coefs, which returns a 16x3 vector (16 variables, each with a point estimate, standard error, and t-statistic).
I am trying to loop over the five sets of results and automate the process of combining the point estimates and standard errors, but unfortunately I seem to be staring at it with no solution arising. Any suggestions?
The code that produces the mer objects follows (variable names changed):
for (i in 1:5) {
results4[i] <- zelig(DV ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 +
V9 + V10 + V11 + V12 + V13 + V14 + V15 + tag(1 | L2),
data = as.data.frame(w4[,,i]), model = "ls.mixed", REML = FALSE)
}
I'm not going to take the time to code up the multiple-imputation rules (someone who wants the credit can what I show here and build on it), but I think you should be able to do what you want by building a 16x3x5 array containing the results:
resultsList <- lapply(results,function(x) summary(x)#coefs)
library(abind)
resultsArr <- abind(resultsList,along=3)
and then using apply appropriately across the margins.
There's probably a plyr-based solution as well.
You could also do this less fancily by just defining the array up front and filling it in as you go:
sumresults <- array(dim=c(16,3,5))
for (...) {
...
sumresults[,,i] <- summary(results4[[i]])#coefs
}

Resources