"variable lengths differ" error while running regressions in a loop - r

I am trying to run a regression loop based on code that I have found in a previous answer (How to Loop/Repeat a Linear Regression in R) but I keep getting an error. My outcomes (dependent) are 940 variables (metabolites) and my exposure (independent) are "bmi","Age", "sex","lpa2c", and "smoking". where BMI and Age are continuous. BMI is the mean exposure, and for others, I am controlling for them.
So I'm testing the effect of BMI on 940 metabolites.
Also, I would like to know how I can extract coefficient, p-value, standard error, and confidence interval for BMI only and when it is significant.
This is the code I have used:
y<- c(1653:2592) # response
x1<- c("bmi","Age", "sex","lpa2c", "smoking") # predictor
for (i in x1){
model <- lm(paste("y ~", i[[1]]), data= QBB_clean)
print(summary(model))
}
And this is the error:
Error in model.frame.default(formula = paste("y ~", i[[1]]), data = QBB_clean, :
variable lengths differ (found for 'bmi').
y1 y2 y3 y4 bmi age sex lpa2c smoking
1 0.2875775201 0.59998896 0.238726027 0.784575267 24 18 1 0.470681834 1
2 0.7883051354 0.33282354 0.962358936 0.009429905 12 20 0 0.365845473 1
3 0.4089769218 0.48861303 0.601365726 0.779065883 18 15 0 0.121272054 0
4 0.8830174040 0.95447383 0.515029727 0.729390652 16 21 0 0.046993681 0
5 0.9404672843 0.48290240 0.402573342 0.630131853 18 28 1 0.262796304 1
6 0.0455564994 0.89035022 0.880246541 0.480910830 13 13 0 0.968641168 1
7 0.5281054880 0.91443819 0.364091865 0.156636851 11 12 0 0.488495482 1
8 0.8924190444 0.60873498 0.288239281 0.008215520 21 23 0 0.477822030 0
9 0.5514350145 0.41068978 0.170645235 0.452458394 18 17 1 0.748792881 0
10 0.4566147353 0.14709469 0.172171746 0.492293329 20 15 1 0.667640231 1

If you want to loop over responses you will want something like this:
respvars <- names(QBB_clean[1653:2592])
predvars <- c("bmi","Age", "sex","lpa2c", "smoking")
results <- list()
for (v in respvars) {
form <- reformulate(predvars, response = v)
results[[v]] <- lm(form, data = QBB_clean)
}
You can then print the results with something like lapply(results, summary), extract coefficients, etc.. (I have a little trouble seeing how it's going to be useful to just print the results of 940 regressions ... are you really going to inspect them all?
If you want coefficients etc. for BMI, I think this should work (not tested):
t(sapply(results, function(m) coef(summary(m))["bmi",]))
Or for coefficients:
t(sapply(results, function(m) confint(m)["bmi",]))

Related

Problem in creating a model.matrix of quantitative predictors in R

I must do a Lasso regression with the package glmnetand I have problems to generate my x model.matrix
My data.frame: 108 observations, Y response variable, 24 predictors, here is an overview:
CONVENTIONAL_HUmin CONVENTIONAL_HUmean CONVENTIONAL_HUstd CONVENTIONAL_HUmax
1 37.9400539686119 63.4903779286635 11.7592095845857 85.2375439991287
2 23.8400539686119 80.5903779286635 15.0592095845857 125.837543999129
3 19.3035945249441 73.2764716205565 12.8816244173147 130.24141901586
CONVENTIONAL_HUQ1 CONVENTIONAL_HUQ2 CONVENTIONAL_HUQ3 HISTO_Skewness HISTO_Kurtosis
1 54.9938390994964 65.4873070322704 72.8863025473031 -0.203420585259268 2.25208159159488
2 70.8938390994964 80.3873070322704 91.4863025473031 -0.117420585259268 2.91208159159488
3 64.4689755423307 73.8666609177099 81.7351818199415 -0.0908104900456161 2.8751327713366
HISTO_ExcessKurtosis HISTO_Entropy_log10 HISTO_Entropy_log2 HISTO_Energy...Uniformity.
1 -0.751917020142877 0.701345471328916 2.32782599847774 0.219781577333287
2 -0.0887170201428774 0.793345471328916 2.63782599847774 0.184781577333287
3 -0.127231561113029 0.738530858918985 2.45445652190669 0.206887426065656
GLZLM_SZE GLZLM_LZE GLZLM_LGZE GLZLM_HGZE GLZLM_SZLGE
1 0.366581916604228 35.7249100350856 8.7285612359045e-05 11497.6407737833 3.22615226279017e-05
2 0.693581916604228 984.424910035086 8.5685612359045e-05 11697.6407737833 5.98615226279017e-05
3 0.622711792823853 1103.10288991619 8.5573088970709e-05 11571.7421733917 5.33303855950858e-05
GLZLM_SZHGE GLZLM_LZLGE GLZLM_LZHGE GLZLM_GLNU GLZLM_ZLNU
1 4164.91570215061 0.00314512237564268 405585.990838764 2.66964898745512 2.47759091065361
2 8064.91570215061 0.0835651223756427 11581585.9908388 12.9796489874551 38.5375909106536
3 7295.45317481887 0.0949686480587339 12926109.9421091 15.0930512668698 37.6083347285291
GLZLM_ZP Y
1 0.219643444043173 1
2 0.112643444043173 0
3 0.104031438564764 0
My code for the model.matrix
x=model.matrix(Y~.,data=data.det)
It générale a very large model.matrix with 244728 elements ! It seems that it has duplicated a hundred times each predictor of the 24 !
Here's an overview of the data.matrix:
(Intercept) CONVENTIONAL_HUmin-10.5599460313881
CONVENTIONAL_HUmin-117.359946031388 CONVENTIONAL_HUmin-13.0599460313881
CONVENTIONAL_HUmin-154.359946031388 CONVENTIONAL_HUmin-17.6599460313881
CONVENTIONAL_HUmin-18.3599460313881 CONVENTIONAL_HUmin-2.87994603138811
CONVENTIONAL_HUmin-21.281710504529 CONVENTIONAL_HUmin-28.3599460313881
CONVENTIONAL_HUmin-3.44994603138811 CONVENTIONAL_HUmin-3.89640547505594
CONVENTIONAL_HUmin-67.0599460313881 CONVENTIONAL_HUmin-682.359946031388
CONVENTIONAL_HUmin-9.08171050452898 CONVENTIONAL_HUmin1.04428949547101
CONVENTIONAL_HUmin1.63928949547101 CONVENTIONAL_HUmin10.8400539686119
CONVENTIONAL_HUmin10.968289495471 CONVENTIONAL_HUmin11.5400539686119
CONVENTIONAL_HUmin11.618289495471 CONVENTIONAL_HUmin11.6400539686119
CONVENTIONAL_HUmin12.518289495471 CONVENTIONAL_HUmin12.5400539686119
CONVENTIONAL_HUmin13.4400539686119 CONVENTIONAL_HUmin13.6400539686119
CONVENTIONAL_HUmin13.7400539686119 CONVENTIONAL_HUmin13.818289495471
CONVENTIONAL_HUmin14.5400539686119 CONVENTIONAL_HUmin14.6693017607572
CONVENTIONAL_HUmin14.8400539686119 CONVENTIONAL_HUmin16.9400539686119
CONVENTIONAL_HUmin17.0400539686119 CONVENTIONAL_HUmin17.618289495471
CONVENTIONAL_HUmin18.2400539686119 CONVENTIONAL_HUmin18.8400539686119
CONVENTIONAL_HUmin19.3035945249441 CONVENTIONAL_HUmin20.0400539686119
CONVENTIONAL_HUmin20.818289495471 CONVENTIONAL_HUmin21.0400539686119
CONVENTIONAL_HUmin21.118289495471 CONVENTIONAL_HUmin21.3400539686119
CONVENTIONAL_HUmin21.5400539686119 CONVENTIONAL_HUmin21.9400539686119
...
attr(,"contrasts")$CONVENTIONAL_HUmin
[1] "contr.treatment"
Not convenient at all because I end up with much more predictors in the input x for Lasso Regression which makes hazardous selection of the predictors even more present
Have you an idea of the source of the dysfunction ? any suggestion to fix that ?
Try this, you want a matrix not a model matrix...
# make a matrix of your predictors minus your outcome
x <- as.matrix(data.detect[-25])
# put the y column in a vector
y <- data.detect$Y
# run it
fit.lasso <- glmnet(x, y, family = "binomial", alpha = 1)

DFFITs for Beta Regression

I am trying to calculate DFFITS for GLM, where responses follow a Beta distribution. By using betareg R package. But I think this package doesn't support influence.measures() because by using dffits()
Code
require(betareg)
df<-data("ReadingSkills")
y<-ReadingSkills$accuracy
n<-length(y)
bfit<-betareg(accuracy ~ dyslexia + iq, data = ReadingSkills)
DFFITS<-dffits(bfit, infl=influence(bfit, do.coef = FALSE))
DFFITS
it yield
Error in if (model$rank == 0) { : argument is of length zero
I am a newbie in R. I don't know how to resolve this problem. Kindly help to solve this or give me some tips through R code that how to calculate DFFITs manually.
Regards
dffits are not implemented for "betareg" objects, but you could try to calculate them manually.
According to this Stack Overflow Q/A we could write this function:
dffits1 <- function(x1, bres.type="response") {
stopifnot(class(x1) %in% c("lm", "betareg"))
sapply(1:length(x1$fitted.values), function(i) {
x2 <- update(x1, data=x1$model[-i, ]) # leave one out
h <- hatvalues(x1)
nm <- rownames(x1$model[i, ])
num_dffits <- suppressWarnings(predict(x1, x1$model[i, ]) -
predict(x2, x1$model[i, ]))
residx <- if (class(x1) == "betareg") {
betareg:::residuals.betareg(x2, type=bres.type)
} else {
x2$residuals
}
denom_dffits <- sqrt(c(crossprod(residx)) / x2$df.residual*h[i])
return(num_dffits / denom_dffits)
})
}
It works well for lm:
fit <- lm(mpg ~ hp, mtcars)
dffits1(fit)
stopifnot(all.equal(dffits1(fit), dffits(fit)))
Now let's try betareg:
library(betareg)
data("ReadingSkills")
bfit <- betareg(accuracy ~ dyslexia + iq, data=ReadingSkills)
dffits1(bfit)
# 1 2 3 4 5 6 7
# -0.07590185 -0.21862047 -0.03620530 0.07349169 -0.11344968 -0.39255172 -0.25739032
# 8 9 10 11 12 13 14
# 0.33722706 0.16606198 0.10427684 0.11949807 0.09932991 0.11545263 0.09889406
# 15 16 17 18 19 20 21
# 0.21732090 0.11545263 -0.34296030 0.09850239 -0.36810187 0.09824013 0.01513643
# 22 23 24 25 26 27 28
# 0.18635669 -0.31192106 -0.39038732 0.09862045 -0.10859676 0.04362528 -0.28811277
# 29 30 31 32 33 34 35
# 0.07951977 0.02734462 -0.08419156 -0.38471945 -0.43879762 0.28583882 -0.12650591
# 36 37 38 39 40 41 42
# -0.12072976 -0.01701615 0.38653773 -0.06440176 0.15768684 0.05629040 0.12134228
# 43 44
# 0.13347935 0.19670715
Looks not bad.
Notes:
Even if this works in code, you should check if it meets your statistical requirements!
I've used suppressWarnings in lines 5:6 of dffits1. predict(bfit, ReadingSkills) drops the contrasts somehow, whereas predict(bfit) does not (should practically be the same). However the results are identical: all.equal(predict(bfit, ReadingSkills), predict(bfit)), thus ignoring the warnings be safe.

Feeding new data into predict() for multiple regression? [duplicate]

This question already has an answer here:
r predict function returning too many values [closed]
(1 answer)
Closed 6 years ago.
Assume I have have fit a regression model with multiple predictor variables in R, like in the following toy example:
n <- 20
x <- rnorm(n)
y <- rnorm(n)
z <- x + y + rnorm(n)
m <- lm(z ~ x + y + I(y^2))
Now I have new date, consisting of x and y values, and I want to predict the corresponding z values:
x.new <- rnorm(5)
y.new <- rnorm(5)
Question: How should I best call predict to apply the fitted model to the new data?
Here are a few things I tried, which do not work:
Attempt 1. Trying to use the x.new and y.new as the columns of a new data frame:
> predict(m, data=data.frame(x=x.new, y=y.new))
1 2 3 4 5 6 7
-0.0157090 1.1667958 -1.3797101 0.1185750 0.7786496 1.7666232 -0.6692865
8 9 10 11 12 13 14
1.9720532 0.3514206 1.1677019 0.6441418 -2.3010431 -0.3228424 -0.2181511
15 16 17 18 19 20
-0.8883275 0.4549592 -1.0377040 0.1750522 -2.4542843 1.2250101
This gave 20 values instead of 5, so cannot be right.
Attempt 2: Maybe predict got confused because the y^2 values were not supplied? Try to use model.frame to provide data in the correct form.
> predict(m, model.frame(~ x.new + y.new + I(y.new^2)))
1 2 3 4 5 6 7
-0.0157090 1.1667958 -1.3797101 0.1185750 0.7786496 1.7666232 -0.6692865
8 9 10 11 12 13 14
1.9720532 0.3514206 1.1677019 0.6441418 -2.3010431 -0.3228424 -0.2181511
15 16 17 18 19 20
-0.8883275 0.4549592 -1.0377040 0.1750522 -2.4542843 1.2250101
Warning message:
'newdata' had 5 rows but variables found have 20 rows
Again, this results in 20 values (plus a warning), so cannot be right.
The parameter is newdata (not data) when telling predict what to predict for.
predict(m, newdata = data.frame(x = x.new, y = y.new))

How to read the indexes from the prediction output of predict.ranger, R

Using the ranger package I run the following script:
rf <- ranger(Surv(time, Y) ~ ., data = train_frame[1:50000, ], write.forest = TRUE, num.trees = 100)
test_frame <- train_frame[50001:100000, ]
preds <- predict(rf, test_frame)
chfs <- preds$chf
plot(chfs[1, ])
The cumulative hazard function has indexes 1 - 36 on the X-axis. Obviously this corresponds with time, but I'm not sure how: my time of observation variable ranges from a minimum of 0 to a maximum of 399. What is the mapping between the original data and the predicted output from predict.ranger, and how can I operationalize this to quantify degree of risk for a given subject after a given length of time?
Here's a sample of what my time/event data looks like:
Y time
<int> <dbl>
1 1 358
2 0 90
3 0 162
4 0 35
5 0 307
6 0 69
7 0 184
8 0 24
9 0 366
10 0 33
And here's what the CHF of the first subject looks like:
Can anyone help me connect the dots? There are no row or columns names on the "matrix" object that is preds$chf.
In the prediction object is vector called unique.death.times containing the time points where the CHF and survival estimates are computed. The chf matrix has observations in the rows and these time points in the columns, same for survival.
Reproducible example:
library(survival)
library(ranger)
## Split the data
n <- nrow(veteran)
idx <- sample(n, 2/3*n)
train <- veteran[idx, ]
test <- veteran[-idx, ]
## Grow RF and predict
rf <- ranger(Surv(time, status) ~ ., train, write.forest = TRUE)
preds <- predict(rf, test)
## Example CHF plot
plot(preds$unique.death.times, preds$chf[1, ])
## Example survival plot
plot(preds$unique.death.times, preds$survival[1, ])
Setting importance = "impurity" for survival forests should throw an error.

JAGS Runtime Error: Cannot insert node into X[ ]. Dimension Mismatch

I'm trying to add a bit of code to a data-augmentation capture-recapture model and am coming up with some errors I haven't encountered before. In short, I want to estimate a series of survivorship phases that each last more than a single time interval. I want the model to estimate the length of each survivorship phase and use that to improve the capture recapture model. I tried and failed with a few different approaches, and am now trying to accomplish this using a switching state array for the survivorship phases:
for (t in 1:(n.occasions-1)){
phi1switch[t] ~ dunif(0,1)
phi2switch[t] ~ dunif(0,1)
phi3switch[t] ~ dunif(0,1)
phi4switch[t] ~ dunif(0,1)
psphi[1,t,1] <- 1-phi1switch[t]
psphi[1,t,2] <- phi1switch[t]
psphi[1,t,3] <- 0
psphi[1,t,4] <- 0
psphi[1,t,5] <- 0
psphi[2,t,1] <- 0
psphi[2,t,2] <- 1-phi2switch[t]
psphi[2,t,3] <- phi2switch[t]
psphi[2,t,4] <- 0
psphi[2,t,5] <- 0
psphi[3,t,1] <- 0
psphi[3,t,2] <- 0
psphi[3,t,3] <- 1-phi3switch[t]
psphi[3,t,4] <- phi3switch[t]
psphi[3,t,5] <- 0
psphi[4,t,1] <- 0
psphi[4,t,2] <- 0
psphi[4,t,3] <- 0
psphi[4,t,4] <- 1-phi4switch[t]
psphi[4,t,5] <- phi4switch[t]
psphi[5,t,1] <- 0
psphi[5,t,2] <- 0
psphi[5,t,3] <- 0
psphi[5,t,4] <- 0
psphi[5,t,5] <- 1
}
So this creates a [5,t,5] array where the survivorship state can only switch to the subsequent state and not backwards (e.g. 1 to 2, 4 to 5, but not 4 to 3). Now I create a vector where the survivorship state is defined:
PhiState[1] <- 1
for (t in 2:(n.occasions-1)){
# State process: draw PhiState(t) given PhiState(t-1)
PhiState[t] ~ dcat(psphi[PhiState[t-1], t-1,])
}
We start in state 1 always, and then take a categorical draw at each time step 't' for remaining in the current state or moving on to the next one given the probabilities within the array. I want a maximum of 5 states (assuming that the model will be able to functionally produce fewer by estimating the probability of moving from state 3 to 4 and onwards near 0, or making the survivorship value of subsequent states the same or similar if they belong to the same survivorship value in reality). So I create 5 hierarchical survival probabilities:
for (a in 1:5){
mean.phi[a] ~ dunif(0,1)
phi.tau[a] <- pow(phi_sigma[a],-2)
phi.sigma[a] ~ dunif(0,20)
}
Now this next step is where the errors start. Now that I've assigned values 1-5 to my PhiState vector it should look something like this:
[1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 5
or maybe
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2
and I now want to assign a mean.phi[] to my actual phi[] term, which feeds into the model:
for(t in 1:(n.occasions-1)){
phi[t] ~ dnorm(mean.phi[PhiState[t]],phi.tau[PhiState[t]])
}
However, when I try to run this I get the following error:
Error in jags.model(model.file, data = data, inits = init.values, n.chains = n.chains, :
RUNTIME ERROR:
Cannot insert node into mean.phi[1:5]. Dimension mismatch
It's worth noting that the model works just fine when I use the following phi[] determinations:
phi[t] ~ dunif(0,1) #estimate independent annual phi's
or
phi[t] ~ dnorm(mean.phi,phi_tau) #estimate hierarchical phi's from a single mean.phi
or
#Set fixed survial periods (this works the best, but I don't want to have to tell it when
#the periods start/end and how many there are, hence the current exercise):
for (a in 1:21){
surv[a] ~ dnorm(mean.phi1,phi1_tau)
}
for (b in 22:30){
surv[b] ~ dnorm(mean.phi2,phi2_tau)
}
for (t in 1:(n.occasions-1)){
phi[t] <- surv[t]
}
I did read this post: https://sourceforge.net/p/mcmc-jags/discussion/610037/thread/36c48f25/
but I don't see where I'm redefining variables in this case... Any help fixing this or advice on a better approach would be most welcome!
Many thanks,
Josh
I'm a bit confused as to what are your actual data (the phi[t]?), but the following might give you a starting point:
nt <- 29
nstate <- 5
M <- function() {
phi_state[1] <- 1
for (t in 2:nt) {
up[t-1] ~ dbern(p[t-1])
p[t-1] <- ifelse(phi_state[t-1]==nstate, 0, p_[t-1])
p_[t-1] ~ dunif(0, 1)
phi_state[t] <- phi_state[t-1] + equals(up[t-1], 1)
}
for (k in 1:nstate) {
mean_phi[k] ~ dunif(0, 1)
phi_sigma[k] ~ dunif(0, 20)
}
for(t in 1:(nt-1)){
phi[t] ~ dnorm(mean_phi[phi_state[t]], phi_sigma[phi_state[t]]^-2)
}
}
library(R2jags)
fit <- jags(list(nt=nt, nstate=nstate), NULL,
c('phi_state', 'phi', 'mean_phi', 'phi_sigma', 'p'),
M, DIC=FALSE)
Note that above, p is a vector of probabilities of moving up to the next (adjacent) state.

Resources