How to resolve "number of items to replace is not a multiple of replacement length" error in a bootstrapped regression? - r

I am trying to conduct a bootstrapped regression model using code from Andy Field's textbook Discovering Statistics Using R.
I am struggling to interpret an error message that I receive when running the boot() function. From reading other forum posts I understand that it is telling me that there is an imbalance in the number of items between two objects, but I don't understand what this means in my context and how I can resolve it.
You can download my data here (a publicly available Dataset on Airbnb listings) and find my code and the full error message below. I am using a mixture of factored dummy variables and continuous variables as predictors. Thanks in advance for any help!
Code:
bootReg <- function (formula, data, i)
{
d <- data [i,]
fit <- lm(formula, data = d)
return(coef(fit))
}
bootResults <- boot(statistic = bootReg, formula = review_scores_rating ~ instant_bookable + cancellation_policy +
host_since_cat + host_location_cat + host_response_time +
host_is_superhost + host_listings_cat + property_type + room_type +
accommodates + bedrooms + beds + price + security_deposit +
cleaning_fee + extra_people + minimum_nights + amenityBreakfast +
amenityAC + amenityElevator + amenityKitchen + amenityHostGreeting +
amenitySmoking + amenityPets + amenityWifi + amenityTV,
data = listingsRating, R = 2000)
Error:
Error in t.star[r, ] <- res[[r]] :
number of items to replace is not a multiple of replacement length
In addition: Warning message:
In doTryCatch(return(expr), name, parentenv, handler) :
restarting interrupted promise evaluation

The Problem
The problem is your factor variables. When you do an lm() on a subset of your data (which is done over and over again in boot::boot()), you only get coefficients for the factor levels that are present. Then each coefficient draw could be of different lengths. This can be reproduced if you do
debug(boot)
set.seed(123)
bootResults <- boot(statistic = bootReg, formula = review_scores_rating ~ instant_bookable + cancellation_policy +
host_since_cat + host_location_cat + host_response_time +
host_is_superhost + host_listings_cat + property_type + room_type +
accommodates + bedrooms + beds + price + security_deposit +
cleaning_fee + extra_people + minimum_nights + amenityBreakfast +
amenityAC + amenityElevator + amenityKitchen + amenityHostGreeting +
amenitySmoking + amenityPets + amenityWifi + amenityTV,
data = listingsRating, R = 2)
which will allow you to move through the function call one line at a time. After you run the line
res <- if (ncpus > 1L && (have_mc || have_snow)) {
if (have_mc) {
parallel::mclapply(seq_len(RR), fn, mc.cores = ncpus)
}
else if (have_snow) {
list(...)
if (is.null(cl)) {
cl <- parallel::makePSOCKcluster(rep("localhost",
ncpus))
if (RNGkind()[1L] == "L'Ecuyer-CMRG")
parallel::clusterSetRNGStream(cl)
res <- parallel::parLapply(cl, seq_len(RR), fn)
parallel::stopCluster(cl)
res
}
else parallel::parLapply(cl, seq_len(RR), fn)
}
} else lapply(seq_len(RR), fn)
Then try
setdiff(names(res[[1]]), names(res[[2]]))
# [1] "property_typeBarn" "property_typeNature lodge"
There are two factor levels present in the first subset not present in the second. This is causing your problem.
The Solution
Use model.matrix() to expand your factors before hand (following this Stack Overflow post):
df2 <- model.matrix( ~ review_scores_rating + instant_bookable + cancellation_policy +
host_since_cat + host_location_cat + host_response_time +
host_is_superhost + host_listings_cat + property_type + room_type +
accommodates + bedrooms + beds + price + security_deposit +
cleaning_fee + extra_people + minimum_nights + amenityBreakfast +
amenityAC + amenityElevator + amenityKitchen + amenityHostGreeting +
amenitySmoking + amenityPets + amenityWifi + amenityTV - 1, data = listingsRating)
undebug(boot)
set.seed(123)
bootResults <- boot(statistic = bootReg, formula = review_scores_rating ~ .,
data = as.data.frame(df2), R = 2)
(Note that throughout I reduce R to 2 just for faster runtime during debugging).

The way you are defining bootReg and calling it are wrong.
First, you must keep to the order of arguments of the function statistic, in this case bootReg. The first argument is the dataset and the second argument is the indices. Then come other, optional arguments.
bootReg <- function (data, i, formula){
d <- data[i, ]
fit <- lm(formula, data = d)
return(coef(fit))
}
Second, in the call, the other optional arguments will be passed in the dots ... argument. So once again, keep to the order of arguments as defined in help("boot"), section Usage.
bootResults <- boot(data = iris, statistic = bootReg, R = 2000,
formula = Sepal.Length ~ Sepal.Width)
colMeans(bootResults$t)
#[1] 6.5417719 -0.2276868

Related

How do I fix "object loglik not found" error when using glmmLasso for model selection?

I am using glmmlasso for model selection. I get to a point where I try to implement my tuned lambda only to get the error:
Error in logLik.glmmLasso(y = y, yhelp = yhelp, mu = mu, family = family, :
object 'loglik' not found
I have searched but found no solution and am honestly not even sure what this error means.
image of what my data looks like before scaling
BB3[,c(6,16,20,24,28,29,32:38)]<-scale(BB3[,c(6,16,20,24,28,29,32:38)],center=T,scale=T)
After centering and scaling
Below is the code I was using:
BB3<-data.frame(BB3)
lambda <- seq(500,0,by=-5)
family <- inverse.gaussian(link = "1/mu^2")
ID <- as.factor(BB3$ID)
BIC_vec<-rep(Inf,length(lambda))
PQL<-glmmPQL(BBSDI~1,random = ~1|ID,family=family,data=BB3)
Delta.start<-c(as.numeric(PQL$coef$fixed),rep(0,11),as.numeric(t(PQL$coef$random$ID)))
Q.start<-as.numeric(VarCorr(PQL)[1,1])
j<- -1
for(j in 1:length(BIC_vec))
{
print(paste("Iteration ", j,sep=""))
glm1 <- try(glmmLasso(BBSDI~AvgTemp
+ AvgJulianDate + AvgCloud
+ AvgBeaufort + TreeCover + PropObsBV + as.factor(Quality) + NH + HHSDI + PCNum, rnd = list(ID=~1),
family = family, data = BB3, lambda=lambda[j],switch.NR=FALSE,final.re=FALSE), silent=TRUE)
if(!inherits(glm1, "try-error"))
{
BIC_vec[j]<-glm1$bic
Coeff_ma<-cbind(Coeff_ma,glm1$coefficients)
#save error (deviance) values
y.hat<-predict(glm1,BB3)
Devianz_ma[j]<-sum(family$dev.resids(BB3I$BBSDI,y.hat,wt=rep(1,length(y.hat))))
}
}
opt<-which.min(BIC_vec)
glm1_Final <- glmmLasso(BBSDI~AvgTemp
+ AvgJulianDate + AvgCloud
+ AvgBeaufort + TreeCover + PropObsBV + as.factor(Quality) + NH + HHSDI + PCNum, rnd = list(ID=~1),
family = family, data = BB3, lambda=lambda[opt], switch.NR=FALSE,final.re=FALSE,
control=list(start=Delta.start,q_start=Q.start))
```
`
Here is where I get the error, which I can't even figure out the meaning of.

svyglm - how to code for a logistic regression model across all variables?

In R using GLM to include all variables you can simply use a . as shown How to succinctly write a formula with many variables from a data frame?
for example:
y <- c(1,4,6)
d <- data.frame(y = y, x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
mod <- lm(y ~ ., data = d)
however I am struggling to do this with svydesign. I have many exploratory variables and an ID and weight variable, so first I create my survey design:
des <-svydesign(ids=~id, weights=~wt, data = df)
Then I try creating my binomial model using weights:
binom <- svyglm(y~.,design = des, family="binomial")
But I get the error:
Error in svyglm.survey.design(y ~ ., design = des, family = "binomial") :
all variables must be in design = argument
What am I doing wrong?
You typically wouldn't want to do this, because "all the variables" would include design metadata such as weights, cluster indicators, stratum indicators, etc
You can use col.names to extract all the variable names from a design object and then reformulate, probably after subsetting the names, eg with the api example in the package
> all_the_names <- colnames(dclus1)
> all_the_actual_variables <- all_the_names[c(2, 11:37)]
> reformulate(all_the_actual_variables,"y")
y ~ stype + pcttest + api00 + api99 + target + growth + sch.wide +
comp.imp + both + awards + meals + ell + yr.rnd + mobility +
acs.k3 + acs.46 + acs.core + pct.resp + not.hsg + hsg + some.col +
col.grad + grad.sch + avg.ed + full + emer + enroll + api.stu

solve TWANG package svyglm.survey.design error using mnps function

i have been trying to peform the mnps function in R but I receive this error message each time:
Error in svyglm.survey.design(x ~ t, design) :
all variables must be in design= argument
this is my code:
mnps_1 <- mnps(M02M_CONGPAT ~ M2_M_PPD + M00M2_COUPLE +
M00M2_PEREACC + M00X_AUTPATHO + M0_M_age + M0_P_age + M0_M_nation +
M0_P_nation + mother_diploma + father_diploma + mom_profession +
dad_profession + fchild + mother_medicine + M2_P_PPD +
distress_mod_parent + distress_sev_parent + number_household +
child_living_sit + Fwanted_child + Mwanted_child + M2_conflict_ab +
M2_conflict_dp + M2_conflict_bp + M2_conflict + M0_zone +
M0_siblingbis + cdi_pere + relative_poverty + relationship_mother,
data = impute_1,
n.trees = 10000,
interaction.depth = 3,
shrinkage = 0.01,
perm.test.iters = 0,
stop.method = c("es.mean", "ks.max"),
estimand = "ATE",
verbose = FALSE)
if("pair" != "none"){
baltab<-bal.table(mnps_1, collapse.to="pair")
write.table(baltab,file='propensity_scores_impute_1.csv',row.names=FALSE,col.names=TRUE,sep=',',na='.')}
error:
Error in svyglm.survey.design(x ~ t, design) :
all variables must be in design= argument
can someone help me? thank you!
This is probably because your data variable, impute_1, is a tibble. bal.table() doesn't function properly when a tibble is supplied to mnps(). I was able to replicate this error by supplying a tibble to mnps(), and this is a common problem in older packages, but there may be a different cause.
If this is what's going on, try re-running mnps() replacing impute_ with as.data.frame(impute_1).

how to insert more densely sampled auxiliary variables into the georob package in Rstudio

I need to make a prediction of a soil variable as a function of auxiliary variables in the georob package.
My solo dataset has 200 observations and my auxiliary variables set has 19940 data, however in the code, I can't enter the coordinates of the auxiliary variables as prediction points.
dat= read.csv("malhas amostrais/solo_200.csv", sep = ",")
covar = read.csv("../dados/csv/variaveis_auxiliares.csv", sep = ";")
ku_georob_cpeso <- georob(argila ~ CV + CH + dist_bebedouros + Eca_0.5m + Eca_1m + elevacao + IH_0.5m + sd_ndvi_01 + sd_ndvi_02 + twi + S_P_T + sd_b4 +sd_b5 + sd_b6+ sd_b7,
data= dat,
locations= ~ x + y,
variogram.model="RMexp",
param=c(variance=200, nugget=600, scale=150),
verbose = 3,
psi.func = "huber")
ku_georob_cpeso <- georob(argila ~ CV + CH + dist_bebedouros + Eca_0.5m + Eca_1m + elevacao + IH_0.5m + sd_ndvi_01 + sd_ndvi_02 + twi + S_P_T + sd_b4 +sd_b5 + sd_b6+ sd_b7,
data= dat1,
subset = cova,
locations= ~ x + y,
variogram.model="RMexp",
param=c(variance=200, nugget=600, scale=150),+ verbose = 3,
psi.func = "huber")
I receive the error:
Error in xj[i] : invalid subscript type 'list'

neural network in R

hi i am trying to use neuralnet function in R so i can predict an integer outcome (meaning) using the rest of the variables.
here is the code that i have used:
library("neuralnet")
I am going to put 2/3 from the data for neural network learning and the rest
for test
ind<-sample(1:nrow(Data),6463,replace=FALSE)
Train<-Data[ind,]
Test<-Data[-ind,]
m <- model.matrix(
~meaning +
firstLevelAFFIRM + firstLevelDAT.PRSN + firstLevelMODE +
firstLevelO.DEF + firstLevelO.INDIV + firstLevelS.AGE.INDIV +
secondLevelV.BIN + secondLevelWord1 + secondLevelWord2 +
secondLevelWord3 + secondLevelWord4 + thirdLevelP.TYPE,
data = Train[,-1]) #(the first column is ID , i am not going to use it)
PredictorVariables <- paste("m[," , 3:ncol(m),"]" ,sep="")
Formula <- formula(paste("meaning ~ ", paste(PredictorVariables, collapse=" + ")))
net <- neuralnet(Formula,data=m, hidden=3, threshold=0.05)
m.test < -model.matrix(
~meaning +
firstLevelAFFIRM + firstLevelDAT.PRSN + firstLevelMODE +
firstLevelO.DEF + firstLevelO.INDIV + firstLevelS.AGE.INDIV +
secondLevelV.BIN + secondLevelWord1 + secondLevelWord2 +
secondLevelWord3 + secondLevelWord4 + thirdLevelP.TYPE,
data = Test[,-1])
net.results <- compute(net, m.test[,-c(1,2)]) #(first column is ID and the second one is the outcome that i am trying to predict)
output<-cbind(round(net.results$net.result),Test$meaning)
mean(round(net.results$net.result)!=Test$meaning)
the misclassification that i got was around 0.01 which is great, but my question is why the outcome that i got (net.results$net.result) is not an integer?
I assume that your output is linear. Try setting linear.output = FALSE.
net <- neuralnet(Formula, data = m, hidden = 3, threshold = 0.05, linear.output = FALSE)

Resources