GBM: Object 'p' not found [duplicate] - r

I am using gbm to predict binary response.
When I set cv.folds=0, everything works well. However when cv.folds > 1, I got error:Error in object$var.levels[[i]] : subscript out of bounds when the first irritation of crossvalidation finished. Someone said this could because some factor variables have missing levels in training or testing data, but I tried only use numeric variables and still get this error.
> gbm.fit <- gbm(model.formula,
+ data=dataall_train,
+ distribution = "adaboost",
+ n.trees=10,
+ shrinkage=0.05,
+ interaction.depth=2,
+ bag.fraction = 0.5,
+ n.minobsinnode = 10,
+ train.fraction=0.5,
+ cv.folds=3,
+ verbose=T,
+ n.cores=1)
CV: 1
CV: 2
CV: 3
Error in object$var.levels[[i]] : subscript out of bounds
Anyone have some insights on this? Thanks!
Answer my self:
Problem solved. This is because a bug in this function. The input data cannot contain variables other than the variables in the model.

I second this solution: The input data in the R function gbm() cannot include the variables (columns) that will not be used in your model.

Related

How does the weights argument in glm work in R?

I'm really puzzled by the weighting argument in glm. I realise that this question has been asked before but Im still confused about what the weights argument does or how it works. For example, in the code below my dependant variable PCL_Sum2 is binary and highly imbalanced. I would like both levels to be equally weighted. How would I accomplish this?
Final_Frame.df <- read.csv("no_subset.csv")
Omitted_Nas.df<-na.omit(Final_Frame.df)
This yields 278 remaining observations. Then when I go ahead and perform the regression:
prelim_model<-glm(PCL_Sum2~Mean_social_combined +
Mean_traditional_time+
Mean_Passive_Use_Updated+
factor(Gender)+
factor(Ethnicity)+
factor(Age)+
factor(Location)+
factor(Income)+
factor(Education)+
factor(Working_Home)+
Perceived_Fin_Risk+
Anxiety_diagnosed+
Depression_diagnosed+
Lived_alone+
Mean_Active_Use_Updated, data=Omitted_Nas.df<-na.omit(Final_Frame.df), weights=??? family = binomial())
summary(prelim_model)
I've tried setting weights = 0.5, 0.5 but I always get the following error:
Error in model.frame.default(formula = PCL_Sum2 ~ Mean_social_combined + : variable lengths differ (found for '(weights)')
Any help would be greatly appreciated!

LME error in model.frame.default ... variable lengths differ

I am trying to run a random effects model with LME. It is part of a larger function and I want it to be flexible so that I can pass the fixed (and ideally random) effects variable names to the lme function as variables. get() worked great for this where I started with lm, but it only seems to throw the ambiguous "Error in model.frame.default(formula = ~var1 + var2 + ID, data = list( : variable lengths differ (found for 'ID')." I'm stumped, the data are the same lengths, there are no NAs in this data or the real data, ...
set.seed(12345) #because I got scolded for not doing this previously
var1="x"
var2="y"
exdat<-data.frame(ID=c(rep("a",10),rep("b",10),rep("c",10)),
x = rnorm(30,100,1),
y = rnorm(30,100,2))
#exdat<-as.data.table(exdat) #because the data are actually in a dt, but that doesn't seem to be the issue
Works great
lm(log(get(var1))~log(get(var2)),data=exdat)
lme(log(y)~log(x),random=(~1|ID), data=exdat)
Does not work
lme(log(get(var1,pos=exdat))~log(get(var2)),random=(~1|ID), data=exdat)
Does not work, but throws a new error code: "Error in model.frame.default(formula = ~var1 + var2 + rfac + exdat, data = list( : invalid type (list) for variable 'exdat'"
rfac="ID"
lme(log(get(var1))~log(get(var2)),random=~1|get(rfac,pos=exdat), data=exdat)
Part of the problem seems to be with the nlme package. If you can consider using lme4, the desired results can be obtained by with:
lme4::lmer(log(get(var1)) ~ log(get(var2)) + (1 | ID),
data = exdat)

Resolving symmetry in gnls model

I'm trying to fit a logistic growth model in R, using gnls in the nlme package.
I have previously successfully fit a model:
mod1 <- gnls(Weight ~ I(A/(1+exp(b + v0*Age + v1*Sum.T))),
data = df,
start = c(A= 13.157132, b= 3, v0= 0.16, v1= -0.0059),
na.action=na.omit)
However, I now wish to constrain b so that it is not fitted by the model, so have tried fitting a second model:
mod2 <- gnls(Weight ~ I(A/(1+exp(log((A/1.022)-1) + v0*Age + v1*Sum.T))),
data = df,
start = c(A= 13.157132, v0= 0.16, v1= -0.0059),
na.action=na.omit)
This model returned the error:
Error in gnls(Weight ~ A/(1 + exp(log((A/1.022) - 1) + v0 * Age + :
approximate covariance matrix for parameter estimates not of full rank
Warning messages:
1: In log((A/1.022) - 1) : NaNs produced
Searching the error suggests that the problem is caused by symmetry in the model, and solutions to specific questions involve adapting the formula with different parameters. Unfortunately, my statistical knowledge is not good enough to a) fully understand the problem or b) adapt the formula myself.
As for the warning messages (there were 15 in all, all the same) I can't see why they arise because this section of the model works alone (with example numbers).
Any help with any of these queries would be greatly appreciated.
It may be informative for users to know that I finally solved this with what turned out to be a simple solution (with help from a friend).
Since exp(a+b) = exp(a)*exp(b), the equation can be rewritten:
Weight ~ I(A/(1+((A/1.022)-1) * exp(v0*Age + v1*Sum.T))
Which fits without any problems. In general, writing the equation in a different form would seem to be the answer.

Select Features for Naive Bayes Clasification in R

i want to use naive Bayes classifier to make some predictions.
So far i can make the prediction with the following (sample) code in R
library(klaR)
library(caret)
Faktor<-x <- sample( LETTERS[1:4], 10000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )
alter<-abs(rnorm(10000,30,5))
HF<-abs(rnorm(10000,1000,200))
Diffalq<-rnorm(10000)
Geschlecht<-sample(c("Mann","Frau", "Firma"),10000,replace=TRUE)
data<-data.frame(Faktor,alter,HF,Diffalq,Geschlecht)
set.seed(5678)
flds<-createFolds(data$Faktor, 10)
train<-data[-flds$Fold01 ,]
test<-data[flds$Fold01 ,]
features <- c("HF","alter","Diffalq", "Geschlecht")
formel<-as.formula(paste("Faktor ~ ", paste(features, collapse= "+")))
nb<-NaiveBayes(formel, train, usekernel=TRUE)
pred<-predict(nb,test)
test$Prognose<-as.factor(pred$class)
Now i want to improve this model by feature selection. My real data is about 100 features big.
So my question is , what woould be the best way to select the most important features for naive Bayes classification?
Is there any paper dor reference?
I tried the following line of code, bit this did not work unfortunately
rfe(train[, 2:5],train[, 1], sizes=1:4,rfeControl = rfeControl(functions = ldaFuncs, method = "cv"))
EDIT: It gives me the following error message
Fehler in { : task 1 failed - "nicht-numerisches Argument für binären Operator"
Calls: rfe ... rfe.default -> nominalRfeWorkflow -> %op% -> <Anonymous>
Because this is in german you may please reproduce this on your machine
How can i adjust the rfe() call to get a recursive feature elimination?
This error appears to be due to the ldaFuncs. Apparently they do not like factors when using matrix input. The main problem can be re-created with your test data using
mm <- ldaFuncs$fit(train[2:5], train[,1])
ldaFuncs$pred(mm,train[2:5])
# Error in FUN(x, aperm(array(STATS, dims[perm]), order(perm)), ...) :
# non-numeric argument to binary operator
And this only seems to happens if you include the factor variable. For example
mm <- ldaFuncs$fit(train[2:4], train[,1])
ldaFuncs$pred(mm,train[2:4])
does not return the same error (and appears to work correctly). Again, this only appears to be a problem when you use the matrix syntax. If you use the formula/data syntax, you don't have the same problem. For example
mm <- ldaFuncs$fit(Faktor ~ alter + HF + Diffalq + Geschlecht, train)
ldaFuncs$pred(mm,train[2:5])
appears to work as expected. This means you have a few different options. Either you can use the rfe() formula syntax like
rfe(Faktor ~ alter + HF + Diffalq + Geschlecht, train, sizes=1:4,
rfeControl = rfeControl(functions = ldaFuncs, method = "cv"))
Or you could expand the dummy variables yourself with something like
train.ex <- cbind(train[,1], model.matrix(~.-Faktor, train)[,-1])
rfe(train.ex[, 2:6],train.ex[, 1], ...)
But this doesn't remember which variables are paired in the same factor so it's not ideal.

R: NA/NaN/Inf in X error

I am trying to perform a negative binomial regression using R. When I am executing the following command:
DV2.25112013.nb <- glm.nb(DV2.25112013~ Bcorp.Geographic.Proximity + Dirty.Industry +
Clean.Industry + Bcorp.Industry.Density + State + Dirty.Region +
Clean.Region + Bcorp.Geographic.Density + Founded.As.Bcorp + Centrality +
Bcorp.Industry.Density.Squared + Bcorp.Geographic.Density.Squared +
Regional.Institutionalization + Sales + Any.Best.In.Class +
Dirty.Region.Heterogeneity + Clean.Region.Heterogeneity +
Ind.Dirty.Heterogeneity+Ind.Clean.Heterogeneity + Industry,
data = analysis25112013DF6)
R gives the following error:
Error in glm.fitter(x = X, y = Y, w = w, etastart = eta, offset = offset, :
NA/NaN/Inf in 'x'
In addition: Warning message:
step size truncated due to divergence
I do not understand this error since my data matrix does not contain any NA/NaN/Inf values...how can I fix this?
thank you,
I think the most likely cause of this error are negative values or zeros in the data, since the default link in glm.nb is 'log'. It would be easy enough to test by changing link="identity". I also think you need to try smaller models .... maybe a quarter of those variables to start. That also lets you add related variables as bundles since it looks from the names that you have possibly severe potential for collinearity with categorical variables.
We really need a data description. I wondered about Dirty.Industry + Clean.Industry. That is the sort of dichotomy that is better handled with a factor variable that has those levels. That prevents the collinearity if Clean = not-Dirty. Perhaps similarly with your "Heterogeneity" variables. (I'm not convinced that #BenBolker's comment is correct. I think it very possible that you first need statistical consultation before address coding issues.)
require(MASS)
data(quine) # following example in ?glm.nb page
> quine$Days[1] <- -2
> quine.nb1 <- glm.nb(Days ~ Sex/(Age + Eth*Lrn), data = quine, link = "identity")
Error in eval(expr, envir, enclos) :
negative values not allowed for the 'Poisson' family
> quine$Days[1] <- 0
> quine.nb1 <- glm.nb(Days ~ Sex/(Age + Eth*Lrn), data = quine, link = "identity")
Error: no valid set of coefficients has been found: please supply starting values
In addition: Warning message:
In log(y/mu) : NaNs produced
i have resolved this issue by putting in the control argument into the model assumptions with maxiter=10 or lower. the default is 50 iterations. perhaps it works for you with a little more iterations. just try

Resources