R: NA/NaN/Inf in X error - r

I am trying to perform a negative binomial regression using R. When I am executing the following command:
DV2.25112013.nb <- glm.nb(DV2.25112013~ Bcorp.Geographic.Proximity + Dirty.Industry +
Clean.Industry + Bcorp.Industry.Density + State + Dirty.Region +
Clean.Region + Bcorp.Geographic.Density + Founded.As.Bcorp + Centrality +
Bcorp.Industry.Density.Squared + Bcorp.Geographic.Density.Squared +
Regional.Institutionalization + Sales + Any.Best.In.Class +
Dirty.Region.Heterogeneity + Clean.Region.Heterogeneity +
Ind.Dirty.Heterogeneity+Ind.Clean.Heterogeneity + Industry,
data = analysis25112013DF6)
R gives the following error:
Error in glm.fitter(x = X, y = Y, w = w, etastart = eta, offset = offset, :
NA/NaN/Inf in 'x'
In addition: Warning message:
step size truncated due to divergence
I do not understand this error since my data matrix does not contain any NA/NaN/Inf values...how can I fix this?
thank you,

I think the most likely cause of this error are negative values or zeros in the data, since the default link in glm.nb is 'log'. It would be easy enough to test by changing link="identity". I also think you need to try smaller models .... maybe a quarter of those variables to start. That also lets you add related variables as bundles since it looks from the names that you have possibly severe potential for collinearity with categorical variables.
We really need a data description. I wondered about Dirty.Industry + Clean.Industry. That is the sort of dichotomy that is better handled with a factor variable that has those levels. That prevents the collinearity if Clean = not-Dirty. Perhaps similarly with your "Heterogeneity" variables. (I'm not convinced that #BenBolker's comment is correct. I think it very possible that you first need statistical consultation before address coding issues.)
require(MASS)
data(quine) # following example in ?glm.nb page
> quine$Days[1] <- -2
> quine.nb1 <- glm.nb(Days ~ Sex/(Age + Eth*Lrn), data = quine, link = "identity")
Error in eval(expr, envir, enclos) :
negative values not allowed for the 'Poisson' family
> quine$Days[1] <- 0
> quine.nb1 <- glm.nb(Days ~ Sex/(Age + Eth*Lrn), data = quine, link = "identity")
Error: no valid set of coefficients has been found: please supply starting values
In addition: Warning message:
In log(y/mu) : NaNs produced

i have resolved this issue by putting in the control argument into the model assumptions with maxiter=10 or lower. the default is 50 iterations. perhaps it works for you with a little more iterations. just try

Related

Why do I keep getting "argument is of length zero" error?

I'm trying to run a lasso regression on my large dataset but I keep obtaining the following error messages:
**Error in if (is.null(np) | (np[2] <= 1)) stop("x should be a matrix with 2 or more columns") :
argument is of length zero**
**Error in elnet(x, is.sparse, ix, jx, y, weights, offset, type.gaussian, :
(list) object cannot be coerced to type 'double'**
My dataset is information on a travel index (GTI) for determining 'safe' LGBT traveling. I'm trying to use the other variables in the dataset to fit a model to and predict the GTI.
Here is the code I have used thus far:
gaydata <- read.csv(file = 'GayData.csv')
sample of data and headers
names(gaydata)[names(gaydata) == "Total"] <- "GTI"
lasso_1 = glmnet(GTI ~ Anti.Discrimination.Legislation + Marriage.Civil.Partnership + Adoption.Allowed +
Transgender.Rights + Intersex.3rd.Option + Equal.Age.of.Consent +
X.Conversion.Therapy + LGBT.Marketing + Religious.Influence +
HIV.Travel.Restrictions + Anti.Gay.Laws + Homosexuality.Illegal +
Pride.Banned + Locals.Hostile + Prosecution + Murders + Death.Sentences, data = gaydata)
OR
lasso_2 = glmnet(x=gaydata, y=gaydata$GTI, alpha=1)
Removing 'Country' since it is categorical data that may be causing an issue
gaydata = subset(gaydata, select = -Country)
Trying to identify what is causing "argument is of length zero" error
sapply(gaydata, is.null)
sapply(gaydata, is.factor)
sum(is.null(gaydata))
In my research in trying to find a solution to this issue, I've seen that nulls, incorrect column names, and issues with factor variables typically cause the error. However, my data does not have those problems so I'm lost. My data is a copy and paste from the
Just figured it out with the help of a statistician:
Apparently I needed to change my dataset into a matrix
gaydata = as.matrix(gaydata)
and use the following format
lasso_0 = glmnet(y=gaydata[,2], x=gaydata[,-2])

Error in rep(" ", spaces1) : invalid 'times' argument

I'm trying to carry out covariate balancing using the ebal package. The basic code is:
W1 <- weightit(Conformidad ~ SexoCon + DurPetFiscPrisión1 +
Edad + HojaHistPen + NacionCon + AnteVivos +
TipoAbog + Reincidencia + Habitualidad + Delitos,
data = Suspension1,
method = "ebal", estimand = "ATT")
I then want to check the balance using the summary function:
summary(W1)
This originally worked fine but now I get the error message:
Error in rep(" ", spaces1) : invalid 'times' argument
It's the same dataset and same code, except I changed some of the covariates. But now even when I go back to original covariates I get the same error. Any ideas would be very appreciated!!
I'm the author of WeightIt. That looks like a bug. I'll take a look at it. Are you using the most updated version of WeightIt?
Also, summary() doesn't assess balance. To do that, you need to use cobalt::bal.tab(). summary() summarizes the distribution of the weights, which is less critical than examining balance. bal.tab() displays the effect sample size as well, which is probably the most important statistic produced by summary().
I encountered the same error message. This happens when the treatment variable is coded as factor or character, but not as numeric in weightit.
To make summary() work, you need to use 1 and 0.

GBM: Object 'p' not found [duplicate]

I am using gbm to predict binary response.
When I set cv.folds=0, everything works well. However when cv.folds > 1, I got error:Error in object$var.levels[[i]] : subscript out of bounds when the first irritation of crossvalidation finished. Someone said this could because some factor variables have missing levels in training or testing data, but I tried only use numeric variables and still get this error.
> gbm.fit <- gbm(model.formula,
+ data=dataall_train,
+ distribution = "adaboost",
+ n.trees=10,
+ shrinkage=0.05,
+ interaction.depth=2,
+ bag.fraction = 0.5,
+ n.minobsinnode = 10,
+ train.fraction=0.5,
+ cv.folds=3,
+ verbose=T,
+ n.cores=1)
CV: 1
CV: 2
CV: 3
Error in object$var.levels[[i]] : subscript out of bounds
Anyone have some insights on this? Thanks!
Answer my self:
Problem solved. This is because a bug in this function. The input data cannot contain variables other than the variables in the model.
I second this solution: The input data in the R function gbm() cannot include the variables (columns) that will not be used in your model.

R - Error: Variable lenghts differ (yet I can't see how they could): - require(boot)

I am running a code to plot 10-fold cross-validation errors(something I learnt at an online Stanford course). The goal is to find out which of a variety of variable values (a polynomial of degrees 1 to 5) is the best fit for further out of sample predictions
The initial coding is:
require(ISLR)
require(boot)
cv.error10=rep(0,5)
for(d in degree){
glm.fit=glm(mpg~poly(horsepower,d), data=Auto)
cv.error10[d]=cv.glm(Auto,glm.fit,K=10)$delta[1]
}
plot(degree,cv.error10,type="b",col="red")
This works fine. Now I am trying to do the same for my data (I run a negative binomial)
glm.fit2<-glm(abs_pb_t ~ RI1 + RA1 + pacl_to + abs_pb + pbf3 + pbf32 + pbs + pbc + suc_inc + cdval + + cdval:RI1 + cdval:RA1 + pbf3:RI1 + pbf3:RA1 + pbf32:RI1 + pbf32:RA1 + Rfirm, data=p.data, family="quasipoisson")
cv.error10=rep(0,5)
for(d in degree){
glm.fit2=update(glm.fit2 , ~. + poly(yy,d), data=p.data, na.action="na.exclude")
cv.error10[d]=cv.glm(p.data,glm.fit2,K=10)$delta[1]
}
I added the exclusion of NA values because people had suggested this in other SO questions (here , here and here).
I get the following error:
Error in model.frame.default(formula = abs_pb_t ~ RI1 + RA1 + lnRnD + :
variable lengths differ (found for 'poly(yy, d)')
In my update formula the variable yy is a count variable that perfectly fits my data.frame (592 observations)
yy<-rep(seq(1:16),times=37) ; poly(yy,1) ; poly(yy,5)
According to the help file on "poly" missing values are not allowed so I do not understand why this variable would suddenly (by using the polynomial) generate missing values. I checked this and the polynomial does not create NA variables so something else must explain why I get this error.
Any ideas?
Thanks in advance

RandomForest error code

I am trying to run a rather simple randomForest. I keep having an error code that does not make any sense to me. See code below.
test.data<-data.frame(read.csv("test.RF.data.csv",header=T))
attach(test.data)
head(test.data)
Depth<-Data1
STemp<-Data2
FPT<-Sr_hr_15
Stage<-stage_feet
Q<-discharge_m3s
V<-vel_ms
Turbidity<-turb_ntu
Day_Night<-day_night
FPT.rf <- randomForest(FPT ~ Depth + STemp + Q + V + Stage + Turbidity + Day_Night, data = test.data,mytry=1,importance=TRUE,na.action=na.omit)
Error in randomForest.default(m, y, ...) : data (x) has 0 rows
In addition: Warning message:
In randomForest.default(m, y, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?
I then run the dimensions to ensure there is infact data recognized in R
dim(test.data)
[1] 77 15
This is a subset of the complete data set I ran just to test if I could get it to run since I got the same error with the complete data set.
Why is it telling me data(x) has 0 rows when clearly there is.
Thanks

Resources