How to fix "variable length differ" error in cv.zipath? - r

Trying to run a Cross validation of a zero-inflated poisson model using cv.zipath from the mpath package.
Fitting the LASSO
fit.lasso = zipath(estimation_sample_nomiss ~ .| .,
data = missings,
nlambda = 100,
family = "poisson",
link = "logit")
Cross validation
n <- dim(docvisits)[1]
K <- 10
set.seed(197)
foldid <- split(sample(1:n), rep(1:K, length = n))
fitcv <- cv.zipath(F_time_unemployed~ . | .,
data = estimation_sample_nomiss, family = "poisson",
nlambda = 100, lambda.count = fit.lasso$lambda.count[1:30],
lambda.zero = fit.lasso$lambda.zero[1:30], maxit.em = 300,
maxit.theta = 1, theta.fixed = FALSE, penalty = "enet",
rescale = FALSE, foldid = foldid)
I encounter the following error:
Error in model.frame.default(formula = F_time_unemployed ~ . + ., data = list(: variable lengths differ (found for '(weights)')
I have cleaned the sample of all NA's but still encounter the error message.

The solution turns out to be that the cv.zipath() command does not accept tibble data formats - at least in this instance. (No guarantee as to how this statement can be generalised). Having used dplyr commands, one needs to convert back to data frame. Thus, the solution is as simple as as.dataframe().

Related

nlme error with lme function: function evaluation limit reached without convergence

I am conducting a linear mixed effect model with random intercept and slopes with the lme function from the ````nlme``` package in R.
library(readr)
library(nlme)
#Data
post = read_csv("C:/data.csv")
post$regions <- as.numeric(as.factor(post$regions))
post.sc = data.frame(scale(post))
post.sc = na.omit(post.sc)
#Regressions
lm.reg <- lm(V1 ~ V2+V3+V4+V5+V6+V7+V8+V9+V10, data = post.sc)
lmm.null <- lme(V1 ~ 1, data = post.sc, random = ~1 | regions, method = 'ML')
lmm.reg <- lme(V1 ~ V2+V3+V4+V5+V6+V7+V8+V9+V10, data = post.sc, random = ~1 | regions, method = 'ML')
lmm.reg.slope <- lme(V1 ~ V2+V3+V4+V5+V6+V7+V8+V9+V10, data = post.sc, random = ~1+V1 ~ V2+V3+V4+V5+V6+V7+V8+V9+V10| regions, method = 'ML', control = lmeControl(maxIter = 10000, msMaxIter = 10000))
I get the following error message:
Error in lme.formula(V1 ~ V2+V3+V4+V5+V6+V7+ :
nlminb problem, convergence error code = 1
message = function evaluation limit reached without convergence (9)
I have not encountered questions with similar problems.
Is there a way to fix this?

I cannot get the gam.check function to work

I used the model below to predict marine survival in both SAS and R. The R code is below.
gam_mod<-gam(recruits/smolt ~ s(max, k = 6) + s(medB, k = 10),
weights = smolt, data = GAMdata, method = "REML", family = binomial("logit"))
However when I use gam.check(gam_mod) I get the error message below:
Error in dm[, i] <- sort(residuals(object, type = type)) :
number of items to replace is not a multiple of replacement length

Cannot use a Variable Importance Plot when Using Argument 'Importance = TRUE'

When I try to run the following code:
reg <- randomForest(max_orders ~ ., data = df[-c(1:3)], ntree = 100, importance = T)
varImpPlot(reg, sort = T)
I get the error:
Error in plot.window(xlim = xlim, ylim = ylim, log = "") :
need finite 'xlim' values
But if I run:
reg <- randomForest(max_orders ~ ., data = df[-c(1:3)], ntree = 100)
varImpPlot(reg, sort = T)
Everything's fine and dandy!
I'm legitimately about to lose my sanity. I've made the MSE variable importance plots a countless number of times, I don't know what the issue is here. Here's my original regression data (df[-c(1:3]):
EDIT: R has officially gonna full blown schizophrenic on me:
> # Test Variables
> reg <- randomForest(max_orders ~ release_age, data = df[-c(1:3)], ntree =
100, importance = T)
> varImpPlot(reg, sort = T)
> # Test Variables
> reg <- randomForest(max_orders ~ release_age, data = df[-c(1:3)], ntree =
100, importance = T)
> varImpPlot(reg, sort = T)
Error in plot.window(xlim = xlim, ylim = ylim, log = "") :
need finite 'xlim' values
HOW DOES R RUN THE EXACT SAME CODE WORD FOR WORD AND HAVE AN ERROR ONE TIME AND NOT ANOTHER?! Well, I guess I fixed the problem, just kept rerunning the code until a plot finally showed up, still want to know what is reason behind this enigma.

Random Forest using R

i'm working on building a predictive model for breast cancer data using R. After performing gcrma normalization, i generated the potential predictor variables. Now while i run the RF algorithm i encountered the following error
rf_output=randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)
Error: Error in randomForest.default(x = pred.data, y = target, importance = TRUE, : Can not handle categorical predictors with more than 53 categories.
code:
library(randomForest)
library(ROCR)
library(Hmisc)
library(genefilter)
setwd("E:/kavya's project_work/final")
datafile<-"trainset_gcrma.txt"
clindatafile<-read.csv("mod clinical_details.csv")
outfile="trainset_RFoutput.txt"
varimp_pdffile="trainset_varImps.pdf"
MDS_pdffile="trainset_MDS.pdf"
ROC_pdffile="trainset_ROC.pdf"
case_pred_outfile="trainset_CasePredictions.txt"
vote_dist_pdffile="trainset_vote_dist.pdf"
data_import=read.table(datafile, header = TRUE, na.strings = "NA", sep="\t")
clin_data_import=clindatafile
clincaldata_order=order(clin_data_import[,"GEO.asscession.number"])
clindata=clin_data_import[clincaldata_order,]
data_order=order(colnames(data_import)[4:length(colnames(data_import))])+3 #Order data without first three columns, then add 3 to get correct index in original file
rawdata=data_import[,c(1:3,data_order)] #grab first three columns, and then remaining columns in order determined above
header=colnames(rawdata)
X=rawdata[,4:length(header)]
ffun=filterfun(pOverA(p = 0.2, A = 100), cv(a = 0.7, b = 10))
filt=genefilter(2^X,ffun)
filt_Data=rawdata[filt,]
#Get potential predictor variables
predictor_data=t(filt_Data[,4:length(header)])
predictor_names=c(as.vector(filt_Data[,3])) #gene symbol
colnames(predictor_data)=predictor_names
target= clindata[,"relapse"]
target[target==0]="NoRelapse"
target[target==1]="Relapse"
target=as.factor(target)
tmp = as.vector(table(target))
num_classes = length(tmp)
min_size = tmp[order(tmp,decreasing=FALSE)[1]]
sampsizes = rep(min_size,num_classes)
rf_output=randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)
error:"Error in randomForest.default(x = pred.data, y = target, importance = TRUE, : Can not handle categorical predictors with more than 53 categories."
as i'm new to Machine learning i'm unable to proceed. kindly do the needful.
Thnks in advance.
It is hard to say without knowing the data. Run class or summary on all your predictor variables to ensure that they are not accidentally interpreted as characters or factors. If you really do have more than 53 levels, you will have to convert them to binary variables. Example:
mtcars$automatic <- mtcars$am == 0
mtcars$manual <- mtcars$am == 1

random forest variable lengths differ

I am trying to run RF using a feature as the response variable. I am having trouble passing a string through a variable to be used as the response in RF. First I try running RF on the string passed through a variable as the response and I am getting a "vector lengths differ error". After this, I try just inputing the actual string(feature) as the response and it works fine. Can you shed some light on why the variable lengths are differing? Thanks.
> colnames(Data[1])
[1] "feature1"
> rf.file = randomForest(formula =colnames(Data[1])~ ., data = Data, proximity = T, importance = T, ntree = 500, nodesize = 3)
Error in model.frame.default(formula = colnames(Data[1]) ~ ., :
variable lengths differ (found for 'feature1')
Enter a frame number, or 0 to exit
1: randomForest(formula = colnames(Data[1]) ~ ., data = Data, proximity = T, importance = T, ntree = 500, nodesize = 3)
2: randomForest.formula(formula = colnames(Data[1]) ~ ., data = brainDataTrim, proximity = T, importance = T, ntree = 500, nodesize = 3)
3: eval(m, parent.frame())
4: eval(expr, envir, enclos)
5: model.frame(formula = colnames(Data[1]) ~ ., data = Data, na.action = function (object, ...)
6: model.frame.default(formula = colnames(Data[1]) ~ ., data = Data, na.action = function (object, ...)
Selection: 0
> rf.file = randomForest(formula =feature1~ ., data = Data, proximity = T, importance = T, ntree = 500, nodesize = 3)
> rf.file
Call:
randomForest(formula = feature1 ~ ., data = Data, proximity = T, importance = T, ntree = 500, nodesize = 3)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 3
Mean of squared residuals: 0.1536834
% Var explained: 34.21
>
You are simply misunderstanding how formulas work. Basically, your first attempt isn't supposed to work.
Formulas should consist of names of variables, possibly simple functions of them. e.g.
var1 ~ var2
var1 ~ log(var2)
Note the lack of quotes. If you didn't quote it, it's not a string, its a symbol.
So, avoid raw strings, weird evaluation demands (like Data[1], or any use of $) in your formulas. To construct a formula from strings, paste it together and then call as.formula on the resulting string.
Keep in mind that the whole point of a formula is that you have provided a symbolic representation of the model, and R will then go look for the specific columns you named in the data frame provided.
I think some functions will do the coercion of a string representation of a formula for you (e.g. "var1 ~ var2"), but I wouldn't count on, or expect it.

Resources