random forest variable lengths differ - r

I am trying to run RF using a feature as the response variable. I am having trouble passing a string through a variable to be used as the response in RF. First I try running RF on the string passed through a variable as the response and I am getting a "vector lengths differ error". After this, I try just inputing the actual string(feature) as the response and it works fine. Can you shed some light on why the variable lengths are differing? Thanks.
> colnames(Data[1])
[1] "feature1"
> rf.file = randomForest(formula =colnames(Data[1])~ ., data = Data, proximity = T, importance = T, ntree = 500, nodesize = 3)
Error in model.frame.default(formula = colnames(Data[1]) ~ ., :
variable lengths differ (found for 'feature1')
Enter a frame number, or 0 to exit
1: randomForest(formula = colnames(Data[1]) ~ ., data = Data, proximity = T, importance = T, ntree = 500, nodesize = 3)
2: randomForest.formula(formula = colnames(Data[1]) ~ ., data = brainDataTrim, proximity = T, importance = T, ntree = 500, nodesize = 3)
3: eval(m, parent.frame())
4: eval(expr, envir, enclos)
5: model.frame(formula = colnames(Data[1]) ~ ., data = Data, na.action = function (object, ...)
6: model.frame.default(formula = colnames(Data[1]) ~ ., data = Data, na.action = function (object, ...)
Selection: 0
> rf.file = randomForest(formula =feature1~ ., data = Data, proximity = T, importance = T, ntree = 500, nodesize = 3)
> rf.file
Call:
randomForest(formula = feature1 ~ ., data = Data, proximity = T, importance = T, ntree = 500, nodesize = 3)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 3
Mean of squared residuals: 0.1536834
% Var explained: 34.21
>

You are simply misunderstanding how formulas work. Basically, your first attempt isn't supposed to work.
Formulas should consist of names of variables, possibly simple functions of them. e.g.
var1 ~ var2
var1 ~ log(var2)
Note the lack of quotes. If you didn't quote it, it's not a string, its a symbol.
So, avoid raw strings, weird evaluation demands (like Data[1], or any use of $) in your formulas. To construct a formula from strings, paste it together and then call as.formula on the resulting string.
Keep in mind that the whole point of a formula is that you have provided a symbolic representation of the model, and R will then go look for the specific columns you named in the data frame provided.
I think some functions will do the coercion of a string representation of a formula for you (e.g. "var1 ~ var2"), but I wouldn't count on, or expect it.

Related

Error in glmnet if I specify a variable to be a factor

I have a database in R where I would like to perform a glmnet task. The y variable consists on an originally numeric variable which however takes on only 0 and 1 values. If I specify the latter to be a factor variable as follows
df_ML_1976[,names] <- lapply(df_ML_1976[,names] , factor)
and then apply glmnet after dividing into training and test set:
library("dplyr")
df_ML_1976 %>%
select(where(~ any(. != 0)))
#df_ML_1976 <- subset(df_ML_1976, select = -c(X))
library("caret")
default_idx = createDataPartition(df_ML_1976$y_tr4, p = 0.75, list = FALSE)
default_trn = df_ML_1976[default_idx, ]
default_tst = df_ML_1976[-default_idx, ]
## Fitting elasticnet:
cv_5 = trainControl(method = "cv", number = 5)
def_elnet = train(
y_tr4 ~ ., data = default_trn,
method = "glmnet",
trControl = cv_5
)
def_elnet
an error occurs:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'drop': non-conformable arguments
which does not appear if I do not specify
df_ML_1976[,names] <- lapply(df_ML_1976[,names] , factor)
why is it like so?
Thank you

I cannot get the gam.check function to work

I used the model below to predict marine survival in both SAS and R. The R code is below.
gam_mod<-gam(recruits/smolt ~ s(max, k = 6) + s(medB, k = 10),
weights = smolt, data = GAMdata, method = "REML", family = binomial("logit"))
However when I use gam.check(gam_mod) I get the error message below:
Error in dm[, i] <- sort(residuals(object, type = type)) :
number of items to replace is not a multiple of replacement length

Error in model.frame.default(form = classvariable ~ ., data = trainingDataset, : variable lengths differ (found for 'Sepal.Length')

I've tried to look at similar questions but can't figure out my problem.
I was already able to complete my analysis with random forest (using caret), tuning parameters separately. Now I'm trying to create a function that will perform my analysis all at once.
I created a function with two inputs, the dataset, and variable to be classified.
For now I'm using the iris dataset for simplicity.
RF <- function(data, classvariable) {
# Best mtry
trControl <- trainControl(method = "cv", number = 10,
search = "grid")
set.seed(1234)
tuneGrid <- expand.grid(.mtry = c(1: 3))
RF_mtry <- train(classvariable ~.,
data = dataset,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
ntree = 100)
print(RF_mtry)
mtry = 0
for (i in 1:nrow(RF_mtry$results)) {
if (RF_mtry$results[i,2] > mtry) mtry <-
RF_mtry$results[i,2]
}
trial_mtry <- c(1:3)
best_mtry <- trial_mtry[i]
best_mtry
}
Once I run the function
RF(data = iris, classvariable = Species)
I get the error
Error in `[.data.frame`(data, , all.vars(Terms), drop = FALSE) :
undefined columns selected
Tried running the code without putting it in a function, so i wrote directly iris instead of dataset and Species instead of classvariable, and it works.
previously I was getting the error
Error in model.frame.default(form = classvariable ~ ., data = trainingDataset, :
variable lengths differ (found for 'Sepal.Length')
Anybody have an idea why it does not work?
Thank you very much.

How to fix "variable length differ" error in cv.zipath?

Trying to run a Cross validation of a zero-inflated poisson model using cv.zipath from the mpath package.
Fitting the LASSO
fit.lasso = zipath(estimation_sample_nomiss ~ .| .,
data = missings,
nlambda = 100,
family = "poisson",
link = "logit")
Cross validation
n <- dim(docvisits)[1]
K <- 10
set.seed(197)
foldid <- split(sample(1:n), rep(1:K, length = n))
fitcv <- cv.zipath(F_time_unemployed~ . | .,
data = estimation_sample_nomiss, family = "poisson",
nlambda = 100, lambda.count = fit.lasso$lambda.count[1:30],
lambda.zero = fit.lasso$lambda.zero[1:30], maxit.em = 300,
maxit.theta = 1, theta.fixed = FALSE, penalty = "enet",
rescale = FALSE, foldid = foldid)
I encounter the following error:
Error in model.frame.default(formula = F_time_unemployed ~ . + ., data = list(: variable lengths differ (found for '(weights)')
I have cleaned the sample of all NA's but still encounter the error message.
The solution turns out to be that the cv.zipath() command does not accept tibble data formats - at least in this instance. (No guarantee as to how this statement can be generalised). Having used dplyr commands, one needs to convert back to data frame. Thus, the solution is as simple as as.dataframe().

Cannot use a Variable Importance Plot when Using Argument 'Importance = TRUE'

When I try to run the following code:
reg <- randomForest(max_orders ~ ., data = df[-c(1:3)], ntree = 100, importance = T)
varImpPlot(reg, sort = T)
I get the error:
Error in plot.window(xlim = xlim, ylim = ylim, log = "") :
need finite 'xlim' values
But if I run:
reg <- randomForest(max_orders ~ ., data = df[-c(1:3)], ntree = 100)
varImpPlot(reg, sort = T)
Everything's fine and dandy!
I'm legitimately about to lose my sanity. I've made the MSE variable importance plots a countless number of times, I don't know what the issue is here. Here's my original regression data (df[-c(1:3]):
EDIT: R has officially gonna full blown schizophrenic on me:
> # Test Variables
> reg <- randomForest(max_orders ~ release_age, data = df[-c(1:3)], ntree =
100, importance = T)
> varImpPlot(reg, sort = T)
> # Test Variables
> reg <- randomForest(max_orders ~ release_age, data = df[-c(1:3)], ntree =
100, importance = T)
> varImpPlot(reg, sort = T)
Error in plot.window(xlim = xlim, ylim = ylim, log = "") :
need finite 'xlim' values
HOW DOES R RUN THE EXACT SAME CODE WORD FOR WORD AND HAVE AN ERROR ONE TIME AND NOT ANOTHER?! Well, I guess I fixed the problem, just kept rerunning the code until a plot finally showed up, still want to know what is reason behind this enigma.

Resources