Error coming while using Random Forest using R - r

I am using a dataset containing mvar_1 as column, having names of one of 5 parties that citizen voted for last year. Other variables are just demographic variables, as the number of rallies attended for each parties, other stuffs.
When I use the following code:
data.model.rf = randomForest(mvar_1 ~ mvar_2 + mvar_3 + mvar_4 + mvar_5 +
mvar_6 + mvar_7 + mvar_8 + mvar_9 + mvar_10 +
mvar_11 + mvar_15 + mvar_17 + mvar_18 + mvar_21 +
mvar_22 + mvar_23 + mvar_24 + mvar_25 + mvar_26 +
mvar_28, data=data.train, ntree=20000, mtry=15,
importance=TRUE, na.action = na.omit )
This error message appears:
Error in randomForest.default(m, y, ...) :
Can not handle categorical predictors with more than 53 categories.

One of your mvar is a factor with more than 53 levels.
You may have a categorical variable with lots of levels, like demographic group, and you should aggregate it into less levels to use this package. (See here for the best way of doing it)
More likely, you have a non-categorical variable incorrectly typed as a factor. In this case you should fix it by typing your variable correctly. E.g. to get a numeric from a factor, you call as.numeric(as.character(myfactor)).
If you don't know what a factor is, the second option is probably it. You should do a summary of data.train, this will help you see which mvar are incorrectly typed. If the mvar is typed as numeric, you will see min, max, mean, median, etc. If a numeric variable is incorrectly typed as a factor, you will not see that but you will see the number of occurence of each level.
In any case, calling summary will help you because it shows the number of levels for each factor. The variables with >53 levels are causing the issue.

I had the same problem, but solved it after seeing that I had imported the data frame with comma separators without indicating it.
After importing the table using read.table(data, dec=",") the problem was solved!

I encountered the same problem - error message suggesting >53 factor levels, but none of my variables were like that.
Upon further inspection, I found I had some factor variables with empty levels.
I used the forcats function fct_drop to remove these, then everything worked!

As antoine-sac pointed out, in my case this error was because of numeric variables appearing as factors. Only that the conversion happened by R when it was importing my (numeric) file.
Casting the factors as numerics didn't work. But what worked was using strip.white = TRUE when importing the dataset. (I found this solution here.)

This error occurs when you train your model with the entire dataset and not with the train data. Try implementing the model with train data and work out with test adm to perform prediction.

Related

using randomForest but saying can't handle more than 53 categories error [duplicate]

I am using a dataset containing mvar_1 as column, having names of one of 5 parties that citizen voted for last year. Other variables are just demographic variables, as the number of rallies attended for each parties, other stuffs.
When I use the following code:
data.model.rf = randomForest(mvar_1 ~ mvar_2 + mvar_3 + mvar_4 + mvar_5 +
mvar_6 + mvar_7 + mvar_8 + mvar_9 + mvar_10 +
mvar_11 + mvar_15 + mvar_17 + mvar_18 + mvar_21 +
mvar_22 + mvar_23 + mvar_24 + mvar_25 + mvar_26 +
mvar_28, data=data.train, ntree=20000, mtry=15,
importance=TRUE, na.action = na.omit )
This error message appears:
Error in randomForest.default(m, y, ...) :
Can not handle categorical predictors with more than 53 categories.
One of your mvar is a factor with more than 53 levels.
You may have a categorical variable with lots of levels, like demographic group, and you should aggregate it into less levels to use this package. (See here for the best way of doing it)
More likely, you have a non-categorical variable incorrectly typed as a factor. In this case you should fix it by typing your variable correctly. E.g. to get a numeric from a factor, you call as.numeric(as.character(myfactor)).
If you don't know what a factor is, the second option is probably it. You should do a summary of data.train, this will help you see which mvar are incorrectly typed. If the mvar is typed as numeric, you will see min, max, mean, median, etc. If a numeric variable is incorrectly typed as a factor, you will not see that but you will see the number of occurence of each level.
In any case, calling summary will help you because it shows the number of levels for each factor. The variables with >53 levels are causing the issue.
I had the same problem, but solved it after seeing that I had imported the data frame with comma separators without indicating it.
After importing the table using read.table(data, dec=",") the problem was solved!
I encountered the same problem - error message suggesting >53 factor levels, but none of my variables were like that.
Upon further inspection, I found I had some factor variables with empty levels.
I used the forcats function fct_drop to remove these, then everything worked!
As antoine-sac pointed out, in my case this error was because of numeric variables appearing as factors. Only that the conversion happened by R when it was importing my (numeric) file.
Casting the factors as numerics didn't work. But what worked was using strip.white = TRUE when importing the dataset. (I found this solution here.)
This error occurs when you train your model with the entire dataset and not with the train data. Try implementing the model with train data and work out with test adm to perform prediction.

Puzzling error in svyby of survey package

I am using "svyby" function from survey R package, and get an error I don't know how to deal with.
At first, I used variable cntry as a grouping, next, I used essround as grouping, and it all worked smoothly. But when I use their combination ~cntry+essround it returns an error.
I am puzzled how it can work separately for each grouping but doesn't work for combined grouping.
This is somehow related to omitted data, as when I drop all the empty cells (i.e. using na.omit(dat) instead of dat for defining survey design) it starts working. But I don't want to drop all the missings. I thought na.rm argument of svymean should deal with it. Note that variables cntry and essround do not contain any missing values.
library("survey")
s.w <- svydesign(ids = ~1, data = dat, weights = dat[,weight])
svyby(~ Security, by=~ essround, s.w, svymean, na.rm=T) # Works
svyby(~ Security, by=~ cntry, s.w, svymean, na.rm=T) # Also works
svyby(~ Security, by=~ essround+cntry, s.w, svymean, na.rm=T) # Gives an error
Error in tapply(1:NROW(x), list(factor(strata)), function(index) { :
arguments must have same length
So my question is - how to make it work?
UPDATE.
Sorry, I misread the documentation. The problem is solved by adding na.rm.all = TRUE to the svyby function.
Forgive me for the late answer, but I was just looking for solution for a similar problem and solved it for myself just now. Check to see if you have any empty cells in your cross-tabulation of essround, cntry, and Security (using table()). If you do, try transforming the grouping variables into ordered factors with ordered() and explicitly naming your levels with the levels argument of the function, before you run the svyby(). Ordered factors will show frequency of 0 in a cross tabulation, while regular factors will drop empty cells.
I don't know exactly why, but here's how I resolved the same issue. It seems to have something to do with the way svyby deals with NA data - even if you specify na.rm=T. I made subsets of my data frame and found that it does happen if the subset is smaller than the certain threshold (it was 500 in my case, but the exact value is to be determined) AND contains NA - works well for other subsets like bigger than 10,000 with NA or smaller than 500 without NA. In your case, there should be a subset of essround==x & cntry==y which is small and where Security = NA. So, clean the data not to have NA BEFORE you do svyby (could be removal, estimate, or separate grouping - it's up to you) and then try once again. It worked for me.

R mlogit() function: Error in if (abs(x - oldx) < ftol) { : missing value where TRUE/FALSE needed

I am having trouble with mlogit() function. I am trying to predict which variables in a given set are the most preferred amongst people who took our survey. I am trying to predict the optimal combination of variables to create the most preferred option. Basically, we are measuring "Name", "Logo Size", "Design", "Theme","Flavor", and "Color".
To do this, we have a large data set and are trying to run it through mlogit.data() and mlogit(), although we keep getting the same error:
Error in if (abs(x - oldx) < ftol) { :
missing value where TRUE/FALSE needed
None of my data is negative or missing, so this is very confusing. My syntax is:
#Process data in mlogit.data()
data2 <-
mlogit.data(data=data, choice="Choice",
shape="long", varying=5:10,
alt.levels=paste("pos",1:3))
#Make character columns factors and "choice" column (the one we are
#measuring) a numeric.
data2$Name <- as.factor(data2$Name)
data2$Logo.Size <- as.factor(data2$Logo.Size)
data2$Design <- as.factor(data2$Design)
data2$Theme <- as.factor(data2$Theme)
data2$Color <- as.factor(data2$Color)
data2$Choice <- as.numeric(as.character(data2$Choice))
##### RUN MODEL #####
m1 <- mlogit(Choice ~ 0 + Name + Logo.Size + Design + Theme + Flavor
+ Color, data = data2)
m1
Does it look like there is a problem with my syntax, or is it likely my data that is the problem?
In a panel setting, it is potentially the case that one or more of your choice cards does not have a TRUE value. One fix would be to drop choice cards that are missing a choice.
## Use data.table
library(data.table)
## Drop choice cards that received no choice
data.table[, full := sum(Choice), by=Choice_id]
data.table.full <- data.table[full!=0,]
This is an issue specific to mlogit(). For example, STATA's mixed logit approach ignores missing response variables, R views this as more of an issue that needs to be addressed.
I had the same error. It got resolved when I arranged the data by unique ID and alternative ID. For some reason, mlogit requires all the choice instances to be stacked together.
Error in if (abs(x - oldx) < ftol) { : missing value where TRUE/FALSE needed
Suggests that if your response variable is binary ie 1/0 then one or more of the values is something other than 1/0
Look at: table(data2$Choice) to see if this is the case
I had similar issue, but eventually figured out. In my case, it is due to missing value of the covariates not the choice response.
I had this problem when my data included choice situations (questions that participants were asked) in which none of the choices was selected. Removing those rows fixed the problem.
Just in case others might have the same issue. I got this error when I did run my choice model (a maximum difference scaling) when I had partial missings. E.g. if two choices per task/set had to be made by the respondent, but only one choice was made.
I could solve this issue in the long format data set by dropping those observations that belonged to the missing choice while keeping the observations where a valid choise was made.
E.g. assume I have a survey with 9 tasks/sets and in each task/set 5 alternatives are provided. In each task my respondents had to make two choices, i.e. selecting one of the 5 alternatives as "most important" and one of the alternatives as "least important". This results in a data set that has 5*9*2 = 90 rows per respondent. There are exactly 5 rows per task*choice combination (e.g. 5 rows for task 1 containing the alternatives, where exactly one of these 5 rows is coded as 1 in the response variable in case it was chosen as the most (or least) important alternative).
Now imagine a respondent only provides a choice for "most important", but not for least important. In such a case the 5 rows for "least important" would all have a 0 in the response variable. Excluding these 5 rows from the data solves the aboove error issue and btw leads to the exact same results as other tools woudl provide (e.g. Sawtooth's Lighthouse software).
Re (1)
"data2 <- mlogit.data(data=data, choice="Choice",
shape="long", varying=5:10,
**alt.levels=paste("pos",1:3))**"
and (2)
"m1 <- mlogit(**Choice** ~ 0 + Name + Logo.Size + Design + Theme + Flavor + Color, data = data2)"
In addition to making sure all of the data is filled in, I would just highlight that: (1) The level names need to exactly match the part of the variable name after the separator. And, (2) The DV in the model needs to be the variable name appearing before the separator.
Example: original variable "Media" with 5 categories -> 5 dummy variables "Med_Radio", "Med_TV", etc: The level names need to be "Radio", "TV", etc., exactly as written. And you must put "Med" into the model, not "Media", as DV.
This fixed the problem for me.

CCA Analysis: Error in weighted.mean.default(newX[, i], ...) : 'x' and 'w' must have the same length

I'm very new to R and this might be a very silly question to ask but I'm quite stuck right now.
I'm currently trying to do a Canonical Correspondence Analysis on my data to see which environmental factors have more weight on community distribution. I'm using the vegan package. My data consists of a table for the environmental factors (dataset EFamoA) and another for an abundance matrix (dataset AmoA). I have 41 soils, with 39 environmental factors and 334 species.
After cleaning my data of any variables which are not numerical, I try to perform the cca analysis using the formula notation:
CCA.amoA <- cca (AmoA ~ EFamoA$PH + EFamoA$LOI, data = EFamoA,
scale = TRUE, na.action = na.omit)
But then I get this error:
Error in weighted.mean.default(newX[, i], ...) :
'x' and 'w' must have the same length
I don't really know where to go from here and haven't found much regarding this problem anywhere (which leads me to think that it must be some sort of very basic mistake I'm doing). My environmental factor data is not standardized as I red in the cca help file that the algorithm does it but maybe I should standardize it before? (I've also red that scale = TRUE is only for species). Should I convert the data into matrices?
I hope I made my point clear enough as I've been struggling with this for a while now.
Edit: My environmental data has NA values
Alright so I was able to figure it out all by myself and it was indeed a silly thing, turns out my abundance data had soils as columns and species as rows, while environmental factor (EF) data had soils as rows and EF as columns.
using t() on my data, I transposed my data.frame (and collaterally converted it into a matrix) and cca() worked (as "length" was the same, I assume). Transposing the data separately and loading it already transposed works too.
Although maybe the t() approach saves the need of creating a whole new file (in case your data was organized using different rows as in my case), it converts the data into a matrix and this might not be desired in some cases, either way, this turned out to be a very simple and obvious thing to solve (took me a while though).

What is an alternative to leap() that can handle NAs?

Need to apply Branch and bound method to choose best model. leaps() from leaps package works well, only if the data has no NA values, otherwise throws an error:
#dummy data
x<-matrix(rnorm(100),ncol=4)
#convert to 0,1,2 - this is a genetic data, NA=NoCall
x<-matrix(round(runif(100)*10) %% 3,ncol=4)
#introduce NA=NoCall
x[1,1] <-NA
#response, case or control
y<-rep(c(0,1,1,0,1),5)
leaps(x,y)
Error in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = NCOL(x) + int, :
NA/NaN/Inf in foreign function call (arg 4)
Using only complete.cases() is not an option as I lose 80% of data.
What is an alternative to leap that can handle NAs? I am writing my own function to do something similar, but it is getting big and clunky, I feel like I am reinventing the wheel...
UPDATE:
I have tried using stepAIC(), facing the same problem of missing data:
Error in stepAIC(fit) :
number of rows in use has changed: remove missing values?
you may try bestglm::bestglm where branch-bound method can be specified. The NAs can be handled by na.action argument as it in glm. see here for additional information:
http://cran.r-project.org/web/packages/bestglm/vignettes/bestglm.pdf
This is a stats problem, as AIC can't compare models built with
different data sets. So to compare models with and without certain
variables, you need to remove the rows with missing values for those
variables. You may need to "reconsider your modeling strategy", to
quote Ben
Bolker.
Otherwise you may also want to look into variants of AIC, a quick
Google search brings up a recent JASA
article
that might be a good starting point.
- Aaron

Resources