randomForest Categorical Predictor Limits - r

I understand and appreciate that R's randomForest function can only handle categorical predictors with less than 54 categories. However, when I trim my categorical predictor down to less than 54 categories, I still get the error. The only questions I've seen around categorical predictor limits on stackoverflow is how to get around this category limit, but I'm trying to trim my number of categories to follow the function's limitations and I am still get the error.
The following script creates a data frame so we can predict 'profession'. Understandably, I get the "Can not handle categorical predictors with more than 53 categories" error when trying to run randomForest() on 'df' due to the 'college_id' variable.
But when I trim my data set to only include the top 40 college IDs, I get the same error. Am I missing some basic data frame concept that retains all of the categories even though only 40 are now populated in the 'df2' data frame? What is a workaround option that I can use?
library(dplyr)
library(randomForest)
# create data frame
df <- data.frame(profession = sample(c("accountant", "lawyer", "dentist"), 10000, replace = TRUE),
zip = sample(c("32801", "32807", "32827", "32828"), 10000, replace = TRUE),
salary = sample(c(50000:150000), 10000, replace = TRUE),
college_id = as.factor(c(sample(c(1001:1040), 9200, replace = TRUE),
sample(c(1050:9999), 800, replace = TRUE))))
# results in error, as expected
rfm <- randomForest(profession ~ ., data = df)
# arrange college_ids by count and retain the top 40 in the 'df' data frame
sdf <- df %>%
dplyr::group_by(college_id) %>%
dplyr::summarise(n = n()) %>%
dplyr::arrange(desc(n))
sdf <- sdf[1:40, ]
df2 <- dplyr::inner_join(df, sdf, by = "college_id")
df2$n <- NULL
# confirm that df2 only contains 40 categories of 'college_id'
nrow(df2[which(!duplicated(df2$college_id)), ])
# THIS IS WHAT I WANT TO RUN, BUT STILL RESULTS IN ERROR
rfm2 <- randomForest(profession ~ ., data = df2)

I think you still had all the factor levels in your variable. Try adding this line before you fit the forest again:
df2$college_id <- factor(df2$college_id)

Related

r mice - "sample" imputation method not working correctly

I am using mice to impute missing data in a large dataset (24k obs, 98 vars). I am using the "sample" imputation method to impute some variables (and other methods for the others - many categorical). When I check my imputed data, those variables that I've applied "sample" to are not always imputed and I have missingness in them. I know for sure that I'm applying "sample" to them (I double checked the methods), and I made sure to remove all predictors of them in the prediction matrix. From my understanding, where they are in the visit sequence shouldn't matter (but I make sure they're immediately after variables with no missingness).
I can't give you a reprex because when I try to recreate the problem, it doesn't happen and everything is imputed just fine. I tried simulating my own data and I tried subsetting the dataset to a group of the variables that I want to use the sample method on. That's part of why I'm so stumped - I coded everything the same and it worked with the subset. I didn't think that the sample method would be at all dependent on the presence of any other vars.
EDIT:
This is the code I'm using
#produce prediction matrix
pred1 <- quickpred_ext(data1, mincor = 0.08, include = "age")
pred2 <- pred1
# for vars to not be imputed, set all predictors to 0
data_no_impute <- data1 %>%
select(contains(c("exp_", "outcome_"))) %>%
select(sort(names(.))) %>%
names
data_level3 <- data1 %>%
select(contains(c("f4", "f5", "f6")),
k22) %>%
select(sort(names(.))) %>%
names
pred2[data_no_impute,] <- 0
pred2[data_level3,] <- 0
#produce initial methods and visit sequence
initial <- mice(data1, max = 0, print = F, vis = "monotone",
defaultMethod = c("pmm", "logreg", "polyreg", "polr"))
#edit methods to be blank for vars I don't want to impute, "sample" for level 3
meth1 <- initial$meth
meth2 <- meth1
meth2[data_level3] <- "sample"
meth2[data_no_impute] <- ""
visits1 <- initial$visitSequence
visits2 <- visits1
visits2 <- append(visits2, data_level3,22)
#run mice test
mice_test <- mice(data1, m = 2, print = F,
predictorMatrix = pred2,
method = meth2,
vis = visits2,
nnet.MaxNWts = 3000)
#pull second completed dataset
imput1 <- mice::complete(mice_test, 2, include = F)
#look at missingness patterns
missingness_pattern2 <- md.pattern(imput1, plot = F)

how to use R package `caret` to run `pls::plsr( )` with multiple responses

the caret::train() does not seem to accept y if y is a matrix of multiple columns.
Thanks for any help!
That's correct. Perhaps you want the tidymodels package? Kuhn has said there would be support for multivariate response in it. Here's evidence in favor of my suggestion: https://www.tidymodels.org/learn/models/pls/
Do a search of that document for plsr:
library(tidymodels)
library(pls)
get_var_explained <- function(recipe, ...) {
# Extract the predictors and outcomes into their own matrices
y_mat <- bake(recipe, new_data = NULL, composition = "matrix", all_outcomes())
x_mat <- bake(recipe, new_data = NULL, composition = "matrix", all_predictors())
# The pls package prefers the data in a data frame where the outcome
# and predictors are in _matrices_. To make sure this is formatted
# properly, use the `I()` function to inhibit `data.frame()` from making
# all the individual columns. `pls_format` should have two columns.
pls_format <- data.frame(
endpoints = I(y_mat),
measurements = I(x_mat)
)
# Fit the model
mod <- plsr(endpoints ~ measurements, data = pls_format)
# Get the proportion of the predictor variance that is explained
# by the model for different number of components.
xve <- explvar(mod)/100
# To do the same for the outcome, it is more complex. This code
# was extracted from pls:::summary.mvr.
explained <-
drop(pls::R2(mod, estimate = "train", intercept = FALSE)$val) %>%
# transpose so that components are in rows
t() %>%
as_tibble() %>%
# Add the predictor proportions
mutate(predictors = cumsum(xve) %>% as.vector(),
components = seq_along(xve)) %>%
# Put into a tidy format that is tall
pivot_longer(
cols = c(-components),
names_to = "source",
values_to = "proportion"
)
}
#We compute this data frame for each resample and save the results in the different columns.
folds <-
folds %>%
mutate(var = map(recipes, get_var_explained),
var = unname(var))
#To extract and aggregate these data, simple row binding can be used to stack the data vertically. Most of the action happens in the first 15 components so let’s filter the data and compute the average proportion.
variance_data <-
bind_rows(folds[["var"]]) %>%
filter(components <= 15) %>%
group_by(components, source) %>%
summarize(proportion = mean(proportion))
This might not be a reproducible code block. May need additional data or packages.

Run svymean on all variables [duplicate]

This question already has an answer here:
Is there a better alternative than string manipulation to programmatically build formulas?
(1 answer)
Closed 2 years ago.
------ Short story--------
I would like to run svymean on all variables in the dataset (assuming they are all numeric). I've pulled this narrative from this guide over here: https://stylizeddata.com/how-to-use-survey-weights-in-r/
I know I can run svymean on all the variables by listing them out like this:
svymean(~age+gender, ageDesign, na.rm = TRUE)
However, my real dataset is 500 variables long (they are all numeric), and I need to get the means all at once more efficiently. I tried the following but it does not work.
svymean(~., ageDesign, na.rm = TRUE)
Any ideas's?
--------- Long explanation with real data-----
library(haven)
library(survey)
library(dplyr)
Import NHANES demographic data
nhanesDemo <- read_xpt(url("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT"))
Copy and rename variables so they are more intuitive. "fpl" is percent of the
of the federal poverty level. It ranges from 0 to 5.
nhanesDemo$fpl <- nhanesDemo$INDFMPIR
nhanesDemo$age <- nhanesDemo$RIDAGEYR
nhanesDemo$gender <- nhanesDemo$RIAGENDR
nhanesDemo$persWeight <- nhanesDemo$WTINT2YR
nhanesDemo$psu <- nhanesDemo$SDMVPSU
nhanesDemo$strata <- nhanesDemo$SDMVSTRA
Since there are 47 variables, we will select only the variables we will use in
this analysis.
nhanesAnalysis <- nhanesDemo %>%
select(fpl,
age,
gender,
persWeight,
psu,
strata)
Survey Weights
Here we use "svydesign" to assign the weights. We will use this new design
variable "nhanesDesign" when running our analyses.
nhanesDesign <- svydesign(id = ~psu,
strata = ~strata,
weights = ~persWeight,
nest = TRUE,
data = nhanesAnalysis)
Here we use "subset" to tell "nhanesDesign" that we want to only look at a
specific subpopulation (i.e., those age between 18-79 years). This is
important to do. If you don't do this and just restrict it in a different way
your estimates won't have correct SEs.
ageDesign <- subset(nhanesDesign, age > 17 &
age < 80)
Statistics
We will use "svymean" to calculate the population mean for age. The na.rm
argument "TRUE" excludes missing values from the calculation. We see that
the mean age is 45.648 and the standard error is 0.5131.
svymean(~age, ageDesign, na.rm = TRUE)
I know I can run svymean on all the variables by listing them out like this:
svymean(~age+gender, ageDesign, na.rm = TRUE)
However, my real dataset is 500 variables long, and I need to get the means all at once more efficiently. I tried the following but it does not work.
svymean(~., ageDesign, na.rm = TRUE)
You can use reformulate to construct the formula dynamically.
library(survey)
svymean(reformulate(names(nhanesAnalysis)), ageDesign, na.rm = TRUE)
# mean SE
#fpl 3.0134 0.1036
#age 45.4919 0.5273
#gender 1.5153 0.0065
#persWeight 80773.3847 5049.1504
#psu 1.5102 0.1330
#strata 126.1877 0.1506
This gives the same output as specifying each column individually in the function.
svymean(~age + fpl + gender + persWeight + psu + strata, ageDesign, na.rm = TRUE)

How to get around error "factor has new levels" in cross-validation glm?

My goal is to use cross-validation to evaluate the performance of a linear model.
My problem is that my training and testing sets might not always have the same variable levels.
Here is a reproducible data example:
set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))
data <- data.frame(x,y,z)
summary(data)
Now let's make a glm model:
model_glm <- glm(x~., data = data)
And let's use cross-validation on this model:
library(boot)
cross_validation_glm <- cv.glm(data = data, glmfit = model_glm, K = 10)
And this is the kind of error output that you will get:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor z has new levels F
if you don't get this error, re-run the cross validation and at some point you will get a similar error.
The nature of the problem here is that when you do cross-validation, the train and test subsets might not have the exact same variable levels. Here our variable z has three levels (D,E,F).
In the total amount of our data there is much more D's than E's and F's.
Thus whenever you take a small subset of the whole data (to do cross-validation).
There is a very good chance that your z variable are all going to be set at the D's level.
Thus Eand F levels gets dropped, thus we get the error (This answer is helpful to understand the problem: https://stackoverflow.com/a/51555998/10972294).
My question is: how to avoid the drop in the first place?
If it is not possible, what are the alternatives?
(Keep in mind that this a reproducible example, the actual data I am using has many variables like z, I would like to avoid deleting them.)
To answer your question in the comment, I don't know if there is a function or not. Most likely there is one, but I have no idea on which package would contain it. For this example, this function should work:
set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))
data <- data.frame(x,y,z)
#optional tag row for later identification:
#data$rowid<-1:nrow(data)
stratified <- function(df, column, percent){
#split dataframe into groups based on column
listdf<-split(df, df[[column]])
testsubgroups<-lapply(listdf, function(x){
#pick the number of samples per group, round up.
numsamples <- ceiling(percent*nrow(x))
#selects the rows
whichones <-sample(1:nrow(x), numsamples, replace = FALSE)
testsubgroup <-x[whichones,]
})
#combine the subgroups into one data frame
testgroup<-do.call(rbind, testsubgroups)
testgroup
}
testgroup<-stratified(data, "z", 0.8)
This will just split the initial data by column z, if you are interested is grouping by multiple columns then this could be extended by using the group_by function from the dplyr package, but that would be another question.
Comment on the statistics: If you just have a few examples for any particular factor, what type of fit do you expect? A poor fit with wide confidence limits.

"Error in model.frame.default(data = train, formula = cost ~ .) : variable lengths differ", but all variables are length 76?

I'm modeling burrito prices in San Diego to determine whether some burritos are over/under priced (according to the model). I'm attempting to use regsubsets() to determine the best linear model, using the BIC, on a data frame of 76 observations of 14 variables. However, I keep getting an error saying that variable lengths differ, and thus a linear model doesn't work.
I've tried rounding all the observations in the data frame to one decimal place, I've used the length() function on each variable in the data frame to make sure they're all the same length, and before I made the model I used na.omit() on the data frame to make sure no NAs were present. By the way, the original dataset can be found here: https://www.kaggle.com/srcole/burritos-in-san-diego. I cleaned it up a bit in Excel first, removing all the categorical variables that appeared after the "overall" column.
burritos <- read.csv("/Users/Jack/Desktop/R/STOR 565 R Projects/Burritos.csv")
burritos <- burritos[ ,-c(1,2,5)]
burritos <- na.exclude(burritos)
burritos <- round(burritos, 1)
library(leaps)
library(MASS)
yelp <- burritos$Yelp
google <- burritos$Google
cost <- burritos$Cost
hunger <- burritos$Hunger
tortilla <- burritos$Tortilla
temp <- burritos$Temp
meat <- burritos$Meat
filling <- burritos$Meat.filling
uniformity <- burritos$Uniformity
salsa <- burritos$Salsa
synergy <- burritos$Synergy
wrap <- burritos$Wrap
overall <- burritos$overall
variable <- sample(1:nrow(burritos), 50)
train <- burritos[variable, ]
test <- burritos[-variable, ]
null <- lm(cost ~ 1, data = train)
full <- regsubsets(cost ~ ., data = train) #This is where error occurs

Resources