Error: The data used by step_impute_linear() did not have any rows where the imputation values were all complete - r

I am using the recipe function and get an error when using the step_impute_linear() inside of the recipe function to impute NA's. Note that step_impute_median or step_impute_mean work without a problem. Also it does not matter if I use:
step_impute_linear(all_predictors()) or,
step_impute_linear(all_numeric(),.) etc.
None of the combinations work.
Also not that other methods like:
step_impute_knn(all_nominal(),impute_with = all_predictors(),-has_role("ID"))
fail too.
I also checked the data and not all of the rows contain missing data also not all of the columns do.
dt_rec <- recipe(
OFFER_STATUS~ ., data = dt_training) %>%
# 1. Define Role
update_role(MO_ID, new_role = "ID") %>%
update_role(SO_ID, new_role = "ID") %>%
# turn dates into decimals
step_mutate_at(where(is.Date), fn = decimal_date) %>%
# impute all numeric columns with their median
# 2. Impute
# step_impute_median(all_numeric(),-has_role("ID"))%>%
step_impute_linear(all_numeric(),impute_with = .,-has_role("ID"))
# ignoring novel factors
# 3. Handle factor levels
step_novel(all_predictors(), -all_numeric()) %>%
# impute all other nominal (character + factor) columns with the value "none"
step_unknown(all_nominal(), new_level = "none") %>%
step_string2factor(all_nominal(), -all_outcomes(), -has_role("ID")) %>%
# remove constant columns
step_zv(all_predictors()) %>%
# 4. Discretize
# remove variables that have a high correlation with each other
# as this will lead to multicollinearity
step_corr(all_numeric(), threshold = 0.99) %>%
# normalization --> centering and scaling numeric variables
# mean = 0 and Sd = 1
step_normalize(all_numeric()) %>%
# 5. Dummy variables
# creating dummary variables for nominal predictors
step_dummy(all_nominal(), -all_outcomes())
# 6. Normalization
# 7. Multivariate transformation
step_pca(all_numeric_predictors())
dt_rec
dt_rec %>% summary()

When you use a function like step_impute_linear(), you are saying "impute the values for my variable with other variables". If some of those other variables also have missing data, the model is not going to be able to fit successfully. If you have a set of variables, say x, y, and z, that all have some missing data and that you want to impute using each other, I recommend that you:
impute one or more of the variables (say x) with a method that only depends on that variable, like using the median or similar
impute other variables using only the predictors that are now complete with no missing data (say impute y and z based on x)
It's not going to work out if you try to fit a whole set of linear models for a set of variables using each other, all of which have missing data.

Related

how to use R package `caret` to run `pls::plsr( )` with multiple responses

the caret::train() does not seem to accept y if y is a matrix of multiple columns.
Thanks for any help!
That's correct. Perhaps you want the tidymodels package? Kuhn has said there would be support for multivariate response in it. Here's evidence in favor of my suggestion: https://www.tidymodels.org/learn/models/pls/
Do a search of that document for plsr:
library(tidymodels)
library(pls)
get_var_explained <- function(recipe, ...) {
# Extract the predictors and outcomes into their own matrices
y_mat <- bake(recipe, new_data = NULL, composition = "matrix", all_outcomes())
x_mat <- bake(recipe, new_data = NULL, composition = "matrix", all_predictors())
# The pls package prefers the data in a data frame where the outcome
# and predictors are in _matrices_. To make sure this is formatted
# properly, use the `I()` function to inhibit `data.frame()` from making
# all the individual columns. `pls_format` should have two columns.
pls_format <- data.frame(
endpoints = I(y_mat),
measurements = I(x_mat)
)
# Fit the model
mod <- plsr(endpoints ~ measurements, data = pls_format)
# Get the proportion of the predictor variance that is explained
# by the model for different number of components.
xve <- explvar(mod)/100
# To do the same for the outcome, it is more complex. This code
# was extracted from pls:::summary.mvr.
explained <-
drop(pls::R2(mod, estimate = "train", intercept = FALSE)$val) %>%
# transpose so that components are in rows
t() %>%
as_tibble() %>%
# Add the predictor proportions
mutate(predictors = cumsum(xve) %>% as.vector(),
components = seq_along(xve)) %>%
# Put into a tidy format that is tall
pivot_longer(
cols = c(-components),
names_to = "source",
values_to = "proportion"
)
}
#We compute this data frame for each resample and save the results in the different columns.
folds <-
folds %>%
mutate(var = map(recipes, get_var_explained),
var = unname(var))
#To extract and aggregate these data, simple row binding can be used to stack the data vertically. Most of the action happens in the first 15 components so let’s filter the data and compute the average proportion.
variance_data <-
bind_rows(folds[["var"]]) %>%
filter(components <= 15) %>%
group_by(components, source) %>%
summarize(proportion = mean(proportion))
This might not be a reproducible code block. May need additional data or packages.

How to plot interactions of predictors when the response data is ordinal in r?

I have a dataset with an ordinal target variable. I need to draw an interaction plot of one continuous and one categorical value to see if their interaction matters.
Here, I will use diamonds built-in dataset for reproducibility. Let's pretend that "cut" is a target variable. I tried to use this:
interaction.plot(diamonds$carat, diamonds$clarity, diamonds$cut)
which gives me this error:
Error in plot.window(...) : need finite 'ylim' values
It works for continuous target variables. But my data has an ordinal response variable. Should I recode my target variable for this function or is there a better way of plotting it?
The plot should look like this (https://www.statology.org/interaction-plot-r/):
You have to make the binary field numeric for it to work.
library(tidyverse)
data("diamonds")
str(diamonds)
df1 <- diamonds %>%
filter(cut %in% c("Fair", "Good")) %>% # Make it binary
mutate(cut = ifelse(cut == "Fair", 0, 1)) # Make it numeric
# check it
str(df1)
# plot it
interaction.plot(x.factor = df1$color,
trace.factor = df1$clarity,
response = df1$cut,
trace.label = "Clarity")

Tidymodels. step_impute_linear(), can it be used when every column contains NAs

My data contain >100 columns and every one of them contains NA's, and when I try to use step_impute_linear() it returns a mistake
Warning message:
There were missing values in the predictor(s) used to impute;
imputation did not occur.
Can, I, somehow make it work?
I think you'll need to use at least two steps of imputation.
First you will need to choose some variables to impute with something very simple, like the median or mode. I would choose the variables with lower rates of missingness for this.
Next you can choose some variables to impute with linear models, using only complete variables (the ones you imputed first with, say, the median). I would choose variables with higher rates of missingness for this, I think.
Here is an example analysis where I took this approach:
bb_rec <-
recipe(is_home_run ~ launch_angle + launch_speed + plate_x + plate_z +
bb_type + bearing + pitch_mph +
is_pitcher_lefty + is_batter_lefty +
inning + balls + strikes + game_date,
data = bb_train
) %>%
step_date(game_date, features = c("week"), keep_original_cols = FALSE) %>%
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
step_impute_median(all_numeric_predictors(), -launch_angle, -launch_speed) %>%
step_impute_linear(launch_angle, launch_speed,
impute_with = imp_vars(plate_x, plate_z, pitch_mph)
) %>%
step_nzv(all_predictors())
If you want to try out different strategies for types of imputation, I suggest setting up workflowsets and test on resampling folds.

Run svymean on all variables [duplicate]

This question already has an answer here:
Is there a better alternative than string manipulation to programmatically build formulas?
(1 answer)
Closed 2 years ago.
------ Short story--------
I would like to run svymean on all variables in the dataset (assuming they are all numeric). I've pulled this narrative from this guide over here: https://stylizeddata.com/how-to-use-survey-weights-in-r/
I know I can run svymean on all the variables by listing them out like this:
svymean(~age+gender, ageDesign, na.rm = TRUE)
However, my real dataset is 500 variables long (they are all numeric), and I need to get the means all at once more efficiently. I tried the following but it does not work.
svymean(~., ageDesign, na.rm = TRUE)
Any ideas's?
--------- Long explanation with real data-----
library(haven)
library(survey)
library(dplyr)
Import NHANES demographic data
nhanesDemo <- read_xpt(url("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT"))
Copy and rename variables so they are more intuitive. "fpl" is percent of the
of the federal poverty level. It ranges from 0 to 5.
nhanesDemo$fpl <- nhanesDemo$INDFMPIR
nhanesDemo$age <- nhanesDemo$RIDAGEYR
nhanesDemo$gender <- nhanesDemo$RIAGENDR
nhanesDemo$persWeight <- nhanesDemo$WTINT2YR
nhanesDemo$psu <- nhanesDemo$SDMVPSU
nhanesDemo$strata <- nhanesDemo$SDMVSTRA
Since there are 47 variables, we will select only the variables we will use in
this analysis.
nhanesAnalysis <- nhanesDemo %>%
select(fpl,
age,
gender,
persWeight,
psu,
strata)
Survey Weights
Here we use "svydesign" to assign the weights. We will use this new design
variable "nhanesDesign" when running our analyses.
nhanesDesign <- svydesign(id = ~psu,
strata = ~strata,
weights = ~persWeight,
nest = TRUE,
data = nhanesAnalysis)
Here we use "subset" to tell "nhanesDesign" that we want to only look at a
specific subpopulation (i.e., those age between 18-79 years). This is
important to do. If you don't do this and just restrict it in a different way
your estimates won't have correct SEs.
ageDesign <- subset(nhanesDesign, age > 17 &
age < 80)
Statistics
We will use "svymean" to calculate the population mean for age. The na.rm
argument "TRUE" excludes missing values from the calculation. We see that
the mean age is 45.648 and the standard error is 0.5131.
svymean(~age, ageDesign, na.rm = TRUE)
I know I can run svymean on all the variables by listing them out like this:
svymean(~age+gender, ageDesign, na.rm = TRUE)
However, my real dataset is 500 variables long, and I need to get the means all at once more efficiently. I tried the following but it does not work.
svymean(~., ageDesign, na.rm = TRUE)
You can use reformulate to construct the formula dynamically.
library(survey)
svymean(reformulate(names(nhanesAnalysis)), ageDesign, na.rm = TRUE)
# mean SE
#fpl 3.0134 0.1036
#age 45.4919 0.5273
#gender 1.5153 0.0065
#persWeight 80773.3847 5049.1504
#psu 1.5102 0.1330
#strata 126.1877 0.1506
This gives the same output as specifying each column individually in the function.
svymean(~age + fpl + gender + persWeight + psu + strata, ageDesign, na.rm = TRUE)

randomForest Categorical Predictor Limits

I understand and appreciate that R's randomForest function can only handle categorical predictors with less than 54 categories. However, when I trim my categorical predictor down to less than 54 categories, I still get the error. The only questions I've seen around categorical predictor limits on stackoverflow is how to get around this category limit, but I'm trying to trim my number of categories to follow the function's limitations and I am still get the error.
The following script creates a data frame so we can predict 'profession'. Understandably, I get the "Can not handle categorical predictors with more than 53 categories" error when trying to run randomForest() on 'df' due to the 'college_id' variable.
But when I trim my data set to only include the top 40 college IDs, I get the same error. Am I missing some basic data frame concept that retains all of the categories even though only 40 are now populated in the 'df2' data frame? What is a workaround option that I can use?
library(dplyr)
library(randomForest)
# create data frame
df <- data.frame(profession = sample(c("accountant", "lawyer", "dentist"), 10000, replace = TRUE),
zip = sample(c("32801", "32807", "32827", "32828"), 10000, replace = TRUE),
salary = sample(c(50000:150000), 10000, replace = TRUE),
college_id = as.factor(c(sample(c(1001:1040), 9200, replace = TRUE),
sample(c(1050:9999), 800, replace = TRUE))))
# results in error, as expected
rfm <- randomForest(profession ~ ., data = df)
# arrange college_ids by count and retain the top 40 in the 'df' data frame
sdf <- df %>%
dplyr::group_by(college_id) %>%
dplyr::summarise(n = n()) %>%
dplyr::arrange(desc(n))
sdf <- sdf[1:40, ]
df2 <- dplyr::inner_join(df, sdf, by = "college_id")
df2$n <- NULL
# confirm that df2 only contains 40 categories of 'college_id'
nrow(df2[which(!duplicated(df2$college_id)), ])
# THIS IS WHAT I WANT TO RUN, BUT STILL RESULTS IN ERROR
rfm2 <- randomForest(profession ~ ., data = df2)
I think you still had all the factor levels in your variable. Try adding this line before you fit the forest again:
df2$college_id <- factor(df2$college_id)

Resources