Tidymodels. step_impute_linear(), can it be used when every column contains NAs - r

My data contain >100 columns and every one of them contains NA's, and when I try to use step_impute_linear() it returns a mistake
Warning message:
There were missing values in the predictor(s) used to impute;
imputation did not occur.
Can, I, somehow make it work?

I think you'll need to use at least two steps of imputation.
First you will need to choose some variables to impute with something very simple, like the median or mode. I would choose the variables with lower rates of missingness for this.
Next you can choose some variables to impute with linear models, using only complete variables (the ones you imputed first with, say, the median). I would choose variables with higher rates of missingness for this, I think.
Here is an example analysis where I took this approach:
bb_rec <-
recipe(is_home_run ~ launch_angle + launch_speed + plate_x + plate_z +
bb_type + bearing + pitch_mph +
is_pitcher_lefty + is_batter_lefty +
inning + balls + strikes + game_date,
data = bb_train
) %>%
step_date(game_date, features = c("week"), keep_original_cols = FALSE) %>%
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
step_impute_median(all_numeric_predictors(), -launch_angle, -launch_speed) %>%
step_impute_linear(launch_angle, launch_speed,
impute_with = imp_vars(plate_x, plate_z, pitch_mph)
) %>%
step_nzv(all_predictors())
If you want to try out different strategies for types of imputation, I suggest setting up workflowsets and test on resampling folds.

Related

Error: The data used by step_impute_linear() did not have any rows where the imputation values were all complete

I am using the recipe function and get an error when using the step_impute_linear() inside of the recipe function to impute NA's. Note that step_impute_median or step_impute_mean work without a problem. Also it does not matter if I use:
step_impute_linear(all_predictors()) or,
step_impute_linear(all_numeric(),.) etc.
None of the combinations work.
Also not that other methods like:
step_impute_knn(all_nominal(),impute_with = all_predictors(),-has_role("ID"))
fail too.
I also checked the data and not all of the rows contain missing data also not all of the columns do.
dt_rec <- recipe(
OFFER_STATUS~ ., data = dt_training) %>%
# 1. Define Role
update_role(MO_ID, new_role = "ID") %>%
update_role(SO_ID, new_role = "ID") %>%
# turn dates into decimals
step_mutate_at(where(is.Date), fn = decimal_date) %>%
# impute all numeric columns with their median
# 2. Impute
# step_impute_median(all_numeric(),-has_role("ID"))%>%
step_impute_linear(all_numeric(),impute_with = .,-has_role("ID"))
# ignoring novel factors
# 3. Handle factor levels
step_novel(all_predictors(), -all_numeric()) %>%
# impute all other nominal (character + factor) columns with the value "none"
step_unknown(all_nominal(), new_level = "none") %>%
step_string2factor(all_nominal(), -all_outcomes(), -has_role("ID")) %>%
# remove constant columns
step_zv(all_predictors()) %>%
# 4. Discretize
# remove variables that have a high correlation with each other
# as this will lead to multicollinearity
step_corr(all_numeric(), threshold = 0.99) %>%
# normalization --> centering and scaling numeric variables
# mean = 0 and Sd = 1
step_normalize(all_numeric()) %>%
# 5. Dummy variables
# creating dummary variables for nominal predictors
step_dummy(all_nominal(), -all_outcomes())
# 6. Normalization
# 7. Multivariate transformation
step_pca(all_numeric_predictors())
dt_rec
dt_rec %>% summary()
When you use a function like step_impute_linear(), you are saying "impute the values for my variable with other variables". If some of those other variables also have missing data, the model is not going to be able to fit successfully. If you have a set of variables, say x, y, and z, that all have some missing data and that you want to impute using each other, I recommend that you:
impute one or more of the variables (say x) with a method that only depends on that variable, like using the median or similar
impute other variables using only the predictors that are now complete with no missing data (say impute y and z based on x)
It's not going to work out if you try to fit a whole set of linear models for a set of variables using each other, all of which have missing data.

Issue with multiple logistical regresstion code with multiple predictors

I am trying to perform multiple logistical regression with some of the variables that came out as statistically significant for a diseased conditions with univariate analysis. We took the cut off for that as p<0.2 since our sample size was ~300. I made a new dataframe for these variables
regression1df <- data.frame(dgfcriteria, recipientage, ESRD_dx,bmirange,graftnumber, dsa_class_1, organ_tx, transfuse01m, transfuse1yr, readmission1yr, citrange1, switrange, anastamosisrange, donorage, donorgender, donorcriteria, donorionotrope, intubaterange, kdpirange, kdrirange, eptsrange, proteinuria, terminalurea, na.rm=TRUE)
I'm using variables to predict for disease condition, which is DGF (dgfcriteria==1), and non-disease is no DGF (dgfcriteria==0).
Here is structure of the data.
When I tried to run the entire list of variables with the glm code I got:
predictors1 <- glm(dgfcriteria ~.,
data = predictors1df,
family = "binomial" )
Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels.
But when I run it with only some of the variables of the dataframe, there is an output.
predictors1 <- glm(dgfcriteria ~ recipientage + ESRD_dx + bmirange + graftnumber + dsa_class_1 + organ_tx + transfuse01m + transfuse1yr + readmission1yr +citrange1 +switrange + anastamosisrange+ donorage+ donorgender + donorcriteria + donorionotrope,
data = predictors1df,
family = "binomial" )
This output looks really strange though with alot of NAs.
Where have I gone wrong?
Looking at your data structure, you've got a lot of missing values. Quite a few of your variables look to have only 2 or 3 non-missing values in the first 10 rows. When you run regression on data with missing values, the default is to drop all rows that have any missing values.
Apparently some of your data has bad overlaps, so that when all the rows with missing values are dropped (see na.omit(your_data) for what is left over), some variables only have one level left and are therefore no longer fit for regression. Of course, when you only use some variables, fewer rows will be dropped and you may be in a better situation.
So, you'll have to decide what to do with your missing values. This should depend on your goals and your understanding of the reasons for missingness. Common possibilities include omission, imputation, creating new "missing" levels, and taking level of missingness into account in your variable selection.

Run svymean on all variables [duplicate]

This question already has an answer here:
Is there a better alternative than string manipulation to programmatically build formulas?
(1 answer)
Closed 2 years ago.
------ Short story--------
I would like to run svymean on all variables in the dataset (assuming they are all numeric). I've pulled this narrative from this guide over here: https://stylizeddata.com/how-to-use-survey-weights-in-r/
I know I can run svymean on all the variables by listing them out like this:
svymean(~age+gender, ageDesign, na.rm = TRUE)
However, my real dataset is 500 variables long (they are all numeric), and I need to get the means all at once more efficiently. I tried the following but it does not work.
svymean(~., ageDesign, na.rm = TRUE)
Any ideas's?
--------- Long explanation with real data-----
library(haven)
library(survey)
library(dplyr)
Import NHANES demographic data
nhanesDemo <- read_xpt(url("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT"))
Copy and rename variables so they are more intuitive. "fpl" is percent of the
of the federal poverty level. It ranges from 0 to 5.
nhanesDemo$fpl <- nhanesDemo$INDFMPIR
nhanesDemo$age <- nhanesDemo$RIDAGEYR
nhanesDemo$gender <- nhanesDemo$RIAGENDR
nhanesDemo$persWeight <- nhanesDemo$WTINT2YR
nhanesDemo$psu <- nhanesDemo$SDMVPSU
nhanesDemo$strata <- nhanesDemo$SDMVSTRA
Since there are 47 variables, we will select only the variables we will use in
this analysis.
nhanesAnalysis <- nhanesDemo %>%
select(fpl,
age,
gender,
persWeight,
psu,
strata)
Survey Weights
Here we use "svydesign" to assign the weights. We will use this new design
variable "nhanesDesign" when running our analyses.
nhanesDesign <- svydesign(id = ~psu,
strata = ~strata,
weights = ~persWeight,
nest = TRUE,
data = nhanesAnalysis)
Here we use "subset" to tell "nhanesDesign" that we want to only look at a
specific subpopulation (i.e., those age between 18-79 years). This is
important to do. If you don't do this and just restrict it in a different way
your estimates won't have correct SEs.
ageDesign <- subset(nhanesDesign, age > 17 &
age < 80)
Statistics
We will use "svymean" to calculate the population mean for age. The na.rm
argument "TRUE" excludes missing values from the calculation. We see that
the mean age is 45.648 and the standard error is 0.5131.
svymean(~age, ageDesign, na.rm = TRUE)
I know I can run svymean on all the variables by listing them out like this:
svymean(~age+gender, ageDesign, na.rm = TRUE)
However, my real dataset is 500 variables long, and I need to get the means all at once more efficiently. I tried the following but it does not work.
svymean(~., ageDesign, na.rm = TRUE)
You can use reformulate to construct the formula dynamically.
library(survey)
svymean(reformulate(names(nhanesAnalysis)), ageDesign, na.rm = TRUE)
# mean SE
#fpl 3.0134 0.1036
#age 45.4919 0.5273
#gender 1.5153 0.0065
#persWeight 80773.3847 5049.1504
#psu 1.5102 0.1330
#strata 126.1877 0.1506
This gives the same output as specifying each column individually in the function.
svymean(~age + fpl + gender + persWeight + psu + strata, ageDesign, na.rm = TRUE)

error with rda test in vegan r package. Variable not being read correctly

I am trying to perform a simple RDA using the vegan package to test the effects of depth, basin and sector on genetic population structure using the following data frame.
datafile.
The "ALL" variable is the genetic population assignment (structure).
In case the link to my data doesn't work well, I'll paste a snippet of my data frame here.
I read in the data this way:
RDAmorph_Oct6 <- read.csv("RDAmorph_Oct6.csv")
My problems are two-fold:
1) I can't seem to get my genetic variable to read correctly. I have tried three things to fix this.
gen=rda(ALL ~ Depth + Basin + Sector, data=RDAmorph_Oct6, na.action="na.exclude")
Error in eval(specdata, environment(formula), enclos = globalenv()) :
object 'ALL' not found
In addition: There were 12 warnings (use warnings() to see them)
so, I tried things like:
> gen=rda("ALL ~ Depth + Basin + Sector", data=RDAmorph_Oct6, na.action="na.exclude")
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
so I specified numeric
> RDAmorph_Oct6$ALL = as.numeric(RDAmorph_Oct6$ALL)
> gen=rda("ALL ~ Depth + Basin + Sector", data=RDAmorph_Oct6, na.action="na.exclude")
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
I am really baffled. I've also tried specifying each variable with dataset$variable, but this doesn't work either.
The strange thing is, I can get an rda to work if I look the effects of the environmental variables on a different, composite, variable
MC = RDAmorph_Oct6[,5:6]
H_morph_var=rda(MC ~ Depth + Basin + Sector, data=RDAmorph_Oct6, na.action="na.exclude")
Note that I did try to just extract the ALL column for the genetic rda above. This didn't work either.
Regardless, this leads to my second problem.
When I try to plot the rda I get a super weird plot. Note the five dots in three places. I have no idea where these come from.
I will have to graph the genetic rda, and I figure I'll come up with the same issue, so I thought I'd ask now.
I've been though several tutorials and tried many iterations of each issue. What I have provided here is I think the best summary. If anyone can give me some clues, I would much appreciate it.
The documentation, ?rda, says that the left-hand side of the formula specifying your model needs to be a data matrix. You can't pass it the name of a variable in the data object as the left-hand side (or at least if this was ever anticipated, doing so exposes bugs in how we parse the formula which is what leads to further errors).
What you want is a data frame containing a variable ALL for the left-hand side of the formula.
This works:
library('vegan')
df <- read.csv('~/Downloads/RDAmorph_Oct6.csv')
ALL <- df[, 'ALL', drop = FALSE]
Notice the drop = FALSE, which stops R from dropping the empty dimension (i.e. converting the single column data frame to a vector.
Then your original call works:
ord <- rda(ALL ~ Basin + Depth + Sector, data = df, na.action = 'na.exclude')
The problem is that rda expects a separate df for the first part of the formula (ALL in your code), and does not use the one in the data = argument.
As mentioned above, you can create a new df with the variable needed for analysis, but here's a oneline solution that should also work:
gen <- rda(RDAmorph_Oct6$ALL ~ Depth + Basin + Sector, data = RDAmorph_Oct6, na.action = na.exclude)
This is partly similar to Gavin simpson's answer. There is also a problem with the categorical vectors in your data frame. You can either use library(data.table) and the rowid function to set the categorical variables to unique integers. Most preferably, not use them. I also wanted to set the ID vector as site names, but I am too lazy now.
library(data.table)
RDAmorph_Oct6 <- read.csv("C:/........../RDAmorph_Oct6.csv")
#remove NAs before. I like looking at my dataframes before I analyze them.
RDAmorph_Oct6 <- na.omit(RDAmorph_Oct6)
#I removed one duplicate
RDAmorph_Oct6 <- RDAmorph_Oct6[!duplicated(RDAmorph_Oct6$ID),]
#Create vector with only ALL
ALL <- RDAmorph_Oct6$ALL
#Create data frame with only numeric vectors and remove ALL
dfn <- RDAmorph_Oct6[,-c(1,4,11,12)]
#Select all categorical vectors.
dfc <- RDAmorph_Oct6[,c(1,11,12)]
#Give the categorical vectors unique integers doesn't do this for ID (Why?).
dfc2 <- as.data.frame(apply(dfc, 2, function(x) rowid(x)))
#Bind back with numeric data frame
dfnc <- cbind.data.frame(dfn, dfc2)
#Select only what you need
df <- dfnc[c("Depth", "Basin", "Sector")]
#The rest you know
rda.out <- rda(ALL ~ ., data=df, scale=T)
plot(rda.out, scaling = 2, xlim=c(-3,2), ylim=c(-1,1))
#Also plot correlations
plot(cbind.data.frame(ALL, df))
Sector and depth have the highest variation. Almost logical, since there are only three vectors used. The assignment of integers to the categorical vector has probably no meaning at all. The function assigns from top to bottom unique integers to the following unique character string. I am also not really sure which question you want to answer. Based on this you can organize the data frame.

vegan package cca error: rowsum(X) must be >0: missing value where TRUE/FALSE needed

I am trying to run a Canonical Correspondence Analysis on diet composition data (prey.counts) with respect to a suite of environmental variables (envvar). Every row and every column sums to greater than 0, but I keep getting this error message:
diet <- cca(prey.counts, envvar$SL + envvar$Month + envvar$water.temp +
envvar$salinity + envvar$DO)
Error in if (any(rowSums(X) <= 0)) stop("All row sums must be >0 in the community data matrix") :
missing value where TRUE/FALSE needed
I have double and triple checked the prey.counts dataframe for NAs or empty columns/rows and none of them sum to zero or are missing values. R, RStudio, and all packages are fully up to date. Any help would be appreciated!
Meredith
The problem is how you are calling the function, you seem to be mixing the default and formula interfaces (and abusing the formula notation whilst you are at it).
Does this help:
diet <- cca(prey.counts ~ SL + Month + water.temp + salinity + DO, data = envvar)
Alternatively, if the named variables are the only ones in envvar, you could do either of
diet <- cca(prey.counts ~ ., data = envvar)
or
diet <- cca(prey.counts, envvar)
with the latter using the less flexible but simple default method for cca().

Resources