Extracting the relative influence from a gbm.fit object - r

I am trying to extract the relative influence of each variable from a gbm.fit object but it is coming up with the error below:
> summary(boost_cox, plotit = FALSE)
Error in data.frame(var = object$var.names[i], rel.inf = rel.inf[i]) :
row names contain missing values
The boost_cox object itself is fitted as follows:
boost_cox = gbm.fit(x = x,
y = y,
distribution="coxph",
verbose = FALSE,
keep.data = TRUE)
I have to use the gbm.fit function rather than the standard gbm function due to the large number of predictors (26k+)

I have solve this issue now myself.
The relative.influence() function can be used and works for objects created using both gbm() and gbm.fit(). However, it does not provide the plots as in the summary() function.
I hope this helps anyone else looking in the future.

Related

Error with svyglm function in survey package in R: "all variables must be in design=argument"

New to stackoverflow. I'm working on a project with NHIS data, but I cannot get the svyglm function to work even for a simple, unadjusted logistic regression with a binary predictor and binary outcome variable (ultimately I'd like to use multiple categorical predictors, but one step at a time).
El_under_glm<-svyglm(ElUnder~SO2, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in eval(extras, data, env) :
object '.survey.prob.weights' not found
I changed the variables to 0 and 1 instead:
Under_narm$SO2REG<-ifelse(Under_narm$SO2=="Heterosexual", 0, 1)
Under_narm$ElUnderREG<-ifelse(Under_narm$ElUnder=="No", 0, 1)
But then get a different issue:
El_under_glm<-svyglm(ElUnderREG~SO2REG, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in svyglm.survey.design(ElUnderREG ~ SO2REG, design = SAMPdesign, :
all variables must be in design= argument
This is the design I'm using to account for the weights -- I'm pretty sure it's correct:
SAMPdesign=svydesign(data=Under_narm, id= ~NHISPID, weight= ~SAMPWEIGHT)
Any and all assistance appreciated! I've got a good grasp of stats but am a slow coder. Let me know if I can provide any other information.
Using some make-believe sample data I was able to get your model to run by setting rescale = TRUE. The documentation states
Rescaling of weights, to improve numerical stability. The default
rescales weights to sum to the sample size. Use FALSE to not rescale
weights.
So, one solution maybe is just to set rescale = TRUE.
library(survey)
# sample data
Under_narm <- data.frame(SO2 = factor(rep(1:2, 1000)),
ElUnder = sample(0:1, 1000, replace = TRUE),
NHISPID = paste0("id", 1:1000),
SAMPWEIGHT = sample(c(0.5, 2), 1000, replace = TRUE))
# with 'rescale' = TRUE
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(), # this family avoids warnings
rescale=TRUE) # Weights rescaled to the sum of the sample size.
summary(El_under_glm, correlation = TRUE) # use correlation with summary()
Otherwise, looking code for this function's method with 'survey:::svyglm.survey.design', it seems like there may be a bug. I could be wrong, but by my read when 'rescale' is FALSE, .survey.prob.weights does not appear to get assigned a value.
if (is.null(g$weights))
g$weights <- quote(.survey.prob.weights)
else g$weights <- bquote(.survey.prob.weights * .(g$weights)) # bug?
g$data <- quote(data)
g[[1]] <- quote(glm)
if (rescale)
data$.survey.prob.weights <- (1/design$prob)/mean(1/design$prob)
There may be a work around if you assign a vector of numeric values to .survey.prob.weights in the global environment. No idea what these values should be, but your error goes away if you do something like the following. (.survey.prob.weights needs to be double the length of the data.)
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
.survey.prob.weights <- rep(1, 2000)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(),
rescale=FALSE)
summary(El_under_glm, correlation = TRUE)

Can't give a subset when using randomForest inside a function

I'm wanting to create a function that uses within it the randomForest function from the randomForest package. This takes the "subset" argument, which is a vector of row numbers of the data frame to use for training. However, if I use this argument when calling the randomForest function in another defined function, I get the error:
Error in eval(substitute(subset), data, env) :
object 'tr_subset' not found
Here is a reproducible example, where we attempt to train a random forest to classify a response "type" either "A" or "B", based on three numerical predictors:
library(randomForest)
# define a random data frame to train with
test.data = data.frame(
type = rep(NA, times = 500),
x = runif(500),
y = runif(500),
z = runif(500)
)
train.data$type[runif(500) >= 0.5] = "A"
train.data$type[is.na(test.data$type)] = "B"
train.data$type = as.factor(test.data$type)
# define the training range
training.range = sample(500)[1:300]
# formula to use
tr_form = formula(type ~ x + y + z)
# Function that includes the randomForest function
train_rf = function(form, all_data, tr_subset) {
p = randomForest(
formula = form,
data = all_data,
subset = tr_subset,
na.action = na.omit
)
return(p)
}
# test the new defined function
test_tree = train_rf(form = tr_form, all_data = train.data, tr_subset = training.range)
Running this gives the error:
Error in eval(substitute(subset), data, env) :
object 'tr_subset' not found
If, however, subset = tr_subset is removed from the randomForest function, and tr_subset is removed from the train_rf function, this code runs fine, however the whole data set is used for training!
It should be noted that using the subset argument in randomForest when not defined in another function works completely fine, and is the intended method for the function, as described in the vignette linked above.
I know in the mean time I could just define another training set that has just the row numbers required, and train using all of that, but is there a reason why my original code doesn't work please?
Thanks.
EDIT: I conjecture that, as subset() is a base R function, R is getting confused and thinking you're wanting to use the base R function rather than defining an argument of the randomForest function. I'm not an expert, though, so I may be wrong.

Not finding 'weightedMean' object for numFun using kNN in VIM package for R

I'm getting an error stating that the 'weightedMean' argument is not found for the 'numFun' parameter in the kNN imputation function within the VIM R package. I'm attempting to impute data in a fairly large dataset, and I want to use kNN with 5 neighbors using weighted means.
Here is my code:
df.imputed <- kNN(df, variable = c(...), dist_var = c(...), numFun = weightedMean, k = 5, weightDist = TRUE, trace = TRUE, imp_var = TRUE)
And the exact error is:
Error in args(numFun) : object 'weightedMean' not found
Based on the documentation (https://cran.r-project.org/web/packages/VIM/VIM.pdf page 29) it seems like this should work.
Try put this way numFun = weighted.mean,
at least worked for me

Grid tuning xgboost with missing data

It seems like the expected method of grid tuning the xgboost model is using the caret package, as clearly displayed here: https://stats.stackexchange.com/questions/171043/how-to-tune-hyperparameters-of-xgboost-trees
However, I struggle to make sense of the case with missing data. When creating the model without using caret, I set the missing to NA.
dtrain = xgb.DMatrix(data = data.matrix(train$data),label = train$label,missing = NA)
That allows me to create the model like so:
bst = xgboost(data = dtrain,depth = 4,eta =.3,nthread = 2,
nround = 43, print.every.n = 5,
objective = "binary:logistic",eval_metric = "auc",verbose = TRUE
)
This works very nicely, however, caret does not take this kind of object.
This is what I'm trying:
xgbtrain = train(x = train$data,y = as.factor(make.names(train$label)),
trControl = trControl, tuneGrid = my_grid,method = "xgbTree")
But for every iteration it is telling me this:
Error in xgb.DMatrix(as.matrix(x), label = y) : can not open file "NA"
That's the same error message I was getting before in regular xgb.boost when I didn't set my missing to NA. The xgb.DMatrix is not a subsettable object I could take the data from, and it is also not possible to convert it to a data frame. How do I get around this?
EDIT
Figured it out. In the end, it had nothing to do with missing data, but with having factors in the dataset. Instead of using xgboost's function to convert to a sparse matrix, I used regular model.matrix() and was able to successfully plug in the new matrix into caret's train function.

Error message not understood

I am trying to compute a function in R but, i am getting a weird error message which is not giving me any hint on what might be wrong...
G2M1$data is just a matrix containing data.
library(klaR)
out <- NaiveBayes(x = G2M1$data, grouping = G2M1$labels, usekernel = TRUE, density(G2M1$data, bw = "nrd0", adjust = 1,kernel = "gaussian"))
error message:
Error in sum(prior) : invalid 'type' (list) of argument
I am not sure why, since i am not defining any prior?
The first step in asking a question on Stack Overflow is to create a reproducible example. That is a small example that users can input into their computers to test, diagnose, and solve your issue. It not only helps others but it also enables you to properly assess your problem and potentially find a solution while creating the example.
Example
G2M1 <- list(data=as.matrix(iris[-5]), labels=iris[[5]])
This is an example dataset in the same structure and name as your question using the iris dataset.
Recreate error
Let's run your expression as is to see the error:
library(klaR)
out <- NaiveBayes(x = G2M1$data, grouping = G2M1$labels, usekernel = TRUE, density(G2M1$data, bw = "nrd0", adjust = 1,kernel = "gaussian"))
#Error in sum(prior) : invalid 'type' (list) of argument
Now we have found the error with our example. Let's investigate why it's happening. Let's look at the density expression and save it to a variable:
den <- density(G2M1$data, bw = "nrd0", adjust = 1,kernel = "gaussian")
class(den)
#[1] "density"
typeof(den)
#[1] "list"
It is a list. It not only has the densities, it also has other information like the call used and coordinates that we do not need for the model. Where are the densities themselves? We look to the documentation:
y the estimated density values.
Let's subset the variable with y to see the densities:
head(den$y)
#[1] 0.0003561307 0.0004076448 0.0004647614 0.0005300218 0.0006043244 0.0006864581
This is what the model is looking for. We substitute den$y into the model call:
out <- NaiveBayes(x = G2M1$data, grouping = G2M1$labels, usekernel = TRUE, den$y)
Success. In the future, remember to create an example for everyone. And use these basic troubleshooting techniques. Good luck

Resources