Plotting using xyplot() - r

I am having trouble understanding the xyplot() function to create plots in R. Below, I have an example of R code that does create a nice plot
install.packages("mice")
library("mice")
data <- airquality[, c("Ozone", "Solar.R")]
# Applies regression imputation to Ozone w.r.t Solar.R
imp <- mice(data, method = "norm.predict", seed = 1,
m = 1, print = FALSE)
xyplot(imp, Ozone ~ Solar.R)
The above code creates this image as desired:
code1 output
But the code below does not create a nice plot and instead gives me the error message "Error in UseMethod("xyplot") :
no applicable method for 'xyplot' applied to an object of class "data.frame""
airquality2 <- tidyr::fill(airquality, Ozone)
xyplot(airquality2, Ozone ~ Day)
Why does this occur? I am confused as applying the typeof() function to both "imp" and "airquality2" return "list", so I believe I have the object types correct. Thank you!

Remember that in R, generic functions call a specific method depending on the "class" attribute of the object passed as the first argument. This is known as S3 dispatch. The "class" of an object is not the same thing as its storage mode or internal type, which is what typeof returns. The fact that typeof(imp) == typeof(airquality2) is therefore irrelevant.
xyplot is a generic function, borrowed from the lattice package. The lattice package itself only defines methods for the classes formula and ts. It has no method for data frames.
The reason why xyplot works with imp passed as the first argument is that imp is an object of class "mids", and the method xyplot.mids is defined as an (unexported) function in mice, so there is an available method for it.
The upshot is that, since a method is available for objects of class “formula”, you can easily plot airquality2 by passing the formula as the first argument:
xyplot(Ozone ~ Day, airquality2)
Or explicitly naming the data argument:
xyplot(data = airquality2, Ozone ~ Day)
Both of which result in:

Related

Writing a function to produce a Kaplan-Meier curve

I am trying to write a function that spits out a KM survival curve. I am going to use this in a ShineyApp which is why I want to write a function so I can easily pass in arguments from a dropdown menu (which will input as a string into the strata argument). Here is a simplified version of what I need:
survival_function <- function(data_x, strata_x="1"){
survFormula <- Surv(data_x$time, data_x$status)
my_survfit <- survfit(data=data_x, as.formula(paste("survFormula~", {{strata_x}})))
ggsurvplot(my_survfit, data = data_x, pval=T)
}
survival_function(inputdata, "strata_var")
I get an error:
Error in paste("survFormula1~", { : object 'strata_x' not found
I'm at a loss because
as.formula(paste("~", {{arg}}))
has worked in other functions I've written to produce plots using ggplot to easily change variables to facet by, but this doesn't even seem to recognize strata_x as an argument.
Your function needs a couple of tweaks to get it working with ggsurvplot. It would be best to create the Surv object as a new column in the data frame and use this column in your formula. You also need to make sure you have an actual symbolic formula as the $call$formula member of the survfit object, otherwise ggsurvplot will fail to work due to non-standard evaluation deep within its internals.
library(survival)
library(survminer)
survival_function <- function(data_x, strata_x) {
data_x$s <- Surv(data_x$time, data_x$status)
survFormula <- as.formula(paste("s ~", strata_x))
my_survfit <- survfit(survFormula, data = data_x)
my_survfit$call$formula <- survFormula
ggsurvplot(my_survfit, data = data_x)
}
We can test this on the included lung data set:
survival_function(lung, "sex")
Created on 2022-08-03 by the reprex package (v2.0.1)

No applicable method for 'mae' applied to an object of class "c('double', 'numeric') when mapping using map2_dbl?

I am trying to calculate the MAE for a model I created and I receiving the following error:
x no applicable method for 'mae' applied to an object of class "c('double', 'numeric')
My mapping looks as such using map2_dbl:
cv_eval_rf <- cv_model_rf %>%
mutate(validate_mae = map2_dbl(validate_actual, validate_predicted, ~mae(actual = .x, predicted = .y)))
I am confused because when I examine the class of validate_actual & validate_predicted I get the following:
My tibble looks as such:
I am attempting to create another column named validate_mae as you see above in my calculation. Quite simply all I want to do is calculate the MAE for each tibble and attach it to this object so I can evaluate the best performing training/validation data.
Many different packages implement a function called mae. You're using an mae function from the wrong package.
Your code should work if you use Metrics::mae.
cv_eval_rf <- cv_model_rf %>%
mutate(validate_mae = map2_dbl(validate_actual, validate_predicted, ~Metrics::mae(actual = .x, predicted = .y)))
yardstick::mae should also work with slightly different syntax
cv_eval_rf <- cv_model_rf %>%
yarstick::mae(validate_actual, validate_predicted)

Can't give a subset when using randomForest inside a function

I'm wanting to create a function that uses within it the randomForest function from the randomForest package. This takes the "subset" argument, which is a vector of row numbers of the data frame to use for training. However, if I use this argument when calling the randomForest function in another defined function, I get the error:
Error in eval(substitute(subset), data, env) :
object 'tr_subset' not found
Here is a reproducible example, where we attempt to train a random forest to classify a response "type" either "A" or "B", based on three numerical predictors:
library(randomForest)
# define a random data frame to train with
test.data = data.frame(
type = rep(NA, times = 500),
x = runif(500),
y = runif(500),
z = runif(500)
)
train.data$type[runif(500) >= 0.5] = "A"
train.data$type[is.na(test.data$type)] = "B"
train.data$type = as.factor(test.data$type)
# define the training range
training.range = sample(500)[1:300]
# formula to use
tr_form = formula(type ~ x + y + z)
# Function that includes the randomForest function
train_rf = function(form, all_data, tr_subset) {
p = randomForest(
formula = form,
data = all_data,
subset = tr_subset,
na.action = na.omit
)
return(p)
}
# test the new defined function
test_tree = train_rf(form = tr_form, all_data = train.data, tr_subset = training.range)
Running this gives the error:
Error in eval(substitute(subset), data, env) :
object 'tr_subset' not found
If, however, subset = tr_subset is removed from the randomForest function, and tr_subset is removed from the train_rf function, this code runs fine, however the whole data set is used for training!
It should be noted that using the subset argument in randomForest when not defined in another function works completely fine, and is the intended method for the function, as described in the vignette linked above.
I know in the mean time I could just define another training set that has just the row numbers required, and train using all of that, but is there a reason why my original code doesn't work please?
Thanks.
EDIT: I conjecture that, as subset() is a base R function, R is getting confused and thinking you're wanting to use the base R function rather than defining an argument of the randomForest function. I'm not an expert, though, so I may be wrong.

S4 object creation in R

I am busy with comparing different machine learning techniques in R.
This is the case: I made several functions that, in an automated way
are able to create each a different prediction model (e.g: logistic regression, random forest, neural network, hybrid ensemble , etc.) , predictions, confusion matrices, several statistics (e.g AUC and Fscore) ,and different plots.
Now I would like to create a list of S4 (or S3?) objects in R, where each object contains the model, predictions, the plots, confusion matrix , auc and fscore.
The idea is that each function creates such object and then append it to the object list in the return statement.
How should I program such class? And how can I define that each model can be of some different type (I suppose that all models that I create are S3 objects, so how do can I define this in my S4 class?
The end result should be able to do something like this: modelList[i]#plot should for example summon the requested plot. and names(modelList[i]) should give the name of the used model (if this is not possible, modelList[i]#name will do). Also, it should be possible to select the best model out of the list, based on a parameter, such as AUC.
I am not experienced in creating such object, so this is the code / idea I have at the moment:
modelObject <- setClass(
# Set the name for the class
"modelObject",
# Define the slots
slots = c(
modelName = "character"
model = #should contain a glm, neural network, random forest , etc model
predictions = #should contain a matrix or dataframe of custid and prediction
rocCurve = #when summoned, the ROC curve should be plotted
plotX = #when summoned, plot X should be plotted
AUC = "numeric" #contains the value of the AUC
confusionMatrix = "matrix" #prints the confusion matrix in the console
statX = "numeric"#contains statistic X about the confusion matrix e.g. Fscore
),
# Set the default values for the slots. (optional)
prototype=list(
# I guess i can assign NULL to each variable of the S4 object
),
# Make a function that can test to see if the data is consistent.
# This is not called if you have an initialize function defined!
validity=function(object)
{
#not really an idea how to handle this
}
return(TRUE)
}
)
Use setOldClass() to promote each S3 class to it's S4 equivalent
setOldClass("lm")
setOldClass(c("glm", "lm"))
setOldClass(c("nnet.formula", "nnet"))
setOldClass("xx")
Use setClassUnion() to insert a common base class in the hierarchy
setClassUnion("lmORnnetORxx", c("lm", "nnet", "xx"))
.ModelObject <- setClass("ModelObject", slots=c(model="lmORnnetORxx"))
setMethod("show", "ModelObject", function(object) {
cat("model class: ", class(object#model), "\n")
})
In action:
> library(nnet)
> x <- y <- 1:10
> .ModelObject(model=lm(x~y))
model class: lm
> .ModelObject(model=glm(x~y))
model class: glm lm
> .ModelObject(model=nnet(x~y, size=10, trace=FALSE))
model class: nnet.formula nnet
I think that you would also like to implement a Models object that contains a list where all elements are ModelObject; the constraint would be imposed by a validity method (see ?setValidity).
What I would do, is for each slot you want in your modelObject class, determine the range of expected values. For example, your model slot has to support all the possible classes of objects that can be returned by model training functions (e.g. lm(), glm(), nnet(), etc.). In the example case, you see the following objects returned:
```
x <- y <- 1:10
class(lm(x~y))
class(glm(x~y))
class(nnet(x~y, size=10))
```
Since there is no common class among the objects returned, it might make more sense to use an S3, which has less rigorous syntax and would allow you to assign various classes of output to the same field name. Your question is actually quite tough to answer, given that there are so many different approaches to take with R's myriad OO systems.

Extracting predictions from a GAM model with splines and lagged predictors

I have some data and am trying to teach myself about utilize lagged predictors within regression models. I'm currently trying to generate predictions from a generalized additive model that uses splines to smooth the data and contains lags.
Let's say I have the following data and have split the data into training and test samples.
head(mtcars)
Train <- sample(1:nrow(mtcars), ceiling(nrow(mtcars)*3/4), replace=FALSE)
Great, let's train the gam model on the training set.
f_gam <- gam(hp ~ s(qsec, bs="cr") + s(lag(disp, 1), bs="cr"), data=mtcars[Train,])
summary(f_gam)
When I go to predict on the holdout sample, I get an error message.
f_gam.pred <- predict(f_gam, mtcars[-Train,]); f_gam.pred
Error in ExtractData(object, data, NULL) :
'names' attribute [1] must be the same length as the vector [0]
Calls: predict ... predict.gam -> PredictMat -> Predict.matrix3 -> ExtractData
Can anyone help diagnose the issue and help with a solution. I get that lag(__,1) leaves a data point as NA and that is likely the reason for the lengths being different. However, I don't have a solution to the problem.
I'm going to assume you're using gam() from the mgcv library. It appears that gam() doesn't like functions that are not defined in "base" in the s() terms. You can get around this by adding a column which include the transformed variable and then modeling using that variable. For example
tmtcars <- transform(mtcars, ldisp=lag(disp,1))
Train <- sample(1:nrow(mtcars), ceiling(nrow(mtcars)*3/4), replace=FALSE)
f_gam <- gam(hp ~ s(qsec, bs="cr") + s(ldisp, bs="cr"), data= tmtcars[Train,])
summary(f_gam)
predict(f_gam, tmtcars[-Train,])
works without error.
The problem appears to be coming from the mgcv:::get.var function. It tires to decode the terms with something like
eval(parse(text = txt), data, enclos = NULL)
and because they explicitly set the enclosure to NULL, variable and function names outside of base cannot be resolved. So because mean() is in the base package, this works
eval(parse(text="mean(x)"), data.frame(x=1:4), enclos=NULL)
# [1] 2.5
but because var() is defined in stats, this does not
eval(parse(text="var(x)"), data.frame(x=1:4), enclos=NULL)
# Error in eval(expr, envir, enclos) : could not find function "var"
and lag(), like var() is defined in the stats package.

Resources