I've got the following code:
breast.svr=ksvm(Diagnosis~.,data=breast.train,kernel="rbfdot",C=4)
pred.svr=predict(breast.svr,newdata=breast.test)
tabel <- table(breast.test[,1],pred.svr)/nrow(breast.test)
tabel[1,2] + tabel[2,1]
The result is:
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 4
Gaussian Radial Basis kernel function.
Hyperparameter : sigma = 0.149121426557224
Number of Support Vectors : 99
Objective Function Value : -143.4679
Training error : 0.028947
I know that I can extract a lot of information from this model on the following manner:
coef(breast.svr)
But I don't know what to do with it? How can I interpret this? How can I make from this a model like: f(x) = ...? More specific, how can I say which predictor variables that are important?
Kernel SVM is by it's very nature not very intepretable. Each kernel uses many predictor variables so its hard to say which predictor variable is important. If you care about predictability, try to use linear regression or other interpretable models.
Related
My question is closely related to this previous one: Simulation-based hypothesis testing on spatial point pattern hyperframes using "envelope" function in spatstat
I have obtained an mppm object by fitting a model on several independent datasets using the mppmfunction from the R package spatstat. How can I study its envelope to compare it to my observations ?
I fitted my model as such:
data <- listof(NMJ1,NMJ2,NMJ3)
data <- hyperframe(X=1:3, Points=data)
model <- mppm(Points ~marks*sqrt(x^2+y^2), data)
where NMJ1, NMJ2, and NMJ3 are marked ppp and are independent realizations of the same experiment.
However, the envelope function does not accept inputs of type mppm:
> envelope(model, Kcross.inhom, nsim=10)
Error in UseMethod("envelope") :
no applicable method for 'envelope' applied to an object of class "c('mppm', 'list')"
The answer provided to the previously mentioned question indicates how to plot global envelopes for each pattern, and to use the product rule for multiple testing. However, my fitted model implies that my 3 ppp objects are statistically equivalent, and are independent realizations of the same experiment (ie no different covariates between them). I would thus like to obtain one single plot comparing my fitted model to my 3 datasets. The following code:
gamma= 1 - 0.95^(1/3)
nsims=round(1/gamma-1)
sims <- simulate(model, nsim=2*nsims)
SIMS <- list()
for(i in 1:nrow(sims)) SIMS[[i]] <- as.solist(sims[i,,drop=TRUE])
Hplus <- cbind(data, hyperframe(Sims=SIMS))
EE1 <- with(Hplus, envelope(Points, Kcross.inhom, nsim=nsims, simulate=Sims))
pool(EE1[1],EE1[2],EE1[3])
leads to the following error:
Error in pool.envelope(`1` = list(r = c(0, 0.78125, 1.5625, 2.34375, 3.125, :
Arguments 2 and 3 do not belong to the class “envelope”
Wrong type of subset index. Use
pool(EE1[[1]], EE1[[2]], EE1[[3]])
or just
pool(EE1)
These would have given an error message that the envelope commands should have been called with savefuns=TRUE. So you just need to change that step as well.
However, statistically this procedure makes little sense. You have already fitted a model, which allows for rigorous statistical inference using anova.mppm and other tools. Instead of this, you are generating simulated data from the fitted model and performing a Monte Carlo test, with all the fraught issues of multiple testing and low power. There are additional problems with this approach - for example, even if the model is "the same" for each row of the hyperframe, the patterns are not statistically equivalent unless the windows of the point patterns are identical, and so on.
I am having trouble getting the Brier Score for my Machine Learning Predictive models. The outcome "y" was categorical (1 or 0). Predictors are a mix of continuous and categorical variables.
I have created four models with different predictors, I will call them "model_1"-"model_4" here (except predictors, other parameters are the same). Example code of my model is:
Model_1=rfsrc(y~ ., data=TrainTest, ntree=1000,
mtry=30, nodesize=1, nsplit=1,
na.action="na.impute", nimpute=3,seed=10,
importance=T)
When I run the "Model_1" function in R, I got the results:
My question was how can I get the predicted possibility for those 412 people? And how to find the observed probability for each person? Do I need to calculate by hand? I found the function BrierScore() in "DescTools" package.
But I tried "BrierScore(Model_1)", it gives me no results.
codes I added:
library(scoring)
library(DescTools)
BrierScore(Raw_SB)
class(TrainTest$VL_supress03)
TrainTest$VL_supress03_nu<-as.numeric(as.character(TrainTest$VL_supress03))
class(TrainTest$VL_supress03_nu)
prediction_Raw_SB = predict(Raw_SB, TrainTest)
BrierScore(prediction_Raw_SB, as.numeric(TrainTest$VL_supress03) - 1)
BrierScore(prediction_Raw_SB, as.numeric(as.character(TrainTest$VL_supress03)) - 1)
BrierScore(prediction_Raw_SB, TrainTest$VL_supress03_nu - 1)
I tried some codes: have so many error messages:
One assumption I am making about your approach is that you want to compute the BrierScore on the data you train your model on (which is usually not the correct approach, google train-test split if you need more info there).
In general, therefore you should reflect on whether your approach is correct there.
The BrierScore method in DescTools only has a defined method for glm models, otherwise, it expects as input a vector of predicted probabilities and a vector of true values (see ?BrierScore).
What you would need to do though is to predict on your data using:
prediction = predict(model_1, TrainTest, na.action="na.impute")
and then compute the brier score using
BrierScore(as.numeric(TrainTest$y) - 1, prediction$predicted[, 1L])
(Note, that we transform TrainTest$y into a numeric vector of 0's and 1's in order to compute the brier score.)
Note: The randomForestSRC package also prints a normalized brier score when you call print(prediction).
In general, using one of the available workbenches for machine learning in R (mlr3, tidymodels, caret) might simplify this approach for you and prevent a lot of errors in this direction. This is a really good practice, especially if you are less experienced in ML as it can prevent many errors.
See e.g. this chapter in the mlr3 book for more information.
For reference, here is some very similar code using the mlr3 package, automatically also taking care of train-test splits.
data(breast, package = "randomForestSRC") # with target variable "status"
library(mlr3)
library(mlr3extralearners)
task = TaskClassif$new(id = "breast", backend = breast, target = "status")
algo = lrn("classif.rfsrc", na.action = "na.impute", predict_type = "prob")
resample(task, algo, rsmp("holdout", ratio = 0.8))$score(msr("classif.bbrier"))
I am developing a prediction model in R. It uses the restricted cubic spline of an important continuous predictor that is a priori likely to have a nonlinear relationship to the outcome. To do this I used rms::rcs() and specified the number of knots, but allowed rcs() to 'decide' the location.
I want to extract the coefficients for all the predictors to use in an external application, the purpose of which is to predict Y given new input data. However, to do this I need to be able to find the location of the knots that were used by rcs().
The relevant code within rcs() is
if (!length(knots)) {
xd <- rcspline.eval(x, nk = nknots, inclx = TRUE, pc = pc,
fractied = fractied)
knots <- attr(xd, "knots")
}
In my case, pc == 0 and fractied == 0.05
How can I find the location of the knots?
rms::specs(model) will provide the knots locations for the splines.
A different and better approach is the built in Function() function that provides the equation for the fitted model along with all the knot locations.
As hinted in the question, the one possible way to accomplish this is to run rcspline.evel() with the predictor and relevant parameters set to those that were used in the model fitting procedure. An example would be
rcspline.eval(
data.frame$predictor,
nk = number_of_knots_used,
inclx = TRUE,
pc = 0, # what ever was used
fractied = 0.05 # what ever was used
)
The output of this contains attr(, "knots") which gives the location of the knots.
I have a gam function with many covariates and I would like to simplify it (find the minimum model)
I used a dsm function to model the density of a species across line transcect segments as a function of covariates. And it works fine!
But it was the maximum model with too many covariates and I would like to reduce their number automatically. So I tried using the gam::step.Gam function. (I also used the gam.scope function to make sure I do everything correctly).
DSM code:
GamModel = dsm(
ddf.obj=PreparedDdf,
formula = D ~ x + y + Cov1 + Cov2 +...+ Covn factor1+ factor2+...+factorn,
family=gaussian(link='identity'),
group=FALSE,
engine='gam',
convert.units=1,
segment.data=segment.df,
observation.data=observation.df
)
step.Gam code:
GamScope=gam.scope(segment.df[,c(5:6,11:16)], response=1, smoother="s", arg=NULL, form=TRUE)
MinModel = step.Gam(GamModel, GamScope, trace=TRUE, direction="backward")
I hoped to get the minimal model, instead it gives me the following error:
Error in gam(formula = D ~ x + Cov1 + Cov2 + Cov3, : invalid `method': REML
And I don't understand why this happens! I tried different methods (GACV.Cp, ML) and I get the same kind of error (invalid method: GACV.Cp etc)
Why is this happening? Is it because it's gam model produced by the dsm function?
And more importantly, how can I minimize the model automatically??
(When I used engine='glm' instead of 'gam' in the dsm function) and I try the stats::step function to find the minimal model it works, but the results seem a bit scetcy... so I want to use the gam engine)
The gam package doesn't fit models using REML or the other options you state. Those are options to the gam() function in the mgcv package.
The only allowed options for the method argument in gam::gam() are:
"glm.fit", which is the default, and
"model.frame", which doesn't really do anything as it instructs the function to just spit out the model frame resulting from the formula.
It is quite important to differentiate between these two packages that both provide a gam() function. They are very different approaches to the estimation of GAMs.
As you are using dsm(), you'll be fitting using mgcv::gam() not gam::gam() and in that case you cannot apply the gam::step.gam() function to the model.
I believe that the authors of dsm() recommend that you use the select = TRUE argument to mgcv::gam(), which you can provide when using dsm() and which will get passed on the gam(). This will add extra penalties to the smooth terms in the model such that they can be shrunk out of the model.
From other posts on this platform, I found that Li Mak test on the standardised residuals is more appropriate to test a fitted GARCH model than the Ljung Box test. The Weighted.LM.test() from the WeightedPortTest package in R is used for it.
I’m trying this code but I’m getting an error. Since it a univariate test, I have extracted standardised residuals and cvar from the slot name fit:
std.resid1<-dccfit#mfit$stdresid[,1]
cvar1<-dccfit#mfit$cvar[,1]
Weighted.LM.test(std.resid1, cvar1, lag=10)
Error in std.resid1, cvar1, : Length of x and h.t must match
How do it get this to work? Any help is very much appreciated.
Firstly, you should not take standardized residuals so
instead of dccfit#mfit$stdresid[,1], take: dccfit#model$residuals[,1].
Then, in the documentation of the Weighted.LM.test, it says that h.t should be a numeric vector of the conditional variances, thus take instead:
dccfit#model$sigma[,1]^2
run the test:
Weighted.LM.test(dccfit#model$residuals[,1], dccfit#model$sigma[,1]^2, lag = 2,type = c("correlation", "partial"),fitdf = 1, weighted = FALSE)
Please correct me if I am wrong.