Find AUC with tree package - binary response - r

Attempting to get ROC Curve and AUC for CART decision tree which was made using "tree" package.
> str(pruned.tree7)
Here is the Structure of my tree
'data.frame': 13 obs. of 6 variables:
$ var : Factor w/ 15 levels "","Age",..: 15 10 1 11 11 5 1 1 15 1 ...
$ n : num 383 158 29 129 110 38 20 18 72 7 ...
$ dev : num 461.1 218.6 29.6 174 141.8 ...
$ yval : Factor w/ 2 levels "Negative","Positive": 2 2 1 2 2 1 2 1 2 1 ...
$ splits: chr [1:13, 1:2] "<19.5" "<81.5" "" "<65" ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "cutleft" "cutright"
$ yprob : num [1:13, 1:2] 0.29 0.475 0.793 0.403 0.345 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "Negative" "Positive"
Referencing the above structure, I have written (many variations of) the following code:
> preds <- prediction(pruned.tree7$frame$yprob, dimnames(pruned.tree7$frame$yprob))
Error in prediction(pruned.tree7$frame$yprob, dimnames(pruned.tree7$frame$yprob)) :
Number of predictions in each run must be equal to the number of labels for each run.
> preds <- prediction(pruned.tree7$frame$yprob, dimnames)
Error in prediction(pruned.tree7$frame$yprob, dimnames) :
Format of labels is invalid.
> preds <- prediction(pruned.tree7$frame$yprob, "dimnames")
Error in prediction(pruned.tree7$frame$yprob, "dimnames") :
Number of cross-validation runs must be equal for predictions and labels.
> preds <- prediction(pruned.tree7$frame$yprob, names(yprob))
Error in is.data.frame(labels) : object 'yprob' not found
> preds <- prediction(pruned.tree7$frame$yprob, names(pruned.tree7$frame$yprob))
Error in prediction(pruned.tree7$frame$yprob, names(pruned.tree7$frame$yprob)) :
Format of labels is invalid.
> preds <- prediction(pruned.tree7$frame$yprob, dimnames(pruned.tree7$frame$yprob))
Error in prediction(pruned.tree7$frame$yprob, dimnames(pruned.tree7$frame$yprob)) :
Number of predictions in each run must be equal to the number of labels for each run.
I have searched and found this link: ROCR Package Documentation
It mentions the topic of cross-validation. However, it does not make sense to me.
Thank you in advance!!

Related

MLR: Extracting the names of the covariates with non-zero coefficients in CoxBoost

I am using mlr to create a survival model with the learner cv.CoxBoost and 5-fold cross-validation. (Yes I realise cv.CoxBosst has CV built in, but I am adding another level to be consistent with the other learners I am comparing it against). I need to extract the names of the covariates with nonzero coefficients from the final model, as you would do if you were using Lasso. However, I only seem to be able to get the output from the individual runs of CoxBoost, not from cv.CoxBoost.
This is what I have tried:
library(survival)
library(mlr)
set.seed(24601)
data(veteran)
task_id = "MAS_MEDEXAM"
surv.task <- makeSurvTask(id = task_id, data = veteran, target = c("time", "status"))
cindex.sd = setAggregation(cindex, test.sd)
surv.measures = list(cindex, cindex.sd)
cboostcv.lrn <- makeLearner(cl="surv.cv.CoxBoost", id = "CoxBoostCV", predict.type="response")
outer = makeResampleDesc("CV", iters=5, stratify=TRUE)
learners = list(cboostcv.lrn)
bmr = benchmark(learners, surv.task, outer, surv.measures, show.info = TRUE)
mods = getBMRModels(bmr, learner.ids = c('CoxBoostCV'))
mod = mods$MAS$CoxBoostCV[[1]]$learner.model
str(mod, max.level=1)
Which produced the following output:
List of 16
$ time : num [1:109] 87 123 182 97 83 100 103 164 30 10 ...
$ status : num [1:109] 0 0 0 0 0 0 0 1 1 1 ...
$ stepno : num 43
$ penalty : num [1:9] 918 918 918 918 918 918 918 918 918
$ xnames : chr [1:9] "trt" "karno" "diagtime" "age" ...
$ n : int 109
$ p : int 9
$ event.times : num [1:81] 1 2 3 4 7 8 10 11 12 13 ...
$ coefficients :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
$ linear.predictor: num [1:44, 1:109] 0 -0.0607 -0.1171 -0.1695 -0.218 ...
$ meanx : num [1:9] 1.49 57.24 8.95 58.41 3.03 ...
$ sdx : num [1:9] 0.502 20.489 11.31 10.768 4.616 ...
$ standardize : logi TRUE
$ Lambda : num [1:44, 1:81] 0.0184 0.0184 0.0183 0.0182 0.0181 ...
$ scoremat : num [1:43, 1:9] 0.0404 0.0503 0.0604 0.0704 0.0802 ...
$ logplik : num -357
- attr(*, "class")= chr "CoxBoost"
- attr(*, "mlr.train.info")=List of 5
..- attr(*, "class")= chr "FixDataInfo"
This is consistent with output from CoxBoost, but cv.CoxBoost is supposed to return the following:
mean.logplik
se.logplik
optimal.step
folds
How do I extract this information?
EDIT:
After contacting Prof Harald Binder and examining the mlr code for the cv.CoxBoost learner I realised that I had misunderstood its operation. Prof Binder's response was
cv.CoxBoost only determines the number of boosting steps to be performed. You have to fit a model (using a CoxBoost call) afterwards, using that number of steps.
The mlr learner cv.CoxBoost does exactly that - it first calls cv.CoxBoost to find the optinal number of steps, then calls CoxBoost using that number of steps.
So my question now is, will the following code give me the names of the covariates with non-zero coefficients in the final model?
mods = getBMRModels(bmr, learner.ids = c('CoxBoostCV'))
for (i in 1:5) {
mod = mods[[task_id]]$CoxBoostCV[[i]]$learner.model
print(mod$xnames[mod$coefficients[mod$stepno+1,] != 0])
}

How to get result of package function into a dataframe in r

I am at the learning stage of r.
I am using library(usdm) in r where I am using vifcor(vardata,th=0.4,maxobservations =50000) to find the not multicollinear variables. I need to get the result of vifcor(vardata,th=0.4,maxobservations =50000) into a structured dataframe for further analysis.
Data reading process I am using:
performdata <- read.csv('F:/DGDNDRV_FINAL/OutputTextFiles/data_blk.csv')
vardata <-performdata[,c(names(performdata[5:length(names(performdata))-2])]
Content of the csv file:
pointid grid_code Blocks_line_dst_CHT GrowthCenter_dst_CHT Roads_nationa_dst_CHT Roads_regiona_dst_CHT Settlements_CHT_line_dst_CHT Small_Hat_Bazar_dst_CHT Upazilla_lin_dst_CHT resp
1 6 150 4549.428711 15361.31836 3521.391846 318.9043884 3927.594727 480 1
2 6 127.2792206 4519.557617 15388.68457 3500.24292 342.0526123 3902.883545 480 1
3 2 161.5549469 4484.473145 15391.6377 3436.539063 335.4101868 3844.216553 540 1
My tries:
r<-vifcor(vardata,th=0.2,maxobservations =50000) returns
2 variables from the 6 input variables have collinearity problem:
Roads_regiona_dst_CHT GrowthCenter_dst_CHT
After excluding the collinear variables, the linear correlation coefficients ranges between:
min correlation ( Small_Hat_Bazar_dst_CHT ~ Roads_nationa_dst_CHT ): -0.04119076963
max correlation ( Small_Hat_Bazar_dst_CHT ~ Settlements_CHT_line_dst_CHT ): 0.1384278434
---------- VIFs of the remained variables --------
Variables VIF
1 Blocks_line_dst_CHT 1.026743892
2 Roads_nationa_dst_CHT 1.010556752
3 Settlements_CHT_line_dst_CHT 1.038307666
4 Small_Hat_Bazar_dst_CHT 1.026943711
class(r) returns
[1] "VIF"
attr(,"package")
[1] "usdm"
mode(r) returns "S4"
I need Roads_regiona_dst_CHT GrowthCenter_dst_CHT into a dataframe and VIFs of the remained variables into another dataframe!
But nothing worked!
Basically the resturned result is a S4 class and you can extract slots via the # operator:
library(usdm)
example(vifcor) # creates 'v2'
str(v2)
# Formal class 'VIF' [package "usdm"] with 4 slots
# ..# variables: chr [1:10] "Bio1" "Bio2" "Bio3" "Bio4" ...
# ..# excluded : chr [1:5] "Bio5" "Bio10" "Bio7" "Bio6" ...
# ..# corMatrix: num [1:5, 1:5] 1 0.0384 -0.3011 0.0746 0.7102 ...
# .. ..- attr(*, "dimnames")=List of 2
# .. .. ..$ : chr [1:5] "Bio1" "Bio2" "Bio3" "Bio8" ...
# .. .. ..$ : chr [1:5] "Bio1" "Bio2" "Bio3" "Bio8" ...
# ..# results :'data.frame': 5 obs. of 2 variables:
# .. ..$ Variables: Factor w/ 5 levels "Bio1","Bio2",..: 1 2 3 4 5
# .. ..$ VIF : num [1:5] 2.09 1.37 1.25 1.27 2.31
So you can extract the results and the excluded slot now via:
v2#excluded
# [1] "Bio5" "Bio10" "Bio7" "Bio6" "Bio4"
v2#results
# variables VIF
# 1 Bio1 2.086186
# 2 Bio2 1.370264
# 3 Bio3 1.253408
# 4 Bio8 1.267217
# 5 Bio9 2.309479
You should be able to use the below command to get the information in the slot 'results' into a data frame. You can then split the information out into separate data frames using traditional methods
df <- r#results
Note that r#results[1:2,2] would give you the VIF for the first two rows.

Trouble converting list into factor in R

I am having problems creating a boxplot of my data, because one of my variables is in the form of a list.
I am trying to create a boxplot:
boxplot(dist~species, data=out)
and received the following error:
Error in model.frame.default(formula = dist ~ species, data = out) :
invalid type (list) for variable 'species'
I have been unsuccessful in forcing 'species' into the form of a factor:
out[species]<- as.factor(out[[out$species]])
and receive the following error:
Error in .subset2(x, i, exact = exact) : invalid subscript type 'list'
How can I convert my 'species' column into a factor which I can then use to create a boxplot? Thanks.
EDIT:
str(out)
'data.frame': 4570 obs. of 6 variables:
$ GridRef : chr "NT73" "NT80" "NT85" "NT86" ...
$ pred : num 154 71 81 85 73 99 113 157 92 85 ...
$ pred_bin : int 0 0 0 0 0 0 0 0 0 0 ...
$ dist : num 20000 10000 9842 14144 22361 ...
$ years_since_1990: chr "21" "16" "21" "20" ...
$ species :List of 4570
..$ : chr "C.splendens"
..$ : chr "C.splendens"
..$ : chr "C.splendens"
.. [list output truncated]
It's hard to imagine how you got the data into this form in the first place, but it looks like
out <- transform(out,species=unlist(species))
should solve your problem.
set.seed(101)
f <- as.list(sample(letters[1:5],replace=TRUE,size=100))
## need I() to make a wonky data frame ...
d <- data.frame(y=runif(100),f=I(f))
## 'data.frame': 100 obs. of 2 variables:
## $ y: num 0.125 0.0233 0.3919 0.8596 0.7183 ...
## $ f:List of 100
## ..$ : chr "b"
## ..$ : chr "a"
boxplot(y~f,data=d) ## invalid type (list) ...
d2 <- transform(d,f=unlist(f))
boxplot(y~f,data=d2)

How to get fitted values from ar() method model in R

I want to retrieve the fitted values from an ar() function output model in R. When using Arima() method, I get them using fitted(model.object) function, but I cannot find its equivalent for ar().
It does not store a fitted vector but does have the residuals. An example of using the residuals from the ar-object to reconstruct the predictions from the original data:
data(WWWusage)
arf <- ar(WWWusage)
str(arf)
#====================
List of 14
$ order : int 3
$ ar : num [1:3] 1.175 -0.0788 -0.1544
$ var.pred : num 117
$ x.mean : num 137
$ aic : Named num [1:21] 258.822 5.787 0.413 0 0.545 ...
..- attr(*, "names")= chr [1:21] "0" "1" "2" "3" ...
$ n.used : int 100
$ order.max : num 20
$ partialacf : num [1:20, 1, 1] 0.9602 -0.2666 -0.1544 -0.1202 -0.0715 ...
$ resid : Time-Series [1:100] from 1 to 100: NA NA NA -2.65 -4.19 ...
$ method : chr "Yule-Walker"
$ series : chr "WWWusage"
$ frequency : num 1
$ call : language ar(x = WWWusage)
$ asy.var.coef: num [1:3, 1:3] 0.01017 -0.01237 0.00271 -0.01237 0.02449 ...
- attr(*, "class")= chr "ar"
#===================
str(WWWusage)
# Time-Series [1:100] from 1 to 100: 88 84 85 85 84 85 83 85 88 89 ...
png(); plot(WWWusage)
lines(seq(WWWusage),WWWusage - arf$resid, col="red"); dev.off()
The simplest way to get the fits from an AR(p) model would be to use auto.arima() from the forecast package, which does have a fitted() method. If you really want a pure AR model, you can constrain the differencing via the d parameter and the MA order via the max.q parameter.
> library(forecast)
> fitted(auto.arima(WWWusage,d=0,max.q=0))
Time Series:
Start = 1
End = 100
Frequency = 1
[1] 91.68778 86.20842 82.13922 87.60576 ...

Feature selection using the penalizedLDA package

I am trying to use the penalizedLDA package to run a penalized linear discriminant analysis in order to select the "most meaningful" variables. I have searched here and on other sites for help in accessing the the output from the penalized model to no avail.
My data comprises of 400 varaibles and 44 groups. Code I used and results I got thus far:
yy.m<-as.matrix(yy) #Factors/groups
xx.m<-as.matrix(xx) #Variables
cv.out<-PenalizedLDA.cv(xx.m,yy.m,type="standard")
## aplly the penalty
out <- PenalizedLDA(xx.m,yy.m,lambda=cv.out$bestlambda,K=cv.out$bestK)
Too get the structure of the output from the anaylsis:
> str(out)
List of 10
$ discrim: num [1:401, 1:4] -0.0234 -0.0219 -0.0189 -0.0143 -0.0102 ...
$ xproj : num [1:100, 1:4] -8.31 -14.68 -11.07 -13.46 -26.2 ...
$ K : int 4
$ crits :List of 4
..$ : num [1:4] 2827 2827 2827 2827
..$ : num [1:4] 914 914 914 914
..$ : num [1:4] 162 162 162 162
..$ : num [1:4] 48.6 48.6 48.6 48.6
$ type : chr "standard"
$ lambda : num 0
$ lambda2: NULL
$ wcsd.x : Named num [1:401] 0.0379 0.0335 0.0292 0.0261 0.0217 ...
..- attr(*, "names")= chr [1:401] "R400" "R405" "R410" "R415" ...
$ x : num [1:100, 1:401] 0.147 0.144 0.145 0.141 0.129 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:401] "R400" "R405" "R410" "R415" ...
$ y : num [1:100, 1] 2 2 2 2 2 1 1 1 1 1 ...
- attr(*, "class")= chr "penlda"
I am interested in obtaining a list or matrix of the top 20 variables for feature selection, more than likely based on the coefficients of the Linear discrimination.
I realized I would have to sort the coefficients in descending order, and get the variable names matched to it. So the output I would expect is something like this imaginary example
V1 V2
R400 0.34
R1535 0.22...
Can anyone provide any pointers (not necessarily the R code). Thanks in advance.
Your out$K is 4, and that means you have 4 discriminant vectors. If you want the top 20 variables according to, say, the 2nd vector, try this:
# get the data frame of variable names and coefficients
var.coef = data.frame(colnames(xx.m), out$discrim[,2])
# sort the 2nd column (the coefficients) in decreasing order, and only keep the top 20
var.coef.top = var.coef[order(var.coef[,2], decreasing = TRUE)[1:20], ]
var.coef.top is what you want.

Resources