How to plot MASS:qda scores - r

From this question, I was wondering if it's possible to extract the Quadratic discriminant analysis (QDA's) scores and reuse them after like PCA scores.
## follow example from ?lda
Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]),
Sp = rep(c("s","c","v"), rep(50,3)))
set.seed(1) ## remove this line if you want it to be pseudo random
train <- sample(1:150, 75)
table(Iris$Sp[train])
## your answer may differ
## c s v
## 22 23 30
Using the QDA here
z <- qda(Sp ~ ., Iris, prior = c(1,1,1)/3, subset = train)
## get the whole prediction object
pred <- predict(z)
## show first few sample scores on LDs
Here, you can see that it's not working.
head(pred$x)
# NULL
plot(LD2 ~ LD1, data = pred$x)
# Error in eval(expr, envir, enclos) : object 'LD2' not found

NOTE: Too long/formatted for a comment. NOT AN ANSWER
You may want to try the rrcov package:
library(rrcov)
z <- QdaCov(Sp ~ ., Iris[train,], prior = c(1,1,1)/3)
pred <- predict(z)
str(pred)
## Formal class 'PredictQda' [package "rrcov"] with 4 slots
## ..# classification: Factor w/ 3 levels "c","s","v": 2 2 2 1 3 2 2 1 3 2 ...
## ..# posterior : num [1:41, 1:3] 5.84e-45 5.28e-50 1.16e-25 1.00 1.48e-03 ...
## ..# x : num [1:41, 1:3] -97.15 -109.44 -54.03 2.9 -3.37 ...
## ..# ct : 'table' int [1:3, 1:3] 13 0 1 0 16 0 0 0 11
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ Actual : chr [1:3] "c" "s" "v"
## .. .. ..$ Predicted: chr [1:3] "c" "s" "v"
It also has robust PCA methods that may be useful.
Unfortunately, not every model in R conforms to the same object structure/API and this won't be a linear model, so it is unlikely to conform to linear model fit structure APIs.
There's an example of how to visualize the qda results here — http://ramhiser.com/2013/07/02/a-brief-look-at-mixture-discriminant-analysis/
And, you can do:
library(klaR)
partimat(Sp ~ ., data=Iris, method="qda", subset=train)
for a partition plot of the qda results.

Related

How to model the marginals of a copula as student t distributions in R

I am trying to model the performance of a portfolio consisting of a basket of ETFs. To do this, I am using a T copula. For now, I have specified the marginals (i.e. the performance of the individual ETFs) as being normal, however, I want to use a Student t-distribution instead of a normal distribution.
I have looked into the fit.st() method from the QRM package, but I am unsure how to combine this with the copula package.
I know how to implement normally distributed margins:
mv.NE <- mvdc(normalCopula(0.75), c("norm"),
list(list(mean = 0, sd =2)))
How can I do the same thing, but with a t-distribution?
All that you need to do is to use tCopula instead of the normalCopula. You need to set up the parameter and degree of freedom of t-copula. And you need to specify the margins as well.
Hence, here we replace the normalCopula with tCopula and df=5 is the degree of freedom. Both margins are normal (as you want).
mv.NE <- mvdc(tCopula(0.75, df=5), c("norm", "norm"),
+ list(list(mean = 0, sd =2), list(list(mean = 0, sd =2))))
The result is:
Multivariate Distribution Copula based ("mvdc")
# copula:
t-copula, dim. d = 2
Dimension: 2
Parameters:
rho.1 = 0.75
df = 5.00
# margins:
[1] "norm" "norm"
with 2 (not identical) margins; with parameters (# paramMargins)
List of 2
$ :List of 2
..$ mean: num 0
..$ sd : num 2
$ :List of 1
..$ mean:List of 2
.. ..$ mean: num 0
.. ..$ sd : num 2
For t-margins, use this:
mv.NE <- mvdc(tCopula(0.75), c("t","t"),list(t=5,t=5))
Multivariate Distribution Copula based ("mvdc")
# copula:
t-copula, dim. d = 2
Dimension: 2
Parameters:
rho.1 = 0.75
df = 4.00
# margins:
[1] "t" "t"
with 2 (not identical) margins; with parameters (# paramMargins)
List of 2
$ t: Named num 5
..- attr(*, "names")= chr "df"
$ t: Named num 5
..- attr(*, "names")= chr "df"

Unable to get Residuals for AOV in R

I have a dataframe in R.
This is part of the head of the DF:
Sujet T E O P meanTR
1 1 0 0 0 0.97
1 1 0 0 0 1.44
2 0 1 0 1 0.94
Sujet : from 1 to 12
T , E , O , P : 1 or 0
meanTR : Numeric
I want to get the anova table, so I tried this:
model_all <- aov(meanTR ~ E*O*P+ Error(Sujet/E*O*P), data = df)
After that, I want to extract the residuals of my models to plot them
So I tried this :
res <- residuals(model_all) returns NULL
So I found people online suggesting this solution:
model_all.pr <- proj(model_all)
res <- model_all.pr[[3]][, "Residuals"]
But this returns subscript out of bound
res <- model_all.pr[[3]]["Residuals"]
But this returns NA
I don't know what I'm doing wrong. I'm really confused
Any help would be appreciated.
The main goal is to be able to run this:
plot(res)
qqnorm(res)
With aov(), you'll get a top-level $residuals attribute for some fits but not others.
For example with a simple model like the following, you can access residuals directly (use str() to see the structure of an object, include what attributes can be accessed):
fit1 <- aov(Sepal.Length ~ Sepal.Width, data=iris)
str(fit1$residuals)
## Named num [1:150] -0.644 -0.956 -1.111 -1.234 -0.722 ...
## - attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
But in the more complex model specification you're using (i.e. with an explicit/custom error term), there are separate residual values in each of the top-level attributes:
fit2 <- aov(Sepal.Length ~ Sepal.Width + Error(Species/Sepal.Length), data=iris)
fit2$residuals # NULL
names(fit2)
## [1] "(Intercept)" "Species" "Sepal.Length:Species" "Within"
fit2$Species$residuals
## 2 3
## -1.136219 5.179749
str(fit2$Within$residuals)
## Named num [1:144] -1.83e-15 -2.49e-15 -1.90e-15 -2.55e-15 -2.89e-15 ...
## - attr(*, "names")= chr [1:144] "7" "8" "9" "10" ...
## ...
## ...
I haven't thought about the stats behind enough to say why this is, but I'm sure it is reasonable.
Hope that helps!

R - Extract ns spline object from lmer model and predict on new data

I'm looking to predict 'terms', especially ns splines, from an lmer model. I've replicated the problem with the mtcars dataset (technically poor example, but works to get the point across).
Here is what I'm trying to do with a linear model:
data(mtcars)
mtcarsmodel <- lm(wt ~ ns(drat,2) + hp + as.factor(gear), data= mtcars)
summary(mtcarsmodel)
coef(mtcarsmodel)
test <- predict(mtcarsmodel, type = "terms")
Perfect. However, there is no equivalent 'terms' option for lmer predict (unresolved issue here).
mtcarsmodellmer <- lmer(wt ~ ns(drat,2) + (hp|as.factor(gear)), data= mtcars)
summary(mtcarsmodellmer)
coef(mtcarsmodellmer)
ranef(mtcarsmodellmer)
Given there is no equivalent ‘predict, terms’ function, I was going to extract the fixed and random coefficients above and apply the coefficients to the mtcars data, but have no idea on how to extract an ns spline object from a model and 'predict' it to some new data. The same goes for a 'poly' transformed variable eg. poly(drat, 2) - extra kudos if you can get this as well.
It is not difficult to do it yourself.
library(lme4)
library(splines)
X <- with(mtcars, ns(drat, 2)) ## design matrix for splines (without intercept)
## head(X)
# 1 2
#[1,] 0.5778474 -0.1560021
#[2,] 0.5778474 -0.1560021
#[3,] 0.5738625 -0.1792162
#[4,] 0.2334329 -0.1440232
#[5,] 0.2808520 -0.1704002
#[6,] 0.0000000 0.0000000
## str(X)
# ns [1:32, 1:2] 0.578 0.578 0.574 0.233 0.281 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:2] "1" "2"
# - attr(*, "degree")= int 3
# - attr(*, "knots")= Named num 3.7
# ..- attr(*, "names")= chr "50%"
# - attr(*, "Boundary.knots")= num [1:2] 2.76 4.93
# - attr(*, "intercept")= logi FALSE
# - attr(*, "class")= chr [1:3] "ns" "basis" "matrix"
fit <- lmer(wt ~ X + (hp|gear), data= mtcars)
beta <- coef(fit)
#$gear
# hp (Intercept) X1 X2
#3 0.010614406 2.455403 -2.167337 -0.9246454
#4 0.014601363 2.455403 -2.167337 -0.9246454
#5 0.006342761 2.455403 -2.167337 -0.9246454
#
#attr(,"class")
#[1] "coef.mer"
If we want to predict the ns term, just do
## use `predict.ns`; read `?predict.ns`
x0 <- seq(1, 5, by = 0.2) ## example `newx`
Xp <- predict(X, newx = x0) ## prediction matrix
b <- with(beta$gear, c(X1[1], X2[1])) ## coefficients for spline
y <- Xp %*% b ## predicted mean
plot(x0, y, type = "l")

Extract distances from hclust (hierarchical clustering) object

I would like to calculate how good the fit of my cluster analysis solution for the actual distance scores is. To do that, I need to extract the distance between the stimuli I am clustering. I know that when looking at the dendrogram I can extract the distance, for example between 5 and -14 is .219 (the height of where they are connected), but is there an automatic way of extracting the distances from the information in the hclust object?
List of 7
$ merge : int [1:14, 1:2] -5 -1 -6 -4 -10 -2 1 -9 -12 -3 ...
$ height : num [1:14] 0.219 0.228 0.245 0.266 0.31 ...
$ order : int [1:15] 3 11 5 14 4 1 8 12 10 15 ...
$ labels : chr [1:15] "1" "2" "3" "4" ...
$ method : chr "ward.D"
$ call : language hclust(d = as.dist(full_naive_eucAll, diag = F, upper = F), method = "ward.D")
$ dist.method: NULL
- attr(*, "class")= chr "hclust"
Yes.
You are asking about the cophenetic distance.
d_USArrests <- dist(USArrests)
hc <- hclust(d_USArrests, "ave")
par(mfrow = c(1,2))
plot(hc)
plot(cophenetic(hc) ~ d_USArrests)
cor(cophenetic(hc), d_USArrests)
The same method can also be applied to compare two hierarchical clustering methods, and is implemented in the dendextend R package (the function makes sure the two distance matrix are ordered to match). For example:
# install.packages('dendextend')
library("dendextend")
d_USArrests <- dist(USArrests)
hc1 <- hclust(d_USArrests, "ave")
hc2 <- hclust(d_USArrests, "single")
cor_cophenetic(hc1, hc2)
# 0.587977

Evaluating a statistical model in R

I have a very big data set (ds). One of its columns is Popularity, of type factor ('High' / ' Low').
I split the data to 70% and 30% in order to create a training set (ds_tr) and a test set (ds_te).
I have created the following model using a Logistic regression:
mdl <- glm(formula = popularity ~ . -url , family= "binomial", data = ds_tr )
then I created a predict object (will do it again for ds_te)
y_hat = predict(mdl, data = ds_tr - url , type = 'response')
I want to find the precision value which corresponds to a cutoff threshold of 0.5 and find the recall value which corresponds to a cutoff threshold of 0.5, so I did:
library(ROCR)
pred <- prediction(y_hat, ds_tr$popularity)
perf <- performance(pred, "prec", "rec")
The result is a table of many values
str(perf)
Formal class 'performance' [package "ROCR"] with 6 slots
..# x.name : chr "Recall"
..# y.name : chr "Precision"
..# alpha.name : chr "Cutoff"
..# x.values :List of 1
.. ..$ : num [1:27779] 0.00 7.71e-05 7.71e-05 1.54e-04 2.31e-04 ...
..# y.values :List of 1
.. ..$ : num [1:27779] NaN 1 0.5 0.667 0.75 ...
..# alpha.values:List of 1
.. ..$ : num [1:27779] Inf 0.97 0.895 0.89 0.887 ...
How do I find the specific precision and recall values corresponding to a cutoff threshold of 0.5?
Acces the slots of performance object (through the combination of # + list)
We create a dataset with all possible values:
probab.cuts <- data.frame(cut=perf#alpha.values[[1]], prec=perf#y.values[[1]], rec=perf#x.values[[1]])
You can view all associated values
probab.cuts
If you want to select the requested values, it is trivial to do:
tail(probab.cuts[probab.cuts$cut > 0.5,], 1)
Manual check
tab <- table(ds_tr$popularity, y_hat > 0.5)
tab[4]/(tab[4]+tab[2]) # recall
tab[4]/(tab[4]+tab[3]) # precision

Resources