Extract distances from hclust (hierarchical clustering) object - r

I would like to calculate how good the fit of my cluster analysis solution for the actual distance scores is. To do that, I need to extract the distance between the stimuli I am clustering. I know that when looking at the dendrogram I can extract the distance, for example between 5 and -14 is .219 (the height of where they are connected), but is there an automatic way of extracting the distances from the information in the hclust object?
List of 7
$ merge : int [1:14, 1:2] -5 -1 -6 -4 -10 -2 1 -9 -12 -3 ...
$ height : num [1:14] 0.219 0.228 0.245 0.266 0.31 ...
$ order : int [1:15] 3 11 5 14 4 1 8 12 10 15 ...
$ labels : chr [1:15] "1" "2" "3" "4" ...
$ method : chr "ward.D"
$ call : language hclust(d = as.dist(full_naive_eucAll, diag = F, upper = F), method = "ward.D")
$ dist.method: NULL
- attr(*, "class")= chr "hclust"

Yes.
You are asking about the cophenetic distance.
d_USArrests <- dist(USArrests)
hc <- hclust(d_USArrests, "ave")
par(mfrow = c(1,2))
plot(hc)
plot(cophenetic(hc) ~ d_USArrests)
cor(cophenetic(hc), d_USArrests)
The same method can also be applied to compare two hierarchical clustering methods, and is implemented in the dendextend R package (the function makes sure the two distance matrix are ordered to match). For example:
# install.packages('dendextend')
library("dendextend")
d_USArrests <- dist(USArrests)
hc1 <- hclust(d_USArrests, "ave")
hc2 <- hclust(d_USArrests, "single")
cor_cophenetic(hc1, hc2)
# 0.587977

Related

I'm missing the second line in a ggplot, there should be test and train lines present?

I'm trying to use ggplot2 using R to graph a train and test curve for the iterative error rates of a neural network. There should be two lines but I'm only seeing just the test line, does anyone know what happened? It looks like when I used head(error_df) every type is labelled as test for some reason.
Edit: even with just error_df without any subsets it's still not showing the line for the training set's error, this also includes various ranges such as error_df[2500:5000, 7500:10000,]
Here's the ggplot graph:
Here's the code and this is a link to a public google spreadsheet of the data:
library(Rcpp)
library(RSNNS)
library(ggplot2)
library(plotROC)
library(tidyr)
setwd("**set working directory**")
data <- read.csv("WDBC.csv", header=T)
data <- data[,1:4]
data <- scale(data) # normalizes the data
numHneurons3 = 3
DecTargets = decodeClassLabels(data[,4])
train.test3 <- splitForTrainingAndTest(data, DecTargets,ratio = 0.50) # split
model3_02 <- mlp(train.test3$inputsTrain, train.test3$targetsTrain, # build model3
size = numHneurons3, learnFuncParams = c(0.02),maxit = 10000,
inputsTest = train.test3$inputsTest,
targetsTest = train.test3$targetsTest)
#--------------------------------------
# GGPlots of the Iterative Error:
#--------------------------------------
str(model3_02)
test_error <- model3_02$IterativeTestError
train_error <- model3_02$IterativeFitError
error_df <- data.frame(iter = c(seq_along(test_error),
seq_along(train_error)),
Error = c(test_error, train_error),
type = c(rep("test", length(test_error)),
rep("train", length(train_error))
))
ggplot(error_df[5000:10000,], aes(iter, Error, color = type, each = length(test_error))) + geom_line()
Here's also a snippet of the data, model, and data frame:
> head(data, 10)
PatientID radius texture perimeter
[1,] -0.2361973 1.0960995 -2.0715123 1.26881726
[2,] -0.2361956 1.8282120 -0.3533215 1.68447255
[3,] 0.4313615 1.5784992 0.4557859 1.56512598
[4,] 0.4317407 -0.7682333 0.2535091 -0.59216612
[5,] 0.4318215 1.7487579 -1.1508038 1.77501133
[6,] -0.2361855 -0.4759559 -0.8346009 -0.38680772
[7,] -0.2361809 1.1698783 0.1605082 1.13712450
[8,] 0.4326197 -0.1184126 0.3581350 -0.07280278
[9,] -0.2361759 -0.3198854 0.5883121 -0.18391855
[10,] 0.4329621 -0.4731182 1.1044669 -0.32919213
> str(model3_02)
List of 17
$ nInputs : int 4
$ maxit : num 10000
$ IterativeFitError : num [1:10000] 18838 4468 2365 1639 1278 ...
$ IterativeTestError : num [1:10000] 7031 3006 1916 1431 1161 ...
$ fitted.values : num [1:284, 1:522] 0.00386 0.00386 0.00387 0.00387 0.00386 ...
$ fittedTestValues : num [1:285, 1:522] 0.00387 0.00387 0.00387 0.00387 0.00387 ...
$ nOutputs : int 522
- attr(*, "class")= chr [1:2] "mlp" "rsnns"
> head(error_df)
iter Error type
1 1 7031.3101 test
2 2 3006.4253 test
3 3 1915.8997 test
4 4 1430.6152 test
5 5 1160.6987 test
6 6 990.2686 test
You created a data frame (error_df) with three columns by concatenating two variable together into one column, thus the variables were filled one after the other. However, you're limiting your plot from rows 5000 to 10000 of the data.
ggplot(error_df[c(5000:10000, 15000:20000),], aes(iter, Error, color = type, each = length(test_error))) + geom_line()
should show both curves.

How to model the marginals of a copula as student t distributions in R

I am trying to model the performance of a portfolio consisting of a basket of ETFs. To do this, I am using a T copula. For now, I have specified the marginals (i.e. the performance of the individual ETFs) as being normal, however, I want to use a Student t-distribution instead of a normal distribution.
I have looked into the fit.st() method from the QRM package, but I am unsure how to combine this with the copula package.
I know how to implement normally distributed margins:
mv.NE <- mvdc(normalCopula(0.75), c("norm"),
list(list(mean = 0, sd =2)))
How can I do the same thing, but with a t-distribution?
All that you need to do is to use tCopula instead of the normalCopula. You need to set up the parameter and degree of freedom of t-copula. And you need to specify the margins as well.
Hence, here we replace the normalCopula with tCopula and df=5 is the degree of freedom. Both margins are normal (as you want).
mv.NE <- mvdc(tCopula(0.75, df=5), c("norm", "norm"),
+ list(list(mean = 0, sd =2), list(list(mean = 0, sd =2))))
The result is:
Multivariate Distribution Copula based ("mvdc")
# copula:
t-copula, dim. d = 2
Dimension: 2
Parameters:
rho.1 = 0.75
df = 5.00
# margins:
[1] "norm" "norm"
with 2 (not identical) margins; with parameters (# paramMargins)
List of 2
$ :List of 2
..$ mean: num 0
..$ sd : num 2
$ :List of 1
..$ mean:List of 2
.. ..$ mean: num 0
.. ..$ sd : num 2
For t-margins, use this:
mv.NE <- mvdc(tCopula(0.75), c("t","t"),list(t=5,t=5))
Multivariate Distribution Copula based ("mvdc")
# copula:
t-copula, dim. d = 2
Dimension: 2
Parameters:
rho.1 = 0.75
df = 4.00
# margins:
[1] "t" "t"
with 2 (not identical) margins; with parameters (# paramMargins)
List of 2
$ t: Named num 5
..- attr(*, "names")= chr "df"
$ t: Named num 5
..- attr(*, "names")= chr "df"

tf-idf document term matrix and LDA: Error messages in R

Can we input tf-idf document term matrix into Latent Dirichlet Allocation (LDA)? if yes, how?
It does not work in my case and the LDA function requires the 'term-frequency' document term matrix.
Thank you
(I make a question as concise as possible. So, if you need more details, I can add
##########################################################################
TF-IDF Document matrix construction
##########################################################################
> DTM_tfidf <-DocumentTermMatrix(corpora,control = list(weighting =
function(x)+ weightTfIdf(x, normalize = FALSE)))
> str(DTM_tfidf)
List of 6
$ i : int [1:4466] 1 1 1 1 1 1 1 1 1 1 ...
$ j : int [1:4466] 6 10 22 26 28 36 39 41 47 48 ...
$ v : num [1:4466] 6 2.09 1.05 3.19 2.19 ...
$ nrow : int 64
$ ncol : int 297
$ dimnames:List of 2
..$ Docs : chr [1:64] "1" "2" "3" "4" ...
..$ Terms: chr [1:297] "accommod" "account" "achiev" "act" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency - inverse document
frequency" "tf-idf"
##########################################################################
LDA section
##########################################################################
> LDA_results <-LDA(DTM_tfidf,k, method="Gibbs", control=list(nstart=nstart,
+ seed = seed, best=best,
+ burnin = burnin, iter = iter, thin=thin))
##########################################################################
Error messages
##########################################################################
Error in LDA(DTM_tfidf, k, method = "Gibbs", control = list(nstart =
nstart, :
The DocumentTermMatrix needs to have a term frequency weighting
If you explore the documentation for LDA topic modeling using the topicmodels package, for example by typing ?LDA in the R console, you'll see that this modeling procedure is expecting a frequency-weighted document-term matrix, not tf-idf-weighted.
"Object of class "DocumentTermMatrix" with term-frequency weighting or an object coercible..."
So the answer is no, you cannot use a tf-idf-weighted DTM directly in this function. If you have a tf-idf-weighted DTM already, you can convert it using tm::weightTf() to get to the necessary weighting. If you are building a document-term matrix from scratch, then don't weight it by tf-idf.

How to plot MASS:qda scores

From this question, I was wondering if it's possible to extract the Quadratic discriminant analysis (QDA's) scores and reuse them after like PCA scores.
## follow example from ?lda
Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]),
Sp = rep(c("s","c","v"), rep(50,3)))
set.seed(1) ## remove this line if you want it to be pseudo random
train <- sample(1:150, 75)
table(Iris$Sp[train])
## your answer may differ
## c s v
## 22 23 30
Using the QDA here
z <- qda(Sp ~ ., Iris, prior = c(1,1,1)/3, subset = train)
## get the whole prediction object
pred <- predict(z)
## show first few sample scores on LDs
Here, you can see that it's not working.
head(pred$x)
# NULL
plot(LD2 ~ LD1, data = pred$x)
# Error in eval(expr, envir, enclos) : object 'LD2' not found
NOTE: Too long/formatted for a comment. NOT AN ANSWER
You may want to try the rrcov package:
library(rrcov)
z <- QdaCov(Sp ~ ., Iris[train,], prior = c(1,1,1)/3)
pred <- predict(z)
str(pred)
## Formal class 'PredictQda' [package "rrcov"] with 4 slots
## ..# classification: Factor w/ 3 levels "c","s","v": 2 2 2 1 3 2 2 1 3 2 ...
## ..# posterior : num [1:41, 1:3] 5.84e-45 5.28e-50 1.16e-25 1.00 1.48e-03 ...
## ..# x : num [1:41, 1:3] -97.15 -109.44 -54.03 2.9 -3.37 ...
## ..# ct : 'table' int [1:3, 1:3] 13 0 1 0 16 0 0 0 11
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ Actual : chr [1:3] "c" "s" "v"
## .. .. ..$ Predicted: chr [1:3] "c" "s" "v"
It also has robust PCA methods that may be useful.
Unfortunately, not every model in R conforms to the same object structure/API and this won't be a linear model, so it is unlikely to conform to linear model fit structure APIs.
There's an example of how to visualize the qda results here — http://ramhiser.com/2013/07/02/a-brief-look-at-mixture-discriminant-analysis/
And, you can do:
library(klaR)
partimat(Sp ~ ., data=Iris, method="qda", subset=train)
for a partition plot of the qda results.

Evaluating a statistical model in R

I have a very big data set (ds). One of its columns is Popularity, of type factor ('High' / ' Low').
I split the data to 70% and 30% in order to create a training set (ds_tr) and a test set (ds_te).
I have created the following model using a Logistic regression:
mdl <- glm(formula = popularity ~ . -url , family= "binomial", data = ds_tr )
then I created a predict object (will do it again for ds_te)
y_hat = predict(mdl, data = ds_tr - url , type = 'response')
I want to find the precision value which corresponds to a cutoff threshold of 0.5 and find the recall value which corresponds to a cutoff threshold of 0.5, so I did:
library(ROCR)
pred <- prediction(y_hat, ds_tr$popularity)
perf <- performance(pred, "prec", "rec")
The result is a table of many values
str(perf)
Formal class 'performance' [package "ROCR"] with 6 slots
..# x.name : chr "Recall"
..# y.name : chr "Precision"
..# alpha.name : chr "Cutoff"
..# x.values :List of 1
.. ..$ : num [1:27779] 0.00 7.71e-05 7.71e-05 1.54e-04 2.31e-04 ...
..# y.values :List of 1
.. ..$ : num [1:27779] NaN 1 0.5 0.667 0.75 ...
..# alpha.values:List of 1
.. ..$ : num [1:27779] Inf 0.97 0.895 0.89 0.887 ...
How do I find the specific precision and recall values corresponding to a cutoff threshold of 0.5?
Acces the slots of performance object (through the combination of # + list)
We create a dataset with all possible values:
probab.cuts <- data.frame(cut=perf#alpha.values[[1]], prec=perf#y.values[[1]], rec=perf#x.values[[1]])
You can view all associated values
probab.cuts
If you want to select the requested values, it is trivial to do:
tail(probab.cuts[probab.cuts$cut > 0.5,], 1)
Manual check
tab <- table(ds_tr$popularity, y_hat > 0.5)
tab[4]/(tab[4]+tab[2]) # recall
tab[4]/(tab[4]+tab[3]) # precision

Resources