tf-idf document term matrix and LDA: Error messages in R - r

Can we input tf-idf document term matrix into Latent Dirichlet Allocation (LDA)? if yes, how?
It does not work in my case and the LDA function requires the 'term-frequency' document term matrix.
Thank you
(I make a question as concise as possible. So, if you need more details, I can add
##########################################################################
TF-IDF Document matrix construction
##########################################################################
> DTM_tfidf <-DocumentTermMatrix(corpora,control = list(weighting =
function(x)+ weightTfIdf(x, normalize = FALSE)))
> str(DTM_tfidf)
List of 6
$ i : int [1:4466] 1 1 1 1 1 1 1 1 1 1 ...
$ j : int [1:4466] 6 10 22 26 28 36 39 41 47 48 ...
$ v : num [1:4466] 6 2.09 1.05 3.19 2.19 ...
$ nrow : int 64
$ ncol : int 297
$ dimnames:List of 2
..$ Docs : chr [1:64] "1" "2" "3" "4" ...
..$ Terms: chr [1:297] "accommod" "account" "achiev" "act" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency - inverse document
frequency" "tf-idf"
##########################################################################
LDA section
##########################################################################
> LDA_results <-LDA(DTM_tfidf,k, method="Gibbs", control=list(nstart=nstart,
+ seed = seed, best=best,
+ burnin = burnin, iter = iter, thin=thin))
##########################################################################
Error messages
##########################################################################
Error in LDA(DTM_tfidf, k, method = "Gibbs", control = list(nstart =
nstart, :
The DocumentTermMatrix needs to have a term frequency weighting

If you explore the documentation for LDA topic modeling using the topicmodels package, for example by typing ?LDA in the R console, you'll see that this modeling procedure is expecting a frequency-weighted document-term matrix, not tf-idf-weighted.
"Object of class "DocumentTermMatrix" with term-frequency weighting or an object coercible..."
So the answer is no, you cannot use a tf-idf-weighted DTM directly in this function. If you have a tf-idf-weighted DTM already, you can convert it using tm::weightTf() to get to the necessary weighting. If you are building a document-term matrix from scratch, then don't weight it by tf-idf.

Related

mk.test() results to tabble/matrix R

I want to apply mk.test() to the large dataset and get results in a table/matrix.
My data look something like this:
Column A
Column B
...
ColumnXn
1
2
...
5
...
...
...
...
3
4
...
7
So far I managed to perform mk.test() for all columns and print the results:
for(i in 1:ncol(data)) {
print(mk.test(as.numeric(unlist(data[ , i]))))
}
I got all the results printed:
.....
Mann-Kendall trend test
data: as.numeric(unlist(data[, i]))
z = 4.002, n = 71, p-value = 6.28e-05
alternative hypothesis: true S is not equal to 0
sample estimates:
S varS tau
7.640000e+02 3.634867e+04 3.503154e-01
Mann-Kendall trend test
data: as.numeric(unlist(data[, i]))
z = 3.7884, n = 71, p-value = 0.0001516
alternative hypothesis: true S is not equal to 0
sample estimates:
S varS tau
7.240000e+02 3.642200e+04 3.283908e-01
....
However, I was wondering if it is possible to get results in a table/matrix format that I could save as excel.
Something like this:
Column
z
p-value
S
varS
tau
Column A
4.002
0.0001516
7.640000e+02
3.642200e+04
3.283908e-01
...
...
...
...
...
...
ColumnXn
3.7884
6.28e-05
7.240000e+02
3.642200e+04
3.283908e-01
Is it possible to do so?
I would really appreciate your help.
Instead of printing the test results you can store them in a variable. This variable holds the various test statistics and values. To find the names of the properties you can perform the test on the first row and find the property names using a string conversion:
testres = mk.test(as.numeric(unlist(data[ , 1])))
str(testres)
List of 9
$ data.name : chr "as.numeric(unlist(data[, 1]))"
$ p.value : num 0.296
$ statistic : Named num 1.04
..- attr(*, "names")= chr "z"
$ null.value : Named num 0
..- attr(*, "names")= chr "S"
$ parameter : Named int 3
..- attr(*, "names")= chr "n"
$ estimates : Named num [1:3] 3 3.67 1
..- attr(*, "names")= chr [1:3] "S" "varS" "tau"
$ alternative: chr "two.sided"
$ method : chr "Mann-Kendall trend test"
$ pvalg : num 0.296
- attr(*, "class")= chr "htest"
Here you see that for example the z-value is called testres$statistic and similar for the other properties. The values of S, varS and tau are not separate properties but they are grouped together in the list testres$estimates.
In the code you can create an empty dataframe, and in the loop add the results of that run to this dataframe. Then at the end you can convert to csv using write.csv().
library(trend)
# sample data
mydata = data.frame(ColumnA = c(1,3,5), ColumnB = c(2,4,1), ColumnXn = c(5,7,7))
# empty dataframe to store results
results = data.frame(matrix(ncol=6, nrow=0))
colnames(results) <- c("Column", "z", "p-value", "S", "varS", "tau")
for(i in 1:ncol(mydata)) {
# store test results in variable
testres = mk.test(as.numeric(unlist(mydata[ , i])))
# extract elements of result
testvars = c(colnames(mydata)[i], # column
testres$statistic, # z
testres$p.value, # p-value
testres$estimates[1], # S
testres$estimates[2], # varS
testres$estimates[3]) # tau
# add to results dataframe
results[nrow(results)+1,] <- testvars
}
write.csv(results, "mannkendall.csv", row.names=FALSE)
The resulting csv file can be opened in Excel.

Why caret train function always return one .outcome when the model can have more than one dependent variable?

This is something very tricky and is related with other questions I did here (1, 2, 3).
Basically, I am comparing two neural network trainings:
One using directly the function neuralnet() from the neuralnet package.
One using the function train() from the caret package, and including method = "neuralnet".
The same formula in both models
In both cases, in the arguments to introduce the formula:
f <- as.formula(paste("DC1 + DC2 + DC3 ~", paste(n[!n %in% c("DC1","DC2","DC3")], collapse = " + ")))
If we see its structure, you can see that there are three DEPENDENT VARIABLES (DC1, DC2, DC3)
str(f)
Class 'formula' language DC1 + DC2 + DC3 ~ DC4 + DC5 + DC6 + DC7 + DC8 + DC9 + DC10
..- attr(*, ".Environment")=<environment: R_GlobalEnv>
Using neuralnet() (neuralnet)
This is the code
BestModel <- neuralnet(formula = f,
data = train_df, # available in one of the links above
hidden = c(3,2,4) # random neurons per layer
learningrate = 0.01,
threshold = 0.01,
stepmax = 50000
)
Using train() (caret)
This is the code (all the parameters are available in this question):
model <- train(f, data = predict(pre_mdl_df, train_df),
method = "neuralnet",
tuneGrid = tune.grid.neuralnet,
metric = "RMSE",
stepmax = 100000,
learningrate = 0.01,
threshold = 0.01,
act.fct = softplus,
trControl = caret::trainControl (
method = "repeatedcv",
number = 2, # Number of folds of the cv
repeats = 1, # Number of cv repetitions
verboseIter = TRUE,
savePredictions = TRUE,
allowParallel = TRUE))
Here, we store as finalModel the final and best model obtained in the training:
finalModel <- model$finalModel
Comparison of the outputs from neuralnet() and train()
Different structures
neuralnet(): str(BestModel) is a List of 14.
train(): str(FinalModel) is a List of 19
Different responses:
As you can see, the response from neuralnet() is having chr [1:3] "DC1" "DC2" "DC3", whereas from train() has chr ".outcome". I guess this will impact the results later, because the prediction will return different quantity of results.
# neuralnet()
str(BestModel$response)
num [1:26, 1:3] 4 5 6 8 11 11 11 6 8 5 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:26] "165" "167" "168" "172" ...
..$ : chr [1:3] "DC1" "DC2" "DC3"
# train()
str(finalModel$response)
num [1:26, 1] 0.83 0.509 1.199 1.353 1.55 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:26] "X165" "X167" "X168" "X172" ...
..$ : chr ".outcome"
Different model lists
# neuralnet()
str(BestModel$model.list)
List of 2
$ response : chr [1:3] "DC1" "DC2" "DC3"
$ variables: chr [1:7] "DC4" "DC5" "DC6" "DC7" ...
# train()
str(finalModel$model.list)
List of 2
$ response : chr ".outcome"
$ variables: chr [1:7] "DC4" "DC5" "DC6" "DC7" ...
Different plots
From neuralnet(): as expected, with three outputs representing the three dependent variables (DC1, DC2, DC3).
From train(): only one output, .outcome.
Different prediction variables
I expect to predict the three variables simultaneously, as it happens with neuralnet() but not with train(), that returns only one.
# neuralnet()
head(predict(BestModel, train_df))
[,1] [,2] [,3]
165 9.384303 10.38448 10.07678
167 9.384303 10.38448 10.07678
168 9.384303 10.38448 10.07678
172 9.384303 10.38448 10.07678
174 9.384303 10.38448 10.07678
176 9.384303 10.38448 10.07678
# train()
head(predict(finalModel, train_df))
[,1]
165 0.8641611
167 1.3874627
168 0.8641579
172 1.1318320
174 0.8639903
176 1.7193986
Key questions
After comparing both models, which were supposed to behave similarly, several questions come to my mind:
How can I make train() to work similarly as neuralnet()? This includes the plots and predictions.
Is there any way to modify the code from train()?
Is there a bug in train()?
How can I change .outcome to have the three dependent variables?
Please, I need help!

How to model the marginals of a copula as student t distributions in R

I am trying to model the performance of a portfolio consisting of a basket of ETFs. To do this, I am using a T copula. For now, I have specified the marginals (i.e. the performance of the individual ETFs) as being normal, however, I want to use a Student t-distribution instead of a normal distribution.
I have looked into the fit.st() method from the QRM package, but I am unsure how to combine this with the copula package.
I know how to implement normally distributed margins:
mv.NE <- mvdc(normalCopula(0.75), c("norm"),
list(list(mean = 0, sd =2)))
How can I do the same thing, but with a t-distribution?
All that you need to do is to use tCopula instead of the normalCopula. You need to set up the parameter and degree of freedom of t-copula. And you need to specify the margins as well.
Hence, here we replace the normalCopula with tCopula and df=5 is the degree of freedom. Both margins are normal (as you want).
mv.NE <- mvdc(tCopula(0.75, df=5), c("norm", "norm"),
+ list(list(mean = 0, sd =2), list(list(mean = 0, sd =2))))
The result is:
Multivariate Distribution Copula based ("mvdc")
# copula:
t-copula, dim. d = 2
Dimension: 2
Parameters:
rho.1 = 0.75
df = 5.00
# margins:
[1] "norm" "norm"
with 2 (not identical) margins; with parameters (# paramMargins)
List of 2
$ :List of 2
..$ mean: num 0
..$ sd : num 2
$ :List of 1
..$ mean:List of 2
.. ..$ mean: num 0
.. ..$ sd : num 2
For t-margins, use this:
mv.NE <- mvdc(tCopula(0.75), c("t","t"),list(t=5,t=5))
Multivariate Distribution Copula based ("mvdc")
# copula:
t-copula, dim. d = 2
Dimension: 2
Parameters:
rho.1 = 0.75
df = 4.00
# margins:
[1] "t" "t"
with 2 (not identical) margins; with parameters (# paramMargins)
List of 2
$ t: Named num 5
..- attr(*, "names")= chr "df"
$ t: Named num 5
..- attr(*, "names")= chr "df"

How to plot MASS:qda scores

From this question, I was wondering if it's possible to extract the Quadratic discriminant analysis (QDA's) scores and reuse them after like PCA scores.
## follow example from ?lda
Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]),
Sp = rep(c("s","c","v"), rep(50,3)))
set.seed(1) ## remove this line if you want it to be pseudo random
train <- sample(1:150, 75)
table(Iris$Sp[train])
## your answer may differ
## c s v
## 22 23 30
Using the QDA here
z <- qda(Sp ~ ., Iris, prior = c(1,1,1)/3, subset = train)
## get the whole prediction object
pred <- predict(z)
## show first few sample scores on LDs
Here, you can see that it's not working.
head(pred$x)
# NULL
plot(LD2 ~ LD1, data = pred$x)
# Error in eval(expr, envir, enclos) : object 'LD2' not found
NOTE: Too long/formatted for a comment. NOT AN ANSWER
You may want to try the rrcov package:
library(rrcov)
z <- QdaCov(Sp ~ ., Iris[train,], prior = c(1,1,1)/3)
pred <- predict(z)
str(pred)
## Formal class 'PredictQda' [package "rrcov"] with 4 slots
## ..# classification: Factor w/ 3 levels "c","s","v": 2 2 2 1 3 2 2 1 3 2 ...
## ..# posterior : num [1:41, 1:3] 5.84e-45 5.28e-50 1.16e-25 1.00 1.48e-03 ...
## ..# x : num [1:41, 1:3] -97.15 -109.44 -54.03 2.9 -3.37 ...
## ..# ct : 'table' int [1:3, 1:3] 13 0 1 0 16 0 0 0 11
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ Actual : chr [1:3] "c" "s" "v"
## .. .. ..$ Predicted: chr [1:3] "c" "s" "v"
It also has robust PCA methods that may be useful.
Unfortunately, not every model in R conforms to the same object structure/API and this won't be a linear model, so it is unlikely to conform to linear model fit structure APIs.
There's an example of how to visualize the qda results here — http://ramhiser.com/2013/07/02/a-brief-look-at-mixture-discriminant-analysis/
And, you can do:
library(klaR)
partimat(Sp ~ ., data=Iris, method="qda", subset=train)
for a partition plot of the qda results.

Extract distances from hclust (hierarchical clustering) object

I would like to calculate how good the fit of my cluster analysis solution for the actual distance scores is. To do that, I need to extract the distance between the stimuli I am clustering. I know that when looking at the dendrogram I can extract the distance, for example between 5 and -14 is .219 (the height of where they are connected), but is there an automatic way of extracting the distances from the information in the hclust object?
List of 7
$ merge : int [1:14, 1:2] -5 -1 -6 -4 -10 -2 1 -9 -12 -3 ...
$ height : num [1:14] 0.219 0.228 0.245 0.266 0.31 ...
$ order : int [1:15] 3 11 5 14 4 1 8 12 10 15 ...
$ labels : chr [1:15] "1" "2" "3" "4" ...
$ method : chr "ward.D"
$ call : language hclust(d = as.dist(full_naive_eucAll, diag = F, upper = F), method = "ward.D")
$ dist.method: NULL
- attr(*, "class")= chr "hclust"
Yes.
You are asking about the cophenetic distance.
d_USArrests <- dist(USArrests)
hc <- hclust(d_USArrests, "ave")
par(mfrow = c(1,2))
plot(hc)
plot(cophenetic(hc) ~ d_USArrests)
cor(cophenetic(hc), d_USArrests)
The same method can also be applied to compare two hierarchical clustering methods, and is implemented in the dendextend R package (the function makes sure the two distance matrix are ordered to match). For example:
# install.packages('dendextend')
library("dendextend")
d_USArrests <- dist(USArrests)
hc1 <- hclust(d_USArrests, "ave")
hc2 <- hclust(d_USArrests, "single")
cor_cophenetic(hc1, hc2)
# 0.587977

Resources