Calculate intertopic distances from LDAvis package - r

The LDAvis package produces beautiful intertopic distance maps
serVis(json_lda, out.dir = 'vis', open.browser = FALSE) # outputs lda visualizations
produces:
How can go about producing a matrix or dataframe of all of the pairwise relative distances between each topic?
I have access to the Document Term Matrix, Corpus, LDA model object, and json_lda used to output the visualization.
I've uploaded RDS files for testing to here. They can be loaded using:
library(lsa)
library(tm)
library(slam)
library(LDAvis)
library(topicmodels)
DTM <- readRDS("dtm.RDS")
ldamodel <- readRDS("ldamodel.RDS")
json_lda <- readRDS("json_lda.RDS")
corpus <- readRDS("new.corpus.RDS")

unzip("dtm.zip")
readRDS("json_lda.rds") -> k
library(jsonlite)
fromJSON(k) -> z
cbind(z$mdsDat$x, z$mdsDat$y) -> q
rownames(q) <- z$mdsDat$topics
dist(q) -> r

Related

unable to make scaled heatmap for diffrential gene analysis

Im new to R so be easy on me, I'm having trouble generating a heatmap for my genes. I performed diffrential gene analysis using DESeq2 package and found the 30 most downregulated genes and with fdr<0.05 for cell lines. I was trying to create a heatmap using the pheatmap package and I wasn't able to generate my heatmap as I want to. I want to generate a heatmap for my top 30 genes for each cell line(which are 8)
Here's my code :
dds <- DESeqDataSetFromMatrix(countData = GSM_subset,
colData = subset,
design = ~ Condition)
d_analysis <- DESeq(dds)
res <- results(d_analysis)
res
nrow(dds)
dds <- dds[rowSums(counts(dds)) > 1,]
nrow(dds)
mcols(res, use.names = TRUE)
summary(res)
resLFC1 <- results(d_analysis, lfcThreshold=3)
table(resLFC1$padj<0.05)
resLFC1 <- resLFC1[complete.cases(resLFC1),]
resLFC1
resSig <- subset(resLFC1, log2FoldChange=-3)
resSig <- subset(resLFC1, padj<0.05)
top30=head(resSig[ order(resSig$log2FoldChange), ],30)
top30<-as.data.frame(top30)
library(pheatmap)
pheatmap(top30)
Heatmaps in the genomics context usually use the scaled (that is Z-transformed) normalized counts on the log2 scale, or similar transformation such as vst or rlog from the DESeq2 package.
Given you already use DESeq2 you can do with dds being your DESeqDataSet:
vsd <- assay(vst(dds)) # log-normalized and variance-stabilized counts
Z <- t(scale(t(vsd))) # z-transformation
Z.select <- Z[your.genes.of.interest,] # subset to genes of interest
...and from there use the heatmap package of your choice.

Perform feature selection over document-term matrix in R

I have a matrix with 99,814 items containing reviews and their respective polarities (positive or negative), and I was looking to do some feature selection over the terms of the corpus to select only those that are more determinant for the identification of each score before I pass it to a model.
The problem is I am currently working with 16,554 terms, so trying to transform the document-term matrix into a sparse matrix so I can apply something like chi-squared to the terms is getting me a "Cholmod error out of memory" message.
So my question is: is there any feasible way I can get the chi-squared value of all terms with the matrix in its more "memory efficient" format? Or am I out of luck?
Here's some sample code that should give one an idea of what I am trying to do. I am using the text2vec library to do the transformation on the text.
library(text2vec)
review_matrix <- data.frame(id=c(1,2,3),
review=c('This review is negative',
'This review is positive',
'This review is positive'),
sentiment=c('Negative', 'Positive', 'Positive'))
tokenizer <- word_tokenizer
tokens <- tokenizer(review_matrix$review)
iterator <- itoken(tokens,
ids = review_matrix$reviewId,
progressbar = FALSE)
vocabulary <- create_vocabulary(iterator)
vectorizer <- vocab_vectorizer(vocabulary)
document_term_matrix <- create_dtm(iterator, vectorizer)
model_tf_idf <- TfIdf$new()
document_term_matrix <- model_tf_idf$fit_transform(document_term_matrix)
# This is where I am trying to do the chisq.test

could not find function "FUNcluster" in R

I want to run kmeans clustering on my data and show the plot using this code:
Elbow method is used to calculate number of k.
library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visual
library(NbClust) #use zip file to install it
wss <- function(k) {
kmeans(df_km, k, nstart = 25 )$tot.withinss
}
# Compute and plot wss for k = 1 to k = 32
k.values <- 1:32
wss_values <- map_dbl(k.values, wss)
set.seed(123)
fviz_nbclust(df_km, FUNcluster = kmeans, method = "wss")
in the last line of code, it shows this error:
Error in FUNcluster(x, i, ...) : could not find function "FUNcluster"
I tried to restart session, also used .zip, CRAN, and also this link to download the "factoextra"
devtools::install_github("kassambara/factoextra")
But still I get the error. Any solution to solve it?

Mclust in R: How to output cluster centers

I'm currently using RStudio for doing text mining on Support tickets, clustering them by their description (freetext). For this, I compare kmeans to EM algorithm. I prepared the data with the tm package, and now I try do apply clustering algorithms to the data matrix.
With the kmeans() function, I can use following Code snippet to Output the 5 most frequent Terms in text Clusters (kmeans21):
> for (i in 1:num_cluster) {
cat(paste("cluster ", i, ": ", sep = ""))
s <- sort(kmeans21$centers[i, ], decreasing = T)
cat(names(s)[1:5], "\n")
}
Until now, I couldnt find a function to do the same within the mclust package. My data has the following Format:
> bic21 <- MclustBIC(m1, G=21)
> emmodel21 <- summary(bic21, data = m1)
With the command
> emmodel21$classification
I can see the Cluster for each supportticket, but is there also the possibility to Output the most frequent Terms like in the first Code block for kmeans?
I think you can try
summary(mod1, parameters = TRUE)
Just tried the same example in the link
library(mclust)
data(diabetes)
X <- diabetes[,-1]
BIC <- mclustBIC(X)
mod1 <- Mclust(X, x = BIC)
summary(mod1, parameters = TRUE)
Slightly altering the first example in the vignette:
data(diabetes)
X <- diabetes[,-1]
mod <- mclust(X)
means <- mod$parameters$means
The means object is now a matrix containing the means of the clusters.

Using tm and rpart in R: decision tree for textual data?

I am using the tm package in R to create a corpus of text documents and I would like to create a decision tree with rpart for classification purposes. However, I can't find any examples on the internet about using textual data with rpart. Is it even possible or are there other packages I could use?
Here's a starter:
library(tm)
library(rpart)
docs <- c(txt1="Hello world", txt2="lorem ipsum")
dtm <- DocumentTermMatrix(Corpus(VectorSource(docs)), control = list(weight = weightBin))
m <- as.matrix(dtm)
train <- as.data.frame(m)
train$Docs <- factor(rownames(m), labels=names(docs))
fit <- rpart(Docs~.,data=train, control = rpart.control(minsplit=1))
test <- data.frame(hello=c(1,0),world=c(0,0),ipsum=c(0,1),lorem=c(0,0), row.names=letters[1:2])
predict(fit, newdata=test, type="class")
# a b
# txt1 txt2
# Levels: txt1 txt2

Resources