I would like to append two Document Term Matrices together. I have one row of data and would like to use different control functions on them (an n-gram tokenizer, removal of stopwords, and wordLength bounds for text, none of these for my non-text fields).
When I use the tm_combine: c(dtm_text,dtm_inputs) it adds the second set as a new row. I want to append these attributes to the same row.
library("tm")
BigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "),
use.names = FALSE)
# Data to be tokenized
txt_fields <- paste("i like your store","i love your products","i am happy")
# Data not to be tokenized
other_inputs <- paste("cd1_ABC","cd2_555","cd3_7654")
# NGram tokenize text data
dtm_text <- DocumentTermMatrix(Corpus(VectorSource(txt_fields)),
control = list(
tokenize = BigramTokenizer,
stopwords=TRUE,
wordLengths=c(2, Inf),
bounds=list(global = c(1,Inf))))
# Do not perform tokenization of other inputs
dtm_inputs <- DocumentTermMatrix(Corpus(VectorSource(other_inputs)),
control = list(
bounds = list(global = c(1,Inf))))
# DESIRED OUTPUT
<<DocumentTermMatrix (documents: 1, terms: 12)>>
Non-/sparse entries: 12/0
Sparsity : 0%
Maximal term length: 13
Weighting : term frequency (tf)
Terms
Docs am happy happy like like your love love your products products am store store love
1 1 1 1 1 1 1 1 1 1 1
Terms
Docs your products your store cd1_abc cd2_555 cd3_7654
1 1 1 1
1 1 1
I suggest to use text2vec (but I'm biased, since I'm the author).
library(text2vec)
# Data to be tokenized
txt_fields <- paste("i like your store","i love your products","i am happy")
# Data not to be tokenized
other_inputs <- paste("cd1_ABC","cd2_555","cd3_7654")
stopwords = tm::stopwords("en")
# tokenize by whitespace
txt_toknens = strsplit(txt_fields, ' ', TRUE)
vocab = create_vocabulary(itoken(txt_toknens), ngram = c(1, 2), stopwords = stopwords)
# if you need word lengths:
# vocab$vocab = vocab$vocab[nchar(terms) > 1]
# but note, it will not remove "i_am", etc.
# you can add word "i" to stopwords to remove such terms
txt_vectorizer = vocab_vectorizer(vocab)
dtm_text = create_dtm(itoken(txt_fields), vectorizer = txt_vectorizer)
# also tokenize by whitespace, but won't create bigrams in next step
other_inputs_toknes = strsplit(other_inputs, ' ', TRUE)
vocab_other = create_vocabulary(itoken(other_inputs))
other_vectorizer = vocab_vectorizer(vocab_other)
dtm_other = create_dtm(itoken(other_inputs), vectorizer = other_vectorizer)
# combine
result = cbind(dtm_text, dtm_other)
dtm_combined = as.DocumentTermMatrix(cbind(dtm_text, dtm_inputs), weighting = weightTf)
inspect(dtm_combined)
# <<DocumentTermMatrix (documents: 1, terms: 8)>>
# Non-/sparse entries: 8/0
# Sparsity : 0%
# Maximal term length: 8
# Weighting : term frequency (tf)
#
# Terms
# Docs happy like love products store cd1_abc cd2_555 cd3_7654
# 1 1 1 1 1 1 1 1 1
But it will give wrong results if you have the same words in the dtm_text and in the dtm_inputs. This words won't be combined and will appear twice in the dtm_combined.
Related
I have a document-term matrix dtm, for example:
dtm
<<DocumentTermMatrix (documents: 50, terms: 50)>>
Non-/sparse entries: 220/2497
Sparsity : 100%
Maximal term length: 7
Weighting : term frequency (tf)
Now I want transfer it into a list of matrices, each represents a document. This is to fulfill the formal requirement of the package STM:
[[1]]
[,1] [,2] [,3] [,4]
[1,] 23 33 42 117
[2,] 2 1 3 1
[[2]]
[,1] [,2] [,3] [,4]
[1,] 2 19 93 168
[2,] 2 2 1 1
I am thinking of finding all the non-zero entries from dtm and generate them into matrices, each row at a time, so:
mat = matrix()
dtm.to.mat = function(x){
mat[1,] = x[x != 0]
mat[2,] = colnames(x[x != 0])
return(mat)
}
matrix = list(apply(dtm, 1, dtm.to.mat))
However,
x[x != 0]
just won't work. The error says:
$ operator is invalid for atomic vectors
I was wondering why this is the case. If I change x to matrix beforehand, it won't give me this error. However, I actually have a dtm of approximately 2,500,000 lines. I fear this will be very inefficient.
Me again!
I wouldn't use a dtm as the input for the stm package unless your data is particularly strange. Use the function stm::textProcessor. You can specify the documents to be raw (unprocessed) text from an any length character vector. You can also specify the metadata as you wish:
Suppose you have a dataframe df with a column called df$documents which is your raw text and df$meta which is your covariate:
processed <- textProcessor(df$documents, metadata = df$meta, lowercase = TRUE,
removestopwords = TRUE, removenumbers = TRUE, removepunctuation = TRUE,
stem = TRUE, wordLengths = c(3, Inf))
stm_50 <- stm(documents = processed$documents, vocab = processed$vocab,
K = 50, prevalence = ~ meta, init.type = "Spectral", seed = 57468)
This will run a 50 topic STM.
textProcessor will deal with empty documents and their associated metadata.
Edit: stm::textProcessor is technically just a wrapper for the tm package. But it is designed to remove problem documents, while dealing with their associated covariates.
Also the metadata argument can take a dataframe if you have multiple covariates. In that case you would also need to modify the prevalence argument in the second equation.
If you have something tricky like this I'd switch over to the quanteda package as it has nice converters to stm. If you want to stick with tm have you tried using stm::convertCorpus to change the object into the list structure stm needs?
I have the following dataset (obtained here):
----------item survivalpoints weight
1 pocketknife 10 1
2 beans 20 5
3 potatoes 15 10
4 unions 2 1
5 sleeping bag 30 7
6 rope 10 5
7 compass 30 1
I can cluster this dataset into three clusters with kmeans() using a binary string as my initial choice of centers. For eg:
## 1 represents the initial centers
chromosome = c(1,1,1,0,0,0,0)
## exclude first column (kmeans only support continous data)
cl <- kmeans(dataset[, -1], dataset[chromosome == 1, -1])
## check the memberships
cl$clusters
# [1] 1 3 3 1 2 1 2
Using this fundamental concept, I tried it out with GA package to conduct the search where I am trying to optimize(minimize) Davies-Bouldin (DB) Index.
library(GA) ## for ga() function
library(clusterSim) ## for index.DB() function
## defining my fitness function (Davies-Bouldin)
DBI <- function(x) {
## converting matrix to vector to access each row
binary_rep <- split(x, row(x))
## evaluate the fitness of each chromsome
for(each in 1:nrow(x){
cl <- kmeans(dataset, dataset[binary_rep[[each]] == 1, -1])
dbi <- index.DB(dataset, cl$cluster, centrotypes = "centroids")
## minimizing db
return(-dbi)
}
}
g<- ga(type = "binary", fitness = DBI, popSize = 100, nBits = nrow(dataset))
Of course (I have no idea what's happening), I received error message of
Warning messages:
Error in row(x) : a matrix-like object is required as argument to 'row'
Here are my questions:
How can correctly use the GA package to solve my problem?
How can I make sure the randomly generated chromosomes contains the same number of 1s which corresponds to k number of clusters (eg. if k=3 then the chromosome must contain exactly three 1s)?
I can't comment on the sense of combining k-means with ga, but I can point out that you had issue in your fitness function. Also, errors are produced when all genes are on or off, so fitness is only calculated when that is not the case:
DBI <- function(x) {
if(sum(x)==nrow(dataset) | sum(x)==0){
score <- 0
} else {
cl <- kmeans(dataset[, -1], dataset[x==1, -1])
dbi <- index.DB(dataset[,-1], cl=cl$cluster, centrotypes = "centroids")
score <- dbi$DB
}
return(score)
}
g <- ga(type = "binary", fitness = DBI, popSize = 100, nBits = nrow(dataset))
plot(g)
g#solution
g#fitnessValue
Looks like several gene combinations produced the same "best" fitness value
I have a corpus of 11 text documents. I have found word associations using the commands:
findAssocs(dtm, c("youngster","campaign"), corlimit=0.9)
findAssocs(dtms, "corruption", corlimit=0.9)
dtm is a document term matrix.
dtm <- DocumentTermMatrix(docs)
where docs is the corpus.
dtms is the document term matrix after removing 10% sparse terms.
dtms <- removeSparseTerms(dtm, 0.1)
I would like to plot the correlated terms I got against (i) 2 specific words and (ii) 1 specific word
I tried following this post : Plot highly correlated words against a specific word of interest
toi <- "corruption" # term of interest
corlimit <- 0.9 # lower correlation bound limit.
cor_0.9 <- data.frame(corr = findAssocs(dtm, toi, corlimit)[,1],terms=row.names(findAssocs(dtm, toi, corlimit)))
But unfortunately the code :
cor_0.9 <- data.frame(corr = findAssocs(dtm, toi, corlimit)[,1],terms=row.names(findAssocs(dtm, toi, corlimit)))
gives me an error :
Error in findAssocs(dtm, toi, corlimit)[, 1]:incorrect number of dimensions
This is the structure of the document term matrix:
dtm
<<DocumentTermMatrix (documents: 11, terms: 1847)>>
Non-/sparse entries: 8024/12293
Sparsity : 61%
Maximal term length: 23
Weighting : term frequency (tf)
and in the environemt it is of form:
dtm List of 6
i: int [1:8024] 1 1 1 1 1 ...
j: int [1:8024] 17 29 34 43 47 ...
v: num [1:8024] 9 4 9 5 5 ...
nrow : int 11
ncol : int 1847
dimnames: list of 2
...$ Docs : chr [1:11] "character (0)" "character (0)" "character (0)"
...$ Terms: chr [1:1847] "campaigning"|__truncated__"a"|__"truncated"__
attr(*,"class") = chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"...
attr(*,"weighting") = chr [1:2] "term frequency" "tf"
How do I plot word correlations for a single word and multiple words? Please help.
Here is the output of
findAssocs(dtm, c("youngster","campaign"), corlimit=0.9)
$youngster
character colleges controversi expect corrupt much
1.00 1.00 1.00 1.00 0.99 0.99
okay saritha existing leads satisfi social
0.99 0.99 0.98 0.98 0.98 0.98
$campaign
basic make lack internal general method satisfied time
0.95 0.95 0.94 0.93 0.92 0.92 0.92 0.92
A slightly different approach is required for two words, here's a quick attempt:
require(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
# Compute correlations and store in data frame...
toi1 <- "oil" # term of interest
toi2 <- "winter"
corlimit <- 0.7 # lower correlation bound limit.
corr1 <- findAssocs(tdm, toi1, corlimit)[[1]]
corr1 <- cbind(read.table(text = names(corr1), stringsAsFactors = FALSE), corr1)
corr2 <- findAssocs(tdm, toi2, corlimit)[[1]]
corr2 <- cbind(read.table(text = names(corr2), stringsAsFactors = FALSE), corr2)
# join them together
library(dplyr)
two_terms_corrs <- full_join(corr1, corr2)
# gather for plotting
library(tidyr)
two_terms_corrs_gathered <- gather(two_terms_corrs, term, correlation, corr1:corr2)
# insert the actual terms of interest so they show up on the legend
two_terms_corrs_gathered$term <- ifelse(two_terms_corrs_gathered$term == "corr1", toi1, toi2)
# Draw the plot...
require(ggplot2)
ggplot(two_terms_corrs_gathered, aes(x = V1, y = correlation, colour = term ) ) +
geom_point(size = 3) +
ylab(paste0("Correlation with the terms ", "\"", toi1, "\"", " and ", "\"", toi2, "\"")) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
I have 2 vectors of dimensions 6 and I would like to have a number between 0 and 1.
a=c("HDa","2Pb","2","BxU","BuQ","Bve")
b=c("HCK","2Pb","2","09","F","G")
Can anyone explain what I should do?
using the lsa package and the manual for this package
# create some files
library('lsa')
td = tempfile()
dir.create(td)
write( c("HDa","2Pb","2","BxU","BuQ","Bve"), file=paste(td, "D1", sep="/"))
write( c("HCK","2Pb","2","09","F","G"), file=paste(td, "D2", sep="/"))
# read files into a document-term matrix
myMatrix = textmatrix(td, minWordLength=1)
EDIT: show how is the mymatrix object
myMatrix
#myMatrix
# docs
# terms D1 D2
# 2 1 1
# 2pb 1 1
# buq 1 0
# bve 1 0
# bxu 1 0
# hda 1 0
# 09 0 1
# f 0 1
# g 0 1
# hck 0 1
# Calculate cosine similarity
res <- lsa::cosine(myMatrix[,1], myMatrix[,2])
res
#0.3333
You need a dictionary of possible terms first and then convert your vectors to binary vectors with a 1 in the positions of the corresponding terms and 0 elsewhere. If you name the new vectors a2 and b2, you can calculate the cosine similarly with cor(a2, b2), but notice the cosine similarly is between -1 and 1. You could map it to [0,1] with something like this: 0.5*cor(a2, b2) + 0.5
CSString_vector <- c("Hi Hello","Hello");
corp <- tm::VCorpus(VectorSource(CSString_vector));
controlForMatrix <- list(removePunctuation = TRUE,wordLengths = c(1, Inf), weighting = weightTf)
dtm <- DocumentTermMatrix(corp,control = controlForMatrix);
matrix_of_vector = as.matrix(dtm);
res <- lsa::cosine(matrix_of_vector[1,], matrix_of_vector[2,]);
could be the better one for the larger data set.
Advanced form of embedding might help you to get better output. Please check the following code.
It is a Universal sentence encode model that generates the sentence embedding using transformer-based architecture.
from absl import logging
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
return model([input])
paragraph = [
"Universal Sentence Encoder embeddings also support short paragraphs. ",
"Universal Sentence Encoder support paragraphs"]
messages = [paragraph]
print(np.inner( embed(paragraph[0]) , embed(paragraph[1])))
I am creating a trigram and quadgram model using RWeka. There is an odd behavior I notice
For the trigram
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(docs, control = list(tokenize = TrigramTokenizer))
> dim(tdm)
[1] 1540099 3
> tdm
<<TermDocumentMatrix (terms: 1540099, documents: 3)>>
Non-/sparse entries: 1548629/3071668
Sparsity : 66%
Maximal term length: 180
Weighting : term frequency (tf)
When I remove sparse terms it shrinks the above ~1 million rows to 8307
> b <- removeSparseTerms(tdm, 0.66)
> dim(b)
[1] 8307 3
For a Quadgram removal does not affect it at all
quadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
tdm <- TermDocumentMatrix(docs, control = list(tokenize = QuadgramTokenizer))
<<TermDocumentMatrix (terms: 1427403, documents: 3)>>
Non-/sparse entries: 1427936/2854273
Sparsity : 67%
Maximal term length: 185
Weighting : term frequency (tf)
> dim(tdm)
[1] 1427403 3
> tdm <- removeSparseTerms(tdm, 0.67)
> dim(tdm)
[1] 1427403 3
Has 1 million items after removal of sparse terms.
This does not look right.
Please let me know if I am doing something wrong
Regards
Ganesh
This is weird. A logical behaviour is that removing sparse terms will remove a lot in both cases, as trigrams and quadgrams are less common single gram cases. Do you have any other QuadgramTokenizer object in your session? your original function is called with a small "q" quadgramTokenize. But I am wondering why it did not show an error, it might have taken it as empty?
I think it must be something as simple as this. Check this out and if not I ll run it with a data sample and see what could be wrong here.