I am working on complaints data analysis where I am adapting text summary technique for reducing unnecessary text and bringing out only useful text.
I have used LDA - Latent Dirichlet Allocation in R for text summarization but I am not able to perform it to its full potential.
library(igraph)
library(iterators)
#create a TCM using skip grams, we'll use a 5-word window
tcm <- CreateTcm(doc_vec = datacopy$Text,skipgram_window = 10,
verbose = FALSE,cpus = 2)
# LDA to get embeddings into probability space
embeddings <- FitLdaModel(dtm = tcm, k = 50, iterations = 300,
burnin = 180, alpha = 0.1,beta = 0.05, optimize_alpha = TRUE,
calc_likelihood = FALSE,calc_coherence = FALSE, calc_r2 = FALSE,cpus=2)
#Summarizer function
summarizer <- function(doc, gamma) {
# handle multiple docs at once
if (length(doc) > 1 )
return(sapply(doc, function(d) try(summarizer(d, gamma))))
# parse it into sentences
sent <- stringi::stri_split_boundaries(doc, type = "sentence")[[ 1 ]]
names(sent) <- seq_along(sent) # so we know index and order
# embed the sentences in the model
e <- CreateDtm(sent, ngram_window = c(1,1), verbose = FALSE, cpus = 2)
# remove any documents with 2 or fewer words
#e <- e[ rowSums(e) > 2 , ]
vocab <- intersect(colnames(e), colnames(gamma))
e <- e / rowSums(e)
e <- e[ , vocab ] %*% t(gamma[ , vocab ])
e <- as.matrix(e)
# get the pairwise distances between each embedded sentence
e_dist <- CalcHellingerDist(e)
# turn into a similarity matrix
g <- (1 - e_dist) * 100
# we don't need sentences connected to themselves
diag(g) <- 0
# turn into a nearest-neighbor graph
g <- apply(g, 1, function(x){
x[ x < sort(x, decreasing = TRUE)[ 3 ] ] <- 0
x
})
# by taking pointwise max, we'll make the matrix symmetric again
g <- pmax(g, t(g))
g <- graph.adjacency(g, mode = "undirected", weighted = TRUE)
# calculate eigenvector centrality
ev <- evcent(g)
# format the result
result<-sent[names(ev$vector)[order(ev$vector,decreasing=TRUE)[1:3]]]
result <- result[ order(as.numeric(names(result))) ]
paste(result, collapse = " ")
}
docs <- datacopy$Text[1:10]
names(docs) <- datacopy$Reference[1:10]
sums <- summarizer(docs,gamma = embeddings$gamma)
sums
Error -
Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) :
'x' must be an array of at least two dimensions
Error in if (nrow(adjmatrix) != ncol(adjmatrix)) { :
argument is of length zero
Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) :
'x' must be an array of at least two dimensions
Error in if (nrow(adjmatrix) != ncol(adjmatrix))
{:argument is of length zero
Error in if (nrow(adjmatrix) != ncol(adjmatrix))
{:argument is of length zero
Actual text:
it is the council’s responsibility to deal with the loose manhole cover.
Could you provide an update on the next steps taken by the council.
** Trail Mails Text follows - about 50 lines of text**
summarized text:
it is the council’s responsibility to deal with the loose manhole cover.I have read the email thread, please get in contact with the numbers provided by ABC"
Related
I would like to check the convergence of Sobol' sensitivity indices, using the sensobol library, by re-computing the sensitivity indices using sub-samples of decreasing size extracted from the original sample.
Here, I present an example code using the Ishigami function as model. Since computing the model output takes very long with the model I actually use, I want to avoid recomputing the model output for different sample sizes, but want to use sub-samples of my overall sample for this check.
I have written code that runs through, however, it seems that the result is 'not correct', as soon as the sample size is not equal the initial sample size.
Inital set-up
library(sensobol)
# Define settings
matrices <- c("A", "B", "AB", "BA")
N <- 1000
params <- paste("X", 1:3, sep = "")
first <- total <- "azzini"
order <- "first"
R <- 10
type <- "percent"
conf <- 0.95
# Create sample matrix using Sobol' (1967) quasi-random numbers
mat <- sobol_matrices(matrices = matrices, N = N, params = params, order = order, type = "QRN")
# Compute model output using Ishigami function as model
Y <- ishigami_Fun(mat)
Correct Sobol' indices as benchmark result
# Compute and bootstrap Sobol' indices for entire sample N
ind <- sobol_indices(matrices = c("A", "B", "AB", "BA"),
Y = Y,
N = N,
params = params,
boot = TRUE,
first = "azzini",
total = "azzini",
order = "first",
R = R,
type = type,
conf = conf)
cols <- colnames(ind)[1:length(params)]
ind[ , (cols):= round(.SD, 3), .SDcols = (cols)]
Check for convergence
Now, to analyze whether convergence is reached, I want to re-compute the sensitivity indices using sub-samples of decreasing size extracted from the original sample
# function to compute sensitivity indices, depending on the sample size and the model output vector
fct_conv <- function(N, Y) {
# compute how many model runs are performed in the case of the Azzini estimator
nr_model_runs <- 2*N*(length(params)+1) # length(params) = k
# extract sub-sample of model output
y_sub <- Y[1:nr_model_runs]
# compute and bootstrap Sobol' indices
ind_sub <- sobol_indices(matrices = c("A", "B", "AB", "BA"),
Y = y_sub,
N = N,
params = params,
boot = TRUE,
first = "azzini",
total = "azzini",
order = "first",
R = R,
type = type,
conf = conf)
cols <- colnames(ind_sub)[1:length(params)]
ind_sub[ , (cols):= round(.SD, 3), .SDcols = (cols)]
return(ind_sub)
}
Let's compare the benchmark result (ind) to two other outputs: Running fct_conv with the full sample (ind_full_sample) and running fct_conv with a very slightly reduced sample (ind_red_sample).
ind_full_sample <- fct_conv(1000, Y)
ind_red_sample <- fct_conv(999, Y)
ind
ind_full_sample
ind_red_sample
It seems that as soon as the sample size is reduced, the result doesn't make sense. Why is that? I'd be glad for any hints or ideas!
The results do not make sense because you are sampling without considering the ordering of the sample matrix. Try the following
# Load the required packages:
library(sensobol)
library(data.table)
library(ggplot2)
# Create function to swiftly check convergence (you do not need bootstrap)
sobol_convergence <- function(Y, N, sample.size, seed = 666) {
dt <- data.table(matrix(Y, nrow = N))
set.seed(seed) # To permit replication
subsample <- unlist(dt[sample(.N, sample.size)], use.names = FALSE)
ind <- sobol_indices(matrices = matrices,
Y = subsample,
N = sample.size,
params = params,
first = first,
total = total,
order = order)
return(ind)
}
# Define sequence of sub-samples at which you want to check convergence
sample.size <- seq(100, 1000, 50) # every 50
# Run function
ind.list <- lapply(sample.size, function(n)
sobol_convergence(Y = Y, N = N, sample.size = n))
# Extract total number of model runs C and results in each run
Cost <- indices <- list()
for(i in 1:length(ind.list)) {
Cost[[i]] <- ind.list[[i]]$C
indices[[i]] <- ind.list[[i]]$results
}
names(indices) <- Cost
# Final dataset
final.dt <- rbindlist(indices, idcol = "Cost")[, Cost:= as.numeric(Cost)]
# Plot results
ggplot(final.dt, aes(Cost, original, color = sensitivity)) +
geom_line() +
labs(x = "Total number of model runs", y = "Sobol' indices") +
facet_wrap(~parameters) +
theme_bw()
I'm trying the same code as in https://thiloshon.wordpress.com/2018/03/11/build-your-own-word-sentence-prediction-application-part-02/ to do word-level prediction. The input textual data is also in the mentioned link and I use en_US.news.txt file as my only input file.
library(quanteda)
library(data.table)
#read the .txt file
df=readLines('en_US.news.txt')
#take a sample of the df
sampleHolderNews <- sample(length(df), length(df) * 0.1)
US_News_Sample <- df[sampleHolderNews]
#build the corpus of the data
corp <- corpus(US_News_Sample)
#Preprocessing
master_Tokens <- tokens(x = tolower(corp),remove_punct =
TRUE,remove_numbers = TRUE,remove_hyphens = TRUE,remove_symbols = TRUE)
stemed_words <- tokens_wordstem(master_Tokens, language = "english")
#tokenization#
bi_gram <- tokens_ngrams(stemed_words, n = 2)
tri_gram <- tokens_ngrams(stemed_words, n = 3)
uni_DFM <- dfm(stemed_words)
bi_DFM <- dfm(bi_gram)
tri_DFM <- dfm(tri_gram)
uni_DFM <- dfm_trim(uni_DFM, 3)
bi_DFM <- dfm_trim(bi_DFM, 3)
tri_DFM <- dfm_trim(tri_DFM, 3)
sums_U <- colSums(uni_DFM)
sums_B <- colSums(bi_DFM)
sums_T <- colSums(tri_DFM)
# Create data tables with individual words as columns
uni_words <- data.table(word_1 = names(sums_U), count = sums_U)
bi_words <- data.table(
word_1 = sapply(strsplit(names(sums_B), "_", fixed = TRUE), '[[', 1),
word_2 = sapply(strsplit(names(sums_B), "_", fixed = TRUE), '[[', 2),
count = sums_B)
tri_words <- data.table(
word_1 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 1),
word_2 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 2),
word_3 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 3),
count = sums_T)
#indexing#
setkey(uni_words, word_1)
setkey(bi_words, word_1, word_2)
setkey(tri_words, word_1, word_2, word_3)
######## Finding Bi-Gram Probability #################
discount_value <- 0.75
# Finding number of bi-gram words
numOfBiGrams <- nrow(bi_words[.(word_1, word_2)])
# Dividing number of times word 2 occurs as second part of bigram, by total number of bigrams.
# Finding probability for a word given the number of times it was second word of a bigram
ckn <- bi_words[, .(Prob = ((.N) / numOfBiGrams)), by = word_2]
setkey(ckn, word_2)
# Assigning the probabilities as second word of bigram, to unigrams
uni_words[, Prob := ckn[word_1, Prob]]
uni_words <- uni_words[!is.na(uni_words$Prob)]
# Finding number of times word 1 occurred as word 1 of bi-grams
n1wi <- bi_words[, .(N = .N), by = word_1]
setkey(n1wi, word_1)
# Assigning total times word 1 occured to bigram cn1
bi_words[, Cn1 := uni_words[word_1, count]]
# Kneser Kney Algorithm
bi_words[, Prob := ((count - discount_value) / Cn1 + discount_value / Cn1 *
n1wi[word_1, N] * uni_words[word_2, Prob])]
######## End of Finding Bi-Gram Probability #################
######## Finding Tri-Gram Probability #################
# Finding count of word1-word2 combination in bigram
tri_words[, Cn2 := bi_words[.(word_1, word_2), .N]]
n1w12 <- tri_words[, .N, by = .(word_1, word_2)]
setkey(n1w12, word_1, word_2)
# Kneser Kney Algorithm
tri_words[, Prob := ((count - discount_value) / Cn2 + discount_value / Cn2 *
n1w12[.(word_1, word_2), .N] * bi_words[.(word_1, word_2), Prob])]
Here I get the following error for Kneser algorithm for trigrams:
Error in `[.data.table`(tri_words, , `:=`(Prob, ((count - discount_value)/Cn2 + :
Supplied 13867 items to be assigned to 3932 items of column 'Prob'. If you wish to 'recycle'
the RHS please use rep() to make this intent clear to readers of your code.
In addition: Warning messages:
1: In discount_value/Cn2 * n1w12[list(word_1, word_2), .N] * bi_words[list(word_1, :
longer object length is not a multiple of shorter object length
2: In (count - discount_value)/Cn2 + discount_value/Cn2 * n1w12[list(word_1, :
longer object length is not a multiple of shorter object length
I could find some similar questions related to data table error but I can't understand how should I solve this error in the code.
The problem is in your attempt to multiply the quantities in the last line. This expression:
(count - discount_value) / Cn2 + discount_value / Cn2
is length 20, like tri_words. But the next expression
n1w12[.(word_1, word_2), .N]
is length 19. Then the last part,
bi_words[.(word_1, word_2), Prob])
is length 155 (and contains a lot of NAs).
The error messages are saying that the shorter item cannot be recycled into the longer item because the longer item's length is not a multiple of the length of the shorter item. To fix this, you need to implement this algorithm more carefully.
I have three dataframes created from different ngram counts (Uni, Bi , Tri) each data frame contains the separated ngram, frequency counts (n) and have added probability using smoothing.
I have written three functions to look through the tables and return the highest probable word based on an input string. And have binded them
##Prediction Model
trigramwords <- function(FirstWord, SecondWord, n = 5 , allow.cartesian =TRUE) {
probword <- trigramtable[.(FirstWord, SecondWord), allow.cartesian = TRUE][order(-Prob)]
if(any(is.na(probword)))
return(bigramwords(SecondWord, n))
if(nrow(probword) > n)
return(probword[1:n, ThirdWord])
count <-nrow(probword)
bgramwords <- bigramtable(SecondWord, n)[1:(n - count)]
return(c(probword[, ThirdWord], bgramwords))
}
bigramwords <- function(FirstWord, n = 5 , allow.cartesian = TRUE){
probword <- bigramtable[FirstWord][order(-Prob)]
if(any(is.na(probword)))
return(Unigramword(n))
if (nrow(probword) > n)
return(probword[1:n, SecondWord])
count <- nrow(probword)
word1 <- Unigramword(n)[1:(n - count)]
return(c(probword[, SecondWord], word1))
}
##Back off Model
Unigramword <- function(n = 5, allow.cartesian = TRUE){
return(sample(UnigramTable[, FirstWord], size = n))
}
## Bind Functions
predictword <- function(str) {
require(quanteda)
tokens <- tokens(x = char_tolower(str))
tokens <- char_wordstem(rev(rev(tokens[[1]])[1:2]), language = "english")
words <- trigramwords(tokens[1], tokens[2], 5)
chain_1 <- paste(tokens[1], tokens[2], words[1], sep = " ")
print(words[1])
}
However I receive the following warning message and the output is always the same word. If I use only the bigramwords function it works fine, but when adding the trigram function I get the warning message. I believe it because 1:n is not defined correctly.
Warning message:
In 1:n : numerical expression has 5718534 elements: only the first used
I have a function for which i need a matrix and a vector as arguments. I will extract the matrices and vectors from a data.matrix()and data.frame()respectively.
for (i in 1:3) {
assign(paste("vavc", i, sep = ""),as.numeric(inputvar[i,-1]));
assign(paste("cor", i, sep = ""),matrix(input[which(ArtID ==i),-1],nrow = 2 ))
}
What I want now is to apply the resulting variables to the function cor2cov() (the function is pasted below the divider in the code section at the end of this post, where you can also find the input for creating a minimum reproducible example).
For example: cor2cov(cor1,vavc1)
I tried to incorporate the following code into the for-loop
cor2cov(noquote(paste("cor", i, sep = "")),noquote(paste("vavc", i, sep = "")))
Wich gives me an error:
#Error in cor2cov(noquote(paste("cor", 1, sep = "")), noquote(paste("vavc", :
'corMat must be a matrix
Basically the function doesn't get the right arguments.
Any help is appreciated.
Function and minimum code follows now:
ArtID = c(1,2,3)
AC_AC = c(1,1,1)
MKT_AC = c(0.5,0.6,0.2)
AC_MKT = c(0.5,0.6,0.2)
MKT_MKT = c(1,1,1)
input = data.frame(ArtID, AC_AC, MKT_AC, AC_MKT, MKT_MKT)
input <- data.matrix(input)
#Now we need to create the variance vectors
#Create data.frame for testing the varvector-creation loop
ArtIDv = c(1,2,3)
Varvec1 = c(0.3, 0.6)
Varvec1 = c(0.3, 0.6, 0.35)
Varvec2 = c(0.15, 0.19, 0.21)
inputvar = data.frame(ArtIDv,Varvec1,Varvec2)
for (i in 1:3) {
assign(paste("vavc", i, sep = ""),as.numeric(inputvar[i,-1]));
assign(paste("cor", i, sep = ""),matrix(input[which(ArtID ==i),-1],nrow = 2 ))
}
-------------------------
2) Incorporate the cor2cov()-Function into R by copy-pasting the following code:
# Goal: convert a correlation matrix and variance vector
# into the corresponding covariance matrix
#
# Input:
# 'corMat' is a square matrix with 1's on the diagonal
# and valid correlations on the off-diagonal
# 'varVec' is a valid variance vector, with length
# matching the dimension of 'covMat'. A single
# row or single column matrix is also allowed.
# Output:
# the covariance matrix
#
# A warning is given if the covariance matrix is not
# positive definite.
#
cor2cov = function(corMat, varVec) {
# test the input
if (!is.matrix(corMat)) stop("'corMat must be a matrix")
n = nrow(corMat)
if (ncol(corMat) != n) stop("'corMat' must be square")
if (mode(corMat) != "numeric") stop("'corMat must be numeric")
if (mode(varVec) != "numeric") stop("'varVec must be numeric")
if (!is.null(dim(varVec))) {
if (length(dim(varVec)) != 2) stop("'varVec' should be a vector")
if (any(dim(varVec)==1)) stop("'varVec' cannot be a matrix")
varVec = as.numeric(varVec) # convert row or col matrix to a vector
}
if (!all(diag(corMat) == 1)) stop("correlation matrices have 1 on the diagonal")
if (any(corMat < -1 | corMat > +1))
stop("correlations must be between -1 and 1")
if (any(varVec <= 0)) stop("variances must be non-negative")
if (length(varVec) != n) stop("length of 'varMat' does not match 'corMat' size")
# Compute the covariance
sdMat = diag(sqrt(varVec))
rtn = sdMat %*% corMat %*% t(sdMat)
if (det(rtn)<=0) warning("covariance matrix is not positive definite")
return(rtn)
}
#The cor2cov-Function will now be available in your global environment.
I think you want to use get(...) instead of noquote(...) to refer to the variables dynamically:
> cor2cov(get(paste("cor", i, sep = "")), get(paste("vavc", i, sep = "")))
[,1] [,2]
[1,] 0.35000000 0.05422177
[2,] 0.05422177 0.21000000
The get() function takes a string and returns an R variable/function with that name if it exists. It defaults to searching the global namespace.
> x = 'ls'
> class(ls)
[1] "function"
> class(get('ls'))
[1] "function"
While the noquote() function on the other hand returns a string:
> noquote('ls')
[1] ls
> class(noquote('ls'))
[1] "noquote"
> noquote('ls') == 'ls'
Normally when topic modeling I use something along the lines of:
matrix <- create_matrix(cbind(as.vector(lda_data)), language="english", removeNumbers=TRUE, weighting=weightTf)
k <- 20 #Hardcoded temp value
lda <- LDA(matrix, k, method = "Gibbs", control = list(iter = 1000, burnin = 1000))
Terms <- terms(lda, 20)
But with a mid sized data set (3.2M rows) I get the following error calculating the matrix:
Warning message:
In nr * nc : NAs produced by integer overflow
Error in as.matrix(textColumns) :
error in evaluating the argument 'x' in selecting a method for function 'as.matrix': Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
Is there a different library/approach that avoids this error? (The code works fine on small data sets)
Alternatively, when using a TermDocumentMatrix as the matrix for the LDA, my resulting Terms are entirely numerical, is there a way to strings (words) instead?
I've used an alternate approach to creating the matrix which works on the large data set:
dtm <- DocumentTermMatrix(donation_message,
control = list(stemming = TRUE, stopwords = TRUE,
removeNumbers = TRUE, removePunctuation = TRUE))
dtm <- removeSparseTerms(dtm, 0.99)
rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
dtm <- dtm[rowTotals> 0, ] #Remove all docs without words
k <- 20 #Hardcoded temp value
lda <- LDA(dtm, k, method = "Gibbs", control = list(iter = 1000, burnin = 1000)) #seed = 1000, thin = 100
Terms <- terms(lda, 20)