data.table error and warnings for finding trigram probability

data.table error and warnings for finding trigram probability - r

I'm trying the same code as in https://thiloshon.wordpress.com/2018/03/11/build-your-own-word-sentence-prediction-application-part-02/ to do word-level prediction. The input textual data is also in the mentioned link and I use en_US.news.txt file as my only input file.
library(quanteda)
library(data.table)
#read the .txt file
df=readLines('en_US.news.txt')
#take a sample of the df
sampleHolderNews <- sample(length(df), length(df) * 0.1)
US_News_Sample <- df[sampleHolderNews]
#build the corpus of the data
corp <- corpus(US_News_Sample)
#Preprocessing
master_Tokens <- tokens(x = tolower(corp),remove_punct =
TRUE,remove_numbers = TRUE,remove_hyphens = TRUE,remove_symbols = TRUE)
stemed_words <- tokens_wordstem(master_Tokens, language = "english")
#tokenization#
bi_gram <- tokens_ngrams(stemed_words, n = 2)
tri_gram <- tokens_ngrams(stemed_words, n = 3)
uni_DFM <- dfm(stemed_words)
bi_DFM <- dfm(bi_gram)
tri_DFM <- dfm(tri_gram)
uni_DFM <- dfm_trim(uni_DFM, 3)
bi_DFM <- dfm_trim(bi_DFM, 3)
tri_DFM <- dfm_trim(tri_DFM, 3)
sums_U <- colSums(uni_DFM)
sums_B <- colSums(bi_DFM)
sums_T <- colSums(tri_DFM)
# Create data tables with individual words as columns
uni_words <- data.table(word_1 = names(sums_U), count = sums_U)
bi_words <- data.table(
word_1 = sapply(strsplit(names(sums_B), "_", fixed = TRUE), '[[', 1),
word_2 = sapply(strsplit(names(sums_B), "_", fixed = TRUE), '[[', 2),
count = sums_B)
tri_words <- data.table(
word_1 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 1),
word_2 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 2),
word_3 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 3),
count = sums_T)
#indexing#
setkey(uni_words, word_1)
setkey(bi_words, word_1, word_2)
setkey(tri_words, word_1, word_2, word_3)
######## Finding Bi-Gram Probability #################
discount_value <- 0.75
# Finding number of bi-gram words
numOfBiGrams <- nrow(bi_words[.(word_1, word_2)])
# Dividing number of times word 2 occurs as second part of bigram, by total number of bigrams.
# Finding probability for a word given the number of times it was second word of a bigram
ckn <- bi_words[, .(Prob = ((.N) / numOfBiGrams)), by = word_2]
setkey(ckn, word_2)
# Assigning the probabilities as second word of bigram, to unigrams
uni_words[, Prob := ckn[word_1, Prob]]
uni_words <- uni_words[!is.na(uni_words$Prob)]
# Finding number of times word 1 occurred as word 1 of bi-grams
n1wi <- bi_words[, .(N = .N), by = word_1]
setkey(n1wi, word_1)
# Assigning total times word 1 occured to bigram cn1
bi_words[, Cn1 := uni_words[word_1, count]]
# Kneser Kney Algorithm
bi_words[, Prob := ((count - discount_value) / Cn1 + discount_value / Cn1 *
n1wi[word_1, N] * uni_words[word_2, Prob])]
######## End of Finding Bi-Gram Probability #################
######## Finding Tri-Gram Probability #################
# Finding count of word1-word2 combination in bigram
tri_words[, Cn2 := bi_words[.(word_1, word_2), .N]]
n1w12 <- tri_words[, .N, by = .(word_1, word_2)]
setkey(n1w12, word_1, word_2)
# Kneser Kney Algorithm
tri_words[, Prob := ((count - discount_value) / Cn2 + discount_value / Cn2 *
n1w12[.(word_1, word_2), .N] * bi_words[.(word_1, word_2), Prob])]
Here I get the following error for Kneser algorithm for trigrams:
Error in `[.data.table`(tri_words, , `:=`(Prob, ((count - discount_value)/Cn2 + :
Supplied 13867 items to be assigned to 3932 items of column 'Prob'. If you wish to 'recycle'
the RHS please use rep() to make this intent clear to readers of your code.
In addition: Warning messages:
1: In discount_value/Cn2 * n1w12[list(word_1, word_2), .N] * bi_words[list(word_1, :
longer object length is not a multiple of shorter object length
2: In (count - discount_value)/Cn2 + discount_value/Cn2 * n1w12[list(word_1, :
longer object length is not a multiple of shorter object length
I could find some similar questions related to data table error but I can't understand how should I solve this error in the code.

The problem is in your attempt to multiply the quantities in the last line. This expression:
(count - discount_value) / Cn2 + discount_value / Cn2
is length 20, like tri_words. But the next expression
n1w12[.(word_1, word_2), .N]
is length 19. Then the last part,
bi_words[.(word_1, word_2), Prob])
is length 155 (and contains a lot of NAs).
The error messages are saying that the shorter item cannot be recycled into the longer item because the longer item's length is not a multiple of the length of the shorter item. To fix this, you need to implement this algorithm more carefully.

Related

I am having an issue adding vectors to a list [duplicate]

This question already has an answer here:
What is the difference between [ ] and [[ ]] in R? [duplicate]
(1 answer)
Closed 1 year ago.
I have a list of matrices constructed by the following loops:
# Set up Row and Column Names for prediction coefficients.
rows = c("Intercept", "actsBreaks0", "actsBreaks1","actsBreaks2","actsBreaks3","actsBreaks4","actsBreaks5","actsBreaks6",
"actsBreaks7","actsBreaks8","actsBreaks9","tBreaks0","tBreaks1","tBreaks2","tBreaks3", "unitBreaks0", "unitBreaks1",
"unitBreaks2","unitBreaks3", "covgBreaks0","covgBreaks1","covgBreaks2","covgBreaks3","covgBreaks4","covgBreaks5",
"covgBreaks6","yearBreaks2016","yearBreaks2015","yearBreaks2014","yearBreaks2013","yearBreaks2011",
"yearBreaks2010","yearBreaks2009","yearBreaks2008","yearBreaks2007","yearBreaks2006","yearBreaks2005",
"yearBreaks2004","yearBreaks2003","yearBreaks2002","yearBreaks2001","yearBreaks2000","yearBreaks1999",
"yearBreaks1998","plugBump0","plugBump1","plugBump2","plugBump3")
cols = c("Value")
# Build Matrix for dummy coefficient values.
matrix1 <- matrix(c(1:48), nrow = 48, ncol = 1, byrow = TRUE, dimnames = list(rows,cols))
matrix1
# Extract each variable type into own matrix (i.e. all "actsBreaks{x}")
#
Beta_names <- list()
betabreaks <- unique(gsub("[0-9]*", "", rows))
for (bc in betabreaks)
{
Breaks <- grep(paste0(bc, "[0-9]*"), rows)
Beta_names[[bc]] <- matrix1[Breaks, ,drop = FALSE]
Beta_names[[bc]] <- data.matrix(unlist(Beta_names[[bc]])) #, byrow = TRUE)
}
# Set up matrices for excluded/test data
one_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,2,0,10)
two_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,3,0,10)
three_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,4,10,0)
four_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,5,0,10)
five_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,6,0,10)
six_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,7,0,10)
seven_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,8,0,10)
eight_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,9,0,10)
nine_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,1,0,10)
ten_column <-c(1,1,2,3,3,3,4,4,4,4,4,5,6,9,9,8,7,5,4,7,7,8,0,0,10)
DF1 <- data.frame (one_column ,two_column ,three_column ,
four_column ,five_column ,six_column ,
seven_column ,eight_column ,nine_column ,
ten_column )
paralength <- 5
Xnames <- list()
datindc <- 1
while ( datindc <= paralength )
{
Xbreaks <- factor(DF1[[datindc]],levels=sort(unique.default(DF1[[datindc]]),decreasing=FALSE))
Xnames[[datindc]] <- data.frame(model.matrix(~Xbreaks -1), stringsAsFactors = FALSE)
datindc <- datindc + 1
}
#
Xlngth <- length(Xnames)
BtaXind <- 1
BetaiXi <- list()
while ( BtaXind <= Xlngth )
{
BetaiXi[[BtaXind]] <- (Beta_names[[BtaXind + 1]] * Xnames[[BtaXind]])
BtaXind <- (BtaXind + 1)
}
I need to add each of those matrices' rows to each other, which I am trying to do by turning each matrix into a vector
BiXilngth <- length(BetaiXi)
BetaiXiTr <- list()
BtaiXiTrd <- 1
while (BtaiXiTrd <= BiXilngth)
{
Var1 <- c(t(BetaiXi[[BtaiXiTrd]]))
BetaiXiTr[BtaiXiTrd] <- Var1
BtaiXiTrd <- BtaiXiTrd + 1
}
and adding the vectors, effectively transposing the matrices. However, when I tried to convert the first matrix BetaiXi[[1]] to a vector and add it to the list with this command BetaiXiTr[BtaiXiTrd] <- c(t(BetaiXi[[BtaiXiTrd]])) I got the following message:
Warning message:
In BetaiXiTr[BtaiXiTrd] <- c(t(BetaiXi[[BtaiXiTrd]])) :
number of items to replace is not a multiple of replacement length
I then tried using unlist():
> BetaiXiTr[BtaiXiTrd] <-unlist(c(t(BetaiXi[[1]])))
Warning message:
In BetaiXiTr[BtaiXiTrd] <- unlist(c(t(BetaiXi[[1]]))) :
number of items to replace is not a multiple of replacement length
with the same result. Finally, I tried assigning the first vector to a variable > Var1 <- c(t(BetaiXi[[BtaiXiTrd]])) and assigning that vector to the list > BetaiXiTr[BtaiXiTrd] <- Var1 with, yet again, the same warning:
Warning message:
In BetaiXiTr[BtaiXiTrd] <- Var1 :
number of items to replace is not a multiple of replacement length
I searched for the warning message to determine what exactly I was being warned of but ended being more confused. Most reproduce or encountered the error message by trying to replace a vector of so many elements with a vector of fewer, while (to my understanding) I am simply trying to add a vector to a list. Am I going about this the incorrect way?

I was using [ ] and [ [ ] ] incorrectly in BetaiXiTr[BtaiXiTrd]. It needs to be BetaiXiTr[[BtaiXiTrd]]and that allows the vectors to be added.

Use sum of binary values as constraint coefficient in integer programming constraint

I am trying to formulate a shelf-optimization integer programming algorithm in lpsolveAPI and wish to add a constraint whereby the same number of each product is on each selected shelf (S):
f
My difficulty is accessing and using the sums as constraints (specifically Xij)
I can reformulate to make it linear without too much problems (please excuse the pseudocode):
sum(X_ij)*F_ij - sum(F_ij) = 0
This operation could affect the choice of X itself, thus I need it to be dynamic (otherwise I could just change the values post-solve) How can I access these values, or code the F values to be equal?
The linear program creates a binary solution through a series of constraints to place products on shelves (it may place on one shelf or two shelves at the moment), there are four shelves. Then there is a second set which allows a number of products on those shelves which are non-zero, constrained by the width of the products against the length of the shelves with a given maximum of products on all shelves (8 for most of them, though this is somewhat arbitrary), maximising the product profit. All this works as expected. However, I wish to add a constraint such that the number of products on two or more shelves are the same i.e. four on one and four on the other. Given that the number of shelves used can be 1 or 2, I cannot simply divide the values. Further, as which shelves are occupied are decided by constraints, I cannot simply use P1S1 = P1S2 (unless I could select the occupied shelves, which I am failing to do)
Here is a minimal example of what I am trying to do (please excuse the inelegant code as I am doing this for the first time) the dataset is here:
library(lpSolveAPI)
shelves <- data.frame(Sl_i = c(151, 200, 180, 218),
Sh_i = c(30, 30, 30, 36))
datatable <- read.csv("~/Desktop/sales/datatable.txt", sep="")
S = 4 # number of shelves
P = 40 # number of products
Shelf_choice <-
make.lp(0, nrow(mydata) * 2) # create the lp object with decision variables == longitude of data.frame
#### SET OBJECTIVE FUNCTION ####
#### Set controls for the model ####
lp.control(Shelf_choice,
sense = "max",
timeout = 10,
presolve = "none") ## timeout prevents getting stuck
set.objfn(Shelf_choice, c(rep(rep(0, nrow(
mydata
)), 1), mydata$Pu_j)) # maximize profit (Pu_j)
set.type(Shelf_choice, 1:nrow(mydata), "binary") # present on shelf or not
set.type(Shelf_choice, 1:nrow(mydata) + nrow(mydata), "integer") # number of product j on shelf
### Assure that each product appears on minimum number of shelves (1 in this case)
Add_productShelf_constraint <- function (prod_index) {
cargo_cols <-
(0:(S - 1)) * P + prod_index # # index of products by column (eg. 1,41,81,121)
add.constraint(
Shelf_choice,
rep(1, S),
# repeat value the same number of times as shelves
indices = cargo_cols,
type = ">=",
rhs = mydata$smin_j[prod_index]
) # value of minimum number of shelves
}
lapply(1:P, Add_productShelf_constraint) # list apply this for every product
### Assure that product appears no more than the number of shelves permitted (2 in this case)
Add_productShelfMAX_constraint <- function (prod_index) {
cargo_cols <-
(0:(S - 1)) * P + prod_index # index of products by column (eg. 1,41,81,121)
add.constraint(
Shelf_choice,
rep(1, S),
# repeat value the same number of times as shelves
indices = cargo_cols,
type = "<=",
rhs = mydata$smax_j[prod_index]
) # value of minimum number of shelves
}
lapply(1:P, Add_productShelfMAX_constraint) # list apply this for every product
### Third Constraint: Products too tall for a shelf are excluded
Add_height_constraint <-
function (prod_index) {
# this needs to be improved
add.constraint(Shelf_choice,
1,
indices = prod_index,
type = "=",
rhs = 0)
}
lapply(which(mydata$height == 0), Add_height_constraint) # Here we select the colums which have 0 (don't fit), and set the value to 0
## Products are on consecutive shelves - this currently only works for two shelves
Add_nextshelf_constraint <- function (prod_index) {
mat1 <- combn(1:S, 2)[, which(combn(1:S, 2)[2, ] - combn(1:S, 2)[1, ] != 1)]
cargo_cols <- (0:(S - 1)) * P + prod_index
result <- matrix(cargo_cols[mat1], nrow = 2)
for (i in 1:ncol(result)) {
add.constraint(
Shelf_choice,
c(1, 1),
indices = result[, i],
type = "<=",
rhs = 1
)
}
}
lapply(1:P, Add_nextshelf_constraint)
### Product facings only appear on selected shelves (where Xij = 1)
Add_FF_constraint1 <- function (prod_index) {
Y01col <- prod_index
print(Y01col)
FF_col <- prod_index + nrow(mydata)
add.constraint(
Shelf_choice,
c(1, -100),
indices = c(FF_col, Y01col),
type = "<=",
rhs = 0
)
}
lapply(1:nrow(mydata), Add_FF_constraint1) #
Add_FF_constraint2 <- function (prod_index) {
Y01col <- prod_index
FF_col <- prod_index + nrow(mydata)
add.constraint(
Shelf_choice,
c(1, -1),
indices = c(FF_col, Y01col),
type = ">=",
rhs = 0
)
}
lapply(1:nrow(mydata), Add_FF_constraint2) #
#### Sum of product widths on shelves does not exceed shelf length
Add_FijShelflength_constraint <- function (shelf_index) {
shelf_cols_mydata <- ((1:(P)) + (shelf_index - 1) * P)
FF_shelf_cols <- ((1:(P)) + (shelf_index - 1) * P) + nrow(mydata)
add.constraint(
Shelf_choice,
c(mydata$Pw_j[shelf_cols_mydata]),
# width of each product
indices = c(FF_shelf_cols),
# indices of products per shelf in Fij Matrix
rhs = shelves$Sl_i[shelf_index]
) # length of each shelf
}
lapply(1:S, Add_FijShelflength_constraint) # list apply this by shelf index
## add minimum number of total facings
Add_min_facings_constraint <- function (prod_index) {
FjSi_cols <-
(0:(S - 1)) * P + prod_index + nrow(mydata) # index of the products by column in out table
add.constraint(
Shelf_choice,
rep(1, S),
# repeat value the same number of times as shelves
indices = FjSi_cols,
# index of products by column (eg. 1,41,81,121)
type = ">=",
rhs = mydata$Fmin_j[prod_index]
) # value of minimum number of products
}
lapply(1:P, Add_min_facings_constraint) # list apply this for every product
## add maximum number of facings
Add_max_facings_constraint <- function (prod_index) {
FjSi_cols <-
(0:(S - 1)) * P + prod_index + nrow(mydata)
add.constraint(
Shelf_choice,
rep(1, S),
# repeat value the same number of times as shelves
indices = FjSi_cols,
# index of products by column (eg. 1,41,81,121)
type = "<=",
rhs = mydata$Fmax_j[prod_index]
) # value of maximum number of products
}
lapply(1:P, Add_max_facings_constraint) # list apply this for every product
solve(Shelf_choice)
get.objective(Shelf_choice) # gives the total value of the facings
### Tabulates the results ####
test <- matrix(get.variables(Shelf_choice),
ncol = S * 2,
byrow = F)
rownames(test) <- paste0("Product", 1:40)
colnames(test) <- c(rep(paste0("Shelf", 1:4), 2))
test[, 5:8] # shows the product placements (uneven products between shelves)
#
Results:
Product
Shelf 1
Shelf 2
Shelf 3
Shelf 4
P1
0
0
0
2
-
-
-
-
-
P11
1
7
0
0
P16
2
2
0
0
I need, for example, that product 11 has the same number of products on each shelf (4 on each)
I have tried to create a constraint such as :
Sum_Xshelf_constraint <- function (prod_index) {
binary_sum <-
sum(get.variables(Shelf_choice)[(0:(S - 1)) * P + prod_index])
total_Fij <-
sum(get.variables(Shelf_choice)[(0:(S - 1)) * P + prod_index + nrow(df)])
total_cols <-
(0:(S - 1)) * P + prod_index + nrow(df) # index of the products in Fij
for (i in 1:length(total_cols)) {
add.constraint(
Shelf_choice,
c(binary_sum),
indices = c(total_cols[i]),
type = "<=",
rhs = total_Fij
) # value of minimum number which is Fmin_j
}
}
### At least one product on a shelf
lapply(1:P, Sum_Xshelf_constraint)
This unsurprisingly will not work before solving, and once solved it has no effect.
Any ideas how to achieve this? Thank you in advance.

It your decision variables are PiSj (#ith Product on jth Shelf) then a series of chained equations could work:
P1S1 = P1S2
P1S2 = P1S3
P1S3 = P1S4
That kind of simplicity can sometimes markedly improve performance.

I have to answer my own problem here:
It is not possible.
Reading carefully it seems that this particular constraint is quadratic... (http://lpsolve.sourceforge.net/5.5/)
"Suppose that xj must take an integer value i:
yj / y0 = i
or
yj = i * y0
Unfortunately, this constraint can't be handled by lpsolve since it is quadratic."
I will edit this answer with the quadratic solution when it is complete. Thought I had better post this now so that others don't put too much effort into trying to solve it.
Thanks!

R for loop to calculate wilcox.test

I am trying to write a code that would automatically calculate Wilcoxon test p-value for several comparisons.
Data used: 2 data sets with the same information representing two groups of participants completed the same 5 tasks which means that the each table contains 5 columns (tasks) and X rows with tasks scores.
data_17_18_G2 # first data set (in data.table format)
data_18_20_G2 # second data set (in data.table format)
Both data sets have identical names of column which are to be used in the W-test the next way:
wilcox.test(Group1Task1, Group2Task1, paired = F)
wilcox.test(Group1Task2, Group2Task2, paired = F)
and so on.
The inputs (e.g., Grou1Task1) are two vectors of task scores (the first one will be from data_17_18_G2 and the other one from data_18_20_G2
Desired output: a data table with a column of p-values
The problem I faced is that no matter how I manipulated the val1 and val2 empty objects, in the second and the third lines the right size "as.numeric(unlist(data_17_18_G2[, ..i]))" gives a correct output (a numeric vector) but it's left size "val1[i]" always returns only one value from the vector. That gave me the idea that the main problem appeared on the step of creating an empty vector, however, I wasn't able to solve it.
Empty objects:
result <- data.table(matrix(ncol=2))
val1 <- as.numeric() # here I also tried functions "numeric" and "vector"
val2 <- as.numeric()
res <- vector(mode = "list", length = 7)
For loop
for (i in 1:5) {
val1[i] <- as.numeric(unlist(data_17_18_G2[ , ..i]))
val2[i] <- as.numeric(unlist(data_18_20_G2[ , ..i]))
res[i] <- wilcox.test(val1[i], val2[i], paired = F)
result[i, 1] <- i
result[i, 2] <- res$p.value
}
Output:
Error in `[<-.data.table`(`*tmp*`, i, 2, value = NULL) :
When deleting columns, i should not be provided
1: В val1[i] <- as.numeric(unlist(data_17_18_G2[, ..i])) :
number of items to replace is not a multiple of replacement length
2: В val2[i] <- as.numeric(unlist(data_18_20_G2[, ..i])) :
number of items to replace is not a multiple of replacement length
3: В res[i] <- wilcox.test(val1[i], val2[i], paired = F) :
number of items to replace is not a multiple of replacement length
Alternative:
I changed the second and the third lines
for (i in 1:5) {
val1[i] <- as.numeric(data_17_18_G2[ , ..i])
val2[i] <- as.numeric(data_18_20_G2[ , ..i])
res[i] <- wilcox.test(val1[i], val2[i], paired = F)
result[i, 1] <- i
result[i, 2] <- res$p.value
}
And got this
Error in as.numeric(data_17_18_G2[, ..i]) :
(list) object cannot be coerced to type 'double'
which means that the function wilcox.test cannot interpret this type of input.
How can I improve the code so that I get a data table of p-values?

There would appear to be some bugs in the code. I have rewritten the code using the cars dataset as a example.
## use the cars dataset as a example (change with appropriate data)
data(cars)
data_17_18_G2 <- as.data.table(cars)
data_18_20_G2 <- data_17_18_G2[,2:1]
## Fixed code
result <- data.table(matrix(as.numeric(), nrow=ncol(data_17_18_G2), ncol=2))
val1 <- as.numeric()
val2 <- as.numeric()
res <- vector(mode = "list", length = 7)
for (i in 1:ncol(data_17_18_G2)) {
val1 <- as.numeric(unlist(data_17_18_G2[ , ..i]))
val2 <- as.numeric(unlist(data_18_20_G2[ , ..i]))
res[[i]] <- wilcox.test(val1, val2, paired = F)
result[i, 1] <- as.numeric(i)
result[i, 2] <- as.numeric(res[[i]]$p.value)
}
Hope this gives you the output you are after.

How to summarize email text using LDA in R

I am working on complaints data analysis where I am adapting text summary technique for reducing unnecessary text and bringing out only useful text.
I have used LDA - Latent Dirichlet Allocation in R for text summarization but I am not able to perform it to its full potential.
library(igraph)
library(iterators)
#create a TCM using skip grams, we'll use a 5-word window
tcm <- CreateTcm(doc_vec = datacopy$Text,skipgram_window = 10,
verbose = FALSE,cpus = 2)
# LDA to get embeddings into probability space
embeddings <- FitLdaModel(dtm = tcm, k = 50, iterations = 300,
burnin = 180, alpha = 0.1,beta = 0.05, optimize_alpha = TRUE,
calc_likelihood = FALSE,calc_coherence = FALSE, calc_r2 = FALSE,cpus=2)
#Summarizer function
summarizer <- function(doc, gamma) {
# handle multiple docs at once
if (length(doc) > 1 )
return(sapply(doc, function(d) try(summarizer(d, gamma))))
# parse it into sentences
sent <- stringi::stri_split_boundaries(doc, type = "sentence")[[ 1 ]]
names(sent) <- seq_along(sent) # so we know index and order
# embed the sentences in the model
e <- CreateDtm(sent, ngram_window = c(1,1), verbose = FALSE, cpus = 2)
# remove any documents with 2 or fewer words
#e <- e[ rowSums(e) > 2 , ]
vocab <- intersect(colnames(e), colnames(gamma))
e <- e / rowSums(e)
e <- e[ , vocab ] %*% t(gamma[ , vocab ])
e <- as.matrix(e)
# get the pairwise distances between each embedded sentence
e_dist <- CalcHellingerDist(e)
# turn into a similarity matrix
g <- (1 - e_dist) * 100
# we don't need sentences connected to themselves
diag(g) <- 0
# turn into a nearest-neighbor graph
g <- apply(g, 1, function(x){
x[ x < sort(x, decreasing = TRUE)[ 3 ] ] <- 0
x
})
# by taking pointwise max, we'll make the matrix symmetric again
g <- pmax(g, t(g))
g <- graph.adjacency(g, mode = "undirected", weighted = TRUE)
# calculate eigenvector centrality
ev <- evcent(g)
# format the result
result<-sent[names(ev$vector)[order(ev$vector,decreasing=TRUE)[1:3]]]
result <- result[ order(as.numeric(names(result))) ]
paste(result, collapse = " ")
}
docs <- datacopy$Text[1:10]
names(docs) <- datacopy$Reference[1:10]
sums <- summarizer(docs,gamma = embeddings$gamma)
sums
Error -
Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) :
'x' must be an array of at least two dimensions
Error in if (nrow(adjmatrix) != ncol(adjmatrix)) { :
argument is of length zero
Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) :
'x' must be an array of at least two dimensions
Error in if (nrow(adjmatrix) != ncol(adjmatrix))
{:argument is of length zero
Error in if (nrow(adjmatrix) != ncol(adjmatrix))
{:argument is of length zero
Actual text:
it is the council’s responsibility to deal with the loose manhole cover.
Could you provide an update on the next steps taken by the council.
** Trail Mails Text follows - about 50 lines of text**
summarized text:
it is the council’s responsibility to deal with the loose manhole cover.I have read the email thread, please get in contact with the numbers provided by ABC"

Txt Prediction Model Numerical Expression Warning

I have three dataframes created from different ngram counts (Uni, Bi , Tri) each data frame contains the separated ngram, frequency counts (n) and have added probability using smoothing.
I have written three functions to look through the tables and return the highest probable word based on an input string. And have binded them
##Prediction Model
trigramwords <- function(FirstWord, SecondWord, n = 5 , allow.cartesian =TRUE) {
probword <- trigramtable[.(FirstWord, SecondWord), allow.cartesian = TRUE][order(-Prob)]
if(any(is.na(probword)))
return(bigramwords(SecondWord, n))
if(nrow(probword) > n)
return(probword[1:n, ThirdWord])
count <-nrow(probword)
bgramwords <- bigramtable(SecondWord, n)[1:(n - count)]
return(c(probword[, ThirdWord], bgramwords))
}
bigramwords <- function(FirstWord, n = 5 , allow.cartesian = TRUE){
probword <- bigramtable[FirstWord][order(-Prob)]
if(any(is.na(probword)))
return(Unigramword(n))
if (nrow(probword) > n)
return(probword[1:n, SecondWord])
count <- nrow(probword)
word1 <- Unigramword(n)[1:(n - count)]
return(c(probword[, SecondWord], word1))
}
##Back off Model
Unigramword <- function(n = 5, allow.cartesian = TRUE){
return(sample(UnigramTable[, FirstWord], size = n))
}
## Bind Functions
predictword <- function(str) {
require(quanteda)
tokens <- tokens(x = char_tolower(str))
tokens <- char_wordstem(rev(rev(tokens[[1]])[1:2]), language = "english")
words <- trigramwords(tokens[1], tokens[2], 5)
chain_1 <- paste(tokens[1], tokens[2], words[1], sep = " ")
print(words[1])
}
However I receive the following warning message and the output is always the same word. If I use only the bigramwords function it works fine, but when adding the trigram function I get the warning message. I believe it because 1:n is not defined correctly.
Warning message:
In 1:n : numerical expression has 5718534 elements: only the first used

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

data.table error and warnings for finding trigram probability - r

Related

I am having an issue adding vectors to a list [duplicate]

Use sum of binary values as constraint coefficient in integer programming constraint

R for loop to calculate wilcox.test

How to summarize email text using LDA in R

Txt Prediction Model Numerical Expression Warning

Categories

Resources