Corpus bleu score calculation - math

I am trying to calculate bleu score text summarization. Before that I need to know how corpus bleu calculates score between given references and a candidate. I have 3 references and 2 candidates where I am calling corpus bleu on first two references with candidate 1 and reference 3 with candidate 2. What kind of math is involved? Like does it take the average or geometric mean , what is it?
It would be helpful if someone shows the calculation

Related

LDA - How do I normalise and "add the smoothing constant" to raw document-topic allocation counts?

Context
thanks in advance for your help. Right now, I have run a dataset through the LDA function in Jonathan Chang's 'lda' package (N.B. this is different from the 'topicmodels' package). Below is a replicable example, which uses the cora dataset that comes automatically when you install and load the 'lda' package.
library(lda)
data(cora.documents) #list of words contained in each of the 2,410 documents
data(cora.vocab) #vocabulary list of words that occur at least once across all documents
Thereafter, I conduct the actual LDA by setting the different parameters and running the code.
#parameters for LDA algorithm
K <- 20 #number of topics to be modelled from the corpus, "K"
G <- 1000 #number of iterations to cover - the higher this number, the more likely the data converges, "G"
alpha <- 0.1 #document-topic distributions, "alpha"
beta <- 0.1 #topic-term distributions, "beta/eta"
#creates an LDA fit based on above parameters
lda_fit <- lda.collapsed.gibbs.sampler(cora.documents, cora.vocab, K = 20,
num.iterations = G, alpha, beta)
Following which, we examine one component of the output of the LDA model, which is called document_sums. This component displays the number of words that each individual document contains that is allocated to each of the 20 topics (based on the K-value I chose). For instance, one document may have 4 words allocated to Topic 3, and 12 words allocated to Topic 19, in which case the document is assigned to Topic 19.
#gives raw figures for no. of times each document (column) had words allocated to each of 20 topics (rows)
document_sums <- as.data.frame(lda_fit$document_sums)
document_sums[1:20, 1:20]
Question
However, what I want to do is essentially use the principle of fuzzy membership. Instead of allocating each document to the topic it contains the most words in, I want to extract the probabilities that each document gets allocated to each topic. document_sums quite close to this, but I still have to do some processing on the raw data.
Jonathan Chang, the creator of the 'lda' package, himself says this in this thread:
n.b. If you want to convert the matrix to probabilities just row normalize and add the smoothing constant from your prior. The function here just returns the raw number of assignments in the last Gibbs sampling sweep. ()
Separately, another reply on another forum reaffirms this:
The resulting document_sums will give you the (unnormalized) distribution over topics for the test documents. Normalize them, and compute the inner product, weighted by the RTM coefficients to get the predicted link probability (or use predictive.link.probability)
And thus, my question is, how do I normalise my document_sums and 'add the smoothing constant'? These I am unsure of.
As asked: You need to add the prior to the matrix of counts and then divide each row by its total. For example
theta <- document_sums + alpha
theta <- theta / rowSums(theta)
You'll need to do something similar for the matrix of counts relating words to topics.
However if you're using LDA, may I suggest you check out textmineR? It does this normalization (and other useful things) for you. I originally wrote it as a wrapper for the 'lda' package, but have since implemented my own Gibbs sampler to enable other features. Details on using it for topic modeling are in the third vignette

Multiclass text classification using R

I am working on a multiclass text classification problem. I have build a gradient boosting model for the same.
About the dataset:
The dataset has two columns: "Test_name" and "Description"
There are six labels in the Test_Name column and their corresponding description in the "Description" column.
My approach towards the problem
DATA PREPARATION
Creat a word vector for description.
Build a corpus using the word vector.
Pre-processing tasks such as removing number, whitespaces, stopwords and conversion to lower case.
Build a document term matrix (dtm).
Remove sparse words from the above dtm.
The above step leads to a count frequency matrix showing the frequency of each word in its coressponding column.
Tranform count frequency matrix to a binary instance matrix, which shows occurences of a word in a document as either 0 or 1, 1 for being present and 0 for absent.
Append the label column from the original notes dataset with the transformed dtm. The label column has 6 labels.
Model Building
Using H2o package, build a gbm model.
Results obtained
Four of the class labels are classified well but the rest two are poorly classified.
below is the output:
Extract training frame with `h2o.getFrame("train")`
MSE: (Extract with `h2o.mse`) 0.1197392
RMSE: (Extract with `h2o.rmse`) 0.3460335
Logloss: (Extract with `h2o.logloss`) 0.3245868
Mean Per-Class Error: 0.3791268
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
Body Fluid Analysis = 401 / 2,759
Cytology Test = 182 / 1,087
Diagnostic Imaging = 117 / 3,907
Doctors Advice = 32 / 752
Organ Function Test = 461 / 463
Patient Related = 101 / 113
Totals = 1,294 / 9,081
The misclassification errors for organ function test and patient related are relatively higher. How can i fix this?
Just some quick things you can do to improve this:
Look at performance metrics on a validation set, including the confusion matrix
Perhaps try hyperparameter tuning to improve performance for your task (using h2o.grid:http://docs.h2o.ai/h2o/latest-stable/h2o-docs/grid-search.html)
Consider using h2o.word2vec for feature generation (Docs: https://github.com/h2oai/h2o-3/blob/master/h2o-r/demos/rdemo.word2vec.craigslistjobtitles.R and Example: https://github.com/h2oai/h2o-3/blob/master/h2o-r/demos/rdemo.word2vec.craigslistjobtitles.R)
If you provide more details and a working example there is more that can be done to help you.

What is the probability of a TERM for a specific TOPIC in Latent Dirichlet Allocation (LDA) in R

I'm working in R, package "topicmodels". I'm trying to work out and better understand the code/package. In most of the tutorials, documentation I'm reading I'm seeing people define topics by the 5 or 10 most probable terms.
Here is an example:
library(topicmodels)
data("AssociatedPress", package = "topicmodels")
lda <- LDA(AssociatedPress[1:20,], k = 5)
topics(lda)
terms(lda)
terms(lda,5)
so the last part of the code returns me the 5 most probable terms associated with the 5 topics I've defined.
In the lda object, i can access the gamma element, which contains per document the probablity of beloning to each topic. So based on this I can extract the topics with a probability greater than any threshold I prefer, instead of having for everyone the same number of topics.
But my second step would then to know which words are strongest associated to the topics. I can use the terms(lda) function to pull this out, but this gives me the N so many.
In the output I've also found the
lda#beta
which contains the beta per word per topic, but this is a Beta value, which I'm having a hard time interpreting. They are all negative values, and though I see some values around -6, and other around -200, i can't interpret this as a probability or a measure to see which words and how much stronger certain words associate to a topic. Is there a way to pull out/calculate anything that can be interpreted as such a measure.
many thanks
Frederik
The beta-matrix gives you a matrix with dimension #topics x #terms. The values are log-likelihoods, therefore you exp them. The given probabilities are of the type
P(word|topic) and these probabilities only add up to 1 if you take the sum over the words but not over the topics P(all words|topic) = 1 and NOT P(word|all topics) = 1.
What you are searching for is P(topic|word) but I actually do not know how to access or calculate it in this context. You will need P(word) and P(topic) I guess. P(topic) should be:
colSums(lda#gamma)/sum(lda#gamma)
Becomes more obvious if you look at the gamma-matrix, which is #document x #topics. The given probabilites are P(topic|document) and can be interpreted as "what is the probability of topic x given document y". The sum over all topics should be 1 but not the sum over all documents.

Why is the objective function in a nonlinear programming (NLP) solver Rsolnp not honored?

The case:
I have a universe of 3 regions (Brazil, New Zealand, USA) (my actual problem is much larger – 31 regions). These three regions are connected through migration. For example, if 10 people move from Brazil to the USA (BRZ-USA) we have immigration to the USA (inflow of people) and emigration from Brazil (outflow of people). I have a data set of migration rates for all possible migration flows in the given universe (3*2 = 6). In addition I have a data set of the population in each region. When I multiply the migration rates by the population, I obtain migrant counts. I can then compute for each region the number of immigrants and the number of emigrants. Subtracting the emigrants from the immigrants results in net migration counts (can be positive or negative). However since we have a balanced system (inflows equal outflows for each region) the sum of the net migrants across all regions should be zero. In addition to the net migrant rates and the population, I also have the net migrant counts from a hypothetical future scenario for each region. But the scenario net migrant counts are different from the counts that I can compute from my data. So I would like to scale the 6 migration rates up and down (by adding or subtracting a fixed number) so that the resulting net migration counts meet the scenario values. I use the nonlinear programming (NLP) solver Rsolnp to accomplish this task (see example script below).
The problem:
I have specified the objective function in the form of a least square equation because it is the goal to force the 6 scalars to be as close to zero as possible. In addition I am using an equality constraint function to meet the scenario values. This all works fine and the solver provides the scalars that I can add to the migration rates leading to the migrant counts to perfectly match the scenario values (see script part “Test if target was reached”). However, I would also like to apply weights (variable: w) to the objective function so that higher values on certain scalars are stronger penalized. However, regardless of how I specify the weights, I always obtain the same solution (see "Example results for different weights"). So it appears that the solver does not honor the objective function. Does anyone have an idea why that is the case and how I could change the objective function so that the use of weights is possible? Thanks a lot for any help!
library(Rsolnp)
# Regions
regUAll=c("BRZ","NZL","USA") # "BRZ"=Brazil; "NZL"=New Zealand; "USA"=United States
#* Generate unique combinations of regions
uCombi=expand.grid(regUAll,regUAll,stringsAsFactors=F)
uCombi=uCombi[uCombi$Var1!=uCombi$Var2,] # remove same region combination (e.g., BRZ-BRZ)
uCombi=paste(uCombi$Var2,uCombi$Var1,sep="-")
#* Generate data frames
# Migration rates - rows represent major age groups (row1=0-25 years, row2=26-50 years, row3=51-75 years)
dfnm=data.frame(matrix(rep(c(0.01,0.04,0.02),length(uCombi)),ncol=length(uCombi),nrow=3),stringsAsFactors=F) # generate empty df
names(dfnm)=uCombi # assign variable names
# Population (number of people) in region of origin
pop=c(rep(c(20,40,10),2),rep(c(4,7,2),2),rep(c(30,70,50),2))
dfpop=data.frame(matrix(pop,ncol=length(uCombi),nrow=3),stringsAsFactors=F) # generate empty df
names(dfpop)=uCombi # assign variable names
#* Objective function for optimization
# Note: Least squares method to keep the additive scalers as close to 0 as possible
# The sum expression allows for flexible numbers of scalars to be included but is identical to: w[1](scal[1]-0)^2+w[2](scal[2]-0)^2+w[3](scal[3]-0)^2+w[4](scal[4]-0)^2+w[5](scal[5]-0)^2+w[6](scal[6]-0)^2
f.main=function(scal,nScal,w,dfnm,dfpop,regUAll){
sum(w*(scal[1:nScal]-0)^2)
}
#* Equality contraint function
f.equal=function(scal,nScal,w,dfnm,dfpop,regUAll){
#* Adjust net migration rates by scalar
for(s in 1:nScal){
dfnm[,s]=dfnm[,s]+scal[s]
}
#* Compute migration population from data
nmp=sapply(dfpop*dfnm,sum) # sums migration population across age groups
nmd=numeric(length(regUAll)); names(nmd)=regUAll # generate named vector to be filled with values
for(i in 1:length(regUAll)){
colnEm=names(nmp)[grep(paste0("^",regUAll[i],"-.*"),names(nmp))] # emigration columns
colnIm=names(nmp)[grep(paste0("^.*","-",regUAll[i],"$"),names(nmp))] # immigration columns
nmd[regUAll[i]]=sum(nmp[colnIm])-sum(nmp[colnEm]) # compute net migration population = immigration - emigration
}
nmd=nmd[1:(length(nmd)-1)] # remove the last equality constraint value - not needed because we have a closed system in which global net migration=0
return(nmd)
}
#* Set optimization parameters
cpar2=list(delta=1,tol=1,outer.iter=10,trace=1) # optimizer settings
nScal=ncol(dfnm) # number of scalars to be used
initScal=rep(0,nScal) # initial values of additive scalars
lowScal=rep(-1,nScal) # lower bounds on scalars
highScal=rep(1,nScal) # upper bounds on scalars
nms=c(-50,10) # target values: BRZ=-50, NZL=10, USA=40; last target value does not need to be included since we deal with a closed system in which global net migration sums to 0
w=c(1,1,1,1,1,1) # unity weights
#w=c(1,1,2,2,1,1) # double weight on NZL
#w=c(5,1,2,7,1,0.5) # mixed weights
#* Perform optimization using solnp
solRes=solnp(initScal,fun=f.main,eqfun=f.equal,eqB=nms,LB=lowScal,UB=highScal,control=cpar2,
nScal=nScal,w=w,dfnm=dfnm,dfpop=dfpop,regUAll=regUAll)
scalSol=solRes$pars # return optimized values of scalars
# Example results for different weights
#[1] 0.101645349 0.110108019 -0.018876993 0.001571639 -0.235945755 -0.018134294 # w=c(1,1,1,1,1,1)
#[1] 0.101645349 0.110108019 -0.018876993 0.001571639 -0.235945755 -0.018134294 # w=c(1,1,2,2,1,1)
#[1] 0.101645349 0.110108019 -0.018876993 0.001571639 -0.235945755 -0.018134294 # w=c(5,1,2,7,1,0.5)
#*** Test if target was reached
# Adjust net migration rates using the optimized scalars
for(s in 1:nScal){
dfnm[,s]=dfnm[,s]+scalSol[s]
}
# Compute new migration population
nmp=sapply(dfpop*dfnm,sum) # sums migration population across age groups
nmd=numeric(length(regUAll)); names(nmd)=regUAll # generate named vector to be filled with values
for(i in 1:length(regUAll)){
colnEm=names(nmp)[grep(paste0("^",regUAll[i],"-.*"),names(nmp))] # emigration columns
colnIm=names(nmp)[grep(paste0("^.*","-",regUAll[i],"$"),names(nmp))] # immigration columns
nmd[regUAll[i]]=sum(nmp[colnIm])-sum(nmp[colnEm]) # compute net migration population = immigration - emigration
}
nmd # should be -50,10,40 if scalars work correctly
Try different starting values (initScal) than zero; the sum( w * 0^2) = 0 for all w.

mathematical model to build a ranking/ scoring system

I want to rank a set of sellers. Each seller is defined by parameters var1,var2,var3,var4...var20. I want to score each of the sellers.
Currently I am calculating score by assigning weights on these parameters(Say 10% to var1, 20 % to var2 and so on), and these weights are determined based on my gut feeling.
my score equation looks like
score = w1* var1 +w2* var2+...+w20*var20
score = 0.1*var1+ 0.5 *var2 + .05*var3+........+0.0001*var20
My score equation could also look like
score = w1^2* var1 +w2* var2+...+w20^5*var20
where var1,var2,..var20 are normalized.
Which equation should I use?
What are the methods to scientifically determine, what weights to assign?
I want to optimize these weights to revamp the scoring mechanism using some data oriented approach to achieve a more relevant score.
example
I have following features for sellers
1] Order fulfillment rates [numeric]
2] Order cancel rate [numeric]
3] User rating [1-5] { 1-2 : Worst, 3: Average , 5: Good} [categorical]
4] Time taken to confirm the order. (shorter the time taken better is the seller) [numeric]
5] Price competitiveness
Are there better algorithms/approaches to solve this problem? calculating score? i.e I linearly added the various features, I want to know better approach to build the ranking system?
How to come with the values for the weights?
Apart from using above features, few more that I can think of are ratio of positive to negative reviews, rate of damaged goods etc. How will these fit into my Score equation?
Unfortunately stackoverflow doesn't have latex so images will have to do:
Also as a disclaimer, I don't think this is a concise answer but your question is quite broad. This has not been tested but is an approach I would most likely take given a similar problem.
As a possible direction to go, below is the multivariate gaussian. The idea would be that each parameter is in its own dimension and therefore could be weighted by importance. Example:
Sigma = [1,0,0;0,2,0;0,0,3] for a vector [x1,x2,x3] the x1 would have the greatest importance.
The co-variance matrix Sigma takes care of scaling in each dimension. To achieve this simply add the weights to a diagonal matrix nxn to the diagonal elements. You are not really concerned with the cross terms.
Mu is the average of all logs in your data for your sellers and is a vector.
xis the mean of every category for a particular seller and is as a vector x = {x1,x2,x3...,xn}. This is a continuously updated value as more data are collected.
The parameters of the the function based on the total dataset should evolve as well. That way biased voting especially in the "feelings" based categories can be weeded out.
After that setup the evaluation of the function f_x can be played with to give the desired results. This is a probability density function, but its utility is not restricted to stats.

Resources