semantic matching strings - using word2vec or s-match? - semantic-analysis

I have this problem of matching two strings for 'more general', 'less general', 'same meaning', 'opposite meaning' etc.
The strings can be from any domain. Assume that the strings can be from people's emails.
To give an example,
String 1 = "movies"
String 2 = "Inception"
Here I should know that Inception is less general than movies (sort of is-a relationship)
String 1 = "Inception"
String 2 = "Christopher Nolan"
Here I should know that Inception is less general than Christopher Nolan
String 1 = "service tax"
String 2 = "service tax 2015"
At a glance it appears to me that S-match will do the job. But I am not sure if S-match can be made to work on knowledge bases other than WordNet or GeoWordNet (as mentioned in their page).
If I use word2vec or dl4j, I guess it can give me the similarity scores. But does it also support telling a string is more general or less general than the other?
But I do see word2vec can be based on a training set or large corpus like wikipedia etc.
Can some one throw light on the way to go forward?

The current usage of machine learning methods such as word2vec and dl4j for modelling words are based on distributional hypothesis. They train models of words and phrases based on their context. There is no ontological aspects in these word models. At its best trained case a model based on these tools can say if two words can appear in similar contexts. That is how their similarity measure works.
The Mikolov papers (a, b and c) which suggests that these models can learn "Linguistic Regularity" doesn't have any ontological test analysis, it only suggests that these models are capable of predicting "similarity between members of the word pairs". This kind of prediction doesn't help your task. These models are even incapable of recognising similarity in contrast with relatedness (e.g. read this page SimLex test set).
I would say that you need an ontological database to solve your problem. More specifically about your examples, it seems for String 1 and String 2 in your examples:
String 1 = "a"
String 2 = "b"
You are trying to check entailment relations in sentences:
(1) "c is b"
(2) "c is a"
(3) "c is related to a".
Where:
(1) entails (2)
or
(1) entails (3)
In your two first examples, you can probably use semantic knowledge bases to solve the problem. But your third example will probably need a syntactical parsing before understanding the difference between two phrases. For example, these phrases:
"men"
"all men"
"tall men"
"men in black"
"men in general"
It needs a logical understanding to solve your problem. However, you can analyse that based on economy of language, adding more words to a phrase usually makes it less general. Longer phrases are less general comparing to shorter phrases. It doesn't give you a precise tool to solve the problem, but it can help to analyse some phrases without special words such as all, general or every.

Related

BERT sentence embeddings from transformers

I'm trying to get sentence vectors from hidden states in a BERT model. Looking at the huggingface BertModel instructions here, which say:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = BertModel.from_pretrained("bert-base-multilingual-cased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
So first note, as it is on the website, this does /not/ run. You get:
>>> Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'BertTokenizer' object is not callable
But it looks like a minor change fixes it, in that you don't call the tokenizer directly, but ask it to encode the input:
encoded_input = tokenizer.encode(text, return_tensors="pt")
output = model(encoded_input)
OK, that aside, the tensors I get, however, have a different shape than I expected:
>>> output[0].shape
torch.Size([1,11,768])
This is a lot of layers. Which is the correct layer to use for sentence embeddings? [0]? [-1]? Averaging several? I have the goal of being able to do cosine similarity with these, so I need a proper 1xN vector rather than an NxK tensor.
I see that the popular bert-as-a-service project appears to use [0]
Is this correct? Is there documentation for what each of the layers are?
While the existing answer of Jindrich is generally correct, it does not address the question entirely. The OP asked which layer he should use to calculate the cosine similarity between sentence embeddings and the short answer to this question is none. A metric like cosine similarity requires that the dimensions of the vector contribute equally and meaningfully, but this is not the case for BERT weights released by the original authors. Jacob Devlin
(one of the authors of the BERT paper) wrote:
I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn't mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally).
However, that does not mean you can not use BERT for such a task. It just means that you can not use the pre-trained weights out-of-the-box. You can either train a classifier on top of BERT which learns which sentences are similar (using the [CLS] token) or you can use sentence-transformers which can be used in an unsupervised scenario because they were trained to produce meaningful sentence representations.
I don't think there is single authoritative documentation saying what to use and when. You need to experiment and measure what is best for your task. Recent observations about BERT are nicely summarized in this paper: https://arxiv.org/pdf/2002.12327.pdf.
I think the rule of thumb is:
Use the last layer if you are going to fine-tune the model for your specific task. And finetune whenever you can, several hundred or even dozens of training examples are enough.
Use some of the middle layers (7-th or 8-th) if you cannot finetune the model. The intuition behind that is that the layers first develop a more and more abstract and general representation of the input. At some point, the representation starts to be more target to the pre-training task.
Bert-as-services uses the last layer by default (but it is configurable). Here, it would be [:, -1]. However, it always returns a list of vectors for all input tokens. The vector corresponding to the first special (so-called [CLS]) token is considered to be the sentence embedding. This where the [0] comes from in the snipper you refer to.
As mentioned in other answers, BERT was not meant to produce sentence level embeddings. Now, let's work on the how we can leverage power of BERT for computing context-sensitive sentence level embeddings.
BERT does carry the context at word level, here is an example:
This is a wooden stick.
Stick to your work.
Above two sentences carry the word 'stick', BERT does a good job in computing embeddings of stick as per sentence(or say, context).
Now, let's move to one another example:
--What is your age?
--How old are you?
Above two sentences are contextually very similar, so, we need a model that can accept a sentence or text chunk or paragraph and produce right embeddings collectively. Here is how it can be achieved.
Method 1:
Use pre-trained sentence_transformers, here is link to huggingface hub.
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
model = SentenceTransformer(r"sentence-transformers/paraphrase-MiniLM-L6-v2")
embd_a = model.encode("What is your age?")
embd_b = model.encode("How old are you?")
sim_score = cos_sim(embd_a, embd_b)
print(sim_score)
output: tensor([[0.8648]])
Now, there may be a question on how can we train our on sentence_transformer, specific to a domain. Here we go,
Supervised approach:
A common challenge for Data Scientist or MLEngineers is to get rightly annotated data, mostly it is hard to get it in good volume, but say, if you have it here is how we can train our on sentence_transformer (don't worry, there is an unsupervised approach too).
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)
#Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)
More details here.
Tip: If you have a set of sentences that are similar to each other, say, you have a CSV, where column A and B contains sentences similar to each other(I mean each row will have a pair of sentences which are similar to each other), just load the csv and assign random values between 0.85 to 0.95 as similarity score and proceed.
Unsupervised approach
Say you don't have a huge set of annotated data, but you want to train a domain specific sentence_transformer, here is how we do it. Even for unsupervised training, data will be required, i.e. list of sentences/paragraphs, but need not to be annotated. Say, you don't have any data at all, still there is a work round (please visit last part of the answer).
Multiple approaches are available for unsupervised training, listing two of the most prominent ones. To see list of all available approaches, please visit here.
TSDAE link to research paper.
from sentence_transformers import SentenceTransformer, LoggingHandler
from sentence_transformers import models, util, datasets, evaluation, losses
from torch.utils.data import DataLoader
# Define your sentence transformer model using CLS pooling
model_name = 'bert-base-uncased'
word_embedding_model = models.Transformer(model_name)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), 'cls')
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
# Define a list with sentences (1k - 100k sentences)
train_sentences = ["Your set of sentences",
"Model will automatically add the noise",
"And re-construct it",
"You should provide at least 1k sentences"]
# Create the special denoising dataset that adds noise on-the-fly
train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences)
# DataLoader to batch your data
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
# Use the denoising auto-encoder loss
train_loss = losses.DenoisingAutoEncoderLoss(model, decoder_name_or_path=model_name, tie_encoder_decoder=True)
# Call the fit method
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
weight_decay=0,
scheduler='constantlr',
optimizer_params={'lr': 3e-5},
show_progress_bar=True
)
model.save('output/tsdae-model')
SimCSE link to research paper
from sentence_transformers import SentenceTransformer, InputExample
from sentence_transformers import models, losses
from torch.utils.data import DataLoader
# Define your sentence transformer model using CLS pooling
model_name = 'distilroberta-base'
word_embedding_model = models.Transformer(model_name, max_seq_length=32)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
# Define a list with sentences (1k - 100k sentences)
train_sentences = ["Your set of sentences",
"Model will automatically add the noise",
"And re-construct it",
"You should provide at least 1k sentences"]
# Convert train sentences to sentence pairs
train_data = [InputExample(texts=[s, s]) for s in train_sentences]
# DataLoader to batch your data
train_dataloader = DataLoader(train_data, batch_size=128, shuffle=True)
# Use the denoising auto-encoder loss
train_loss = losses.MultipleNegativesRankingLoss(model)
# Call the fit method
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
show_progress_bar=True
)
model.save('output/simcse-model')
Tip: If you carefully observer, major difference is in the loss function used. To see a list of all the loss function applicable to such training scenarios, visit here. Also, with all the experiments I did, I found that TSDAE is more useful, when you want decent precision and good recall. However, SimCSE can be used when you want very high precision and low recall.
Now, if you don't have sufficient data to fine tune the model, but you find a BERT model trained on your domain, you can directly leverage that by adding pooling and dense layers. Please do research on what is 'pooling', to have better understanding on what you are doing.
from sentence_transformers import SentenceTransformer, models
from torch import nn
word_embedding_model = models.Transformer('bert-base-uncased', max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
dense_model = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=256, activation_function=nn.Tanh())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])
Tip: With above approach, if you start getting extreme high cosine score, it is an alarm to do negative testing. Sometime, simply adding pooling layers may not help, you must take few examples and check similarity scores for the inputs that are not similar (it is possible that even for dissimilar sentences, this may show good similarity, and that is the time you should stop and try to collect some data and do unsupervised training)
People who are interested in going deeper, here is a list of topics that may help you.
Pooling
Siamese Networks
Contrastive Loss
:) :)

How to extract sentences between point and brackets with R?

I Have:
Stringa=" This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995).Given the major differences between big data and research-collected data, it is surprising how little discussion has arisen about how using big data should change the practice of theory-informed IS research. Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014). Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008). Perhaps “scientists no longer have to make educated guesses, construct hypotheses and models, test them in data-based experiments andexamples. Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientificconclusions without further experimentation” (Prensky, 2009). "
Desidered Output:
[1]This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995).
[2]Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014)
[3] Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008)
[4]Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientific conclusions without further experimentation” (Prensky, 2009)
I use:unlist(str_extract_all(string =Stringa, pattern = "\\. [A-Za-z][^()]+ \\("))
But it doesn't work
I don’t want extract ‘Given the major differences between big data and research-collected data, it is surprising how little discussion has arisen about how using big data should change the practice of theory-informed IS research. ‘ and ‘Perhaps “scientists no longer have to make educated guesses, construct hypotheses and models, test them in data-based experiments andexamples. ‘
If there are no abbreviations in the text, you may use
regmatches(Stringa, gregexpr("[^.?!\\s][^.!?]*?\\([^()]*\\)", Stringa, perl=TRUE))
[[1]]
[1] "This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995)"
[2] "Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014)"
[3] "Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008)"
[4] "Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientificconclusions without further experimentation” (Prensky, 2009)"
See the regex demo and the R demo.
Details
[^.?!\\s] - any char but ., ?, ! and whitespace
[^.!?]*? - any 0+ chars other than ., ?, ! as few as possible
\([^()]*\) - a (, 0+ chars other than ( and ) and then a ).
We can handle this using grepexpr and regmatches, using the following regex pattern:
.*?\([^)]+\).*?(?=\w|$)
This will capture any content up to the first parenthesis, followed by a (...) term. The script below will capture all such matches in the source text.
m <- gregexpr(".*?\\([^)]+\\).*?(?=\\w|$)", x, perl=TRUE)
regmatches(x, m)
[[1]]
[1] "This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995)."
[2] "Given the major differences between big data and research-collected data, it is surprising how little discussion has arisen about how using big data should change the practice of theory-informed IS research. Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014). "
[3] "Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008). "
[4] "Perhaps “scientists no longer have to make educated guesses, construct hypotheses and models, test them in data-based experiments andexamples. Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientificconclusions without further experimentation”(Prensky, 2009). "

Deep learning for chatbot training

We are trying to create an intelligent chatbot for customer service. We have a corpus of customer service questions and answers, with a flagged intention of each conversation. We are exploring to use Deep Learning to train our models but we encounter a couple of issues:
1 - How to do feature engineering to train models on text data. Specifically, how do you turn language into vectors ?
2 - How to use non-word features that you use as input for the intent recognition deep learning classifier? How do you accommodate e.g. client product names?
3 - How to choose a neural network architecture for Deep Learning with text input?
4 - How can we deal with situations where we do not have enough data? Use Bayesian techniques?
Cool.. great start !!.
before you make jump to implementation, i would suggest please do learn some basics.
anyway , here are answers to your questions. !!
feature engineering : as name suggests , in your data there are something that may reduce accuracy of your model. like words mixed with small and capital character, digits ,special character , lines ends with some special character.. etc. which after feature engineering gives more accuracy!! but again it's required all depends on what type of data you have !!
language into vectors : any type of language , at the end it is text (here in your case). we can give vector representation to word or character. this vector representation can be get by one hot vector or using pre-built methods like word2vec or glove.
one hot vector :- let's say you have 100 words from your training dataset . then create k-dimensional vector for each word. where k is total number of words. sord word by their character position. and based on thire sorted order create vector with keeping their index position 1 and rest as 0.
ex: [1 0 0 0 0 ....] - word1
[0 1 0 0 0 ....] - word2
[0 0 0 0 0 ...1] - word100
non-word features : follow same rule as word-features
client product name :- create one hot vector as they are not usually used in text. and they don't have meaning in real life.
how to choose NN :- it depends on what you want to achieve. NN can be used in many ways for many purpose.
not enough data :- it again depends on your data. !! if your data has more common pattern and in future data also these patterns going to come !! then it's still okay to use NN. else i don't recommend to use NN.
Good Luck !!
Some additions to the previous answer from Achyuta nanda sahoo. (Numbering according to your questions)
As he said, use some pretrained word embedding layers (Fasttext, word2vec)
U can find pretrained Models e.g. Here:
https://github.com/facebookresearch/fastText/blob/master/docs/pretrained-vectors.md
U can particularly find client product names using Named Entity Recognition. U can e.g. start off with the following repo
https://github.com/guillaumegenthial/tf_ner
U can start with some simple question answering matching according to cosine similarity, as done here:
https://github.com/sachinbiradar9/Question-Answer-Selection
Even if u initially do not have enough data, u may start with a deep neural net by pretraining on a huge dataset that comes from a similar question answering data distribution. There should be tons of websites, where u can find these question answering scenarios ready for scraping :-)
Best

R Constained Combination generation

Say I have an set of string:
x=c("a1","b2","c3","d4")
If I have a set of rules that must be met:
if "a1" and "b2" are together in group, then "c3" cannot be in that group.
if "d4" and "a1" are together in a group, then "b2" cannot be in that group.
I was wondering what sort of efficient algorithm are suitable for generating all combinations that meet those rules? What research or papers or anything talk about these type of constrained combination generation problems?
In the above problem, assume its combn(x,3)
I don't know anything about R, so I'll just address the theoretical aspect of this question.
First, the constraints are really boolean predicates of the form "a1 ^ b2 -> ¬c3" and so on. That means that all valid combinations can be represented by one binary decision diagram, which can be created by taking each of the constraints and ANDing them together. In theory you might make an exponentially large BDD that way (that usually doesn't happen, but depends on the structure of the problem), but that would mean that you can't really list all combinations anyway, so it's probably not too bad.
For example the BDD generated for those two constraints would be (I think - not tested - just to give an idea)
But since this is really about a family of sets, a ZDD probably works even better. The difference, roughly, between a BDD and a ZDD is that a BDD compresses nodes that have equal sub-trees (in the total tree of all possibilities), while the ZDD compresses nodes where the solid edge (ie "set this variable to 1") goes to False. Both re-use equal sub-trees and thus form a DAG.
The ZDD of the example would be (again not tested)
I find ZDDs a bit easier to manipulate in code, because any time a variable can be set, it will appear in the ZDD. In contrast, in a BDD, "skipped" nodes have to be detected, including "between the last node and the leaf", so for a BDD you have to keep track of your universe. For a ZDD, most operations are independent of the universe (except complement, which is rarely needed in the family-of-sets scenario). A downside is that you have to be aware of the universe when constructing the constraints, because they have to contain "don't care" paths for all the variables not mentioned in the constraint.
You can find more information about both BDDs and ZDDs in The Art of Computer Programming volume 4A, chapter 7.1.4, there is an old version available for free here.
These methods are in particular nice to represent large numbers of such combinations, and to manipulate them somehow before generating all the possibilities. So this will also work when there are many items and many constraints (such that the final count of combinations is not too large), (usually) without creating intermediate results of exponential size.

Recursive hypothesis-building with ambiguites - what's it called?

There's a problem I've encountered a lot (in the broad fields of data analyis or AI). However I can't name it, probably because I don't have a formal CS background. Please bear with me, I'll give two examples:
Imagine natural language parsing:
The flower eats the cow.
You have a program that takes each word, and determines its type and the relations between them. There are two ways to interpret this sentence:
1) flower (substantive) -- eats (verb) --> cow (object)
using the usual SVO word order, or
2) cow (substantive) -- eats (verb) --> flower (object)
using a more poetic world order. The program would rule out other possibilities, e.g. "flower" as a verb, since it follows "the". It would then rank the remaining possibilites: 1) has a more natural word order than 2), so it gets more points. But including the world knowledge that flowers can't eat cows, 2) still wins. So it might return both hypotheses, and give 1) a score of 30, and 2) a score of 70.
Then, it remembers both hypotheses and continues parsing the text, branching off. One branch assumes 1), one 2). If a branch reaches a contradiction, or a ranking of ~0, it is discarded. In the end it presents ranked hypotheses again, but for the whole text.
For a different example, imagine optical character recognition:
** **
** ** *****
** *******
******* **
* ** **
** **
I could look at the strokes and say, sure this is an "H". After identifying the H, I notice there are smudges around it, and give it a slightly poorer score.
Alternatively, I could run my smudge recognition first, and notice that the horizontal line looks like an artifact. After removal, I recognize that this is ll or Il, and give it some ranking.
After processing the whole image, it can be Hlumination, lllumination or Illumination. Using a dictionary and the total ranking, I decide that it's the last one.
The general problem is always some kind of parsing / understanding. Examples:
Natural languages or ambiguous languages
OCR
Path finding
Dealing with ambiguous or incomplete user imput - which interpretations make sense, which is the most plausible?
I'ts recursive.
It can bail out early (when a branch / interpretation doesn't make sense, or will certainly end up with a score of 0). So it's probably some kind of backtracking.
It keeps all options in mind in light of ambiguities.
It's based off simple rules at the bottom can_eat(cow, flower) = true.
It keeps a plausibility ranking of interpretations.
It's recursive on a meta level: It can fork / branch off into different 'worlds' where it assumes different hypotheses when dealing with the next part of data.
It'll forward the individual rankings, probably using bayesian probability, to dependent hypotheses.
In practice, there will be methods to train this thing, determine ranking coefficients, and there will be cutoffs if the tree becomes too big.
I have no clue what this is called. One might guess 'decision tree' or 'recursive descent', but I know those terms mean different things.
I know Prolog can solve simple cases of this, like genealogies and finding out who is whom's uncle. But you have to give all the data in code, and it doesn't seem convienent or powerful enough to do this for my real life cases.
I'd like to know, what is this problem called, are there common strategies for dealing with this? Is there good literature on the topic? Are there libraries for ideally C(++), Python, were you can just define a bunch of rules, and it works out all the rankings and hypotheses?
I don't think there is one answer that fits all the bullet points you have. But I hope my links will lead you closer to an answer or might give you a different question.
I think the closest answer is Bayesian network since you have probabilities affecting each other as I understand it, it is also related to Conditional probability and Fuzzy Logic
You also describe a bit of genetic programming as well as Artificial Neural Networks
I can name drop some more topics which might be related:
http://en.wikipedia.org/wiki/Rule-based_programming
http://en.wikipedia.org/wiki/Expert_system
http://en.wikipedia.org/wiki/Knowledge_engineering
http://en.wikipedia.org/wiki/Fuzzy_system
http://en.wikipedia.org/wiki/Bayesian_inference

Resources