SyntaxNet - Can I get count of a given bigram/trigram from syntaxnet? - syntaxnet

I need to get the frequency of occurence of a given bigram or trigram - is this possible with syntaxnet?

I'm not sure, but why would you use SyntaxNet for that when you can do the same with something much less complex like CWB or NLTK.

Related

Checking for similarity of text in two text strings

I have two strings of text (typically two paragraphs). I am looking to check for the "similarity" between them, e.g. check if one paragraph is a plagiarised version of the other. Ideally I need a similarity score, as well as an indication of where the similarities are. I prefer to do this fully in R. Any suggestions please?
The difference of stings can be measured with the levenshtein distance (or concepts that build on top of that). The main idea is to quantify the "editiing distance" of strings: how many letters need to be included/excluded/changed, etc (depending on the algorithm more or less types of editing are allowed). A package in R for this task would be fuzzyjoin.
To look up the similarities you could cut both texts (original and suposed plagiate) in sentences and build the fuzzy joins on this - Then you can filter for best matches. The topic is a bit tricky so I recomend trying out different algorithms (jaccard distance, damerau levenshtein, etc). A start into the topic can be found here: https://cran.r-project.org/web/packages/fuzzyjoin/readme/README.html

How to detect grammatical elements in a corpus

I'm working with a big corpus in RStudio and the next phase of our research includes the detection of some grammatical elements and its frequency in the texts. We want to detect the frequency of occurrence of things like the use of abstract nouns or deontic modalities which include the auxiliary verbs ‘must’, ‘have to’, ‘may’, ‘can’, ‘should’, ‘ought to ’, etc. I would like to capture its possible conjugation, i.e., not only 'she have to' but 'she had to'; not only 'he can' but 'he could'. I guess it could be done using some simple RegEx such as
She ha(ve|d) to
He c(an|ould)
...right? The problem is 1) I'm not sure whether this can be done (I guess it can be) and 2) which library should I use to do that.
I have thought I could make a dictionary and run it to the whole corpus but 1) and 2) are still here.

MS Access- Calculated Column for Distinct Count in Table Rather than a Query

I'd like to have a Calculated Column in a table that counts the instances of a concatenation.
I get the following error when inputting Abs(Count([concat])) as the column formula for the calculation: The expression Abs(Count([concat])) cannot be used in a calculated column.
Is there any other way to do it without doing a query? I'm pretty sure it can't be done but I figured I'd ask anyways since I didn't see any other posts about it.
No, and even if there was, you should create and use a query for this.
Besides, applying Abs on a count doesn't make much sense, as the count cannot be negative.

n-gram sentence similarity with cosine similarity measurement

I have been working on a project about sentence similarity. I know it has been asked many times in SO, but I just want to know if my problem can be accomplished by the method I use by the way that I am doing it, or I should change my approach to the problem. Roughly speaking, the system is supposed to split all sentences of an article and find similar sentences among other articles that are fed to the system.
I am using cosine similarity with tf-idf weights and that is how I did it.
1- First, I split all the articles into sentences, then I generate trigrams for each sentence and sort them(should I?).
2- I compute the tf-idf weights of trigrams and create vectors for all sentences.
3- I calculate the dot product and magnitude of original sentence and of the sentence to be compared. Then calculate the cosine similarity.
However, the system does not work as I expected. Here, I have some questions in my mind.
As far as I have read about tf-idf weights, I guess they are more useful for finding similar "documents". Since I am working on sentences, I modified the algorithm a little by changing some variables of the formula of tf and idf definitions(instead of document I tried to come up with sentence based definition).
tf = number of occurrences of trigram in sentence / number of all trigrams in sentence
idf = number of all sentences in all articles / number of sentences where trigram appears
Do you think it is ok to use such a definition for this problem?
Another one is that I saw the normalization is mentioned many times when calculating the cosine similarity. I am guessing that this is important because the trigrams vectors might not be the same size(which they rarely are in my case). If a trigram vector is size of x and the other is x+1, then I treat the first vector as it was the size of x+1 with the last value is being 0. Is this what it is meant by normalization? If not, how do I do the normalization?
Besides these, if I have chosen the wrong algorithm what else can be used for such problem(preferably with n-gram approach)?
Thank you in advance.
I am not sure why you are sorting the trigrams for every sentence. All you need to care about when computing cosine similarity is that whether the same trigram occurred in the two sentences or not and with what frequencies. Conceptually speaking you define a fixed and common order among all possible trigrams. Remember the order has to be the same for all sentences. If the number of possible trigrams is N, then for each sentence you obtain a vector of dimensionality N. If a certain trigram does not occur, you set the corresponding value in the vector to zero. You dont really need to store the zeros, but have to take care of them when you define the dot product.
Having said that, trigrams are not a good choice as chances of a match are a lot sparser. For high k you will have better results from bags of k consecutive words, rather than k-grams. Note that the ordering does not matter inside a bag, its a set. You are using k=3 k-grams, but that seems to be on the high side, especially for sentences. Either drop down to bi-grams or use bags of different lengths, starting from 1. Preferably use both.
I am sure you have noticed that sentences that do not use the exact trigram has 0 similarity in your method. K-bag of words
will alleviate the situation somewhat but not solve it completely. Because now you need sentences to share actual words. Two sentences may be similar without using the same words. There are a couple of ways to fix this. Either use LSI(latent Semantic Indexing) or clustering of the words and use the cluster labels to define your cosine similarity.
In order to compute the cosine similarity between vectors x and y you compute the dot product and divide by the norms of x and y.
The 2-norm of the vector x can be computed as square root of the sum of the components squared. However you should also try your algorithm out without any normalization to compare. Usually it works fine, because you are already taking care of the relative sizes of the sentences when you compute the term frequencies (tf).
Hope this helps.

Find HEX patterns and number of occurrences

I'd like to find patterns and sort them by number of occurrences on an HEX file I have.
I am not looking for some specific pattern, just to make some statistics of the occurrences happening there and sort them.
DB0DDAEEDAF7DAF5DB1FDB1DDB20DB1BDAFCDAFBDB1FDB18DB23DB06DB21DB15DB25DB1DDB2EDB36DB43DB59DB32DB28DB2ADB46DB6FDB32DB44DB40DB50DB87DBB0DBA1DBABDBA0DB9ADBA6DBACDBA0DB96DB95DBB7DBCFDBCBDBD6DB9CDBB5DB9DDB9FDBA3DB88DB89DB93DBA5DB9CDBC1DBC1DBC6DBC3DBC9DBB3DBB8DBB6DBC8DBA8DBB6DBA2DB98DBA9DBB9DBDBDBD5DBD9DBC3DB9BDBA2DB84DB83DB7DDB6BDB58DB4EDB42DB16DB0DDB01DB02DAFCDAE9DAE5DAD9DAE2DAB7DA9BDAA6DA9EDAAADAC9DACADAC4DA92DA90DA84DA89DA93DAA9DA8CDA7FDA62DA53DA6EDA
That's an excerpt of the HEX file, and as an example I'd like to get:
XX occurrences of BDBDBD
XX occurrences of B93D
Is there a way to mine the file to generate that output?
Sure. Use a sliding window to create the counts (The link is for Perl, but it seems general enough to understand the algorithm). Your patterns are named N-grams. You will have to limit the maximal pattern, though.
This is a pretty classic CS problem. The code in general is non-trivial to implement as it will require at least one full parse of the sequence, and depending on your efficiency and memory/processor constraints might require several. See here.
You will need to partition your input string in some way to ensure that you get a good subsequence across it.
If there is a specific problem we might be able to help more, but the general strategy is in the Wikipedia article above.
You can use Regular Expressions to make a pattern to search for.
The regex needed would be very simple. Just use the exact phrase you're searching for. Then there should be a regular expression function in the language you're using (you didn't specify) that can count the number of matches.
Use that to create a simple counter.

Resources