Lucene Feature Vector - vector

I want to construct a feature vector of each document from the lucene index.
I've also got a set of keywords, and want to construct a feature vector of them.
Then I will try to match the document according to the similarity of feature vectors of documents and keywords.
So, any hints on how lucene can help me address these three tasks?
Much thanks.

As bmargulies says, you can use Mahout. Here's some documentation on it: https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text#CreatingVectorsfromText-FromLucene

Related

gremlin shortestPath() with step

where is the document about gremlin with step?
https://tinkerpop.apache.org/docs/current/reference/#with-step
https://tinkerpop.apache.org/javadocs/current/full/org/apache/tinkerpop/gremlin/process/computer/traversal/step/map/ShortestPath.html
there is no example I can use.
I want to know all with-step option(like ShortestPath.edges-Direction.OUT) for shortestPath().
I found below
g.withComputer().V(xxx).shortestPath().with(ShortestPath.edges, Direction.OUT).with(ShortestPath.distance, 'cost').with(ShortestPath.target, hasId(bbb))
I want to know all option I can use
The with()-step is not really a "step". It is a step modulator. Its context is bound to the step that it is modifying and therefore you won't find "all the with() configurations" in one place. You can only find them in the documentation related to the steps that they modulate. Using shortestPath() as an example, note that if you look at the shortestPath() step documentation all of the options are present.
You may also need to consult the documentation of your graph provider as they may provide their own configuration keys for certain steps which can help optimize or otherwise modify traversal operations.

Use multiple dictionaries for cmu sphinx

For my project the default dictionary provided by the Sphinx is not sufficient.
I need to use another custom dictionary along with the provided dictionary.
Now my question is that is there any way of specifying multiple dictionary files to Sphinx or do I need to combine both the dictionaries into a single big dictionary file?
Thanks in advance :)
You have to combine dictionaries into single one.
If you want multiple dictionaries with Sphinx, what I did was to make my program delete the previous dictionary and write another one to the same file path with the different words you want to use. This can be done back and forth as many times as you like to give the impression of multiple dictionaries.

How to find the depth of a word in Wordnet with JWNL

I am using JWNL library to access the Wordnet lexical database. I am trying to find the similarity between two words and for that I need to find the depth of a word from the root. I am a newbie to Wordnet and can someone please provide instructions to acquire depth of a word as mentioned above
http://web.stanford.edu/class/archive/cs/cs276a/cs276a.1032/projects/docs/jwnl/javadoc/
Hope this helps.
If you want the ready-made similarity between two words, i suggest to use WS4j library. Many methods are implemented in it.

SQLite FTS necessary?

I am using Python with sqlite3.
Is there an advantage to using FTS3 or FTS4 if I only want to search for words in one column, or I could just use LIKE "%word%"?
While yes, you could handle the simplest of cases with LIKE '%word%', FTS allows user queries like clown NOT circus (matches rows talking about clowns but not circuses). It handles stemming (well, in English) so that break would match against breaks but not breakdance. It also builds indexes in a different way so that searching is much faster for this sort of query, and can return not just where the match happened but also a snippet of the matched text.
Finally, all the parsing of those potentially-complex user queries is done for you; that's code you don't have to write at all.

is there a way to verify if a word exists in determined idiom?

I have some word lists that was extracted from aspell dictionary.
The problem is that some of the words that aspell returns isn't valid words.
I'd like know if there are a way to check if the words returned exists in determined idiom or not.
Thanks!
You could validate them using google spelling api.

Resources