IR with indri - how to get smoothing values and documents data by index - information-retrieval

I have an index (I didn't build it so I don't have the documents) and I want to get this values from the index:
1. What are the documents the index are based on? and what is their length?
2. Can I get the bag of words values for each document? I know I can get the values for all the corpus with RunQuery_tfidf.xml but I want the values for each document.
3. Is there a way to get the smoothing values?

Related

Handle a string return from R to Tableau and SPLIT it

I connect Tableau to R and execute an R function for recommending products. When R ends, the return value is a string which will have all products details, like below:
ID|Existing_Prod|Recommended_Prod\nC001|NA|PROD008\nC002|PROD003|NA\nF003|NA|PROD_ABC\nF004|NA|PROD_ABC1\nC005|PROD_ABC2|NA\nC005|PRODABC3|PRODABC4
(Each line separated by \n indicating end of line)
On Tableau, I display the calculated field which is as below:
ID|Existing_Prod|Recommended_Prod
C001|NA|PROD008
C002|PROD003|NA
F003|NA|PROD_ABC
F004|NA|PROD_ABC1
C005|PROD_ABC2|NA
C005|PRODABC3|PRODABC4
Above data reaches Tableau through a calculated field as a single string which I want to split based on pipeline ('|'). Now, I need to split this into three columns, separated by the pipeline.
I used Split function on the calculated field :
SPLIT([R_Calculated_Field],'|',1)
SPLIT([R_Calculated_Field],'|',2)
SPLIT([R_Calculated_Field],'|',3)
But the error says "SPLIT function cannot be applied on Table calculations", which is self explanatory. Are there any alternatives to solve this ?? I googled to check for best practices to handle integration between R and Tableau and all I could find was simple kmeans clustering codes.
Make sure you understand how partitioning and addressing work for table calcs. Table calcs pass vectors of arguments to the R script, and receive a single vector in response. The cardinality of those vectors depends on the partitioning of the table calc. You can view that by editing the table calc, clicking specific dimensions. The fields that are not checked determine the partitioning - and thus the cardinality of the arguments you send and receive from R
This means it might be tricky to map your problem onto this infrastructure. Not necessarily impossible. It was designed to send a series of vector arguments with one cell per partitioning dimension, say, Manufacturer and get back one vector with one result per Manufacturer (or whatever combination of fields partition your data for the table calc). Sounds like you are expecting an arbitrary length list of recommendations. It shouldn’t be too hard to have your R script turn the string into a vector before returning, but the size of the vector has to make sense.
As an example of an approach that fits this model more easily, say you had a Tableau view that had one row per Product (and you had N products) - and some other aggregated measure fields in the view per Product. (In Tableau speak, the view’s level of detail is at the Product level.)
It would be straightforward to pass those measures as a series of argument vectors to R - each vector having N values, and then have R return a vector of reals of length N where the value returned at each location was a recommender score for the product at that position. (Which is why the ordering aka addressing of the vectors also matters)
Then you could filter out low scoring products from the view and visually distinguish highly recommended products.
So the first step to understanding R integration is to understand how table calcs operate with partitioning and addressing and to think in terms of vectors of fixed lengths passed in both directions.
If this model doesn’t support your use case well, you might be able to do something useful with URL actions or the JavaScript API.

If I want 5 values that are in columns to the right of each key, what is the ideal way to train the Form Recognizer?

I have a column of numbers to the far left as my keys, of which, each entry has 5 design values I'm trying to pair to it. To train the model, I've used 15 completed pdf files, most of which were not scans. I also edited 3 of those, deleting the values but leaving the keys, and saved them with the same file name as the original, suffixed with "Empty".
The results returned from the model have no problem finding any of the numbers or their locations, but they are not in key-value pairs of any kind. I get that key-value "pair" excludes any possibility of grabbing the column header and the row, but just the row and position relative to the others would be make things easy enough. Just hoping for some insight on how to train it to reuse the same key as it looks across the row.
I'm exporting the data to Word format and tabulating the values with a light border. I have no experience with machine learning. For the empty form, would there be any benefit to adding DocVariable fields to each of the 5 value columns, with the variable name being a combination of the row and column key names?
Actually, it's necessary to delete these keys from your sample data to train the model of Form Recognizer, even incorrect to do that. Form Recognizer need to learn what key is in your sample data.
So you just need to follow the offical tutorial Build a training data set for a custom model to train the model with more samples of the similar form layout with different keys and different value. Then, you can follow my answer for the SO thread How to improve the accuracy of Form Recognizer? to draw the keys and values and extract the values what you want from the json result by their boundingBox values.
Yes, what I said means you need to design an algorithm to classify these keys and values by classifing their geometry values of boundingBox.
For example, you can try to draw several horizontal or vertical lines to link these left-up point of keys and values and to find out the geometry point pattern for classification these form cells.

Optimize contains query of numbers to exact match query

I am looking to optimize my contains query. I have a pipe separated list of numbers in one of my Aerospike bins(columns) something like 234|235|236|
These numbers may vary from 1 to 2^14
Currently I am applying a contains query to find 235| in this column but it is getting slow. Is there any Math or any strategy I can apply to convert this contains query to an exact match??
TIA,
Karan
Did you try using a List type for this bin? You can then build a secondary index on the List values (indextype = LIST, type=NUMERIC)and get all records that match the value of interest in the list using a secondary index query.

association from term document matrix

Is there a way to find associated words from a term document matrix , other than using findAssoc() in r. My objective is to find all words with decided frequency (lets say if i want to find all wrods with freq more than 200)and then find words which were appearing with these words.
findAssoc() has a terms argument. E.g. feed it with the words, which have a frequency greater than 200. You can find those words by using findFreqTerms().

What are the pre-processing requirements on cosine similarity?

The input on cosine similarity is two vectors representing two different data i want to compare. Is there a requirement for the semantic of the vector? Can it simply be the byte representation of each file. And then compute the frequency of each byte? Does this make sense? Or there should be a vectorization of the file where each dimension is not a raw-piece of data from the file but some metadata as the frequency of each term if we speak for text files or the tf-idf encoding model? To put it in another shape: Does cosine similarity in order to be "correct" asks a complex pre-processing step of data or i can give it as input integer values that represents each byte of my data without text in mind or just a frequency term of each byte?
The "semantics" of the data is critical. For example, say you are comparing English text documents. For large documents, the frequency of occurence of the various letters will be roughly the same, so if the elements of your vector represent the counts of letters, you will have trouble distinguishing documents. If the elements of your vector represent the counts of words, you will get better results. If the elements of your vector represent the counts of "stemmed" words, even better. Etc.
Cosine similarity is a "dumb" statistical measure - it is up to you to give it something meaningful to compare.

Resources