Particial specification in Information retrieval - information-retrieval

Hello i got an assignment on Information Retrieval and i could not realise how to create that partial specification,i mean the value of the words like here: http://nlp.stanford.edu/IR-book/html/htmledition/finite-automata-and-language-models-1.html
the = 0.2
a = 0.1
frog = 0.01... and so on. I would be thankful if someone explains how to calculate these values.
Learn about Language models!
a) Explain the idea!
b) Consider the following document collection:
D1: Today is sunny. Sunny Berlin! To be or not to be.
D2: She is in Berlin today. She is a sunny girl. Berlin is always exciting!
Calculate the corresponding Unigram Language Model for each document! Assume
the stop probability to be xed across models (and equal to 0:2). Use these models
to rank the documents given the query \sunny Berlin"!

The value of those words are not calculated there on the page. The are obtained from statistics of from the definition of the model.
For example if you look at the picture below, there are two different models with different probabilities for each word. As the designer of your model you will need to define the probabilities by yourself.
If you couldn't understand what is the language model here is a simple example:
Imagine people who are living in London have one language model M1 and people living in NY have other language model M2.
Based on some statistics, we know that people in London use the word "sunny" two times more than people in NY (for any reason) so in M1 probability of using "sunny" will be 0.04 and in M2 "sunny" = 0.02. Refereeing to other texts TV, Magazine and so on, we can define "with what probability" people of London(M1) and NY(M2) use other words and we create a table like what shown above.
Now we have a sentence "She is a sunny girl" which we don't know its from person in London or in NY.
Referring to the table we can guess this more likely is from a Londoner (M1) because they use this word more!

Related

Classification Task - Feature importance/selection in R (n-gram)

I am still working on a dataset with customer satisfaction and their online paths/journeys (here as mini-sequences/bi-grams). The customers were classified according to their usability satisfaction (4 classes).
Please find some exemplary lines of my current data frame in the following.
|bi-gram| satisfaction_class |
|:---- |:----: |----:|
|"openPage1", "writeText“|1|
|“writeText“ , “writeText“|2|
|"openPage3", "writeText“|4|
|"writeText“,"openPage1"|3|
...
Now I would like to know which bi-gram is a significant/robust predictor for certain classes. Is it possible with only bigrams or do I need the whole customer path?
I have read that one can use TF-IDF or chi-square but I could not find the perfect code. :(
Thank you so much!
Marius

A weird word appears with topic analysis in r

I have a paragraph:
disgusting do at was horrific we have stayed please to at traveler photos ironic i did post those witnessed each every thing in pictures gave us fist free then moved us to rooms were any better we slept with clothes on entire there never once took off shoes to walk on carpet shower etc holes in wall stains on bedding curtains couch chair no working electric in lamps cords nothing could be plugged in when we called down to fix it so we no lighting except bathroom light tv toilets constantly plugged up shower drain.
That appears to be a little grammatically weird since I cleaned the paragraph. And I use the following code to extract work frequencies.
# create corpus
docs<-Corpus(VectorSource(example))
# stem document
docs<-tm_map(docs,stemDocument)
# create document-term matrix
dtm<-DocumentTermMatrix(docs)
# convert row names
rownames(dtm)<-"example"
# collapse matrix by summing over columns
freq<-colSums(as.matrix(dtm))
# length should be total number of terms
length(freq)
# create sort order (descending)
ord<-order(freq,decreasing=TRUE)
# list all terms in decreasing order of freq and write to disk
freq[ord]
Then the freq[ord] is:
I am wondering why there is a word ani here, apparently, ani does not appear in my paragraph. Thanks.
Just figured the problem, the following code transfers any to ani, does anyone know how to avoid that?
docs<-tm_map(docs,stemDocument)
It's the word "any" after having being stemmed. The (in this case faulty) logic of the underlying function, wordStem, which uses Dr. Martin Porter's stemming algorithm and the C libstemmer library generated by Snowball, changed the y to an i.

Nested design in R using aov and lmer

I am trying to figure out the correct R code for this problem.
There are 10 golfers from the US and 10 golfers from the UK. Each player plays 5 rounds of golf in the UK and 5 rounds in the US. There are three specific questions:
Do golfers from the UK or US have higher scores? (is there an effect of nationality)
Do scores differ when playing in the US or UK course? (is there an effect of course.location)
Do golfers have a 'home field advantage', where the effect of playing in the US or UK depends on if the golfer comes from the US or the UK (ie, is there an interaction between nationality and course.location)?
The data are coded with the variable 'golfer' taking values 1-20, as opposed to values 1-10 within each nationality.
First off, is this a nested design? That is, am I correct in saying that golfer is nested within nationality? If so, is this the correct anova formula:
aov(score ~ nationality + course.location + nationality*course.location + Error(nationality/golfer)
What confuses me a bit is that if I were to do this using a mixed model framework, this formula makes the most sense to me:
lmer(score ~ nationality + course.location + nationality*course.location + (1|course.location:golfer))
Here, the random effect of golfer depends on whether that golfer is playing in the US or the UK.
So are these analogous? Is one wrong? Are both wrong? In the lmer framework I don't see a need to use the nesting formulation, but in the anova framework I feel like I am missing the fact that the random effect of golfer depends on course location.
Any thoughts? Thanks.
(as a bonus question, what exactly is the difference between the random effect term (1+course.location|golfer) and (1|course.location:golfer) when course. location is binary?)

network directed graph optimization package in R

I have used R package lpsolve in past, but feel that it is not perfect for my current problem.
I want to optimize below problem.
I have nodes and links as depicted in the diagram. I start from new york and I want to ship fruits to customer on day 4. Each node consist of 4 parts: physical location, item, site type, time. You can say that node name is combination of above 4 fields.
I can take 2 paths. My objective is to meet the customer demand and to send all fruits to sink at minimum cost.
Transportation cost for each fruit and time take to travel on the lane is given by text which is on transportation route.
my new york location is the only input and it gets 50 fruits on day 1 and customer is the only output location and in this case customer is looking for 30 fruits on day 4.
in the current scenario solution is to send 30 fruits along newyork, newmexico, customer lane and 20 fruits along newyork, arizona, customer lane. For 20 fruits we will choose newyork, arizona, customer lane as arizona, customer lane has less cost (90 USD) as compared to newmexico, customer cost (100 usd)Inline image 6
To provide the input to model, I create a sink to newyork link and send 50 fruits on that lane. direct transportation lanes from arizona and newmexico to sink are very costly and because of the high cost, my optimization will avoid them as much as possible.
as of now I am building all links and nodes using sql. I am also using sql to populate quantity that newyork gets and quantity that customer wants. Then I optimize my network using IBM ILOG.
I want to replace IBM ILOG optimization part with R package. Which package should I use?
my constraints are:
input quantity to each node has to be equal to output quantity from each node.
new york get 50 fruits on day 1
customer want 30 fruits on day 4 and we cannot give more to a customer.
to make optimization easy I create sink to newyork link, which I have shown by dotted line.
In ILOG I can create TUPLEs and then write my optimization code. I guess I can solve this problem in R package Lpsolve too, but creating constraints and objective is going to involve writing many loops. In my actual network I have 10000+ nodes and I was wondering if there is any R package specially designed for this purpose.
Would it be possible to provide simple code to solve below problem in R?

Fuzzy matching of product names

I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database.
For example "Canon PowerShot a20IS", "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS"
should all match "Canon PowerShot A20 IS". I've worked with levenshtein distance with some added heuristics (removing obvious common words, assigning higher cost to number changes etc), which works to some extent, but not well enough unfortunately.
The main problem is that even single-letter changes in relevant keywords can make a huge difference, but it's not easy to detect which are the relevant keywords. Consider for example three product names:
Lenovo T400
Lenovo R400
New Lenovo T-400, Core 2 Duo
The first two are ridiculously similar strings by any standard (ok, soundex might help to disinguish the T and R in this case, but the names might as well be 400T and 400R), the first and the third are quite far from each other as strings, but are the same product.
Obviously, the matching algorithm cannot be a 100% precise, my goal is to automatically match around 80% of the names with a high confidence.
Any ideas or references is much appreciated
I think this will boil down to distinguishing key words such as Lenovo from chaff such as New.
I would run some analysis over the database of names to identify key words. You could use code similar to that used to generate a word cloud.
Then I would hand-edit the list to remove anything obviously chaff, like maybe New is actually common but not key.
Then you will have a list of key words that can be used to help identify similarities. You would associate the "raw" name with its keywords, and use those keywords when comparing two or more raw names for similarities (literally, percentage of shared keywords).
Not a perfect solution by any stretch, but I don't think you are expecting one?
The key understanding here is that you do have a proper distance metric. That is in fact not your problem at all. Your problem is in classification.
Let me give you an example. Say you have 20 entries for the Foo X1 and 20 for the Foo Y1. You can safely assume they are two groups. On the other hand, if you have 39 entries for the Bar X1 and 1 for the Bar Y1, you should treat them as a single group.
Now, the distance X1 <-> Y1 is the same in both examples, so why is there a difference in the classification? That is because Bar Y1 is an outlier, whereas Foo Y1 isn't.
The funny part is that you do not actually need to do a whole lot of work to determine these groups up front. You simply do an recursive classification. You start out with node per group, and then add the a supernode for the two closest nodes. In the supernode, store the best assumption, the size of its subtree and the variation in it. As many of your strings will be identical, you'll soon get large subtrees with identical entries. Recursion ends with the supernode containing at the root of the tree.
Now map the canonical names against this tree. You'll quickly see that each will match an entire subtree. Now, use the distances between these trees to pick the distance cutoff for that entry. If you have both Foo X1 and Foo Y1 products in the database, the cut-off distance will need to be lower to reflect that.
edg's answer is in the right direction, I think - you need to distinguish key words from fluff.
Context matters. To take your example, Core 2 Duo is fluff when looking at two instances of a T400, but not when looking at a a CPU OEM package.
If you can mark in your database which parts of the canonical form of a product name are more important and must appear in one form or another to identify a product, you should do that. Maybe through the use of some sort of semantic markup? Can you afford to have a human mark up the database?
You can try to define equivalency classes for things like "T-400", "T400", "T 400" etc. Maybe a set of rules that say "numbers bind more strongly than letters attached to those numbers."
Breaking down into cases based on manufacturer, model number, etc. might be a good approach. I would recommend that you look at techniques for term spotting to try and accomplish that: http://www.worldcat.org/isbn/9780262100854
Designing everything in a flexible framework that's mostly rule driven, where the rules can be modified based on your needs and emerging bad patterns (read: things that break your algorithm) would be a good idea, as well. This way you'd be able to improve the system's performance based on real world data.
You might be able to make use of a trigram search for this. I must admit I've never seen the algorithm to implement an index, but have seen it working in pharmaceutical applications, where it copes very well indeed with badly misspelt drug names. You might be able to apply the same kind of logic to this problem.
This is a problem of record linkage. The dedupe python library provides a complete implementation, but even if you don't use python, the documentation has a good overview of how to approach this problem.
Briefly, within the standard paradigm, this task is broken into three stages
Compare the fields, in this case just the name. You can use one or more comparator for this, for example an edit distance like the Levenshtein distance or something like the cosine distance that compares the number of common words.
Turn an array fo distance scores into a probability that a pair of records are truly about the same thing
Cluster those pairwise probability scores into groups of records that likely all refer to the same thing.
You might want to create logic that ignores the letter/number combination of model numbers (since they're nigh always extremely similar).
Not having any experience with this type of problem, but I think a very naive implementation would be to tokenize the search term, and search for matches that happen to contain any of the tokens.
"Canon PowerShot A20 IS", for example, tokenizes into:
Canon
Powershot
A20
IS
which would match each of the other items you want to show up in the results. Of course, this strategy will likely produce a whole lot of false matches as well.
Another strategy would be to store "keywords" with each item, such as "camera", "canon", "digital camera", and searching based on items that have matching keywords. In addition, if you stored other attributes such as Maker, Brand, etc., you could search on each of these.
Spell checking algorithms come to mind.
Although I could not find a good sample implementation, I believe you can modify a basic spell checking algorithm to comes up with satisfactory results. i.e. working with words as a unit instead of a character.
The bits and pieces left in my memory:
Strip out all common words (a, an, the, new). What is "common" depends on context.
Take the first letter of each word and its length and make that an word key.
When a suspect word comes up, looks for words with the same or similar word key.
It might not solve your problems directly... but you say you were looking for ideas, right?
:-)
That is exactly the problem I'm working on in my spare time. What I came up with is:
based on keywords narrow down the scope of search:
in this case you could have some hierarchy:
type --> company --> model
so that you'd match
"Digital Camera" for a type
"Canon" for company and there you'd be left with much narrower scope to search.
You could work this down even further by introducing product lines etc.
But the main point is, this probably has to be done iteratively.
We can use the Datadecision service for matching products.
It will allow you to automatically match your product data using statistical algorithms. This operation is done after defining a threshold score of confidence.
All data that cannot be automatically matched will have to be manually reviewed through a dedicated user interface.
The online service uses lookup tables to store synonyms as well as your manual matching history. This allows you to improve the data matching automation next time you import new data.
I worked on the exact same thing in the past. What I have done is using an NLP method; TF-IDF Vectorizer to assign weights to each word. For example in your case:
Canon PowerShot a20IS
Canon --> weight = 0.05 (not a very distinguishing word)
PowerShot --> weight = 0.37 (can be distinguishing)
a20IS --> weight = 0.96 (very distinguishing)
This will tell your model which words to care and which words to not. I had quite good matches thanks to TF-IDF.
But note this: a20IS cannot be recognized as a20 IS, you may consider to use some kind of regex to filter such cases.
After that, you can use a numeric calculation like cosine similarity.

Resources