Term relatedness algorithm - information-retrieval

For an assignment I have to suggest an algorithm to calculate the degree of relatedness between two terms given a document. I don't know where to start creating an algorithm like that;. This is all in the area of Information Retrieval and we are currently study the binary and vector space model etc.
If anyone could put me in the right direction at least, that would be great! Or any links that would help.

A key problem in text mining is the extraction of relations between terms. Hand-crafted lexical resources such as Wordnet have limitations when it comes to special text corpora. Distributional approaches to the problem of automatic construction of thesauri from large corpora have been proposed, making use of sophisticated Natural Language Processing techniques, which makes them language specific, and computationally intensive. It is conjectured that in a number of applications, it is not necessary to determine the exact nature of term relations, but it is sufficient to capture and exploit the frequent co-occurrence of terms. Such an application is tag recommendation.
Collaborative tagging systems are social data repositories, in which users manage web resources by assigning to them descriptive keywords (tags). An important element of collaborative tagging systems is the tag recommender, which proposes a set of tags to a user who is posting a resource. In this talk we explore the potential of three tag sources: resource content (including metadata fields, such as the title), resource profile (the set of tags assigned to the resource by all users that tagged it) and user profile (the set of tags the user assigned to all the resources she tagged). The content-based tag set is enriched with related tags in the tag-to-tag and title-word-to-tag graphs, which capture co-occurrences of words as tags and/or title words. The resulting tag set is further enriched with tags previously used to describe the same resource (resource profile). The resource-based tag set is checked against user profile tags - a rich, but imprecise source of information about user interests. The result is a set of tags related both to the resource and user.
(And if you copy that word-for word into your report the prof is bound to discover that you got it from a simple Google search, like I did.)

Related

Customised tokens annotation in R

Currently I'm working on an NLP project. It's totally new for me that's why i'm really struggling with implementation of NLP techniques in R.
Generally speaking, I need to extract machines entities from descriptions. I have a dictionary of machines which contains 2 columns: Manufacturer and Model.
To train the extraction model, I have to have an annotated corpus. That's where I'm stuck. How to annotate machines in text? Here is an example of the text:
The Skyjack 3219E electric scissor lift is a self-propelled device powered by 4 x 6 V batteries. The machine is easy to charge, just plug it into the mains. This unit can be used in construction, manufacturing and maintenance operations as a working installation on any flat paved surface. You can use it both indoors and outdoors. Thanks to its non-marking tyres, the machine does not leave any visible tracks on floors. The machine can be driven at full height and is very easy to operate. The S3219E has a 250 kg platform payload capacity. It can handle two people when operating indoors and one outdoors. Discover our trainings via Heli Safety Academy.
Skyjack 3219E - this is a machine which has to be identified and tagged.
I wanna have results similar to POS tagging but instead of nouns and verbs - manufacturer and model. All the other words might be tagged as irrelevant.
Manual annotation is very expensive and not an option as usually descriptions are really long and messy.
Is there a way to adapt POS tagger and use a customised dictionary for tagging? Any help is appreciated!
Edit: ( At the end of writing this I realized you plan on using R, all my algorithmic suggestions are based on python implementations but I hope you can still get some ideas from the answer )
In general this is considered an NER (named entity recognition) problem. I am doing work on a similar problem at my job.
Is there any general structure to the text?
For example does the entity name generally occur in the first sentence? This maybe a way to simplify a heuristic search or a search based a dictionary (of Known products for instance).
Is annotation that prohibitive?
A weeks worth of tagging could be all you need given that you essentially have to one label that you care about. I was working on discovering brand names in a unstructured sentences, we did quite well with a week's work of annotation and training a CRF ( Conditional Random Fields ) model. see pycrfsuite a good python wrapper of a fast c++ implementation of CRF
[EDIT]
For annotation I used a variant BIO tagging scheme.
This what typical sentence like: "We would love a victoria's secret in our neighborhood", would look like when tagged.
We O
would O
love O
a O
victoria B-ENT
's I-ENT
secret I-ENT
O represented words that are Outside of the entities I cared about (brands). B represented the Beginning of entity phrases and I represents Inside of entity phrases.
In your case you seem to want to separate the manufacturer and the model item. So you can use tags like B-MAN, I-MAN, B-MOD, I-MOD. Here is an example of annotating:
The O
Skyjack B-MAN
3219E B-MOD
electric O
scissor O
lift O
etc..
of course a manufacture of a model can have multiple words in their names so use the I-MOD, I-MAN tags to capture that (see the example from my work above)
See this link ( ipython notebook) for a full example of how tagged sequences look for me. I based my work on this.
Build A big dictionary
We scrapped the internet, used or own data got databases from partners. And build a huge dictionary that we used as features in our CRF and for general searches. see ahocorosick for a fast trie based keyword search in python.
Hope some of this helps!

Practical usage for linked data

I've been reading about linked data and I think I understand the basics of publishing linked data, but I'm trying to find real world practical (and best practise) usage for linked data. Many books and online tutorials talk a lot about RDF and SPARQL but not about dealing with other peoples data.
My question is, if I have a project with a bunch of data that I output as RDF, what is the best way to enhance (or correctly use) other people's data?
If I create an application for animals and I want to use data from the BBC wildlife page (http://www.bbc.co.uk/nature/life/Snow_Leopard) what should I do? Crawl the BBC wildlife page, for RDF, and save the contents to my own triplestore or query the BBC with SPARQL (I'm not sure that this is actually possible with the BBC) or do I take the URI for my animal (owl:sameAs) and curl the content from the BBC website?
This also asks the question, can you programmatically add linked data? I imagine you would have to crawl the BBC wildlife page unless they provide an index of all the content.
If I wanted to add extra information such as location for these animals (http://www.geonames.org/2950159/berlin.html) again what is considered the best approach? owl:habitat (fake predicate) Brazil? and curl the RDF for Brazil from the geonames site?
I imagine that linking to the original author is the best way because your data can then be kept up-to-date, which from these slides from a BBC presentation (http://www.slideshare.net/metade/building-linked-data-applications) is what the BBC does, but what if the authors website goes down or is too slow? And if you were to index the author's RDF I imagine your owl:sameAs would point to a local RDF.
Here's one potential way of creating and consuming linked data.
If you are looking for an entity (i.e., a 'Resource' in Linked Data terminology) online, see if there is Linked Data description about it. One easy place to find this is DBpedia. For Snow Leopard, one URI that you can use is http://dbpedia.org/page/Snow_leopard. As you can see from the page, there are several object and property descriptions. You can use them to create a rich information platform.
You can use SPARQL in two ways. Firstly, you can directly query a SPARQL endpoint on the web where there might be some data. BBC had one for music; I'm not sure if they do for other information. DBpedia can be queried using snorql. Secondly, you can retrieve the data you need from these endpoints and load into your triple store using INSERT and INSERT DATA features of SPARQL 1.1. To access the SPARQL end points from your triple store, you will need to use the SERVICE feature of SPARQL. The second approach protects you from the inability to execute your queries when a publicly available end point is down for maintenance.
To programmatically add the data to your triplestore, you can use one of the predesigned libraries. In Python, RDFlib is useful for such applications.
To enrich the data with that sourced from elsewhere, there can again be two approaches. The standard way of doing it is using existing vocabularies. So, you'd have to look for the habitat predicate and just insert this statement:
dbpedia:Snow_leopard prefix:habitat geonames:Berlin .
If no appropriate ontologies are found to contain the property (which is unlikely in this case), one needs to create a new ontology.
If you want to keep your information current, then it makes sense to periodically run your queries. Using something such as DBpedia Live is useful is this regard.

StatsD/Graphite Naming Conventions for Metrics

I'm beginning the process of instrumenting a web application, and using StatsD to gather as many relevant metrics as possible. For instance, here are a few examples of the high-level metric names I'm currently using:
http.responseTime
http.status.4xx
http.status.5xx
view.renderTime
oauth.begin.facebook
oauth.complete.facebook
oauth.time.facebook
users.active
...and there are many, many more. What I'm grappling with right now is establishing a consistent hierarchy and set of naming conventions for the various metrics, so that the current ones make sense and that there are logical buckets within which to add future metrics.
My question is two fold:
What relevant metrics are you gathering that you have found indespensible?
What naming structure are you using to categorize metrics?
This is a question that has no definitive answer but here's how we do it at Datadog (we are a hosted monitoring service so we tend to obsess over these things).
1. Which metrics are indispensable? It depends on the beholder. But at a high-level, for each team, any metric that is as close to their goals as possible (which may not be the easiest to gather).
System metrics (e.g. system load, memory etc.) are trivial to gather but seldom actionable because they are too hard to reliably connect them to a probable cause.
On the other hand number of completed product tours matter to anyone tasked with making sure new users are happy from the first minute they use the product. StatsD makes this kind of stuff trivially easy to collect.
We have also found that the core set of key metrics for any teamchanges as the product evolves so there is a continuous editorial process.
Which in turn means that anyone in the company needs to be able to pick and choose which metrics matter to them. No permissions asked, no friction to get to the data.
2. Naming structure The highest level of hierarchy is the product line or the process. Our web frontend is internally called dogweb so all the metrics from that component are prefixed with dogweb.. The next level of hierarchy is the sub-component, e.g. dogweb.db., dogweb.http., etc.
The last level of hierarchy is the thing being measured (e.g. renderTime or responseTime).
The unresolved issue in graphite is the encoding of metric metadata in the metric name (and selection using *, e.g. dogweb.http.browser.*.renderTime) It's clever but can get in the way.
We ended up implementing explicit metadata in our data model, but this is not in statsd/graphite so I will leave the details out. If you want to know more, contact me directly.

Is there a hierarchical representation of the Freebase types?

For example, if some topic (Ex: Texas) is of type /location/citytown, I also see that there is a type "/location/location" attached to the same topic. In addition, here as the topic is the name of a city or town, it is also by default a general location, right? So, would that conclude if a topic has a type /location/citytown, then it would by default have /location/location also as a type associated with the same topic?
In summary, does Freebase have a hierarchical representation of the types in a way that lets us understand that if something is a /location/citytown, then it is also a /location/location, and so on for other cases too?
There isn't a hierarchical representation as such, but types have a /freebase/type_hints/included_types property which specifies the types which the Freebase web client will automatically include when a type is asserted. You can see these listed in the web client or fetch them with an MQL query.
Important points to note here are that these are hints only: nothing enforces the fact that a /location/citytown must be a /location/location, and that it is only the web client which automatically adds the included types - if you are creating topics by any other means, you'll have to add the included types yourself.

Decision tables in Enterprise Architect?

I'm trying to model a business rule set in EA.
The rules are easily described in a decision table: a column is a matching condition, a row is a rule, if all the conditions are matched in a row then the rule matched. More info is available in the Drools docs, for example.
These rules are an integral part of the application, even if on a different level than the technology details (classes, database tables, etc.). So naturally I would like to add the decision table to my documentation in EA.
I found no way to do this. EA doesn't even know about a "table" or a "spreadsheet", let alone decision tables. I would be happy to simply insert my XLS as an "attachment" to the model, but I didn't find a way to do that either.
Any ideas are appreciated.
There currently seems to be no way to do this short of taking a screen shot of the decision table and pasting it into the generated report after the fact. I believe it is in Sparx System's road-map to implement but no immediate time-frame has been given.
You could try submitting a feature request via their official forms, it can do nothing but add more ammunition to the request. At the very least they should notify you when its available.
Update1: You could always paste that screen shot into the linked document (Ctrl+Alt+D) of the parent element that contains the business rules matrix. This could then be automatically included in the auto generated report. At least then it is still contained in the model and can be used in many places.
Update2: Just Rereading your OP, are you actually using EA's business Rules engine? or are you just after a matrix that can be included in the reporting? if it is the latter then you have two options.
The first is the Relationship Matrix (View -> Relationship matrix). This can be included automatically in RTF and HTML generated reports as well has the option to Export to CSV, save as a png or metafile.
The second option is to shoehorn the State Machine Table, (From a State Machine Diagram, right click and select State Chart Editor - Table) Both of these options will allow you to layout a grid style table where you can compare your business rules.
I hope this helps

Resources