Google prediction API - Building classifier training data - bigdata

EDIT: I'm trying to classify new user review to predefined set of tags. Each review can have multiple tags associated to it.
I've mapped my DB user reviews to 15 categories, The following example shows the text, reasoning the mapped categories
USER_REVIEWS | CATEGORIES
"Best pizza ever, we really loved this place, our kids ..." | "food,family"
"The ATV tour was extreme and the nature was beautiful ..." | "active,family"
pizza:food
our kids:family
The ATV tour was extreme:active
nature was beautiful:nature
EDIT:
I tried 2 approaches of training data:
The first includes all categories in a single file like so:
"food","Best pizza ever, we really loved this place, our kids..."
"family","Best pizza ever, we really loved this place, our kids..."
The second approach was splitting the training data to 15 separate files like so:
family_training_data.csv:
"true" , "Best pizza ever, we really loved this place, our kids..."
"false" , "The ATV tour was extreme and the nature was beautiful ..."
Non of the above were conclusive, and missed tagging most of the times.
Here are some questions that came up, while I was experimenting:
Some of my reviews are very long (more than 300 words), should I limit the words on my training data file, so it will match the average review word count (80)?
Is it best to separate the data to 15 training data files, with TRUE/FALSE option, meaning: (is the review text of a specific category), or mix all categories in one training data file?
How can I train the model to find synonyms or related keywords, so it can tag "The motorbike ride was great" as active although the training data had a record for ATV ride
Iv'e tried some approaches as described above, without any good results.
Q: What training data format would give the best results?

I'll start with the parts I can answer with the given information. Maybe we can refine your questions from there.
Question 3: You can't train a model to recognize a new vocabulary word without supporting context. It's not just that "motorbike" is not in the training set, but that "ride" is not in the training set either, and the other words in the review do not relate transportation. The cognitive information you seek is simply not in the data you present.
Question 2: This depends on the training method you're considering. You can give the each tag as a separate feature column with a true/false value. This is functionally equivalent to 15 separate data files, each with a single true/false value. The one-file method gives you the chance to later extend to some context support between categories.
Question 1: The length, itself, is not particularly relevant, except that cutting out unproductive words will help focus the training -- you won't get nearly as many spurious classifications from incidental correlations. Do you have a way to reduce the size programmatically? Can you apply that to the new input you want to classify? If not, then I'm not sure it's worth the effort.
OPEN ISSUES
What empirical evidence do you have that 80% accuracy is possible with the given data? If the training data do not contain the theoretical information needed to accurately tag that data, then you have no chance to get the model you want.
Does your chosen application have enough intelligence to break the review into words? Is there any cognizance of word order or semantics -- and do you need that?

After facing similar problems, here are my insights regarding your questions:
According to WATSON Natural Language Classifier documentation it is best to limit the length of input text to fewer than 60 words, so I guess using your average 80 words will produce better results
You can go either way, but separate files will produce a more unambiguous results
creating a a synonym graph, as suggested would be a good place to start, WATSON is aimed to answer a more complex cognitive solution.
Some other helping tips from WATSON guidelines:
Limit the length of input text to fewer than 60 words.
Limit the number of classes to several hundred classes. Support for larger
numbers of classes might be included in later versions of the service.
When each text record has only one class, make sure that each class is
matched with at least 5 - 10 records to provide enough training on
that class.
It can be difficult to decide whether to include multiple
classes for a text. Two common reasons drive multiple classes:
When the text is vague, identifying a single class is not always clear.
When experts interpret the text in different ways, multiple classes
support those interpretations.
However, if many texts in your training
data include multiple classes, or if some texts have more than three
classes, you might need to refine the classes. For example, review
whether the classes are hierarchical. If they are hierarchical,
include the leaf node as the class.

Related

Customised tokens annotation in R

Currently I'm working on an NLP project. It's totally new for me that's why i'm really struggling with implementation of NLP techniques in R.
Generally speaking, I need to extract machines entities from descriptions. I have a dictionary of machines which contains 2 columns: Manufacturer and Model.
To train the extraction model, I have to have an annotated corpus. That's where I'm stuck. How to annotate machines in text? Here is an example of the text:
The Skyjack 3219E electric scissor lift is a self-propelled device powered by 4 x 6 V batteries. The machine is easy to charge, just plug it into the mains. This unit can be used in construction, manufacturing and maintenance operations as a working installation on any flat paved surface. You can use it both indoors and outdoors. Thanks to its non-marking tyres, the machine does not leave any visible tracks on floors. The machine can be driven at full height and is very easy to operate. The S3219E has a 250 kg platform payload capacity. It can handle two people when operating indoors and one outdoors. Discover our trainings via Heli Safety Academy.
Skyjack 3219E - this is a machine which has to be identified and tagged.
I wanna have results similar to POS tagging but instead of nouns and verbs - manufacturer and model. All the other words might be tagged as irrelevant.
Manual annotation is very expensive and not an option as usually descriptions are really long and messy.
Is there a way to adapt POS tagger and use a customised dictionary for tagging? Any help is appreciated!
Edit: ( At the end of writing this I realized you plan on using R, all my algorithmic suggestions are based on python implementations but I hope you can still get some ideas from the answer )
In general this is considered an NER (named entity recognition) problem. I am doing work on a similar problem at my job.
Is there any general structure to the text?
For example does the entity name generally occur in the first sentence? This maybe a way to simplify a heuristic search or a search based a dictionary (of Known products for instance).
Is annotation that prohibitive?
A weeks worth of tagging could be all you need given that you essentially have to one label that you care about. I was working on discovering brand names in a unstructured sentences, we did quite well with a week's work of annotation and training a CRF ( Conditional Random Fields ) model. see pycrfsuite a good python wrapper of a fast c++ implementation of CRF
[EDIT]
For annotation I used a variant BIO tagging scheme.
This what typical sentence like: "We would love a victoria's secret in our neighborhood", would look like when tagged.
We O
would O
love O
a O
victoria B-ENT
's I-ENT
secret I-ENT
O represented words that are Outside of the entities I cared about (brands). B represented the Beginning of entity phrases and I represents Inside of entity phrases.
In your case you seem to want to separate the manufacturer and the model item. So you can use tags like B-MAN, I-MAN, B-MOD, I-MOD. Here is an example of annotating:
The O
Skyjack B-MAN
3219E B-MOD
electric O
scissor O
lift O
etc..
of course a manufacture of a model can have multiple words in their names so use the I-MOD, I-MAN tags to capture that (see the example from my work above)
See this link ( ipython notebook) for a full example of how tagged sequences look for me. I based my work on this.
Build A big dictionary
We scrapped the internet, used or own data got databases from partners. And build a huge dictionary that we used as features in our CRF and for general searches. see ahocorosick for a fast trie based keyword search in python.
Hope some of this helps!

Categorical Clustering of Users Reading Habits

I have a data set with a set of users and a history of documents they have read, all the documents have metadata attributes (think topic, country, author) associated with them.
I want to cluster the users based on their reading history per one of the metadata attributes associated with the documents they have clicked on. This attribute has 7 possible categorical values and I want to prove a hypothesis that there is a pattern to the users' reading habits and they can be divided into seven clusters. In other words, that users will often read documents based on one of the 7 possible values in the particular metadata category.
Anyone have any advice on how to do this especially in R, like specific packages? I realize that the standard k-means algorithm won't work well in this case since the data is categorical and not numeric.
Cluster analysis cannot be used to prove anything.
The results are highly sensitive to normalization, feature selection, and choice of distance metric. So no result is trustworthy. Most results you get out are outright useless. So it's as reliable as a proof by example.
They should only be used for explorative analysis, i.e. to find patterns that you then need to study with other methods.

Why does Hyperloglog work and which real-world problems?

I know how Hyperloglog works but I want to understand in which real-world situations it really applies i.e. makes sense to use Hyperloglog and why? If you've used it in solving any real-world problems, please share. What I am looking for is, given the Hyperloglog's standard error, in which real-world applications is it really used today and why does it work?
("Applications for cardinality estimation", too broad? I would like to add this simply as a comment but it won't fit).
I would suggest you turn to the numerous academic research of the subject; usually academic papers contain some information of "prior research on the subject" as well as "applications for which the subject has been used". You could start with traversing the references of interest as referenced by the following article:
HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm, by P. Flageolet et al.
... This problem has received a great deal of attention over the past
two decades, finding an ever growing number of applications in
networking and traffic monitoring, such as the detection of worm
propagation, of network attacks (e.g., by Denial of Service), and of
link-based spam on the web [3]. For instance, a data stream over a
network consists of a sequence of packets, each packet having a
header, which contains a pair (source–destination) of addresses,
followed by a body of specific data; the number of distinct header
pairs (the cardinality of the multiset) in various time slices is an
important indication for detecting attacks and monitoring traffic, as
it records the number of distinct active flows. Indeed, worms and
viruses typically propagate by opening a large number of different
connections, and though they may well pass unnoticed amongst a huge
traffic, their activity becomes exposed once cardinalities are
measured (see the lucid exposition by Estan and Varghese in [11]).
Other applications of cardinality estimators include data mining of
massive data sets of sorts—natural language texts [4, 5], biological
data [17, 18], very large structured databases, or the internet graph,
where the authors of [22] report computational gains by a factor of
500+ attained by probabilistic cardinality estimators.
At my work, HyperLogLog is used to estimate the number of unique users or unique devices hitting different code paths in online services. For example, how many users are affected by each type of service error? How many users use each feature? There are MANY interesting questions HyperLogLog allows us to answer.
Stackoverflow might use hyperloglog to count the views of each question. Stackoverflow wants to make sure that one user can only contribute one view per item so every view is unique.
It could be implemented with set. every question would have a set that stores the usernames:
question#ID121e={username1,username2...}
For each question creating a set would take up some space and consider how many questions have been asked on this platform. The total amount of space to keep track of every view per user would be huge. But hyperloglog uses about 12 kB of memory per key no matter how many usernames are added, even 10 million views.

How to design a physics problem Database?

I have both problems and solutions to over twenty years of physics PhD qualifying exams that I would like to make more accessible, searchable, and useful.
The problems on the Quals are organized into several different categories. The first category is Undergraduate or Graduate problems. (The first day of the exam is Undergraduate, the second day is Graduate). Within those categories there are several subjects that are tested: Mechanics, Electricity & Magnetism, Statistical Mechanics, Quantum Mechanics, Mathematical Methods, and Miscellaneous. Other identifying features: Year, Season, and Problem number.
I'm specifically interested in designing a web-based database system that can store the problem and solution and all the identifying pieces of information in some way so that the following types of actions could be done.
Search and return all Electricity & Magnetism problems.
Search and return all graduate Statistical Mechanics problems.
Create a random qualifying exam — meaning a new 20 question test randomly picking 2 Undergrad mechanics problems, 2 Undergrade E&M problems, etc. from past qualifying exams (over some restricted date range).
Have the option to hide or display the solutions on results.
Any suggestions or comments on how best to do this project would be greatly appreciated!
I've written up some more details here if you're interested.
For your situation, it seems that it is more important part to implement the interface than the data storage. To store the data, you can use a database table or tags. Each record in the database (or tag) should have the following properties:
Year
Season
Undergradure or Graduate
Subject: CM, EM, QM, SM, Mathematical Methods, and Miscellaneous
Problem number (is it neccesary?)
Question
Answer
Search and return all Electricity & Magnetism problems.
Directly query the database and you will get an array, then display some or all questions.
Create a random qualifying exam — meaning a new 20 question test randomly picking 2 Undergrad mechanics problems, 2 Undergrade E&M problems, etc. from past qualifying exams (over some restricted date range).
To generate a random exam, you should first outline the number of questions for each category and the years it drawn from. For example, if you want 2 UG EM question. Query the database for all UG EM questions and then perform a random shuffling on the question array. Finally, select the first two of them and display this question to student. Continue with the other categories and you will get a complete random exam paper.
Have the option to hide or display the solutions on results.
It is your job to determine whether you want the students to see answer. It should be controlled by only one variable.
Are "Electricity & Magnetism" and "Statistical Mechanics" mutually exclusive categoriztions, along the same dimension? Are there multiple dimensions in categories you want to search for?
If the answer is yes to both, then I would suggest you look into multidimensional data modeling. As a physicist, you've got a leg up on most people when it comes to evaluating the number of dimensions to the problem. Analyzing reality in a multidimensional way is one of the things physicists do.
Sometimes obtaining and learning an MDDB tool is overkill. Once you've looked into multidimensional modeling, you may decide you like the modeling concept, but you still want to implement using relational databases that use the SQL interface.
In that case, the next thing to look into is star schema design. Star schema is quite different from normalization as a design principle, and it doesn't offer the same advantages and limitations. But it's worth knowing in the case where the problem is really a multidimensional one.

Is information a subset of data?

I apologize as I don't know whether this is more of a math question that belongs on mathoverflow or if it's a computer science question that belongs here.
That said, I believe I understand the fundamental difference between data, information, and knowledge. My understanding is that information carries both data and meaning. One thing that I'm not clear on is whether information is data. Is information considered a special kind of data, or is it something completely different?
The words data,information and knowlege are value-based concepts used to categorize, in a subjective fashion, the general "conciseness" and "usefulness" of a particular information set.
These words have no precise meaning because they are relative to the underlying purpose and methodology of information processing; In the field of information theory these have no meaning at all, because all three are the same thing: a collection of "information" (in the Information-theoric sense).
Yet they are useful, in context, to summarize the general nature of an information set as loosely explained below.
Information is obtained (or sometimes induced) from data, but it can be richer, as well a cleaner (whereby some values have been corrected) and "simpler" (whereby some irrelevant data has been removed). So in the set theory sense, Information is not a subset of Data, but a separate set [which typically intersects, somewhat, with the data but also can have elements of its own].
Knowledge (sometimes called insight) is yet another level up, it is based on information and too is not a [set theory] subset of information. Indeed Knowledge typically doesn't have direct reference to information elements, but rather tells a "meta story" about the information / data.
The unfounded idea that along the Data -> Information -> Knowledge chain, the higher levels are subsets of the lower ones, probably stems from the fact that there is [typically] a reduction of the volume of [IT sense] information. But qualitatively this info is different, hence no real [set theory] subset relationship.
Example:
Raw stock exchange data from Wall Street is ... Data
A "sea of data"! Someone has a hard time finding what he/she needs, directly, from this data. This data may need to be normalized. For example the price info may sometimes be expressed in a text string with 1/32th of a dollar precision, in other cases prices may come as a true binary integer with 1/8 of a dollar precision. Also the field which indicate, say, the buyer ID, or seller ID may include typos, and hence point to the wrong seller/buyer. etc.
A spreadsheet made from the above is ... Information
Various processes were applied to the data:
-cleaning / correcting various values
-cross referencing (for example looking up associated codes such as adding a column to display the actual name of the individual/company next to the Buyer ID column)
-merging when duplicate records pertaining to the same event (but say from different sources) are used to corroborate each other, but are also combined in one single record.
-aggregating: for example making the sum of all transaction value for a given stock (rather than showing all the individual transactions.
All this (and then some) turned the data into Information, i.e. a body of [IT sense] Information that is easily usable, where one can quickly find some "data", such as say the Opening and Closing rate for the IBM stock on June 8th 2009.
Note that while being more convenient to use, in part more exact/precise, and also boiled down, there is not real [IT sense] information in there which couldn't be located or computed from the original by relatively simple (if only painstaking) processes.
An financial analyst's report may contain ... knowledge
For example if the report indicate [bogus example] that whenever the price of Oil goes past a certain threshold, the value of gold start declining, but then quickly spikes again, around the time the price of coffee and tea stabilize. This particular insight constitute knowledge. This knowledge may have been hidden in the data alone, all along, but only became apparent when one applied some fancy statistically analysis, and/or required the help of a human expert to find or confirm such patterns.
By the way, in the Information Theory sense of the word Information, "data", "information" and "knowlegde" all contain [IT sense] information.
One could possibly get on the slippery slope of stating that "As we go up the chain the entropy decreases", but that is only loosely true because
entropy decrease is not directly or systematically tied to "usefulness for human"
(a typical example is that a zipped text file has less entropy yet is no fun reading)
there is effectively a loss of information (in addition to entropy loss)
(for example when data is aggregate the [IT sense] information about individual records get lost)
there is, particular in the case of Information -> Knowlege, a change in level of abstration
A final point (if I haven't confused everybody yet...) is the idea that the data->info->knowledge chain is effectively relative to the intended use/purpose of the [IT-sense] Information.
ewernli in a comment below provides the example of the spell checker, i.e. when the focus is on English orthography, the most insightful paper from a Wallstreet genius is merely a string of words, effectively "raw data", some of it in need of improvement (along the orthography purpose chain.
Similarly, a linguist using thousands of newspaper articles which typically (we can hope...) contain at least some insight/knowledge (in the general sense), may just consider these articles raw data, which will help him/her create automatically French-German lexicon (this will be information), and as he works on the project, he may discover a systematic semantic shift in the use of common words betwen the two languages, and hence gather insight into the distinct cultures.
Define information and data first, very carefully.
What is information and what is data is very dependent on context. An extreme example is a picture of you at a party which you email. For you it's information, but for the the ISP it's just data to be passed on.
Sometimes just adding the right context changes data to information.
So, to answer you question: No, information is not a subset of data. It could be at least the following.
A superset, when you add context
A subset, needle-in-a-haystack issue
A function of the data, e.g. in a digest
There are probably more situations.
This is how I see it...
Data is dirty and raw. You'll probably have too much of it.
... Jason ... 27 ... Denton ...
Information is the data you need, organised and meaningful.
Jason.age=27
Jason.city=Denton
Knowledge is why there are wikis, blogs: to keep track of insights and experiences. Note that these are human (and community) attributes. Except for maybe a weird science project, no computer is on Facebook telling people what it believes in.
information is an enhancement of data:
data is inert
information is actionable
note that information without data is merely an opinion ;-)
Information could be data if you had some way of representing the additional content that makes it information. A program that tries to 'understand' written text might transform the input text into a format that allows for more complex processing of the meaning of that text. This transformed format is a kind of data that represents information, when understood in the context of the overall processing system. From outside the system it appears as data, whereas inside the system it is the information that is being understood.

Resources