I am working on a dataset of restaurant reviews and wish to filter out the reviews about hygiene specifically.
I figured out that I can come up with a bunch of words that is hygiene related (clean, dirty, soap, germs …) and match all the words in the reviews. Is there a specific library that I can use for that? Or perhaps a more sophisticated method to do this? Thank you so much :)
I have tried LDA which gave me the topic models, however I did not find any hygiene related topics. Perhaps they are not salient enough. Yup
Related
TL;DR i'm currently creating a cross-platform mobile news aggregator, which will identify news articles from different publishers, but about the same topic, e.g. a celebrity passes away.
I believe I found an appropriate journal that can guide me through the steps 'Document Clustering with grouping and chaining algorithms'.
(https://www.aclweb.org/anthology/I05-1025.pdf)
However many of the steps are confusing me such as:
1) Document clustering
2) Grouping and chaining algorithms
3) Understanding equations such as the one below that I'll need to compute.
Any help on the matter, or a brief description of the steps would be greatly appreciated.
Thanks for the help.
I'm also interested in any experts in this field, and would love to use your knowledge as qualitative evidence for my project. If you'd be be up for it please DM, or drop a comment. Thanks again!
I have a set of 20'000 words and simple phrases. I need to pick each word and define it's general concept, or category.
So if I take "hockey" it should fall into a large "Sports" category. If it's "Barack Obama" then it's "Politics". Here is a sample from my word list:
israel
illness
face
experts
throat
tory
moments
numerous
All the weird stuff can fall into "General" category.
That's my problem. Following are my thoughts that you could probably ignore, cause I have no good clue how to deal with the problem.
Probably I am looking for some kind of opened dictionary or API that can define a general concept of a word. I was thinking to take a simple dictionary and run every word through it parsing it's Economics categories. But not all words have it.
I could point you to http://dbpedia.org/. It's a onthology of the data of many wikipedia info-boxes and it has a sparql endpoint for queries. I used it two year ago, but the api seems to have changed, so I can't give you an example right now. But it has a pretty good documentation.
It sounds like you're wanting to do topic modeling. The packages quanteda, Snowball, and tm are good places to start. A resource for doing topic modeling with the mallet package is here:
http://www.matthewjockers.net/materials/dh-2014-introduction-to-text-analysis-and-topic-modeling-with-r/
The general idea of topic modeling is that your words came from documents that were themselves about a certain topic. Topic modeling checks which words occur together in the same documents, and assumes that, over many documents, those words are probably about the same topic. Hopefully this helps.
besides BM25, what's other ranking functions exists? Where I found information on this topic?
BM25 is one of term-based ranking algorithms. Nowadays there are concept-based algorithms as well.
BM25 if state-of-the-art of term based information retrieval; however, there are some challenges that term-based cannot overcome such as, relating synonyms, matching an abbreviation or recognizing homonyms.
Here are the examples:
synonym: "buy" and "purchase"
antonym: "Professor" and "Prof."
homonym:
bow – a long wooden stick with horse hair that is used to play certain string instruments such as the violin
bow – to bend forward at the waist in respect (e.g. "bow down")
To deal with these problems, some are using concept-based models such as this article and this article.
Concept-based models are mostly using dictionaries or external terminologies to identify concepts and each have their own representation of concepts or weighting algorithms.
vanilla tf-idf is what is often used. If you want to learn about these things the best place to start is this book.
My teammates and I have a very challenging new project to do, and we are supposed to submit it next week. We don't have a single clue about how to do it, and really need help. We are undergraduate students, new to Information Retrieval and AI, and really need your ideas.
The project is roughly:
When an expert is cited in a document,
find an expert with an opposing
opinion & find out what he/she says
about that topic.
We are free to use any programming language, but we are not concerned with the programming. We would like help to get us started. Please give us a rough idea on how to design such a system and how to retrieve information on the internet. How should we get his opinion, then find an opposite opinion?
Simple: use Amazon's Mechanical Turk.
Without that (or an equivalent) you're in trouble. If there are no further constraints on the problem then you will need a full-blown AI, the kind that doesn't yet exist. If there are severe restraints then you might have a chance of doing this in a week. If the expert can be in any field (medicine, politics, history, fashion, science, comic books, etc.) then there will be no single, well-organized repository of essays. You'll have to use Google to find Dr. X's opinion. Once you find Dr. X's writing (and let's pray it's text, not audio) you'll have to do some kind of natural language processing to get the thrust of it, even if you're lucky enough to find a descriptive title ("Digital Photography Is Absolutely Great"). Then you have to figure out it's opposite. What's the opposite of "Neil Gaiman draws on folklore for his story ideas"? Figuring out what opinion you're looking for will be a serious problem. After that, things actually get easier: you can google for the subject and use the same magic tools to find the one you're looking for.
So what do have a chance of solving? A search for opinions that someone else has already organised into "pro" and "con". Some online political forums are organised that way. Wikipedia cites opposing views in a special section in some of its articles. Science journals print letters of rebuttal. Look around, you might find a site even more cut-and-dried. Choose a small enough arena and you'll have a tractible problem.
EDIT: Damn, Ben Dunlap beat me to all my major points in a comment. Sigh
Sounds like an NLP problem to me. As for the information about documents and cites, http://citeseerx.ist.psu.edu should be a good starting point.
For each paper, there are several citations which refer to the paper. At the very minimum, you have to scan the abstract of the paper and that of the citations and run your own algorithm to figure if any citation is of the opposing opinion. Maybe your professor can give you hints on some approximate heuristic, but as far as I know it is a really hard problem.
I would be watching this thread for more interesting approaches.
Automatically submit a Google search request similar to "expert_name sucks", "expert_name wrong", or something like that. Find the first result that has "PhD" with a document link in the same sentence and return the link.
I think you might be blowing this up a little too big... as an undergraduate project, I would approach it a little more small scale.
Unless your specification says you must use actual internet resources, you would be better off creating your own database of custom short documents. Add metadata to each document stating the points they make about certain topics.
Next, I would create a list of citations which link to each document and add some metadata representing that experts stance on the topic. When someone reads a document, I would augment the list of citations with lists of links to documents which have alternative views on that topic.
Basically it would consist of these tables:
Document (id, data)
DocumentPoints (documentId, topic, stance)
Citation (documentId, topic, stance)
And when someone loads up a document, the citations are pulled up as well. For each citation, you search DocumentPoints for the same topics with different stances. The most difficult part of this project would be creating the 5 or 6 documents you need to have data in your database. After that the solution is trivial.
On a side note, most of these other answers are telling you to use some existing solution... don't do that unless the assignment tells you to. You'll be much better off understanding the problem and various ways to solve it (this is definitely not the only/best one) if you work through the entire problem yourself. When the teacher asks you to do something not supported by whatever product you chose to implement your solution on, you wouldn't be able to fix it. If you had just written it yourself, you could just as easily implement to the new spec as well.
I'm interested in learning more about pattern recognition. I know that's somewhat of a broad field, so I'll list some specific types of problems I would like to learn to deal with:
Finding patterns in a seemingly random set of bytes.
Recognizing known shapes (such as circles and squares) in images.
Noticing movement patterns given a stream of positions (Vector3)
This is a new area of experimentation for me personally, and to be honest, I simply don't know where to start :-) I'm obviously not looking for the answers to be provided to me on a silver platter, but some search terms and/or online resources where I can start to acquaint myself with the concepts of the above problem domains would be awesome.
Thanks!
ps: For extra credit, if said resources provide code examples/discussion in C# would be grand :-) but doesn't need to be
Hidden Markov Models are a great place to look, as well as Artificial Neural Networks.
Edit: You could take a look at NeuronDotNet, it's open source and you could poke around the code.
Edit 2: You can also take a look at ITK, it's also open source and implements a lot of these types of algorithms.
Edit 3: Here's a pretty good intro to neural nets. It covers a lot of the basics and includes source code (albeit in C++). He implemented an unsupervised learning algorithm, I think you may be looking for a supervised backpropagation algorithm to train your network.
Edit 4: Another good intro, avoids really heavy math, but provides references to a lot of that detail at the bottom, if you want to dig into it. Includes pseudo-code, good diagrams, and a lengthy description of backpropagation.
This is kind of like saying "I'd like to learn more about electronics.. anyone tell me where to start?" Pattern Recognition is a whole field - there are hundreds, if not thousands of books out there, and any university has at least several (probably 10 or more) courses at the grad level on this. There are numerous journals dedicated to this as well, that have been publishing for decades ... conferences ..
You might start with the wikipedia.
http://en.wikipedia.org/wiki/Pattern_recognition
This is kind of an old question, but it's relevant so I figured I'd post it here :-) Stanford began offering an online Machine Learning class here - http://www.ml-class.org
OpenCV has some functions for pattern recognition in images.
You might want to look at this :http://opencv.willowgarage.com/documentation/pattern_recognition.html. (broken link: closest thing in the new doc is http://opencv.willowgarage.com/documentation/cpp/ml__machine_learning.html, although it is no longer what I'd call helpful documentation for a beginner - see other answers)
However, I also recommend starting with Matlab because openCV is not intuitive to use.
Lot of useful links on this page on computer vision related pattern recognition. Some of the links seem to be broken now but you may find it useful.
I am not an expert on this, but reading about Hidden Markov Models is a good way to start.
Beware false patterns! For any decently large data set you will find subsets that appear to have pattern, even if it is a data set of coin flips. No good process for pattern recognition should be without statistical techniques to assess confidence that the detected patterns are real. When possible, run your algorithms on random data to see what patterns they detect. These experiments will give you a baseline for the strength of a pattern that can be found in random (a.k.a "null") data. This kind of technique can help you assess the "false discovery rate" for your findings.
learning pattern-recoginition is easier in matlab..
there are several examples and there are functions to use.
it is good for the understanding concepts and experiments...
I would recommend starting with some MATLAB toolbox. MATLAB is an especially convenient place to start playing around with stuff like this due to its interactive console. A nice toolbox I personally used and really liked is PRTools (http://prtools.org); they have an implementation of pretty much every pattern recognition tool and also some other machine learning tools (Neural Networks, etc.). But the nice thing about MATLAB is that there are many other toolboxes as well you can try out (there is even a proprietary toolbox from Mathworks)
Whenever you feel comfortable enough with the different tools (and found out which classifier is perfomring best for you problem), you can start thinking about implementing the machine learning in a different application.