I'm working with a big corpus in RStudio and the next phase of our research includes the detection of some grammatical elements and its frequency in the texts. We want to detect the frequency of occurrence of things like the use of abstract nouns or deontic modalities which include the auxiliary verbs ‘must’, ‘have to’, ‘may’, ‘can’, ‘should’, ‘ought to ’, etc. I would like to capture its possible conjugation, i.e., not only 'she have to' but 'she had to'; not only 'he can' but 'he could'. I guess it could be done using some simple RegEx such as
She ha(ve|d) to
He c(an|ould)
...right? The problem is 1) I'm not sure whether this can be done (I guess it can be) and 2) which library should I use to do that.
I have thought I could make a dictionary and run it to the whole corpus but 1) and 2) are still here.
Related
I am pretty green when it comes to R and coding in general. I've been working on a CS project recently for a linguistics course through which I'm finding the words that surround various natural landscape words in The Lord of the Rings. For instance, I'm interested in finding the descriptive words used around words like "stream", "mountain", etc.
Anyhow, to extract all of these words from the text, I've been working off of this post. When running this command by itself, it works:
stringr::str_extract_all(text, "([^\\s]+\\s){4}stream(\\s[^\\s]+){6}")
where "stream" is the specific word I'm going after. The numbers before and after specify how many words before and after I want to extract along with it.
However, I'm interested in combining this (and some other things) into a single function, where all you need to plug in the text you want to search, and the word you want to get context for. However, as far as I've tinkered, I can't get anything other than a specific word to work in the above code. Would there be a way to, in the context of writing a function in R, include the above code, but with a variable input, for instance
stringr::str_extract_all(text, "([^\\s]+\\s){4}WORD(\\s[^\\s]+){6}")
where WORD is whatever you specify in the overall function:
function(text,WORD)
I apologize for the generally apparent newb-ness of this post. I am very new to all of this but would greatly appreciate any help you could offer.
This is what you are looking for, if I understood you correctly,
my_fun <- function(input_text, word) {
stringr::str_extract_all(
string = input_text,
pattern = paste("([^\\s]+\\s){4}", word, "(\\s[^\\s]+){6}", sep = "")
)
}
May the light of Eärendil ever shine upon you!
I'm working on a system to analyze texts in english: I use stanford-core nlp to make sentences from whole documents and to make tokens from sentences. I also use the maxent tagger to get tokens pos tags.
Now, considering that I use this corpus to build a supervised classifier, it would be good if I could replace any word like 're, 's, havin, sayin', etc. to its standard form(are, is, having, saying). I've been searching for some english dictionary file, but I don't know how to use it. There are so many distinct cases to consider that I don't think it's an easy task to realize: is there some similar work or whole project that I could use?
Ideas:
I) use string edit distance on a subset of your text and try to match words that do not exist in the dictionary using edit distance against existing words in the dictionary.
II) The key feature of lots of those examples you have is that they are only 1 character different from the correct spelling. So, I suggest for those words that you fail to match with a dictionary entry, try and add all english characters to the front or back and lookup the resulting word in a dictionary. This is very expensive in the beginning but if you keep track of those misspellings in a lookup table (re -> are) at some point you will have 99.99% of the common misspellings (or whatever you call them) in your lookup table with their actual correct spelling.
III) Train a word-level 2-gram or 3-gram language model on proper and clean english text (i.e. newspaper articles), then run it over the entire corpus that you have and see for those words that your language model considers as unknown words (which means it hasn't seen them in training phase), what is the highest probable word according to the language model. Most probably the language model top-10 prediction will be the correct spelled word.
I'm reading the R FAQ source in texinfo, and thinking that it would be easier to manage and extend if it was parsed as an R structure. There are several existing examples related to this:
the fortunes package
bibtex entries
Rd files
each with some desirable features.
In my opinion, FAQs are underused in the R community because they lack i) easy access from the R command-line (ie through an R package); ii) powerful search functions; iii) cross-references; iv) extensions for contributed packages. Drawing ideas from packages bibtex and fortunes, we could conceive a new system where:
FAQs can be searched from R. Typical calls would resemble the fortune() interface: faq("lattice print"), or faq() #surprise me!, faq(51), faq(package="ggplot2").
Packages can provide their own FAQ.rda, the format of which is not clear yet (see below)
Sweave/knitr drivers are provided to output nicely formatted Markdown/LaTeX, etc.
QUESTION
I'm not sure what is the best input format, however. Either for converting the existing FAQ, or for adding new entries.
It is rather cumbersome to use R syntax with a tree of nested lists (or an ad hoc S3/S4/ref class or structure,
\list(title = "Something to be \\escaped", entry = "long text with quotes, links and broken characters", category = c("windows", "mac", "test"))
Rd documentation, even though not an R structure per se (it is more a subset of LaTeX with its own parser), can perhaps provide a more appealing example of an input format. It also has a set of tools to parse the structure in R. However, its current purpose is rather specific and different, being oriented towards general documentation of R functions, not FAQ entries. Its syntax is not ideal either, I think a more modern markup, something like markdown, would be more readable.
Is there something else out there, maybe examples of parsing markdown files into R structures? An example of deviating Rd files away from their intended purpose?
To summarise
I would like to come up with:
1- a good design for an R structure (class, perhaps) that would extend the fortune package to more general entries such as FAQ items
2- a more convenient format to enter new FAQs (rather than the current texinfo format)
3- a parser, either written in R or some other language (bison?) to convert the existing FAQ into the new structure (1), and/or the new input format (2) into the R structure.
Update 2: in the last two days of the bounty period I got two answers, both interesting but completely different. Because the question is quite vast (arguably ill-posed), none of the answers provide a complete solution, thus I will not (for now anyway) accept an answer. As for the bounty, I'll attribute it to the answer most up-voted before the bounty expires, wishing there was a way to split it more equally.
(This addresses point 3.)
You can convert the texinfo file to XML
wget http://cran.r-project.org/doc/FAQ/R-FAQ.texi
makeinfo --xml R-FAQ.texi
and then read it with the XML package.
library(XML)
doc <- xmlParse("R-FAQ.xml")
r <- xpathSApply( doc, "//node", function(u) {
list(list(
title = xpathSApply(u, "nodename", xmlValue),
contents = as(u, "character")
))
} )
free(doc)
But it is much easier to convert it to text
makeinfo --plaintext R-FAQ.texi > R-FAQ.txt
and parse the result manually.
doc <- readLines("R-FAQ.txt")
# Split the document into questions
# i.e., around lines like ****** or ======.
i <- grep("[*=]{5}", doc) - 1
i <- c(1,i)
j <- rep(seq_along(i)[-length(i)], diff(i))
stopifnot(length(j) == length(doc))
faq <- split(doc, j)
# Clean the result: since the questions
# are in the subsections, we can discard the sections.
faq <- faq[ sapply(faq, function(u) length(grep("[*]", u[2])) == 0) ]
# Use the result
cat(faq[[ sample(seq_along(faq),1) ]], sep="\n")
I'm a little unclear on your goals. You seem to want all the R-related documentation converted into some format which R can manipulate, presumably so the one can write R routines to extract information from the documentation better.
There seem to be three assumptions here.
1) That it will be easy to convert these different document formats (texinfo, RD files, etc.) to some standard form with (I emphasize) some implicit uniform structure and semantics.
Because if you cannot map them all to a single structure, you'll have to write separate R tools for each type and perhaps for each individual document, and then the post-conversion tool work will overwhelm the benefit.
2) That R is the right language in which to write such document processing tools; suspect you're a little biased towards R because you work in R and don't want to contemplate "leaving" the development enviroment to get information about working with R better. I'm not an R expert, but I think R is mainly a numerical language, and does not offer any special help for string handling, pattern recognition, natural language parsing or inference, all of which I'd expect to play an important part in extracting information from the converted documents that largely contain natural language. I'm not suggesting a specific alternative language (Prolog??), but you might be better off, if you succeed with the conversion to normal form (task 1) to carefully choose the target language for processing.
3) That you can actually extract useful information from those structures. Library science was what the 20th century tried to push; now we're all into "Information Retrieval" and "Data Fusion" methods. But in fact reasoning about informal documents has defeated most of the attempts to do it. There are no obvious systems that organize raw text and extract deep value from it (IBM's Jeopardy-winning Watson system being the apparent exception but even there it isn't clear what Watson "knows"; would you want Watson to answer the question, "Should the surgeon open you with a knife?" no matter how much raw text you gave it) The point is that you might succeed in converting the data but it isn't clear what you can successfully do with it.
All that said, most markup systems on text have markup structure and raw text. One can "parse" those into tree-like structures (or graph-like structures if you assume certain things are reliable cross-references; texinfo certainly has these). XML is widely pushed as a carrier for such parsed-structures, and being able to represent arbitrary trees or graphs it is ... OK ... for capturing such trees or graphs. [People then push RDF or OWL or some other knoweldge encoding system that uses XML but this isn't changing the problem; you pick a canonical target independent of R]. So what you really want is something that will read the various marked-up structures (texinfo, RD files) and spit out XML or equivalent trees/graphs. Here I think you are doomed into building separate O(N) parsers to cover all the N markup styles; how otherwise would a tool know what the value markup (therefore parse) was? (You can imagine a system that could read marked-up documents when given a description of the markup, but even this is O(N): somebody still has to describe the markup). One this parsing is to this uniform notation, you can then use an easily built R parser to read the XML (assuming one doesn't already exist), or if R isn't the right answer, parse this with whatever the right answer is.
There are tools that help you build parsers and parse trees for arbitrary lanuages (and even translators from the parse trees to other forms). ANTLR is one; it is used by enough people so you might even accidentally find a texinfo parser somebody already built. Our DMS Software Reengineering Toolkit is another; DMS after parsing will export an XML document with the parse tree directly (but it won't necessarily be in that uniform representation you ideally want). These tools will likely make it relatively easy to read the markup and represent it in XML.
But I think your real problem will be deciding what you want to extract/do, and then finding a way to do that. Unless you have a clear idea of how to do the latter, doing all the up front parsers just seems like a lot of work with unclear payoff. Maybe you have a simpler goal ("manage and extend" but those words can hide a lot) that's more doable.
I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".
My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar.
My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)
I've very briefly looked at the tm package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.
I have a few related questions:
Is the tm package appropriate for this sort of task?
Is there a faster alternative to agrep? (Said function uses the
Levenshtein edit distance which is anecdotally slow.)
Are there other suitable tools in R, apart from agrep and tm?
Should I even be doing this in R, or should this sort of thing be
done directly in the database? (It's an Access database, so I'd
rather avoid touching it if possible.)
If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.
I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).
If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.
So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.
You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.
Maybe google refine could help. It looks maybe more fitted if you have lots of exceptions and you don't know them all yet.
What you're doing is called record linkage, and it's been a huge field of research over many decades already. Luckily for you, there's a whole bunch of tools out there that are ready-made for this sort of thing. Basically, you can point them at your database, set up some cleaning and comparators (like Levenshtein or Jaro-Winkler or ...), and they'll go off and do the job for you.
These tools generally have features in place to solve the performance issues, so that even though Levenshtein is slow they can run fast because most record pairs never get compared at all.
The Wikipedia link above has links to a number of record linkage tools you can use. I've personally written one called Duke in Java, which I've used successfully for exactly this. If you want something big and expensive you can buy a Master Data Management tool.
In your case probably something like edit-distance calculation would work, but if you need to find near duplicates in larger text based documents, you can try
http://www.softcorporation.com/products/neardup/
I'd like to find patterns and sort them by number of occurrences on an HEX file I have.
I am not looking for some specific pattern, just to make some statistics of the occurrences happening there and sort them.
DB0DDAEEDAF7DAF5DB1FDB1DDB20DB1BDAFCDAFBDB1FDB18DB23DB06DB21DB15DB25DB1DDB2EDB36DB43DB59DB32DB28DB2ADB46DB6FDB32DB44DB40DB50DB87DBB0DBA1DBABDBA0DB9ADBA6DBACDBA0DB96DB95DBB7DBCFDBCBDBD6DB9CDBB5DB9DDB9FDBA3DB88DB89DB93DBA5DB9CDBC1DBC1DBC6DBC3DBC9DBB3DBB8DBB6DBC8DBA8DBB6DBA2DB98DBA9DBB9DBDBDBD5DBD9DBC3DB9BDBA2DB84DB83DB7DDB6BDB58DB4EDB42DB16DB0DDB01DB02DAFCDAE9DAE5DAD9DAE2DAB7DA9BDAA6DA9EDAAADAC9DACADAC4DA92DA90DA84DA89DA93DAA9DA8CDA7FDA62DA53DA6EDA
That's an excerpt of the HEX file, and as an example I'd like to get:
XX occurrences of BDBDBD
XX occurrences of B93D
Is there a way to mine the file to generate that output?
Sure. Use a sliding window to create the counts (The link is for Perl, but it seems general enough to understand the algorithm). Your patterns are named N-grams. You will have to limit the maximal pattern, though.
This is a pretty classic CS problem. The code in general is non-trivial to implement as it will require at least one full parse of the sequence, and depending on your efficiency and memory/processor constraints might require several. See here.
You will need to partition your input string in some way to ensure that you get a good subsequence across it.
If there is a specific problem we might be able to help more, but the general strategy is in the Wikipedia article above.
You can use Regular Expressions to make a pattern to search for.
The regex needed would be very simple. Just use the exact phrase you're searching for. Then there should be a regular expression function in the language you're using (you didn't specify) that can count the number of matches.
Use that to create a simple counter.