How to count the number of sentences in a text in R? - r

I read a text into R using the readChar() function. I aim at testing the hypothesis that the sentences of the text have as many occurrences of letter "a" as occurrences of letter "b". I recently discovered the {stringr} package, which helped me a great deal to do useful things with my text such as counting the number of characters and the total number of occurrences of each letter in the entire text. Now, I need to know the number of sentences in the whole text. Does R have any function, which can help me do that? Thank you very much!

Thank you #gui11aume for your answer. A very good package I just found that can help do the work is {openNLP}. This is the code to do that:
install.packages("openNLP") ## Installs the required natural language processing (NLP) package
install.packages("openNLPmodels.en") ## Installs the model files for the English language
library(openNLP) ## Loads the package for use in the task
library(openNLPmodels.en) ## Loads the model files for the English language
text = "Dr. Brown and Mrs. Theresa will be away from a very long time!!! I can't wait to see them again." ## This sentence has unusual punctuation as suggested by #gui11aume
x = sentDetect(text, language = "en") ## sentDetect() is the function to use. It detects and seperates sentences in a text. The first argument is the string vector (or text) and the second argument is the language.
x ## Displays the different sentences in the string vector (or text).
[1] "Dr. Brown and Mrs. Theresa will be away from a very long time!!! "
[2] "I can't wait to see them again."
length(x) ## Displays the number of sentences in the string vector (or text).
[1] 2
The {openNLP} package is really great for natural language processing in R and you can find a good and short intro to it here or you can check out the package's documentation here.
Three more languages are supported in the package. You just need to install and load the corresponding model files.
{openNLPmodels.es} for Spanish
{openNLPmodels.ge} for German
{openNLPmodels.th} for Thai

What you are looking for is sentence tokenization, and it is not as straightforward as it seems, even in English (sentences like "I met Dr. Bennett, the ex husband of Mrs. Johson." can contain full stops).
R is definitely not the best choice for natural language processing. If you are Python proficient, I suggest you have a look at the nltk module, which covers this and many other topics. You can also copy the code from this blog post, which does sentence tokenization and word tokenization.
If you want to stick to R, I would suggest you count the end-of-sentence characters (., ?, !), since you are able to count characters. A way of doing it with a regular expression is like so:
text <- 'Hello world!! Here are two sentences for you...'
length(gregexpr('[[:alnum:] ][.!?]', text)[[1]])

Related

Converting journal titles to their abbreviated form

Good morning my hero!
I have a list of journal titles in English, Spanish and Portuguese that I want to convert to their abbreviated form. The official abbreviation dictionary for journal titles is the List of Title Word Abbreviations found on the ISSN website.
# example of my data
journal names <- c(journals = c("peste revista psicanalise sociedade", "abanico veterinario", "abcd arquivos brasileiros cirurgia digestiva sao paulo", "academo asuncion", "accion psicologica", "acimed", "acta academica", "acta amazonica", "acta bioethica", "acta bioquimica clinica latinoamericana")
I have split each title into a list of single words. So currently I have a list of lists, where each title is a list of its individual words.
[[1]]
[1] "peste" "revista" "psicanalise" "sociedade"
[[2]]
[1] "abanico" "veterinario"
Once I remove the stop words (as seen above), I need to match any relevant words to the suffixes or prefixes in the LTWA and then convert them to the abbreviation. I have converted the LTWA words so that they have regular expressions and can be used to search for a match easily with a package like stringi.
# this is an excerpt from the dataframe I created with the LTWA
the ABBREVIATIONS_NA replaces the n.a. with the original word and the REXP has the prefix/suffix with the regular expressions
WORDS,ABBREVIATIONS,LANGUAGES,REXP,ABBREVIATIONS_NA
proofreader,proofread.,eng,proofreader,proofread.
prophylact-,prophyl.,eng,^prophylact.*,prophyl.
propietario,prop.,spa,propietario,prop.
propriedade,propr.,por,propriedade,propr.
prostético,prostét.,spa,prostético,prostét.
protecção,prot.,por,protecção,prot.
proteccion-,prot.,spa,^proteccion.*,prot.
prototyping,prototyp.,eng,prototyping,prototyp.
provisional,n.a.,eng,provisional,provisional
provisóri-,n.a.,por,^provisóri.*,provisóri-
proyección,proyecc.,spa,proyección,proyecc.
psicanalise,psicanal.,por,psicanalise,psicanal.
psicoeduca-,psicoeduc.,spa,^psicoeduca.*,psicoeduc.
psicosomat-,psicosom.,spa,^psicosomat.*,psicosom.
psicotecni-,psicotec.,spa,^psicotecni.*,psicotec.
psicoterap-,psicoter.,spa,^psicoterap.*,psicoter.
psychedelic,n.a.,eng,psychedelic,psychedelic
psychoanal-,psychoanal.,eng,^psychoanal.*,psychoanal.
psychodrama,n.a.,eng,psychodrama,psychodrama
psychopatha,n.a.,por,psychopatha,psychopatha
pteridolog-,pteridol.,eng,^pteridolog.*,pteridol.
publicitar-,public.,spa,^publicitar.*,public.
puericultor,pueric.,spa,puericultor,pueric.
Puerto Rico,P. R.,spa,Puerto Rico,P. R.
The search and conversion needs to be done from largest prefix/suffix to smallest prefix/suffix, and words that have already been processed cannot be processed again.
The issue: I would like to convert each title word to its proper abbreviation. However, if there is a prefix like 'latinoamericano', it should only respond to the prefix 'latinoameri-' and be converted to latinoam. The problem is that it will also respond to 'latin-' and then get converted to 'latin.' How can I make it so that each word is only processed once?
Also note that my LTWA database only has about 12,000 words in total, so there will be words that don't have a match at all.
I have gotten up to this point, but not sure where to go from here to accomplish this. So far, I have only come up with very clunky solutions that do not work perfectly.
Thank you!

Use R to search for a specific text pattern and return entire sentence(s) where pattern appears

So I scanned in a physical document, changed it to a tiff image and used the package Tesseract to import it into R. However, I need R to look for specific keywords, find it in the text file and return the entire line that the keyword is in.
For example, if I had the text file:
This is also straightforward. Look at the years of experience required and see if that matches the years of experience that the candidate has. It is important to note that if the candidate matches or exceeds the years of experience required, you would rate both of those scenarios a “5”.
And I tell R to search for the keyword "straightforward", how do I get it to return "This is also straightforward...see if that matches the"?
Here is a solution using the quanteda package that breaks the text into sentences, and then uses grep() to return the sentence containing the word "straightforward".
aText <- "This is also straightforward. Look at the years of experience required and see if that matches the years of experience that the candidate has. It is important to note that if the candidate matches or exceeds the years of experience required, you would rate both of those scenarios a “5”."
library(quanteda)
aCorpus <- corpus(aText)
theSentences <- tokens(aCorpus,what="sentence")
grep("straightforward",theSentences,value=TRUE)
and the output:
> grep("straightforward",theSentences,value=TRUE)
text1
"This is also straightforward."
To search for multiple keywords, add them in the grep() function via the or operator | .
grep("straightforward|exceeds",theSentences,value=TRUE)
...and the output:
> grep("straightforward|exceeds",theSentences,value=TRUE)
text1
"This is also straightforward."
<NA>
"It is important to note that if the candidate matches or exceeds the years of experience required, you would rate both of those scenarios a \"5\"."
Here is one base R option:
text <- "This is also straightforward. Look at the years of experience required and see if that matches the years of experience that the candidate has. It is important to note that if the candidate matches or exceeds the years of experience required, you would rate both of those scenarios a “5”."
lst <- unlist(strsplit(text, "(?<=[a-z]\\.\\s)", perl=TRUE))
lst[grepl("\\bstraightforward\\b", lst)]
I am splitting your text on the pattern (?<=[a-z]\\.\\s), which says to lookbehind for a lowercase letter, following by a full stop and a space. This should work well most of the time. There is the issue of abbreviations, but most of the time they would be in the form of capital letter followed by dot, and also most of the time they would not be ending sentences.
Demo

R - Tokenization - single and two letter words in a TermDocumentMatrix

I am currently trying to do a little bit of text processing and I would like to get the one and two letter words in a TermDocumentMatrix.
The issue is that it seems to display only 3 letter words and more.
library(tm)
library(RWeka)
test<-'This is a test.'
testmyCorpus<-Corpus(VectorSource(test))
testTDF<-TermDocumentMatrix(testmyCorpus, control=list(tokenize=AlphabeticTokenizer))
inspect(testTDF)
Only the words "this" and "test" are displayed. Any ideas?
Thanks a lot for you help!
Robert
Here is the answer to almost your problem: in short, you should add an option control=list(wordLengths=c(1,Inf) to TermDocumentMatrix.

Which function should I use to read unstructured text file into R? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
This is my first ever question here and I'm new to R, trying to figure out my first step in how to do data processing, please keep it easy : )
I'm wondering what would be the best function and a useful data structure in R to load unstructured text data for further processing. For example, let's say I have a book stored as a text file, with no new line characters in it.
Is it a good idea to use read.delim() and store the data in a list? Or is a character vector better, and how would I define it?
Thank you in advance.
PN
P.S. If I use "." as my delimeter, it would treat things like "Mr." as a separate sentence. While this is just an example and I'm not concerned about this flaw, just for educational purposes, I'd still be curious how you'd go around this problem.
read.delim reads in data in table format (with rows and columns, as in Excel). It is not very useful for reading a string of text.
To read text from a text file into R you can use readLines(). readLines() creates a character vector with as many elements as lines of text. A line, for this kind of software, is any string of text that ends with a newline. (Read about newline on Wikipedia.) When you write text, you enter your system specific newline character(s) by pressing Return. In effect, a line of text is not defined by the width of your software window, but can run over many visual rows. In effect, a line of text is what in a book would be a a paragraph. So readLines() splits your text at the paragraphs:
> readLines("/path/to/tom_sawyer.txt")
[1] "\"TOM!\""
[2] "No answer."
[3] "\"TOM!\""
[4] "No answer."
[5] "\"What's gone with that boy, I wonder? You TOM!\""
[6] "No answer."
[7] "The old lady pulled her spectacles down and looked over them about the room; then she put them up and looked out under them. She seldom or never looked through them for so small a thing as a boy; they were her state pair, the pride of her heart, and were built for \"style,\" not service—she could have seen through a pair of stove-lids just as well. She looked perplexed for a moment, and then said, not fiercely, but still loud enough for the furniture to hear:"
[8] "\"Well, I lay if I get hold of you I'll—\"
Note that you can scroll long text to the left here in Stackoverflow. That seventh line is longer than this column is wide.
As you can see, readLines() read that long seventh paragraph as one line. And, as you can also see, readLines() added a backslash in front of each quotation mark. Since R holds the individual lines in quotation marks, it needs to distinguish these from those that are part of the original text. Therefore, it "escapes" the original quotation marks. Read about escaping on Wikipedia.
readLines() may output a warning that an "incomplete final line" was found in your file. This only means that there was no newline after the last line. You can suppress this warning with readLines(..., warn = FALSE), but you don't have to, it is not an error, and supressing the warning will do nothing but supress the warning message.
If you don't want to just output your text to the R console but process it further, create an object that holds the output of readLines():
mytext <- readLines("textfile.txt")
Besides readLines(), you can also use scan(), readBin() and other functions to read text from files. Look at the manual by entering ?scan etc. Look at ?connections to learn about many different methods to read files into R.
I would strongly advise you to write your text in a .txt-file in a text editor like Vim, Notepad, TextWrangler etc., and not compose it in a word processor like MS Word. Word files contain more than the text you see on screen or printed, and those will be read by R. You can try and see what you get, but for good results you should either save your file as a .txt-file from Word or compose it in a text editor.
You can also copy-paste your text from a text file open in any other software to R or compose your text in the R console:
myothertext <- c("What did you do?
+ I wrote some text.
+ Ah, interesting.")
> myothertext
[1] "What did you do?\nI wrote some text.\nAh, interesting."
Note how entering Return does not cause R to execute the command before I closed the string with "). R just replies with +, telling me that I can continue to edit. I did not type in those plusses. Try it. Note also that now the newlines are part of your string of text. (I'm on a Mac, so my newline is \n.)
If you input your text manually, I would load the whole text as one string into a vector:
x <- c("The text of your book.")
You could load different chapters into different elements of this vector:
y <- c("Chapter 1", "Chapter 2")
For better reference, you can name the elements:
z <- c(ch1 = "This is the text of the first chapter. It is not long! Why was the author so lazy?", ch2 = "This is the text of the second chapter. It is even shorter.")
Now you can split the elements of any of these vectors:
sentences <- strsplit(z, "[.!?] *")
Enter ?strsplit to read the manual for this function and learn about the attributes it takes. The second attribute takes a regular expression. In this case I told strsplit to split the elements of the vector at any of the three punctuation marks followed by an optional space (if you don't define a space here, the resulting "sentences" will be preceded by a space).
sentences now contains:
> sentences
$ch1
[1] "This is the text of the first chapter" "It is not long"
[3] "Why was the author so lazy"
$ch2
[1] "This is the text of the second chapter" "It is even shorter"
You can access the individual sentences by indexing:
> sentences$ch1[2]
[3] "It is not long"
R will be unable to know that it should not split after "Mr.". You must define exceptions in your regular expression. Explaining this is beyond the scope of this question.
How you would tell R how to recognize subjects or objects, I have no idea.

Filtering out non-English words from a corpus using `textcat`

Similar to this SO member, I've been looking for a simple package in R that filters out words that are non-English. For example, I might have a list of words that looks like this:
Flexivel
eficaz
gut-wrenching
satisfatorio
apropiado
Benutzerfreundlich
interessante
genial
cool
marketing
clients
internet
My end goal is to simply filter out the non-English words from the corpus so that my list is simply:
gut-wrenching
cool
marketing
clients
internet
I've read in the data as a data.frame, although it will subsequently be transformed into a corpus and then a TermDocumentMatrix in order to create a wordcloud using wordcloud and tm.
I am currently using the package textcat to filter by language. The documentation is a bit above my head, but seems to indicate that you can run the command textcat on lists. For example, if the data above was in a data.frame called df with a single column called "words", I'd run the command:
library(textcat)
textcat(c(df$word))
However, this has the effect of reading the entire list of words as a single document, rather than looking at each row and determining it's language. Please help!
For a dictionary search you could use aspell:
txt <- c("Flexivel", "eficaz", "gut-wrenching", "satisfatorio", "apropiado",
"Benutzerfreundlich", "interessante", "genial", "cool", "marketing",
"clients", "internet")
fn <- tempfile()
writeLines(txt, fn)
result <- aspell(fn)
results$Original gives the non-matching words. From those you can select the matching words:
> result$Original
[1] "Flexivel" "eficaz" "satisfatorio"
[4] "apropiado" "interessante" "Benutzerfreundlich"
> english <- txt[!(txt %in% result$Original)]
> english
[1] "gut-wrenching" "genial" "cool" "marketing"
[5] "clients" "internet"
However, as Carl Witthoft indicates you can not be sure that these are actually English words. 'cool', 'marketing' and 'internet' are also valid Dutch words for example.

Resources