Converting journal titles to their abbreviated form - r

Good morning my hero!
I have a list of journal titles in English, Spanish and Portuguese that I want to convert to their abbreviated form. The official abbreviation dictionary for journal titles is the List of Title Word Abbreviations found on the ISSN website.
# example of my data
journal names <- c(journals = c("peste revista psicanalise sociedade", "abanico veterinario", "abcd arquivos brasileiros cirurgia digestiva sao paulo", "academo asuncion", "accion psicologica", "acimed", "acta academica", "acta amazonica", "acta bioethica", "acta bioquimica clinica latinoamericana")
I have split each title into a list of single words. So currently I have a list of lists, where each title is a list of its individual words.
[[1]]
[1] "peste" "revista" "psicanalise" "sociedade"
[[2]]
[1] "abanico" "veterinario"
Once I remove the stop words (as seen above), I need to match any relevant words to the suffixes or prefixes in the LTWA and then convert them to the abbreviation. I have converted the LTWA words so that they have regular expressions and can be used to search for a match easily with a package like stringi.
# this is an excerpt from the dataframe I created with the LTWA
the ABBREVIATIONS_NA replaces the n.a. with the original word and the REXP has the prefix/suffix with the regular expressions
WORDS,ABBREVIATIONS,LANGUAGES,REXP,ABBREVIATIONS_NA
proofreader,proofread.,eng,proofreader,proofread.
prophylact-,prophyl.,eng,^prophylact.*,prophyl.
propietario,prop.,spa,propietario,prop.
propriedade,propr.,por,propriedade,propr.
prostético,prostét.,spa,prostético,prostét.
protecção,prot.,por,protecção,prot.
proteccion-,prot.,spa,^proteccion.*,prot.
prototyping,prototyp.,eng,prototyping,prototyp.
provisional,n.a.,eng,provisional,provisional
provisóri-,n.a.,por,^provisóri.*,provisóri-
proyección,proyecc.,spa,proyección,proyecc.
psicanalise,psicanal.,por,psicanalise,psicanal.
psicoeduca-,psicoeduc.,spa,^psicoeduca.*,psicoeduc.
psicosomat-,psicosom.,spa,^psicosomat.*,psicosom.
psicotecni-,psicotec.,spa,^psicotecni.*,psicotec.
psicoterap-,psicoter.,spa,^psicoterap.*,psicoter.
psychedelic,n.a.,eng,psychedelic,psychedelic
psychoanal-,psychoanal.,eng,^psychoanal.*,psychoanal.
psychodrama,n.a.,eng,psychodrama,psychodrama
psychopatha,n.a.,por,psychopatha,psychopatha
pteridolog-,pteridol.,eng,^pteridolog.*,pteridol.
publicitar-,public.,spa,^publicitar.*,public.
puericultor,pueric.,spa,puericultor,pueric.
Puerto Rico,P. R.,spa,Puerto Rico,P. R.
The search and conversion needs to be done from largest prefix/suffix to smallest prefix/suffix, and words that have already been processed cannot be processed again.
The issue: I would like to convert each title word to its proper abbreviation. However, if there is a prefix like 'latinoamericano', it should only respond to the prefix 'latinoameri-' and be converted to latinoam. The problem is that it will also respond to 'latin-' and then get converted to 'latin.' How can I make it so that each word is only processed once?
Also note that my LTWA database only has about 12,000 words in total, so there will be words that don't have a match at all.
I have gotten up to this point, but not sure where to go from here to accomplish this. So far, I have only come up with very clunky solutions that do not work perfectly.
Thank you!

Related

Extract text based on character position returned from gregexpr

I'm working in R, trying to prepare text documents for analysis. Each document is stored in a column (aptly named, "document") of dataframe called "metaDataFrame." The documents are strings containing articles and their BibTex citation info. Data frame looks like this:
[1] filename document doc_number
[2] lithuania2016 Commentary highlights Estonian... 1
[3] lithuania2016 Norwegian police, immigration ... 2
[4] lithuania2016 Portugal to deply over 1,000 m... 3
I want to extract the BibTex information from each document into a new column. The citation information begins with "Credit:" but some articles contain multiple "Credit:" instances, so I need to extract all of the text after the last instance. Unfortunately, the string is only sometimes preceded by a new line.
My solution so far has been to find all of the instances of the string and save the location of the last instance of "Credit:" in each document in a list:
locate.last.credit <- lapply(gregexpr('Credit:', metaDataFrame$document), tail, 1)
This provides a list of integer locations of the last "Credit:" string in each document or a value of "-1" where no instance is found. (Those missing values pose a separate but related problem I think I can tackle after resolving this issue).
I've tried variations of strsplit, substr, stri_match_last, and rm_between...but can't figure out a way to use the character position in lieu of regular expression to extract this part of the string.
How can I use the location of characters to manipulate a string instead of regular expressions? Is there a better approach to this (perhaps with regex)?
How about like this:
test_string <- " Portugal to deply over 1,000 m Credit: mike jones Credit: this is the bibliography"
gsub(".*Credit:\\s*(.*)", "\\1", test_string, ignore.case = TRUE)
[1] "this is the bibliography"
The Regex pattern is looking for Credit, but because it's preceeded by .*, it's going to find the last instance of the word (if you wanted the first instance of Credit, you'd use .*?). \\s* matches 0 or more white space characters after credit and before the rest of the text. We then capture the remainder of each document in (.*), as capture group 1. And we return \\1. Also, I use ignore.case = TRUE so credit, CREDIT, and Credit will all be matched.
And with your object it would be:
gsub(".*Credit:\\s*(.*)", "\\1", metaDataFrame$document, ignore.case = TRUE)

r Regular expression for extracting UK postcode from an address is not ordered

I'm trying to extract UK postcodes from address strings in R, using the regular expression provided by the UK government here.
Here is my function:
address_to_postcode <- function(addresses) {
# 1. Convert addresses to upper case
addresses = toupper(addresses)
# 2. Regular expression for UK postcodes:
pcd_regex = "[Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})"
# 3. Check if a postcode is present in each address or not (return TRUE if present, else FALSE)
present <- grepl(pcd_regex, addresses)
# 4. Extract postcodes matching the regular expression for a valid UK postcode
postcodes <- regmatches(addresses, regexpr(pcd_regex, addresses))
# 5. Return NA where an address does not contain a (valid format) UK postcode
postcodes_out <- list()
postcodes_out[present] <- postcodes
postcodes_out[!present] <- NA
# 6. Return the results in a vector (should be same length as input vector)
return(do.call(c, postcodes_out))
}
According to the guidance document, the logic this regular expression looks for is as follows:
"GIR 0AA" OR One letter followed by either one or two numbers OR One letter followed by a second letter that must be one of
ABCDEFGHJ KLMNOPQRSTUVWXY (i.e..not I) and then followed by either one
or two numbers OR One letter followed by one number and then another
letter OR A two part post code where the first part must be One letter
followed by a second letter that must be one of ABCDEFGH
JKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and
optionally a further letter after that AND The second part (separated
by a space from the first part) must be One number followed by two
letters. A combination of upper and lower case characters is allowed.
Note: the length is determined by the regular expression and is
between 2 and 8 characters.
My problem is that this logic is not completely preserved when using the regular expression without the ^ and $ anchors (as I have to do in this scenario because the postcode could be anywhere within the address strings); what I'm struggling with is how to preserve the order and number of characters for each segment in a partial (as opposed to complete) string match.
Consider the following example:
> address_to_postcode("1A noplace road, random city, NR1 2PK, UK")
[1] "NR1 2PK"
According to the logic in the guideline, the second letter in the postcode cannot be 'z' (and there are some other exclusions too); however look what happens when I add a 'z':
> address_to_postcode("1A noplace road, random city, NZ1 2PK, UK")
[1] "Z1 2PK"
... whereas in this case I would expect the output to be NA.
Adding the anchors (for a different usage case) doesn't seem to help as the 'z' is still accepted even though it is in the wrong place:
> grepl("^[Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})$", "NZ1 2PK")
[1] TRUE
Two questions:
Have I misunderstood the logic of the regular expression and
If not, how can I correct it (i.e. why aren't the specified letter
and character ranges exclusive to their position within the regular expression)?
Edit
Since posting this answer, I dug deeper into the UK government's regex and found even more problems. I posted another answer here that describes all the issues and provides alternatives to their poorly formatted regex.
Note
Please note that I'm posting the raw regex here. You'll need to escape certain characters (like backslashes \) when porting to r.
Issues
You have many issues here, all of which are caused by whoever created the document you're retrieving your regex from or the coder that created it.
1. The space character
My guess is that when you copied the regular expression from the link you provided it converted the space character into a newline character and you removed it (that's exactly what I did at first). You need to, instead, change it to a space character.
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
here ^
2. Boundaries
You need to remove the anchors ^ and $ as these indicate start and end of line. Instead, wrap your regex in (?:) and place a \b (word boundary) on either end as the following shows. In fact, the regex in the documentation is incorrect (see Side note for more information) as it will fail to anchor the pattern properly.
See regex in use here
\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^^^^^ ^^^
3. Character class oversight
There's a missing - in the character class as pointed out by #deadcrab in his answer here.
\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^
4. They made the wrong character class optional!
In the documentation it clearly states:
A two part post code where the first part must be:
One letter followed by a second letter that must be one of ABCDEFGHJKLMNOPQRSTUVWXY (i.e..not I) and then followed by one number and optionally a further letter after that
They made the wrong character class optional!
\b(?:([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))\b
^^^^^^
it should be this one ^^^^^^^^
5. The whole thing is just awful...
There are so many things wrong with this regex that I just decided to rewrite it. It can very easily be simplified to perform a fraction of the steps it currently takes to match text.
\b(?:[A-Za-z][A-HJ-Ya-hj-y]?[0-9][0-9A-Za-z]? [0-9][A-Za-z]{2}|[Gg][Ii][Rr] 0[Aa]{2})\b
Answer
As mentioned in the comments below my answer, some postcodes are missing the space character. For missing spaces in the postcodes (e.g. NR12PK), simply add a ? after the spaces as shown in the regex below:
\b(?:[A-Za-z][A-HJ-Ya-hj-y]?[0-9][0-9A-Za-z]? ?[0-9][A-Za-z]{2}|[Gg][Ii][Rr] ?0[Aa]{2})\b
^^ ^^
You may also shorten the regex above with the following and use the case-insensitive flag (ignore.case(pattern) or ignore_case = TRUE in r, depending on the method used.):
\b(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]? ?[0-9][A-Z]{2}|GIR ?0A{2})\b
Note
Please note that regular expressions only validate the possible format(s) of a string and cannot actually identify whether or not a postcode legitimately exists. For this, you should use an API. There are also some edge-cases where this regex will not properly match valid postcodes. For a list of these postcodes, please see this Wikipedia article.
The regex below additionally matches the following (make it case-insensitive to match lowercase variants as well):
British Overseas Territories
British Forces Post Office
Although they've recently changed it to align with the British postcode system to BF, followed by a number (starting with BF1), they're considered optional alternative postcodes
Special cases outlined in that article (as well as SAN TA1 - a valid postcode for Santa!)
See this regex in use here.
\b(?:(?:[A-Z][A-HJ-Y]?[0-9][0-9A-Z]?|ASCN|STHL|TDCU|BBND|[BFS]IQ{2}|GX11|PCRN|TKCA) ?[0-9][A-Z]{2}|GIR ?0A{2}|SAN ?TA1|AI-?[0-9]{4}|BFPO[ -]?[0-9]{2,3}|MSR[ -]?1(?:1[12]|[23][135])0|VG[ -]?11[1-6]0|[A-Z]{2} ? [0-9]{2}|KY[1-3][ -]?[0-2][0-9]{3})\b
I would also recommend anyone implementing this answer to read this StackOverflow question titled UK Postcode Regex (Comprehensive).
Side note
The documentation you linked to (Bulk Data Transfer: Additional Validation for CAS Upload - Section 3. UK Postcode Regular Expression) actually has an improperly written regular expression.
As mentioned in the Issues section, they should have:
Wrapped the entire expression in (?:) and placed the anchors around the non-capturing group. Their regular expression, as it stands, will fail in for some cases as seen here.
The regular expression is also missing - in one of the character classes
It also made the wrong character class optional.
here is my regular expression
txt="0288, Bishopsgate, London Borough of Tower Hamlets, London, Greater London, England, EC2M 4QP, United Kingdom"
matches=re.findall(r'[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}', txt)

Use R to search for a specific text pattern and return entire sentence(s) where pattern appears

So I scanned in a physical document, changed it to a tiff image and used the package Tesseract to import it into R. However, I need R to look for specific keywords, find it in the text file and return the entire line that the keyword is in.
For example, if I had the text file:
This is also straightforward. Look at the years of experience required and see if that matches the years of experience that the candidate has. It is important to note that if the candidate matches or exceeds the years of experience required, you would rate both of those scenarios a “5”.
And I tell R to search for the keyword "straightforward", how do I get it to return "This is also straightforward...see if that matches the"?
Here is a solution using the quanteda package that breaks the text into sentences, and then uses grep() to return the sentence containing the word "straightforward".
aText <- "This is also straightforward. Look at the years of experience required and see if that matches the years of experience that the candidate has. It is important to note that if the candidate matches or exceeds the years of experience required, you would rate both of those scenarios a “5”."
library(quanteda)
aCorpus <- corpus(aText)
theSentences <- tokens(aCorpus,what="sentence")
grep("straightforward",theSentences,value=TRUE)
and the output:
> grep("straightforward",theSentences,value=TRUE)
text1
"This is also straightforward."
To search for multiple keywords, add them in the grep() function via the or operator | .
grep("straightforward|exceeds",theSentences,value=TRUE)
...and the output:
> grep("straightforward|exceeds",theSentences,value=TRUE)
text1
"This is also straightforward."
<NA>
"It is important to note that if the candidate matches or exceeds the years of experience required, you would rate both of those scenarios a \"5\"."
Here is one base R option:
text <- "This is also straightforward. Look at the years of experience required and see if that matches the years of experience that the candidate has. It is important to note that if the candidate matches or exceeds the years of experience required, you would rate both of those scenarios a “5”."
lst <- unlist(strsplit(text, "(?<=[a-z]\\.\\s)", perl=TRUE))
lst[grepl("\\bstraightforward\\b", lst)]
I am splitting your text on the pattern (?<=[a-z]\\.\\s), which says to lookbehind for a lowercase letter, following by a full stop and a space. This should work well most of the time. There is the issue of abbreviations, but most of the time they would be in the form of capital letter followed by dot, and also most of the time they would not be ending sentences.
Demo

Filtering out non-English words from a corpus using `textcat`

Similar to this SO member, I've been looking for a simple package in R that filters out words that are non-English. For example, I might have a list of words that looks like this:
Flexivel
eficaz
gut-wrenching
satisfatorio
apropiado
Benutzerfreundlich
interessante
genial
cool
marketing
clients
internet
My end goal is to simply filter out the non-English words from the corpus so that my list is simply:
gut-wrenching
cool
marketing
clients
internet
I've read in the data as a data.frame, although it will subsequently be transformed into a corpus and then a TermDocumentMatrix in order to create a wordcloud using wordcloud and tm.
I am currently using the package textcat to filter by language. The documentation is a bit above my head, but seems to indicate that you can run the command textcat on lists. For example, if the data above was in a data.frame called df with a single column called "words", I'd run the command:
library(textcat)
textcat(c(df$word))
However, this has the effect of reading the entire list of words as a single document, rather than looking at each row and determining it's language. Please help!
For a dictionary search you could use aspell:
txt <- c("Flexivel", "eficaz", "gut-wrenching", "satisfatorio", "apropiado",
"Benutzerfreundlich", "interessante", "genial", "cool", "marketing",
"clients", "internet")
fn <- tempfile()
writeLines(txt, fn)
result <- aspell(fn)
results$Original gives the non-matching words. From those you can select the matching words:
> result$Original
[1] "Flexivel" "eficaz" "satisfatorio"
[4] "apropiado" "interessante" "Benutzerfreundlich"
> english <- txt[!(txt %in% result$Original)]
> english
[1] "gut-wrenching" "genial" "cool" "marketing"
[5] "clients" "internet"
However, as Carl Witthoft indicates you can not be sure that these are actually English words. 'cool', 'marketing' and 'internet' are also valid Dutch words for example.

How to count the number of sentences in a text in R?

I read a text into R using the readChar() function. I aim at testing the hypothesis that the sentences of the text have as many occurrences of letter "a" as occurrences of letter "b". I recently discovered the {stringr} package, which helped me a great deal to do useful things with my text such as counting the number of characters and the total number of occurrences of each letter in the entire text. Now, I need to know the number of sentences in the whole text. Does R have any function, which can help me do that? Thank you very much!
Thank you #gui11aume for your answer. A very good package I just found that can help do the work is {openNLP}. This is the code to do that:
install.packages("openNLP") ## Installs the required natural language processing (NLP) package
install.packages("openNLPmodels.en") ## Installs the model files for the English language
library(openNLP) ## Loads the package for use in the task
library(openNLPmodels.en) ## Loads the model files for the English language
text = "Dr. Brown and Mrs. Theresa will be away from a very long time!!! I can't wait to see them again." ## This sentence has unusual punctuation as suggested by #gui11aume
x = sentDetect(text, language = "en") ## sentDetect() is the function to use. It detects and seperates sentences in a text. The first argument is the string vector (or text) and the second argument is the language.
x ## Displays the different sentences in the string vector (or text).
[1] "Dr. Brown and Mrs. Theresa will be away from a very long time!!! "
[2] "I can't wait to see them again."
length(x) ## Displays the number of sentences in the string vector (or text).
[1] 2
The {openNLP} package is really great for natural language processing in R and you can find a good and short intro to it here or you can check out the package's documentation here.
Three more languages are supported in the package. You just need to install and load the corresponding model files.
{openNLPmodels.es} for Spanish
{openNLPmodels.ge} for German
{openNLPmodels.th} for Thai
What you are looking for is sentence tokenization, and it is not as straightforward as it seems, even in English (sentences like "I met Dr. Bennett, the ex husband of Mrs. Johson." can contain full stops).
R is definitely not the best choice for natural language processing. If you are Python proficient, I suggest you have a look at the nltk module, which covers this and many other topics. You can also copy the code from this blog post, which does sentence tokenization and word tokenization.
If you want to stick to R, I would suggest you count the end-of-sentence characters (., ?, !), since you are able to count characters. A way of doing it with a regular expression is like so:
text <- 'Hello world!! Here are two sentences for you...'
length(gregexpr('[[:alnum:] ][.!?]', text)[[1]])

Resources