Extracting university names from affiliation in Pubmed data with R - r

I've been using the extremely useful rentrez package in R to get information about author, article ID and author affiliation from the Pubmed database. This works fine but now I would like to extract information from the affiliation field. Unfortunately the affiliation field is widely unstructured, not standardized string with various types of information such as the name of university, name of department, address and more delimited by commas. Therefore text mining approach is necessary to get any useful information from this field.
I tried the package easyPubmed in combination with rentrez, and even though easyPubmed package can extract some information from the affiliation field (e.g. email address, which is very useful), to my knowledge it cannot extract university name. I also tried the package pubmed.mineR, but unfortunately this also does not provide university name extraction. I startet to experiment with grep and regex functions but as I am no R expert I could not make this work.
I was able to find very similar threads solving the issue with python:
Regex for extracting names of colleges, universities, and institutes?
How to extract university/school/college name from string in python using regular expression?
But unfortunately I do not know how to convert the python regex function to an R regex function as I am not familiar with python.
Here is some example data:
PMID = c(121,122,123,124,125)
author=c("author1","author2","author3","author4","author5")
Affiliation = c("blabla,University Ghent,blablabla", "University Washington, blabla, blablabla, blablabalbalba","blabla,University of Florence,blabla", "University Chicago, Harvard University", "Oxford University")
df = as.data.frame(cbind(PMID,author,Affiliation))
df
PMID author Affiliation
1 121 author1 blabla,University Ghent,blablabla
2 122 author2 University Washington, blabla, blablabla, blablabalbalba
3 123 author3 blabla,University of Florence,blabla
4 124 author4 University Chicago, Harvard University
5 125 author5 Oxford University
What I would like to get:
PMID author Affiliation University
1 121 author1 blabla,University Ghent,blablabla University Ghent
2 122 author2 University Washington,ba, bla, bla University Washington
3 123 author3 blabla,University Florence,blabla University of Florence
4 124 author4 University Chicago, Harvard Univ University Chicago, Harvard University
5 125 author5 Oxford University Oxford University
Please sorry if there is already a solution online, but I honestly googled a lot and did not find any clear solution for R. I would be very thankful for any hints and solutions to this task.

In general, regex expressions can be ported to R with some changes. For example, using the php link you included, you can create a new variable with extracted text using that regex expression, and only changing the escape character ("\\" instead "\"). So, using dplyr and stringr packages:
library(dplyr)
library(stringr)
df <- df %>%
mutate(Organization=str_extract(Affiliation,
"([A-Z][^\\s,.]+[.]?\\s[(]?)*(College|University|Institute|Law School|School of|Academy)[^,\\d]*(?=,|\\d)"))

Related

Pairing a gsub function and text file for corpus cleaning

I have a large sample of tweets that I am trying to clean up before I analyze them. I have the tweets in a dataframe where each cell has the contents of one tweet (e.g. "i love san francisco" and "proud member of the air force"). However, there are some words in each bio that should be combined when I analyze the text in a network visualization. I want to also combine common two-word phrases (e.g. "new york", "san francisco", and "air force"). I have already compiled the list of terms that need to be combined, and have used gsub to combine a few of them with this line of code:
twitterdata_cleaning$bio = gsub('air force','airforce',twitterdata_cleaning$bio)
The line of code above turns "proud member of the air force" into "proud member of the airforce". I have been able to successfully do this with dozens of two-word phrases.
However, I have hundreds of two-word phrases in the bios, and I want to keep a better track of them, so I've moved all of these terms into two columns in an excel file. I would like to find a way to use the above formula on a txt or excel file, that identifies terms in the dataframe that look like those in the first column of the txt file and changes the words to look like those in the second column of the txt file.
For example, I have xlsx and txt files that look like this:
**column1** **column2*
san francisco sanfrancisco
new york newyork
las vegas lasvegas
san diego sandiego
new hampshire newhampshire
good bye goodbye
air force airforce
video game videogame
high school school
middle school school
elementary school school
I would like to use the gsub command in a formula that searches the dataframe for all the terms in column 1 and terms them into the terms in column 2 using something like this:
twitterdata_df$tweet = gsub('textfile$column1','textfile$columnb',twitterdata_df$tweet)
to get something like this in the cells:
i love sanfrancisco
can not wait to go to newyork
what happens in lasvegas stays there
at the beach in sandiego
can beat the autumn leave in newhampshire
so done with all the drama goodbye
proud member of the airforce
love this videogame so much
playing at the school tonight
so sick of school
school was the best and i miss it
Any help would be very greatly appreciated.
Generalized Solution
You can feed in a named vector to str_replace_all() from package stringr to accomplish this. In my example df has a column with old values to be replaced by new values. This I assume is what you mean by having an Excel file to track them.
library(stringr)
df <- data.frame(old = c("five", "six", "seven"),
new = as.character(5:7),
stringsAsFactors = FALSE)
text <- c("I am a vector with numbers six and other text five",
"another vector seven six text five")
str_replace_all(text, setNames(df$new, df$old))
Result:
[1] "I am a vector with numbers 6 and other text 5" "another vector 7 6 text 5"
Specific Example
Data
Read in the text file with the replacements.
textfile <- read.csv(textConnection("column1,column2
san francisco,sanfrancisco
new york,newyork
las vegas,lasvegas
san diego,sandiego
new hampshire,newhampshire
good bye,goodbye
air force,airforce
video game,videogame
high school,school
middle school,school
elementary school,school"), stringsAsFactors = FALSE)
Load a data frame with tweets in the column tweet.
twitterdata_df <- data.frame(id = 1:11)
twitterdata_df$tweet <- c("i love san francisco",
"can not wait to go to new york",
"what happens in las vegas stays there",
"at the beach in san diego",
"can beat the autumn leave in new hampshire",
"so done with all the drama goodbye",
"proud member of the air force",
"love this video game so much",
"playing at the high school tonight",
"so sick of middle school",
"elementary school was the best and i miss it")
Replace
twitterdata_df$tweet2 <- str_replace_all(twitterdata_df$tweet, setNames(textfile$column2, textfile$column1))
Result
As you can see, the replacements were made in tweet2.
id tweet tweet2
1 1 i love san francisco i love sanfrancisco
2 2 can not wait to go to new york can not wait to go to newyork
3 3 what happens in las vegas stays there what happens in lasvegas stays there
4 4 at the beach in san diego at the beach in sandiego
5 5 can beat the autumn leave in new hampshire can beat the autumn leave in newhampshire
6 6 so done with all the drama goodbye so done with all the drama goodbye
7 7 proud member of the air force proud member of the airforce
8 8 love this video game so much love this videogame so much
9 9 playing at the high school tonight playing at the school tonight
10 10 so sick of middle school so sick of school
11 11 elementary school was the best and i miss it school was the best and i miss it
Thanks for your help, but I found out how to do it. I decided to use a loop, which went into my my table of two columns, and searched for each set of terms in the first column and replaced them with the word in the second column.
for(i in 1:nrow(compoundterms)) {
twitterdata_dfg$tweet = gsub(compoundterms[i,1],compoundterms[i,2],twitterdata_df$tweet)
}

NLP: Extracting only specific sentence of whole text in R

I have the multiple rows of text data(different document) and each row has around 60-70 lines of text data(more than 50000 characters). But of these my area of interest is only on 1-2 rows of data, based on keywords. I want to extract only those sentences where the keyword/group of words are present. My hypothesis is that by extracting only that piece of information, I can have a better POS tagging and understand sentence context better as I am only looking at sentence that I need. Is my understanding correct and how can we accomplish this in R apart from using regex and full stops. This might be computationally intensive.
Eg:
The Boy lives in Miami and studies in the st. Martin School.The boy has a heiht of 5.7" and weights 60 Kg's. He has intrest in the Arts and crafts; and plays basketball..............................................
..................................................................
I just want to extract the sentence "The Boy lives in Miami and studies in the st. Martin School" based on key word study (stemmed keyword).
For this example, I have used three packages: NLP and openNLP (for sentence split) and SnowballC (for lemmatize). I did not use the tokenizers package mentioned above because I did not know it. And the packages I mentioned are part of the Apache OpenNLP toolkit, well known and used by the community.
First, use the code below to install the packages mentioned. If you have the packages installed, skip to the next step:
## List of used packages
list.of.packages <- c("NLP", "openNLP", "SnowballC")
## Returns a not installed packages list
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
## Installs new packages
if(length(new.packages))
install.packages(new.packages)
Next, load used packages:
library(NLP)
library(openNLP)
library(SnowballC)
Next, convert the text to a string (NLP package function). This is necessary because the openNLP package works with the String type. In this example, I used the same text that you provided in your question:
example_text <- paste0("The Boy lives in Miami and studies in the St. Martin School. ",
"The boy has a heiht of 5.7 and weights 60 Kg's. ",
"He has intrest in the Arts and crafts; and plays basketball. ")
example_text <- as.String(example_text)
#output
> example_text
The Boy lives in Miami and studies in the St. Martin School. The boy has a heiht of 5.7 and weights 60 Kg's. He has intrest in the Arts and crafts; and plays basketball.
Next, we use the openNLP package to generate a sentence annotator that computes the annotations through a sentence detector:
sent_annotator <- Maxent_Sent_Token_Annotator()
annotation <- annotate(example_text, sent_annotator)
Next, through the notes made in the text, we can extract the sentences:
splited_text <- example_text[annotation]
#output
splited_text
[1] "The Boy lives in Miami and studies in the St. Martin School."
[2] "The boy has a heiht of 5.7 and weights 60 Kg's. "
[3] "He has intrest in the Arts and crafts; and plays basketball. "
Finally, we use the wordStem function of the SnowballC package that has support for the English language. This function reduces a word or a vector of words to its radical (common base form). Next, we use the grep function of the base package R to find the sentences that contain the keywords we are looking for:
stemmed_keyword <- wordStem ("study", language = "english")
sentence_index<-grep(stemmed_keyword, splited_text)
#output
splited_text[sentence_index]
[1] "The Boy lives in Miami and studies in the St. Martin School."
Note
Note that I have changed the example text you provided from ** "... st. Martin School." ** to ** "... St. Martin School." **. If the letter "s" remained lowercase, the sentence detector would understand that punctuation in "st." is an end point. And the vector with the splited sentences would be as follows:
> splited_text
[1] "The Boy lives in Miami and studies in the st." "Martin School."
[3] "The boy has a heiht of 5.7 and weights 60 Kg's." "He has intrest in the Arts and crafts; and plays basketball."
And consequently when checking your keyword in this vector, your output would be:
> splited_text[sentence_index]
[1] "The Boy lives in Miami and studies in the st."
I also tested the tokenizers package mentioned above and also have this same problem. Therefore, notice that this is an open problem in NLP annotation tasks. However, the above logic and algorithm works correctly.
I hope this helps.
For each document, you could first apply SnowballC::wordStem to lemmatize, and then use tokenizers::tokenize_sentences to split the document. Now you could use grepl to find the sentences that contain the keywords you are looking for.

Extracting from result [duplicate]

I want to get the names of the companies by two columns Region and Name of role-player. I find json links on each page already, but with RJSonio it didnt work. It's collect data, but how could I get it to a readable view? Could anybody help, thanks.
Here is the link
I try this code from another similiar question on Stackoverflow
library(RJSONIO)
library(RCurl)
grab the data
raw_data <- getURL("http://www.milksa.co.za/admin/settings/mis_rest/webservicereceive/GET/index/page:1/regionID:7.json")
#Then covert from JSON into a list in R
data <- fromJSON(raw_data)
length(data)
final_data <- do.call(rbind, data)
head (final_data)
My personal preference for this is to use the library jsonlite and not use fromJSON at all
require(jsonlite)
data<-jsonlite::fromJSON(raw_data, simplifyDataFrame = TRUE)
finalData<-data.frame(cbind(data$rolePlayers$RolePlayer$orgName, data$rolePlayers$Region$RegionNameEng))
colnames(finalData)<-c("Name", "Region")
Which gives you the following data frame:
Name Region
GoodHope Cheese (Pty) Ltd Western Cape
Jay Chem (Pty) Ltd Western Cape
Coltrade International cc Western Cape
GC Rieber Compact South Africa (Pty) Ltd Western Cape
Latana Cheese Pty Ltd Western Cape
Marco Frischknecht Western Cape
A great way to visualize how to query and what is in your JSON string can be found here:Chris Photo JSON viewer
You can just cut and paste it in there from the raw_data (removing external quotation marks). From there it becomes easy to see how to structure your data using addressing like you would with a traditional data frame and the $ operator.

Webscraping with R JSON

I want to get the names of the companies by two columns Region and Name of role-player. I find json links on each page already, but with RJSonio it didnt work. It's collect data, but how could I get it to a readable view? Could anybody help, thanks.
Here is the link
I try this code from another similiar question on Stackoverflow
library(RJSONIO)
library(RCurl)
grab the data
raw_data <- getURL("http://www.milksa.co.za/admin/settings/mis_rest/webservicereceive/GET/index/page:1/regionID:7.json")
#Then covert from JSON into a list in R
data <- fromJSON(raw_data)
length(data)
final_data <- do.call(rbind, data)
head (final_data)
My personal preference for this is to use the library jsonlite and not use fromJSON at all
require(jsonlite)
data<-jsonlite::fromJSON(raw_data, simplifyDataFrame = TRUE)
finalData<-data.frame(cbind(data$rolePlayers$RolePlayer$orgName, data$rolePlayers$Region$RegionNameEng))
colnames(finalData)<-c("Name", "Region")
Which gives you the following data frame:
Name Region
GoodHope Cheese (Pty) Ltd Western Cape
Jay Chem (Pty) Ltd Western Cape
Coltrade International cc Western Cape
GC Rieber Compact South Africa (Pty) Ltd Western Cape
Latana Cheese Pty Ltd Western Cape
Marco Frischknecht Western Cape
A great way to visualize how to query and what is in your JSON string can be found here:Chris Photo JSON viewer
You can just cut and paste it in there from the raw_data (removing external quotation marks). From there it becomes easy to see how to structure your data using addressing like you would with a traditional data frame and the $ operator.

Can't import this excel file into R

I'm having trouble importing a file into R. The file was obtained from this website: https://report.nih.gov/award/index.cfm, where I clicked "Import Table" and downloaded a .xls file for the year 1992.
This image might help describe how I retrieved the data
Here's what I've tried typing into the console, along with the results:
Input:
> library('readxl')
> data1992 <- read_excel("1992.xls")
Output:
Not an excel file
Error in eval(substitute(expr), envir, enclos) :
Failed to open /home/chrx/Documents/NIH Funding Awards, 1992 - 2016/1992.xls
Input:
> data1992 <- read.csv ("1992.xls", sep ="\t")
Output:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
I'm not sure whether or not this is relevant, but I'm using GalliumOS (linux). Because I'm using Linux, Excel isn't installed on my computer. LibreOffice is.
Why bother with getting the data in and out of a .csv if it's right there on the web page for you to scrape?
# note the query parameters in the url when you apply a filter, e.g. fy=
url <- 'http://report.nih.gov/award/index.cfm?fy=1992'
library('rvest')
library('magrittr')
library('dplyr')
df <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="orgtable"]') %>%
html_table()%>%
extract2(1) %>%
mutate(Funding = as.numeric(gsub('[^0-9.]','',Funding)))
head(df)
returns
Organization City State Country Awards Funding
1 A.T. STILL UNIVERSITY OF HEALTH SCIENCES KIRKSVILLE MO UNITED STATES 3 356221
2 AAC ASSOCIATES, INC. VIENNA VA UNITED STATES 10 1097158
3 AARON DIAMOND AIDS RESEARCH CENTER NEW YORK NY UNITED STATES 3 629946
4 ABBOTT LABORATORIES NORTH CHICAGO IL UNITED STATES 4 1757241
5 ABIOMED, INC. DANVERS MA UNITED STATES 6 2161146
6 ABRATECH CORPORATION SAUSALITO CA UNITED STATES 1 450411
If you need to loop through years 1992 to present, or something similar, this programmatic approach will save you a lot of time versus handling a bunch of flat files.
This works for me
library(gdata)
dat1 <- read.xls("1992.xls")
If you're on 32-bit Windows this will also work:
require(RODBC)
dat1 <- odbcConnectExcel("1992.xls")
For several more options that rely on rJava-based packages like xlsx you can check out this link.
As someone mentioned in the comments it's also easy to save the file as a .csv and read it in that way. This will save you the trouble of dealing with the effects of strange formatting or metadata on your imported file:
dat1 <- read.csv("1992.csv")
head(dat1)
ORGANIZATION CITY STATE COUNTRY AWARDS FUNDING
1 A.T. STILL UNIVERSITY OF HEALTH SCIENCES KIRKSVILLE MO UNITED STATES 3 $356,221
2 AAC ASSOCIATES, INC. VIENNA VA UNITED STATES 10 $1,097,158
3 AARON DIAMOND AIDS RESEARCH CENTER NEW YORK NY UNITED STATES 3 $629,946
4 ABBOTT LABORATORIES NORTH CHICAGO IL UNITED STATES 4 $1,757,241
5 ABIOMED, INC. DANVERS MA UNITED STATES 6 $2,161,146
6 ABRATECH CORPORATION SAUSALITO CA UNITED STATES 1 $450,411
Converting to .csv is also usually the fastest way in my opinion (though this is only an issue with Big Data).

Resources