How to remove the first few characters in a column in R? - r

My data (csv file) has a column that contains uninformative characters (e.g. special characters, random lowercase letters), and I want to remove them.
df <- data.frame(Affiliation = c(". Biotechnology Centre, Malaysia Agricultural Research and Development Institute (MARDI), Serdang, Malaysia","**Institute for Research in Molecular Medicine (INFORMM), Universiti Sains Malaysia, Pulau Pinang, Malaysia","aas Massachusetts General Hospital and Harvard Medical School, Center for Human Genetic Research and Department of Neurology , Boston , MA , USA","ac Albert Einstein College of Medicine , Department of Pathology , Bronx , NY , USA"))
The number of characters I want to remove (e.g. ".","**","aas","ac") per line is indefinite as shown above.
Expected output:
df <- data.frame(Affiliation = c("Biotechnology Centre, Malaysia Agricultural Research and Development Institute (MARDI), Serdang, Malaysia","Institute for Research in Molecular Medicine (INFORMM), Universiti Sains Malaysia, Pulau Pinang, Malaysia","Massachusetts General Hospital and Harvard Medical School, Center for Human Genetic Research and Department of Neurology , Boston , MA , USA","Albert Einstein College of Medicine , Department of Pathology , Bronx , NY , USA"))
I was thinking of using dplyr's mutate function, but I'm not sure how to go about it.

If we assume that the valid text starts from the first uppercase onwards, the following works:
library(tidyverse)
df %>%
mutate(Affiliation = str_extract(Affiliation, "[:upper:].+"))

Base R regex solution:
df$cleaned_str <- gsub("^\\w+ |^\\*+|^\\. ", "", df$Affiliation)
Tidyverse regex solution:
library(tidyverse)
df %>%
mutate(Affiliation = str_replace(Affiliation, "^\\w+ |^\\*+|^\\. ", ""))

Related

Extracting university names from affiliation in Pubmed data with R

I've been using the extremely useful rentrez package in R to get information about author, article ID and author affiliation from the Pubmed database. This works fine but now I would like to extract information from the affiliation field. Unfortunately the affiliation field is widely unstructured, not standardized string with various types of information such as the name of university, name of department, address and more delimited by commas. Therefore text mining approach is necessary to get any useful information from this field.
I tried the package easyPubmed in combination with rentrez, and even though easyPubmed package can extract some information from the affiliation field (e.g. email address, which is very useful), to my knowledge it cannot extract university name. I also tried the package pubmed.mineR, but unfortunately this also does not provide university name extraction. I startet to experiment with grep and regex functions but as I am no R expert I could not make this work.
I was able to find very similar threads solving the issue with python:
Regex for extracting names of colleges, universities, and institutes?
How to extract university/school/college name from string in python using regular expression?
But unfortunately I do not know how to convert the python regex function to an R regex function as I am not familiar with python.
Here is some example data:
PMID = c(121,122,123,124,125)
author=c("author1","author2","author3","author4","author5")
Affiliation = c("blabla,University Ghent,blablabla", "University Washington, blabla, blablabla, blablabalbalba","blabla,University of Florence,blabla", "University Chicago, Harvard University", "Oxford University")
df = as.data.frame(cbind(PMID,author,Affiliation))
df
PMID author Affiliation
1 121 author1 blabla,University Ghent,blablabla
2 122 author2 University Washington, blabla, blablabla, blablabalbalba
3 123 author3 blabla,University of Florence,blabla
4 124 author4 University Chicago, Harvard University
5 125 author5 Oxford University
What I would like to get:
PMID author Affiliation University
1 121 author1 blabla,University Ghent,blablabla University Ghent
2 122 author2 University Washington,ba, bla, bla University Washington
3 123 author3 blabla,University Florence,blabla University of Florence
4 124 author4 University Chicago, Harvard Univ University Chicago, Harvard University
5 125 author5 Oxford University Oxford University
Please sorry if there is already a solution online, but I honestly googled a lot and did not find any clear solution for R. I would be very thankful for any hints and solutions to this task.
In general, regex expressions can be ported to R with some changes. For example, using the php link you included, you can create a new variable with extracted text using that regex expression, and only changing the escape character ("\\" instead "\"). So, using dplyr and stringr packages:
library(dplyr)
library(stringr)
df <- df %>%
mutate(Organization=str_extract(Affiliation,
"([A-Z][^\\s,.]+[.]?\\s[(]?)*(College|University|Institute|Law School|School of|Academy)[^,\\d]*(?=,|\\d)"))

Creating Tidy Text

I am using R for text analysis. I used the 'readtext' function to pull in text from a pdf. However, as you can imagine, it is pretty messy. I used 'gsub' to replace text for different purposes. The general goal is to use one type of delimiter '%%%%%' to split records into rows, and another delimiter '#' into columns. I accomplished the first but am at a loss of how to accomplish the latter. A sample of the data found in the dataframe is as follows:
895 "The ambulatory case-mix development project\n#Published:: June 6, 1994#Authors: Baker A, Honigfeld S, Lieberman R, Tucker AM, Weiner JP#Country: United States #Journal:Project final report. Baltimore, MD, USA: Johns Hopkins University and Aetna Health Plans. Johns Hopkins\nUniversity and Aetna Health Plans, USA As the US […"
896 "Ambulatory Care Groups: an evaluation for military health care use#Published:: June 6, 1994#Authors: Bolling DR, Georgoulakis JM, Guillen AC#Country: United States #Journal:Fort Sam Houston, TX, USA: United States Army Center for Healthcare Education and Studies, publication #HR 94-\n004. United States Army Center for Healthcare Education and […]#URL: http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA27804"
I want to take this data and split the #Published, #Authors, #Journal, #URL into columns -- c("Published", "Authors", "Journal", "URL").
Any suggestions?
Thanks in advance!
This seems to work OK:
dfr <- data.frame(TEXT=c("The ambulatory case-mix development project\n#Published:: June 6, 1994#Authors: Baker A, Honigfeld S, Lieberman R, Tucker AM, Weiner JP#Country: United States #Journal:Project final report. Baltimore, MD, USA: Johns Hopkins University and Aetna Health Plans. Johns Hopkins\nUniversity and Aetna Health Plans, USA As the US […",
"Ambulatory Care Groups: an evaluation for military health care use#Published:: June 6, 1994#Authors: Bolling DR, Georgoulakis JM, Guillen AC#Country: United States #Journal:Fort Sam Houston, TX, USA: United States Army Center for Healthcare Education and Studies, publication #HR 94-\n004. United States Army Center for Healthcare Education and […]#URL: http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA27804"),
stringsAsFactors = FALSE)
library(magrittr)
do.call(rbind, strsplit(dfr$TEXT, "#Published::|#Authors:|#Country:|#Journal:")) %>%
as.data.frame %>%
setNames(nm = c("Preamble","Published","Authors","Country","Journal"))
Basically split the text by one of four fields (noticing double :: after Published!), row-binding the result, converting to a dataframe, and giving some names.

Using regex on a vector list in R

I have a vector with character strings in it. The vector has over 6000 rivers in it but we can use the following as an example:
Names <- Baker R, Colorado R, Missouri R
I am then matching these river names to a list that contains their full names. As an example, the other list contains names such as:
station_nm <- North Creek River, Baker River at Wentworth, Lostine River at Baker Road, Colorado River at North Street, Missouri River
In order to find the full names of the stations for the river names in "Names" I have:
station_nm <- grep(paste(Names, collapse = "|"), ALLsites$station_nm, ignore.case = TRUE, perl = TRUE, value = TRUE)
Continuing with the example, this returns: Baker River at Wentworth, Lostine River at Baker Road, Colorado River at North Street, Missouri River. It does not return North Creek River, as this is not listed in the "Names" vector. This is what I want.
However, I want to restrict the rivers that it returns to only Baker River at Wentworth, Colorado River at North Street, Missouri River. I don't want to include names for which there is something before it, i.e. Lostine River at Baker Road.
I believe this should involve some sort of negative look behind but I don't know how to write this with the vector "Names".
Thank you for any help!
You just have to prepend the values in Names with a ^ meaning "has to start with":
grep(paste0("^", Names, "iver", collapse = "|"), station_nm,
ignore.case = TRUE, value = TRUE)
# [1] "Baker River at Wentworth" "Colorado River at North Street" "Missouri River"

Insert special character in R based on existing words

I was looking for a intuitive solution for a problem of mine.
I have a huge list of words, in which i have to insert a special character based on some criteria.
So if a two/three letter word appear in a cell i want to add "+" right and left to it
Example
global b2b banking would transform to global +b2b+ banking
how to finance commercial ale estate would transform to how +to+ finance commercial +ale+ estate
Here is sample data set:
sample <- c("commercial funding",
"global b2b banking"
"how to finance commercial ale estate"
"opening a commercial account",
"international currency account",
"miami imports banking",
"hsbc supply chain financing",
"international business expansion",
"grow business in Us banking",
"commercial trade Asia Pacific",
"business line of credits hsbc",
"Britain commercial banking",
"fx settlement hsbc",
"W Hotels")
data <- data.frame(sample)
Additionally is it possible to drop a row which has a character of length 1 ?
Example:
W Hotels
For all the one letter word i tried removing them with gsub,
gsub(" *\\b[[:alpha:]]{1,1}\\b *", " ", sample)
This should be removed from the data set set.
Any help is highly appreciated.
Edit 1
Thanks for the help, I added few more lines to it:
sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels")
sample <- sample[!grepl("\\b[[:alpha:]]\\b",sample)]
sample <- gsub("\\b([[:alpha:][:digit:]]{2,3})\\b", "+\\1+", sample)
sample <- gsub(" ",",",sample)
sample <- gsub("+,","+",sample)
sample <- gsub(",+","+",sample)
sample <- tolower(sample)
sample <- ifelse(substr(sample, 1, 1) == "+", sub("^.", "", sample), sample)
data <- data.frame(sample)
data
sample
1 commercial++funding
2 global+++b2b+++banking
3 how++++to+++finance++commercial+++ale+++estate
4 international++currency++account
5 miami++imports++banking
6 hsbc++supply++chain++financing
7 international++business++expansion
8 grow++business+++in++++us+++banking
9 commercial++trade++asia++pacific
10 business++line+++of+++credits++hsbc
11 britain++commercial++banking
12 fx+++settlement++hsbc
Somehow i am unable to remove "+," with "," with gsub ? what am i doing wrong ?
so "fx+,settlement,hsbc" should be "fx+settlement,hsbc" but it is replacing , wth additional ++.
You need to do that in 2 steps: remove the items with 1-letter whole words, and then add + around 2-3 letter words.
Use
sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels")
sample <- sample[!grepl("\\b[[:alnum:]]\\b",sample)]
sample <- gsub("\\b([[:alnum:]]{2,3})\\b", "+\\1+", sample)
data <- data.frame(sample)
data
See the R demo
The sample[!grepl("\\b[[:alnum:]]\\b",sample)] removes the items that contain word boundary (\b), a letter ([[:alnum:]]) and a word boundary pattern.
The gsub("\\b([[:alnum:]]{2,3})\\b", "+\\1+", sample) line replaces all 2-3-letter whole words with these words enclosed with +.
Result:
sample
1 commercial funding
2 global +b2b+ banking
3 +how+ +to+ finance commercial +ale+ estate
4 international currency account
5 miami imports banking
6 hsbc supply chain financing
7 international business expansion
8 grow business +in+ +Us+ banking
9 commercial trade Asia Pacific
10 business line +of+ credits hsbc
11 Britain commercial banking
12 +fx+ settlement hsbc
Note that W Hotels and opening a commercial account got filtered out.
Answer to the EDIT
You added some more replacement operations to the code, but you are using literal string replacements, thus, you just need to pass fixed=TRUE argument:
sample <- gsub(" ",",",sample, fixed=TRUE)
sample <- gsub("+,","+",sample, fixed=TRUE)
sample <- gsub(",+","+",sample, fixed=TRUE)
Else, the + is treated as a regex quantifier, and must be escaped to be treated as a literal plus symbol.
Also, if you need to remove all + from the start of the string, use
sample <- sub("^\\++", "", sample)

Categorizing a data frame column in R using grep on a list of a list

I have a column of a data frame that I want to categorize.
> df$orgName
[1] "Hank Rubber" "United Steel of Chicago"
[3] "Muddy Lakes Solar" "West cable"
I want to categorize the column using the categories list below that contains a list of subcategories.
metallurgy <- c('steel', 'iron', 'mining', 'aluminum', 'metal', 'copper' ,'geolog')
energy <- c('petroleum', 'coal', 'oil', 'power', 'petrol', 'solar', 'nuclear')
plastics <- c('plastic', 'rubber')
wiring <- c('wire', 'cable')
categories = list(metallurgy, energy, plastics, wiring)
So far I've been able to use a series of nested ifelse statements to categorize the column as shown below, but the number of categories and subcategories keeps increasing.
df$commSector <-
ifelse(grepl(paste(metallurgy,collapse="|"),df$orgName,ignore.case=TRUE), 'metallurgy',
ifelse(grepl(paste(energy,collapse="|"),df$orgName,ignore.case=TRUE), 'energy',
ifelse(grepl(paste(plastics,collapse="|"),df$orgName,ignore.case=TRUE), 'plastics',
ifelse(grepl(paste(wiring,collapse="|"),df$orgName,ignore.case=TRUE), 'wiring',''))))
I've thought about using a set of nested lapply statements, but I'm not too sure how to execute it.
Lastly does anyone know of any R Libraries that may have functions to do this.
Thanks a lot for everyone's time.
Cheers.
One option would be to get the vectors as a named list using mget, then paste the elements together (as showed by OP), use grep to find the index of elements in 'orgName' that matches (or use value = TRUE) extract those elements, stack it create a data.frame.
res <- setNames(stack(lapply(mget(c("metallurgy", "energy", "plastics", "wiring")),
function(x) df$orgName[grep(paste(x, collapse="|"),
tolower(df$orgName))])), c("orgName", "commSector"))
res
# orgName commSector
#1 United Steel of Chicago metallurgy
#2 Muddy Lakes Solar energy
#3 Hank Rubber plastics
#4 West cable wiring
If we have other columns in 'df', do a merge
merge(df, res, by = "orgName")
# orgName commSector
#1 Hank Rubber plastics
#2 Muddy Lakes Solar energy
#3 United Steel of Chicago metallurgy
#4 West cable wiring
data
df <- data.frame(orgName = c("Hank Rubber", "United Steel of Chicago",
"Muddy Lakes Solar", "West cable"), stringsAsFactors=FALSE)

Resources