How do I extract certain items from unstructured text? - r

I have an extremely unstructured data frame (df) in R, which includes a text column.
An example of the df$text looks like this
John Smith 3.8 GPA johnsmith#gmail.com, https://link.com
I am trying to extract the GPA out of the field and save to a new column called df$GPA but am unable to get it to work.
I have tried:
df$gpa <- sub('[0-9].[0-9] GPA',"\\1", df$text)
But that returns the whole block of text.
I am also trying to extract the url but am unsure how to do that as well.Does anybody have any suggestions?

Here's a solution using positive lookahead in (?=GPA)and str_extractfrom the package stringr:
df$GPA <- str_extract(df$text, "\\d+\\.\\d+\\s(?=GPA)")
A subsolution with backreference would be this:
df$GPA <- sub(".*(\\d+\\.\\d+).*", "\\1", df$text)
Result:
df
text GPA
1 John Smith 3.8 GPA johnsmith#gmail.com, https://link.com 3.8
Data:
df <- data.frame(text = "John Smith 3.8 GPA johnsmith#gmail.com, https://link.com")

We can use a regex lookaround to extract the numeric part
library(stringr)
df$GPA <- str_extract(df$text, "[0-9.]+(?=\\s*GPA)")
df$GPA
#[1] "3.8"
Or in base R with regmatches/regexpr
regmatches(df$text, regexpr("[0-9.]+(?=\\s*GPA)", df$text, perl = TRUE))
data
df <- data.frame(text = 'John Smith 3.8 GPA johnsmith#gmail.com, https://link.com', stringsAsFactors = FALSE)

Related

Remove non-english string from Rows: R

I have a couple of variables whose data (rows) contain english string followed by non-english translation (Hindi).
E.g. Carpenter (Hindi word for carpenter)
Is there a way to strip the rows to contain only the english part? Hindi is causing problems with applying functions and so I want them removed.
Here is another option using base R's iconv() which removes only the non-Latin script:
s <- 'Carpenter (बढ़ई)'
iconv(s, "latin1", "ASCII", sub="")
# [1] "Carpenter ()"
Applying to a data frame:
df <- data.frame(rbind('Carpenter (बढ़ई)',
'Cat (बिल्ली)'))
sapply(df,iconv, from="latin1", to="ASCII",sub="")
# [1,] "Carpenter ()"
# [2,] "Cat ()"
I managed to strip the english part of the text by using regular expressions (regex) and the stringr package. Below is an example data frame and resulting output.
library(tidyverse)
library(stringr)
df <- tibble(
complete_wrd = c(
"carpenter (Hindi word for carpenter)",
"cat (Hindi word for cat)",
"dog (Hindi word for dog)"))
df %>%
mutate(engl_wrd = stringr::str_extract(complete_wrd, "^.*?\\S*"))
# A tibble: 3 x 2
complete_wrd engl_wrd
<chr> <chr>
1 carpenter (Hindi word for carpenter) carpenter
2 cat (Hindi word for cat) cat
3 dog (Hindi word for dog) dog
Another way you can try
library(dplyr)
library(stringr)
df %>%
mutate(hindi_text = str_remove(hindi_text, "\\(.*\\)"))
# hindi_text
# 1 Construction Labourer
# 2 Other
Data
df <- data.frame(hindi_text = c("Construction Labourer(सभी प्रकार के निर्माण मजदूर)", "Other(उपरोक्त के अतिरिक्त) "))

Str_split is returning only half of the string

I have a tibble and the vectors within the tibble are character strings with a mix of English and Mandarin characters. I want to split the tibble into two, with one column returning the English, the other column returning the Mandarin. However, I had to resort to the following code in order to accomplish this:
tb <- tibble(x = c("I我", "love愛", "you你")) #create tibble
en <- str_split(tb[[1]], "[^A-Za-z]+", simplify = T) #split string when R reads a character that is not a-z
ch <- str_split(tb[[1]], "[A-Za-z]+", simplify = T) #split string after R reads all the a-z characters
tb <- tb %>%
mutate(EN = en[,1],
CH = ch[,2]) %>%
select(-x)#subset the matrices created above, because the matrices create a column of blank/"" values and also remove x column
tb
I'm guessing there's something wrong with my RegEx that's causing this to occur. Ideally, I would like to write one str_split line that would return both of the columns.
We can use strsplit from base R
do.call(rbind, strsplit(tb$x, "(?<=[A-Za-z])(?=[^A-Za-z])", perl = TRUE))
Or we can use
library(stringr)
tb$en <- str_extract(tb$x,"[[:alpha:]]+")
tb$ch <- str_extract(tb$x,"[^[:alpha:]]+")
We can use str_match and get data for English and rest of the characters separately.
stringr::str_match(tb$x, "([A-Za-z]+)(.*)")[, -1]
# [,1] [,2]
#[1,] "I" "我"
#[2,] "love" "愛"
#[3,] "you" "你"
A simple solution using str_extract from package stringr:
library(stringr)
tb$en <- str_extract(tb$x,"[A-z]+")
tb$ch <- str_extract(tb$x,"[^A-z]")
In case there's more than one Chinese character, just add +to [^A-z].
Alternatively, use gsuband backreference:
tb$en <- gsub("(\\w+).$", "\\1", tb$x)
tb$ch <- gsub("\\w+(.$)", "\\1", tb$x)
Yet another solution macthes unicode characters with [ -~]+ and excludes them with [^ -~]+:
tb$en <- str_extract(tb$x, "[ -~]+")
tb$ch <- str_extract(tb$x, "[^ -~]+")
Result:
tb
# A tibble: 3 x 3
x en ch
<chr> <chr> <chr>
1 I我 I 我
2 love愛 love 愛
3 you你 you 你

splitting surnames from fullnames

I've used this:
String <- unlist(str_split(Invname,"[ ]",n=2))
To split the names that I have into Surnames and First Names, since the surnames come first. But I cannot figure out how to reassign the split Invname into two lists, so that I can use only the surnames for the rest of my project. Right now I have this:
" [471] "KRUEGER" "MARCUS" "
And I would like to have the left side only assigned to a new variable, so that I can work further with mining the surnames for information.
Using the data in nate.edwinton's answer, there is no need to unlist.
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
String <- stringr::str_split(Invnames, "[ ]", n = 2)
Surnames <- sapply(String, '[', 1)
Firstnames <- sapply(String, '[', 2)
data.frame(Surnames, Firstnames)
# Surnames Firstnames
#1 Krueger Markus
#2 Doe John
#3 Tatum Jayson
As mentioned in the comments, it would be easier to help if you provided some data. Anyway, here might be a solution:
Assuming that Invnames is a vector of where for every first name there is (exactly) one last name, you could do the following
# data
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
# extraction
String <- unlist(stringr::str_split(Invnames,"[ ]",n=2))
# saving first and last names
lastNames <- String[seq(1,length(String),2)]
firstNames <- String[seq(2,length(String),2)]
# yields
> cbind(lastNames,firstNames)
lastNames firstNames
[1,] "Krueger" "Markus"
[2,] "Doe" "John"
[3,] "Tatum" "Jayson"
Here is some sample data and a suggested solution. Data modified from #Rui Barradas' answer:
Invnames <- c("Krueger.$Markus","Doe.John","Tatum.Jayson")
sapply(strsplit(Invnames,"\\W"),"[")
Again using data from an earlier answer with dplyr this time
library(tidyverse)
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
Invnames <- data.frame(Invnames)
Invnames %>%
separate(Invnames, c('Surname', 'FirstName'), sep=" ")
Surname FirstName
1 Krueger Markus
2 Doe John
3 Tatum Jayson
With base R, we can make use of read.table/read.csv to separate the string into columns
read.table(text = Invnames, header = FALSE, col.names = c("Surnames", "Firstnames"))
# Surnames Firstnames
#1 Krueger Markus
#2 Doe John
#3 Tatum Jayson
data
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
If only names were so straightforward! If there were few complications between strings then yes the answers below are good options. In my experience with name lists we get hyphenated names (both in "first" and "last"), "Middle" names, Titles and shortened name versions (Dr., Mr, Md), and many other variants. I first try to clean the strings before any splitting.
Here is just one idea using dplyr (explicit code provided for clarity)
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson", "Taylor - Cline Jeff", "Davis - Freud Melvin- John")
df <- as.data.frame(Invnames, Invnames = Invnames) %>%
mutate(Invnames2 = gsub("- ","-",Invnames)) %>%
mutate(Invnames2 = gsub(" -","-",Invnames2)) %>%
mutate(surname = gsub(" .*", "", Invnames2))

Extracting specific strings patterns from one column

I would like to extract specific strings with the pattern gene=something from one column in R.
An example of input:
df <- 'V1
ID=gene92;DbX;gene=BH1;genePro
ID=gene91;DbY;gene=BH2;genePro;inf2
ID=gene90;DbY;gene=BH3;genePro;inf2'
df <- read.table(text=df, header=T)
The example of the expected output:
dfout <- 'V1
gene=BH1
gene=BH2
gene=BH3'
dfout <- read.table(text=dfout, header=T)
Some idea to accomplish that?
library(stringr)
str_extract(df$V1, 'gene=BH[0-9]+')
#[1] "gene=BH1" "gene=BH2" "gene=BH3"
You may also use
gsub(".*(gene=.*?)(;|$).*", "\\1", df$V1)
# [1] "gene=BH1" "gene=BH2" "gene=BH3"
so that we match only the part gene=... that follows anything, .*, and is followed by ; or the end of the string, ;|$.

Splitting unseparated string and numerical variables in R

I have transformed a Pdf to text file and I have a data set which is constructed as follow:
data=c("Paris21London3Tokyo51San Francisco38")
And I would like to obtain the following structure:
matrix(c("Paris","London","Tokyo","San Francisco",21,3,51,38),4,2)
Does anyone have a method to do it ? Thanks
You could try strsplit with regex lookahead and lookbehind
v1 <- strsplit(data, '(?<=[^0-9])(?=[0-9])|(?<=[0-9])(?=[^0-9])',
perl=TRUE)[[1]]
indx <- c(TRUE, FALSE)
data.frame(Col1= v1[indx], Col2=v1[!indx])
Update
Including decimal numbers as well
data1=c("Paris21.53London3Tokyo51San Francisco38.2")
v2 <- strsplit(data1, '(?<=[^0-9.])(?=[0-9])|(?<=[0-9])(?=[^0-9.])',
perl=TRUE)[[1]]
indx <- c(TRUE, FALSE)
data.frame(Col1= v2[indx], Col2=v2[!indx])
# Col1 Col2
#1 Paris 21.53
#2 London 3
#3 Tokyo 51
#4 San Francisco 38.2
Regular expressions are the right tool here, but unlike the other answer shows, strsplit is not well suited for the job.
Better use regular expression matches, and to have two separate expressions for words and numbers:
words = '[a-zA-Z ]+'
numbers = '[+-]?\\d+(\\.\\d+)?'
word_matches = gregexpr(words, data)
number_matches = gregexpr(numbers, data)
result = cbind(regmatches(data, word_matches)[[1]],
regmatches(data, number_matches)[[1]])
This recognises any number with an optional decimal point, and an optional sign. It does not recognise numbers in scientific (exponential) notation. This can be trivially added, if necessary.

Resources