Extract words from text and create a vector from them - r

Suppose, I have a txt file that contains such text:
Type: fruits
Title: retail
Date: 2015-11-10
Country: UK
Products:
apple,
passion fruit,
mango
Documents: NDA
Export: 2.10
I read this file with readLines function.
Then, I want to get a vector that looks like this:
x <- c(fruits, apple, passion fruit, mango)
So, I want to extract the word after "Type:" and all words between "Products:" and "Documents:".
How can I do this? Thanks!

If it's not subject to change, it looks close to yaml format e.g. using package of the same name
library(yaml)
info <- yaml::read_yaml("your file.txt")
# strsplit - split either side of the commas
# unlist - convert to vector
# trimws - remove trailing and leading white space
out <- trimws(unlist(strsplit(info$Products, ",")))
You will get the other entries as list elements in info of the required name e.g. info$Type

Maybe there is a more elegant solution, in case you can try this, if you got a vector like this:
vec <- readLines("path\\file.txt")
And in the file there is the text you posted, you can try this:
# replace biggest spaces
gsub(" "," ",
# replace the first space
sub(" ",", ",
# pattern to extract words
gsub(".*Type:\\s*|Title.*Products:\\s*| Documents.*", "",
# collapse in one vector
paste0(vec, collapse = " "))))
[1] "fruits, apple, passion fruit, mango"
If you dput(vec) to make code reproducible:
c("Type: fruits", "Title: retail", "Date: 2015-11-10", "Country: UK",
"Products:", " apple,", " passion fruit,", " mango", "Documents: NDA",
"Export: 2.10")

Related

How do I collapse on a specific pattern in a text?

I have some strings of text (example below). As you can see each string was split at a period or question mark.
[1]"I am a Mr."
[2]"asking for help."
[3]"Can you help?"
[4]"Thank you ms."
[5]"or mr."
I want to collapse where the string ends with an abbreviation like mr., mrs. so the end result would be the desired output below.
[1]"I am a Mr. asking for help."
[2]"Can you help?"
[3]"Thank you ms. or mr."
I already created a vector (called abbr) containing all my abbreviations in the following format:
> abbr
[1] "Mr|Mrs|Ms|Dr|Ave|Blvd|Rd|Mt|Capt|Maj"
but I can't figure out how to use it in paste function to collapse. I have also tried using gsub (didn't work) to replace \n following abbreviation with a period with a space like this:
lines<-gsub('(?<=abbr\\.\\n)(?=[A-Z])', ' ', lines, perl=FALSE)
We can use tapply to collapse string and grepl to create groups to collapse.
x <- c("I am a Mr.", "asking for help.","Can you help?","Thank you ms.", "or Mr.")
#Include all the abbreviations with proper cases
#Note that "." has a special meaning in regex so you need to escape it.
abbr <- 'Mr\\.|Mrs\\.|Ms\\.|Dr\\.|mr\\.|ms\\.'
unname(tapply(x, c(0, head(cumsum(!grepl(abbr, x)), -1)), paste, collapse = " "))
#[1] "I am a Mr. asking for help." "Can you help?" "Thank you ms. or mr."

extract a text string after spaces and a symbol with gsub R

I need to extract the first word (German) from the following text string
substr(details[1],0,50)%>%
+ gsub("[^a-z/A-Z/,/ ]","" ,.)%>%
+ gsub("A-Z.*" , "", .)
[1] " , German, European, Central European"
For many combinations I try with gsub I can't extract it
Thank you very much
Assuming your string is s <- " , German, European, Central European", maybe you can use the following code to get the word German:
w <- gsub("\\s+,\\s+([[:alpha:]]+),.*","\\1",s)
or
w <- trimws(unlist(strsplit(s,split = ","))[2])

How to extract text using delimiters when some delimiters missing

I am trying to extract text according to the headers in a semi-structured text document.
Input
Column<-"Order:1223442 Subject:History Name Bilbo Johnson Grade: Bad Report: Need to complete Conclusion: Dud"
The output here is
Order Subject Name Grade Report Conclusion
1223442 History Bilbo Johnson Bad Need to complete Dud
I can achieve this with the following (messy but it works) function:
dataframeIn<-data.frame(Column,stringsAsFactors=FALSE)
delim<-c("Order","Subject","Name","Grade","Report","Conclusion")
Extractor <- function(dataframeIn, Column, delim) {
dataframeInForLater<-dataframeIn
ColumnForLater<-Column
Column <- rlang::sym(Column)
dataframeIn <- data.frame(dataframeIn)
dataframeIn<-dataframeIn %>%
tidyr::separate(!!Column, into = c("added_name",delim),
sep = paste(delim, collapse = "|"),
extra = "drop", fill = "right")
names(dataframeIn) <- gsub(".", "", names(dataframeIn), fixed = TRUE)
dataframeIn<-data.frame(dataframeIn)
#Add the original column back in so have the original reference
dataframeIn<-cbind(dataframeInForLater[,ColumnForLater],dataframeIn)
dataframeIn<-data.frame(dataframeIn)
return(dataframeIn)
}
Extractor(dataframeIn, "Column", delim)
However, sometimes the delimiters are missing eg
Order:1223442 Subject:History Name Bilbo Johnson Grade: Bad Conclusion: Dud
In which case the desired output is
Order Subject Name Grade Conclusion
1223442 History Bilbo Johnson Bad Dud
but the actual output becomes:
Order Subject Name Grade Report Conclusion
:1223442 :History Bilbo Johnson : Bad : Dud <NA>
How can I account for missing delimiters although they are in the same order (including delimiters that are missing in the middle of the text as well as the end as in the example above) ?
We may do the following (it's only text extraction, I leave constructing the output for you):
library(stringr)
Extractor <- function(x, delim) {
pattern <- paste0(delim, ":{0,1}(.*?)(", paste(c(delim, "$"), collapse = "|"), ")")
trimws(str_match(x, pattern)[, 2])
}
Extractor(Column1, delim)
# [1] "1223442" "History" "Bilbo Johnson" "Bad" "Need to complete" "Dud"
Extractor(Column2, delim)
# [1] "1223442" "History" "Bilbo Johnson" "Bad" NA "Dud"
Column3 <- "Subject:History Name Bilbo Johnson"
Extractor(Column3, delim)
# [1] NA "History" "Bilbo Johnson" NA NA NA
Since we have NA's it's clear what delimiters were missing and what weren't.
The way it works in your case is that we have a series of patterns
pattern
# [1] "Order:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [2] "Subject:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [3] "Name:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [4] "Grade:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [5] "Report:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [6] "Conclusion:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
Then str_match nice extracts the (.*?) part to the second output columns and we get rid of any spaces with trimws. Ah and we use lazy matching in (.*?) as not to match too much.

Regex to pull sub-path

I have a dataframe df with a place field containing strings that looks like so:
countryName0 / provinceName0 / countyName0 / cityName0
countryName1 / provinceName1
Using this code I can pull out the finest resolution place identifier:
df$shortplace <- trimws(basename(df$place))
or:
df$shortplace <- gsub(".*/ ", "", df$place)
e.g.
cityName0
provinceName1
I can then use ggmap library to extract geocodes for cityName0 and provinceName1:
df$geo <- geocode(df$shortplace)
Result looks like this:
geo.lat geo.long
-33.789 147.909
-29.333 133.819
Unfortunately, some city names are not unique e.g. Perth is the capital of Western Australia, a town in Tasmania, and a city in Scotland. What I need to do is extract not the place identifier after the last "/" but the second last "/" (and replace the "/" with a " " to provide more information for the geocode() function. How do I scan to second last "/" and extract highest and second highest order place names? E.g.
shortplace
countyName0 cityName0
countryName1 provinceName1
There are other ways, but strsplit() seems the most straightforward to me here. Give this a try:
x = "countryName0 / provinceName0 / countyName0 / cityName0"
x_split = strsplit(x, " / ")[[1]] # Somewhat confusingly, result of strsplit() is a list; [[1]] pulls out the one and only entry here
n_terms = length(x_split)
result = paste(x_split[n_terms - 1], x_split[n_terms], sep = ", ")
result
# [1] "countyName0, cityName0"
One option is sub to match the alpha numeric characters followed by one or more spaces, / followed by space (\\s+), then another set of alpha numeric characters until the end of the string ($), capture as a group and replace with the backreferences (\\1 \\2) of the capture groups
df$shortplace <- sub(".*\\b([[:alnum:]]+)\\s+\\/\\s+([[:alnum:]]+)$", "\\1 \\2", df$place)
df$shortplace
#[1] "countyName0 cityName0" "countryName1 provinceName1"
This worked for me in the end:
df$shortplace <- gsub("((?:/[^/\r\n]*){2})$", "\1", df$place)
df$shortplace <- gsub("\\ / ", ", ", df$place)
Not super elegant but it does the job.

Extract last word in string before the first comma

I have a list of names like "Mark M. Owens, M.D., M.P.H." that I would like to sort to first name, last names and titles. With this data, titles always start after the first comma, if there is a title.
I am trying to sort the list into:
FirstName LastName Titles
Mark Owens M.D.,M.P.H
Lara Kraft -
Dale Good C.P.A
Thanks in advance.
Here is my sample code:
namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames=sub('.*?(\\w+)\\W+\\w+\\W*?$', '\\1', namelist)
titles = sub('.*,\\s*', '', namelist)
names <- data.frame(firstnames , lastnames, titles )
You can see that with this code, Mr. Owens is not behaving. His title starts after the last comma, and the last name begins from P. You can tell that I referred to Extract last word in string in R, Extract 2nd to last word in string and Extract last word in a string after comma if there are multiple words else the first word
You were off to a good start so you should pick up from there. The firstnames variable was good as written. For lastnames I used a modified name list. Inside of the sub function is another that eliminates everything after the first comma. The last name will then be the final word in the string. For titles there is a two-step process of first eliminating everything before the first comma, then replacing non-matched strings with a hyphen -.
namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames <- sub(".*?(\\w+)$", "\\1", sub(",.*", "", namelist), perl=TRUE)
titles <- sub(".*?,", "", namelist)
titles <- ifelse(titles == namelist, "-", titles)
names <- data.frame(firstnames , lastnames, titles )
firstnames lastnames titles
1 Mark Owens M.D., M.P.H.
2 Dale Good C.P.A
3 Lara Kraft -
4 Roland Bass III
This should do the trick, at least on test data:
x=strsplit(namelist,split = ",")
x=rapply(object = x,function(x) gsub(pattern = "^ ",replacement = "",x = x),how="replace")
names=sapply(x,function(y) y[[1]])
titles=sapply(x,function(y) if(length(unlist(y))>1){
paste(na.omit(unlist(y)[2:length(unlist(y))]),collapse = ",")
}else{""})
names=strsplit(names,split=" ")
firstnames=sapply(names,function(y) y[[1]])
lastnames=sapply(names,function(y) y[[3]])
names <- data.frame(firstnames, lastnames, titles )
names
In cases like this, when the structure of strings is always the same, it is easier to use functions like strsplit() to extract desired parts

Resources