Parse a character string and later re-assemble it - r

I am trying to parse a character string into its parts, check if each of the parts exist in a separate vocabulary, and later re-assemble only those strings whose parts are in the vocabulary. The vocabulary is a vector of words, and is created separately from the the strings I want to compare. The final goal is to create a data frame with only those strings whose word parts are in the vocabulary.
I have written a piece of code to parse out the data into strings, but cannot figure out how to make the comparison. If you believe that parsing out the data is not the optimal solution, please let me know.
Here is an example:
Assume that I have three character strings:
"The elephant in the room is blue",
"The dog cannot swim",
"The cat is blue"
and my vocabulary consists of the words:
cat, **the**, **elephant**, hippo,
**in**, run, **is**, bike,
walk, **room, is, blue, cannot**
In this case I will pick only the first and third strings, because each of their word parts are matched to a word in my vocabulary. I will not select the second string, because the words "dog" and "swim" are not in the vocabulary.
Thank you!
Per request, attached is the code I have written so far to clean the strings, and parse them into unique words:
animals <- c("The elephant in the room is blue", "The dog cannot swim", "The cat is blue")
animals2 <- toupper(animals)
animals2 <- gsub("[[:punct:]]", " ", animals2)
animals2 <- gsub("(^ +)|( +$)|( +)", " ", animals2)
## Parse the characters and select unique words only
animals2 <- unlist(strsplit(animals2," "))
animals2 <- unique(animals2)

Here how I would do :
Read the data
clean vocab to remove extra spaces and *
Loop over strings , using setdiff
My code is :
## read your data
tt <- c("The elephant in the room is blue",
"The dog cannot swim",
"The cat is blue")
vocab <- scan(textConnection('cat, **the**, **elephant**, hippo,
**in**, run, **is**, bike,
walk, **room, is, blue, cannot**'),sep=',',what='char')
## polish vocab
vocab <- gsub('\\s+|[*]+','',vocab)
vocab <- vocab[nchar(vocab) >0]
##
sapply(tt,function(x){
+ x.words <- tolower(unlist(strsplit(x,' '))) ## take lower (the==The)
+ length(setdiff(x.words ,vocab)) ==0
+ })
The elephant in the room is blue The dog cannot swim The cat is blue
TRUE FALSE TRUE

Related

Extract words from text and create a vector from them

Suppose, I have a txt file that contains such text:
Type: fruits
Title: retail
Date: 2015-11-10
Country: UK
Products:
apple,
passion fruit,
mango
Documents: NDA
Export: 2.10
I read this file with readLines function.
Then, I want to get a vector that looks like this:
x <- c(fruits, apple, passion fruit, mango)
So, I want to extract the word after "Type:" and all words between "Products:" and "Documents:".
How can I do this? Thanks!
If it's not subject to change, it looks close to yaml format e.g. using package of the same name
library(yaml)
info <- yaml::read_yaml("your file.txt")
# strsplit - split either side of the commas
# unlist - convert to vector
# trimws - remove trailing and leading white space
out <- trimws(unlist(strsplit(info$Products, ",")))
You will get the other entries as list elements in info of the required name e.g. info$Type
Maybe there is a more elegant solution, in case you can try this, if you got a vector like this:
vec <- readLines("path\\file.txt")
And in the file there is the text you posted, you can try this:
# replace biggest spaces
gsub(" "," ",
# replace the first space
sub(" ",", ",
# pattern to extract words
gsub(".*Type:\\s*|Title.*Products:\\s*| Documents.*", "",
# collapse in one vector
paste0(vec, collapse = " "))))
[1] "fruits, apple, passion fruit, mango"
If you dput(vec) to make code reproducible:
c("Type: fruits", "Title: retail", "Date: 2015-11-10", "Country: UK",
"Products:", " apple,", " passion fruit,", " mango", "Documents: NDA",
"Export: 2.10")

Extract text according to delimiters but miss out missing entries

I have some text as follows:
inputString<- “Patient Name:MRS Comfor Atest Date of Birth:23/02/1981 Hospital Number:000000 Date of Procedure:01/01/2010 Endoscopist:Dr. Sebastian Zeki: Nurses:Anthony Nurse , Medications:Medication A 50 mcg, Another drug 2.5 mg Instrument:D111 Extent of Exam:second part of duodenum Visualization:Good Tolerance: Good Complications: None Co-morbidity:None INDICATIONS FOR EXAMINATION Illness Stomach pain. PROCEDURE PERFORMED Gastroscopy (OGD) FINDINGS Things found and biopsied DIAGNOSIS Biopsy of various RECOMMENDATIONS Chase for histology. FOLLOW UP Return Home"
I want to extract parts of the test in to their own columns according to some text boundaries I have set:
myWords<-c("Patient Name","Date of Birth","Hospital Number","Date of Procedure","Endoscopist","Second Endoscopist","Trainee","Referring Physician","Nurses"."Medications")
Not all of the delimiter words are in the text (but they are always in the same order).
I have a function that should separate them out (with the column title as the start of the word boundary:
delim<-myWords
inputStringdf <- data.frame(inputString,stringsAsFactors = FALSE)
inputStringdf <- inputStringdf %>%
tidyr::separate(inputString, into = c("added_name",delim),
sep = paste(delim, collapse = "|"),
extra = "drop", fill = "right")
However, when there is no finding between two delimiters, or if the delimiters do not exist, rather than place NA in the column, it just fills it with the next text found between two delimiters. How can I make sure that the correct columns are filled with the correct text as defined by the delimiters?
Using the input shown in the Note at the end transform it into DCF format and then read it in using read.dcf which converts the input lines into a character matrix m. See ?read.dcf for more info. No packages are used.
pat <- sprintf("(%s)", paste(myWords, collapse = "|"))
g <- gsub(pat, "\n\\1", paste0(Lines, "\n"))
m <- read.dcf(textConnection(g))
Here are the first three columns:
m[, 1:3]
## Patient Name Date of Birth Hospital Number
## [1,] "MRS Comfor Atest" "23/02/1981" "000000"
## [2,] "MRS Comfor Atest" NA "000000"
Note
The input is assumed to have one record per patient like this example which has two records. We have just repeated the first patient for simplicity in synthesizing an input data set except we have omitted the Date of Birth in the second record.
Lines <- c(inputString, sub("Date of Birth:23/02/1981 ", "", inputString))

Substituting all instances of words in one vector with words specified in a second vector

I am trying to find an efficient way to remove all instances of a set of words in an input list with the words in the removal list.
vectorOfWordsToRemove <- c('cat', 'monkey', 'wolf', 'mouses')
vectorOfPhrases <- c('the cat and the monkey walked around the block', 'the wolf and the mouses ate lunch with the monkey', 'this should remain unmodified')
remove_strings <- function(a, b) { stringr::str_replace_all(a,b, '')}
remove_strings(vectorOfPhrases, vectorOfWordsToRemove)
What I would like as output is
vectorOfPhrases <- c('the and the walked around the block', 'the and the ate lunch with the', 'this should remain unmodified')
That is, every instance of all the words in the vector - vectorOfWordsToRemove should be eliminated in vectorOfPhrases.
I could do this with for loops but it's pretty slow and it seems like there should be a vectorized way to do this efficiently.
Thanks
First I make a vector of empty strings to replace with:
vectorOfNothing <- rep('', 4)
And then use the qdap library to replace a vector of patterns with a vector of replacements:
library(qdap)
vectorOfPhrases <- qdap::mgsub(vectorOfWordsToRemove,
vectorOfNothing,
vectorOfPhrases)
> vectorOfPhrases
[1] "the and the walked around the block" "the and the ate lunch with the"
[3] "this should remain unmodified"
You can use gsubfn():
library(gsubfn)
replaceStrings <- as.list(rep("", 4))
newPhrases <- gsubfn("\\S+", setNames(replaceStrings, vectorOfWordsToRemove), vectorOfPhrases)
> newPhrases
[1] "the and the walked around the block" "the and the ate lunch with the"
[3] "this should remain unmodified"

Text Replacement -- Pattern is a set list of strings [r]

I have a string variable in a large data set that I want to cleanse based on a set list of strings. ex. pattern <- c("dog","cat") but my list will be about 400 elements long.
vector_to_clean == a
black Dog
white dOG
doggie
black CAT
thatdamcat
Then I want to apply a function to yield
new
dog
dog
dog
cat
cat
I have tried str_extract, grep, grepl etc.. Since I can pick a pattern based on one string at a time. I think what I want is to use dapply with one of these text cleansing functions. Unfortunately, I'm stuck. Below is my latest attempt. Thank you for your help!
new <- vector()
lapply(pattern, function(x){
where<- grep(x,a,value = FALSE, ignore.case = TRUE)
new[where]<-x
})
We paste the 'pattern' vector together to create a single string, use that to extract the words from 'vec1' after we change it to lower case (tolower(vec1)).
library(stringr)
str_extract(tolower(vec1), paste(pattern, collapse='|'))
#[1] "dog" "dog" "dog" "cat" "cat"
data
pattern <- c("dog","cat")
vec1 <- c('black Dog', 'white dOG', 'doggie','black CAT', 'thatdamcat')
Another way using base R is:
#data
vec <- c('black Dog', 'white dOG', 'doggie','black CAT','thatdamcat')
#regexpr finds the locations of cat and dog ignoring the cases
a <- regexpr( 'dog|cat', vec, ignore.case=TRUE )
#regmatches returns the above locations from vec (here we use tolower in order
#to convert to lowercase)
regmatches(tolower(vec), a)
[1] "dog" "dog" "dog" "cat" "cat"

Extract last word in string before the first comma

I have a list of names like "Mark M. Owens, M.D., M.P.H." that I would like to sort to first name, last names and titles. With this data, titles always start after the first comma, if there is a title.
I am trying to sort the list into:
FirstName LastName Titles
Mark Owens M.D.,M.P.H
Lara Kraft -
Dale Good C.P.A
Thanks in advance.
Here is my sample code:
namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames=sub('.*?(\\w+)\\W+\\w+\\W*?$', '\\1', namelist)
titles = sub('.*,\\s*', '', namelist)
names <- data.frame(firstnames , lastnames, titles )
You can see that with this code, Mr. Owens is not behaving. His title starts after the last comma, and the last name begins from P. You can tell that I referred to Extract last word in string in R, Extract 2nd to last word in string and Extract last word in a string after comma if there are multiple words else the first word
You were off to a good start so you should pick up from there. The firstnames variable was good as written. For lastnames I used a modified name list. Inside of the sub function is another that eliminates everything after the first comma. The last name will then be the final word in the string. For titles there is a two-step process of first eliminating everything before the first comma, then replacing non-matched strings with a hyphen -.
namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames <- sub(".*?(\\w+)$", "\\1", sub(",.*", "", namelist), perl=TRUE)
titles <- sub(".*?,", "", namelist)
titles <- ifelse(titles == namelist, "-", titles)
names <- data.frame(firstnames , lastnames, titles )
firstnames lastnames titles
1 Mark Owens M.D., M.P.H.
2 Dale Good C.P.A
3 Lara Kraft -
4 Roland Bass III
This should do the trick, at least on test data:
x=strsplit(namelist,split = ",")
x=rapply(object = x,function(x) gsub(pattern = "^ ",replacement = "",x = x),how="replace")
names=sapply(x,function(y) y[[1]])
titles=sapply(x,function(y) if(length(unlist(y))>1){
paste(na.omit(unlist(y)[2:length(unlist(y))]),collapse = ",")
}else{""})
names=strsplit(names,split=" ")
firstnames=sapply(names,function(y) y[[1]])
lastnames=sapply(names,function(y) y[[3]])
names <- data.frame(firstnames, lastnames, titles )
names
In cases like this, when the structure of strings is always the same, it is easier to use functions like strsplit() to extract desired parts

Resources