How to extract text using delimiters when some delimiters missing - r

I am trying to extract text according to the headers in a semi-structured text document.
Input
Column<-"Order:1223442 Subject:History Name Bilbo Johnson Grade: Bad Report: Need to complete Conclusion: Dud"
The output here is
Order Subject Name Grade Report Conclusion
1223442 History Bilbo Johnson Bad Need to complete Dud
I can achieve this with the following (messy but it works) function:
dataframeIn<-data.frame(Column,stringsAsFactors=FALSE)
delim<-c("Order","Subject","Name","Grade","Report","Conclusion")
Extractor <- function(dataframeIn, Column, delim) {
dataframeInForLater<-dataframeIn
ColumnForLater<-Column
Column <- rlang::sym(Column)
dataframeIn <- data.frame(dataframeIn)
dataframeIn<-dataframeIn %>%
tidyr::separate(!!Column, into = c("added_name",delim),
sep = paste(delim, collapse = "|"),
extra = "drop", fill = "right")
names(dataframeIn) <- gsub(".", "", names(dataframeIn), fixed = TRUE)
dataframeIn<-data.frame(dataframeIn)
#Add the original column back in so have the original reference
dataframeIn<-cbind(dataframeInForLater[,ColumnForLater],dataframeIn)
dataframeIn<-data.frame(dataframeIn)
return(dataframeIn)
}
Extractor(dataframeIn, "Column", delim)
However, sometimes the delimiters are missing eg
Order:1223442 Subject:History Name Bilbo Johnson Grade: Bad Conclusion: Dud
In which case the desired output is
Order Subject Name Grade Conclusion
1223442 History Bilbo Johnson Bad Dud
but the actual output becomes:
Order Subject Name Grade Report Conclusion
:1223442 :History Bilbo Johnson : Bad : Dud <NA>
How can I account for missing delimiters although they are in the same order (including delimiters that are missing in the middle of the text as well as the end as in the example above) ?

We may do the following (it's only text extraction, I leave constructing the output for you):
library(stringr)
Extractor <- function(x, delim) {
pattern <- paste0(delim, ":{0,1}(.*?)(", paste(c(delim, "$"), collapse = "|"), ")")
trimws(str_match(x, pattern)[, 2])
}
Extractor(Column1, delim)
# [1] "1223442" "History" "Bilbo Johnson" "Bad" "Need to complete" "Dud"
Extractor(Column2, delim)
# [1] "1223442" "History" "Bilbo Johnson" "Bad" NA "Dud"
Column3 <- "Subject:History Name Bilbo Johnson"
Extractor(Column3, delim)
# [1] NA "History" "Bilbo Johnson" NA NA NA
Since we have NA's it's clear what delimiters were missing and what weren't.
The way it works in your case is that we have a series of patterns
pattern
# [1] "Order:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [2] "Subject:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [3] "Name:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [4] "Grade:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [5] "Report:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [6] "Conclusion:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
Then str_match nice extracts the (.*?) part to the second output columns and we get rid of any spaces with trimws. Ah and we use lazy matching in (.*?) as not to match too much.

Related

Parsing text in r without separator

I need help with ideas for parsing this text.
I want do it the most automatic way possible.
This is the text
text <- "JOHN DEERE: PMWF2126 NEW HOLLAND: 441702A1 HIFI: WE 2126 CUMMINS: 4907485"
I need this result:
a
b
JOHN DEERE
PMWF2126
NEW HOLLAND
441702A1
HIFI
WE 2126
CUMMINS
4907485
This is an example, there is a different marks an item id
I try:
str_split(text, " ")
[[1]]
[1] "JOHN" "DEERE:" "PMWF2126" "NEW" "HOLLAND:" "441702A1" "HIFI:" "WE" "2126"
[10] "CUMMINS:" "4907485" "CUMMINS:" "3680433" "CUMMINS:" "3680315" "CUMMINS:" "3100310"
Thanks!
Edit:
Thanks for your answers, very helpfull
But there is anoter case where can end with a letter to
text <- "LANSS: EF903R DARMET: VP-2726/S CASE: 133721A1 JOHN DEERE: RE68049 JCB: 32917302 WIX: 46490 TURBO: TR25902 HIFI: SA 16080 CATERPILLAR: 4431570 KOMATSU: Z7602BXK06 KOMATSU: Z7602BX106 KOMATSU: YM12991012501 KOMATSU: YM12991012500 KOMATSU: YM11900512571 KOMATSU: 6001851320 KOMATSU: 6001851300 KOMATSU: 3EB0234790 KOMATSU: 11900512571"
We can use separate_rows and separate from tidyr for this task:
library(tidyverse)
data.frame(text) %>%
# separate into rows:
separate_rows(text, sep = "(?<=\\d)\\s") %>%
# separate into columns:
separate(text,
into = c("a", "b"),
sep = ":\\s")
# A tibble: 4 × 2
a b
<chr> <chr>
1 JOHN DEERE PMWF2126
2 NEW HOLLAND 441702A1
3 HIFI WE 2126
4 CUMMINS 4907485
The split point for separate_rows uses look-behind (?<=\\d) to assert that the whitespace \\s on which the string is broken must be preceded by a \\digit.
Data:
text <- "JOHN DEERE: PMWF2126 NEW HOLLAND: 441702A1 HIFI: WE 2126 CUMMINS: 4907485"
Thje sulution assumes (as in your sample data), that the second value always ends with a number, and the first column does not.
If this s not the case, you'll have to adapt the regex-part (?<=[0-9] )(?=[A-Z]), so that the splitting point lies between the two round-bracketed parts.
text <- "JOHN DEERE: PMWF2126 NEW HOLLAND: 441702A1 HIFI: WE 2126 CUMMINS: 4907485"
lapply(
strsplit(
unlist(strsplit(text, "(?<=[0-9] )(?=[A-Z])", perl = TRUE)),
":"), trimws)
[[1]]
[1] "JOHN DEERE" "PMWF2126"
[[2]]
[1] "NEW HOLLAND" "441702A1"
[[3]]
[1] "HIFI" "WE 2126"
[[4]]
[1] "CUMMINS" "4907485"
the key part is the strsplit(text, "(?<=[0-9] )(?=[A-Z])", perl = TRUE) part.
This looks for occurences where, after a numeric value followed by a space ?<=[0-9] , there is a new part, starting with a capital ?=[A-Z].
These positions are the used as splitting points
Since the second field always ends in a digit and the first field does not, replace a digit followed by space with that digit and a newline and then use read.table with a colon separator.
text |>
gsub("(\\d) ", "\\1\n", x = _) |>
read.table(text = _, sep = ":", strip.white = TRUE)
giving
V1 V2
1 JOHN DEERE PMWF2126
2 NEW HOLLAND 441702A1
3 HIFI WE 2126
4 CUMMINS 4907485
If in your data the second field can have a digit but the first cannot and the digit is not necessarily at the end of the last word in field two but could be anywhere in the last word in field 2 then we can use this variation which gives the same result here. gsubfn is like gsub except the 2nd argument can be a function instead of a replacement string and it takes the capture group as input and replaces the entire match with the output of the function. The function can be expressed in formula notation as is done here.
library(gsubfn)
text |>
gsubfn("\\w+", ~ if (grepl("[0-9]", x)) paste(x, "\n") else x, x = _) |>
read.table(text = _, sep = ":", strip.white = TRUE)

Extract text according to delimiters but miss out missing entries

I have some text as follows:
inputString<- “Patient Name:MRS Comfor Atest Date of Birth:23/02/1981 Hospital Number:000000 Date of Procedure:01/01/2010 Endoscopist:Dr. Sebastian Zeki: Nurses:Anthony Nurse , Medications:Medication A 50 mcg, Another drug 2.5 mg Instrument:D111 Extent of Exam:second part of duodenum Visualization:Good Tolerance: Good Complications: None Co-morbidity:None INDICATIONS FOR EXAMINATION Illness Stomach pain. PROCEDURE PERFORMED Gastroscopy (OGD) FINDINGS Things found and biopsied DIAGNOSIS Biopsy of various RECOMMENDATIONS Chase for histology. FOLLOW UP Return Home"
I want to extract parts of the test in to their own columns according to some text boundaries I have set:
myWords<-c("Patient Name","Date of Birth","Hospital Number","Date of Procedure","Endoscopist","Second Endoscopist","Trainee","Referring Physician","Nurses"."Medications")
Not all of the delimiter words are in the text (but they are always in the same order).
I have a function that should separate them out (with the column title as the start of the word boundary:
delim<-myWords
inputStringdf <- data.frame(inputString,stringsAsFactors = FALSE)
inputStringdf <- inputStringdf %>%
tidyr::separate(inputString, into = c("added_name",delim),
sep = paste(delim, collapse = "|"),
extra = "drop", fill = "right")
However, when there is no finding between two delimiters, or if the delimiters do not exist, rather than place NA in the column, it just fills it with the next text found between two delimiters. How can I make sure that the correct columns are filled with the correct text as defined by the delimiters?
Using the input shown in the Note at the end transform it into DCF format and then read it in using read.dcf which converts the input lines into a character matrix m. See ?read.dcf for more info. No packages are used.
pat <- sprintf("(%s)", paste(myWords, collapse = "|"))
g <- gsub(pat, "\n\\1", paste0(Lines, "\n"))
m <- read.dcf(textConnection(g))
Here are the first three columns:
m[, 1:3]
## Patient Name Date of Birth Hospital Number
## [1,] "MRS Comfor Atest" "23/02/1981" "000000"
## [2,] "MRS Comfor Atest" NA "000000"
Note
The input is assumed to have one record per patient like this example which has two records. We have just repeated the first patient for simplicity in synthesizing an input data set except we have omitted the Date of Birth in the second record.
Lines <- c(inputString, sub("Date of Birth:23/02/1981 ", "", inputString))

How do I 'efficiently' replace a vector of strings with another (pairwise) in a large text corpus

I have a large corpus of text in a vector of strings (app. 700.000 strings). I'm trying to replace specific words/phrases within the corpus. That is, I have a vector of app 40.000 phrases and a corresponding vector of replacements.
I'm looking for an efficient way of solving the problem
I can do it in a for loop, looping through each pattern + replacement. But it scales badly (3 days or so !)
I'v also tried qdap::mgsub(), but it seems to scale badly as well
txt <- c("this is a random sentence containing bca sk",
"another senctence with bc a but also with zqx tt",
"this sentence contains non of the patterns",
"this sentence contains only bc a")
patterns <- c("abc sk", "bc a", "zqx tt")
replacements <- c("#a-specfic-tag-#abc sk",
"#a-specfic-tag-#bc a",
"#a-specfic-tag-#zqx tt")
#either
txt2 <- qdap::mgsub(patterns, replacements, txt)
#or
for(i in 1:length(patterns)){
txt <- gsub(patterns[i], replacements[i], txt)
}
Both solutions scale badly for my data with app 40.000 patterns/replacements and 700.000 txt strings
I figure there must be a more efficient way of doing this?
If you can tokenize the texts first, then vectorized replacement is much faster. It's also faster if a) you can use a multi-threaded solution and b) you use fixed instead of regular expression matching.
Here's how to do all that in the quanteda package. The last line pastes the tokens back into a single "document" as a character vector, if that is what you want.
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
quanteda_options(threads = 4)
txt <- c(
"this is a random sentence containing bca sk",
"another sentence with bc a but also with zqx tt",
"this sentence contains none of the patterns",
"this sentence contains only bc a"
)
patterns <- c("abc sk", "bc a", "zqx tt")
replacements <- c(
"#a-specfic-tag-#abc sk",
"#a-specfic-tag-#bc a",
"#a-specfic-tag-#zqx tt"
)
This will tokenize the texts and then use fast replacement of the hashed types, using a fixed pattern match (but you could have used valuetype = "regex" for regular expression matching). By wrapping patterns inside the phrases() function, you are telling tokens_replace() to look for token sequences rather than individual matches, so this solves the multi-word issue.
toks <- tokens(txt) %>%
tokens_replace(phrase(patterns), replacements, valuetype = "fixed")
toks
## tokens from 4 documents.
## text1 :
## [1] "this" "is" "a" "random" "sentence"
## [6] "containing" "bca" "sk"
##
## text2 :
## [1] "another" "sentence"
## [3] "with" "#a-specfic-tag-#bc a"
## [5] "but" "also"
## [7] "with" "#a-specfic-tag-#zqx tt"
##
## text3 :
## [1] "this" "sentence" "contains" "none" "of" "the"
## [7] "patterns"
##
## text4 :
## [1] "this" "sentence" "contains"
## [4] "only" "#a-specfic-tag-#bc a"
Finally if you really want to put this back into character format, then convert to a list of character types and then paste them together.
sapply(as.list(toks), paste, collapse = " ")
## text1
## "this is a random sentence containing bca sk"
## text2
## "another sentence with #a-specfic-tag-#bc a but also with #a-specfic-tag-#zqx tt"
## text3
## "this sentence contains none of the patterns"
## text4
## "this sentence contains only #a-specfic-tag-#bc a"
You'll have to test this on your large corpus, but 700k strings does not sound like too large a task. Please try this and report how it did!
Create a vector of all words in each phrase
txt1 = strsplit(txt, " ")
words = unlist(txt1)
Use match() to find the index of words to replace, and replace them
idx <- match(words, patterns)
words[!is.na(idx)] = replacements[idx[!is.na(idx)]]
Re-form the phrases and paste together
phrases = relist(words, txt1)
updt = sapply(phrases, paste, collapse = " ")
I guess this won't work if patterns can have more than one word...
Create a map between the old and new values
map <- setNames(replacements, patterns)
Create a pattern that contains all patterns in a single regular expression
pattern = paste0("(", paste0(patterns, collapse="|"), ")")
Find all matches, and extract them
ridx <- gregexpr(pattern, txt)
m <- regmatches(txt, ridx)
Unlist, map, and relist the matches to their replacement values, and update the original vector
regmatches(txt, ridx) <- relist(map[unlist(m)], m)

gsubfn function not giving desired output when ignore.case = TRUE

I am trying to substitute multiple patterns within a character vector with their corresponding replacement strings. After doing some research I found the package gsubfn which I think is able to do what I want it to, however when I run the code below I don't get my expected output (see end of question for results versus what I expected to see).
library(gsubfn)
# Our test data that we want to search through (while ignoring case)
test.data<- c("1700 Happy Pl","155 Sad BLVD","82 Lolly ln", "4132 Avent aVe")
# A list data frame which contains the patterns we want to search for
# (again ignoring case) and the associated replacement strings we want to
# exchange any matches we come across with.
frame<- data.frame(pattern= c(" Pl"," blvd"," LN"," ave"), replace= c(" Place", " Boulevard", " Lane", " Avenue"),stringsAsFactors = F)
# NOTE: I added spaces in front of each of our replacement terms to make
# sure we only grab matches that are their own word (for instance if an
# address was 45 Splash Way we would not want to replace "pl" inside of
# "Splash" with "Place
# The following set of paste lines are supposed to eliminate the substitute function from
# grabbing instances like first instance of " Ave" found directly after "4132"
# inside "4132 Avent Ave" which we don't want converted to " Avenue".
pat <- paste(paste(frame$pattern,collapse = "($|[^a-zA-Z])|"),"($|[^a-zA-Z])", sep = "")
# Here is the gsubfn function I am calling
gsubfn(x = test.data, pattern = pat, replacement = setNames(as.list(frame$replace),frame$pattern), ignore.case = T)
Output being received:
[1] "1700 Happy" "155 Sad" "82 Lolly" "4132 Avent"
Output expected:
[1] "1700 Happy Place" "155 Sad Boulevard" "82 Lolly Lane" "4132 Avent Avenue"
My working theory on why this isn't working is that the matches don't match the names associated with the list I am passing into the gsubfn's replacement argument because of some case discrepancies (eg: the match being found on "155 Sad BLVD" doesn't == " blvd" even though it was able to be seen as a match due to the ignore.case argument). Can someone confirm that this is the issue/point me to what else might be going wrong, and perhaps a way of fixing this that doesn't require me expanding my pattern vector to include all case permutations if possible?
Seems like stringr has a simple solution for you:
library(stringr)
str_replace_all(test.data,
regex(paste0('\\b',frame$pattern,'$'),ignore_case = T),
frame$replace)
#[1] "1700 Happy Place" "155 Sad Boulevard" "82 Lolly Lane" "4132 Avent Avenue"
Note that I had to alter the regex to look for only words at the end of the string because of the tricky 'Avent aVe'. But of course there's other ways to handle that too.

matching words in strings with variables in R

I have a data set, like the following:
cp<-data.frame("name"=c("billy", "jean", "jean", "billy","billy", "dawn", "dawn"),
"answer"=c("michael jackson is my favorite", "I like flowers", "flower is red","hey michael",
"do not touch me michael","i am a girl","girls have hair"))
Every variable called name has a string attached to it, stored in the variable answer. I would like to find out what specific words, or parts of words, or whole sentences, in the answer variable, that is common for the different names in name:
For example, the name "billy" would have "michael" connected to it.
EDIT:
A data frame with following variables called ddd:
name: debby answer: "did you go to dallas?"
name: debby answer: "debby did dallas"
function(name=debby,data=ddd) {...} ,
which gives output "did debby dallas".
Here's a (not very efficient) function I've made that uses pmatch in order to match partial matches. The problem with it that it will also match a and am or i and is because they are also very close.
freqFunc <- function(x){
temp <- tolower(unlist(strsplit(as.character(x), " ")))
temp2 <- length(temp)
temp3 <- lapply(temp, function(x){
temp4 <- na.omit(temp[pmatch(rep(x, temp2), temp)])
temp4[length(temp4) > 1]
})
list(unique(unlist(temp3)))
}
library(data.table)
setDT(cp)[, lapply(.SD, freqFunc), by = name, .SDcols = "answer"]
# name answer
# 1: billy michael
# 2: jean i,is,flower,flowers
# 3: dawn a,am,girl,girls
If you satisfied with just exact matches, this can be very simplified and improve performance (I also added tolower so it will match different cases too)
freqFunc2 <- function(x){
temp <- table(tolower(unlist(strsplit(as.character(x), " "))))
list(names(temp[temp > 1]))
}
library(data.table)
setDT(cp)[, lapply(.SD, freqFunc2), by = name, .SDcols = "answer"]
# name answer
# 1: billy michael
# 2: jean
# 3: dawn
With the caveat of understanding correctly, I think this is what you''re looking for. Doesn't handle plurals of words though, as David mentioned. This just finds words that are exactly the same.
billyAnswers<-cp$answer[cp$name=="billy"]
#output of billyAnswers
#[1] "michael jackson is my favorite" "hey michael"
#[3] "do not touch me michael"
Now we get all the words
allWords<-unlist(strsplit(billyAnswer, " "))
#outputvof allWords
# [1] "michael" "jackson" "is" "my" "favorite" "hey"
# [7] "michael" "do" "not" "touch" "me" "michael"
We can find the common ones
common<-allWords[duplicated(allWords)]
#output of common
#[1] "michael" "michael"
Of course there are two michaels because there are multiple instances of michael in billy's answers! So let's pair it down once more.
unique(common)
#[1] "michael"
And there you go, apply that to all names and you got it.
For jean and dawn, there are no common words in their answers, so this method returns two character vectors of length 0
#jean's words
#[1] "I" "like" "flowers" "flower" "is" "red"
#dawn's words
#[1] "i" "am" "a" "girl" "girls" "have" "hair"

Resources