How to pattern match for a list of strings - r

I am trying to find and replace some text based on fuzzy matching as follows.
Aim
I want to do this for a list of find and replaces. I dont know how to extend the current function to allow this to happen.
Input
Input text
df <- data.frame(textcol=c("In this substring would like to find the radiofrequency ablation of this HALO",
"I like to do endoscopic submuocsal resection and also radifrequency ablation",
"No match here","No mention of this radifreq7uency ablati0on thing"))
The attempt
##### Lower case the text ##########
df$textcol<-tolower(df$textcol)
#Need to define the pattern to match and what to replace it with
matchPattern <- "radiofrequency ablation"
findAndReplace<-function(matchPattern,rawText,replace)
{
positions <- aregexec(matchPattern, rawText, max.distance = 0.1)
regmatches(rawText, positions)
res <- regmatches(df$textcol, positions)
res[lengths(res)==0] <- "XXXX" # deal with 0 length matches somehow
#################### Term mapping ####################
df$out <- Vectorize(gsub)(unlist(res), replace, rawText)
df$out
}
matchPatternRFA <- c("radiofrequency ablation")
repRF<-findAndReplace(matchPatternRFA,rawText,"RFA")
repRF
The problem
The above works fine for the replacement of one term, but what if I want to also replace endoscopic 'submucosal resection' with 'EMR' and 'HALO' with 'catheter'?
Ideally I'd like to create a list of terms to match but then how do I also specify how to replace them?

Define asub to replace approximate matches with a replacement string and define a matching list L that for each name defines its replacement. Then run Reduce to perform the replacements.
asub <- function(pattern, replacement, x, fixed = FALSE, ...) {
m <- aregexec(pattern, x, fixed = fixed)
r <- regmatches(x, m)
lens <- lengths(r)
if (all(lens == 0)) return(x) else
replace(x, lens > 0, mapply(sub, r[lens > 0], replacement, x[lens > 0]))
}
L <- list("radiofrequency ablation" = "RFA",
"endoscopic submucosal resection" = "EMR",
"HALO" = "cathetar")
Reduce(function(x, nm) asub(nm, L[[nm]], x), init = df$textcol, names(L))
giving:
[1] "In this substring would like to find the RFA of this cathetar"
[2] "I like to do EMR and also RFA"
[3] "No match here"
[4] "No mention of this RFA thing"

You can create a lookup table with patterns and necessary replacements:
dt <-
data.table(
textcol = c(
"In this substring would like to find the radiofrequency ablation of this HALO",
"I like to do endoscopic submuocsal resection and also radifrequency ablation",
"No match here",
"No mention of this radifreq7uency ablati0on thing"
)
)
dt_gsub <- data.table(
textcol = c("submucosal resection",
"HALO",
"radiofrequency ablation"),
textcol2 = c("EMR", "catheter", "RFA")
)
for (i in 1:nrow(dt))
for (j in 1:nrow(dt_gsub))
dt[i]$textcol <-
gsub(dt_gsub[j, textcol], dt_gsub[j, textcol2], dt[i, textcol])

Related

How to use the `purrr` package in R instead of for-loop to iterate over indices

I have a list of S4 objects, and I'm trying to iterate a function over these lists where I select an index position, and then from that position extract keywords I'm interested in. I am able to do a for loop and apply the function successfully, but is there a way this could be done using the purrr package? I'm not sure how to replicate the S4 object exactly, so I've included a very high level example just to get an idea of my process.
list_1 <- list("Sample", "test", "test Date")
list_2 <- list("test", "sample", "test Date")
listoflists <- list(list_1, list_2)
I created a list of indices of "Sample":
groupList <- map(listoflists,~which(toupper(.) == "SAMPLE"))
As well as a list of keywords that I'd like to extract:
keywordsList <- list(c("One test", "two test"), c("one test", "two test"))
I have a function that takes the S4 objects, selects the index where "sample" is found, and from that extracts the keywords.
for(i in seq_along(listoflists){
output[[i]] <- some_function(listoflists[[i]], index = groupList[[i]], keywords = keywordsList[[i]]) }
I tried using imap, but it seems like when I do this, the output's sublist only has 1 keyword (say "One test" in first list and "two test" in second list) instead of 3:
output <- listoflists %>% imap(~some_function(.x,index = groupList[[.y]], keywords = keywordsList[[.y]])
You are missing an closing bracket in your for loop but other than that your code should work. I am going to define a trivial some_function() to demonstrate:
some_function <- function(x, index, keywords) {
c(x[[index]], keywords)
}
loop_output <- vector(mode = "list", length = length(listoflists))
for (i in seq_along(listoflists)) {
loop_output[[i]] <- some_function(listoflists[[i]], index = groupList[[i]], keywords = keywordsList[[i]])
}
purr_output <- imap(
listoflists,
~ some_function(
.x,
index = groupList[[.y]],
keywords = keywordsList[[.y]]
)
)
identical(loop_output, purr_output)
# TRUE
If even with the correct brackets, your example works in a loop but not using imap I doubt that the use of S4 objects is relevant.
You can be tripped up if you have a named list. From the imap docs:
imap_xxx(x, ...), an indexed map, is short hand for map2(x, names(x), ...) if x has names, or map2(x, seq_along(x), ...) if it does not.
See for example:
listoflists <- list(list_1, list_2)
imap(listoflists, ~.y)
# [[1]]
# [1] 1
# [[2]]
# [1] 2
listoflists <- list(l1 = list_1, l2 = list_2)
imap(listoflists, ~.y)
# $l1
# [1] "l1"
# $l2
# [1] "l2"
Make sure you are looping over the indices rather than the names and the output should be identical.
You could also do this with purrr::pmap(), which maps in parallel over an arbitrary number of lists (passed within a super-list):
output <-
pmap(.l = list(listoflists, index = groupList, keywords = keywordsList),
.f = some_function)

Find sequences of elements in vectors

I need some pointers on this. Actually, I don't necessarily need a fully-fledged solution here - some pointers to functions and/or packages would be great.
The problem: I want to find specific sequences in a character vector. The sequences can be somewhat "underspecified". That means that some of the elements should be fixed, but for some elements it does not matter how long they are or what they are exactly.
An example: Suppose I want to find the following pattern in a character vector:
The sequence should begin with "Out of" or "out of"
The sequence should end with "reasons"
In between, there should be other elements. But it does not matter how much elements (also zero would be OK) and what the elements exactly are.
In between 1. and 2., there shouldn't be a ".", "!" or "?".
There should be a parameter that controls how long the sequence in 3. can maximally be to still produce a result.
Return value of the function should be the intervening elements and/or their indices in the vector.
So, the function should "behave" like this:
c("Out", "of", "specific", "reasons", ".") Return "specific"
c("Out", "of", "very", "specific", "reasons", ".") Return c("very", "specific")
c("out", "of", "curiosity", ".", "He", "had", "his", "reasons") Return "" or NA or NULL, which one doesn't matter - just a signal that there is no result.
As I said: I don't need a full solution. Any pointers to packages that already implement such functionality are appreciated!
Optimally, I don't want to rely on a solution that first pastes the text and then uses regex for matching.
Thanks a lot!
I would be really curious to learn of a package that serves your needs. My inclination would be to collapse the strings and use regular expressions or find a programmer or use perl. But here's one extensible solution in R with a few more cases to experiment on. Not very elegant, but see if this has some utility.
# Recreate data as a list with a few more edge cases
txt1 <- c(
"Out of specific reasons.",
"Out of very specific reasons.",
"Out of curiosity. He had his reasons.",
"Out of reasons.",
"Out of one's mind.",
"For no particular reason.",
"Reasons are out of the ordinary.",
"Out of time and money and for many good reasons, it seems.",
"Out of a box, a car, and for random reasons.",
"Floop foo bar.")
txt2 <- strsplit(txt1, "[[:space:]]+") # remove space
txt3 <- lapply(txt2, strsplit, "(?=[[:punct:]])", perl = TRUE) #
txt <- lapply(txt3, unlist) # create list of tokens from each line
# Define characters to exclude: [. ! and ?] but not [,]
exclude <- "[.!?]"
# Assign acceptable limit to separation
lim <- 5 # try 7 and 12 to experiment
# Create indices identifying each of the enumerated conditions
fun1 <- function(x, pat) grep(pat, x, ignore.case = TRUE)
index1 <- lapply(txt, fun1, "out")
index2 <- lapply(txt, fun1, "of")
index3 <- lapply(txt, fun1, "reasons")
index4 <- lapply(txt, fun1, exclude)
# Create logical vectors from indices satisfying the conditions
fun2 <- function(set, val) val[1] %in% set
cond1 <- sapply(index1, fun2, val = 1) & sapply(index2, fun2, val = 2)
cond2 <- sapply(index3, "[", 1) < lim + 2 + 2 # position of 'of' + 2
cond3 <- sapply(index3, max, -Inf) < sapply(index4, min, Inf)
# Combine logical vectors to a single logical vector
valid <- cond1 & cond2 & cond3
valid <- ifelse(is.na(valid), FALSE, valid)
# Examine selected original lines
print(txt1[valid])
# Helper function to extract the starting and the ending element
fun3 <- function(index2, index3, valid) {
found <- rep(list(NULL), length(index2))
found[valid] <- Map(seq, index2[valid], index3[valid])
found <- lapply(found, tail, -1)
found <- lapply(found, head, -1)
}
# Extract starting and ending element from valid list members
idx <- fun3(index2, index3, valid)
# Return the results or "" for no intervening text or NULL for no match
ans <- Map(function(x, i) {
if (is.null(i)) NULL # no match found
else if (length(i) == 0) "" # no intervening elements
else x[i]}, # all intervening elements <= lim
txt, idx)
# Show found (non-NULL) values
ans[!sapply(ans, is.null)]
So let's assume your example
x <- c("Out", "of", "very", "specific", "reasons", ".")
We first need to get the beginning of the indicator
i_Beginning <- as.numeric(grep("Out|out", x))
and the ending
i_end <- as.numeric(grep("reasons", x))
Need to also check that Out is followed by of
Is_Of <- grepl("Of|of", x[i_Beginning +1])
And if this is true we extract the other elements
if(Is_Of){
extraction <- x[c(i_Beginning +2, i_end -1)]
}
print(extraction)

How to change certain words into the tagged forms of training lists

I tried to change certain strings into the forms of tagged words listed in train.
train = c('love/POS','happy/POS','sad/NEG','fearsome/NEG','lazy/NEG')
test = c('I love you', 'I am so happy now', 'You look sad somehow', 'the lazy boy look so fearsome')
With them, I wanted to make an outcome like
[1]'I love/POS you' 'I am so happy/POS now' 'You look sad/NEG somehow' 'the lazy/NEG boy look so fearsome/NEG'
Of course, I can use gsub like this as a primitive way
part1 = gsub('love', 'love/POS', test)
part2 = gsub('happy', 'happy/POS', part1)
.......
However, this way is not productive at all when I have bigger training lists.
In order to make it possible in a more effective way, I tried
process1 = unlist(strsplit(test, '[[:space:]]+'))
mgsub <- function(pattern, replacement, x, ...) {
if (length(pattern)!=length(replacement)) {
stop("pattern and replacement do not have the same length.")
}
result <- x
for (i in 1:length(pattern)) {
result <- gsub(pattern[i], replacement[i], result, ...)
}
result
}
trainedtest = mgsub(process1, train, test)
trainedtest
In fact, it is not working at all because the length of process1 and train lists are not same. Technically, I should make a program which can select certain words to change into tagged forms of train lists, calculating similarities between process1 and train.
Is there any way to make it possible?
Here is a base R solution using match with nomatch = 0 (i.e. return nothing for no match - default is NA)
v1 <- sub('/.*', '', train)
sapply(strsplit(test, ' '), function(i)
{i[grepl(paste(v1, collapse = '|'), i)] <- train[match(i, v1, nomatch = 0)];
paste(i, collapse = ' ')})
#[1] "I love/POS you" "I am so happy/POS now" "You look sad/NEG somehow"
#[4] "the lazy/NEG boy look so fearsome/NEG"
If you want to replace multiple patterns with desired strings use gsubfn:
require(gsubfn)
input = c("I love you", "I am so happy now")
toreplace<-list("love" = "love/POS", "happy" = "happy/POS")
gsubfn(paste(names(toreplace),collapse="|"),toreplace, input)

String splitting in R Programming

Currently the script below is splitting a combined item code into a specific item codes.
rule2 <- c("MR")
df_1 <- test[grep(paste("^",rule2,sep="",collapse = "|"),test$Name.y),]
SpaceName_1 <- function(s){
num <- str_extract(s,"[0-9]+")
if(nchar(num) >3){
former <- substring(s, 1, 4)
latter <- strsplit(substring(s,5,nchar(s)),"")
latter <- unlist(latter)
return(paste(former,latter,sep = "",collapse = ","))
}
else{
return (s)
}
}
df_1$Name.y <- sapply(df_1$Name.y, SpaceName_1)
Example,
Combined item code: Room 324-326 is splitting into MR324 MR325 MR326.
However for this particular Combined item code: Room 309-311 is splitting into MR309 MR300 MR301.
How should I amend the script to give me MR309 MR310 MR311?
You can try something along these lines:
range <- "324-326"
x <- as.numeric(unlist(strsplit(range, split="-")))
paste0("MR", seq(x[1], x[2]))
[1] "MR324" "MR325" "MR326"
I assume that you can obtain the numerical room sequence by some means, and then use the snippet I gave you above.
If your combined item codes always have the form Room xxx-yyy, then you can extract the range using gsub:
range <- gsub("Room ", "", "Room 324-326")
If your item codes were in a vector called codes, then you could obtain a vector of ranges using:
ranges <- sapply(codes, function(x) gsub("Room ", "", x))
We can also evaluate the string after replacing the - with : and then paste the prefix "MR".
paste0("MR", eval(parse(text=sub("\\S+\\s+(\\d+)-(\\d+)", "\\1:\\2", range))))
#[1] "MR324" "MR325" "MR326"
Wrap it as a function for convenience
fChange <- function(prefixStr, RangeStr){
paste0(prefixStr, eval(parse(text=sub("\\S+\\s+(\\d+)-(\\d+)",
"\\1:\\2", RangeStr))))
}
fChange("MR", range)
fChange("MR", range1)
#[1] "MR309" "MR310" "MR311"
For multiple elements, just loop over and apply the function
sapply(c(range, range1), fChange, prefixStr = "MR")
data
range <- "Room 324-326"
range1 <- "Room 309-311"

compare string in R

I'm currently working on a programming project in R (for school) and I'm using a data set made of a large quantity of LastFm users (an application that collects data when you're using a media player).
I want to work on an eventual link between 2 variables present in the dataset which are the "nickname" and the "real name". To do so, I would like to compute a variable that represents the rate of similarity between the characters.
As an example take one individual (regardless of the other variables):
name = 'chris meller'
nickname = 'mellertime'
So far, tried to sort the strings in order to to check for identical characters one by one but I'm stuck here. What i found is just a way to to check if "name" is present inside "nickname" with different kind of functions.
>paste(sort(unlist(strsplit(name, ""))), collapse = "")
[1] "eeeillmmrt"
>paste(sort(unlist(strsplit(nickname, ""))), collapse = "")
[1] " ceehillmrrs"
What I would like to know is if there is a way to count the number of identical letters between 2 character strings, regardless of the order?
I would like to end with something like this:
function(a,b)
[1] 0.63
# a,b are 2 character strings
where the result is the ratio of the number of identical character between the two strings divided by the number of characters in the real name.
Try this:
SimilarityRatio <- function(wholeName, nickname, matchCase) {
n1 <- sort(strsplit(paste(strsplit(wholeName, " ")[[1]], collapse = ""), "")[[1]])
n2 <- sort(strsplit(paste(strsplit(nickname, " ")[[1]], collapse = ""), "")[[1]])
if (!matchCase) {
n1 <- tolower(n1)
n2 <- tolower(n2)
}
MyLen <- tempLen <- length(n1)
j <- 1L
numMatch <- 0L
while (j <= tempLen) {
test1 <- n1[j] %in% n2
if (test1) {
myRemove <- min(which(n2 %in% n1[j]))
n1 <- n1[-j]
n2 <- n2[-myRemove]
numMatch <- numMatch + 1L
tempLen <- tempLen - 1L
} else {
j <- j+1L
}
}
numMatch/MyLen
}
Below are some test cases:
> SimilarityRatio("chris meller", "mellertime", FALSE)
[1] 0.6363636
> SimilarityRatio("SuperMan3000", "The3Musketeers", FALSE)
[1] 0.5
> SimilarityRatio("SuperMan3000", "The3Musketeers", TRUE)
[1] 0.4166667
> SimilarityRatio("should a garbage collection be performed immediately", "same expression can vary considerably depending on whether", FALSE)
[1] 0.7608696

Resources