Look around pattern doesn't occasionally work

Look around pattern doesn't occasionally work - r

I'm using regex in R and working on a echocardiographic dataset. I want to detect cases where a phenomena called "SAM" is seen and I obviously would want to exclude cases like "no SAM"
so I wrote this lines:
pattern_sam <- regex("(?<!no )sam", ignore_case = TRUE)
str_view_all(echo_1_lvot$description_echo, pattern_sam, match = TRUE)
it effectively removes 99.9% of cases with "no SAM", yet for some reason I still get 3 cases of "no SAM" (see the following image)
Now the weird thing is that if I simply copy pasting these strings into a new dataset, this problem goes away...
sam_test <- tibble(description_echo = c(
"There is asymmetric septal hypertrophy severe LVH in all myocardial segements with spared basal posterior wall with asymmetric septal hypertrophy(anteroseptal thickness=2cm,PWD=0.94cm),no SAM nor LVOT obstruction which is compatible with type III HCM",
"-Normal LV size with mild to moderate systolic dysfunction,EF=45%,severe LVH in all myocardial segements with spared basal posterior wall with asymmetric septal hypertrophy(anteroseptal thickness=2cm,PWD=0.94cm),no SAM nor LVOT obstruction which is compa"
))
str_view_all(sam_test$description_echo, pattern_sam)
same thing happens when I try to detect other patterns
does anyone have any idea on what is the underlying problem and how can it be fixed?
P.S:
here is the .xls file (I only included the problematic string), if you want to see for yourself
funny thing is that when I manually remove the "No SAM" from the .xls and retype it in the exact same place, the problem goes away. still no idea what is wrong, could it be the text format?

You can match any whitespaces, even Unicode ones, with \s since you are using the ICU regex flavor (it is used with all stringr/stringi regex functions):
pattern_sam <- regex("(?<!no\\s)sam", ignore_case = TRUE)
To match any non-word chars including some non-printable chars, use
regex("(?<!no\\W)sam", ignore_case = TRUE)
Besides, if there can be several of them, you may use a constrained-width lookbehind (available in ICU and Java):
pattern_sam <- regex("(?<!no\\s{1,10})sam", ignore_case = TRUE)
pattern_sam <- regex("(?<!no\\W{1,10})sam", ignore_case = TRUE)
Here, from 1 to 10 chars can be between no and sam.
And if you need to match whole words, add \b, word boundary:
pattern_sam <- regex("(?<!\\bno\\s{1,10})sam\\b", ignore_case = TRUE)
pattern_sam <- regex("(?<!\\bno\\W{1,10})sam\\b", ignore_case = TRUE)

Related

Nb_words code counting partial matches in R

before I get started, I would like you to know that I am completely new to coding in R. For a group assignment our professor set up a database by scraping data from Amazon. Within the database, which is called 'dat', there is a column named 'product_name'. We were given a set group of utilitarian words. I think you can guess where this is going. Within the column 'product_name' we have to find for each product name whether any of the utilitarian words appeared. If yes, how many times. We were given the following code by our professor to use for this assignment:
nb_words <- function(lexicon,corpus){
rowSums(sapply(lexicon, function(x) grepl(x, corpus)))
}
after which i created the following codes:
uti_words <-c("additives","antioxidant","artificial", "busy", "calcium","calories", "carb", "carbohydrates", "chemicals", "cholesterol", "convenient", "dense", "diet", "fast")
sentences <- (dat$product_name)
nb_words (lexicon=uti_words,corpus=sentences)
when i run nb_words, however, I noticed something went wrong. A sentence contained the word 'breakfast'. My code counted this as a match because the word 'fast' from 'uti_words' matched with it. I don't want this to happen, does anyone know how to make it so that I only get exact matches and no partial matches?

We may have to add word boundary (\\b) to avoid partial matches
uti_words <- paste0("\\b", trimws(uti_words), "\\b")
Or another option is to change the grepl part of the code with fixed = TRUE
nb_words <- function(lexicon,corpus){
rowSums(sapply(lexicon, function(x) grepl(x, corpus, fixed = TRUE)))
}

grepl() in R using complex pattern with multiple AND, OR

Is that possible to use a pattern like this (see below) in grepl()?
(poverty OR poor) AND (eradicat OR end OR reduc OR alleviat) AND extreme
The goal is to determine if a sentence meets the pattern using
ifelse(grepl(pattern, x, ignore.case = TRUE),"Yes","No")
For example, if x = "end extreme poverty in the country", it will return "Yes", while if x = "end poverty in the country", it will return "No".
An earlier post here works only for single work like poor AND eradicat AND extreme, but not work for my case. Any way to achieve my goal?
Tried this, pattern = "(?=.*poverty|poor)(?=.*eradicat|end|reduce|alleviate)(?=.*extreme)", but it does not work. The error is 'Invalid regexp'

For using all 3 assertions, you can group the words using a non capture group.
^(?=.*(?:poverty|poor))(?=.*extreme)(?=.*(?:eradicat|end|reduc|alleviat)).+
^ Start of string
(?=.*(?:poverty|poor)) Assert either poverty OR poor
(?=.*extreme) Assert extreme
(?=.*(?:eradicat|end|reduc|alleviat)) Assert either eradicat OR end OR reduc or alleviat
.+ Match the whole line for example
Regex demo
For grepl, you have to use perl=T enabling PCRE for the lookarounds.
grepl('^(?=.*(?:poverty|poor))(?=.*extreme)(?=.*(?:eradicat|end|reduc|alleviat)).+', v, perl=T)

Delete rows that don't match National Insurance Number format using regex

I'm fairly new to R after switching from spss, but need to use R for this project. I am reading in data from an excel file of people and the unique identifier for each person is their UK National Insurance Number, but i need to delete any rows that don't contain the NINO in the correct format, i.e. AB123456A.
Some types of "NINOs" that are listed in the data, which i need to remove as they don't match the format exactly.
******69B
cms1234
BCN8888855555
AB 123456 A
NA
I found this regex online to validate the format of the NINO.
/^[A-CEGHJ-PR-TW-Z]{1}[A-CEGHJ-NPR-TW-Z]{1}[0-9]{6}[A-D]{1}$/i
I've tried running it in the code below, but while no error messages are displayed, it doesn't remove any rows from the dataset either.
DEP_Programmes %>%
filter(!grepl("/^[A-CEGHJ-PR-TW-Z]{1}[A-CEGHJ-NPR-TW-Z]{1}[0-9]{6}[A-D]{1}$/i", DEP_Programmes$NiNo)) %>%
count(Programme)
Any suggestions? Please and thanks.

From Wikipedia:
The format of the [National Insurance] number is two prefix letters,
six digits and one suffix letter. [...] Neither of the first two letters can be D, F, I, Q, U
or V. The second letter also cannot be O. The prefixes BG, GB, NK, KN,
TN, NT and ZZ are not allocated. [...] The suffix letter is either A,
B, C, or D
(source).
So the regex you found is almost correct but the leading and trailing characters should be removed. A little test:
regex <- "^[A-CEGHJ-PR-TW-Z]{1}[A-CEGHJ-NPR-TW-Z]{1}[0-9]{6}[A-D]{1}$"
test <- c("AB123456A", "******69B", "cms1234", "BCN8888855555",
"AB 123456 A", NA, "QQ123456C")
grepl(regex, test)
# [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Check this cheatsheet for reference.
Inside your original code this should look like that:
DEP_Programmes %>%
filter(grepl("^[A-CEGHJ-PR-TW-Z]{1}[A-CEGHJ-NPR-TW-Z]{1}[0-9]{6}[A-D]{1}$", NiNo)) %>% # omit DEP_Programmes$ inside dplyr pipe
count(Programme)
Note that within filter, TRUE values are kept and FALSE values are removed. By adding a leading ! you invert the selection, meaning that your TRUE values are removed (which I understand you don't want). That's the reason why with your original code nothing was removed. Since the regex was not the r-flavour of regex but some other language, all strings were FALSE. Inverting this led to all being kept.

The regex passed into grepl does not take any delimiters, and also if you want case insensitive behavior you should use the ignore.case option rather than /i:
DEP_Programmes %>%
filter(!grepl("^[A-CEGHJ-PR-TW-Z]{1}[A-CEGHJ-NPR-TW-Z]{1}[0-9]{6}[A-D]{1}$", NiNo, ignore.case=TRUE)) %>%
count(Programme)
Note: Your current regex looks a lot like either PHP or JavaScript.

How to find and replace for ranges with gsub

I have text as follows:
".OESOPHAGUS: inflammation. STOMACH: Lots of information here.DUODENUM: Some more information. ENDOSCOPIC DIAGNOSIS blabla"
I would like to replace any full stop followed by a letter (upper or lower case) to be replaced by a full stop, newline and then the letter. so that the output should be:
".\nOESOPHAGUS: inflammation. .\nSTOMACH: Lots of information here. .\nDUODENUM: Some more information. .\nENDOSCOPIC DIAGNOSIS blabla"
I tried:
gsub("\\..*?([A-Za-z])","\\.\n\\1",MyData$Algo)
but this gives me:
.\nESOPHAGUS: inflammation.\nTOMACH: Lots of information here.DUODENUM: Some more information.\nNDOSCOPIC DIAGNOSIS blabla"
The problem seems to be in the matching of ranges as specified. Is there a way to do this find-replace. I am not reliant on gsub.

Perl Compatible Regular Expressions (PCRE) should work well in this example.
a = ".OESOPHAGUS: inflammation. STOMACH: Lots of information here.DUODENUM: Some more information. ENDOSCOPIC DIAGNOSIS blabla"
gsub("\\..*?([A-Za-z])","\\.\n\\1", a , perl = T)
#output:
".\nOESOPHAGUS: inflammation.\nSTOMACH: Lots of information here.\nDUODENUM: Some more information.\nENDOSCOPIC DIAGNOSIS blabla"
I am unsure why the lazy matching acts as it does when perl = F.

I'm not sure why you want . . instead of just .\n, this works for the latter:
gsub('[.]\\s*([a-zA-Z])', '.\n\\1', str)
# [1] ".\nOESOPHAGUS: inflammation.\nSTOMACH: Lots of information here.\nDUODENUM: Some more information.\nENDOSCOPIC DIAGNOSIS blabla"
When printed to console with cat, this looks like:
cat(gsub('[.]\\s*([a-zA-Z])', '.\n\\1', str))
# .
# OESOPHAGUS: inflammation.
# STOMACH: Lots of information here.
# DUODENUM: Some more information.
# ENDOSCOPIC DIAGNOSIS blabla
I can't explain either why .*? isn't doing what you want. But there's no reason to use . in this case, since you do have restrictions on the type of character you'd like to match between the full stop and the letter (I assumed white space \s would suffice).

How to get the front part of some words in R?

I am trying to get the front part of some words in a sentence.
For example the sentence is -
x <- c("Ace Bayou Reannounces Recall of Bean Bag Chairs Due to Low Rate of Consumer Response; Two Child Deaths Previously Reported; Consumers Urged to Install Repair", "Panasonic Recalls Metal Cutter Saws Due to Laceration Hazard")
Now I want get the front part of Recall or Recalls. I have tried various functions from R like grep, grepl, pmatch , str_split. However, I could not get exactly what I want .
what I need is only
"Ace Bayou Reannounces"
"Panasonic"
Any help would be appreciated.

If you want to use regular expressions.
gsub(pattern = "(.*(?=Recall(s*)))(.*)", replacement = "\\1", x = x, perl = T)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Look around pattern doesn't occasionally work - r

Related

Nb_words code counting partial matches in R

grepl() in R using complex pattern with multiple AND, OR

Delete rows that don't match National Insurance Number format using regex

How to find and replace for ranges with gsub

How to get the front part of some words in R?

Categories

Resources