How to find and replace for ranges with gsub

How to find and replace for ranges with gsub - r

I have text as follows:
".OESOPHAGUS: inflammation. STOMACH: Lots of information here.DUODENUM: Some more information. ENDOSCOPIC DIAGNOSIS blabla"
I would like to replace any full stop followed by a letter (upper or lower case) to be replaced by a full stop, newline and then the letter. so that the output should be:
".\nOESOPHAGUS: inflammation. .\nSTOMACH: Lots of information here. .\nDUODENUM: Some more information. .\nENDOSCOPIC DIAGNOSIS blabla"
I tried:
gsub("\\..*?([A-Za-z])","\\.\n\\1",MyData$Algo)
but this gives me:
.\nESOPHAGUS: inflammation.\nTOMACH: Lots of information here.DUODENUM: Some more information.\nNDOSCOPIC DIAGNOSIS blabla"
The problem seems to be in the matching of ranges as specified. Is there a way to do this find-replace. I am not reliant on gsub.

Perl Compatible Regular Expressions (PCRE) should work well in this example.
a = ".OESOPHAGUS: inflammation. STOMACH: Lots of information here.DUODENUM: Some more information. ENDOSCOPIC DIAGNOSIS blabla"
gsub("\\..*?([A-Za-z])","\\.\n\\1", a , perl = T)
#output:
".\nOESOPHAGUS: inflammation.\nSTOMACH: Lots of information here.\nDUODENUM: Some more information.\nENDOSCOPIC DIAGNOSIS blabla"
I am unsure why the lazy matching acts as it does when perl = F.

I'm not sure why you want . . instead of just .\n, this works for the latter:
gsub('[.]\\s*([a-zA-Z])', '.\n\\1', str)
# [1] ".\nOESOPHAGUS: inflammation.\nSTOMACH: Lots of information here.\nDUODENUM: Some more information.\nENDOSCOPIC DIAGNOSIS blabla"
When printed to console with cat, this looks like:
cat(gsub('[.]\\s*([a-zA-Z])', '.\n\\1', str))
# .
# OESOPHAGUS: inflammation.
# STOMACH: Lots of information here.
# DUODENUM: Some more information.
# ENDOSCOPIC DIAGNOSIS blabla
I can't explain either why .*? isn't doing what you want. But there's no reason to use . in this case, since you do have restrictions on the type of character you'd like to match between the full stop and the letter (I assumed white space \s would suffice).

Related

How can I dynamically get words surrounding a keyword?

I have a sentence that may contain keywords. I search for them, if one is true, I want the word before and after the keyword.
cont <- c("could not","would not","does not","will not","do not","were not","was not","did not")
text <- "this failed to increase incomes and production did not improve"
str_extract(text,"([^\\s]+\\s+){1}names(which(sapply(cont,grepl,text)))(\\s+[^\\s]+){1}")
This fails when I dynamically search using the names function but if I input:
str_extract(text,"([^\\s]+\\s+){1}did not(\\s+[^\\s]+){1}")
it correctly returns: production did not improve.
How can I get this to function without directly inputing the keywords?
Final note: I do not completely understand the syntax used to get surrounding objects. Basic r books have not covered this. Can someone explain please?

You could use your cont vector to create a vector of regex strings:
targets <- paste0("([^\\s]+\\s+){1}", cont, "(\\s+[^\\s]+){1}")
Which you can feed into str_extract_all and then unlist:
unlist(stringr::str_extract_all(text, targets))
#> [1] "production did not improve"
If this is something you need to do quite frequently, you could wrap it in a function:
get_surrounding <- function(string, keywords) {
targets <- paste0("([^\\s]+\\s+){1}", keywords, "(\\s+[^\\s]+){1}")
unlist(stringr::str_extract_all(string, targets))
}
With which you can easily run the query on new strings:
new_text <- "The production did not increase because the manager would not allow it."
get_surrounding(new_text, cont)
#> [1] "manager would not allow" "production did not increase"

Perhaps we can try this
> regmatches(text, gregexpr(sprintf("\\w+\\s(%s)\\s\\w+", paste0(cont, collapse = "|")), text))[[1]]
[1] "production did not improve"

Each match of the following regular expression will save the preceding and following words in capture groups 1 and 2, respectively.
\\b([a-z]+) +(?:could|would|does|will|do|were|was|did) +not +([a-z]+)\\b
You will of course have to form this expression programmatically, but that should be straightforward.
Hover the cursor over each element of the expression at this demo to obtain an explanation of its function.
For the string
"she could not believe that production did not improve"
there are two matches. For the first ("she could not believe") "she" and "believe" are saved to capture groups 1 and 2, respectively. For the second ("production did not improve") "production" and "improve" are saved to capture groups 1 and 2, respectively.

Delete rows that don't match National Insurance Number format using regex

I'm fairly new to R after switching from spss, but need to use R for this project. I am reading in data from an excel file of people and the unique identifier for each person is their UK National Insurance Number, but i need to delete any rows that don't contain the NINO in the correct format, i.e. AB123456A.
Some types of "NINOs" that are listed in the data, which i need to remove as they don't match the format exactly.
******69B
cms1234
BCN8888855555
AB 123456 A
NA
I found this regex online to validate the format of the NINO.
/^[A-CEGHJ-PR-TW-Z]{1}[A-CEGHJ-NPR-TW-Z]{1}[0-9]{6}[A-D]{1}$/i
I've tried running it in the code below, but while no error messages are displayed, it doesn't remove any rows from the dataset either.
DEP_Programmes %>%
filter(!grepl("/^[A-CEGHJ-PR-TW-Z]{1}[A-CEGHJ-NPR-TW-Z]{1}[0-9]{6}[A-D]{1}$/i", DEP_Programmes$NiNo)) %>%
count(Programme)
Any suggestions? Please and thanks.

From Wikipedia:
The format of the [National Insurance] number is two prefix letters,
six digits and one suffix letter. [...] Neither of the first two letters can be D, F, I, Q, U
or V. The second letter also cannot be O. The prefixes BG, GB, NK, KN,
TN, NT and ZZ are not allocated. [...] The suffix letter is either A,
B, C, or D
(source).
So the regex you found is almost correct but the leading and trailing characters should be removed. A little test:
regex <- "^[A-CEGHJ-PR-TW-Z]{1}[A-CEGHJ-NPR-TW-Z]{1}[0-9]{6}[A-D]{1}$"
test <- c("AB123456A", "******69B", "cms1234", "BCN8888855555",
"AB 123456 A", NA, "QQ123456C")
grepl(regex, test)
# [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Check this cheatsheet for reference.
Inside your original code this should look like that:
DEP_Programmes %>%
filter(grepl("^[A-CEGHJ-PR-TW-Z]{1}[A-CEGHJ-NPR-TW-Z]{1}[0-9]{6}[A-D]{1}$", NiNo)) %>% # omit DEP_Programmes$ inside dplyr pipe
count(Programme)
Note that within filter, TRUE values are kept and FALSE values are removed. By adding a leading ! you invert the selection, meaning that your TRUE values are removed (which I understand you don't want). That's the reason why with your original code nothing was removed. Since the regex was not the r-flavour of regex but some other language, all strings were FALSE. Inverting this led to all being kept.

The regex passed into grepl does not take any delimiters, and also if you want case insensitive behavior you should use the ignore.case option rather than /i:
DEP_Programmes %>%
filter(!grepl("^[A-CEGHJ-PR-TW-Z]{1}[A-CEGHJ-NPR-TW-Z]{1}[0-9]{6}[A-D]{1}$", NiNo, ignore.case=TRUE)) %>%
count(Programme)
Note: Your current regex looks a lot like either PHP or JavaScript.

Extract first letter in each word in R

I had a data.frame with some categorical variables. Let's suppose sentences is one of these variables:
sentences <- c("Direito à participação e ao controle social",
"Direito a ser ouvido pelo governo e representantes",
"Direito aos serviços públicos",
"Direito de acesso à informação")
For each value, I would like to extract just the first letter of each word, ignoring if the word has 4 letters or less (e, de, à, a, aos, ser, pelo), My goal is create acronym variables. I expect the following result:
[1] "DPCS", "DOGR", "DSP", "DAI
I tried to make a pattern subset using stringr with a regex pattern founded here:
library(stringr)
pattern <- "^(\b[A-Z]\w*\s*)+$"
str_subset(str_to_upper(sentences), pattern)
But I got an error when creating the pattern object:
Error: '\w' is an escape sequence not recognized in the string beginning with ""^(\b[A-Z]\w"
What am I doing wrong?
Thanks in advance for any help.

You can use gsub to delete all the unwanted characters and remain with the ones you want. From the expected output, it seems you are still using characters from words tht are 3 characters long:
gsub('\\b(\\pL)\\pL{2,}|.','\\U\\1',sentences,perl = TRUE)
[1] "DPCS" "DSOPGR" "DASP" "DAI"
But if we were to ignore the words you indicated then it would be:
gsub('\\b(\\pL)\\pL{4,}|.','\\U\\1',sentences,perl = TRUE)
[1] "DPCS" "DOGR" "DSP" "DAI"

#Onyambu's answer is great, though as a regular expression beginner, it does take me a long time to try to understand it so that I can make modifications to suit my own needs.
Here is my understanding to gsub('\\b(\\pL)\\pL{4,}|.','\\U\\1',sentences,perl = TRUE).
Post in the hope of being helpful to others.
Background information:
\\b: boundary of word
\\pL matches any kind of letter from any language
{4,} is an occurrence indicator
{m}: The preceding item is matched exactly m times.
{m,}: The preceding item is matched m or more times, i.e., m+
{m,n}: The preceding item is matched at least m times, but not more than n times.
| is OR logic operator
. represents any one character except newline.
\\U\\1 in the replacement text is to reinsert text captured by the pattern as well as capitalize the texts. Note that parentheses () create a numbered capturing group in the pattern.
With all the background knowledge, the interpretation of the command is
replace words matching \\b(\\pL)\\pL{4,} with the first letter
replace any character not matching the above pattern with "" as nothing is captured for this group
Here are two great places I learned all these backgrounds.
https://www.regular-expressions.info/rlanguage.html
https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html

You can use this pattern: (?<=^| )\S(?=\pL{4,})
I used a positive lookbehind to make sure the matches are preceded by either a space or the beginning of the line. Then I match one character, only if it is followed by 4 or more letters, hence the positive lookahead.
I suggest you don't use \w for non-English languages, because it won't match any characters with accents. Instead, \pL matches any letter from any language.
Once you have your matches, you can just concatenate them to create your strings (dpcs, dogr, etc...)
Here's a demo

how to only use sub on when there are multiple values

So this is a short example of a dataframe:
x<- c("WB (16)","CT (14)WB (15)","ET (13)CITG-TILm (16)EE-SS (17)TN-SE (17)")
My question is how to get sub(".*?)", "", x)(or a different function) to work such that this will be the result:
x<-c("WB (16)","WB (15)","TN-SE(17)")
instead of
x<-c("","WB (15)")
I got different types of letters (so not only WB, CT and TN-SE),such as:
"NBIO(15)" "CITG-TP(08)" "BK-AR(10)"
So it should be a general function...
Thanks!

Could you please try following.
sub(".*[0-9]+[^)]\\)?([^)$])", "\\1", x)
Output will be as follows.
[1] "WB (16)" "WB (15)" "TN-SE (17)"
Where Input will be as follows.
> x
[1] "WB (16)" "CT (14)WB (15)"
[3] "ET (13)CITG-TILm (16)EE-SS (17)TN-SE (17)"
Explanation: Following is only for explanation purposes.
sub(" ##Using sub function of Base R here.
##sub works on method of sub(regex_to_match_current_line's_stuff, new_string/variable/value out of matched,regex, variable)
.*[0-9]+[^)]\\) ##Using look ahead method of regex by mentioning .*(everything till) a ) is NOT found then mentioning ) there to cover it too so it will match till a ) which is NOt on end of line.
? ##? this makes sure above regex is matched first and it will move for next regex condition as per look ahead functoianlity.
([^)$])", ##() means in R to put a value into R's memory to remember it kind of place holder in memory, I am mentioning here to keep everything till a ) found at last.
"\\1", ##Substitute whole line with \\1 means first place holder's value.
x) ##Mentioning variable/vector's name here.

I think that I understand what you want. This certainly works on your example.
sub(".*?([^()]+\\(\\d+\\))$", "\\1", x)
[1] "WB (16)" "WB (15)" "TN-SE (17)"
Details: This looks for something of the form SomeStuff (Numbers) at the end of the string and throws away anything before it. SomeStuff is not allowed to contain parentheses.

R Regex to identify and replace characters between multiple dots

I have the following codes
"ABC.A.SVN.10.10.390.10.UDGGL"
"XYZ.Z.SVN.11.12.111.99.ASDDL"
and I need to replace the characters that exist between the 2nd and the 3rd dot. In this case it is SVN but it may well be any combination of between A and ZZZ, so really the only way to make this work is by using the dots.
The required outcome would be:
"ABC.A..10.10.390.10.UDGGL"
"XYZ.Z..11.12.111.99.ASDDL"
I tried variants of grep("^.+(\\.\\).$", "ABC.A.SVN.10.10.390.10.UDGGL") but I get an error.
Some examples of what I have tried with no success :
Link 1
Link 2
EDIT
I tried #Onyambu 's first method and I ran into a variant which I had not accounted for: "ABC.A.AB11.1.12.112.1123.UDGGL". In the replacement part, I also have numeric values. The desired outcome is "ABC.A..1.12.112.1123.UDGGL" and I get it using sub("\\.\\w+.\\B.",".",x) per the second part of his answer!

See code in use here
x <- c("ABC.A.SVN.10.10.390.10.UDGGL", "XYZ.Z.SVN.11.12.111.99.ASDDL")
sub("^(?:[^.]*\\.){2}\\K[^.]*", "", x, perl=T)
^ Assert position at the start of the line
(?:[^.]*\.){2} Match the following exactly twice
[^.]*\. Match any character except . any number of times, followed by .
\K Resets the starting point of the pattern. Any previously consumed characters are no longer included in the final match
[^.]* Match any character except . any number of times
Results in [1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"

x= "ABC.A.SVN.10.10.390.10.UDGGL" "XYZ.Z.SVN.11.12.111.99.ASDDL"
sub("([A-Z]+)(\\.\\d+)","\\2",x)
[1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"
([A-Z]+) Capture any word that has the characters A-Z
(\\.\\d+) The captured word above, must be followed with a dot ie\\..This dot is then followed by numbers ie \\d+. This completes the capture.
so far the captured part of the string "ABC.A.SVN.10.10.390.10.UDGGL" is SVN.10 since this is the part that matches the regular expression. But this part was captured as SVN and .10. we do a backreference ie replace the whole SVN.10 with the 2nd part .10
Another logic that will work:
sub("\\.\\w+.\\B.",".",x)
[1] "ABC.A..10.10.390.10.UDGGL" "XYZ.Z..11.12.111.99.ASDDL"

Not exactly regex but here is one more approach
#DATA
S = c("ABC.A.SVN.10.10.390.10.UDGGL", "XYZ.Z.SVN.11.12.111.99.ASDDL")
sapply(X = S,
FUN = function(str){
ind = unlist(gregexpr("\\.", str))[2:3]
paste(c(substring(str, 1, ind[1]),
"SUBSTITUTION",
substring(str, ind[2], )), collapse = "")
},
USE.NAMES = FALSE)
#[1] "ABC.A.SUBSTITUTION.10.10.390.10.UDGGL" "XYZ.Z.SUBSTITUTION.11.12.111.99.ASDDL"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to find and replace for ranges with gsub - r

Related

How can I dynamically get words surrounding a keyword?

Delete rows that don't match National Insurance Number format using regex

Extract first letter in each word in R

how to only use sub on when there are multiple values

R Regex to identify and replace characters between multiple dots

Categories

Resources