Extracting string between punctuation, when present - r

I'm trying to extract a string after a : or ; and before a ; if the 2nd punctuation is present, then to remove everything after a ; if present. Goal result is a number.
The current code is able to do between : and ; OR after : but cannot handle ; alone or : alone.
Also, gsub(|(OF 100); SEE NOTE) isn't working, and I'm not sure why the initial : isn't being excluded and needs the gsub at all.
test<-c("Score (ABC): 2 (of 100); see note","Amount of ABC; 30%","Presence of ABC: negative","ABC not tested")
#works for :/;
toupper((regmatches(toupper(test), gregexpr(":\\s* \\K.*?(?=;)", toupper(test), perl=TRUE))))
#works for :
test<-toupper((regmatches(toupper(test), gregexpr(":\\s* (.*)", toupper(test), perl=TRUE))))
#removes extra characters:
test<-gsub(": |(OF 100); SEE NOTE|%|; ","",test)
#Negative to numeric:
test[grepl("NEGATIVE|<1",test)]<-0
test
Expected result: 2 30 0

Here are some solutions.
The first two are base. The first only uses very simple regular expressions. The second is shorter and the regular expression is only a bit more complicated. In both cases we return NA if there is no match but you can replace NAs with 0 (using ifelse(is.na(x), 0, x) where x is the answer with NAs) afterwards if that is important to you.
The third is almost the same as the second but uses strapply in gsubfn. It returns 0 instead of NA.
1) read.table Replace all colons with semicolons and read it in as semicolon-separated fields. Pick off the second such field and remove the first non-digit and everything after it. Then convert what is left to numeric.
DF <- read.table(text = gsub(":", ";", test),
as.is = TRUE, fill = TRUE, sep = ";", strip.white = TRUE)
as.numeric(sub("\\D.*", "", DF$V2))
##[1] 2 30 NA
2) strcapture Match from the start characters which are not colon or semicolon and then match a colon or semicolon and then match a space and finally capture digits. Return the captured digits converted to numeric.
strcapture("^[^:;]+[;:] (\\d+)", test, list(num = numeric(0)))$num
##[1] 2 30 NA
3) strapply Using the same pattern as in (2) convert the match to numeric and return 0 if the match is empty.
library(gsubfn)
strapply(test, "^[^:;]+[;:] (\\d+)", as.numeric, simplify = TRUE, empty = 0)
## [1] 2 30 0

Another approach:
out <- gsub('(^.+?[;:][^0-9]+)(\\d+)(.*$)|^.+', '\\2', test)
out[out == ''] <- 0
as.numeric(out)
## [1] 2 30 0

Per the OP's description (italics is mine):
extract a string after a : or ; and before a ; if the 2nd punctuation is present, then to remove everything after a ; if present. Goal result is a number.
I think some of the other suggestions may miss that italicized criteria. So here is the OP's test set with one extra condition at the end to test that:
test<-c( "Score (ABC): 2 (of 100); see note",
"Amount of ABC; 30%",
"Presence of ABC: negative",
"...and before a ; if the second punctuation is present, then remove everything after a ; if present [so 666 should not be returned]")
One-liner to return results as requested:
sub( pattern='.+?[:;]\\D*?[^;](\\d*).*?;*.*',
replacement='\\1',
x=test, perl=TRUE)
Results matching OP's request:
[1] "2" "30" "" ""
If the OP really wants an integer with zeros where no match is found, set the sub() replacement = '0\\1' and wrap with as.integer() as follows:
as.integer( gsub( pattern='.+?[:;]\\D*?[^;](\\d*).*?;*.*',
replacement='0\\1',
x=test, perl=TRUE) )
Result:
[1] 2 30 0 0
Fully working online R (R 3.3.2) example:
https://ideone.com/TTuKzG
Regexp explanation
OP wants to find just one match in a string so the sub() function works just fine.
Technique for using sub() is to make a pattern that matches all strings, but use a capture group in the middle to capture zero or more digits if conditions around it are met.
The pattern .+?[:;]\\D*?[^;](\\d*).*?;*.* is read as follows
.+? Match any character (except for line terminators) + between one and unlimited times ? as few times as possible, expanding as needed (lazy)
[:;] Match a single character in the list between the square brackets, in this case : or ;
\\D Match any character that's NOT a digit (equal to [^0-9])
*? Quantifier * Matches between zero and unlimited times ? as few times as possible, expanding as needed (lazy)
[^;] The ^ hat as first character between square brackets means: Match a single character NOT present in the list between the square brackets, in this case match any character NOT ;
(\d*) Everything between curved brackets is a capturing group - this is the 1st capturing croup: \\d* matches a digit (equal to [0-9]) between zero and unlimited times, as many times as possible(greedy)
;* Match the ; character * between zero and unlimited times [so ; does not have to be present but is matched if it is there: This is the key to excluding anything after the second delimiter as the OP requested]
.* Match any character * between zero and unlimited times, as many times as possible (greedy) [so picks up everything to the end of the line]
The replacement = \\1 refers to the 1st capture group in our pattern. We replace everything that was matched by the pattern with what we found in the capture group. \\d* can match no digits, so will return an empty string if there is no number found where we are expecting it.

Related

Remove one number at position n of the number in a string of numbers separated by slashes

I have a character column with this configuration:
data <- data.frame(
id = 1:3,
codes = c("08001301001", "08002401002 / 08002601003 / 17134604034", "08004701005 / 08005101001"))
I want to remove the 6th digit of any number within the string. The numbers are always 10 characters long.
My code works. However I believe it might be done easier using RegEx, but I couldn't figure it out.
library(stringr)
remove_6_digit <- function(x){
idxs <- str_locate_all(x,"/")[[1]][,1]
for (idx in c(rev(idxs+7), 6)){
str_sub(x, idx, idx) <- ""
}
return(x)
}
result <- sapply(data$codes, remove_6_digit, USE.NAMES = F)
You can use
gsub("\\b(\\d{5})\\d", "\\1", data$codes)
See the regex demo. This will remove the 6th digit from the start of a digit sequence.
Details:
\b - word boundary
(\d{5}) - Capturing group 1 (\1): five digits
\d - a digit.
While word boundary looks enough for the current scenario, a digit boundary is also an option in case the numbers are glued to word chars:
gsub("(?<!\\d)(\\d{5})\\d", "\\1", data$codes, perl=TRUE)
where perl=TRUE enables the PCRE regex syntax and (?<!\d) is a negative lookbehind that fails the match if there is a digit immediately to the left of the current location.
And if you must only change numeric char sequences of 10 digits (no shorter and no longer) you can use
gsub("\\b(\\d{5})\\d(\\d{4})\\b", "\\1\\2", data$codes)
gsub("(?<!\\d)(\\d{5})\\d(?=\\d{4}(?!\\d))", "\\1", data$codes, perl=TRUE)
One remark though: your numbers consist of 11 digits, so you need to replace \\d{4} with \\d{5}, see this regex demo.
Another possible solution, using stringr::str_replace_all and lookaround :
library(tidyverse)
data %>%
mutate(codes = str_replace_all(codes, "(?<=\\d{5})\\d(?=\\d{5})", ""))
#> id codes
#> 1 1 0800101001
#> 2 2 0800201002 / 0800201003 / 1713404034
#> 3 3 0800401005 / 0800501001

How to find substrings flanked by a specific character and replace with text of the same length in R?

In R, what is the best way of finding dots flanked by asterisks and replace them with asterisks?
input:
"AG**...**GG*.*.G.*C.C"
desired output:
"AG*******GG***.G.*C.C"
I tried the following function, but it is not elegant to say the least.
library(stringr)
replac <- function(my_string) {
m <- str_locate_all(my_string, "\\*\\.+\\*")[[1]]
if (nrow(m) == 0) return(my_string)
split_s <- unlist(str_split(my_string, ""))
for (i in 1:nrow(m)) {
st <- m[i, 1]
en <- m[i, 2]
split_s[st:en] <- rep("*", length(st:en))
}
paste(split_s, collapse = "")
}
I've have edited the input string and expected output after #TheForthBird answer below to make clear that dots not flanked by asterisks should not be changed, and that other letters other and "A" and "G" may occur.
You might use gsub with perl = TRUE and make use of the \G anchor to assert the position at the end of the previous match.
You could match AG or GG using a character class [AG]G or [A-Z]+ to match 1+ uppercase characters.
In the replacement use *
(?:[A-Z]+\*+|\G(?!^))\K\.(?=[^*]*\*)
That will match
(?: Non capturing group
[A-Z]+*+Match 1+ times uppercase char A-Z, then 1+ times*`
| Or
\G(?!^) Assert position at the end of previous match, not at the start
) Close non capturing group
\K Forget what is currently matched
\. Match literally
(?= Positive lookahead, assert what is on the right is
[^*]*\* Match 0+ times any char except *, then match *
) Close lookahead
Regex demo | R demo
For example:
gsub("(?:[A-Z]+\\*+|\\G(?!^))\\K\\.(?=[^*]*\\*)", "*", "AG**...**GG*.*.G.*C.C", perl = TRUE)
Result
[1] "AG*******GG***.G.*C.C"
Try this code, it's still not wrapped, but at least is a bit shorter than yours and works for all the cases, not only the ones without other occurrences of dots in the string:
replac_v2 <- function(my_string){
b <- my_string #Just a shorter name
while(TRUE){
df<-as.data.frame(str_locate(b,"\\*\\.+\\*"))
add<-as.numeric(df[2]-df[1])+1
if(is.na(add)){return(b)}
b<-str_replace(b,"\\*\\.+\\*",paste(rep("*",add),collapse=""))
}}

Using gsub or sub function to only get part of a string?

Col
WBU-ARGU*06:03:04
WBU-ARDU*08:01:01
WBU-ARFU*11:03:05
WBU-ARFU*03:456
I have a column which has 75 rows of variables such as the col above. I am not quite sure how to use gsub or sub in order to get up until the integers after the first colon.
Expected output:
Col
WBU-ARGU*06:03
WBU-ARDU*08:01
WBU-ARFU*11:03
WBU-ARFU*03:456
I tried this but it doesn't seem to work:
gsub("*..:","", df$col)
Following may help you here too.
sub("([^:]*):([^:]*).*","\\1:\\2",df$dat)
Output will be as follows.
> sub("([^:]*):([^:]*).*","\\1:\\2",df$dat)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456b"
Where Input for data frame is as follows.
dat <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456b")
df <- data.frame(dat)
Explanation: Following is only for explanation purposes.
sub(" ##using sub for global subtitution function of R here.
([^:]*) ##By mentioning () we are keeping the matched values from vector's element into 1st place of memory(which we could use later), which is till next colon comes it will match everything.
: ##Mentioning letter colon(:) here.
([^:]*) ##By mentioning () making 2nd place in memory for matched values in vector's values which is till next colon comes it will match everything.
.*" ##Mentioning .* to match everything else now after 2nd colon comes in value.
,"\\1:\\2" ##Now mentioning the values of memory holds with whom we want to substitute the element values \\1 means 1st memory place \\2 is second memory place's value.
,df$dat) ##Mentioning df$dat dataframe's dat value.
You may use
df$col <- sub("(\\d:\\d+):\\d+$", "\\1", df$col)
See the regex demo
Details
(\\d:\\d+) - Capturing group 1 (its value will be accessible via \1 in the replacement pattern): a digit, a colon and 1+ digits.
: - a colon
\\d+ - 1+ digits
$ - end of string.
R Demo:
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("(\\d:\\d+):\\d+$", "\\1", col)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
Alternative approach:
df$col <- sub("^(.*?:\\d+).*", "\\1", df$col)
See the regex demo
Here,
^ - start of string
(.*?:\\d+) - Group 1: any 0+ chars, as few as possible (due to the lazy *? quantifier), then : and 1+ digits
.* - the rest of the string.
However, it should be used with the PCRE regex engine, pass perl=TRUE:
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("^(.*?:\\d+).*", "\\1", col, perl=TRUE)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
See the R online demo.
sub("(\\d+:\\d+):\\d+$", "\\1", df$Col)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
Alternatively match what you want (instead of subbing out what you don't want) with stringi:
stringi::stri_extract_first(df$Col, regex = "[A-Z-\\*]+\\d+:\\d+")
Slightly more concise stringr:
stringr::str_extract(df$Col, "[A-Z-\\*]+\\d+:\\d+")
# or
stringr::str_extract(df$Col, "[\\w-*]+\\d+:\\d+")

Inserting character dynamically into string in R

I'm trying to insert a "+" symbol into the middle of a postcode. The postcodes following a pattern of AA111AA or AA11AA. I want the "+" to be inserted before the final number, so an output of either AA11+1AA or AA1+1AA. I've found a way to do this using stringr, but it feels like there's an easier way to do this that how I'm currently doing it. Below is my code.
pc <- "bt43xx"
pc <- str_c(
str_sub(pc, start = 1L, end = -4L),
"+",
str_sub(pc, start = -3L, end = -1L)
)
pc
[1] "bt4+3xx"
Here are some alternatives. All solutions work if pc is a scalar or vector. No packages are needed. Of them (3) seems particularly short and simple.
1) Match everything (.*) up to the last digit (\\d) and then replace that with the first capture (i.e. the match to the part within the first set of parens), a plus and the second capture (i.e. a match to the last digit).
sub("(.*)(\\d)", "\\1+\\2", pc)
2) An alternative which is even shorter is to match a digit followed by a non-digit and replace that with a plus followed by the match:
sub("(\\d\\D)", "+\\1", pc)
## [1] "bt4+3xx"
3) This one is even shorter than (2). It matches the last 3 characters replacing the match with a plus followed by the match:
sub("(...)$", "+\\1", pc)
## [1] "bt4+3xx"
4) This one splits the string into individual characters, inserts a plus in the appropriate position using append and puts the characters back together.
sapply(Map(append, strsplit(pc, ""), after = nchar(pc) - 3, "+"), paste, collapse = "")
## [1] "bt4+3xx"
If pc were known to be a scalar (as is the case in the question) it could be simplified to:
paste(append(strsplit(pc, "")[[1]], "+", nchar(pc) - 3), collapse = "")
[1] "bt4+3xx"
This regular expression with sub and two back references should work.
sub("(\\d?)(\\d[^\\d]*)$", "\\1+\\2", pc)
[1] "bt4+3xx"
\\d? matches 1 or 0 numeric characters, 0-9, and is captured by (). It will match if at least two numeric characters are present.
\\d[^\\d]* matches a numeric character followed by all non numeric characters, and is captured by ()
$ anchors the regular expression to the end of the string
"\\1+\\2" replaces the matched elements in the first two points with themselves and a "+" in the middle.
sub('(\\d)(?=\\D+$)','+\\1',pc,perl=T)

Replace all characters between the 3rd occurrence of “-” and the ":" in each element of a vector

Here is what I am trying to do:
Given a string, I want to remove everything after the third occurrence of the '-' and the character — assuming there is a third occurrence, which there may not be.
This is my expected result :
Initial string
yy-aa-bbb-cccc1:HYT => yy-aa-bbb:HYT
yy-aa-vvv-vv:ZTR => yy-aa-vvv:ZTR
yy-aa-ddd:YTLM => yy-aa-ddd:YTLM
Any help?
gsub('(.*-.*-.*)\\-.*(\\:.*)','\\1\\2',string)
#[1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa-ddd:YTLM"
We match two instances of characters that are not a - followed by - ([^-]+-) followed by another set of characters that are not a -, capture it as a group i.e. inside the (), followed by a - and set of characters that are not a : ([^:]+) followed by the second capture group that starts with : ((:.*)) and replace it with the backreference of the capture groups
sub("(([^-]+-){2}[^-]+)-*[^:]+(:.*)", "\\1\\3", str1)
#[1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa-ddd:YTLM"
data
str1 <- c("yy-aa-bbb-cccc1:HYT", "yy-aa-vvv-vv:ZTR", "yy-aa-ddd:YTLM"
Match the the first two fields and everything afterwards to colon and replace that with the first two fields and colon. Note that \w matches any word character and the \ needs to be doubled inside "..." :
sub("(\\w+-\\w+)-.+:", "\\1:", xx)
## [1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa:YTLM"
Note: The input xx in reproducible form is:
xx <- c("yy-aa-bbb-cccc1:HYT", "yy-aa-vvv-vv:ZTR", "yy-aa-ddd:YTLM")
Just throwing a stringi solution in there.
library(stringi)
sub('_.*:' ,':', stri_replace_last_fixed(x, '-', '_'))
#[1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa:YTLM"

Resources