Search every n characters in a string for a pattern - r

Let's say I have the string ABCC321BB321A. I want to search for a pattern that consists of ABC...321, where ... can be any character(s). However, I want to only return results in which characters in the substring can be grouped into sets of 3.
E.g., I don't want ABCC321 (ABC - C32 - 1), but I do want ABCC321BB321 (ABC - C32 - 1BB - 321).
How would I do this in R? Is it possible to achieve using regular expressions? I guess I could possibly split the string up into a list containing groups of 3 or use conditionals to only return matches that are divisible by 3 to get the answer I want, but I'm assuming there's a more efficient method.

Try this:
x <- "ABCC321BB321A"
threes <- regmatches(x, gregexpr(".{3}", x))[[1]]
threes
paste(threes, collapse = "-")
which produces:
[[1]]
[1] "ABC" "C32" "1BB" "321"
and
[1] "ABC-C32-1BB-321"

Related

R Use Regular Expression to capture number when sometimes the capture is at the end of the string or not

I need to capture the numbers out of a string that come after a certain parameter name.
I have it working for most, but there is one parameter that is sometimes at the end of the string, but not always. When using the regular expression, it seems to matter.
I've tried different things, but nothing seems to work in both cases.
# Regular expression to capture the digit after the phrase "AppliedWhenID="
p <- ".*&AppliedWhenID=(.\\d*)"
# Tried this, but when at end, it just grabs a blank
#p <- ".*&AppliedWhenID=(.\\d*)&.*|.*&AppliedWhenID=(.\\d*)$"
testAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2"
testNotAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2&AgDateTypeID=1"
# What should be returned is "2"
gsub(p, "\\1", testAtEnd) # works
gsub(p, "\\1", testNotAtEnd) # doesn't work, it captures 2 + &AgDateTypeID=1
Note that sub and gsub replace the found text(s), thus, in order to extract a part of the input string with a capturing group + a backreference, you need to actually match (and consume) the whole string.
Hence, you need to match the string to the end by adding .* at the end of the pattern:
p <- ".*&AppliedWhenID=(\\d+).*"
sub(p, "\\1", testNotAtEnd)
# => [1] "2"
sub(p, "\\1", testAtEnd)
# => [1] "2"
See the regex demo and the R online demo.
Note that gsub matches multiple occurrences, you need a single one, so it makes sense to replace gsub with sub.
Regex details
.* - any zero or more chars as many as possible
&AppliedWhenID= - a &AppliedWhenID= string
(\d+) - Group 1 (\1): one or more digits
.* - any zero or more chars as many as possible.
You could try using the string look behind conditional "(?<=)" and str_extract() from the stringr library.
testAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2"
testNotAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2&AgDateTypeID=1"
p <- "(?<=AppliedWhenID=)\\d+"
# What should be returned is "2"
library(stringr)
str_extract(testAtEnd, p)
str_extract(testNotAtEnd, p)
Or in base R
p <- ".*((?<=AppliedWhenID=)\\d+).*"
gsub(p, "\\1", testAtEnd, perl=TRUE)
gsub(p, "\\1", testNotAtEnd, perl=TRUE)

R-- Add leading zero to string, with no fixed string format

I have a column as below.
9453, 55489, 4588, 18893, 4457, 2339, 45489HQ, 7833HQ
I would like to add leading zero if the number is less than 5 digits. However, some numbers have "HQ" in the end, some don't.(I did check other posts, they dont have similar problem in the "HQ" part)
so the finally desired output should be:
09453, 55489, 04588, 18893, 04457, 02339, 45489HQ, 07833HQ
any idea how to do this? Thank you so much for reading my post!
A one-liner using regular expressions:
my_strings <- c("9453", "55489", "4588",
"18893", "4457", "2339", "45489HQ", "7833HQ")
gsub("^([0-9]{1,4})(HQ|$)", "0\\1\\2",my_strings)
[1] "09453" "55489" "04588" "18893"
"04457" "02339" "45489HQ" "07833HQ"
Explanation:
^ start of string
[0-9]{1,4} one to four numbers in a row
(HQ|$) the string "HQ" or the end of the string
Parentheses represent capture groups in order. So 0\\1\\2 means 0 followed by the first capture group [0-9]{1,4} and the second capture group HQ|$.
Of course if there is 5 numbers, then the regex isn't matched, so it doesn't change.
I was going to use the sprintf approach, but found the the stringr package provides a very easy solution.
library(stringr)
x <- c("9453", "55489", "4588", "18893", "4457", "2339", "45489HQ", "7833HQ")
[1] "9453" "55489" "4588" "18893" "4457" "2339" "45489HQ" "7833HQ"
This can be converted with one simple stringr::str_pad() function:
stringr::str_pad(x, 5, side="left", pad="0")
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "7833HQ"
If the number needs to be padded even if the total string width is >5, then the number and text need to be separated with regex.
The following will work. It combines regex matching with the very helpful sprintf() function:
sprintf("%05.0f%s", # this encodes the format and recombines the number with padding (%05.0f) with text(%s)
as.numeric(gsub("^(\\d+).*", "\\1", x)), #get the number
gsub("[[:digit:]]+([a-zA-Z]*)$", "\\1", x)) #get just the text at the end
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "07833HQ"
Another attempt, which will also work in cases like "123" or "1HQR":
x <- c("18893","4457","45489HQ","7833HQ","123", "1HQR")
regmatches(x, regexpr("^\\d+", x)) <- sprintf("%05d", as.numeric(sub("\\D+$","",x)))
x
#[1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
This basically finds any numbers at the start of the string (^\\d+) and replaces them with a zero-padded (via sprintf) string that was subset out by removing any non-numeric characters (\\D+$) from the end of the string.
We can use only sprintf() and gsub() by splitting up the parts then putting them back together.
sprintf("%05d%s", as.numeric(gsub("[^0-9]+", "", x)), gsub("[0-9]+", "", x))
# [1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
Using #thelatemail's data:
x <- c("18893", "4457", "45489HQ", "7833HQ", "123", "1HQR")

Finding number of r's in the vector (Both R and r) before the first u

rquote <- "R's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
in the above code we need to find the number of r's(R and r) in rquote
You could use substrings.
## find position of first 'u'
u1 <- regexpr("u", rquote, fixed = TRUE)
## get count of all 'r' or 'R' before 'u1'
lengths(gregexpr("r", substr(rquote, 1, u1), ignore.case = TRUE))
# [1] 5
This follows what you ask for in the title of the post. If you want the count of all the "r", case insensitive, then simplify the above to
lengths(gregexpr("r", rquote, ignore.case = TRUE))
# [1] 6
Then there's always stringi
library(stringi)
## count before first 'u'
stri_count_regex(stri_sub(rquote, 1, stri_locate_first_regex(rquote, "u")[,1]), "r|R")
# [1] 5
## count all R or r
stri_count_regex(rquote, "r|R")
# [1] 6
To get the number of R's before the first u, you need to make an intermediate step. (You probably don't need to. I'm sure akrun knows some incredibly cool regular expression to get the job done, but it won't be as easy to understand as this).
rquote <- "R's internals are irrefutably intriguing"
before_u <- gsub("u[[:print:]]+$", "", rquote)
length(stringr::str_extract_all(before_u, "(R|r)")[[1]])
You may try this,
> length(str_extract_all(rquote, '[Rr]')[[1]])
[1] 6
To get the count of all r's before the first u
> length(str_extract_all(rquote, perl('u.*(*SKIP)(*F)|[Rr]'))[[1]])
[1] 5
EDIT: Just saw before the first u. In that case, we can get the position of the first 'u' from either which or match.
Then use grepl in the 'chars' up to the position (ind) to find the logical index of 'R' with ignore.case=TRUE and use sum using the strsplit output from the OP's code.
ind <- which(chars=='u')[1]
Or
ind <- match('u', chars)
sum(grepl('r', chars[seq(ind)], ignore.case=TRUE))
#[1] 5
Or we can use two gsubs on the original string ('rquote'). First one removes the characters starting with u until the end of the string (u.$) and the second matches all characters except R, r ([^Rr]) and replace it with ''. We can use nchar to get count of the characters remaining.
nchar(gsub('[^Rr]', '', sub('u.*$', '', rquote)))
#[1] 5
Or if we want to count the 'r' in the entire string, gregexpr to get the position of matching characters from the original string ('rquote') and get the length
length(gregexpr('[rR]', rquote)[[1]])
#[1] 6

R grepl - Matching Pattern to String

I am using grepl() in R to match patterns to a string.
I need to match multiple strings to a common string and return TRUE if they all match.
For example:
a <- 'DEARBORN TRUCK INCDBA'
b <- 'DEARBORN TRUCK INC DBA'
I want to see if all words in variable b are also in variable a.
I can't just use grepl(b, a) because the patterns (spaces) aren't the same.
It seems like it should be something like this:
grepl('DEARBORN&TRUCK&INC&DBA', a)
or
grepl('DEARBORN+TRUCK+INC+DBA', a)
but neither work. I need to compare each individual word in b to a. In this case, since all the words exist in a, it should return TRUE.
Thanks!
Use strsplit to split b into words and then use sapply to perform a grepl on each such word. The result will be a logical vector and if its all TRUE then return TRUE:
all(sapply(strsplit(b, " ")[[1]], grepl, a))
giving:
[1] TRUE
Note: If you are only looking to determine if a and b are the same aside from spaces then remove the spaces from both and compare what is left:
gsub(" ", "", a) == gsub(" ", "", b)

Extract first X Numbers from Text Field using Regex

I have strings that looks like this.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
I need to end up with:
"2134", "0983", and "8723"
Essentially, I need to extract the first four characters that are numbers from each element. Some begin with a letter (disallowing me from using a simple substring() function).
I guess technically, I could do something like:
x <- gsub("^P","",x)
x <- substr(x,1,4)
But I want to know how I would do this with regex!
You could use str_match from the stringr package:
library(stringr)
print(c(str_match(x, "\\d\\d\\d\\d")))
# [1] "2134" "0983" "8723"
You can do this with gsub too.
> sub('.?([0-9]{4}).*', '\\1', x)
[1] "2134" "0983" "8723"
>
I used sub instead of gsub to assure I only got the first match. .? says any single character and its optional (similar to just . but then it wouldn't match the case without the leading P). The () signify a group that I reference in the replacement '\\1'. If there were multiple sets of () I could reference them too with '\\2'. Inside the group, and you had the syntax correct, I want only numbers and I want exactly 4 of them. The final piece says zero or more trailing characters of any type.
Your syntax was working, but you were replacing something with itself so you wind up with the same output.
This will get you the first four digits of a string, regardless of where in the string they appear.
mapply(function(x, m) paste0(x[m], collapse=""),
strsplit(x, ""),
lapply(gregexpr("\\d", x), "[", 1:4))
Breaking it down into pieces:
What's going on in the above line is as follows:
# this will get you a list of matches of digits, and their location in each x
matches <- gregexpr("\\d", x)
# this gets you each individual digit
matches <- lapply(matches, "[", 1:4)
# individual characters of x
splits <- strsplit(x, "")
# get the appropriate string
mapply(function(x, m) paste0(x[m], collapse=""), splits, matches)
Another group capturing approach that doesn't assume 4 numbers.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
gsub("(^[^0-9]*)(\\d+)([^0-9].*)", "\\2", x)
## [1] "2134" "0983" "8723"

Resources