Extract first X Numbers from Text Field using Regex - r

I have strings that looks like this.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
I need to end up with:
"2134", "0983", and "8723"
Essentially, I need to extract the first four characters that are numbers from each element. Some begin with a letter (disallowing me from using a simple substring() function).
I guess technically, I could do something like:
x <- gsub("^P","",x)
x <- substr(x,1,4)
But I want to know how I would do this with regex!

You could use str_match from the stringr package:
library(stringr)
print(c(str_match(x, "\\d\\d\\d\\d")))
# [1] "2134" "0983" "8723"

You can do this with gsub too.
> sub('.?([0-9]{4}).*', '\\1', x)
[1] "2134" "0983" "8723"
>
I used sub instead of gsub to assure I only got the first match. .? says any single character and its optional (similar to just . but then it wouldn't match the case without the leading P). The () signify a group that I reference in the replacement '\\1'. If there were multiple sets of () I could reference them too with '\\2'. Inside the group, and you had the syntax correct, I want only numbers and I want exactly 4 of them. The final piece says zero or more trailing characters of any type.
Your syntax was working, but you were replacing something with itself so you wind up with the same output.

This will get you the first four digits of a string, regardless of where in the string they appear.
mapply(function(x, m) paste0(x[m], collapse=""),
strsplit(x, ""),
lapply(gregexpr("\\d", x), "[", 1:4))
Breaking it down into pieces:
What's going on in the above line is as follows:
# this will get you a list of matches of digits, and their location in each x
matches <- gregexpr("\\d", x)
# this gets you each individual digit
matches <- lapply(matches, "[", 1:4)
# individual characters of x
splits <- strsplit(x, "")
# get the appropriate string
mapply(function(x, m) paste0(x[m], collapse=""), splits, matches)

Another group capturing approach that doesn't assume 4 numbers.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
gsub("(^[^0-9]*)(\\d+)([^0-9].*)", "\\2", x)
## [1] "2134" "0983" "8723"

Related

How to group files in a list based on name?

I have 4 files:
MCD18A1.A2001001.h15v05.061.2020097222704.hdf
MCD18A1.A2001001.h16v05.061.2020097221515.hdf
MCD18A1.A2001002.h15v05.061.2020079205554.hdf
MCD18A1.A2001002.h16v05.061.2020079205717.hdf
And I want to group them by name (date: A2001001 and A2001002) inside a list, something like this:
[[MCD18A1.A2001001.h15v05.061.2020097222704.hdf, MCD18A1.A2001001.h16v05.061.2020097221515.hdf], [MCD18A1.A2001002.h15v05.061.2020079205554.hdf, MCD18A1.A2001002.h16v05.061.2020079205717.hdf]]
I did this using Python, but I don't know how to do with R:
# Seperate files by date
MODIS_files_bydate = [list(i) for _, i in itertools.groupby(MODIS_files, lambda x: x.split('.')[1])]
Is this what you are looking for?
g <- sub("^[^\\.]*\\.([^\\.]+)\\..*$", "\\1", s)
split(s, g)
#$A2001001
#[1] "MCD18A1.A2001001.h15v05.061.2020097222704.hdf"
#[2] "MCD18A1.A2001001.h16v05.061.2020097221515.hdf"
#
#$A2001002
#[1] "MCD18A1.A2001002.h15v05.061.2020079205554.hdf"
#[2] "MCD18A1.A2001002.h16v05.061.2020079205717.hdf"
regex explained
The regex is divided in three parts.
^[^\\.]*\\.
^ first circumflex marks the beginning of the string;
^[^\\.] at the beginning, a class negating a dot (the second ^). The dot is a meta-character and, therefore, must be escaped, \\.;
the sequence with no dots at the beginning repeated zero or more times (*);
the previous sequence ends with a dot, \\..
([^\\.]+) is a capture group.
[^\\.] the class with no dots, like above;
[^\\.]+ repeated at least one time (+).
\\..*$"
\\. starting with one dot
\\..*$ any character repeated zero or more times until the end ($).
What sub is replacing is the capture group, what is between parenthesis, by itself, \\1. This discards everything else.
Data
s <- "
MCD18A1.A2001001.h15v05.061.2020097222704.hdf
MCD18A1.A2001001.h16v05.061.2020097221515.hdf
MCD18A1.A2001002.h15v05.061.2020079205554.hdf
MCD18A1.A2001002.h16v05.061.2020079205717.hdf"
s <- scan(text = s, what = character())
How would you like the outcome organized?
This is a solution:
files <- c("MCD18A1.A2001001.h15v05.061.2020097222704.hdf",
"MCD18A1.A2001001.h16v05.061.2020097221515.hdf",
"MCD18A1.A2001002.h15v05.061.2020079205554.hdf",
"MCD18A1.A2001002.h16v05.061.2020079205717.hdf")
unique_date <- unique(sub("^[^\\.]*\\.([^\\.]+)\\..*$", "\\1", files))
# (credit to Rui Barradas for the nice regular expression)
grouped_files <- lapply(unique_date, function(x){files[grepl(x, files)]})
names(grouped_files) <- unique_date
> grouped_files
# $A2001001
# [1] "MCD18A1.A2001001.h15v05.061.2020097222704.hdf" "MCD18A1.A2001001.h16v05.061.2020097221515.hdf"
# $A2001002
# [1] "MCD18A1.A2001002.h15v05.061.2020079205554.hdf" "MCD18A1.A2001002.h16v05.061.2020079205717.hdf"

R padding 0's inside a string after the hypen

I have the following data
GT_BUC-01_BUCST-19
ADT_BURC-1_BUCST-09
BT_BUDDC-1_BUDSCST-29
CAST_BUC-31_BUCST-9
CAST_BUC-1_BUCST-9
How do I use R to make the numbers after both hyphens to pad leading zeros so it will have Two digits? The resulting format should look like this:
GT_BUC-01_BUCST-19
ADT_BURC-01_BUCST-09
BT_BUDDC-01_BUDSCST-29
CAST_BUC-31_BUCST-09
CAST_BUC-01_BUCST-09
One option would be to use stringr::str_replace_all
x <- c('GT_BUC-01_BUCST-19', 'ADT_BURC-1_BUCST-09',
'BT_BUDDC-1_BUDSCST-29', 'CAST_BUC-31_BUCST-9', 'CAST_BUC-1_BUCST-9')
stringr::str_replace_all(x, '\\d+', function(m) sprintf('%02s', m))
#[1] "GT_BUC-01_BUCST-19" "ADT_BURC-01_BUCST-09"
#[3] "BT_BUDDC-01_BUDSCST-29" "CAST_BUC-31_BUCST-09"
#[5] "CAST_BUC-01_BUCST-09"
You could try using gsub as follows:
x <- gsub("-(\\d)(?!\\d)", "-0\\1", x, perl=TRUE)
x
[1] "GT_BUC-01_BUCST-19" "ADT_BURC-01_BUCST-09" "BT_BUDDC-01_BUDSCST-29"
[4] "CAST_BUC-31_BUCST-09" "CAST_BUC-01_BUCST-09"
Data:
x <- c("GT_BUC-01_BUCST-19",
"ADT_BURC-1_BUCST-09",
"BT_BUDDC-1_BUDSCST-29",
"CAST_BUC-31_BUCST-9",
"CAST_BUC-1_BUCST-9")
The regex pattern used here matches dash followed by a single number only. In this case, we then replace by prepending a zero to this single number.

How to match distinct repeated characters

I'm trying to come up with a regex in R to match strings in which there is repetition of two distinct characters.
x <- c("aaaaaaah" ,"aaaah","ahhhh","cooee","helloee","mmmm","noooo","ohhhh","oooaaah","ooooh","sshh","ummmmm","vroomm","whoopee","yippee")
This regex matches all of the above, including strings such as "mmmm" and "ohhhh" where the repeated letter is the same in the first and the second repetition:
grep(".*([a-z])\\1.*([a-z])\\2", x, value = T)
What I'd like to match in x are these strings where the repeated letters are distinct:
"cooee","helloee","oooaaah","sshh","vroomm","whoopee","yippee"
How can the regex be tweaked to make sure the second repeated character is not the same as the first?
You may restrict the second char pattern with a negative lookahead:
grep(".*([a-z])\\1.*(?!\\1)([a-z])\\2", x, value=TRUE, perl=TRUE)
# ^^^^^
See the regex demo.
(?!\\1)([a-z]) means match and capture into Group 2 any lowercase ASCII letter if it is not the same as the value in Group 1.
R demo:
x <- c("aaaaaaah" ,"aaaah","ahhhh","cooee","helloee","mmmm","noooo","ohhhh","oooaaah","ooooh","sshh","ummmmm","vroomm","whoopee","yippee")
grep(".*([a-z])\\1.*(?!\\1)([a-z])\\2", x, value=TRUE, perl=TRUE)
# => "cooee" "helloee" "oooaaah" "sshh" "vroomm" "whoopee" "yippee"
If you can avoid regex altogether, then I think that's the way to go. A rough example:
nrep <- sapply(
strsplit(x, ""),
function(y) {
run_lengths <- rle(y)
length(unique(run_lengths$values[run_lengths$lengths >= 2]))
}
)
x[nrep > 1]
# [1] "cooee" "helloee" "oooaaah" "sshh" "vroomm" "whoopee" "yippee"

extract substring in R

Suppose I have list of string "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR" and need to get a vector of string that contains only numbers with bracket like eg. [+229][+57].
Is there a convenient way in R to do this?
Using base R, then try it with
> unlist(regmatches(s,gregexpr("\\[\\+\\d+\\]",s)))
[1] "[+229]" "[+57]" "[+229]"
Or you can use
> gsub(".*?(\\[.*\\]).*","\\1",gsub("\\].*?\\[","] | [",s))
[1] "[+229] | [+57] | [+229]"
We can use str_extract_all from stringr
stringr::str_extract_all(x, "\\[\\+\\d+\\]")[[1]]
#[1] "[+229]" "[+57]" "[+229]"
Wrap it in unique if you need only unique values.
Similarly, in base R using regmatches and gregexpr
regmatches(x, gregexpr("\\[\\+\\d+\\]", x))[[1]]
data
x <- "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR"
Seems like you want to remove the alphabetical characters, so
gsub("[[:alpha:]]", "", x)
where [:alpha:] is the class of alphabetical (lower-case and upper-case) characters, [[:alpha:]] says 'match any single alphabetical character', and gsub() says substitute, globally, any alphabetical character with the empty string "". This seems better than trying to match bracketed numbers, which requires figuring out which characters need to be escaped with a (double!) \\.
If the intention is to return the unique bracketed numbers, then the approach is to extract the matches (rather than remove the unwanted characters). Instead of using gsub() to substitute matches to a regular expression with another value, I'll use gregexpr() to identify the matches, and regmatches() to extract the matches. Since numbers always occur in [], I'll simplify the regular expression to match one or more (+) characters from the collection +[:digit:].
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> xx
[[1]]
[1] "+229" "+57" "+229"
xx is a list of length equal to the length of x. I'll write a function that, for any element of this list, makes the values unique, surrounds the values with [ and ], and concatenates them
fun <- function(x)
paste0("[", unique(x), "]", collapse = "")
This needs to be applied to each element of the list, and simplified to a vector, a task for sapply().
> sapply(xx, fun)
[1] "[+229][+57]"
A minor improvement is to use vapply(), so that the result is robust (always returning a character vector with length equal to x) to zero-length inputs
> x = character()
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> sapply(xx, fun) # Hey, this returns a list :(
list()
> vapply(xx, fun, "character") # vapply() deals with 0-length inputs
character(0)

R-- Add leading zero to string, with no fixed string format

I have a column as below.
9453, 55489, 4588, 18893, 4457, 2339, 45489HQ, 7833HQ
I would like to add leading zero if the number is less than 5 digits. However, some numbers have "HQ" in the end, some don't.(I did check other posts, they dont have similar problem in the "HQ" part)
so the finally desired output should be:
09453, 55489, 04588, 18893, 04457, 02339, 45489HQ, 07833HQ
any idea how to do this? Thank you so much for reading my post!
A one-liner using regular expressions:
my_strings <- c("9453", "55489", "4588",
"18893", "4457", "2339", "45489HQ", "7833HQ")
gsub("^([0-9]{1,4})(HQ|$)", "0\\1\\2",my_strings)
[1] "09453" "55489" "04588" "18893"
"04457" "02339" "45489HQ" "07833HQ"
Explanation:
^ start of string
[0-9]{1,4} one to four numbers in a row
(HQ|$) the string "HQ" or the end of the string
Parentheses represent capture groups in order. So 0\\1\\2 means 0 followed by the first capture group [0-9]{1,4} and the second capture group HQ|$.
Of course if there is 5 numbers, then the regex isn't matched, so it doesn't change.
I was going to use the sprintf approach, but found the the stringr package provides a very easy solution.
library(stringr)
x <- c("9453", "55489", "4588", "18893", "4457", "2339", "45489HQ", "7833HQ")
[1] "9453" "55489" "4588" "18893" "4457" "2339" "45489HQ" "7833HQ"
This can be converted with one simple stringr::str_pad() function:
stringr::str_pad(x, 5, side="left", pad="0")
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "7833HQ"
If the number needs to be padded even if the total string width is >5, then the number and text need to be separated with regex.
The following will work. It combines regex matching with the very helpful sprintf() function:
sprintf("%05.0f%s", # this encodes the format and recombines the number with padding (%05.0f) with text(%s)
as.numeric(gsub("^(\\d+).*", "\\1", x)), #get the number
gsub("[[:digit:]]+([a-zA-Z]*)$", "\\1", x)) #get just the text at the end
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "07833HQ"
Another attempt, which will also work in cases like "123" or "1HQR":
x <- c("18893","4457","45489HQ","7833HQ","123", "1HQR")
regmatches(x, regexpr("^\\d+", x)) <- sprintf("%05d", as.numeric(sub("\\D+$","",x)))
x
#[1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
This basically finds any numbers at the start of the string (^\\d+) and replaces them with a zero-padded (via sprintf) string that was subset out by removing any non-numeric characters (\\D+$) from the end of the string.
We can use only sprintf() and gsub() by splitting up the parts then putting them back together.
sprintf("%05d%s", as.numeric(gsub("[^0-9]+", "", x)), gsub("[0-9]+", "", x))
# [1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
Using #thelatemail's data:
x <- c("18893", "4457", "45489HQ", "7833HQ", "123", "1HQR")

Resources