Regex: Match first two digits of a four digit number - r

I have:
'30Jun2021'
I want to skip/remove the first two digits of the four digit number (or any other way of doing this):
'30Jun21'
I have tried:
^.{0,5}
https://regex101.com/r/hAJcdE/1
I have the first 5 characters but I have not figured out how to skip/remove the '20'

Manipulating datetimes is better using the dedicated date/time functions.
You can convert the variable to date and use format to get the output in any format.
x <- '30Jun2021'
format(as.Date(x, '%d%b%Y'), '%d%b%y')
#[1] "30Jun21"
You can also use lubridate::dmy(x) to convert x to date.

You don't even need regex for this. Just use substring operations:
x <- '30Jun2021'
paste0(substr(x, 1, 5), substr(x, 8, 9))
[1] "30Jun21"

Use sub
sub('\\d{2}(\\d{2})$', "\\1", x)
[1] "30Jun21"
or with str_remove
library(stringr)
str_remove(x, "\\d{2}(?=\\d{2}$)")
[1] "30Jun21"
data
x <- '30Jun2021'

You could also match the format of the string with 2 capture groups, where you would match the part that you want to omit and capture what you want to keep.
\b(\d+[A-Z][a-z]+)\d\d(\d\d)\b
Regex demo
sub("\\b(\\d+[A-Z][a-z]+)\\d\\d(\\d\\d)\\b", "\\1\\2", "30Jun2021")
Output
[1] "30Jun21"

Related

R padding 0's inside a string after the hypen

I have the following data
GT_BUC-01_BUCST-19
ADT_BURC-1_BUCST-09
BT_BUDDC-1_BUDSCST-29
CAST_BUC-31_BUCST-9
CAST_BUC-1_BUCST-9
How do I use R to make the numbers after both hyphens to pad leading zeros so it will have Two digits? The resulting format should look like this:
GT_BUC-01_BUCST-19
ADT_BURC-01_BUCST-09
BT_BUDDC-01_BUDSCST-29
CAST_BUC-31_BUCST-09
CAST_BUC-01_BUCST-09
One option would be to use stringr::str_replace_all
x <- c('GT_BUC-01_BUCST-19', 'ADT_BURC-1_BUCST-09',
'BT_BUDDC-1_BUDSCST-29', 'CAST_BUC-31_BUCST-9', 'CAST_BUC-1_BUCST-9')
stringr::str_replace_all(x, '\\d+', function(m) sprintf('%02s', m))
#[1] "GT_BUC-01_BUCST-19" "ADT_BURC-01_BUCST-09"
#[3] "BT_BUDDC-01_BUDSCST-29" "CAST_BUC-31_BUCST-09"
#[5] "CAST_BUC-01_BUCST-09"
You could try using gsub as follows:
x <- gsub("-(\\d)(?!\\d)", "-0\\1", x, perl=TRUE)
x
[1] "GT_BUC-01_BUCST-19" "ADT_BURC-01_BUCST-09" "BT_BUDDC-01_BUDSCST-29"
[4] "CAST_BUC-31_BUCST-09" "CAST_BUC-01_BUCST-09"
Data:
x <- c("GT_BUC-01_BUCST-19",
"ADT_BURC-1_BUCST-09",
"BT_BUDDC-1_BUDSCST-29",
"CAST_BUC-31_BUCST-9",
"CAST_BUC-1_BUCST-9")
The regex pattern used here matches dash followed by a single number only. In this case, we then replace by prepending a zero to this single number.

Split a comma separated string into defined number of pieces in R

I have a string of comma separated values that I'd like to split into several pieces based on the number of commas.
E.g.: Split the following string every 5 values or commas:
txt = "120923,120417,120416,105720,120925,120790,120792,120922,120928,120930,120918,120929,61065,120421"
The result would be:
[1] 120923,120417,120416,105720,120925
[2] 120790,120792,120922,120928,120930
[3] 120918,120929,61065,120421
We could split the text on comma (',') and divide them into group of 5.
temp <- strsplit(txt, ",")[[1]]
split(temp, rep(seq_along(temp), each = 5, length.out = length(temp)))
#$`1`
#[1] "120923" "120417" "120416" "105720" "120925"
#$`2`
#[1] "120790" "120792" "120922" "120928" "120930"
#$`3`
#[1] "120918" "120929" "61065" "120421"
If you want them as one concatenated string we can use by
as.character(by(temp, rep(seq_along(temp), each = 5,
length.out = length(temp)), toString))
One base R option would be to use gregexpr with the following regex pattern:
\d+(?:,\d+){0,4}
This pattern would match one number, followed greedily by zero to four other CSV numbers. Note that because the pattern is greedy, it would always try to match the maximum numbers available remaining in the input.
txt <- "120923,120417,120416,105720,120925,120790,120792,120922,120928,120930,120918,120929,61065,120421"
regmatches(txt,gregexpr("\\d+(?:,\\d+){0,4}",txt))
[1] "120923,120417,120416,105720,120925" "120790,120792,120922,120928,120930"
[3] "120918,120929,61065,120421"
Using str_extract
library(stringr)
str_extract_all(txt, "\\d+(,\\d+){1,4}")[[1]]
#[1] "120923,120417,120416,105720,120925" "120790,120792,120922,120928,120930"
#[3] "120918,120929,61065,120421"

R-- Add leading zero to string, with no fixed string format

I have a column as below.
9453, 55489, 4588, 18893, 4457, 2339, 45489HQ, 7833HQ
I would like to add leading zero if the number is less than 5 digits. However, some numbers have "HQ" in the end, some don't.(I did check other posts, they dont have similar problem in the "HQ" part)
so the finally desired output should be:
09453, 55489, 04588, 18893, 04457, 02339, 45489HQ, 07833HQ
any idea how to do this? Thank you so much for reading my post!
A one-liner using regular expressions:
my_strings <- c("9453", "55489", "4588",
"18893", "4457", "2339", "45489HQ", "7833HQ")
gsub("^([0-9]{1,4})(HQ|$)", "0\\1\\2",my_strings)
[1] "09453" "55489" "04588" "18893"
"04457" "02339" "45489HQ" "07833HQ"
Explanation:
^ start of string
[0-9]{1,4} one to four numbers in a row
(HQ|$) the string "HQ" or the end of the string
Parentheses represent capture groups in order. So 0\\1\\2 means 0 followed by the first capture group [0-9]{1,4} and the second capture group HQ|$.
Of course if there is 5 numbers, then the regex isn't matched, so it doesn't change.
I was going to use the sprintf approach, but found the the stringr package provides a very easy solution.
library(stringr)
x <- c("9453", "55489", "4588", "18893", "4457", "2339", "45489HQ", "7833HQ")
[1] "9453" "55489" "4588" "18893" "4457" "2339" "45489HQ" "7833HQ"
This can be converted with one simple stringr::str_pad() function:
stringr::str_pad(x, 5, side="left", pad="0")
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "7833HQ"
If the number needs to be padded even if the total string width is >5, then the number and text need to be separated with regex.
The following will work. It combines regex matching with the very helpful sprintf() function:
sprintf("%05.0f%s", # this encodes the format and recombines the number with padding (%05.0f) with text(%s)
as.numeric(gsub("^(\\d+).*", "\\1", x)), #get the number
gsub("[[:digit:]]+([a-zA-Z]*)$", "\\1", x)) #get just the text at the end
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "07833HQ"
Another attempt, which will also work in cases like "123" or "1HQR":
x <- c("18893","4457","45489HQ","7833HQ","123", "1HQR")
regmatches(x, regexpr("^\\d+", x)) <- sprintf("%05d", as.numeric(sub("\\D+$","",x)))
x
#[1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
This basically finds any numbers at the start of the string (^\\d+) and replaces them with a zero-padded (via sprintf) string that was subset out by removing any non-numeric characters (\\D+$) from the end of the string.
We can use only sprintf() and gsub() by splitting up the parts then putting them back together.
sprintf("%05d%s", as.numeric(gsub("[^0-9]+", "", x)), gsub("[0-9]+", "", x))
# [1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
Using #thelatemail's data:
x <- c("18893", "4457", "45489HQ", "7833HQ", "123", "1HQR")

formatting the date in R

I have a date value as follows
"'2015-10-24'"
class Character
I am trying to format this value such that it looks like this '10/24/2015'
I know how to use noquote function and strip the quotes and gsub function to replace the - with / but I am not sure how to switch the year, date and month such that it looks like this '10/24/2015'
Any help is much appreciated.
We can convert to Date class after removing the ' with gsub, and then use format to get the expected output
format(as.Date(gsub("'", '', v1)), "'%m/%d/%Y'")
#[1] "'10/24/2015'" "'10/25/2015'"
Or without using the gsub to remove ', we can specify the ' also in the format within as.Date
format(as.Date(v1, "'%Y-%m-%d'"), "'%m/%d/%Y'")
#[1] "'10/24/2015'" "'10/25/2015'"
This can be made more compact if we are using library(lubridate)
library(lubridate)
format(ymd(v1), "'%m/%d/%Y'")
#[1] "'10/24/2015'" "'10/25/2015'"
If we don't need the ' in the output, we don't have to specify that in the format,
format(ymd(v1), "%m/%d/%Y")
#[1] "10/24/2015" "10/25/2015"
Or we can do this using only gsub by capturing the characters as a group. In the below code, we capture the first 4 characters (.{4}) as a group by wrapping with parentheses followed by matching the -, then capturing the next two characters, followed by -, and capturing the last two characters. In the replacement, we can shuffle the capture groups as per the requirement. In this case, the second capture group should come first (\\2) followed by /, then the third (\\3) and so on...
gsub('(.{4})-(.{2})-(.{2})', '\\2/\\3/\\1', v1)
#[1] "'10/24/2015'" "'10/25/2015'"
To avoid the quotes,
gsub('.(.{4})-(.{2})-(.{2}).', '\\2/\\3/\\1', v1)
#[1] "10/24/2015" "10/25/2015"
In addition, there are other ways such as splitting the string
vapply(strsplit(v1, "['-]"), function(x) paste(x[c(3,4,2)], collapse='/'), character(1))
#[1] "10/24/2015" "10/25/2015"
or extracting the numeric part with str_extract_all and pasteing as before.
library(stringr)
vapply(str_extract_all(v1, '\\d+'), function(x)
paste(x[c(2,3,1)], collapse='/'), character(1))
#[1] "10/24/2015" "10/25/2015"
data
v1 <- c("'2015-10-24'", "'2015-10-25'")
You can also use the function strftime to get the result
d <- "'2015-10-24'"
strftime(as.Date(gsub("'", "", d)), "%m/%d/%Y")
# [1] "10/24/2015"

Extract first X Numbers from Text Field using Regex

I have strings that looks like this.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
I need to end up with:
"2134", "0983", and "8723"
Essentially, I need to extract the first four characters that are numbers from each element. Some begin with a letter (disallowing me from using a simple substring() function).
I guess technically, I could do something like:
x <- gsub("^P","",x)
x <- substr(x,1,4)
But I want to know how I would do this with regex!
You could use str_match from the stringr package:
library(stringr)
print(c(str_match(x, "\\d\\d\\d\\d")))
# [1] "2134" "0983" "8723"
You can do this with gsub too.
> sub('.?([0-9]{4}).*', '\\1', x)
[1] "2134" "0983" "8723"
>
I used sub instead of gsub to assure I only got the first match. .? says any single character and its optional (similar to just . but then it wouldn't match the case without the leading P). The () signify a group that I reference in the replacement '\\1'. If there were multiple sets of () I could reference them too with '\\2'. Inside the group, and you had the syntax correct, I want only numbers and I want exactly 4 of them. The final piece says zero or more trailing characters of any type.
Your syntax was working, but you were replacing something with itself so you wind up with the same output.
This will get you the first four digits of a string, regardless of where in the string they appear.
mapply(function(x, m) paste0(x[m], collapse=""),
strsplit(x, ""),
lapply(gregexpr("\\d", x), "[", 1:4))
Breaking it down into pieces:
What's going on in the above line is as follows:
# this will get you a list of matches of digits, and their location in each x
matches <- gregexpr("\\d", x)
# this gets you each individual digit
matches <- lapply(matches, "[", 1:4)
# individual characters of x
splits <- strsplit(x, "")
# get the appropriate string
mapply(function(x, m) paste0(x[m], collapse=""), splits, matches)
Another group capturing approach that doesn't assume 4 numbers.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
gsub("(^[^0-9]*)(\\d+)([^0-9].*)", "\\2", x)
## [1] "2134" "0983" "8723"

Resources