Inserting prefix of 19 into a string date - r

I have a vector of birth dates as character strings formatted "10-Feb-85".
When I use the as.Date() function in R it assumes the two digit year is after 2000 (none of these birth dates are after the year 2000).
example:
as.Date(x = "10-Feb-52", format = "%d-%b-%y")
returns: 2052-02-10
I'm not proficient in regular expressions but
I think that this is an occasion for a regular expression to insert a "19" after the second "-" or before the last two digits.
I've found a regex that counts forward three characters and inserts a letter:
gsub(pattern = "^(.{3})(.*)$", replacement = "\\1d\\2", x = "abcefg")
But I'm not sure how to count two from the end.
Any help is appreciated.

insert a "19" after the second "-" or before the last two digits.
Before the last two digits:
gsub(pattern = "-(\\d{2})$", replacement = "-19\\1", x = "10-Feb-52")
See the R demo. Here, - is matched first, then 2 digits ((\\d{2})) - that are at the end of string ($) - are matched and captured into Group 1.
After the second -:
gsub(pattern = "^((?:[^-]*-){2})", replacement = "\\119", x = "10-Feb-52")
See another demo. Here, 2 sequences ({2}) of 0+ chars other than - ([^-]*) are matched from the start of the string (^) and captured into group 1. The replacement contains a backreference that restores the captured text in the replacement result.

Related

Select numeric string with dots and colon

I have this string
string <- "Hospitalization from 25.1.2018 to 26.1.2018", "Date of hospitalization was from 28.8.2019 8:15", "Date of arrival 30.6.2018 20:30 to hospital")
And I would like to get on the numeric part of string (with dots and colons) to have this
print(dates)
c("25.1.2018", "26.1.2018", "28.8.2019 8:15", "30.6.2018 20:30")
I have tried
dates <- gsub("([0-9]+).*$", "\\1", string)
But it gives me just first number before first dot
You can use
library(stringr)
unlist(str_extract_all(string, "\\d{1,2}\\.\\d{1,2}\\.\\d{4}(?:\\s+\\d{1,2}:\\d{1,2})?"))
# => [1] "25.1.2018" "26.1.2018" "28.8.2019 8:15" "30.6.2018 20:30"
See the regex demo.
Details
\d{1,2} - one or two digits
\. - a dot
\d{1,2}\.\d{4} - one or two digits, a dot and four digits
(?:\s+\d{1,2}:\d{1,2})? - an optional occurrence of
\s+ - one or more whitespaces
\d{1,2}:\d{1,2} - one or two digits, : and one or two digits.
Use sapply:
sapply(str_extract_all(string, "[0-9.:]+"), paste0, collapse = " ")
[1] "25.1.2018 26.1.2018" "28.8.2019 8:15" "30.6.2018 20:30"

R Question: Extracting Numeric Characters from End of String

I have a data frame. One of the columns is in string format. Various letters and numbers, but always ending in a string of numbers. Sadly this string isn't always the same length.
I'd like to know how to write a bit of code to extract just the numbers at the end. So for example:
x <- c("AB ABC 19012301927 / XX - 4625",
"BC - AB / 827 / 9765",
"XXXX-9276"
)
And I'd like to get from this: (4625, 9765, 9276)
Is there any easy way to do this please?
Thank you.
A
We can use sub to capture one or more digits (\\d+) at the end ($) of the string that follows a non-digit ([^0-9]) and other characters (.*), in the replacement, specify the backreference (\\1) of the captured group
sub(".*[^0-9](\\d+)$", "\\1", x)
#[1] "4625" "9765" "9276"
Or with word from stringr
library(stringr)
word(x, -1, sep="[- ]")
#[1] "4625" "9765" "9276"
Or with stri_extract_last
library(stringi)
stri_extract_last_regex(x, "\\d+")
#[1] "4625" "9765" "9276"
Replace everything up to the last non-digit with a zero length string.
sub(".*\\D", "", x)
giving:
[1] "4625" "9765" "9276"

Using gsub to replace last occurence of string in R

I have the following character vector than I need to modify with gsub.
strings <- c("x", "pm2.5.median", "rmin.10000m", "rmin.2500m", "rmax.5000m")
Desired output of filtered strings:
"x", "pm2.5.median", "rmin", "rmin", "rmax"
My current attempt works for everything except the pm2.5.median string which has dots that need to be preserved. I'm really just trying to remove the buffer size that is appended to the end of each variable, e.g. 1000m, 2500m, 5000m, 7500m, and 10000m.
gsub("\\..*m$", "", strings)
"x", "pm2", "rmin", "rmin", "rmax"
Match a dot, any number of digits, m and the end of string and replace that with the empty string. Note that we prefer sub to gsub here because we are only interested in one replacement per string.
sub("\\.\\d+m$", "", strings)
## [1] "x" "pm2.5.median" "rmin" "rmin" "rmax"
The .* pattern matches any 0 or more chars, as many as possible. The \..*m$ pattern matches the first (leftmost) . in the string and then grab all the text after it if it ends with m.
You need
> sub("\\.[^.]*m$", "", strings)
[1] "x" "pm2.5.median" "rmin" "rmin" "rmax"
Here, \.[^.]*m$ matches ., then 0 or more chars other than a dot and then m at the end of the string.
See the regex demo.
Details
\. - a dot (must be escaped since it is a special regex char otherwise)
[^.]* - a negated character class matching any char but . 0 or more times
m - an m char
$ - end of string.

Extract Between Parts of a String

I have a string of names in the following format:
names <- c("Q-1234-1", "Q-1234-2", "Q-1234-1-8", "Q-1234-2-8")
I am trying to extract the single digit after the second hyphen. There are instances where there will be a third hyphen and an additional digit at the end of the name. The desired output is:
1, 2, 1, 2
I assume that I will need to use sub/gsub but am not sure where to start. Any suggestions?
We can use sub to match the pattern of zero or more characters that are not a - ([^-]*) from the start (^) of the string followed by a - followed by zero or more characters that are not a - followed by a - and the number that follows being captured as a group. In the replacement, we use the backreference of the captured group (\\1)
as.integer(sub("^[^-]*-[^-]*-(\\d).*", "\\1", names))
#[1] 1 2 1 2
Or this can be modified to
as.integer(sub("^([^-]*-){2}(\\d).*", "\\2", names))
#[1] 1 2 1 2
Here's an alternative using stringr
library("stringr")
names <- c("Q-1234-1", "Q-1234-2", "Q-1234-1-8", "Q-1234-2-8")
output = str_split_fixed(names, pattern = "-", n = 4)[,3]

Inserting character dynamically into string in R

I'm trying to insert a "+" symbol into the middle of a postcode. The postcodes following a pattern of AA111AA or AA11AA. I want the "+" to be inserted before the final number, so an output of either AA11+1AA or AA1+1AA. I've found a way to do this using stringr, but it feels like there's an easier way to do this that how I'm currently doing it. Below is my code.
pc <- "bt43xx"
pc <- str_c(
str_sub(pc, start = 1L, end = -4L),
"+",
str_sub(pc, start = -3L, end = -1L)
)
pc
[1] "bt4+3xx"
Here are some alternatives. All solutions work if pc is a scalar or vector. No packages are needed. Of them (3) seems particularly short and simple.
1) Match everything (.*) up to the last digit (\\d) and then replace that with the first capture (i.e. the match to the part within the first set of parens), a plus and the second capture (i.e. a match to the last digit).
sub("(.*)(\\d)", "\\1+\\2", pc)
2) An alternative which is even shorter is to match a digit followed by a non-digit and replace that with a plus followed by the match:
sub("(\\d\\D)", "+\\1", pc)
## [1] "bt4+3xx"
3) This one is even shorter than (2). It matches the last 3 characters replacing the match with a plus followed by the match:
sub("(...)$", "+\\1", pc)
## [1] "bt4+3xx"
4) This one splits the string into individual characters, inserts a plus in the appropriate position using append and puts the characters back together.
sapply(Map(append, strsplit(pc, ""), after = nchar(pc) - 3, "+"), paste, collapse = "")
## [1] "bt4+3xx"
If pc were known to be a scalar (as is the case in the question) it could be simplified to:
paste(append(strsplit(pc, "")[[1]], "+", nchar(pc) - 3), collapse = "")
[1] "bt4+3xx"
This regular expression with sub and two back references should work.
sub("(\\d?)(\\d[^\\d]*)$", "\\1+\\2", pc)
[1] "bt4+3xx"
\\d? matches 1 or 0 numeric characters, 0-9, and is captured by (). It will match if at least two numeric characters are present.
\\d[^\\d]* matches a numeric character followed by all non numeric characters, and is captured by ()
$ anchors the regular expression to the end of the string
"\\1+\\2" replaces the matched elements in the first two points with themselves and a "+" in the middle.
sub('(\\d)(?=\\D+$)','+\\1',pc,perl=T)

Resources