Select numeric string with dots and colon - r

I have this string
string <- "Hospitalization from 25.1.2018 to 26.1.2018", "Date of hospitalization was from 28.8.2019 8:15", "Date of arrival 30.6.2018 20:30 to hospital")
And I would like to get on the numeric part of string (with dots and colons) to have this
print(dates)
c("25.1.2018", "26.1.2018", "28.8.2019 8:15", "30.6.2018 20:30")
I have tried
dates <- gsub("([0-9]+).*$", "\\1", string)
But it gives me just first number before first dot

You can use
library(stringr)
unlist(str_extract_all(string, "\\d{1,2}\\.\\d{1,2}\\.\\d{4}(?:\\s+\\d{1,2}:\\d{1,2})?"))
# => [1] "25.1.2018" "26.1.2018" "28.8.2019 8:15" "30.6.2018 20:30"
See the regex demo.
Details
\d{1,2} - one or two digits
\. - a dot
\d{1,2}\.\d{4} - one or two digits, a dot and four digits
(?:\s+\d{1,2}:\d{1,2})? - an optional occurrence of
\s+ - one or more whitespaces
\d{1,2}:\d{1,2} - one or two digits, : and one or two digits.

Use sapply:
sapply(str_extract_all(string, "[0-9.:]+"), paste0, collapse = " ")
[1] "25.1.2018 26.1.2018" "28.8.2019 8:15" "30.6.2018 20:30"

Related

R Question: Extracting Numeric Characters from End of String

I have a data frame. One of the columns is in string format. Various letters and numbers, but always ending in a string of numbers. Sadly this string isn't always the same length.
I'd like to know how to write a bit of code to extract just the numbers at the end. So for example:
x <- c("AB ABC 19012301927 / XX - 4625",
"BC - AB / 827 / 9765",
"XXXX-9276"
)
And I'd like to get from this: (4625, 9765, 9276)
Is there any easy way to do this please?
Thank you.
A
We can use sub to capture one or more digits (\\d+) at the end ($) of the string that follows a non-digit ([^0-9]) and other characters (.*), in the replacement, specify the backreference (\\1) of the captured group
sub(".*[^0-9](\\d+)$", "\\1", x)
#[1] "4625" "9765" "9276"
Or with word from stringr
library(stringr)
word(x, -1, sep="[- ]")
#[1] "4625" "9765" "9276"
Or with stri_extract_last
library(stringi)
stri_extract_last_regex(x, "\\d+")
#[1] "4625" "9765" "9276"
Replace everything up to the last non-digit with a zero length string.
sub(".*\\D", "", x)
giving:
[1] "4625" "9765" "9276"

Extract a year number from a string that is surrounded by special characters

What's a good way to extract only the number 2007 from the following string:
some_string <- "1_2_start_2007_3_end"
The pattern to detect the year number in my case would be:
4 digits
surrounded by "_"
I am quite new to using regular expressions. I tried the following:
regexp <- "_+[0-9]+_"
names <- str_extract(files, regexp)
But this does not take into account that there are always 4 digits and outputs the underlines as well.
You may use a sub option, too:
some_string <- "1_2_start_2007_3_end"
sub(".*_(\\d{4})_.*", "\\1", some_string)
See the regex demo
Details
.* - any 0+ chars, as many as possible
_ - a _ char
(\\d{4}) - Group 1 (referred to via \1 from the replacement pattern): 4 digits
_.* - a _ and then any 0+ chars up to the end of string.
NOTE: akrun's str_extract(some_string, "(?<=_)\\d{4}") will extract the leftmost occurrence and my sub(".*_(\\d{4})_.*", "\\1", some_string) will extract the rightmost occurrence of a 4-digit substring enclosed with _. For my my solution to return the leftmost one use a lazy quantifier with the first .: sub(".*?_(\\d{4})_.*", "\\1", some_string).
R test:
some_string <- "1_2018_start_2007_3_end"
sub(".*?_(\\d{4})_.*", "\\1", some_string) # leftmost
## -> 2018
sub(".*_(\\d{4})_.*", "\\1", some_string) # rightmost
## -> 2007
We can use regex lookbehind to specify the _ and extract the 4 digits that follow
library(stringr)
str_extract(some_string, "(?<=_)\\d{4}")
#[1] "2007"
If the pattern also shows - both before and after the 4 digits, then use regex lookahead as well
str_extract(some_string, "(?<=_)\\d{4}(?=_)")
#[1] "2007"
Just to get a non-regex approach out there, in which we split on _ and convert to numeric. All non-numbers will be coerced to NA, so we use !is.na to eliminate them. We then use nchar to count the characters, and pull the one with 4.
i1 <- as.numeric(strsplit(some_string, '_')[[1]])
i1 <- i1[!is.na(i1)]
i1[nchar(i1) == 4]
#[1] 2007
This is the quickest regex I could come up with:
\S.*_(\d{4})_\S.*
It means,
any number of non-space characters,
then _
followed by four digits (d{4})
above four digits is your year captured using ()
another _
any other gibberish non space string
Since, you mentioned you're new, please test this and all other answers at https://regex101.com/, pretty good to learn regex, it explains in depth what your regex is actually doing.
If you just care about (year) then below regex is enough:
_(\d{4})_

Combining gsub calls and remove character after last instance of a string

I have the following string:
time <- "2017-05-30T09:20:00-08:00"
I was to use gsub to produce this:
"2017-05-30 09:20:00"
Here is what I have so far:
time2 <- gsub("T", " ", time)
gsub("\\-.*", "", time2)
Two questions -
How do remove all characters after the last instance of -?
How do I combine these two statements into one?
Use a single call to a sub with a spelled out regex to capture the parts you are interested in, and just match everything else. Then, use replacement backreferences \1 and \2 in the replacement pattern to re-insert those two captured subparts:
^(\d{4}-\d{2}-\d{2})T(\d{2}:\d{2}:\d{2}).*
See the regex demo.
Details:
^ - start of a string
(\d{4}-\d{2}-\d{2}) - Group 1: 4 digits, -, 2 digits, - and then 2 digits
T - a T letter
(\d{2}:\d{2}:\d{2}) - Group 2: 2 digis, :, 2 digits, : and 2 digits
.* - any 0+ chars up to the string end.
R online demo:
time_s <- "2017-05-30T09:20:00-08:00"
sub("^(\\d{4}-\\d{2}-\\d{2})T(\\d{2}:\\d{2}:\\d{2}).*", "\\1 \\2", time_s)
## => [1] "2017-05-30 09:20:00"
It may be better to use functions that convert to DateTime
library(anytime)
format(anytime(time), "%Y-%m-%d %H:%M:%S")
#[1] "2017-05-30 09:20:00"

Inserting prefix of 19 into a string date

I have a vector of birth dates as character strings formatted "10-Feb-85".
When I use the as.Date() function in R it assumes the two digit year is after 2000 (none of these birth dates are after the year 2000).
example:
as.Date(x = "10-Feb-52", format = "%d-%b-%y")
returns: 2052-02-10
I'm not proficient in regular expressions but
I think that this is an occasion for a regular expression to insert a "19" after the second "-" or before the last two digits.
I've found a regex that counts forward three characters and inserts a letter:
gsub(pattern = "^(.{3})(.*)$", replacement = "\\1d\\2", x = "abcefg")
But I'm not sure how to count two from the end.
Any help is appreciated.
insert a "19" after the second "-" or before the last two digits.
Before the last two digits:
gsub(pattern = "-(\\d{2})$", replacement = "-19\\1", x = "10-Feb-52")
See the R demo. Here, - is matched first, then 2 digits ((\\d{2})) - that are at the end of string ($) - are matched and captured into Group 1.
After the second -:
gsub(pattern = "^((?:[^-]*-){2})", replacement = "\\119", x = "10-Feb-52")
See another demo. Here, 2 sequences ({2}) of 0+ chars other than - ([^-]*) are matched from the start of the string (^) and captured into group 1. The replacement contains a backreference that restores the captured text in the replacement result.

In R is there a way to extract data based on the beginning and end of a pattern but not the middle data?

In R is there a way to extract data based on the beginning and end of a pattern but not the middle data?
ie. if the following was in a single cell
(1) Number = '1111111111, 0000000000' Text =....
(2) Number = '0000000000' Text =....
it would result in:
(1) 1111111111, 0000000000
(2) 0000000000
I tried:
x1<-str_match(x,"(?<=Number'\\s\\=\\s\\')(\\d|\\s|\\,)\\d\\'")
but that doesn't work.
We can try with str_extract_all
library(stringr)
sapply(str_extract_all(x, "[0-9]+"), toString)
#[1] "1111111111, 0000000000" "0000000000"
You may use a PCRE regex to extract the numbers after Number=' from your input text:
(?:Number\s*=\s*'|\G(?!\A)\s*,\s*)\K\d+
See the regex demo.
Pattern details:
(?:Number\s*=\s*'|\G(?!\A)\s*,\s*) - either of the two alternatives:
Number\s*=\s*' - Number and a = enclosed with 0+ whitespaces
| - or
\G(?!\A)\s*,\s* - end of the previous successful match (\G(?!\A)) and a comma enclosed with 0+ whitespaces (\s*)
\K - omit the text matched so far
\d+ - 1+ digits (returned as a match)
See the R demo:
> x <- c("(1) Number = '1111111111, 0000000000' Text =....", "(2) Number = '0000000000' Text =....")
> regmatches(x, gregexpr("(?:Number\\s*=\\s*'|\\G(?!\\A)\\s*,\\s*)\\K\\d+", x, perl=TRUE))
[[1]]
[1] "1111111111" "0000000000"
[[2]]
[1] "0000000000"

Resources