Combining gsub calls and remove character after last instance of a string - r

I have the following string:
time <- "2017-05-30T09:20:00-08:00"
I was to use gsub to produce this:
"2017-05-30 09:20:00"
Here is what I have so far:
time2 <- gsub("T", " ", time)
gsub("\\-.*", "", time2)
Two questions -
How do remove all characters after the last instance of -?
How do I combine these two statements into one?

Use a single call to a sub with a spelled out regex to capture the parts you are interested in, and just match everything else. Then, use replacement backreferences \1 and \2 in the replacement pattern to re-insert those two captured subparts:
^(\d{4}-\d{2}-\d{2})T(\d{2}:\d{2}:\d{2}).*
See the regex demo.
Details:
^ - start of a string
(\d{4}-\d{2}-\d{2}) - Group 1: 4 digits, -, 2 digits, - and then 2 digits
T - a T letter
(\d{2}:\d{2}:\d{2}) - Group 2: 2 digis, :, 2 digits, : and 2 digits
.* - any 0+ chars up to the string end.
R online demo:
time_s <- "2017-05-30T09:20:00-08:00"
sub("^(\\d{4}-\\d{2}-\\d{2})T(\\d{2}:\\d{2}:\\d{2}).*", "\\1 \\2", time_s)
## => [1] "2017-05-30 09:20:00"

It may be better to use functions that convert to DateTime
library(anytime)
format(anytime(time), "%Y-%m-%d %H:%M:%S")
#[1] "2017-05-30 09:20:00"

Related

Remove one number at position n of the number in a string of numbers separated by slashes

I have a character column with this configuration:
data <- data.frame(
id = 1:3,
codes = c("08001301001", "08002401002 / 08002601003 / 17134604034", "08004701005 / 08005101001"))
I want to remove the 6th digit of any number within the string. The numbers are always 10 characters long.
My code works. However I believe it might be done easier using RegEx, but I couldn't figure it out.
library(stringr)
remove_6_digit <- function(x){
idxs <- str_locate_all(x,"/")[[1]][,1]
for (idx in c(rev(idxs+7), 6)){
str_sub(x, idx, idx) <- ""
}
return(x)
}
result <- sapply(data$codes, remove_6_digit, USE.NAMES = F)
You can use
gsub("\\b(\\d{5})\\d", "\\1", data$codes)
See the regex demo. This will remove the 6th digit from the start of a digit sequence.
Details:
\b - word boundary
(\d{5}) - Capturing group 1 (\1): five digits
\d - a digit.
While word boundary looks enough for the current scenario, a digit boundary is also an option in case the numbers are glued to word chars:
gsub("(?<!\\d)(\\d{5})\\d", "\\1", data$codes, perl=TRUE)
where perl=TRUE enables the PCRE regex syntax and (?<!\d) is a negative lookbehind that fails the match if there is a digit immediately to the left of the current location.
And if you must only change numeric char sequences of 10 digits (no shorter and no longer) you can use
gsub("\\b(\\d{5})\\d(\\d{4})\\b", "\\1\\2", data$codes)
gsub("(?<!\\d)(\\d{5})\\d(?=\\d{4}(?!\\d))", "\\1", data$codes, perl=TRUE)
One remark though: your numbers consist of 11 digits, so you need to replace \\d{4} with \\d{5}, see this regex demo.
Another possible solution, using stringr::str_replace_all and lookaround :
library(tidyverse)
data %>%
mutate(codes = str_replace_all(codes, "(?<=\\d{5})\\d(?=\\d{5})", ""))
#> id codes
#> 1 1 0800101001
#> 2 2 0800201002 / 0800201003 / 1713404034
#> 3 3 0800401005 / 0800501001

Select numeric string with dots and colon

I have this string
string <- "Hospitalization from 25.1.2018 to 26.1.2018", "Date of hospitalization was from 28.8.2019 8:15", "Date of arrival 30.6.2018 20:30 to hospital")
And I would like to get on the numeric part of string (with dots and colons) to have this
print(dates)
c("25.1.2018", "26.1.2018", "28.8.2019 8:15", "30.6.2018 20:30")
I have tried
dates <- gsub("([0-9]+).*$", "\\1", string)
But it gives me just first number before first dot
You can use
library(stringr)
unlist(str_extract_all(string, "\\d{1,2}\\.\\d{1,2}\\.\\d{4}(?:\\s+\\d{1,2}:\\d{1,2})?"))
# => [1] "25.1.2018" "26.1.2018" "28.8.2019 8:15" "30.6.2018 20:30"
See the regex demo.
Details
\d{1,2} - one or two digits
\. - a dot
\d{1,2}\.\d{4} - one or two digits, a dot and four digits
(?:\s+\d{1,2}:\d{1,2})? - an optional occurrence of
\s+ - one or more whitespaces
\d{1,2}:\d{1,2} - one or two digits, : and one or two digits.
Use sapply:
sapply(str_extract_all(string, "[0-9.:]+"), paste0, collapse = " ")
[1] "25.1.2018 26.1.2018" "28.8.2019 8:15" "30.6.2018 20:30"

R Question: Extracting Numeric Characters from End of String

I have a data frame. One of the columns is in string format. Various letters and numbers, but always ending in a string of numbers. Sadly this string isn't always the same length.
I'd like to know how to write a bit of code to extract just the numbers at the end. So for example:
x <- c("AB ABC 19012301927 / XX - 4625",
"BC - AB / 827 / 9765",
"XXXX-9276"
)
And I'd like to get from this: (4625, 9765, 9276)
Is there any easy way to do this please?
Thank you.
A
We can use sub to capture one or more digits (\\d+) at the end ($) of the string that follows a non-digit ([^0-9]) and other characters (.*), in the replacement, specify the backreference (\\1) of the captured group
sub(".*[^0-9](\\d+)$", "\\1", x)
#[1] "4625" "9765" "9276"
Or with word from stringr
library(stringr)
word(x, -1, sep="[- ]")
#[1] "4625" "9765" "9276"
Or with stri_extract_last
library(stringi)
stri_extract_last_regex(x, "\\d+")
#[1] "4625" "9765" "9276"
Replace everything up to the last non-digit with a zero length string.
sub(".*\\D", "", x)
giving:
[1] "4625" "9765" "9276"

Replace all characters except expression using gsub only

Given strings:
smple_paths <- c("/path/path/path/abc22/path/path",
"/apath/apath/paath/abc11/something/path")
I would like to replace all characters excluding phrase abc\\d{2}
Attempt
gsub(
pattern = "(?!abc\\d{2})",
replacement = "",
x = smple_paths,
perl = TRUE
)
# [1] "/path/path/path/abc22/path/path"
# [2] "/apath/apath/paath/abc11/something/path"
Desired results
abc22
abc11
Notes
I'm not looking for stringr::str_extract based solution or any other solution not based on gsub
If you do not care about the abc\d{2} context, you may use
sub(".*(abc\\d{2}).*", "\\1", smple_paths)
See this regex demo and this R demo.
If you care about the context, you may match and capture abc + 2 digits after / and before / or end of the string, while matching any text before and after this pattern using
sub("^.*/(abc\\d{2})(?:/.*)?$", "\\1", smple_paths)
See the R demo and a regex demo.
Details
^ - start of the string (not necessary here, but kept for the sake of clarity)
.* - any 0+ chars, as many as possible
/ - a / char
(abc\\d{2}) - Group 1: abc and 2 digits
(?:/.*)? - an optional (1 or 0) occurrence of a / followed with any 0+ chars as many as possible
$ - end of string.
The \1 placeholder in the replacement pattern inserts the captured text back into the result.

Extract a year number from a string that is surrounded by special characters

What's a good way to extract only the number 2007 from the following string:
some_string <- "1_2_start_2007_3_end"
The pattern to detect the year number in my case would be:
4 digits
surrounded by "_"
I am quite new to using regular expressions. I tried the following:
regexp <- "_+[0-9]+_"
names <- str_extract(files, regexp)
But this does not take into account that there are always 4 digits and outputs the underlines as well.
You may use a sub option, too:
some_string <- "1_2_start_2007_3_end"
sub(".*_(\\d{4})_.*", "\\1", some_string)
See the regex demo
Details
.* - any 0+ chars, as many as possible
_ - a _ char
(\\d{4}) - Group 1 (referred to via \1 from the replacement pattern): 4 digits
_.* - a _ and then any 0+ chars up to the end of string.
NOTE: akrun's str_extract(some_string, "(?<=_)\\d{4}") will extract the leftmost occurrence and my sub(".*_(\\d{4})_.*", "\\1", some_string) will extract the rightmost occurrence of a 4-digit substring enclosed with _. For my my solution to return the leftmost one use a lazy quantifier with the first .: sub(".*?_(\\d{4})_.*", "\\1", some_string).
R test:
some_string <- "1_2018_start_2007_3_end"
sub(".*?_(\\d{4})_.*", "\\1", some_string) # leftmost
## -> 2018
sub(".*_(\\d{4})_.*", "\\1", some_string) # rightmost
## -> 2007
We can use regex lookbehind to specify the _ and extract the 4 digits that follow
library(stringr)
str_extract(some_string, "(?<=_)\\d{4}")
#[1] "2007"
If the pattern also shows - both before and after the 4 digits, then use regex lookahead as well
str_extract(some_string, "(?<=_)\\d{4}(?=_)")
#[1] "2007"
Just to get a non-regex approach out there, in which we split on _ and convert to numeric. All non-numbers will be coerced to NA, so we use !is.na to eliminate them. We then use nchar to count the characters, and pull the one with 4.
i1 <- as.numeric(strsplit(some_string, '_')[[1]])
i1 <- i1[!is.na(i1)]
i1[nchar(i1) == 4]
#[1] 2007
This is the quickest regex I could come up with:
\S.*_(\d{4})_\S.*
It means,
any number of non-space characters,
then _
followed by four digits (d{4})
above four digits is your year captured using ()
another _
any other gibberish non space string
Since, you mentioned you're new, please test this and all other answers at https://regex101.com/, pretty good to learn regex, it explains in depth what your regex is actually doing.
If you just care about (year) then below regex is enough:
_(\d{4})_

Resources