Extract date from a given string in R - r

Here is a string that I have
"7MA_S_VE_MS_FB_MEASURE_P1_2013-08-21_17-42-19.BMP"
I am trying to extract dates this way:
library(stringr)
as.Date(str_extract(test,"[0-9]{4}/[0-9]{2}/[0-9]{2}"),"%Y-%m-%d")
I am getting NA for this.
Desired output is
2013-08-21
Can someone point me in the right direction?

You have replaced your dash - with a slash / in your regular expression.
as.Date(str_extract(string, "[0-9]{4}-[0-9]{2}-[0-9]{2}"), format="%Y-%m-%d")
# [1] "2013-08-21"
But you can also replace the [0-9] bits with \d, which represent the same thing. I'm not sure why, but regex pros seem to always use the \d version (note that you'll have to escape the backslash with another backslash):
as.Date(str_extract(string, "\\d{4}-\\d{2}-\\d{2}"), format="%Y-%m-%d")
# [1] "2013-08-21"

If it as fixed position
as.Date(strsplit(str1, "_")[[1]][8])
#[1] "2013-08-21"

Related

regex to find the position of the first four concurrent unique values

I've solved 2022 advent of code day 6, but was wondering if there was a regex way to find the first occurance of 4 non-repeating characters:
From the question:
bvwbjplbgvbhsrlpgdmjqwftvncz
bvwbjplbgvbhsrlpgdmjqwftvncz
# discard as repeating letter b
bvwbjplbgvbhsrlpgdmjqwftvncz
# match the 5th character, which signifies the end of the first four character block with no repeating characters
in R I've tried:
txt <- "bvwbjplbgvbhsrlpgdmjqwftvncz"
str_match("(.*)\1", txt)
But I'm having no luck
You can use
stringr::str_extract(txt, "(.)(?!\\1)(.)(?!\\1|\\2)(.)(?!\\1|\\2|\\3)(.)")
See the regex demo. Here, (.) captures any char into consequently numbered groups and the (?!...) negative lookaheads make sure each subsequent . does not match the already captured char(s).
See the R demo:
library(stringr)
txt <- "bvwbjplbgvbhsrlpgdmjqwftvncz"
str_extract(txt, "(.)(?!\\1)(.)(?!\\1|\\2)(.)(?!\\1|\\2|\\3)(.)")
## => [1] "vwbj"
Note that the stringr::str_match (as stringr::str_extract) takes the input as the first argument and the regex as the second argument.

How to extract the just the date from a file path string that also includes time with the format 2020.04.12.10.30.10?

I have a filepath string that looks like this:
\\\\server\file\path\string\10X_blah.2020.04.12.10.30.10.xls
I need to extract just 2020.04.12
I've tried (?<=\.).*(?=\.)
but it matches the whole date and time, I am having trouble learning how to limit it to just the first part of the match, the part that corresponds to date.
I'm using R and mutate(date = str_extract(filepath, pattern) to make a new column in my dataframe. I just don't know the regex to find just the date.
All you need to do is that:
\\d{4}\\.\\d{2}\\.\\d{2}
In R:
stringr::str_extract_all(my_string,"\\d{4}\\.\\d{2}\\.\\d{2}")
[[1]]
[1] "2020.04.12"
explanation:
\\d{4}\\.four digit year followed by a dot
\\d{2}\\.two digit month followed by a dot
\\d{2} two digit day
This is not a very strong solution because it requires locating the length before substringing. A lookahead might be more ideal. Nevertheless, we can do:
my_string<- readClipboard() # copy the file path
substring(stringr::str_remove_all(my_string,"\\D.*(?=\\d{4,})"),
1,10)
[1] "2020.04.12"
If you know the exact pattern(ie that 10 always follows a . and that 10 always exists), then maybe:
stringr::str_remove_all(my_string,"\\D.*(?=\\d{4,})|\\.10.*")
[1] "2020.04.12"

Stringr function or or gsub() to find an x digit string and extract first x digits?

Regex and stringr newbie here. I have a data frame with a column from which I want to find 10-digit numbers and keep only the first three digits. Otherwise, I want to just keep whatever is there.
So to make it easy let's just pretend it's a simple vector like this:
new<-c("111", "1234567891", "12", "12345")
I want to write code that will return a vector with elements: 111, 123, 12, and 12345. I also need to write code (I'm assuming I'll do this iteratively) where I extract the first two digits of a 5-digit string, like the last element above.
I've tried:
gsub("\\d{10}", "", new)
but I don't know what I could put for the replacement argument to get what I'm looking for. Also tried:
str_replace(new, "\\d{10}", "")
But again I don't know what to put in for the replacement argument to get just the first x digits.
Edit: I disagree that this is a duplicate question because it's not just that I want to extract the first X digits from a string but that I need to do that with specific strings that match a pattern (e.g., 10 digit strings.)
If you are willing to use the library stringr from which comes the str_replace you are using. Just use str_extract
vec <- c(111, 1234567891, 12)
str_extract(vec, "^\\d{1,3}")
The regex ^\\d{1,3} matches at least 1 to a maximum of 3 digits occurring right in the beginning of the phrase. str_extract, as the name implies, extracts and returns these matches.
You may use
new<-c("111", "1234567891", "12")
sub("^(\\d{3})\\d{7}$", "\\1", new)
## => [1] "111" "123" "12"
See the R online demo and the regex demo.
Regex graph:
Details
^ - start of string anchor
(\d{3}) - Capturing group 1 (this value is accessed using \1 in the replacement pattern): three digit chars
\d{7} - seven digit chars
$ - end of string anchor.
So, the sub command only matches strings that are composed only of 10 digits, captures the first three into a separate group, and then replaces the whole string (as it is the whole match) with the three digits captured in Group 1.
You can use:
as.numeric(substring(my_vec,1,3))
#[1] 111 123 12

How to extract text inside the brackets in R?

How can I extract all brackets which include a name AND a year?
string="testo(antonio.2018).testo(antonio).testo(giovanni,2018).testo(2018),testo(libero 2019)"
the desired output would look like this:
"(antonio.2018)" "(giovanni,2018)" "(libero 2019)"
I do not want to extract (2018) and (antonio)
You can use str_extract_all from the stringr package with this regex pattern:
stringr::str_extract_all(string,
"\\(\\w+([[:punct:]]{1}|[[:blank:]]{1})[[:digit:]]+\\)")
# [[1]]
# [1] "(antonio.2018)" "(giovanni,2018)" "(libero 2019)"
A small description of the regex:
\\w will match any word-character
+ means that it has to be matched at least once
[[:punct:]] will match any punctuation character
{1} will exactly one appearance
(....|....) indicates one pattern OR the other has to be met
[[:blank:]] means any whitespace must occur
[[:digit:]] means any digit must occur
\\( braces have to be exited.
#loki answer is great! You can also try this, I hope this works for you :)
x<-regmatches(string, gregexpr("(?=\\().*?(?<=\\))", string, perl=T))[[1]]
>x
[1] "(antonio.2018)" "(antonio)" "(giovanni,2018)" "(2018)" "(libero 2019)"
#Extract every nth value.
>x[seq_along(x) %% 2 > 0]
[1] "(antonio.2018)" "(giovanni,2018)" "(libero 2019)"
Note: Unsure of your complete dataset (i.e. if the structure will always be in nth format. If it is (every 2nd value), this will work on large scale.

Keep the last 9 digits of an alphanumeric string in R

Please R-gurus, how can I keep the last 9 digits of an alphanumeric string
for e.g.
LA XAN 000262999444
RA XAN 000263000507
WA XAN 000263268038
SA XAN 000263000464
000263000463
000263000476
I only want to get
262999444
263000507
263268038
263000464
263000463
263000476
Thanks a lot
It's pretty easy in stringr because sub_str interprets negative indices as offsets from the end of the string.
library(stringr)
str_sub(xx, -9, -1)
If you just want the last 9 positions, you could just use substr:
substr(xx,nchar(xx) - 8,nchar(xx))
assuming that your character vector is stored in xx. Also, as Hadley notes below, nchar will return unexpected things if xx is a factor, not a character vector. His solution using stringr is definitely preferable.
Assuming the input is a vector named "strgs":
sub(".*(.........)$", "\\1", strgs)
#[1] "262999444" "263000507" "263268038" "263000464"
?sub
?regex
Not really sure which language you are looking for but here would be a c# implementation.
The logic would be something like :
string s = "WA XAN 000263268038";
s = s.Substring(s.Length - 10, 9);
Hope this helps!

Resources