R Question: Extracting Numeric Characters from End of String - r

I have a data frame. One of the columns is in string format. Various letters and numbers, but always ending in a string of numbers. Sadly this string isn't always the same length.
I'd like to know how to write a bit of code to extract just the numbers at the end. So for example:
x <- c("AB ABC 19012301927 / XX - 4625",
"BC - AB / 827 / 9765",
"XXXX-9276"
)
And I'd like to get from this: (4625, 9765, 9276)
Is there any easy way to do this please?
Thank you.
A

We can use sub to capture one or more digits (\\d+) at the end ($) of the string that follows a non-digit ([^0-9]) and other characters (.*), in the replacement, specify the backreference (\\1) of the captured group
sub(".*[^0-9](\\d+)$", "\\1", x)
#[1] "4625" "9765" "9276"
Or with word from stringr
library(stringr)
word(x, -1, sep="[- ]")
#[1] "4625" "9765" "9276"
Or with stri_extract_last
library(stringi)
stri_extract_last_regex(x, "\\d+")
#[1] "4625" "9765" "9276"

Replace everything up to the last non-digit with a zero length string.
sub(".*\\D", "", x)
giving:
[1] "4625" "9765" "9276"

Related

Convert sign in column names if not at certain position in R [duplicate]

I have a character string of names which look like
"_6302_I-PAL_SPSY_000237_001"
I need to remove the first occurred underscore, so that it will be as
"6302_I-PAL_SPSY_000237_001"
I aware of gsub but it removes all of underscores. Thank you for any suggestions.
gsub function do the same, to remove starting of the string symbol ^ used
x <- "_6302_I-PAL_SPSY_000237_001"
x <- gsub("^\\_","",x)
[1] "6302_I-PAL_SPSY_000237_001"
We can use sub with pattern as _ and replacement as blanks (""). This will remove the first occurrence of '_'.
sub("_", "", str1)
#[1] "6302_I-PAL_SPSY_000237_001"
NOTE: This will remove the first occurence of _ and it will not limit based on the position i.e. at the start of the string.
For example, suppose we have string
str2 <- "6302_I-PAL_SPSY_000237_001"
sub("_", "", str2)
#[1] "6302I-PAL_SPSY_000237_001"
As the example have _ in the beginning, another option is substring
substring(str1, 2)
#[1] "6302_I-PAL_SPSY_000237_001"
data
str1 <- "_6302_I-PAL_SPSY_000237_001"
This can be done with base R's trimws() too
string1<-"_6302_I-PAL_SPSY_000237_001"
trimws(string1, which='left', whitespace = '_')
[1] "6302_I-PAL_SPSY_000237_001"
In case we have multiple words with leading underscores, we may have to include a word boundary (\\b) in our regex, and use either gsub or stringr::string_remove:
string2<-paste(string1, string1)
string2
[1] "_6302_I-PAL_SPSY_000237_001 _6302_I-PAL_SPSY_000237_001"
library(stringr)
str_remove_all(string2, "\\b_")
> str_remove_all(string2, "\\b_")
[1] "6302_I-PAL_SPSY_000237_001 6302_I-PAL_SPSY_000237_001"

R Returning all characters after the first underscore

Sample DATA
x=c("AG.av08_binloop_v6","TL.av1_binloopv2")
Sample ATTEMPT
y=gsub(".*_","",x)
Sample DESIRED
WANT=c("binloop_v6","binloopv2")
Basically I aim to extract all the characters AFTER the first underscore value.
In the pattern, we can change the zero or more any characters (.* - here . is metacharacter that can match any character) to zero or more characters that is not a _ ([^_]*) from the start (^) of the string.
sub("^[^_]*_", "", x)
#[1] "binloop_v6" "binloopv2"
If we don't specify it as such, the _ will match till the last _ in the string and uptill that substring will be lost returning 'v6' and 'binloopv2'
An easier option would be word from stringr
library(stringr)
word(x, 2, sep = "_")
#[1] "binloop" "binloopv2"
regexpr gives the position of first match (in this case _). Then substring can be used to extract the part of x from relevant position to the end (nchar(x))
substring(x, regexpr("_", x) + 1, nchar(x))
#[1] "binloop_v6" "binloopv2"

String replace with regex condition

I have a pattern that I want to match and replace with an X. However, I only want the pattern to be replaced if the preceding character is either an A, B or not preceeded by any character (beginning of string).
I know how to replace patterns using the str_replace_all function but I don't know how I can add this additional condition. I use the following code:
library(stringr)
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("XXXX")
replacement <- str_replace_all(string, pattern, paste0("XXXX"))
Result:
[1] "XXXXAXXXXBXXXXCXXXXDXXXXEXXXXAXXXX"
Desired result:
Replacement only when preceding charterer is A, B or no character:
[1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"
You may use
gsub("(^|[AB])0000", "\\1XXXX", string)
See the regex demo
Details
(^|[AB]) - Capturing group 1 (\1): start of string (^) or (|) A or B ([AB])
0000 - four zeros.
R demo:
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("XXXX")
gsub("(^|[AB])0000", "\\1XXXX", string)
## -> [1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"
Could you please try following. Using positive lookahead method here.
string <- "0000A0000B0000C0000D0000E0000A0000"
gsub(x = string, pattern = "(^|A|B)(?=0000)((?i)0000?)",
replacement = "\\1xxxx", perl=TRUE)
Output will be as follows.
[1] "xxxxAxxxxBxxxxC0000D0000E0000Axxxx"
Thanks to Wiktor Stribiżew for the answer! It also works with the stringr package:
library(stringr)
string <- "0000A0000B0000C0000D0000E0000A0000"
pattern <- c("0000")
replace <- str_replace_all(string, paste0("(^|[AB])",pattern), "\\1XXXX")
replace
[1] "XXXXAXXXXBXXXXC0000D0000E0000AXXXX"

Extract a year number from a string that is surrounded by special characters

What's a good way to extract only the number 2007 from the following string:
some_string <- "1_2_start_2007_3_end"
The pattern to detect the year number in my case would be:
4 digits
surrounded by "_"
I am quite new to using regular expressions. I tried the following:
regexp <- "_+[0-9]+_"
names <- str_extract(files, regexp)
But this does not take into account that there are always 4 digits and outputs the underlines as well.
You may use a sub option, too:
some_string <- "1_2_start_2007_3_end"
sub(".*_(\\d{4})_.*", "\\1", some_string)
See the regex demo
Details
.* - any 0+ chars, as many as possible
_ - a _ char
(\\d{4}) - Group 1 (referred to via \1 from the replacement pattern): 4 digits
_.* - a _ and then any 0+ chars up to the end of string.
NOTE: akrun's str_extract(some_string, "(?<=_)\\d{4}") will extract the leftmost occurrence and my sub(".*_(\\d{4})_.*", "\\1", some_string) will extract the rightmost occurrence of a 4-digit substring enclosed with _. For my my solution to return the leftmost one use a lazy quantifier with the first .: sub(".*?_(\\d{4})_.*", "\\1", some_string).
R test:
some_string <- "1_2018_start_2007_3_end"
sub(".*?_(\\d{4})_.*", "\\1", some_string) # leftmost
## -> 2018
sub(".*_(\\d{4})_.*", "\\1", some_string) # rightmost
## -> 2007
We can use regex lookbehind to specify the _ and extract the 4 digits that follow
library(stringr)
str_extract(some_string, "(?<=_)\\d{4}")
#[1] "2007"
If the pattern also shows - both before and after the 4 digits, then use regex lookahead as well
str_extract(some_string, "(?<=_)\\d{4}(?=_)")
#[1] "2007"
Just to get a non-regex approach out there, in which we split on _ and convert to numeric. All non-numbers will be coerced to NA, so we use !is.na to eliminate them. We then use nchar to count the characters, and pull the one with 4.
i1 <- as.numeric(strsplit(some_string, '_')[[1]])
i1 <- i1[!is.na(i1)]
i1[nchar(i1) == 4]
#[1] 2007
This is the quickest regex I could come up with:
\S.*_(\d{4})_\S.*
It means,
any number of non-space characters,
then _
followed by four digits (d{4})
above four digits is your year captured using ()
another _
any other gibberish non space string
Since, you mentioned you're new, please test this and all other answers at https://regex101.com/, pretty good to learn regex, it explains in depth what your regex is actually doing.
If you just care about (year) then below regex is enough:
_(\d{4})_

How to substring a char vector using patterns in R?

I have this kind of char vector:
"MODIS.evi.2013116.yL2.BOKU.tif"
The number in the middle of the vector is gonna change. And the evi word will change to ndvi some times.
I want to use substr (or other function, maybe) to sub-string the vector after the second point: ., ie, just take the 2013116.yL2.BOKU.tif, even when the string is MODIS.evi.2013116.yL2.BOKU.tif or MODIS.ndvi.2013116.yL2.BOKU.tif.
We can use sub to match two instance of one or more characters that are not a . followed by a . from the start (^) of the string and replace it with blank ("")
sub("^([^.]+\\.){2}", "", str1)
#[1] "2013116.yL2.BOKU.tif" "2013116.yL2.BOKU.tif"
If the pattern to keep always start with numbers, then the above can be simplified to match only one or more non-numeric characters and replace it with blank from the start (^) of the string
sub("^\\D+", "", str1)
#[1] "2013116.yL2.BOKU.tif" "2013116.yL2.BOKU.tif"
data
str1 <- c("MODIS.evi.2013116.yL2.BOKU.tif", "MODIS.ndvi.2013116.yL2.BOKU.tif")
This deletes all leading non-digit characters in s :
sub("^\\D*", "", s)
If s is as in the Note at the end then the result of running the above is:
[1] "2013116.yL2.BOKU.tif" "2013116.yL2.BOKU.tif"
Note:
s <- c("MODIS.evi.2013116.yL2.BOKU.tif", "MODIS.ndvi.2013116.yL2.BOKU.tif")
l = c("MODIS.evi.2013116.yL2.BOKU.tif","MODIS.ndvi.2013116.yL2.BOKU.tif")
sapply(l, function(x) strsplit(x, "vi.", fixed = T)[[1]][2])

Resources