How to substring a char vector using patterns in R? - r

I have this kind of char vector:
"MODIS.evi.2013116.yL2.BOKU.tif"
The number in the middle of the vector is gonna change. And the evi word will change to ndvi some times.
I want to use substr (or other function, maybe) to sub-string the vector after the second point: ., ie, just take the 2013116.yL2.BOKU.tif, even when the string is MODIS.evi.2013116.yL2.BOKU.tif or MODIS.ndvi.2013116.yL2.BOKU.tif.

We can use sub to match two instance of one or more characters that are not a . followed by a . from the start (^) of the string and replace it with blank ("")
sub("^([^.]+\\.){2}", "", str1)
#[1] "2013116.yL2.BOKU.tif" "2013116.yL2.BOKU.tif"
If the pattern to keep always start with numbers, then the above can be simplified to match only one or more non-numeric characters and replace it with blank from the start (^) of the string
sub("^\\D+", "", str1)
#[1] "2013116.yL2.BOKU.tif" "2013116.yL2.BOKU.tif"
data
str1 <- c("MODIS.evi.2013116.yL2.BOKU.tif", "MODIS.ndvi.2013116.yL2.BOKU.tif")

This deletes all leading non-digit characters in s :
sub("^\\D*", "", s)
If s is as in the Note at the end then the result of running the above is:
[1] "2013116.yL2.BOKU.tif" "2013116.yL2.BOKU.tif"
Note:
s <- c("MODIS.evi.2013116.yL2.BOKU.tif", "MODIS.ndvi.2013116.yL2.BOKU.tif")

l = c("MODIS.evi.2013116.yL2.BOKU.tif","MODIS.ndvi.2013116.yL2.BOKU.tif")
sapply(l, function(x) strsplit(x, "vi.", fixed = T)[[1]][2])

Related

Convert sign in column names if not at certain position in R [duplicate]

I have a character string of names which look like
"_6302_I-PAL_SPSY_000237_001"
I need to remove the first occurred underscore, so that it will be as
"6302_I-PAL_SPSY_000237_001"
I aware of gsub but it removes all of underscores. Thank you for any suggestions.
gsub function do the same, to remove starting of the string symbol ^ used
x <- "_6302_I-PAL_SPSY_000237_001"
x <- gsub("^\\_","",x)
[1] "6302_I-PAL_SPSY_000237_001"
We can use sub with pattern as _ and replacement as blanks (""). This will remove the first occurrence of '_'.
sub("_", "", str1)
#[1] "6302_I-PAL_SPSY_000237_001"
NOTE: This will remove the first occurence of _ and it will not limit based on the position i.e. at the start of the string.
For example, suppose we have string
str2 <- "6302_I-PAL_SPSY_000237_001"
sub("_", "", str2)
#[1] "6302I-PAL_SPSY_000237_001"
As the example have _ in the beginning, another option is substring
substring(str1, 2)
#[1] "6302_I-PAL_SPSY_000237_001"
data
str1 <- "_6302_I-PAL_SPSY_000237_001"
This can be done with base R's trimws() too
string1<-"_6302_I-PAL_SPSY_000237_001"
trimws(string1, which='left', whitespace = '_')
[1] "6302_I-PAL_SPSY_000237_001"
In case we have multiple words with leading underscores, we may have to include a word boundary (\\b) in our regex, and use either gsub or stringr::string_remove:
string2<-paste(string1, string1)
string2
[1] "_6302_I-PAL_SPSY_000237_001 _6302_I-PAL_SPSY_000237_001"
library(stringr)
str_remove_all(string2, "\\b_")
> str_remove_all(string2, "\\b_")
[1] "6302_I-PAL_SPSY_000237_001 6302_I-PAL_SPSY_000237_001"

Cut a string from right to left until a certain character is met R

I am looking to manipulate/cut a character string from right to left until a particular character is met
I want to take this:
a <- "L1.L2.L3.L4.L5"
And output this: a <- "L5"
I have specifically worded this problem as needing to cut the string from right to left because the strings can be of variable length and the output string can be of variable length as well
For example the code needs to work on:
b <- "L1.L555"
c <- "L1.L2.L3.L4.L5.L6.LLLL"
We can use sub to match characters (.*) until a . (. is a metacharacter for any character. So we escape (\\) to evaluate it literally) and replace it with blank ("")
sub(".*\\.", "", a)
#[1] "L5"
sub(".*\\.", "", b)
#[1] "L555"
sub(".*\\.", "", c)
#[1] "LLLL"
Or using trimws
trimws(a, whitespace = ".*\\.")
#[1] "L5"

R Question: Extracting Numeric Characters from End of String

I have a data frame. One of the columns is in string format. Various letters and numbers, but always ending in a string of numbers. Sadly this string isn't always the same length.
I'd like to know how to write a bit of code to extract just the numbers at the end. So for example:
x <- c("AB ABC 19012301927 / XX - 4625",
"BC - AB / 827 / 9765",
"XXXX-9276"
)
And I'd like to get from this: (4625, 9765, 9276)
Is there any easy way to do this please?
Thank you.
A
We can use sub to capture one or more digits (\\d+) at the end ($) of the string that follows a non-digit ([^0-9]) and other characters (.*), in the replacement, specify the backreference (\\1) of the captured group
sub(".*[^0-9](\\d+)$", "\\1", x)
#[1] "4625" "9765" "9276"
Or with word from stringr
library(stringr)
word(x, -1, sep="[- ]")
#[1] "4625" "9765" "9276"
Or with stri_extract_last
library(stringi)
stri_extract_last_regex(x, "\\d+")
#[1] "4625" "9765" "9276"
Replace everything up to the last non-digit with a zero length string.
sub(".*\\D", "", x)
giving:
[1] "4625" "9765" "9276"

R Returning all characters after the first underscore

Sample DATA
x=c("AG.av08_binloop_v6","TL.av1_binloopv2")
Sample ATTEMPT
y=gsub(".*_","",x)
Sample DESIRED
WANT=c("binloop_v6","binloopv2")
Basically I aim to extract all the characters AFTER the first underscore value.
In the pattern, we can change the zero or more any characters (.* - here . is metacharacter that can match any character) to zero or more characters that is not a _ ([^_]*) from the start (^) of the string.
sub("^[^_]*_", "", x)
#[1] "binloop_v6" "binloopv2"
If we don't specify it as such, the _ will match till the last _ in the string and uptill that substring will be lost returning 'v6' and 'binloopv2'
An easier option would be word from stringr
library(stringr)
word(x, 2, sep = "_")
#[1] "binloop" "binloopv2"
regexpr gives the position of first match (in this case _). Then substring can be used to extract the part of x from relevant position to the end (nchar(x))
substring(x, regexpr("_", x) + 1, nchar(x))
#[1] "binloop_v6" "binloopv2"

How to take only that part of a string which occurs before a pattern of 2 dots?

I used a code of regular expressions which only took stuff before the 2nd occurrence of a dot. The following is the code:-
colnames(final1)[i] <- gsub("^([^.]*.[^.]*)..*$", "\\1", colnames(final)[i])
But now i realized i wanted to take the stuff before the first occurrence of a pattern of 2 dots.
I tried
gsub(",.*$", "", colnames(final)[i]) (changed the , to ..)
gsub("...*$", "", colnames(final)[i])
But it didn't work
The example to try on
KC1.Comdty...PX_LAST...USD......Comdty........
converted to
KC1.Comdty.
or
"LIT.US.Equity...PX_LAST...USD......Comdty........"
to
"LIT.US.Equity."
Can anyone suggest anything?
Thanks
We could use sub to match 2 or more dots followed by other characters and replace it with blank
sub("\\.{2,}.*", "", str1)
#[1] "KC1.Comdty" "LIT.US.Equity"
The . is a metacharacter implying any character. So, we need to escape (\\.) to get the literal meaning of the character
data
str1 <- c("KC1.Comdty...PX_LAST...USD......Comdty.......", "LIT.US.Equity...PX_LAST...USD......Comdty........")
Another solution with strsplit:
str1 <- c("KC1.Comdty...PX_LAST...USD......Comdty.......", "LIT.US.Equity...PX_LAST...USD......Comdty........")
sapply(strsplit(str1, "\\.{2}\\w"), "[", 1)
# [1] "KC1.Comdty." "LIT.US.Equity."
To also include the dot at the end with #akrun's answer, one can do:
sub("\\.{2}\\w.*", "", str1)
# [1] "KC1.Comdty." "LIT.US.Equity."

Resources