Extract last digit [duplicate] - r

How can I get the last n characters from a string in R?
Is there a function like SQL's RIGHT?

I'm not aware of anything in base R, but it's straight-forward to make a function to do this using substr and nchar:
x <- "some text in a string"
substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}
substrRight(x, 6)
[1] "string"
substrRight(x, 8)
[1] "a string"
This is vectorised, as #mdsumner points out. Consider:
x <- c("some text in a string", "I really need to learn how to count")
substrRight(x, 6)
[1] "string" " count"

If you don't mind using the stringr package, str_sub is handy because you can use negatives to count backward:
x <- "some text in a string"
str_sub(x,-6,-1)
[1] "string"
Or, as Max points out in a comment to this answer,
str_sub(x, start= -6)
[1] "string"

Use stri_sub function from stringi package.
To get substring from the end, use negative numbers.
Look below for the examples:
stri_sub("abcde",1,3)
[1] "abc"
stri_sub("abcde",1,1)
[1] "a"
stri_sub("abcde",-3,-1)
[1] "cde"
You can install this package from github: https://github.com/Rexamine/stringi
It is available on CRAN now, simply type
install.packages("stringi")
to install this package.

str = 'This is an example'
n = 7
result = substr(str,(nchar(str)+1)-n,nchar(str))
print(result)
> [1] "example"
>

Another reasonably straightforward way is to use regular expressions and sub:
sub('.*(?=.$)', '', string, perl=T)
So, "get rid of everything followed by one character". To grab more characters off the end, add however many dots in the lookahead assertion:
sub('.*(?=.{2}$)', '', string, perl=T)
where .{2} means .., or "any two characters", so meaning "get rid of everything followed by two characters".
sub('.*(?=.{3}$)', '', string, perl=T)
for three characters, etc. You can set the number of characters to grab with a variable, but you'll have to paste the variable value into the regular expression string:
n = 3
sub(paste('.+(?=.{', n, '})', sep=''), '', string, perl=T)

UPDATE: as noted by mdsumner, the original code is already vectorised because substr is. Should have been more careful.
And if you want a vectorised version (based on Andrie's code)
substrRight <- function(x, n){
sapply(x, function(xx)
substr(xx, (nchar(xx)-n+1), nchar(xx))
)
}
> substrRight(c("12345","ABCDE"),2)
12345 ABCDE
"45" "DE"
Note that I have changed (nchar(x)-n) to (nchar(x)-n+1) to get n characters.

A simple base R solution using the substring() function (who knew this function even existed?):
RIGHT = function(x,n){
substring(x,nchar(x)-n+1)
}
This takes advantage of basically being substr() underneath but has a default end value of 1,000,000.
Examples:
> RIGHT('Hello World!',2)
[1] "d!"
> RIGHT('Hello World!',8)
[1] "o World!"

Try this:
x <- "some text in a string"
n <- 5
substr(x, nchar(x)-n, nchar(x))
It shoudl give:
[1] "string"

An alternative to substr is to split the string into a list of single characters and process that:
N <- 2
sapply(strsplit(x, ""), function(x, n) paste(tail(x, n), collapse = ""), N)

I use substr too, but in a different way. I want to extract the last 6 characters of "Give me your food." Here are the steps:
(1) Split the characters
splits <- strsplit("Give me your food.", split = "")
(2) Extract the last 6 characters
tail(splits[[1]], n=6)
Output:
[1] " " "f" "o" "o" "d" "."
Each of the character can be accessed by splits[[1]][x], where x is 1 to 6.

someone before uses a similar solution to mine, but I find it easier to think as below:
> text<-"some text in a string" # we want to have only the last word "string" with 6 letter
> n<-5 #as the last character will be counted with nchar(), here we discount 1
> substr(x=text,start=nchar(text)-n,stop=nchar(text))
This will bring the last characters as desired.

For those coming from Microsoft Excel or Google Sheets, you would have seen functions like LEFT(), RIGHT(), and MID(). I have created a package known as forstringr and its development version is currently on Github.
if(!require("devtools")){
install.packages("devtools")
}
devtools::install_github("gbganalyst/forstringr")
library(forstringr)
the str_left(): This counts from the left and then extract n characters
the str_right()- This counts from the right and then extract n characters
the str_mid()- This extract characters from the middle
Examples:
x <- "some text in a string"
str_left(x, 4)
[1] "some"
str_right(x, 6)
[1] "string"
str_mid(x, 6, 4)
[1] "text"

I used the following code to get the last character of a string.
substr(output, nchar(stringOfInterest), nchar(stringOfInterest))
You can play with the nchar(stringOfInterest) to figure out how to get last few characters.

A little modification on #Andrie solution gives also the complement:
substrR <- function(x, n) {
if(n > 0) substr(x, (nchar(x)-n+1), nchar(x)) else substr(x, 1, (nchar(x)+n))
}
x <- "moSvmC20F.5.rda"
substrR(x,-4)
[1] "moSvmC20F.5"
That was what I was looking for. And it invites to the left side:
substrL <- function(x, n){
if(n > 0) substr(x, 1, n) else substr(x, -n+1, nchar(x))
}
substrL(substrR(x,-4),-2)
[1] "SvmC20F.5"

Just in case if a range of characters need to be picked:
# For example, to get the date part from the string
substrRightRange <- function(x, m, n){substr(x, nchar(x)-m+1, nchar(x)-m+n)}
value <- "REGNDATE:20170526RN"
substrRightRange(value, 10, 8)
[1] "20170526"

Related

Replace last characters of a string if it meets criteria

I have a vector of strings:
asdf <- c("a^sdf^", "asdf^^")
Now i want to remove the last element of both strings, but only if that last element is a ^, resulting in:
[1] "a^sdf" "asdf"
I tried:
function1 <- function(x){
while(any(substr(x, nchar(x) - 1 + 1, nchar(x)) == "^")){
x <- gsub(".{1}$", "", x)
}
return(x)
}
function1(asdf)
[1] "a^sd" "asdf"
As you can see the first string is reduced to more than ^ at the end. I tried experimenting with if conditions in combination to the while loop but it didn't work out. What is missing so that only the ^ gets removed?
A possible solution, using stringr::str_remove:
library(stringr)
str_remove(asdf, "\\^+$")
#> [1] "a^sdf" "asdf"
We can think of them as a whitespace and use base trimws - trim whitespace:
trimws(asdf, which = "right", whitespace = "\\^")
# [1] "a^sdf" "asdf"

Regular expression to extract specific part of a URL

I have a vector of URLs and need to extract a certain part of it. I've tried using a regex tester to see if my attempts worked, but they were no good.
The URLs I have are in this format: https://www.baseball-reference.com/teams/MIL/1976.shtml
I ned to extract the three letters after "teams/" (so for the example above, I need "MIL")
Does anyone have any idea how to get the correct regular expression to get this working? Thanks.
1) basename/dirname Try this:
u <- "https://www.baseball-reference.com/teams/MIL/1976.shtml" # input data
basename(dirname(u))
## [1] "MIL"
2) sub or with a regular expression:
sub(".*teams/(.*?)/.*", "\\1", u)
## [1] "MIL"
3) strsplit Split the string on / and take the second last component.
s <- strsplit(u, "/")[[1]]
s[length(s) - 1]
## [1] "MIL"
4) gsub Since the required substring is all upper case and no other characters in the input are this gsub which removes all characters that are not upper case letters would work:
gsub("[^A-Z]", "", u)
## [1] "MIL"
Many different ways to achieve this using regexp's. Here's one:
url <- "https://www.baseball-reference.com/teams/MIL/1976.shtml"
gsub(".+teams/(\\w{3}).+$", "\\1", url);
#[1] "MIL"
Or
x <- c('https://www.baseball-reference.com/teams/MIL/1976.shtml')
pattern <- "/teams/([^/]+)"
m <- regexec(pattern, x)
res = regmatches(x, m)[[1]]
res[2]
which yields
[1] "MIL"
Consider using the stringr package to simplify your code when handling strings.
Use a regular expression with positive lookbehind to catch alphanumeric codes following the string "teams\":
stringr::str_extract(url, "(?<=teams\\/)[A-Z]*")
In your case, if the URLs literally all begin with the same string https://www.baseball-reference.com/teams/ then you can avoid regex entirely and use a simple substring to get the three-letter code which follows:
stringr::str_sub(url, 42, 44)
Here are the results:
> url <- "https://www.baseball-reference.com/teams/MIL/1976.shtml"
>
> stringr::str_extract(url, "(?<=teams\\/)[A-Z]*")
[1] "MIL"
>
> stringr::str_sub(url, 42, 44)
[1] "MIL"

Extracting and matching regular expressions in R

I have a list of strings, an example is shown below (the actual list has a much bigger variety in format)
[1] "AB-123"
[2] "AB-312"
[3] "AB-546"
[4] "ZXC/123456"
Assuming [1] is the correct format, I want to extract the regular expression from [1] and match it against the rest to detect that [4] is inconsistent. Is there a method to do this or is there a better way to achieve the same outcome?
*EDIT - I found something close to what I require, anyone know of any packages that does this?
Given a string, generate a regex that can parse *similar* strings
We may need grep
grepl(sub("-.*", "", v1[1]), v1[-1])
data
v1 <- c( "AB-123" , "AB-312" , "AB-546" , "ZXC/123456")
Here's an attempt at making a function which checks if each value is a Character Digit or Other. It is a bit rough but I'm sure this can be expanded upon to match exactly what you want:
test <- c("AB-123", "AB-312", "AB-546", "ZXC/123456")
compare_1st <- function(x) {
x <- toupper(x)
chars <- list("A",1,"-")
repl <- c("[A-Z]", "[0-9]", "[^0-9A-Z]")
for(i in seq_along(repl)) x <- gsub(repl[i], chars[i], x)
out <- x[1] == x
attr(out, "values") <- chartr("A1-", "CDO", x)
out
}
compare_1st(test)
#[1] TRUE TRUE TRUE FALSE
#attr(,"values")
#[1] "CCODDD" "CCODDD" "CCODDD" "CCCODDDDDD"

Use substr until condition is met

I have a vector from which I just need the first word. The words have different lengths. Words are separated by a symbol (. and _) How can I use the substr() function to get a new vector with just the first word?
I was thinking of something like this
x <- c("wooombel.ab","mugran.cd","friendly_ef.ab","hungry_kd.xy")
y <- substr(x,0, ???)
I think sub with some regular expressions would be the easiest solution:
sub(pattern = "[._].*", replacement = "", x = x)
# [1] "wooombel" "mugran" "friendly" "hungry"
Try:
sapply(strsplit(x,'[._]'), function(x) x[1])
[1] "wooombel" "mugran" "friendly" "hungry"
You could also use package stringr. It has some really handy functions for string manipulation.
One that comes to mind for this problem is word. It has a sep argument that allows the use of a regular expression.
> x <- c("wooombel.ab","mugran.cd","friendly_ef.ab","hungry_kd.xy")
> library(stringr)
> word(x, sep = "[._]")
# [1] "wooombel" "mugran" "friendly" "hungry"
Another option that allows you to continue to use substr is str_locate. So if we just subtract 1 from its result, we can get the desired first words.
> substr(x, 1, str_locate(x, "[._]")-1)
# [1] "wooombel" "mugran" "friendly" "hungry"
An extraction approach with stringi:
library(stringi)
stri_extract_first_regex(x, "[a-z]+(?=[._])")
## [1] "wooombel" "mugran" "friendly" "hungry"
Though "[^a-z]+(?=[._])" may be more explicit.
Regex explanation:
[^a-z]+ any character except: 'a' to 'z' (1 or
more times)
(?= look ahead to see if there is:
[._] any character of: '.', '_'
) end of look-ahead

How to split a string from right-to-left, like Python's rsplit()?

Suppose a vector:
xx.1 <- c("zz_ZZ_uu_d", "II_OO_d")
I want to get a new vector splitted from right most and only split once. The expected results would be:
c("zz_ZZ_uu", "d", "II_OO", "d").
It would be like python's rsplit() function. My current idea is to reverse the string, and split the with str_split() in stringr.
Any better solutions?
update
Here is my solution returning n splits, depending on stringr and stringi. It would be nice that someone provides a version with base functions.
rsplit <- function (x, s, n) {
cc1 <- unlist(stringr::str_split(stringi::stri_reverse(x), s, n))
cc2 <- rev(purrr::map_chr(cc1, stringi::stri_reverse))
return(cc2)
}
Negative lookahead:
unlist(strsplit(xx.1, "_(?!.*_)", perl = TRUE))
# [1] "zz_ZZ_uu" "d" "II_OO" "d"
Where a(?!b) says to find such an a which is not followed by a b. In this case .*_ means that no matter how far (.*) there should not be any more _'s.
However, it seems to be not that easy to generalise this idea. First, note that it can be rewritten as positive lookahead with _(?=[^_]*$) (find _ followed by anything but _, here $ signifies the end of a string). Then a not very elegant generalisation would be
rsplit <- function(x, s, n) {
p <- paste0("[^", s, "]*")
rx <- paste0(s, "(?=", paste(rep(paste0(p, s), n - 1), collapse = ""), p, "$)")
unlist(strsplit(x, rx, perl = TRUE))
}
rsplit(vec, "_", 1)
# [1] "a_b_c_d_e_f" "g" "a" "b"
rsplit(vec, "_", 3)
# [1] "a_b_c_d" "e_f_g" "a_b"
where e.g. in case n=3 this function uses _(?=[^_]*_[^_]*_[^_]*$).
Another two. In both I use "(.*)_(.*)" as the pattern to capture both parts of the string. Remember that * is greedy so the first (.*) will match as many characters as it can.
Here I use regexec to capture where your susbtrings start and end, and regmatches to reconstruct them:
unlist(lapply(regmatches(xx.1, regexec("(.*)_(.*)", xx.1)),
tail, -1))
And this one is a little less academic but easy to understand:
unlist(strsplit(sub("(.*)_(.*)", "\\1###\\2", xx.1), "###"))
What about just pasting it back together after it's split?
rsplit <- function( x, s ) {
spl <- strsplit( "zz_ZZ_uu_d", s, fixed=TRUE )[[1]]
res <- paste( spl[-length(spl)], collapse=s, sep="" )
c( res, spl[length(spl)] )
}
> rsplit("zz_ZZ_uu_d", "_")
[1] "zz_ZZ_uu" "d"
I also thought about a very similar approach to that of Ari
> res <- lapply(strsplit(xx.1, "_"), function(x){
c(paste0(x[-length(x)], collapse="_" ), x[length(x)])
})
> unlist(res)
[1] "zz_ZZ_uu" "d" "II_OO" "d"
This gives exactly what you want and is the simplest approach:
require(stringr)
as.vector(t(str_match(xx.1, '(.*)_(.*)') [,-1]))
[1] "zz_ZZ_uu" "d" "II_OO" "d"
Explanation:
str_split() is not the droid you're looking for, because it only does left-to-right split, and splitting then repasting all the (n-1) leftmost matches is a total waste of time. So use str_split() with a regex with two capture groups. Note the first (.*)_ will greedy match everything up to the last occurrence of _, which is what you want. (This will fail if there isn't at least one _, and return NAs)
str_match() returns a matrix where the first column is the entire string, and subsequent columns are individual capture groups. We don't want the first column, so drop it with [,-1]
as.vector() will unroll that matrix column-wise, which is not what you want, so we use t() to transpose it to unroll row-wise
str_match(string, pattern) is vectorized over both string and pattern, which is neat

Resources