How can I get the last n characters from a string in R?
Is there a function like SQL's RIGHT?
I'm not aware of anything in base R, but it's straight-forward to make a function to do this using substr and nchar:
x <- "some text in a string"
substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}
substrRight(x, 6)
[1] "string"
substrRight(x, 8)
[1] "a string"
This is vectorised, as #mdsumner points out. Consider:
x <- c("some text in a string", "I really need to learn how to count")
substrRight(x, 6)
[1] "string" " count"
If you don't mind using the stringr package, str_sub is handy because you can use negatives to count backward:
x <- "some text in a string"
str_sub(x,-6,-1)
[1] "string"
Or, as Max points out in a comment to this answer,
str_sub(x, start= -6)
[1] "string"
Use stri_sub function from stringi package.
To get substring from the end, use negative numbers.
Look below for the examples:
stri_sub("abcde",1,3)
[1] "abc"
stri_sub("abcde",1,1)
[1] "a"
stri_sub("abcde",-3,-1)
[1] "cde"
You can install this package from github: https://github.com/Rexamine/stringi
It is available on CRAN now, simply type
install.packages("stringi")
to install this package.
str = 'This is an example'
n = 7
result = substr(str,(nchar(str)+1)-n,nchar(str))
print(result)
> [1] "example"
>
Another reasonably straightforward way is to use regular expressions and sub:
sub('.*(?=.$)', '', string, perl=T)
So, "get rid of everything followed by one character". To grab more characters off the end, add however many dots in the lookahead assertion:
sub('.*(?=.{2}$)', '', string, perl=T)
where .{2} means .., or "any two characters", so meaning "get rid of everything followed by two characters".
sub('.*(?=.{3}$)', '', string, perl=T)
for three characters, etc. You can set the number of characters to grab with a variable, but you'll have to paste the variable value into the regular expression string:
n = 3
sub(paste('.+(?=.{', n, '})', sep=''), '', string, perl=T)
UPDATE: as noted by mdsumner, the original code is already vectorised because substr is. Should have been more careful.
And if you want a vectorised version (based on Andrie's code)
substrRight <- function(x, n){
sapply(x, function(xx)
substr(xx, (nchar(xx)-n+1), nchar(xx))
)
}
> substrRight(c("12345","ABCDE"),2)
12345 ABCDE
"45" "DE"
Note that I have changed (nchar(x)-n) to (nchar(x)-n+1) to get n characters.
A simple base R solution using the substring() function (who knew this function even existed?):
RIGHT = function(x,n){
substring(x,nchar(x)-n+1)
}
This takes advantage of basically being substr() underneath but has a default end value of 1,000,000.
Examples:
> RIGHT('Hello World!',2)
[1] "d!"
> RIGHT('Hello World!',8)
[1] "o World!"
Try this:
x <- "some text in a string"
n <- 5
substr(x, nchar(x)-n, nchar(x))
It shoudl give:
[1] "string"
An alternative to substr is to split the string into a list of single characters and process that:
N <- 2
sapply(strsplit(x, ""), function(x, n) paste(tail(x, n), collapse = ""), N)
I use substr too, but in a different way. I want to extract the last 6 characters of "Give me your food." Here are the steps:
(1) Split the characters
splits <- strsplit("Give me your food.", split = "")
(2) Extract the last 6 characters
tail(splits[[1]], n=6)
Output:
[1] " " "f" "o" "o" "d" "."
Each of the character can be accessed by splits[[1]][x], where x is 1 to 6.
someone before uses a similar solution to mine, but I find it easier to think as below:
> text<-"some text in a string" # we want to have only the last word "string" with 6 letter
> n<-5 #as the last character will be counted with nchar(), here we discount 1
> substr(x=text,start=nchar(text)-n,stop=nchar(text))
This will bring the last characters as desired.
For those coming from Microsoft Excel or Google Sheets, you would have seen functions like LEFT(), RIGHT(), and MID(). I have created a package known as forstringr and its development version is currently on Github.
if(!require("devtools")){
install.packages("devtools")
}
devtools::install_github("gbganalyst/forstringr")
library(forstringr)
the str_left(): This counts from the left and then extract n characters
the str_right()- This counts from the right and then extract n characters
the str_mid()- This extract characters from the middle
Examples:
x <- "some text in a string"
str_left(x, 4)
[1] "some"
str_right(x, 6)
[1] "string"
str_mid(x, 6, 4)
[1] "text"
I used the following code to get the last character of a string.
substr(output, nchar(stringOfInterest), nchar(stringOfInterest))
You can play with the nchar(stringOfInterest) to figure out how to get last few characters.
A little modification on #Andrie solution gives also the complement:
substrR <- function(x, n) {
if(n > 0) substr(x, (nchar(x)-n+1), nchar(x)) else substr(x, 1, (nchar(x)+n))
}
x <- "moSvmC20F.5.rda"
substrR(x,-4)
[1] "moSvmC20F.5"
That was what I was looking for. And it invites to the left side:
substrL <- function(x, n){
if(n > 0) substr(x, 1, n) else substr(x, -n+1, nchar(x))
}
substrL(substrR(x,-4),-2)
[1] "SvmC20F.5"
Just in case if a range of characters need to be picked:
# For example, to get the date part from the string
substrRightRange <- function(x, m, n){substr(x, nchar(x)-m+1, nchar(x)-m+n)}
value <- "REGNDATE:20170526RN"
substrRightRange(value, 10, 8)
[1] "20170526"
i want to write a function which takes a character Vector(including numbers) as Input and left pads zeroes to the numbers in it. for example this could be an Input Vector :
x<- c("abc124.kk", "77kk-tt", "r5mm")
x
[1] "abc124.kk" "77kk-tt" "r5mm"
each string of the input Vector contains only one Vector but there all in different positions(some are at the end, some in the middle..)
i want the ouput to look like this:
"abc124.kk" "077kk-tt" "r005mm"
that means to put as many leading Zeros to the number included in the string so that it has as many Digits as the longest number.
but i want a function who does this for every string Input not only my example(the x Vector).
i already started extracting the numbers and letters and turned the numbers the way i want them but how can i put them back together and back on the right Position?
my_function<- function(x){
letters<- str_extract_all(x,"[a-z]+")
numbers<- str_extract_all(x, "[0-9]+")
digit_width<-max(nchar(numbers))
numbers_correct<- str_pad(numbers, width=digit_width, pad="0")
}
and what if i have a Vector which includes some strings without numbers? how can i exclude them and get them back without any changes ?
for example if teh Input would be
y<- c("12ab", "cd", "ef345")
the numbers variable Looks like that:
[[1]]
[1] "12"
[[2]]
character(0)
in this case i would want that the ouput at the would look like this:
"012ab" "cd" "ef345"
An option would be using gsubfn to capture the digits, convert it to numeric and then pass it to sprintf for formatting
library(gsubfn)
gsubfn("([0-9]+)", ~ sprintf("%03d", as.numeric(x)), x)
#[1] "abc124.kk" "077kk-tt" "r005mm"
x <- c("12ab", "cd", "ef345")
s = gsub("\\D", "", x)
n = nchar(s)
max_n = max(n)
sapply(seq_along(x), function(i){
if (n[i] < max_n) {
zeroes = paste(rep(0, max_n - n[i]), collapse = "")
gsub("\\d+", paste0(zeroes, s[i]), x[i])
} else {
x[i]
}
})
#[1] "012ab" "cd" "ef345"
I wrote the following loop to convert user input, which can be single-, two- or three digit numbers, into all three digit numbers; such that an input vector [7, 8, 9, 10, 11] would be converted into an output vector [007, 008, 009, 010, 011]. This is my code:
zeroes <- function(id){
for(i in 1:length(id)){
if(id[i] <= 9){
id[i] <- paste("00", id[i], sep = "")
}
else if(id[i] >= 10 && id[i] <= 99){
id[i] <- paste("0", id[i], sep = "")
}
}
id
}
For an input vector
id <- 50:100
I get the following output:
[1] "050" "0051" "0052" "0053" "0054" "0055" "0056" "0057" "0058" "0059"
[11] "0060" "0061" "0062" "0063" "0064" "0065" "0066" "0067" "0068" "0069"
[21] "0070" "0071" "0072" "0073" "0074" "0075" "0076" "0077" "0078" "0079"
[31] "0080" "0081" "0082" "0083" "0084" "0085" "0086" "0087" "0088" "0089"
[41] "090" "091" "092" "093" "094" "095" "096" "097" "098" "099"
[51] "00100"
So, it looks like for id[1] the function works, then there is a bug for the following numbers, but for id[41:50], I get the correct output again. I haven't been able to figure out why this is the case, and what I am doing wrong. Any suggestions are warmly welcomed.
Its because when you do the first replacement on id in your function, the vector becomes character (because a vector can't store numbers and characters).
So zeroes(51) works fine:
> zeroes(51)
[1] "051"
but if its the second item, it fails:
> zeroes(c(50,51))
[1] "050" "0051"
because by the time your loop gets on to the 51, its actually "51" in quotes. And that fails:
> zeroes("51")
[1] "0051"
because "51" is less than 9:
> "51"<9
[1] TRUE
because R converts the 9 to a "9" and then does a character comparison, so only the "5" gets compared with the "9" and "5" is before "9" in the collating sequence alphabet.
Other languages might convert the character "51" to numeric and then compare with the numeric 9 and say "51"<9 is False, but R does it this way.
Lesson: don't overwrite your input vectors! (and use sprintf).
I have string data like below.
a <- c("53H", "H26","14M","M47")
##"53H" "H26" "14M" "M47"
I want to fix the numbers and letters in a certain order such that
the numbers goes first, the letters goes second, or the other way around.
How can I do it?
##"53H" "26H" "14M" "47M"
or
##"H53" "H26" "M14" "M47"
You can extract the numbers and letters separately with gsub, then use paste0
to put them in any order you like.
a <- c("53H", "H26","14M","M47")
( nums <- gsub("[^0-9]", "", a) ) ## extract numbers
# [1] "53" "26" "14" "47"
( lets <- gsub("[^A-Z]", "", a) ) ## extract letters
# [1] "H" "H" "M" "M"
Numbers first answer:
paste0(nums, lets)
# [1] "53H" "26H" "14M" "47M"
Letters first answer:
paste0(lets, nums)
# [1] "H53" "H26" "M14" "M47"
You can capture the relevant parts in groups using () and then backreference them using gsub:
a <- c("53H", "H26","14M","M47")
gsub("^([0-9]+)([A-Z]+)$", "\\2\\1", a)
# [1] "H53" "H26" "M14" "M47"
This is like saying "Find a group of numbers at the start of my string and capture them in a group (^([0-9]+)). Then find the group of letters that go on to the end of my string and capture them in a second group (([A-Z]+)). That's my search pattern. Next, replace it such that the second group (referred to by \\2) is returned first and the first group (referred to by \\1) is returned second).
From Ananda Mahto's answer, you can order the number first and letter second using the following code:
gsub("^([A-Z]+)([0-9]+)$", "\\2\\1", a)
because you want to capture the strings which start with a letter (^([A-Z]+)), then capture the group of numbers ( ([0-9]+)$ )/
I am doing an exercise in my R class, and I hope you can help. The task is to create my own script that determines whether or not a number is a palindrome. My idea was to create a repetition structure that records each digit in a number of any size, compares those digits in order, and then makes a call as to whether the number is a palindrome or not.
So far, I thought I could use the "for" command to break the number down, like this:
# Initialize
Number <- 242
Number
N <- nchar(Number)
N
# Find numbers and digits
if (Number == 0) {
print ("Number must be greater than 0")
}
if (Number < 0) {
print ("Number must be greater than 0")
}
for (i in 1:N) {
print (Number)
Digit <- Number %/% 10^(N-1)
print (Digit)
Number <- Number %% 10^(N-1)
N <- N-1
}
The problem, though, is that since this structure overwrites the variables in each loop, I cannot print all the digits out separately once the loop is done. Can I command R to print out and record the digits produced in each loop, so that they can be compared to each other downstream and used to assess whether the original number was a palindrome or not? Thanks for your help.
There's better ways of checking for palindrome-ness in R, for which you should see the other answers. For your specific problem of keeping track of things during a for loop, one approach is to make a vector that's as long as the for loop and assign to the ith element of the vector in the ith iteration of the loop.
Number <- 12345
N <- nchar(Number)
backwardsDigits <- numeric(N) ## a vector of numerics of length N
for (i in N:1) {
backwardsDigits[i] <- Number %/% 10^(i-1)
Number <- Number %% 10^(i-1)
}
backwardsDigits
all(backwardsDigits == rev(backwardsDigits))
You could use forwardsDigits instead by writing to forwardsDigits[N - i + 1] in the loop. You don't really need to print anything during the loop, though it can be helpful for debugging.
As #thelatemail suggested, there is another (perhaps more intuitive way) to do this.
First, let's convert the number 117711 to a string and split it up.
charsplit <- strsplit(as.character(117712), "")
[[1]]
[1] "1" "1" "7" "7" "1" "2"
Then, we'll take it out of list form and reverse it
revchar <- rev(unlist(charsplit))
[1] "2" "1" "7" "7" "1" "1"
Finally, we'll paste these together and convert them into a number:
palinum <- as.numeric(paste(revchar, collapse=""))
[1] "217711"
We can then check if they're identical:
117712 == palinum
[1] FALSE
We can even write a function to do it for us.
is.palindrome <- function(number){
charsplit <- strsplit(as.character(number), "")
revchar <- rev(unlist(charsplit))
palinum <- as.numeric(paste(revchar, collapse=""))
number==palinum
}
is.palindrome(117712)
[1] FALSE
is.palindrome(117711)
[1] TRUE