How to space-pad unicode string character in R? - r

I am trying to pad a character with a space using sprintf() (but any base R alternative is would be fine).
It works as expected for letter "a" but for "β" it won't work:
sprintf('% 2s', 'a')
#> [1] " a"
sprintf('% 2s', 'β')
#> [1] "β"
sprintf('% 3s', 'β')
#> [1] " β"
I guess it has to do with the fact that it takes two bytes (i.e., two sprintf's "characters") to represent the "β" string... but so, I could I change my code to make it work and pad with spaces in a way that "β" is understood as one character (i.e., one-visible character).

Convert the string to native first. This worked for me on Windows but not on https://rdrr.io/snippets/ which reports its .Platform$os.type as unix.
s <- 'β'; n <- 3 # inputs
sprintf("%*s", n, enc2native(s)) # or hard code the 3 and drop n
## [1] " ß"
Alternately use paste0 or substring<- with strrep or convert the string to X's, perform the sprintf and then convert back. These worked on Windows and on https://rdrr.io/snippets/ .
# 2
paste0(strrep(' ', n - nchar(s)), s)
## [1] " β"
# 3
`substring<-`(strrep(" ", n), n - nchar(s) + 1, n, s)
## [1] " ß"
# 4
sub("X+", s, sprintf("%*s", n, strrep("X", nchar(s))))
## [1] " β"

Related

Extract last digit [duplicate]

How can I get the last n characters from a string in R?
Is there a function like SQL's RIGHT?
I'm not aware of anything in base R, but it's straight-forward to make a function to do this using substr and nchar:
x <- "some text in a string"
substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}
substrRight(x, 6)
[1] "string"
substrRight(x, 8)
[1] "a string"
This is vectorised, as #mdsumner points out. Consider:
x <- c("some text in a string", "I really need to learn how to count")
substrRight(x, 6)
[1] "string" " count"
If you don't mind using the stringr package, str_sub is handy because you can use negatives to count backward:
x <- "some text in a string"
str_sub(x,-6,-1)
[1] "string"
Or, as Max points out in a comment to this answer,
str_sub(x, start= -6)
[1] "string"
Use stri_sub function from stringi package.
To get substring from the end, use negative numbers.
Look below for the examples:
stri_sub("abcde",1,3)
[1] "abc"
stri_sub("abcde",1,1)
[1] "a"
stri_sub("abcde",-3,-1)
[1] "cde"
You can install this package from github: https://github.com/Rexamine/stringi
It is available on CRAN now, simply type
install.packages("stringi")
to install this package.
str = 'This is an example'
n = 7
result = substr(str,(nchar(str)+1)-n,nchar(str))
print(result)
> [1] "example"
>
Another reasonably straightforward way is to use regular expressions and sub:
sub('.*(?=.$)', '', string, perl=T)
So, "get rid of everything followed by one character". To grab more characters off the end, add however many dots in the lookahead assertion:
sub('.*(?=.{2}$)', '', string, perl=T)
where .{2} means .., or "any two characters", so meaning "get rid of everything followed by two characters".
sub('.*(?=.{3}$)', '', string, perl=T)
for three characters, etc. You can set the number of characters to grab with a variable, but you'll have to paste the variable value into the regular expression string:
n = 3
sub(paste('.+(?=.{', n, '})', sep=''), '', string, perl=T)
UPDATE: as noted by mdsumner, the original code is already vectorised because substr is. Should have been more careful.
And if you want a vectorised version (based on Andrie's code)
substrRight <- function(x, n){
sapply(x, function(xx)
substr(xx, (nchar(xx)-n+1), nchar(xx))
)
}
> substrRight(c("12345","ABCDE"),2)
12345 ABCDE
"45" "DE"
Note that I have changed (nchar(x)-n) to (nchar(x)-n+1) to get n characters.
A simple base R solution using the substring() function (who knew this function even existed?):
RIGHT = function(x,n){
substring(x,nchar(x)-n+1)
}
This takes advantage of basically being substr() underneath but has a default end value of 1,000,000.
Examples:
> RIGHT('Hello World!',2)
[1] "d!"
> RIGHT('Hello World!',8)
[1] "o World!"
Try this:
x <- "some text in a string"
n <- 5
substr(x, nchar(x)-n, nchar(x))
It shoudl give:
[1] "string"
An alternative to substr is to split the string into a list of single characters and process that:
N <- 2
sapply(strsplit(x, ""), function(x, n) paste(tail(x, n), collapse = ""), N)
I use substr too, but in a different way. I want to extract the last 6 characters of "Give me your food." Here are the steps:
(1) Split the characters
splits <- strsplit("Give me your food.", split = "")
(2) Extract the last 6 characters
tail(splits[[1]], n=6)
Output:
[1] " " "f" "o" "o" "d" "."
Each of the character can be accessed by splits[[1]][x], where x is 1 to 6.
someone before uses a similar solution to mine, but I find it easier to think as below:
> text<-"some text in a string" # we want to have only the last word "string" with 6 letter
> n<-5 #as the last character will be counted with nchar(), here we discount 1
> substr(x=text,start=nchar(text)-n,stop=nchar(text))
This will bring the last characters as desired.
For those coming from Microsoft Excel or Google Sheets, you would have seen functions like LEFT(), RIGHT(), and MID(). I have created a package known as forstringr and its development version is currently on Github.
if(!require("devtools")){
install.packages("devtools")
}
devtools::install_github("gbganalyst/forstringr")
library(forstringr)
the str_left(): This counts from the left and then extract n characters
the str_right()- This counts from the right and then extract n characters
the str_mid()- This extract characters from the middle
Examples:
x <- "some text in a string"
str_left(x, 4)
[1] "some"
str_right(x, 6)
[1] "string"
str_mid(x, 6, 4)
[1] "text"
I used the following code to get the last character of a string.
substr(output, nchar(stringOfInterest), nchar(stringOfInterest))
You can play with the nchar(stringOfInterest) to figure out how to get last few characters.
A little modification on #Andrie solution gives also the complement:
substrR <- function(x, n) {
if(n > 0) substr(x, (nchar(x)-n+1), nchar(x)) else substr(x, 1, (nchar(x)+n))
}
x <- "moSvmC20F.5.rda"
substrR(x,-4)
[1] "moSvmC20F.5"
That was what I was looking for. And it invites to the left side:
substrL <- function(x, n){
if(n > 0) substr(x, 1, n) else substr(x, -n+1, nchar(x))
}
substrL(substrR(x,-4),-2)
[1] "SvmC20F.5"
Just in case if a range of characters need to be picked:
# For example, to get the date part from the string
substrRightRange <- function(x, m, n){substr(x, nchar(x)-m+1, nchar(x)-m+n)}
value <- "REGNDATE:20170526RN"
substrRightRange(value, 10, 8)
[1] "20170526"

Regex: how to keep all digits when splitting a string?

Question
Using a regular expression, how do I keep all digits when splitting a string?
Overview
I would like to split each element within the character vector sample.text into two elements: one of only digits and one of only the text.
Current Attempt is Dropping Last Digit
This regular expression - \\d\\s{1} - inside of base::strsplit() removes the last digit. Below is my attempt, along with my desired output.
# load necessary data -----
sample.text <-
c("111110 Soybean Farming", "0116 Soybeans")
# split string by digit and one space pattern ------
strsplit(sample.text, split = "\\d\\s{1}")
# [[1]]
# [1] "11111" "Soybean Farming"
#
# [[2]]
# [1] "011" "Soybeans"
# desired output --------
# [[1]]
# [1] "111110" "Soybean Farming"
#
# [[2]]
# [1] "0116" "Soybeans"
# end of script #
Any advice on how I can split sample.text to keep all digits would be much appreciated! Thank you.
Because you're splitting on \\d, the digit there is consumed in the regex, and not present in the output. Use lookbehind for a digit instead:
strsplit(sample.text, split = "(?<=\\d) ", perl=TRUE)
http://rextester.com/GDVFU71820
Some alternative solutions, using very simple pattern matching on the first occurrence of space:
1) Indirectly by using sub to substitute your own delimiter, then strsplit on your delimiter:
E.g. you can substitute ';' for the first space (if you know that character does not exist in your data):
strsplit( sub(' ', ';', sample.text), split=';')
2) Using regexpr and regmatches
You can effectively match on the first " " (space character), and split as follows:
regmatches(sample.text, regexpr(" ", sample.text), invert = TRUE)
Result is a list, if that's what you are after per your sample desired output:
[[1]]
[1] "111110" "Soybean Farming"
[[2]]
[1] "0116" "Soybeans"
3) Using stringr library:
library(stringr)
str_split_fixed(sample.text, " ", 2) #outputs a character matrix
[,1] [,2]
[1,] "111110" "Soybean Farming"
[2,] "0116" "Soybeans"

Avoid R function paste generating backslash for quotes

I am trying to get two strings that contain quotations ("") combined as a character/string vector or with R function paste so I can plug the result in the argument x of writeFormula in openxlsx package.
An example is like this
paste('HYPERLINK("file)',':///"&path!$C$1&TRIM(MID(CELL("filename",B',sep="")
and I hope that it should produce the result like this
HYPERLINK("file:///"&path!$C$1&TRIM(MID(CELL("filename",B
but it actually produces the result with a backslash in front of the ":
[1] "HYPERLINK(\"file):///\"&path!$C$1&TRIM(MID(CELL(\"filename\",B"
I have searched for many potential solutions like replace paste with cat or add noquote function in front of paste but the output is not a character vector. Functions like toString or as.character could convert these results to strings but the backslash comes back as well.
Really appreciate any helps with this. Thanks.
There are no backslashes in p. The backslashes you see are just how R displays a quote (so that you know that the quote is part of the string and not the ending delimiter) but are not in the string itself.
p <- paste0('HYPERLINK("file)', ':///"&path!$C$1&TRIM(MID(CELL("filename",B')
p
## [1] "HYPERLINK(\"file):///\"&path!$C$1&TRIM(MID(CELL(\"filename\",B"
# no backslashes are found in p
grepl("\\", p, fixed = TRUE)
## [1] FALSE
noquote(p), cat(p, "\n") or writeLines(p) can be used to display the string without the backslash escapes:
noquote(p)
## [1] HYPERLINK("file):///"&path!$C$1&TRIM(MID(CELL("filename",B
cat(p, "\n")
## HYPERLINK("file):///"&path!$C$1&TRIM(MID(CELL("filename",B
writeLines(p)
## HYPERLINK("file):///"&path!$C$1&TRIM(MID(CELL("filename",B
One can see the individual characters separated by spaces like this. We see that there are no backslashes:
do.call(cat, c(strsplit(p, ""), "\n"))
## H Y P E R L I N K ( " f i l e ) : / / / " & p a t h ! $ C $ 1 & T R I M ( M I D ( C E L L ( " f i l e n a m e " , B
As another example here p2 contains one double quote and has a single character in it, not 2:
p2 <- '"'
p2
## [1] "\""
nchar(p2)
## [1] 1

How do you control the formatting of digits when converting lists of numbers to character?

I have a nested list containing numbers, for example
x <- list(1 + .Machine$double.eps, list(rnorm(2), list(rnorm(1))))
If I call as.character on this, all the numbers are given in fixed format, to 15 significant digits.
as.character(x)
## [1] "1"
## [2] "list(c(0.654345721043012, 0.611306113713901), list(-0.278722330674071))"
I'd like to be able to control how the numbers are formatted. At the very least, I'd like to be able to control how many significant figures are included. As a bonus, being able to specify scientific formatting rather than fixed formatting would be nice.
The ?as.character help page states:
as.character represents real and complex numbers to 15 significant
digits (technically the compiler's setting of the ISO C constant
DBL_DIG, which will be 15 on machines supporting IEC60559 arithmetic
according to the C99 standard). This ensures that all the digits in
the result will be reliable (and not the result of representation
error), but does mean that conversion to character and back to numeric
may change the number. If you want to convert numbers to character
with the maximum possible precision, use format.
So it doesn't appear to be possible to change the formatting using as.character directly.
Calling format destroys the list structure:
format(x, digits = 5)
## [1] "1" "0.65435, 0.61131, -0.27872"
formatC throws an error about not supporting list inputs.
deparse also doesn't allow users to change how numbers are formatted: as.character(x) is the same as vapply(x, deparse, character(1)).
This is almost correct, but there are extra double-quote characters around the numbers that I don't want:
as.character(rapply(x, format, digits = 5, how = "list"))
## [1] "1"
## [2] "list(c(\"0.65435\", \"0.61131\"), list(\"-0.27872\"))"
How do I control the formatting of the numbers?
A partial solution: for reducing the number of significant figures, I can adapt the previous example by converting to character using format, then back to numeric.
as.character(rapply(x, function(x) as.numeric(format(x, digits = 5)), how = "list"))
## [1] "1" "list(c(-1.0884, 1.6892), list(0.58783))"
This doesn't work if I want to increase the number of sig figs beyond 15 or use scientific formatting (since we run into the limitation of as.character).
as.character(rapply(x, function(x) as.numeric(format(x, digits = 22)), how = "list"))
## [1] "1"
## [2] "list(c(-1.08842504028146, 1.68923191896784), list(0.5878275490431))"
Play with the how argument to rapply():
> rapply(x, sprintf, fmt = "%0.5f", how = "replace")
[[1]]
[1] "1.00000"
[[2]]
[[2]][[1]]
[1] "0.18041" "-0.63925"
[[2]][[2]]
[[2]][[2]][[1]]
[1] "0.14309"
For more digits, change fmt:
> rapply(x, sprintf, fmt = "%0.22f", how = "replace")
[[1]]
[1] "1.0000000000000002220446"
[[2]]
[[2]][[1]]
[1] "1.2888001496908956244880" "1.0289289081633956612905"
[[2]][[2]]
[[2]][[2]][[1]]
[1] "0.4656598705611921240610"
You can gsub() out the quotes:
> gsub("\"", "", deparse(rapply(x, function(z) sprintf(fmt = "%0.22f", z), how = "replace")))
[1] "list(1.0000000000000002220446, list(c(1.2888001496908956244880, "
[2] "1.0289289081633956612905), list(0.4656598705611921240610)))"
Advice given to me by R Core member Martin Maechler was that you can format the numbers in a "%g17" style (that is R decides whether fixed or scientific formatting is best, and increases the number of significant digits to 17; see ?sprintf) by using deparse and the "digits17" controls option. The latter is documented on the ?.deparseOpts help page.
vapply(x, deparse, character(1), control = "digits17")

Extract/Remove portion of an Integer or string with random digits/characters in R

Say I have an integer
x <- as.integer(442009)
or a character string
y <- "a10ba3m1"
How do I eliminate the last two digits/character of integer/string of any length in general ?
substr returns substrings:
substr(x, 1, nchar(x)-2)
# [1] "4420"
substr(y, 1, nchar(y)-2)
# [1] "a10ba3"
If you know that the value is an integer, then you can just divide by 100 and convert back to integer (drop the decimal part). This is probably a little more efficient than converting it to a string then back.
> x <- as.integer(442009)
> floor(x/100)
[1] 4420
If you just want to remove the last 2 characters of a string then substr works.
Or, here is a regular expression that does it as well (less efficiently than substr:
> y <- "a10ba3m1"
> sub("..$", "", y)
[1] "a10ba3"
If you want to remove the last 2 digits (not any character) from a string and the last 2 digits are not guaranteed to be in the last 2 positions, then here is a regular expression that works:
> sub("[0-9]?([^0-9]*)[0-9]([^0-9]*)$", "\\1\\2", y)
[1] "a10bam"
If you want to remove up to 2 digits that appear at the very end (but not if any non digits come after them) then use this regular expression:
> sub("[0-9]{1,2}$", "", y)
[1] "a10ba3m"

Resources