sprintf padding with non English symbols - r

I encountered a strange sprintf() behaviour with the non English symbols. I tried padding a strings but I got an unexpected results:
lapply(c("ZZZ", "ZZZZZZ", "ЯЯЯ", "ЯЯЯЯЯЯ"),
function(x) sprintf("%-20s: %s", x, "VALUE"))
#> [[1]]
#> [1] "ZZZ : VALUE"
#>
#> [[2]]
#> [1] "ZZZZZZ : VALUE"
#>
#> [[3]]
#> [1] "ЯЯЯ : VALUE"
#>
#> [[4]]
#> [1] "ЯЯЯЯЯЯ : VALUE"
#>
Anybody can explain why this is happening and how to fix it?
Session info may be useful:
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux
locale:
[1] LC_CTYPE=ru_RU.UTF-8 LC_NUMERIC=C LC_TIME=ru_RU.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=ru_RU.UTF-8 LC_MESSAGES=ru_RU.UTF-8 LC_PAPER=ru_RU.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] shiny_0.12.2 R6_2.1.1 rsconnect_0.4.1.4 htmltools_0.2.6 tools_3.2.2 Rcpp_0.12.2 digest_0.6.8
[8] xtable_1.8-0 httpuv_1.3.3 mime_0.4

I can tell you why it's happening, but not how to fix it. From the docs for sprintf:
Field widths and precisions of %s conversions are interpreted as bytes, not characters, as described in the C standard.
In UTF-8 the character Я is two bytes (0xD0 0xAF), so "ЯЯЯ" is 6 bytes whereas "ZZZ" is 3 bytes, and sprintf renders them accordingly.
Edit
One workaround is to use sprintf's asterisk feature, which lets you declare the width of a field (in bytes), along with the nchar function, which lets you calculate both the display width and the number of bytes in a string.
So, for example, nchar("ЯЯЯ", "width") and nchar("ЯЯЯ", "bytes") return 3 and 6, respectively. If we want to pad its width to 20 display characters, then we have to give sprintf a width of 23 bytes: 20 plus the number of bytes minus the display width.
sprintf("%-*s", 23, "ЯЯЯ")
#> [1] "ЯЯЯ "
Or:
str <- "ЯЯЯ"
pad.len <- 20 + nchar(str, "bytes") + nchar(str, "width")
sprintf("%-*s", pad.len, str)
#> [1] "ЯЯЯ "
This works for "ZZZ", too, because the bytes and display width are equal, so it comes out to 20:
pad <- function(str) {
pad.len <- 20 + nchar(str, "bytes") - nchar(str, "width")
return(sprintf("%-*s: %s", pad.len, str, "VALUE"))
}
print(lapply(c("ZZZ", "ZZZZZZ", "ЯЯЯ", "ЯЯЯЯЯЯ"), pad))
#> [[1]]
#> [1] "ZZZ : VALUE"
#>
#> [[2]]
#> [1] "ZZZZZZ : VALUE"
#>
#> [[3]]
#> [1] "ЯЯЯ : VALUE"
#>
#> [[4]]
#> [1] "ЯЯЯЯЯЯ : VALUE"
P.S. This is the first R code I've ever written so if you see any ways to improve it please feel free to comment.

I found solution with stri_pad_right() function from stringi package:
lapply(c("ZZZ", "ZZZZZZ", "ЯЯЯ", "ЯЯЯЯЯЯ"),
function(x) paste0(stringi::stri_pad_right(x, 20), ": VALUE"))
#> [[1]]
#> [1] "ZZZ : VALUE"
#>
#> [[2]]
#> [1] "ZZZZZZ : VALUE"
#>
#> [[3]]
#> [1] "ЯЯЯ : VALUE"
#>
#> [[4]]
#> [1] "ЯЯЯЯЯЯ : VALUE"
#>
Update
Another solution based on the #Jordan answer uses only base R functions:
str_pad <- function(str, width = floor(0.9 * getOption("width")),
side = c("left", "both", "right")) {
side <- match.arg(side)
asc <- iconv(str, "latin1", "ASCII")
ind <- is.na(asc) | asc != str
if (any(ind))
width <- width + nchar(str, "bytes") - nchar(str, "width")
switch(side, left = sprintf("%-*s", width, str),
right = sprintf("%*s", width, str),
both = sprintf("%-*s", width, sprintf("%*s", floor(width/2), str)))
}
lapply(c("ZZZ", "ZZZZZZ", "ЯЯЯ", "ЯЯЯЯЯЯ"),
function(x) paste0(str_pad(x, 20), ": VALUE"))
#> [[1]]
#> [1] "ZZZ : VALUE"
#>
#> [[2]]
#> [1] "ZZZZZZ : VALUE"
#>
#> [[3]]
#> [1] "ЯЯЯ : VALUE"
#>
#> [[4]]
#> [1] "ЯЯЯЯЯЯ : VALUE"
#>

Related

How to use regex in R?

I want to remove the dashes and keep only the first 4 substrings except for the last character.
sub.maf.barcode <- gsub("^([^-]*-[^-]*-[^-]*-[^-]*).{1}$", "\\1", ori.maf.barcode$Tumor_Sample_Barcode)
> ori.maf.barcode$Tumor_Sample_Barcode[1:5]
[1] "TCGA-2K-A9WE-01A-11D-A382-10" "TCGA-2Z-A9J1-01A-11D-A382-10"
[3] "TCGA-2Z-A9J2-01A-11D-A382-10" "TCGA-2Z-A9J3-01A-12D-A382-10"
[5] "TCGA-2Z-A9J5-01A-21D-A382-10"
Expected output:
[1] "TCGA-2K-A9WE-01" "TCGA-2Z-A9J1-01"
[3] "TCGA-2Z-A9J2-01" "TCGA-2Z-A9J3-01"
[5] "TCGA-2Z-A9J5-01"
You could do
gsub('.-[^-]*-[^-]*-.[^-]*$', "", ori.maf.barcode$Tumor_Sample_Barcode)
#> [1] "TCGA-2K-A9WE-01" "TCGA-2Z-A9J1-01" "TCGA-2Z-A9J2-01"
#> [4] "TCGA-2Z-A9J3-01" "TCGA-2Z-A9J5-01"
Or
substr(ori.maf.barcode$Tumor_Sample_Barcode, 1, 15)
#> [1] "TCGA-2K-A9WE-01" "TCGA-2Z-A9J1-01" "TCGA-2Z-A9J2-01"
#> [4] "TCGA-2Z-A9J3-01" "TCGA-2Z-A9J5-01"
using str_extract
library(stringr)
str_extract(ori.maf.barcode$Tumor_Sample_Barcode, "^([^-]+-){3}\\d+")
-output
[1] "TCGA-2K-A9WE-01" "TCGA-2Z-A9J1-01" "TCGA-2Z-A9J2-01"
[4] "TCGA-2Z-A9J3-01" "TCGA-2Z-A9J5-01"

Loop including 0 R

I create the following range:
x <- seq(0,22)
Now I want to get some expected poisson estimations:
for (val in x) {
vec[val]<-(dpois(val,6.298387))*124
}
I want also the estimation for val = 0
(dpois(0,6.298387))*124
However, the vector "vec" obtained previously starts at val = 1.
How can I force the loop to take also values = 0?
Since R is 1-indexed, there is no such thing as vec[0]. The first valid index of vec is vec[1], so you probably intended
x <- seq(0,22)
vec <- numeric()
for (val in x) {
vec[val + 1] <- dpois(val, 6.298387) * 124
}
vec
#> [1] 2.280694e-01 1.436469e+00 4.523719e+00 9.497378e+00 1.495454e+01
#> [6] 1.883790e+01 1.977473e+01 1.779270e+01 1.400816e+01 9.803203e+00
#> [11] 6.174437e+00 3.535363e+00 1.855590e+00 8.990174e-01 4.044542e-01
#> [16] 1.698273e-01 6.685238e-02 2.476836e-02 8.666707e-03 2.872962e-03
#> [21] 9.047512e-04 2.713559e-04 7.768656e-05
However, the loop is not necessary, since dpois is vectorized like many R functions. Therefore the above code simplifies to this one-liner:
dpois(0:22, 6.298387) * 124
#> [1] 2.280694e-01 1.436469e+00 4.523719e+00 9.497378e+00 1.495454e+01
#> [6] 1.883790e+01 1.977473e+01 1.779270e+01 1.400816e+01 9.803203e+00
#> [11] 6.174437e+00 3.535363e+00 1.855590e+00 8.990174e-01 4.044542e-01
#> [16] 1.698273e-01 6.685238e-02 2.476836e-02 8.666707e-03 2.872962e-03
#> [21] 9.047512e-04 2.713559e-04 7.768656e-05
Created on 2022-07-22 by the reprex package (v2.0.1)

Capture `{cli}` output for reporting

I would like to capture the dynamic output produced by {cli} to use in error reporting/logging.
Here's an example:
my_function <- function(val, return_message = TRUE) {
if(val == 'a'){
if (return_message) cli::cli_alert("your value {val} = a!")
} else {
if (return_message) cli::cli_alert("your value {val} is not equal to a!")
}
if (return_message) cli::cli_h2("processing now complete")
return(val)
}
Using my_function, it returns val and also prints a message that is dynamic, based on the input value:
→ your value x is not equal to a!
── processing now complete ──
[1] "x"
Is there any way to capture the dynamic output from {cli} functions, ideally by appending to a list or a similar method?
Ideal output would be something like this:
my_data <- list(val = "x", message = c("your value x is not equal to a!", "processing now complete"))
I don't think cli_X() returns the text, but you could imagine a wrapper that might do what you want. Consider this as a start:
cli_wrapper <- function(str, type="alert", return_str = TRUE, return_message=TRUE, ...){
str <- with(list(...), glue::glue(str))
if(return_message){
cmd <- glue::glue('cli::cli_{type}("{str}")')
eval(parse(text=cmd))
}
if(return_str){
invisible(str)
}
}
In the function above, return_message indicates whether the cli_X() function should be run and the return_str indicates whether the string should be returned (invisibly). You could then rewrite your function with the cli_wrapper() function:
my_function <- function(val, return_message = TRUE) {
message <- NULL
if(val == 'a'){
message <- c(message, cli_wrapper("your value {val} = a!", type="alert", val = val, return_message = return_messasge))
}else{
message <- c(message, cli_wrapper("your value {val} is not equal to a!", type="alert", val = val, return_message = return_messasge))
}
message <- c(message, cli_wrapper("processing now complete", type="h2", return_message = return_messasge))
ret <- list(val=val, message = message)
invisible(ret)
}
Running the function a couple of different ways would give the following output:
my_data <- my_function(val="x", return_message=TRUE)
# → your value x is not equal to a!
#
# ── processing now complete ──
#
my_data
# $val
# [1] "x"
#
# $message
# [1] "your value x is not equal to a!" "processing now complete"
#
my_data <- my_function(val="x", return_message=FALSE)
my_data
# $val
# [1] "x"
#
# $message
# [1] "your value x is not equal to a!" "processing now complete"
{cli} provides a utility function cli_fmt() which gives a much easier interface for this behaviour.
# Function to test `cli_fmt()`
noisy_identity <- function(x) {
cli::cli_h1("Noisily returning {.val {x}}")
cli::cli_bullets(c(
"*" = "Here",
"*" = "are",
"*" = "some",
"*" = "bulllets!"
))
x
}
# This just returns `"hi"` and prints a bunch of text
noisy_identity("hi")
#>
#> ── Noisily returning "hi" ──────────────────────────────────────────────────────
#> • Here
#> • are
#> • some
#> • bulllets!
#> [1] "hi"
# Wrapping in `cli_fmt()` means only the message is returned
cli::cli_fmt(noisy_identity("hi"))
#> [1] ""
#> [2] "── Noisily returning \"hi\" ──────────────────────────────────────────────────────"
#> [3] "• Here"
#> [4] "• are"
#> [5] "• some"
#> [6] "• bulllets!"
# We can use this to define a function that modifies other functions so that
# they return both output value *and* messages produced like so:
capture_cli_messages <- function(fun) {
function(..., .quiet = TRUE) {
output <- list(result = NULL, messages = NULL)
output$messages <- cli::cli_fmt({
output$result <- fun(...)
})
if (!.quiet) cat(output$messages, sep = "\n")
output
}
}
# `capture_cli_messages()` modifies the behaviour of `noisy_identity()`
noisy_identity2 <- capture_cli_messages(noisy_identity)
noisy_identity2("hi")
#> $result
#> [1] "hi"
#>
#> $messages
#> [1] ""
#> [2] "── Noisily returning \"hi\" ──────────────────────────────────────────────────────"
#> [3] "• Here"
#> [4] "• are"
#> [5] "• some"
#> [6] "• bulllets!"
# The .quiet argument can be used if you still want to print the messages out
noisy_identity2("hi", .quiet = FALSE)
#>
#> ── Noisily returning "hi" ──────────────────────────────────────────────────────
#> • Here
#> • are
#> • some
#> • bulllets!
#> $result
#> [1] "hi"
#>
#> $messages
#> [1] ""
#> [2] "── Noisily returning \"hi\" ──────────────────────────────────────────────────────"
#> [3] "• Here"
#> [4] "• are"
#> [5] "• some"
#> [6] "• bulllets!"
Created on 2022-09-21 with reprex v2.0.2

Get names at deepest level of a nested list in R

I'm been working with nested lists and recursive functions in R following this instructions. Now there is just one piece I miss to design an own function, which is getting a vector with the respective names sorted from the highest to the deepest level.
The input list is:
lst <- list(
title = "References and Plant Communities in 'SWEA-Dataveg'",
author = "Miguel Alvarez",
date = "Dezember 28, 2019",
"header-includes" = c(
"- \\usepackage[utf8]{inputenc}",
"- \\usepackage[T1]{fontenc}", "- \\usepackage{bibentry}",
"- \\nobibliography{sweareferences.bib}"),
output = list(pdf_document=list(citation_package="natbib")),
"biblio-style" = "unsrtnat",
bibliography = "sweareferences.bib",
papersize = "a4")
The structure of the output list will then looks like this (printed in the console). Herewith note the vector at lst$output$pdf_document$citation_package:
$title
[1] "title"
$author
[1] "author"
$date
[1] "date"
$`header-includes`
[1] "header-includes"
$output
$output$pdf_document
$output$pdf_document$citation_package
[1] "output"
[2] "pdf_document"
[3] "citation_package"
$`biblio-style`
[1] "biblio-style"
$bibliography
[1] "bibliography"
$papersize
[1] "papersize"
Of course, the function has to be recursive to be applied in any different case.
Here is one possible approach, using only base R. The following function f replaces each terminal node (or "leaf") of a recursive list x with the sequence of names leading up to it. It treats unnamed lists like named lists with all names equal to "", which is a useful generalization.
f <- function(x, s = NULL) {
if (!is.list(x)) {
return(s)
}
nms <- names(x)
if (is.null(nms)) {
nms <- character(length(x))
}
Map(f, x = x, s = Map(c, list(s), nms))
}
f(lst)
$title
[1] "title"
$author
[1] "author"
$date
[1] "date"
$`header-includes`
[1] "header-includes"
$output
$output$pdf_document
$output$pdf_document$citation_package
[1] "output" "pdf_document" "citation_package"
$`biblio-style`
[1] "biblio-style"
$bibliography
[1] "bibliography"
$papersize
[1] "papersize"
Using an external package, this can be done quite efficiently with rrapply() in the rrapply-package:
rrapply::rrapply(lst, f = function(x, .xparents) .xparents)
#> $title
#> [1] "title"
#>
#> $author
#> [1] "author"
#>
#> $date
#> [1] "date"
#>
#> $`header-includes`
#> [1] "header-includes"
#>
#> $output
#> $output$pdf_document
#> $output$pdf_document$citation_package
#> [1] "output" "pdf_document" "citation_package"
#>
#>
#>
#> $`biblio-style`
#> [1] "biblio-style"
#>
#> $bibliography
#> [1] "bibliography"
#>
#> $papersize
#> [1] "papersize"

How to resolve Devanagari script encoding error in R

I have a .csv file which is UTF-8 encoded when I saved it. The script is Devanagari of the data in this file. I am able to see the words in csv file in excel
में
लिए
किया
गया
हैं
नहीं
सिंह
पुलिस
दिया
करने
कहा
रहे
बाद
करें
साथ
रहा
But when I open that in R, the words do not get encoded correctly. the output for print() is like this:
word
सारे_खतरों_को
जानते_हà¥\u0081à¤\u008f_भी
विवेक_ने
टीवी
How can I resolve this? I have tried Sys.setlocale() and read.delim(wordlist.csv, encoding = "UTF-8") but neither worked.
Too long for comment (sorry, I'm a greenhorn in R):
print( sessionInfo())
library(stringi)
library(magrittr)
x <- read.delim("D:\\bat\\SO\\64497248_devangari.csv", encoding = "UTF-8")
print('=== print(x)')
print(x)
for (line in x){
y <- line %>%
stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>%
stri_unescape_unicode() %>%
stri_enc_toutf8()
}
print('=== print(y)')
print(y)
print('=== for (i in y) {print(i)}')
for (i in y) {print(i)}
print('=== print(z)')
z <- x['word'] %>%
stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>%
stri_unescape_unicode() %>%
stri_enc_toutf8()
print(z)
Output (in Rgui.exe console):
> source ( 'D:\\bat\\SO\\64497248.r' )
R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=Czech_Czechia.1250 LC_CTYPE=Czech_Czechia.1250 LC_MONETARY=Czech_Czechia.1250
[4] LC_NUMERIC=C LC_TIME=Czech_Czechia.1250
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.0.1
[1] "=== print(x)"
word
1 <U+092E><U+0947><U+0902>
2 <U+0932><U+093F><U+090F>
3 <U+0915><U+093F><U+092F><U+093E>
4 <U+0917><U+092F><U+093E>
5 <U+0939><U+0948><U+0902>
6 <U+0928><U+0939><U+0940><U+0902>
7 <U+0938><U+093F><U+0902><U+0939>
8 <U+092A><U+0941><U+0932><U+093F><U+0938>
9 <U+0926><U+093F><U+092F><U+093E>
10 <U+0915><U+0930><U+0928><U+0947>
11 <U+0915><U+0939><U+093E>
12 <U+0930><U+0939><U+0947>
13 <U+092C><U+093E><U+0926>
14 <U+0915><U+0930><U+0947><U+0902>
15 <U+0938><U+093E><U+0925>
16 <U+0930><U+0939><U+093E>
[1] "=== print(y)"
[1] "में" "लिए" "किया" "गया" "हैं" "नहीं" "सिंह" "पुलिस" "दिया" "करने" "कहा" "रहे" "बाद" "करें" "साथ" "रहा"
[1] "=== for (i in y) {print(i)}"
[1] "में"
[1] "लिए"
[1] "किया"
[1] "गया"
[1] "हैं"
[1] "नहीं"
[1] "सिंह"
[1] "पुलिस"
[1] "दिया"
[1] "करने"
[1] "कहा"
[1] "रहे"
[1] "बाद"
[1] "करें"
[1] "साथ"
[1] "रहा"
[1] "=== print(z)"
[1] "c(\"में\", \"लिए\", \"किया\", \"गया\", \"हैं\", \"नहीं\", \"सिंह\", \"पुलिस\", \"दिया\", \"करने\", \"कहा\", \"रहे\", \"बाद\", \"करें\", \"साथ\", \"रहा\"\n)"
Warning messages:
1: package ‘magrittr’ was built under R version 4.0.2
2: In stri_replace_all_regex(., "<U\\+([[:alnum:]]+)>", "\\\\u$1") :
argument is not an atomic vector; coercing
>

Resources