R include commas and show max dp for numbers - r

is there any way for me to include commas for large numbers example show 1000000 as 1,000,000 and at the same time display the max number of decimals for each. I've looked through some of the questions asked, doesnt seem to have an option to do both. I tried
format(1000000, big.mark = ",")
which tends to round off the numbers. and if i include the nsmall option, it changes the dp for all the values. So the ideal output i want for a column of numbers is to show the max decimals if they have any and not show any if they dont. So it looks something like this:
1000000 -> 1,000,000
10043.9658 - > 10,043.9658
5005.3 -> 5,000.3
As you can see above, it doesnt show decimal if it doesnt have any and shows the max decimals if it has any to begin with.

You can use sapply() with format() ensuring the digits argument is set to a suitable minimum width and the scientific notation argument is set to FALSE.
sapply(c(1000000, 10043.9658, 5005.3), format, big.mark = ",", digits = 12, scientific = FALSE)
[1] "1,000,000" "10,043.9658" "5,005.3"

Related

Convert superscripted numbers from string into scientific notation (from Unicode, UTF8)

I imported a vector of p-values from an Excel table. The numbers are given as superscripted Unicode strings. After hours of trying I still struggle to convert them into numbers.
See example below. Simple conversion with as.numeric() doesn't work. I also tried to use Regex to capture the superscripted numbers, but it turned out that each superscripted number has a distinct Unicode code, for which there is no translation.
test <- c("0.0126", "0.000289", "4.26x10⁻¹⁴", "6.36x10⁻⁴⁸",
"4.35x10⁻⁹", "0.115", "0.0982", "0.000187", "0.0484", "0.000223")
as.numeric(test)
Does somebody know of an R-package which could do the translation painlessly, or do I have to translate the codes one by one into digits?
This kind of formatting is definitely not very portable... Here's one possible solution though, for the exercise...
test <- c("0.0126", "0.000289", "4.26x10⁻¹⁴", "6.36x10⁻⁴⁸",
"4.35x10⁻⁹", "0.115", "0.0982", "0.000187", "0.0484",
"0.000223")
library(utf8)
library(stringr)
# normalize, ie everything to "normal text"
testnorm <- utf8_normalize(test, map_case = TRUE, map_compat = TRUE)
# replace exponent part
# \\N{Minus Sign} is the unicode name of the minus sign symbol
# (see [ICU regex](http://userguide.icu-project.org/strings/regexp))
# it is necessary because the "-" is not a plain text minus sign...
testnorm <- str_replace_all(testnorm, "x10\\N{Minus Sign}", "e-")
# evaluate these character strings
p_vals <- sapply(X = testnorm,
FUN = function(x) eval(parse(text = x)),
USE.NAMES = FALSE
)
# everything got adjusted to the "e-48" element...
format(p_vals, digits = 2, scientific = F)

Function write() inconsistent with number notation

Consider the following script:
list_of_numbers <- as.numeric()
for(i in 1001999498:1002000501){
list_of_numbers <- c(list_of_numbers, i)
}
write(list_of_numbers, file = "./list_of_numbers", ncolumns = 1)
The file that is produced looks like this:
[user#pc ~]$ cat list_of_numbers
1001999498
1001999499
1.002e+09
...
1.002e+09
1.002e+09
1.002e+09
1002000501
I found a couple more ranges where R does not print consistently the number format.
Now I have the following questions:
Is this a bug or is there an actual reason for this behavior?
Why just in certain ranges, why not every number above x?
I know how I can solve this like this:
options(scipen = 1000)
But are there more elegant ways than setting global options? Without converting it to a dataframe and changing the format.
It's not a bug, R chooses the shortest representation.
More precisely, in ?options one can read:
fixed notation will be preferred unless it is more than scipen
digits wider.
So when scipen is 0 (the default), the shortest notation is preferred.
Note that you can get the scientific notation of a number x with format(x, scientific = TRUE).
In your case:
1001999499 is 10 characters long whereas its scientific notation 1.001999e+09 is longer (12 characters), so the decimal notation is kept.
1001999500: scientific notation is 1.002e+09, which is shorter.
..................... (scientific notation stays equal to 1.002e+09, hence shorter)
1002000501: 1.002001e+09 is longer.
You may ask: how come that 1001999500 is formatted as 1.002e+09 and not as 1.0019995e+09? It's simply because there is also an option that controls the number of significant digits. It is named digits and its default value is 7. Since 1.0019995 has 8 significant digits, it is rounded up to 1.002.
The simplest way to ensure that decimal notation is kept without changing global options is probably to use format:
write(format(list_of_numbers, scientific = FALSE, trim = TRUE),
file = "./list_of_numbers")
Side note: you didn't need a loop to generate your list_of_numbers (which by the way is not a list but a vector). Simply use:
list_of_numbers <- as.numeric(1001999498:1002000501)

Descriptive stats from stat.desc() is in scientific notation form

Lets say I have a data frame "example" of a bunch of random integers
example <- data.frame(Column1 = floor(runif(7, min = 1000, max = 7000)),
column2 = floor(runif(7, min = 12000, max = 70000)))
I want to get a summary of the descriptive statistics of these columns so I use
stat.desc(example)
But the output of the descriptive statistics is in scientific notation form. I know I can do:
format(stat.desc(example), scientific = FALSE)
And it will convert it to non-scientific notation, but why is scientific notation the default output mode?
It is created that way. However, you can set the options:
options(scipen=100)
options(digits=3)
stat.desc(example)
This will produce output as you like without converting to decimals afterward. I've included rounding as it will likely be 6-7 digits without the rounding option. This would give you 3 decimal places and no scientific notation.

Reading very small numbers in R

So I'm trying to generate some plots for my dataset, but I've encountered a certain problem with some values.
Some values are very small: 1.62132528761108e-1916 small to be exact and when it's read on R, it turns into 0.00000000000e+00
I'm reading my data like so:
df <- read.table("path/to/file", header = T, sep = ' ', numerals = "no.loss")
and even with the numerals flag set to no.loss, the number turns to 0.
How can I read the exact number?
Standard numeric data type in R (8-byte double precision) does not support such small numbers. The smallest positive number is about 1e-300
.Machine$double.xmin
# [1] 2.225074e-308
Can you convince whatever program generates your input data to save it in, say, logarithms?

Convert characters with time units (ms, s, us) into numerics

One of the columns in my data frame is a character vector with time span values represented as number+suffix, as so:
c("16.14ms", "7.58ms", "8.38ms", "7.29ms", "6.40ms", "5.76ms",
"5.56ms", "5.27us", "5.12ms", "5.03us", "4.91ms", "4.76ms", "16.12ms",
"7.56ms", "8.59ms", "7.16ms", "6.59ms", "5.91s", "5.62ms", "5.44ms"
)
The units are limited to micro us, milli ms, and full seconds s.
Is there a simple way to make this into a numeric column with all values being either in milliseconds or seconds?
Here are some approaches. We suppose x is the input vector shown in the question.
1) Remove the s, replace m with e-3 and replace u with e-6. Then convert to numeric:
as.numeric(sub("u", "e-6", sub("m", "e-3", sub("s", "", x))))
2) This could also be done neatly using gsubfn. First we match the suffix and then use a replacement list as shown:
library(gsubfn)
as.numeric(gsubfn("\\D+$", list(ms = "e-3", us = "e-6", s = "e0"), x))
This would be particularly convenient if it were desired to extend the problem to many time units as it would just be a matter of extending the list.
Note that at the top of page 4 of the gsubfn vignette there is an example which is very close to this one.

Resources