Round numbers and keep Scientific Notation in R - r

I have a data frame that I'm looking to round to two decimal places as it is currently at 16. I have this done like this;
##rounding numbers
p32_us1rounding = read.csv("p32_us1_ff6.csv")
#print(p32_us1rounding)
dfp32_us1 = data.frame(p32_us1rounding[-1])
dfp32_us1 <- round(dfp32_us1, digits = 2)
write.csv(dfp32_us1, "32 Fama French for US Market on or before 01-08-2005.csv")
This works perfectly but there are a few rows where I have an extremely small number which I would like to keep in scientific notation. For example the first column of the 8th row is 7.12206157653355e-67. This obviously rounds down to zero but I would like it to say 7.12E-67. Is there anyway to do this while also having the other numbers rounded?

You will have to convert the numbers to character strings to use different formatting. Then just use the quote=FALSE argument in write.csv to prevent them from being surrounded by quotation marks:
set.seed(42)
x <- runif(100)*100/39
x <- matrix(x, 10, 10)
y <- ifelse(x > .1, sprintf("%.2f", x), sprintf("%.2e", x))
write.csv(y, "Test.csv", quote=FALSE, row.names=FALSE)
cat(readLines("Test.csv", 4), sep="\n")
# V1,V2,V3,V4,V5,V6,V7,V8,V9,V10
# 2.35,1.17,2.32,1.89,0.97,0.85,1.73,0.11,1.49,1.71
# 2.40,1.84,0.36,2.08,1.12,0.89,2.52,0.36,0.40,6.13e-04
# 0.73,2.40,2.54,1.00,9.60e-02,1.02,1.95,0.55,0.92,0.53
Last value in line 2 is in scientific notation and the 5th value in line 3.

Related

Using format() in R to convert numeric value to scientific notation after rounding

A small snippet of my overall code is to round a given vector to a specified number of decimal places. The rounded value is then converted to standard notation, e.g. "1.2e-01".
The following does the rounding which works fine.
values <- c(0.1234, 0.5678)
dig <- 2
rounded_vals <- round(values, dig) %>% str_trim()
When I run the following code I expect to see the same output for both lines.
format(rounded_vals, scientific = TRUE)
format(c(0.12, 0.56), scientific = TRUE)
What I actually get is:
> format(rounded_vals, scientific = TRUE)
[1] "0.12" "0.57"
> format(c(0.12, 0.56), scientific = TRUE)
[1] "1.2e-01" "5.6e-01"
Why doesn't format(rounded_vals, scientific = TRUE) return the same output and how can I adjust the code to do so?
Would appreciate any input :)
Edit: I missed out a bit of code that was causing the problem - seems to be that str_trim() covnerts to character.
I think rounded_vals might have been stored as character values. Try converting them to numbers using as.numeric() and then put it into the format() function.

Convert superscripted numbers from string into scientific notation (from Unicode, UTF8)

I imported a vector of p-values from an Excel table. The numbers are given as superscripted Unicode strings. After hours of trying I still struggle to convert them into numbers.
See example below. Simple conversion with as.numeric() doesn't work. I also tried to use Regex to capture the superscripted numbers, but it turned out that each superscripted number has a distinct Unicode code, for which there is no translation.
test <- c("0.0126", "0.000289", "4.26x10⁻¹⁴", "6.36x10⁻⁴⁸",
"4.35x10⁻⁹", "0.115", "0.0982", "0.000187", "0.0484", "0.000223")
as.numeric(test)
Does somebody know of an R-package which could do the translation painlessly, or do I have to translate the codes one by one into digits?
This kind of formatting is definitely not very portable... Here's one possible solution though, for the exercise...
test <- c("0.0126", "0.000289", "4.26x10⁻¹⁴", "6.36x10⁻⁴⁸",
"4.35x10⁻⁹", "0.115", "0.0982", "0.000187", "0.0484",
"0.000223")
library(utf8)
library(stringr)
# normalize, ie everything to "normal text"
testnorm <- utf8_normalize(test, map_case = TRUE, map_compat = TRUE)
# replace exponent part
# \\N{Minus Sign} is the unicode name of the minus sign symbol
# (see [ICU regex](http://userguide.icu-project.org/strings/regexp))
# it is necessary because the "-" is not a plain text minus sign...
testnorm <- str_replace_all(testnorm, "x10\\N{Minus Sign}", "e-")
# evaluate these character strings
p_vals <- sapply(X = testnorm,
FUN = function(x) eval(parse(text = x)),
USE.NAMES = FALSE
)
# everything got adjusted to the "e-48" element...
format(p_vals, digits = 2, scientific = F)

Adding leading zero once imported into R

I have a data frame which includes a Reference column. This is a 10 digit number, which could start with zeros.
When importing into R, the leading zeros disappear, which I would like to add back in.
I have tried using sprintf and formatC, but I have different problems with each.
DF=data.frame(Reference=c(102030405,2567894562,235648759), Data=c(10,20,30))
The outputs I get are the following:
> sprintf('%010d', DF$Reference)
[1] "0102030405" " NA" "0235648759"
Warning message:
In sprintf("%010d", DF$Reference) : NAs introduced by coercion
> formatC(DF$Reference, width=10, flag="0")
[1] "001.02e+08" "02.568e+09" "02.356e+08"
The first output gives NA when the number already has 10 digits, and the second stores the result in standard form.
What I need is:
[1] 0102030405 2567894562 0235648759
library(stringi)
DF = data.frame(Reference = c(102030405,2567894562,235648759), Data = c(10,20,30))
DF$Reference = stri_pad_left(DF$Reference, 10, "0")
DF
# Reference Data
# 1 0102030405 10
# 2 2567894562 20
# 3 0235648759 30
Alternative solutions: Adding leading zeros using R.
When importing into R, the leading zeros disappear, which I would like
to add back in.
Reading the column(s) in as characters would avoid this problem outright. You could use readr::read_csv() with the col_types argument.
formatC
You can use
formatC(DF$Reference, digits = 0, width = 10, format ="f", flag="0")
# [1] "0102030405" "2567894562" "0235648759"
sprintf
The use of d in sprintf means that your values are integers (or they have to be converted with as.integer()). help(integer) explains that:
"the range of representable integers is restricted to about +/-2*10^9: doubles can hold much larger integers exactly."
That is why as.integer(2567894562) returns NA.
Another work around would be to use a character format s in sprintf:
sprintf('%010s',DF$Reference)
# [1] " 102030405" "2567894562" " 235648759"
But this gives spaces instead of leading zeros. gsub() can add zeros back by replacing spaces with zeros:
gsub(" ","0",sprintf('%010s',DF$Reference))
# [1] "0102030405" "2567894562" "0235648759"

Convert characters with time units (ms, s, us) into numerics

One of the columns in my data frame is a character vector with time span values represented as number+suffix, as so:
c("16.14ms", "7.58ms", "8.38ms", "7.29ms", "6.40ms", "5.76ms",
"5.56ms", "5.27us", "5.12ms", "5.03us", "4.91ms", "4.76ms", "16.12ms",
"7.56ms", "8.59ms", "7.16ms", "6.59ms", "5.91s", "5.62ms", "5.44ms"
)
The units are limited to micro us, milli ms, and full seconds s.
Is there a simple way to make this into a numeric column with all values being either in milliseconds or seconds?
Here are some approaches. We suppose x is the input vector shown in the question.
1) Remove the s, replace m with e-3 and replace u with e-6. Then convert to numeric:
as.numeric(sub("u", "e-6", sub("m", "e-3", sub("s", "", x))))
2) This could also be done neatly using gsubfn. First we match the suffix and then use a replacement list as shown:
library(gsubfn)
as.numeric(gsubfn("\\D+$", list(ms = "e-3", us = "e-6", s = "e0"), x))
This would be particularly convenient if it were desired to extend the problem to many time units as it would just be a matter of extending the list.
Note that at the top of page 4 of the gsubfn vignette there is an example which is very close to this one.

Converting internationally formatted strings to numeric

I have a file with internationally formatted numbers (i.e strings) including units of measurement. In this case the decimal place is indicated by "," and the 1e3 seperator is indicated as "." (i.e. German number formats).
a <- c('2.200.222 €',
' 180.109,3 €')
or
b <- c('28,42 m²',
'47,70 m²')
I'd like to convert these strings efficiently to numeric. I've tried to filter out numbers by codes like
require(stringr)
str_extract(a, pattern='[0-9]+.[0-9]+.[0-9]+')
str_extract(b, pattern='[0-9]+,[0-9]+')
however, this does seem to be too prone to errors and I guess there must be a more standardized way. So here's my question: Is there a custom function, package or something else that is capable of such a problem?
Thank you very much!
Here is a function that uses gsub to deal with the sample data you posted:
x <- c('2.200.222 €', ' 180.109,3 €', '28,42 m²', '47,70 m²')
strip <- function(x){
z <- gsub("[^0-9,.]", "", x)
z <- gsub("\\.", "", z)
gsub(",", ".", z)
}
as.numeric(strip(x))
[1] 2200222.00 180109.30 28.42 47.70
It works like this:
First strip out all trailing non-digits (and anything after these non-digits)
Then strip out all periods.
Finally, convert commas to periods.

Resources