Force zeros with percent formatting - r

I'd like to display a dataframe feature in percentages with 4 decimal places:
scales::percent(1:3/12345)
"0.0081%" "0.0162%" "0.0243%"
This shows each value as a percent to 4 decimal places.
But if I try e.g.
scales::percent(c(1:3/12345, 0.9), accuracy = 4)
[1] "0%" "0%" "0%" "88%"
I lose the values for the first 3. I'd like those to show as
"0.0081%", "0.0162%" "0.0243%".
How can I force the same number of digits while formatting as percent? I always want 4 digits to the right of the decimal, even if they are all zero.

You can do:
scales::percent(c(1:3/12345, 0.9), accuracy = 0.0001)
[1] "0.0081%" "0.0162%" "0.0243%" "90.0000%"
The accuracy argument has a rather counterintuitive functioning, meaning that if you want to have an output with more decimal places, you need to use a number smaller than 1 (which is also the default value). Every decimal place in the accuracy argument then represents a decimal place in the output.
To illustrate the function at greater depth. If you want an output with one decimal place:
scales::percent(c(1:3/12345, 0.9), accuracy = 0.1)
[1] "0.0%" "0.0%" "0.0%" "90.0%"
while if want it with three decimal places:
scales::percent(c(1:3/12345, 0.9), accuracy = 0.001)
[1] "0.008%" "0.016%" "0.024%" "90.000%"

Related

Trim argument in mean() when number of observations is odd

I need some clarification about the trim argument in the function mean().
In ?mean we find that
trim is the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed
If trim is non-zero, a symmetrically trimmed mean is computed
I assume that it will trim the values symmetrically, taking as many observations from the lower range of values as from the upper.
My question is, if x has an odd number of observations, and if we set trim = 0.5, will it remove one less observation in order to cut the same ones from both sides? Or will it just take one extra out randomly either from the top or the bottom?
Thanks in advance,
Ines
I don't exactly know the answer to your question, but I tested with this:
vec <- c(rep(0, 50), rep(1, 51))
mean(vec)
# 0.5049505
mean(vec, trim = .1)
# 0.5061728
So in this case it seems that the function trimmed one low value before

What explains the 1 decimal place rounding of x.x5 in R?

I'm looking for an explanation of how 1 decimal place rounding works for a sequence like this in R:
seq(1.05, 2.95, by = .1)
At high school, I'd round this up, i.e. 2.05 becomes 2.1. But R rounds it to 2 for 1 decimal place rounding.
Round up from .5
The following rounding function from the above stackoverflow answer consistently achieves the high school rounding:
round2 = function(x, n) {
posneg = sign(x)
z = abs(x)*10^n
z = z + 0.5
z = trunc(z)
z = z/10^n
z*posneg
}
This code compares the R rounding and rounding from above.
data.frame(cbind(
Number = seq(1.05, 2.95, by = .1),
Popular.Round = round2(seq(1.05, 2.95, by = .1), 1),
R.Round = round(seq(1.05, 2.95, by = .1), 1)))
With R rounding, 1.05 is rounded up to 1.1 whereas 2.05 is rounded down to 2. Then again 1.95 is rounded up to 2 and 2.95 is rounded up to 3 as well.
If it is "round to even", why is it 3, i.e. odd number.
Is there a better response than "just deal with it" when asked about this behavior?
Too long to read? Scroll below
This was an interesting study for me personally. According to documentation:
Note that for rounding off a 5, the IEC 60559 standard (see also ‘IEEE
754’) is expected to be used, ‘go to the even digit’. Therefore
round(0.5) is 0 and round(-1.5) is -2. However, this is dependent on
OS services and on representation error (since e.g. 0.15 is not
represented exactly, the rounding rule applies to the represented
number and not to the printed number, and so round(0.15, 1) could be
either 0.1 or 0.2).
Rounding to a negative number of digits means rounding to a power of
ten, so for example round(x, digits = -2) rounds to the nearest
hundred.
For signif the recognized values of digits are 1...22, and non-missing
values are rounded to the nearest integer in that range. Complex
numbers are rounded to retain the specified number of digits in the
larger of the components. Each element of the vector is rounded
individually, unlike printing.
Firstly, you asked "If it is "round to even", why is it 3, i.e. odd number." To be clear, the round to even rule applies for rounding off a 5. If you run round(2.5) or round(3.5), then R returns 2 and 4, respectively.
If you go here, https://stat.ethz.ch/pipermail/r-help/2008-June/164927.html, then you see this response:
The logic behind the round to even rule is that we are trying to
represent an underlying continuous value and if x comes from a truly
continuous distribution, then the probability that x==2.5 is 0 and the
2.5 was probably already rounded once from any values between 2.45 and 2.54999999999999..., if we use the round up on 0.5 rule that we learned in grade school, then the double rounding means that values
between 2.45 and 2.50 will all round to 3 (having been rounded first
to 2.5). This will tend to bias estimates upwards. To remove the
bias we need to either go back to before the rounding to 2.5 (which is
often impossible to impractical), or just round up half the time and
round down half the time (or better would be to round proportional to
how likely we are to see values below or above 2.5 rounded to 2.5, but
that will be close to 50/50 for most underlying distributions). The
stochastic approach would be to have the round function randomly
choose which way to round, but deterministic types are not
comforatable with that, so "round to even" was chosen (round to odd
should work about the same) as a consistent rule that rounds up and
down about 50/50.
If you are dealing with data where 2.5 is likely to represent an exact
value (money for example), then you may do better by multiplying all
values by 10 or 100 and working in integers, then converting back only
for the final printing. Note that 2.50000001 rounds to 3, so if you
keep more digits of accuracy until the final printing, then rounding
will go in the expected direction, or you can add 0.000000001 (or
other small number) to your values just before rounding, but that can
bias your estimates upwards.
Short Answer: If you always round 5s upward, then your data biases upward. But if you round by evens, then your rounded-data, at large, becomes balanced.
Let's test this using your data:
round2 = function(x, n) {
posneg = sign(x)
z = abs(x)*10^n
z = z + 0.5
z = trunc(z)
z = z/10^n
z*posneg
}
x <- data.frame(cbind(
Number = seq(1.05, 2.95, by = .1),
Popular.Round = round2(seq(1.05, 2.95, by = .1), 1),
R.Round = round(seq(1.05, 2.95, by = .1), 1)))
> mean(x$Popular.Round)
[1] 2.05
> mean(x$R.Round)
[1] 2.02
Using a bigger sample:
x <- data.frame(cbind(
Number = seq(1.05, 6000, by = .1),
Popular.Round = round2(seq(1.05, 6000, by = .1), 1),
R.Round = round(seq(1.05, 6000, by = .1), 1)))
> mean(x$Popular.Round)
[1] 3000.55
> mean(x$R.Round)
[1] 3000.537

Avoid floating point zero when summing scaled set of numbers in R

I am rescaling a set of numbers and want to avoid getting a floating point zero for the sum of the rescaled numbers:
x <- c(-5, 1, 8)
y <- scale(x)
sum(y)
# [1] 1.249001e-16
Is there a way to around this to force the sum to zero? I do not care about precision beyond a three decimal places.
I think you should not just "switch" to integers at some point. The scaling was computed with floating point numbers and is therefore not 100% precise. Forcing some values to 0 would indicate a precision that is not available and should thus be avoided.
If you need to compare floating point values use isTRUE(all.equal(...)) as suggested by R documentation. https://stat.ethz.ch/R-manual/R-devel/library/base/html/all.equal.html

What does a single number mean when passed as parameter 'breaks' in an R histogram?

I am learning to plot histograms in R, but I have some problem with parameter "breaks" for a single number. In the help, it says:
breaks: a single number giving the number of cells for the histogram
I did the following experiment:
data("women")
hist(women$weight, breaks = 7)
I expect it should give me 7 bins, but the result is not what I expected! It gives me 10 bins.
Do you know, what does breaks = 7 mean? What does it mean in the help "number of cells"?
Reading carefully breaks argument help page to the end, it says:
breaks
one of:
a vector giving the breakpoints between histogram cells,
a function to compute the vector of breakpoints,
a single number giving the number of cells for the histogram,
a character string naming an algorithm to compute the number of cells (see ‘Details’),
a function to compute the number of cells.
In the last three cases the number is a suggestion only; the breakpoints will be set to pretty values. If breaks is a function, the
x vector is supplied to it as the only argument.
So, as you can notice, n is considered only a "suggestion", it probably tries to get near to that value but it depends on the input values and if they can be nicely split into n buckets (it uses function pretty to compute them).
Hence, the only way to force the number of breaks is to provide the vector of interval breakpoints between the cells.
e.g.
data("women")
n <- 7
minv <- min(women$weight)
maxv <- max(women$weight)
breaks <- c(minv, minv + cumsum(rep.int((maxv - minv) / n, n-1)), maxv)
hist(women$weight, breaks = breaks)

Boxplot main rectangles delimiter which percentage of data points?

I used the command:
boxplot(V15~Class,data=trainData, main="V15 value depending on Class", xlab="Class", ylab="V15")
I would like to understand which is the percentage of points in the rectangle(s)?
I mean: if I take all the samples inside the main rectangle, what percentage of the total count of samples will it be?
I found the documentation, but cannot figure out this answer.
The help text for boxplot, which you refer to, suggest that you should "See Also boxplot.stats which does the computation". From the "Details" section:
The two ‘hinges’ are versions of the first and third quartile, i.e., close to quantile(x, c(1,3)/4).
The hinges equal the quartiles for odd n (where n <- length(x)) and differ for even n.
Whereas the quartiles only equal observations for n %% 4 == 1 (n = 1 mod 4),
the hinges do so additionally for n %% 4 == 2 (n = 2 mod 4), and are in the middle of two observations otherwise.
So yes, basically the middle 50% of the values fall inside the box, but the details of the calculation depend on the nature of the data.

Resources