Given two arrays of numbers, I wish to test whether pairs of numbers are equal to the precision of the least precise of each pair of numbers.
This problem originates from validating the reproduction of presented numbers. I have been given a set of (rounded) numbers, an attempt to replicate them has produced more precise numbers. I need to report whether the less precise numbers are rounded versions of the more precise numbers.
For example, the following pair of vectors should all return true
input_a = c(0.01, 2.2, 3.33, 44.4, 560, 700) # less precise values provided
input_b = c(0.011, 2.22, 3.333, 44.4000004, 555, 660) # more precise replication
because when rounded to the lowest pair-wise precision the two vectors are equal:
pair_wise_precision = c(2, 1, 2, 1, -1, -2)
input_a_rounded = rep(NA, 6)
input_b_rounded = rep(NA, 6)
for(ii in 1:6){
input_a_rounded[ii] = round(input_a[ii], pair_wise_precision[ii])
input_b_rounded[ii] = round(input_b[ii], pair_wise_precision[ii])
}
all(input_a_rounded == input_b_rounded)
# TRUE
# ignoring machine precision
However, I need to do this without knowing the pair-wise precision.
Two approaches I have identified:
Test a range of rounding and accept the two values are equal if any level of rounding returns a match
Pre-calculate the precision of each input
However, both of these approaches feel cumbersome. I have seen in another language the option to round one number of match the precision of another number (sorry, can't recall which). But I can not find this functionality in R.
(This is not a problem about floating point numbers or inaccuracy due to machine precision. I am comfortable handling these separately.)
Edit in response to comments:
We can assume zeros are not significant figures. So, 1200 is considered rounded to the nearest 100, 530 is rounded to the nearest 10, and 0.076 is rounded to the nearest thousandth.
We stop at the precision of the least precise value. So, if comparing 12300 and 12340 the least precise value is rounded to the nearest 100, hence we compare round(12300, -2) and round(12340, -2). If comparing 530 and 570, then the least precise value is rounded to the nearest 10, hence we compare round(530, -1) and round(570, -1).
You could divide by the exponents of 10, remove trailing zeroes and calculate pmin of nchar where you subtract 2 for the whole number and the decimal point. This gives you the precision vector p with which you round the bases of a and b and multiply back the exponents and check if identical.
f <- \(a, b) {
ae <- 10^floor(log10(a))
be <- 10^floor(log10(b))
al <- a/ae
bl <- b/be
p <- pmin(nchar(gsub('0+$', '', format(al))), nchar(gsub('0+$', '', format(bl)))) - 2L
identical(mapply(round, al, p)*ae, mapply(round, bl, p)*be)
}
f(a, b)
# [1] TRUE
Data:
a <- c(0.01, 2.2, 3.33, 44.4, 555, 700)
b <- c(0.011, 2.22, 3.333, 44.4000004, 560, 660)
My initial thinking followed #jay.sf's approach to analyse values as numeric. However, considering the values as character provides another way to determine rounding:
was_rounded_to = function(x){
x = as.character(x)
location_of_dot = as.numeric(regexpr("\\.", x))
ref_point = ifelse(location_of_dot < 0, nchar(x), location_of_dot)
last_non_zero = sapply(gregexpr("[1-9]", x), max)
return(last_non_zero - ref_point)
}
# slight expansion in test cases
a <- c(0.01, 2.2, 3.33, 44.4, 555, 700, 530, 1110, 440, 3330)
b <- c(0.011, 2.22, 3.333, 44.4000004, 560, 660, 570, 1120, 4400, 3300)
rounding = pmin(was_rounded_to(a), was_rounded_to(b))
mapply(round, a, digits = rounding) == mapply(round, b, digits = rounding)
Special case: If the numbers only differ by rounding, then it is easier to determine the magnitude by examining the difference:
a <- c(0.01, 2.2, 3.33, 44.4, 555, 700)
b <- c(0.011, 2.22, 3.333, 44.4000004, 560, 660)
abs_diff = abs(a-b)
mag = -floor(log10(abs_diff ) + 1e-15)
mapply(round, a, digits = mag - 1) == mapply(round, b, digits = mag - 1)
However, this fails when the numbers differ by more than rounding. For example: a = 530 and b = 540 will incorrectly round both 530 and 540 to 500.
Related
I have a min and a max value from each football period (i.e 45, 94 for the second half).
I would like to create two (if possible) equal sized intervals, such as (45, 69) and (70, 94) where the result gets rounded to nearest integer if it is a float.
I've tried using cut() to no avail, and also seq() but I neither than can I figure out how to do.
frame = c(45, 94)
p2.timeslots = cut(frame, 2)
p2.ts = seq(from = frame[1], to = frame[2], by = (frame[2]-frame[1])/2)
# Output
> p2.timeslots
[1] (45,69.5] (69.5,94]
Levels: (45,69.5] (69.5,94]
> p2.ts
[1] 45.0 69.5 94.0
Neither did the length.out argument for seq() solve it for me.
Any idea how I can do this easily in R?
The way cut works is that the bins are contiguous, where the left-side of each bin is typically "open" (denoted by () and right-side "closed" (]). If you assume integers and want both ends to be closed-ends, then you need to manually control both the breaks= and the labels=, perhaps
p2.seq <- seq(frame[1], frame[2], length.out = 3)
p2.seq
# [1] 45.0 69.5 94.0
p2.labels <- sprintf("[%i,%i]", c(p2.seq[1], round(p2.seq[2] + 0.9)), c(round(p2.seq[2] - 0.1), p2.seq[3]))
p2.labels
# [1] "[45,69]" "[70,94]"
cut(frame, breaks = p2.seq + c(-0.1, 0, 0.1), labels = p2.labels)
# [1] [45,69] [70,94]
# Levels: [45,69] [70,94]
The use of + c(-0.1, 0., 0.1) can also be effected by using breaks=p2.seq, include.lowest=TRUE, whichever you prefer.
I don't understand the following behavior with quantile. With type=2 it should average at discontinuities, but this doesn't seem to happen always. If I create a list of 100 numbers and look at the percentiles, then shouldn't I take the average at every percentile? This behavior happens for some, but not for all (i.e. 7th percentile).
quantile(seq(1, 100, 1), 0.05, type=2)
# 5%
# 5.5
quantile(seq(1, 100, 1), 0.06, type=2)
# 6%
# 6.5
quantile(seq(1, 100, 1), 0.07, type=2)
# 7%
# 8
quantile(seq(1, 100, 1), 0.08, type=2)
# 8%
# 8.5
Is this related to floating point issues?
100*0.06 == 6
#TRUE
100*0.07 == 7
#FALSE
sprintf("%.20f", 100*0.07)
#"7.00000000000000088818"
As far as I can tell, it is related to floating points as 0.07 is not exactly representable with floating points.
p <- seq(0, 0.1, by = 0.001)
q <- quantile(seq(1, 100, 1), p, type=2)
plot(p, q, type = "b")
abline(v = 0.07, col = "grey")
If you think of the quantile (type 2) as a function of p, you will never evaluate the function at exactly 0.07, hence your results.Try e.g. decreasing by in the above. In that sense, the function returns exactly as expected. In practice with continuous data, I cannot imagine it would be of any consequence (but that is a poor argument I know).
I would like to decline (i.e. multiply) a value (first.value) against a vector of percentage decline values (decline.vector), where the first value is declined against the starting percentage decline value, then that output value is declined against the second percentage decline value, and so on. I assume there is a more elegant way to do so in R than writing a for loop to reassign the new value and cbind to create the new vector, but I remain a novice.
The decline vectors are not sequences like below, this just an example.
Although, is it possible to sequence where 'by=' is a vector? I did not find anything in the ?seq that suggests it is possible.
Whereby:
first.value <- 100
decline.vector <- c(0.85, 0.9, 0.925, 0.95, 0.975)
Desired output:
[100] 85, 75.5, 70.763, 67.224, 65.544
You can do this with the Reduce function in base R
first.value <- 100
decline.vector <- c(0.85, 0.9, 0.925, 0.95, 0.975)
Reduce(`*`, decline.vector, first.value, accumulate = TRUE)
# [1] 100.00000 85.00000 76.50000 70.76250 67.22437 65.54377
You could also use cumprod
first.value * cumprod(c(1, decline.vector))
# [1] 100.00000 85.00000 76.50000 70.76250 67.22438 65.54377
If you don't want first.value to be the first element of the output, then do
first.value * cumprod(decline.vector)
# [1] 85.00000 76.50000 70.76250 67.22438 65.54377
I am confused with the return of function get.basis(). For example,
lprec <- make.lp(0, 4)
set.objfn(lprec, c(1, 3, 6.24, 0.1))
add.constraint(lprec, c(0, 78.26, 0, 2.9), ">=", 92.3)
add.constraint(lprec, c(0.24, 0, 11.31, 0), "<=", 14.8)
add.constraint(lprec, c(12.68, 0, 0.08, 0.9), ">=", 4)
set.bounds(lprec, lower = c(28.6, 18), columns = c(1, 4))
set.bounds(lprec, upper = 48.98, columns = 4)
RowNames <- c("THISROW", "THATROW", "LASTROW")
ColNames <- c("COLONE", "COLTWO", "COLTHREE", "COLFOUR")
dimnames(lprec) <- list(RowNames, ColNames)
solve(lprec)
Then the basic variables are
> get.basis(lprec)
[1] -7 -2 -3
However, the solution is
> get.variables(lprec)
[1] 28.60000 0.00000 0.00000 31.82759
From the solution, it seems variable 1 and variable 4 are basis. Hence how does vector (-7, -2, -3) come from?
I am guessing it is from 3 constraints and 4 decision variables.
After I reviewed the simplex method for bounded variables, finally I understood how it happens. These two links are helpful. Example; Video
Come back to this problem, the structure is like
lpSolveAPI (R interface for lp_solve) would rewrite the constraint structure as the following format after adding appropriate slack variables. The first three columns are for slack variables. Hence, the return of get.basis(), which is -7,-2,-3, are column 7, 2, 3 that represent variable 4, slack variable 2 and 3.
With respect to this kind of LP with bounded variables, a variable could be nonbasic at either lower bound or upper bound. The return of get.basis(lp, nonbasic=TRUE) is -1,-4,-5,-6. Minus means these variables are at their lower bound. It means slack variable 1 = 0, variable 4 = 28.6, variable 5 = 0, variable 6 = 0.
Thus, the optimal solution is 28.6(nonbasic), 0(nonbasic), 0(nonbasic), 31.82(basic)
It's my understanding that when calculating quantiles in R, the entire dataset is scanned and the value for each quantile is determined.
If you ask for .8, for example it will give you a value that would occur at that quantile. Even if no such value exists, R will nonetheless give you the value that would have occurred at that quantile. It does this through linear interpolation.
However, what if one wishes to calculate quantiles and then proceed to round up/down to the nearest actual value?
For example, if the quantile at .80 gives a value of 53, when the real dataset only has a 50 and a 54, then how could one get R to list either of these values?
Try this:
#dummy data
x <- c(1,1,1,1,10,20,30,30,40,50,55,70,80)
#get quantile at 0.8
q <- quantile(x, 0.8)
q
# 80%
# 53
#closest match - "round up"
min(x[ x >= q ])
#[1] 55
#closest match - "round down"
max(x[ x <= q ])
#[1] 50
There are many estimation methods implemented in R's quantile function. You can choose which type to use with the type argument as documented in https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html.
x <- c(1, 1, 1, 1, 10, 20, 30, 30, 40, 50, 55, 70, 80)
quantile(x, c(.8)) # default, type = 7
# 80%
# 53
quantile(x, c(.8), FALSE, TRUE, 7) # equivalent to the previous invocation
# 80%
# 53
quantile(x, c(.8), FALSE, TRUE, 3) # type = 3, nearest sample
# 80%
# 50