I have found a lot of information on this online, but I haven't been able to find anything that exactly answers my question. My issue does not have to do with the presentation of the numbers, but instead the calculations and storage underneath the presentation.
The issue is with floating points in R. I wish to truncate them; however, I want to make sure I am storing them correctly after they are truncated.
The problem is: I have a dataset where I am trying to compare the difference between different numbers to any threshold I would like (exact to 2 decimal places - i.e. 0.00, 0.05, and 1.00.). I want to make sure when I test the difference to exactly zero that it is testing exactly the correct difference and there is not a storage problem going on behind that I am unaware of.
So far, I have tried:
(1) round (and testing against 0, and very small values like 1e-10)
(2) multiplying by 100 and as.integer
These calculations come up with different answers when I calculate the percentage of observations that have a difference greater than my chosen threshold in my dataset.
In short, it would be great to know how to best store the number to get the most accurate result when calculating whether or not the difference is actually 0.
Note: This needs to work for large datasets.
Example:
dt <-
data.table(d = c(0.00, 988.36, 0.00, 2031.46, 0.00),
c = c(0.00, 30.00, 0.00, 2031.46, 0.00),
n = c("a", "b", "a", "a", "b"))
dt[, diff := d - c]
dt[, abs_diff := abs(diff)]
dt[, pct_diff := mean(abs_diff == 0, na.rm = TRUE), by = "n"]
The last step is the problem, as I continuously get different numbers for pct_diff based on the threshold. (For example, mean(abs_diff <= 1e-10) and mean(abs_diff <= 1e-15) give me different answers).
Rounded numbers are stored as numeric, i.e., floating point numbers:
class(round(1.1))
#[1] "numeric"
class(floor(1.1))
##[1] "numeric"
It seems like your are looking for packages that support arbitrary precision numbers, such as package Rmpfr.
Related
I would like to built a simple probability exercise such that the solution is just a one decimal number between zero and one (different from zero and one). I would like to use the function num_to_schoice, but if I write:
num_to_schoice(0.3,digits=1,range=c(0.1,0.9))
I get the error message:
NULL
Warning message:
In num_to_schoice(0.3, digits = 1, range = c(0.1, 0.9)) :
specified 'range' is too small for 'delta'
Could someone please explain how the function num_to_schoice should be properly used?
Let me add a couple of points to existing answer by #Edward (+1):
If you generate a solution from the sequence 0.1, 0.2, ..., 0.9 and want four from the remaining eight numbers as distractors, I would recommend not using num_to_schoice(). Only if moving to a correct solution in 0.10, 0.11, 0.12, ..., 0.9, say, I would use num_to_schoice().
Without num_to_schoice() for one digit
You can set up an answerlist with all nine numbers from the sequence, sorting the correct solution into the first position, and then using the exshuffle meta-information tag to do the actual sampling.
For example, in the data-generation you need something like this:
sol <- 0.3
ans <- c(sol, setdiff(1:9/10, sol))
ans <- paste0("$", ans, "$")
In the question you can then include
answerlist(ans, markup = "markdown")
## Answerlist
## ----------
## * $0.3$
## * $0.1$
## * $0.2$
## * $0.4$
## * $0.5$
## * $0.6$
## * $0.7$
## * $0.8$
## * $0.9$
Finally, the meta-information needs:
exsolution: 100000000
exshuffle: 5
This will then use the correct solution and four of the eight false answers - all in shuffled order. (Note that the above uses .Rmd syntax, for .Rnw this needs to be adapted accordingly.)
With num_to_schoice() for two digits
For the scenario with one digit using num_to_schoice() which tries to do too many things, but for more than one digit it might be useful. Specifically, num_to_schoice() assures that the rank of the correct solution is non-iformative, i.e., the correct solution could be the smallest, second-smallest, ..., largest number in the displayed sequence with equal probability. Specifically, this may be important if the distribution of the correct solution is not uniform across the possible range. This is the reason why the following code sometimes fails:
num_to_schoice(0.3, digits = 1, delta = 0.1, range = c(0.1, 0.9))
Internally, this first decides how many of the four wrong answers should be to the left of the correct solution 0.3. Clearly, there is room for at most two wrong answers to the left, which may result in a warning and a NULL result` if exceeded. Moving to two digits can resolve this, e.g.:
num_to_schoice(0.31, range = c(0.01, 0.99),
digits = 2, delta = 0.03, method = "delta")
Remarks:
Personally, I would only do this if the correct solution can potentially also have two digits. Otherwise students might pick up this pattern.
You need to assure that to the left and to the right of the correct solution there is at least 4 * delta so that there is enough room for the wrong answers.
Using delta = 0.01 would certainly be possible but if you want larger deltas then delta = 0.03 or delta = 0.07 are also often useful choices. This is because sampling from an equidistant grid with such a delta is typically not noticable for most students. In contrast, deltas like 0.05, 0.1, 0.02, etc. are typically picked up quickly.
Because your range is (0, 1), you have to specify a smaller delta than the default (1). The function calculates 5 wrong answers, so each has to be within the range you give AND far enough away from the other answers by an amount equal to delta. You should also use the "delta" method, since the package authors give the following advice:
Two methods can be used to generate the wrong solutions: Either simply
runif or otherwise a full equi-distant grid for the range with step
size delta is set up from which a discrete uniform sample is drawn.
The former is preferred if the range is large enough while the latter
performs better if the range is small (as compared to delta).
So you can try the following:
num_to_schoice(0.3, digits=1, range=c(0.1, 0.9), delta=0.05, method="delta")
#$solutions
#[1] FALSE FALSE FALSE TRUE FALSE
#$questions
#[1] "$0.6$" "$0.5$" "$0.3$" "$0.4$" "$0.8$"
Note that this function incorporates randomness, so you may need to try a few times before a valid solution appears. Just keep ignoring the errors.
Edit:
I did try this a few times and every now and then I got a warning about the specified range being too small, with a NULL result returned. Other times the function didn't do anything and I had to abort. The help page also has this tidbit:
Exercise templates using num_to_schoice should be thoroughly tested in
order to avoid problems with too small ranges or almost identical
correct and wrong answers! This can potentially cause problems,
infinite loops, etc.
Inspection of the num_to_schoice function revealed that there is a while loop near the end which may get stuck in the aforementioned "infinite loop". To cut a long story short, it looks like you need to increase the digits to at least 2, otherwise there's a chance that this loop will never end. I hope it's ok to have 2 digits in the answers.
num_to_schoice(0.3, digits=2, range=c(0.1, 0.9), delta=0.01, method="delta")
$solutions
[1] FALSE FALSE FALSE TRUE FALSE
$questions
[1] "$0.23$" "$0.42$" "$0.22$" "$0.30$" "$0.54$"
I tried this 10,000 times and it always returned a non-null result.
res <- NULL
for(i in 1:10000){
res[[i]] <- num_to_schoice(0.3, digits=2, range=c(0.1, 0.9), delta=0.01, method="delta")
}
sum(sapply(res, function(x) any(is.null(x))))
# [1] 0
Hope that works now.
I've already read this question with an approach to counting entries in R:
how to realize countifs function (excel) in R
I'm looking for a similar approach, except that I want to count data that is within a given range.
For example, let's say I have this dataset:
data <- data.frame( values = c(1,1.2,1.5,1.7,1.7,2))
Following the approach on the linked question, we would develop something like this:
count <- data$values == 1.5
sum(count)
Problem is, I want to be able to include in the count anything that varies 0.2 from 1.5 - that is, all possible number from 1.3 to 1.7.
Is there a way to do so?
sum(data$values>=1.3 & data$values<=1.7)
As the explanation in the question you linked to points out, when you just write out a boolean condition, it generates a vector of TRUEs and FALSEs the same length as your original dataframe. TRUE equals 1 and FALSE equals 0, so summing across it gives you a count. So it simply becomes a matter of putting your condition as a boolean phrase. In the case of more than one condition, you connect them with & or | (or) -- much the same way that you could do in excel (only in excel you have to do AND() or OR()).
(For a more general solution, you can use dplyr::between - it's also supposed to be faster since it's implemented in C++. In this case, it would be sum(between(data$values,1.3,1.7).)
Like #doviod writes, you can use a compound logical condition.
My approach is different, I wrote a function that takes the vector and as range the center point value and the distance delta.
After a suggestion by #doviod, I have set a default value delta = 0, so that if only value is passed, the function returns
a count of cases where the values equal the value the user provides.
(doviod, in the comment)
countif <- function(x, value, delta = 0)
sum(value - delta <= x & x <= value + delta)
data <- data.frame( values = c(1,1.2,1.5,1.7,1.7,2))
countif(data$values, 1.5, 0.2)
#[1] 3
which identifies the location of all values in your vector that satisfy your criterion, and length subsequently counts the 'hits'.
length( which(data$values>=1.3 & data$values<=1.7) )
[1] 3
I am trying to identify local maxima in a data set and am using the following code that I found from LyzandeR:
library(data.table)
maximums <- function(x) which(x - shift(x, 1) > 0 & x - shift(x, 1, type='lead') > 0)
maximums(data$Y)
Doing so gives me the correct maxima values for non-noisy data, but for noisy data it incorrectly (for my purposes, it's working as written) identifies small fluctuations as independent peaks. I've been unable to determine how to extend the range for the maximum, i.e. in order for a point to be considered a maximum, it needs to be greater than, say, 5 adjacent values. Is that possible given this code?
Thank you!
When I create a dataframe from numeric vectors, R seems to truncate the value below the precision that I require in my analysis:
data.frame(x=0.99999996)
returns 1 (*but see update 1)
I am stuck when fitting spline(x,y) and two of the x values are set to 1 due to rounding while y changes. I could hack around this but I would prefer to use a standard solution if available.
example
Here is an example data set
d <- data.frame(x = c(0.668732936336141, 0.95351462456867,
0.994620622127435, 0.999602102672081, 0.999987126195509, 0.999999955814133,
0.999999999999966), y = c(38.3026509783688, 11.5895099585560,
10.0443344234229, 9.86152339768516, 9.84461434575695, 9.81648333804257,
9.83306725758297))
The following solution works, but I would prefer something that is less subjective:
plot(d$x, d$y, ylim=c(0,50))
lines(spline(d$x, d$y),col='grey') #bad fit
lines(spline(d[-c(4:6),]$x, d[-c(4:6),]$y),col='red') #reasonable fit
Update 1
*Since posting this question, I realize that this will return 1 even though the data frame still contains the original value, e.g.
> dput(data.frame(x=0.99999999996))
returns
structure(list(x = 0.99999999996), .Names = "x", row.names = c(NA,
-1L), class = "data.frame")
Update 2
After using dput to post this example data set, and some pointers from Dirk, I can see that the problem is not in the truncation of the x values but the limits of the numerical errors in the model that I have used to calculate y. This justifies dropping a few of the equivalent data points (as in the example red line).
If you really want set up R to print its results with utterly unreasonable precision, then use: options(digits=16).
Note that this does nothing for that accuracy of functions using htese results. It merely changes how values appear when they are printed to the console. There is no rounding of the values as they are being stored or accessed unless you put in more significant digits than the abscissa can handle. The 'digits' option has no effect on the maximal precision of floating point numbers.
Please re-read R FAQ 7.31 and the reference cited therein -- a really famous paper on what everbody should know about floating-point representation on computers.
The closing quote from Kerngighan and Plauger is also wonderful:
10.0 times 0.1 is hardly ever 1.0.
And besides the numerical precision issue, there is of course also how R prints with fewer decimals than it uses internally:
> for (d in 4:8) print(0.99999996, digits=d)
[1] 1
[1] 1
[1] 1
[1] 1
[1] 0.99999996
>
I have the following claim counts data (triangular) by limits:
claims=matrix(c(2019,690,712,NA,773,574,NA,NA,232),nrow=3, byrow=T)
What would be the most elegant way to do the following simple things resembling Excel's sumif():
put the matrix into as.data.frame() with column names: "100k", "250k", "500k"
sum all numbers except first row; (in this case summing 773,574, and 232). I am looking for a neat reference so I can easily generalize the notation to larger claim triangles.
Sum all numbers, ignoring the NA's. sum(claims, na.rm = T) - Thanks for Gregor's suggestion.
*I played around with the package ChainLadder a bit and enjoyed how it handles triangular data, especially in plotting and calculating link ratios. I wonder more generally if basic R suffices in doing some quick and dirty sumif() or pairwise link ratio kind of calculations? This would be a bonus for me if anyone out there could dispense some words of wisdom.
Thank you!
claims=matrix(c(2019,690,712,NA,773,574,NA,NA,232),nrow=3, byrow=T)
claims.df = as.data.frame(claims)
names(claims.df) <- c("100k", "250k", "500k")
# This isn't the best idea because standard column names don't start with numbers
# If you go non-standard, you'll have to always quote them, that is
claims.df$100k # doesn't work
claims.df$`100k` # works
# sum everything
sum(claims, na.rm = T)
# sum everything except for first row
sum(claims[-1, ], na.rm = T)
It's much easier to give specific advice to specific questions than general advice. As to " I wonder more generally if basic R suffices in doing some quick and dirty sumif() or pairwise link ratio kind of calculations?", at least as to the sumif comment, I'm reminded of fortunes::fortune(286)
...this is kind of like asking "will your Land Rover make it up my driveway?", but I'll assume the question was asked in all seriousness.
sum adds up whatever numbers you give it. Subsetting based on logicals so simple that there is no need for a separate sumif function. Say you have x = rnorm(100), y = runif(100).
# sum x if x > 0
sum(x[x > 0])
# sum x if y < 0.5
sum(x[y < 0.5])
# sum x if x > 0 and y < 0.5
sum(x[x > 0 & y < 0.5])
# sum every other x
sum(x[c(T, F)]
# sum all but the first 10 and last 10 x
sum(x[-c(1:10, 91:100)]
I don't know what a pairwise link ratio is, but I'm willing to bet base R can handle it easily.