how to use the function ```num_to_schoice()```? - r

I would like to built a simple probability exercise such that the solution is just a one decimal number between zero and one (different from zero and one). I would like to use the function num_to_schoice, but if I write:
num_to_schoice(0.3,digits=1,range=c(0.1,0.9))
I get the error message:
NULL
Warning message:
In num_to_schoice(0.3, digits = 1, range = c(0.1, 0.9)) :
specified 'range' is too small for 'delta'
Could someone please explain how the function num_to_schoice should be properly used?

Let me add a couple of points to existing answer by #Edward (+1):
If you generate a solution from the sequence 0.1, 0.2, ..., 0.9 and want four from the remaining eight numbers as distractors, I would recommend not using num_to_schoice(). Only if moving to a correct solution in 0.10, 0.11, 0.12, ..., 0.9, say, I would use num_to_schoice().
Without num_to_schoice() for one digit
You can set up an answerlist with all nine numbers from the sequence, sorting the correct solution into the first position, and then using the exshuffle meta-information tag to do the actual sampling.
For example, in the data-generation you need something like this:
sol <- 0.3
ans <- c(sol, setdiff(1:9/10, sol))
ans <- paste0("$", ans, "$")
In the question you can then include
answerlist(ans, markup = "markdown")
## Answerlist
## ----------
## * $0.3$
## * $0.1$
## * $0.2$
## * $0.4$
## * $0.5$
## * $0.6$
## * $0.7$
## * $0.8$
## * $0.9$
Finally, the meta-information needs:
exsolution: 100000000
exshuffle: 5
This will then use the correct solution and four of the eight false answers - all in shuffled order. (Note that the above uses .Rmd syntax, for .Rnw this needs to be adapted accordingly.)
With num_to_schoice() for two digits
For the scenario with one digit using num_to_schoice() which tries to do too many things, but for more than one digit it might be useful. Specifically, num_to_schoice() assures that the rank of the correct solution is non-iformative, i.e., the correct solution could be the smallest, second-smallest, ..., largest number in the displayed sequence with equal probability. Specifically, this may be important if the distribution of the correct solution is not uniform across the possible range. This is the reason why the following code sometimes fails:
num_to_schoice(0.3, digits = 1, delta = 0.1, range = c(0.1, 0.9))
Internally, this first decides how many of the four wrong answers should be to the left of the correct solution 0.3. Clearly, there is room for at most two wrong answers to the left, which may result in a warning and a NULL result` if exceeded. Moving to two digits can resolve this, e.g.:
num_to_schoice(0.31, range = c(0.01, 0.99),
digits = 2, delta = 0.03, method = "delta")
Remarks:
Personally, I would only do this if the correct solution can potentially also have two digits. Otherwise students might pick up this pattern.
You need to assure that to the left and to the right of the correct solution there is at least 4 * delta so that there is enough room for the wrong answers.
Using delta = 0.01 would certainly be possible but if you want larger deltas then delta = 0.03 or delta = 0.07 are also often useful choices. This is because sampling from an equidistant grid with such a delta is typically not noticable for most students. In contrast, deltas like 0.05, 0.1, 0.02, etc. are typically picked up quickly.

Because your range is (0, 1), you have to specify a smaller delta than the default (1). The function calculates 5 wrong answers, so each has to be within the range you give AND far enough away from the other answers by an amount equal to delta. You should also use the "delta" method, since the package authors give the following advice:
Two methods can be used to generate the wrong solutions: Either simply
runif or otherwise a full equi-distant grid for the range with step
size delta is set up from which a discrete uniform sample is drawn.
The former is preferred if the range is large enough while the latter
performs better if the range is small (as compared to delta).
So you can try the following:
num_to_schoice(0.3, digits=1, range=c(0.1, 0.9), delta=0.05, method="delta")
#$solutions
#[1] FALSE FALSE FALSE TRUE FALSE
#$questions
#[1] "$0.6$" "$0.5$" "$0.3$" "$0.4$" "$0.8$"
Note that this function incorporates randomness, so you may need to try a few times before a valid solution appears. Just keep ignoring the errors.
Edit:
I did try this a few times and every now and then I got a warning about the specified range being too small, with a NULL result returned. Other times the function didn't do anything and I had to abort. The help page also has this tidbit:
Exercise templates using num_to_schoice should be thoroughly tested in
order to avoid problems with too small ranges or almost identical
correct and wrong answers! This can potentially cause problems,
infinite loops, etc.
Inspection of the num_to_schoice function revealed that there is a while loop near the end which may get stuck in the aforementioned "infinite loop". To cut a long story short, it looks like you need to increase the digits to at least 2, otherwise there's a chance that this loop will never end. I hope it's ok to have 2 digits in the answers.
num_to_schoice(0.3, digits=2, range=c(0.1, 0.9), delta=0.01, method="delta")
$solutions
[1] FALSE FALSE FALSE TRUE FALSE
$questions
[1] "$0.23$" "$0.42$" "$0.22$" "$0.30$" "$0.54$"
I tried this 10,000 times and it always returned a non-null result.
res <- NULL
for(i in 1:10000){
res[[i]] <- num_to_schoice(0.3, digits=2, range=c(0.1, 0.9), delta=0.01, method="delta")
}
sum(sapply(res, function(x) any(is.null(x))))
# [1] 0
Hope that works now.

Related

R Precision for Double - Why code returns negative why positive outcome expected?

I am testing 2 ways of calculating Prod(b-a), where a and b are vectors of length n. Prod(b-a)=(b1-a1)(b2-a2)(b3-a3)*... (bn-an), where b_i>a_i>0 for all i=1,2,3, n. For some special cases, another way (Method 2) of calculation this prod(b-a) is more efficient. It uses the following formula, which is to expand the terms and sum them:
Here is my question is: When it happens that a_i very close to b_i, the true outcome could be very, very close 0, something like 10^(-16). Method 1 (substract and Multiply) always returns positive output. Method 2 of using the formula some times return negative output ( about 7~8% of time returning negative for my experiment). Mathematically, these 2 methods should return exactly the same output. But in computer language, it apparently produces different outputs.
Here are my codes to run the test. When I run the testing code for 10000 times, about 7~8% of my runs for method 2 returns negative output. According to the official document, the R double has the precision of "2.225074e-308" as indicated by R parameter: ".Machine$double.xmin". Why it's getting into the negative values when the differences are between 10^(-16) ~ 10^(-18)? Any help that sheds light on this will be apprecaited. I would also love some suggestions concerning how to practically increase the precision to higher level as indicated by R document.
########## Testing code 1.
ftest1case<-function(a,b) {
n<-length(a)
if (length(b)!=n) stop("--------- length a and b are not right.")
if ( any(b<a) ) stop("---------- b has to be greater than a all the time.")
out1<-prod(b-a)
out2<-0
N<-2^n
for ( i in 1:N ) {
tidx<-rev(as.integer(intToBits(x=i-1))[1:n])
tsign<-ifelse( (sum(tidx)%%2)==0,1.0,-1.0)
out2<-out2+tsign*prod(b[tidx==0])*prod(a[tidx==1])
}
c(out1,out2)
}
########## Testing code 2.
ftestManyCases<-function(N,printFreq=1000,smallNum=10^(-20))
{
tt<-matrix(0,nrow=N,ncol=2)
n<-12
for ( i in 1:N) {
a<-runif(n,0,1)
b<-a+runif(n,0,1)*0.1
tt[i,]<-ftest1case(a=a,b=b)
if ( (i%%printFreq)==0 ) cat("----- i = ",i,"\n")
if ( tt[i,2]< smallNum ) cat("------ i = ",i, " ---- Negative summation found.\n")
}
tout<-apply(tt,2,FUN=function(x) { round(sum(x<smallNum)/N,6) } )
names(tout)<-c("PerLess0_Method1","PerLee0_Method2")
list(summary=tout, data=tt)
}
######## Step 1. Test for 1 case.
n<-12
a<-runif(n,0,1)
b<-a+runif(n,0,1)*0.1
ftest1case(a=a,b=b)
######## Step 2 Test Code 2 for multiple cases.
N<-300
tt<-ftestManyCases(N=N,printFreq = 100)
tt[[1]]
It's hard for me to imagine when an algorithm that consists of generating 2^n permutations and adding them up is going to be more efficient than a straightforward product of differences, but I'll take your word for it that there are some special cases where it is.
As suggested in comments, the root of your problem is the accumulation of floating-point errors when adding values of different magnitudes; see here for an R-specific question about floating point and here for the generic explanation.
First, a simplified example:
n <- 12
set.seed(1001)
a <- runif(a,0,1)
b <- a + 0.01
prod(a-b) ## 1e-24
out2 <- 0
N <- 2^n
out2v <- numeric(N)
for ( i in 1:N ) {
tidx <- rev(as.integer(intToBits(x=i-1))[1:n])
tsign <- ifelse( (sum(tidx)%%2)==0,1.0,-1.0)
j <- as.logical(tidx)
out2v[i] <- tsign*prod(b[!j])*prod(a[j])
}
sum(out2v) ## -2.011703e-21
Using extended precision (with 1000 bits of precision) to check that the simple/brute force calculation is more reliable:
library(Rmpfr)
a_m <- mpfr(a, 1000)
b_m <- mpfr(b, 1000)
prod(a_m-b_m)
## 1.00000000000000857647286522936696473705868726043995807429578968484409120647055193862325070279593735821154440625984047036486664599510856317884962563644275433171621778761377125514191564456600405460403870124263023336542598111475858881830547350667868450934867675523340703947491662460873009229537576817962228e-24
This proves the point in this case, but in general doing extended-precision arithmetic will probably kill any performance gains you would get.
Redoing the permutation-based calculation with mpfr values (using out2 <- mpfr(0, 1000), and going back to the out2 <- out2 + ... running summation rather than accumulating the values in a vector and calling sum()) gives an accurate answer (at least to the first 20 or so digits, I didn't check farther), but takes 6.5 seconds on my machine (instead of 0.03 seconds when using regular floating-point).
Why is this calculation problematic? First, note the difference between .Machine$double.xmin (approx 2e-308), which is the smallest floating-point value that the system can store, and .Machine$double.eps (approx 2e-16), which is the smallest value such that 1+x > x, i.e. the smallest relative value that can be added without catastrophic cancellation (values a little bit bigger than this magnitude will experience severe, but not catastrophic, cancellation).
Now look at the distribution of values in out2v, the series of values in out2v:
hist(out2v)
There are clusters of negative and positive numbers of similar magnitude. If our summation happens to add a bunch of values that almost cancel (so that the result is very close to 0), then add that to another value that is not nearly zero, we'll get bad cancellation.
It's entirely possible that there's a way to rearrange this calculation so that bad cancellation doesn't happen, but I couldn't think of one easily.

Store a numeric variable with more decimals [duplicate]

When I create a dataframe from numeric vectors, R seems to truncate the value below the precision that I require in my analysis:
data.frame(x=0.99999996)
returns 1 (*but see update 1)
I am stuck when fitting spline(x,y) and two of the x values are set to 1 due to rounding while y changes. I could hack around this but I would prefer to use a standard solution if available.
example
Here is an example data set
d <- data.frame(x = c(0.668732936336141, 0.95351462456867,
0.994620622127435, 0.999602102672081, 0.999987126195509, 0.999999955814133,
0.999999999999966), y = c(38.3026509783688, 11.5895099585560,
10.0443344234229, 9.86152339768516, 9.84461434575695, 9.81648333804257,
9.83306725758297))
The following solution works, but I would prefer something that is less subjective:
plot(d$x, d$y, ylim=c(0,50))
lines(spline(d$x, d$y),col='grey') #bad fit
lines(spline(d[-c(4:6),]$x, d[-c(4:6),]$y),col='red') #reasonable fit
Update 1
*Since posting this question, I realize that this will return 1 even though the data frame still contains the original value, e.g.
> dput(data.frame(x=0.99999999996))
returns
structure(list(x = 0.99999999996), .Names = "x", row.names = c(NA,
-1L), class = "data.frame")
Update 2
After using dput to post this example data set, and some pointers from Dirk, I can see that the problem is not in the truncation of the x values but the limits of the numerical errors in the model that I have used to calculate y. This justifies dropping a few of the equivalent data points (as in the example red line).
If you really want set up R to print its results with utterly unreasonable precision, then use: options(digits=16).
Note that this does nothing for that accuracy of functions using htese results. It merely changes how values appear when they are printed to the console. There is no rounding of the values as they are being stored or accessed unless you put in more significant digits than the abscissa can handle. The 'digits' option has no effect on the maximal precision of floating point numbers.
Please re-read R FAQ 7.31 and the reference cited therein -- a really famous paper on what everbody should know about floating-point representation on computers.
The closing quote from Kerngighan and Plauger is also wonderful:
10.0 times 0.1 is hardly ever 1.0.
And besides the numerical precision issue, there is of course also how R prints with fewer decimals than it uses internally:
> for (d in 4:8) print(0.99999996, digits=d)
[1] 1
[1] 1
[1] 1
[1] 1
[1] 0.99999996
>

How are rounded numbers stored in R?

I have found a lot of information on this online, but I haven't been able to find anything that exactly answers my question. My issue does not have to do with the presentation of the numbers, but instead the calculations and storage underneath the presentation.
The issue is with floating points in R. I wish to truncate them; however, I want to make sure I am storing them correctly after they are truncated.
The problem is: I have a dataset where I am trying to compare the difference between different numbers to any threshold I would like (exact to 2 decimal places - i.e. 0.00, 0.05, and 1.00.). I want to make sure when I test the difference to exactly zero that it is testing exactly the correct difference and there is not a storage problem going on behind that I am unaware of.
So far, I have tried:
(1) round (and testing against 0, and very small values like 1e-10)
(2) multiplying by 100 and as.integer
These calculations come up with different answers when I calculate the percentage of observations that have a difference greater than my chosen threshold in my dataset.
In short, it would be great to know how to best store the number to get the most accurate result when calculating whether or not the difference is actually 0.
Note: This needs to work for large datasets.
Example:
dt <-
data.table(d = c(0.00, 988.36, 0.00, 2031.46, 0.00),
c = c(0.00, 30.00, 0.00, 2031.46, 0.00),
n = c("a", "b", "a", "a", "b"))
dt[, diff := d - c]
dt[, abs_diff := abs(diff)]
dt[, pct_diff := mean(abs_diff == 0, na.rm = TRUE), by = "n"]
The last step is the problem, as I continuously get different numbers for pct_diff based on the threshold. (For example, mean(abs_diff <= 1e-10) and mean(abs_diff <= 1e-15) give me different answers).
Rounded numbers are stored as numeric, i.e., floating point numbers:
class(round(1.1))
#[1] "numeric"
class(floor(1.1))
##[1] "numeric"
It seems like your are looking for packages that support arbitrary precision numbers, such as package Rmpfr.

acos(1) returns NaN for some values, not others

I have a list of latitude and longitude values, and I'm trying to find the distance between them. Using a standard great circle method, I need to find:
acos(sin(lat1)*sin(lat2) + cos(lat1)*cos(lat2) * cos(long2-long1))
And multiply this by the radius of earth, in the units I am using. This is valid as long as the values we take the acos of are in the range [-1,1]. If they are even slightly outside of this range, it will return NaN, even if the difference is due to rounding.
The issue I have is that sometimes, when two lat/long values are identical, this gives me an NaN error. Not always, even for the same pair of numbers, but always the same ones in a list. For instance, I have a person stopped on a road in the desert:
Time |lat |long
1:00PM|35.08646|-117.5023
1:01PM|35.08646|-117.5023
1:02PM|35.08646|-117.5023
1:03PM|35.08646|-117.5023
1:04PM|35.08646|-117.5023
When I calculate the distance between the consecutive points, the third value, for instance, will always be NaN, even though the others are not. This seems to be a weird bug with R rounding.
Can't tell exactly without seeing your data (try dput), but this is mostly likely a consequence of FAQ 7.31.
(x1 <- 1)
## [1] 1
(x2 <- 1+1e-16)
## [1] 1
(x3 <- 1+1e-8)
## [1] 1
acos(x1)
## [1] 0
acos(x2)
## [1] 0
acos(x3)
## [1] NaN
That is, even if your values are so similar that their printed representations are the same, they may still differ: some will be within .Machine$double.eps and others won't ...
One way to make sure the input values are bounded by [-1,1] is to use pmax and pmin: acos(pmin(pmax(x,-1.0),1.0))
A simple workaround is to use pmin(), like this:
acos(pmin(sin(lat1)*sin(lat2) + cos(lat1)*cos(lat2) * cos(long2-long1),1))
It now ensures that the precision loss leads to a value no higher than exactly 1.
This doesn't explain what is happening, however.
(Edit: Matthew Lundberg pointed out I need to use pmin to get it tow work with vectorized inputs. This fixes the problem with getting it to work, but I'm still not sure why it is rounding incorrectly.)
I just encountered this. This is caused by input larger than 1. Due to the computational error, my inner product between unit norms becomes a bit larger than 1 (like 1+0.00001). And acos() can only deal with [-1,1]. So, we can clamp the upper bound to exactly 1 to solve the problem.
For numpy: np.clip(your_input, -1, 1)
For Pytorch: torch.clamp(your_input, -1, 1)

Cut function in R - exclusive or am I double counting?

Based off of a previous question I asked, which #Andrie answered, I have a question about the usage of the cut function and labels.
I'd like get summary statistics based on the range of number of times a user logs in.
Here is my data:
# Get random numbers
NumLogin <- round(runif(100,1,50))
# Set the login range
LoginRange <- cut(NumLogin,
c(0,1,3,5,10,15,20,Inf),
labels=c('1','2','3-5','6-10','11-15','16-20','20+')
)
Now I have my LoginRange, but I'm unsure how the cut function actually works. I want to find users who have logged in 1 time, 2 times, 3-5 times, etc, while only including the user if they are in that range. Is the cut function including 3 twice (In the 2 bucket and the 3-5 bucket)? If I look in my example, I can see a user who logged in 3 times, but they are cut as '2'. I've looked at the documentation and every R book I own, but no luck. What am I doing wrong?
Also - As a usage question - should I attach the LoginRange to my data frame? If so, what's the best way to do so?
DF <- data.frame(NumLogin, LoginRange)
?
Thanks
The intervals defined by the cut() function are (by default) closed on the right. To see what that means, try this:
cut(1:2, breaks=c(0,1,2))
# [1] (0,1] (1,2]
As you can see, the integer 1 gets included in the range (0,1], not in the range (1,2]. It doesn't get double-counted, and for any input value falling outside of the bins you define, cut() will return a value of NA.
When dealing with integer-valued data, I tend to set break points between the integers, just to avoid tripping myself up. In fact, doing this with your data (as shown below), reveals that the 2nd and 3rd bins were actually incorrectly named, which illustrates the point quite nicely!
LoginRange <- cut(NumLogin,
c(0.5, 1.5, 3.5, 5.5, 10.5, 15.5, 20.5, Inf),
# c(0,1,3,5,10,15,20,Inf) + 0.5,
labels=c('1','2-3','4-5','6-10','11-15','16-20','20+')
)

Resources