understanding application of function: x > y - z & x < y + z - r

I am not particularly great with math and am trying to understand some R code. There is a function called "compareforequality" that looks like this:
compareforequality <- function(val1, val2, epsilon)
{
val1 = as.numeric(val1);
val2 = as.numeric(val2);
equal = val1 > (val2 - epsilon) & val1 < val2 + epsilon;
equal
}
where val1 and val2 are vector of numbers that signify timepoints (usually integers between -10 and 1000 that identify days in a time series), and epsilon is set to 1e-10. I can see that it will return true/false if the values are the same/different, but what is the application of a function like this instead of using something like identical(). What effect does the value of epsilon have on the comparison?
Thanks,

The point is not for them to be exactly equal, it's to compare for rough equality, as in "val1 is within epsilon of val2".
The classic example of the usefulness of something like this is probably floating point numbers, where (for instance) 0.1 + 0.2 != 0.3, but 0.1 + 0.2 is within epsilon of 0.3 for some small epsilon, which is quite often enough.

Related

Why am I getting NAs in this calculation in R?

While working on an Rcpp program, I used the sample() function, which gave me the following error: "NAs not allowed in probability." I traced this issue to the fact that the probability vector I used had NA values in it. I have no idea how. Below is some R code that captures the errors:
n.0=20
n.1=20
n.reps=1
beta0.vals=rep(seq(-.3,.1,,n.0),n.reps)
beta1.vals=rep(seq(-7,0,,n.1),n.reps)
beta.grd=as.matrix(expand.grid(beta0.vals,beta1.vals))
n.rnd=200
beta.rnd.grd=cbind(runif(n.rnd,min(beta0.vals),max(beta0.vals)),runif(n.rnd,min(beta1.vals),max(beta1.vals)))
beta.grd=rbind(beta.grd,beta.rnd.grd)
N = 22670
count = 0
for(i in 1:dim(beta.grd)[1]){ # iterate through 600 possible beta values in beta grid
beta.ind = 0 # indicator for current pair of beta values
for(j in 1:N){ # iterate through all possible Nsums
logit = beta.grd[i,1]/N*(j - .1*N)^2 + beta.grd[i,2];
phi01 = exp(logit)/(1 + exp(logit))
if(is.na(phi01)){
count = count + 1
}
}
}
cat("Total number of invalid probabilities: ", count)
Here, $\beta_0 \in (-0.3, 0.1), \beta_1 \in (-7, 0), N = 22670, N_\text{sum} \in (1, N)$. Note that $N$ and $N_\text{sum}$ are integers, whereas the beta values may not be.
Since mathematically, $\phi_{01} \in (0,1)$, I'm assuming that NAs are arising because R is not liking extremely small values. I am receiving an overwhelming amount of NA values, too. More so than numbers. Why would I be getting NAs in this code?
Include print(logit) next to count = count + 1 and you will find lots of logit > 1000 values. exp(1000) == Inf so you divide Inf by Inf which will get you a NaN and NaN is NA:
> exp(500)
[1] 1.403592e+217
> Inf/Inf
[1] NaN
> is.na(NaN)
[1] TRUE
So your problems are not too small but to large numbers coming first out of the evaluation of exp(x) with x larger then roughly 700:
> exp(709)
[1] 8.218407e+307
> exp(710)
[1] Inf
Bernhard's answer correctly identifies the problem:
If logit is large, exp(logit) = Inf.
Here is a solution:
for(i in 1:dim(beta.grd)[1]){ # iterate through 600 possible beta values in beta grid
beta.ind = 0 # indicator for current pair of beta values
for(j in 1:N){ # iterate through all possible Nsums
logit = beta.grd[i,1]/N*(j - .1*N)^2 + beta.grd[i,2];
## This one isn't great because exp(logit) can be very large
# phi01 = exp(logit)/(1 + exp(logit))
## So, we say instead
## phi01 = 1 / ( 1 + exp(-logit) )
phi01 = plogis(logit)
if(is.na(phi01)){
count = count + 1
}
}
}
cat("Total number of invalid probabilities: ", count)
# Total number of invalid probabilities: 0
We can use the more stable 1 / (1 + exp(-logit)
(to convince yourself of this, multiply your expression with exp(-logit) / exp(-logit)),
and luckily either way, R has a builtin function plogis() that can calculate these probabilities quickly and accurately.
You can see from the help file (?plogis) that this function evaluates the expression I gave, but you can also double check to assure yourself
x = rnorm(1000)
y = 1 / (1 + exp(-x))
z = plogis(x)
all.equal(y, z)
[1] TRUE

How does R (v3.6.1) determine rounding with odd floor values? [duplicate]

Yes I know why we always round to the nearest even number if we are in the exact middle (i.e. 2.5 becomes 2) of two numbers. But when I want to evaluate data for some people they don't want this behaviour. What is the simplest method to get this:
x <- seq(0.5,9.5,by=1)
round(x)
to be 1,2,3,...,10 and not 0,2,2,4,4,...,10.
Edit: To clearify: 1.4999 should be 1 after rounding. (I thought this would be obvious)
This is not my own function, and unfortunately, I can't find where I got it at the moment (originally found as an anonymous comment at the Statistically Significant blog), but it should help with what you need.
round2 = function(x, digits) {
posneg = sign(x)
z = abs(x)*10^digits
z = z + 0.5 + sqrt(.Machine$double.eps)
z = trunc(z)
z = z/10^digits
z*posneg
}
x is the object you want to round, and digits is the number of digits you are rounding to.
An Example
x = c(1.85, 1.54, 1.65, 1.85, 1.84)
round(x, 1)
# [1] 1.8 1.5 1.6 1.8 1.8
round2(x, 1)
# [1] 1.9 1.5 1.7 1.9 1.8
(Thanks #Gregor for the addition of + sqrt(.Machine$double.eps).)
If you want something that behaves exactly like round except for those xxx.5 values, try this:
x <- seq(0, 1, 0.1)
x
# [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
floor(0.5 + x)
# [1] 0 0 0 0 0 1 1 1 1 1 1
As #CarlWitthoft said in the comments, this is the IEC 60559 standard as mentioned in ?round:
Note that for rounding off a 5, the IEC 60559 standard is expected to be used, ‘go to the even digit’. Therefore round(0.5) is 0 and round(-1.5) is -2. However, this is dependent on OS services and on representation error (since e.g. 0.15 is not represented exactly, the rounding rule applies to the represented number and not to the printed number, and so round(0.15, 1) could be either 0.1 or 0.2).
An additional explanation by Greg Snow:
The logic behind the round to even rule is that we are trying to
represent an underlying continuous value and if x comes from a truly
continuous distribution, then the probability that x==2.5 is 0 and the
2.5 was probably already rounded once from any values between 2.45 and 2.54999999999999..., if we use the round up on 0.5 rule that we learned in grade school, then the double rounding means that values
between 2.45 and 2.50 will all round to 3 (having been rounded first
to 2.5). This will tend to bias estimates upwards. To remove the
bias we need to either go back to before the rounding to 2.5 (which is
often impossible to impractical), or just round up half the time and
round down half the time (or better would be to round proportional to
how likely we are to see values below or above 2.5 rounded to 2.5, but
that will be close to 50/50 for most underlying distributions). The
stochastic approach would be to have the round function randomly
choose which way to round, but deterministic types are not
comforatable with that, so "round to even" was chosen (round to odd
should work about the same) as a consistent rule that rounds up and
down about 50/50.
If you are dealing with data where 2.5 is likely to represent an exact
value (money for example), then you may do better by multiplying all
values by 10 or 100 and working in integers, then converting back only
for the final printing. Note that 2.50000001 rounds to 3, so if you
keep more digits of accuracy until the final printing, then rounding
will go in the expected direction, or you can add 0.000000001 (or
other small number) to your values just before rounding, but that can
bias your estimates upwards.
This appears to work:
rnd <- function(x) trunc(x+sign(x)*0.5)
Ananda Mahto's response seems to do this and more - I am not sure what the extra code in his response is accounting for; or, in other words, I can't figure out how to break the rnd() function defined above.
Example:
seq(-2, 2, by=0.5)
# [1] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
round(x)
# [1] -2 -2 -1 0 0 0 1 2 2
rnd(x)
# [1] -2 -2 -1 -1 0 1 1 2 2
Depending on how comfortable you are with jiggling your data, this works:
round(x+10*.Machine$double.eps)
# [1] 1 2 3 4 5 6 7 8 9 10
This method:
round2 = function(x, n) {
posneg = sign(x)
z = abs(x)*10^n
z = z + 0.5
z = trunc(z)
z = z/10^n
z*posneg
}
does not seem to work well when we have numbers with many digits. E.g. doing round2(2436.845, 2) will give us 2436.84. The issue seems to occur with the trunc(z) function.
Overall, I think it has something to do with the way R stores numbers and thus the trunc and float function doesn't always work. I was able to get around it in not the most elegant way:
round2 = function(x, n) {
posneg = sign(x)
z = abs(x)*10^n
z = z + 0.5
z = trunc(as.numeric(as.character(z)))
z = z/10^n
(z)*posneg
}
This mimics the rounding away from zero at .5:
round_2 <- function(x, digits = 0) {
x = x + abs(x) * sign(x) * .Machine$double.eps
round(x, digits = digits)
}
round_2(.5 + -2:4)
-2 -1 1 2 3 4 5

How to replace a percentage of data in matrix with a new value

I need to replace 5% of the values in my matrix with the number 0.2 (guess in my code) IF they are below 0.2, leave them alone if they are above 0.2
Right now my code is changing all values less than 0.2 to 0.2.
This will be in a larger loop eventually to occur over multiple replications, but right now I am trying to get it to work for just 1.
Explanation:
gen.probs.2PLM is a matrix containing probabilities. Guess is the value I have chosen to replace others. Perc is the percentage I would like to look at in the matrix and change IF it is less than guess.
gen.probs.2PLM <- icc.2pl(gen.theta,a,b,N,TL)
perc<-0.05*N
guess<-0.2
gen.probs.2PLM[gen.probs.2PLM < guess] <- guess
I expect only 5 percent of the values to be looked at and changed to 0.2 if they are below 0.2
gen.probs.2PLM is a matrix that is 1000*45
# dput(gen.probs.2PLM[1:20, 1:5])
structure(c(0.940298707380962, 0.848432615784556, 0.927423909103331,
0.850853479678874, 0.857217846940203, 0.437981231531586, 0.876146933879543,
0.735970164547576, 0.76296469377238, 0.640645338681073, 0.980212105400924,
0.45164925578322, 0.890102475061895, 0.593094353657132, 0.837401449711248,
0.867436194744775, 0.753637051722629, 0.64254277457268, 0.947783594375454,
0.956791049998361, 0.966059152820211, 0.896715435704569, 0.957247808046098,
0.898712615329071, 0.903924224222216, 0.474561641407715, 0.919080521405463,
0.795919510255144, 0.821437921281395, 0.700141602452725, 0.990657455188518,
0.490423165094245, 0.92990761183835, 0.649494291971471, 0.887513826127176,
0.912171225584296, 0.812707696992244, 0.702126169775785, 0.971012049724468,
0.976789027046465, 0.905046450670641, 0.81322870291296, 0.890539069545935,
0.81539882951241, 0.821148949083641, 0.494459368656066, 0.838675666691869,
0.719720365120414, 0.741166345529595, 0.646700411799437, 0.9578080044146,
0.504938867664858, 0.852068230044858, 0.611124165649146, 0.803451686558428,
0.830526582119632, 0.73370297276145, 0.648126933954648, 0.913887754151632,
0.925022099584059, 0.875712266966582, 0.762677615526032, 0.857390771477182,
0.765270669721981, 0.772159371696644, 0.418524844618452, 0.793318641931831,
0.65437308255825, 0.678633290218262, 0.574232080921638, 0.943851827968259,
0.428780249640693, 0.809653131485398, 0.536512513508941, 0.751041035436293,
0.783450103818893, 0.6701523432789, 0.575762279897951, 0.886965071394186,
0.901230746880145, 0.868181123535613, 0.688344765218149, 0.840795870494126,
0.69262216320168, 0.703982665712434, 0.215843106547112, 0.738775789107177,
0.513997187757334, 0.551803060188986, 0.397460216626274, 0.956693337996693,
0.225901690507801, 0.765409027208693, 0.347791079152411, 0.669156131912199,
0.72257632593578, 0.538474414984722, 0.399549159711904, 0.884405290470079,
0.904200878248468), .Dim = c(20L, 5L))
Here is a function that you can apply to a numeric matrix to replace 5% of the values below some threshold (e.g. .2 in your case) with the threshold:
replace_5pct <- function(d, threshold=.2){
# get indices of cells below threshold, sample 5% of them
cells_below <- which(d < threshold)
cells_to_modify <- sample(cells_below, size=.05*length(cells_below))
# then replace values for sampled indices with threshold + return
d[cells_to_modify] <- threshold
return(d)
}
Here's an example of how it can be used (where dat would correspond to your matrix):
dat <- matrix(round(runif(1000), 1), ncol=10)
dat_5pct_replaced <- replace_5pct(dat, threshold=.2)
You can look at the data to confirm the result, or look at stats like these:
mean(dat < .2) # somewhere between .1 and .2 probably
sum(dat != dat_5pct_replaced) # about 5% of mean(dat < .2)
p.s.: If you want to generalize the function, you could abstract over the 5% replacement too -- then you could replace e.g. 10% of values below some threshold, etc. And if you wanna get fancy you could abstract over "less than" too, and add a comparison function as a parameter to the main function.
replace_func <- function(d, func, threshold, prop){
cells <- which(func(d, threshold))
cells_to_modify <- sample(cells, size=prop*length(cells))
d[cells_to_modify] <- threshold
return(d)
}
And then e.g. replace 10% of values above .5 with .5:
# (need to backtick infix functions like <, >, etc.)
replace_func(dat, func=`>`, threshold=.5, prop=.1)

R: What is wrong with rounding? [duplicate]

Yes I know why we always round to the nearest even number if we are in the exact middle (i.e. 2.5 becomes 2) of two numbers. But when I want to evaluate data for some people they don't want this behaviour. What is the simplest method to get this:
x <- seq(0.5,9.5,by=1)
round(x)
to be 1,2,3,...,10 and not 0,2,2,4,4,...,10.
Edit: To clearify: 1.4999 should be 1 after rounding. (I thought this would be obvious)
This is not my own function, and unfortunately, I can't find where I got it at the moment (originally found as an anonymous comment at the Statistically Significant blog), but it should help with what you need.
round2 = function(x, digits) {
posneg = sign(x)
z = abs(x)*10^digits
z = z + 0.5 + sqrt(.Machine$double.eps)
z = trunc(z)
z = z/10^digits
z*posneg
}
x is the object you want to round, and digits is the number of digits you are rounding to.
An Example
x = c(1.85, 1.54, 1.65, 1.85, 1.84)
round(x, 1)
# [1] 1.8 1.5 1.6 1.8 1.8
round2(x, 1)
# [1] 1.9 1.5 1.7 1.9 1.8
(Thanks #Gregor for the addition of + sqrt(.Machine$double.eps).)
If you want something that behaves exactly like round except for those xxx.5 values, try this:
x <- seq(0, 1, 0.1)
x
# [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
floor(0.5 + x)
# [1] 0 0 0 0 0 1 1 1 1 1 1
As #CarlWitthoft said in the comments, this is the IEC 60559 standard as mentioned in ?round:
Note that for rounding off a 5, the IEC 60559 standard is expected to be used, ‘go to the even digit’. Therefore round(0.5) is 0 and round(-1.5) is -2. However, this is dependent on OS services and on representation error (since e.g. 0.15 is not represented exactly, the rounding rule applies to the represented number and not to the printed number, and so round(0.15, 1) could be either 0.1 or 0.2).
An additional explanation by Greg Snow:
The logic behind the round to even rule is that we are trying to
represent an underlying continuous value and if x comes from a truly
continuous distribution, then the probability that x==2.5 is 0 and the
2.5 was probably already rounded once from any values between 2.45 and 2.54999999999999..., if we use the round up on 0.5 rule that we learned in grade school, then the double rounding means that values
between 2.45 and 2.50 will all round to 3 (having been rounded first
to 2.5). This will tend to bias estimates upwards. To remove the
bias we need to either go back to before the rounding to 2.5 (which is
often impossible to impractical), or just round up half the time and
round down half the time (or better would be to round proportional to
how likely we are to see values below or above 2.5 rounded to 2.5, but
that will be close to 50/50 for most underlying distributions). The
stochastic approach would be to have the round function randomly
choose which way to round, but deterministic types are not
comforatable with that, so "round to even" was chosen (round to odd
should work about the same) as a consistent rule that rounds up and
down about 50/50.
If you are dealing with data where 2.5 is likely to represent an exact
value (money for example), then you may do better by multiplying all
values by 10 or 100 and working in integers, then converting back only
for the final printing. Note that 2.50000001 rounds to 3, so if you
keep more digits of accuracy until the final printing, then rounding
will go in the expected direction, or you can add 0.000000001 (or
other small number) to your values just before rounding, but that can
bias your estimates upwards.
This appears to work:
rnd <- function(x) trunc(x+sign(x)*0.5)
Ananda Mahto's response seems to do this and more - I am not sure what the extra code in his response is accounting for; or, in other words, I can't figure out how to break the rnd() function defined above.
Example:
seq(-2, 2, by=0.5)
# [1] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
round(x)
# [1] -2 -2 -1 0 0 0 1 2 2
rnd(x)
# [1] -2 -2 -1 -1 0 1 1 2 2
Depending on how comfortable you are with jiggling your data, this works:
round(x+10*.Machine$double.eps)
# [1] 1 2 3 4 5 6 7 8 9 10
This method:
round2 = function(x, n) {
posneg = sign(x)
z = abs(x)*10^n
z = z + 0.5
z = trunc(z)
z = z/10^n
z*posneg
}
does not seem to work well when we have numbers with many digits. E.g. doing round2(2436.845, 2) will give us 2436.84. The issue seems to occur with the trunc(z) function.
Overall, I think it has something to do with the way R stores numbers and thus the trunc and float function doesn't always work. I was able to get around it in not the most elegant way:
round2 = function(x, n) {
posneg = sign(x)
z = abs(x)*10^n
z = z + 0.5
z = trunc(as.numeric(as.character(z)))
z = z/10^n
(z)*posneg
}
This mimics the rounding away from zero at .5:
round_2 <- function(x, digits = 0) {
x = x + abs(x) * sign(x) * .Machine$double.eps
round(x, digits = digits)
}
round_2(.5 + -2:4)
-2 -1 1 2 3 4 5

Can't break out of while loop in R

The purpose of my code is to find the amount of people where the probability that at least 2 of them have the same birthday is 50%.
source('colMatches.r')
all_npeople = 1:300
days = 1:365
ntrials = 1000
sizematch = 2
N = length(all_npeople)
counter = 1
pmean = rep(0,N)
while (pmean[counter] <= 0.5)
{
npeople = all_npeople[counter]
x = matrix(sample(days, npeople*ntrials, replace=TRUE),nrow=npeople,
ncol=ntrials)
w = colMatches(x, sizematch)
pmean[counter] = mean(w)
counter = counter + 1
}
s3 = toString(pmean[counter])
s2 = toString(counter)
s1 = "The smallest value of n for which the probability of a match is at least 0.5 is equal to "
s4 = " (the test p value is "
s5 = "). This means when you have "
s6 = " people in a room the probability that two of them have the same birthday is 50%."
paste(s1, s2, s4, s3, s5, s2, s6, sep="")
When I run that code I get "The smallest value of n for which the probability of a match is at least 0.5 is equal to 301 (the test p value is NA). This means when you have 301 people in a room the probability that two of them have the same birthday is 50%." So the while statement isn't working properly for some reason. It's cycling all the way through all_npeople even though it should stop when pmean[counter] is no longer less than or equal to 0.5.
I know that pmean is updating correctly though because when I test it afterwards pmean[50] = 0.971. So that list is indeed correct but the while loop still won't end.
*colmatches is a function that determines if a column has a certain number of matches based on sizematch. So in this case it's looking at the matrix defined in x and listing 1 for every column that has at least 2 similar values and 0 for every column with no matches.
I admire your attempt to program this question, but the beauty of R is most of this work is done for you:
qbirthday(prob = 0.5, classes = 365, coincident = 2)
#answer is 23 people.
You maybe also be interested in:
pbirthday(n, classes = 365, coincident = 2)
If the purpose of the code is only to define number of people when probability that at least two of them have same birthday is above 0.5, it is possible to write it in much simplier way:
# note that probability below is probability of NOT having same birthday
probability <- 1
people <- 1
days <- 365
while(probability >= 0.5){
people <- people + 1
probability <- probability * (days + 1 - people) / days
}
print(people)

Resources