using replace more efficiently in R - r

I'm coding in R, and I want to replace all the values within a given vector which are over 0.5 with 1, and any value which is under 0.5 with -1. I know how to do this with two sequential calls to "replace" such as:
my_nums <- c(0.2,0.8,0.3,0.4)
my_nums_adj_temp <- replace(my_nums,my_nums>0.5,1)
my_nums_adj <- replace(my_nums_adj_temp,my_nums<0.5,-1)
And so "my_nums" goes from
> my_nums
[1] 0.2 0.8 0.3 0.4
to
> my_nums_adj
[1] -1 1 -1 -1
But is there a way to do this with just a single call to "replace"?

This returns the same value as the code in the question; however, if you really wanted to return 0 for an input component if that input is .5 (rather than returning .5 which is what the code in the question returns) then omit the second term or if you wanted to return some other value in that case replace the second .5 with that value.
sign(my_nums - .5) + .5 * (my_nums == .5)
## [1] -1 1 -1 -1
If you wanted to return 1 if the input is greater than or equal to .5 and -1 if it is less than we could use this. For each component of the input, only one of the two terms can be non-zero so the result will be 1 or -1 for each component.
(my_nums >= .5) - (my_nums < .5)

Related

Entropy calculation gives NaN - is applying na.omit a valid tweak?

By definition, the entropy is defined as:
entropy <- function (p) sum(-p * log(p))
I'm performing LCA using the poLCA package and trying to calculate entropy, which for some of my models are outputting NaN.
error_prior <- entropy(lca2$P) # Class proportions model 2
error_post <- mean(apply(lca2$posterior, 1, entropy), na.rm = TRUE)
results[2,8] <- round(((error_prior - error_post) / error_prior), 3)
From the answer to this question: Entropy output is NaN for some class solutions and not others, I learnt that it is caused by zeros in p and it can be resolved by adding na.omit to the function as follows:
entropy <- function (p) sum(na.omit(-p * log(p)))
My question is - is this technical tweak mathematically valid without affecting the integrity of the calculation?
In my case, around 1/3 of the values in p are zeros. I'm really unsure if I should use na.omit or find another way to resolve this problem.
It is valid, but not transparent at first glance. The reason is that the mathematical limit of xlog(x) as x -> 0 is 0 (we can prove this using L'Hospital Rule). In this regard, the most robust definition of the function should be
entropy.safe <- function (p) {
if (any(p > 1 | p < 0)) stop("probability must be between 0 and 1")
log.p <- numeric(length(p))
safe <- p != 0
log.p[safe] <- log(p[safe])
sum(-p * log.p)
}
But simply dropping p = 0 cases gives identical results, because the result at p = 0 is 0 and contributes nothing to the sum anyway.
entropy.brutal <- function (p) {
if (any(p > 1 | p < 0)) stop("probability must be between 0 and 1")
log.p <- log(p)
## as same as sum(na.omit(-p * log.p))
sum(-p * log.p, na.rm = TRUE)
}
## p has a single 0
( p <- seq(0, 1, by = 0.1) )
#[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
entropy.brutal(p)
#[1] 2.455935
entropy.safe(p)
#[1] 2.455935
## half of p are zeros
p[1:5] <- 0
p
#[1] 0.0 0.0 0.0 0.0 0.0 0.5 0.6 0.7 0.8 0.9 1.0
entropy.brutal(p)
#[1] 1.176081
entropy.safe(p)
#[1] 1.176081
In conclusion, we can use either entropy.brutal or entropy.safe.

How does R (v3.6.1) determine rounding with odd floor values? [duplicate]

Yes I know why we always round to the nearest even number if we are in the exact middle (i.e. 2.5 becomes 2) of two numbers. But when I want to evaluate data for some people they don't want this behaviour. What is the simplest method to get this:
x <- seq(0.5,9.5,by=1)
round(x)
to be 1,2,3,...,10 and not 0,2,2,4,4,...,10.
Edit: To clearify: 1.4999 should be 1 after rounding. (I thought this would be obvious)
This is not my own function, and unfortunately, I can't find where I got it at the moment (originally found as an anonymous comment at the Statistically Significant blog), but it should help with what you need.
round2 = function(x, digits) {
posneg = sign(x)
z = abs(x)*10^digits
z = z + 0.5 + sqrt(.Machine$double.eps)
z = trunc(z)
z = z/10^digits
z*posneg
}
x is the object you want to round, and digits is the number of digits you are rounding to.
An Example
x = c(1.85, 1.54, 1.65, 1.85, 1.84)
round(x, 1)
# [1] 1.8 1.5 1.6 1.8 1.8
round2(x, 1)
# [1] 1.9 1.5 1.7 1.9 1.8
(Thanks #Gregor for the addition of + sqrt(.Machine$double.eps).)
If you want something that behaves exactly like round except for those xxx.5 values, try this:
x <- seq(0, 1, 0.1)
x
# [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
floor(0.5 + x)
# [1] 0 0 0 0 0 1 1 1 1 1 1
As #CarlWitthoft said in the comments, this is the IEC 60559 standard as mentioned in ?round:
Note that for rounding off a 5, the IEC 60559 standard is expected to be used, ‘go to the even digit’. Therefore round(0.5) is 0 and round(-1.5) is -2. However, this is dependent on OS services and on representation error (since e.g. 0.15 is not represented exactly, the rounding rule applies to the represented number and not to the printed number, and so round(0.15, 1) could be either 0.1 or 0.2).
An additional explanation by Greg Snow:
The logic behind the round to even rule is that we are trying to
represent an underlying continuous value and if x comes from a truly
continuous distribution, then the probability that x==2.5 is 0 and the
2.5 was probably already rounded once from any values between 2.45 and 2.54999999999999..., if we use the round up on 0.5 rule that we learned in grade school, then the double rounding means that values
between 2.45 and 2.50 will all round to 3 (having been rounded first
to 2.5). This will tend to bias estimates upwards. To remove the
bias we need to either go back to before the rounding to 2.5 (which is
often impossible to impractical), or just round up half the time and
round down half the time (or better would be to round proportional to
how likely we are to see values below or above 2.5 rounded to 2.5, but
that will be close to 50/50 for most underlying distributions). The
stochastic approach would be to have the round function randomly
choose which way to round, but deterministic types are not
comforatable with that, so "round to even" was chosen (round to odd
should work about the same) as a consistent rule that rounds up and
down about 50/50.
If you are dealing with data where 2.5 is likely to represent an exact
value (money for example), then you may do better by multiplying all
values by 10 or 100 and working in integers, then converting back only
for the final printing. Note that 2.50000001 rounds to 3, so if you
keep more digits of accuracy until the final printing, then rounding
will go in the expected direction, or you can add 0.000000001 (or
other small number) to your values just before rounding, but that can
bias your estimates upwards.
This appears to work:
rnd <- function(x) trunc(x+sign(x)*0.5)
Ananda Mahto's response seems to do this and more - I am not sure what the extra code in his response is accounting for; or, in other words, I can't figure out how to break the rnd() function defined above.
Example:
seq(-2, 2, by=0.5)
# [1] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
round(x)
# [1] -2 -2 -1 0 0 0 1 2 2
rnd(x)
# [1] -2 -2 -1 -1 0 1 1 2 2
Depending on how comfortable you are with jiggling your data, this works:
round(x+10*.Machine$double.eps)
# [1] 1 2 3 4 5 6 7 8 9 10
This method:
round2 = function(x, n) {
posneg = sign(x)
z = abs(x)*10^n
z = z + 0.5
z = trunc(z)
z = z/10^n
z*posneg
}
does not seem to work well when we have numbers with many digits. E.g. doing round2(2436.845, 2) will give us 2436.84. The issue seems to occur with the trunc(z) function.
Overall, I think it has something to do with the way R stores numbers and thus the trunc and float function doesn't always work. I was able to get around it in not the most elegant way:
round2 = function(x, n) {
posneg = sign(x)
z = abs(x)*10^n
z = z + 0.5
z = trunc(as.numeric(as.character(z)))
z = z/10^n
(z)*posneg
}
This mimics the rounding away from zero at .5:
round_2 <- function(x, digits = 0) {
x = x + abs(x) * sign(x) * .Machine$double.eps
round(x, digits = digits)
}
round_2(.5 + -2:4)
-2 -1 1 2 3 4 5

R: What is wrong with rounding? [duplicate]

Yes I know why we always round to the nearest even number if we are in the exact middle (i.e. 2.5 becomes 2) of two numbers. But when I want to evaluate data for some people they don't want this behaviour. What is the simplest method to get this:
x <- seq(0.5,9.5,by=1)
round(x)
to be 1,2,3,...,10 and not 0,2,2,4,4,...,10.
Edit: To clearify: 1.4999 should be 1 after rounding. (I thought this would be obvious)
This is not my own function, and unfortunately, I can't find where I got it at the moment (originally found as an anonymous comment at the Statistically Significant blog), but it should help with what you need.
round2 = function(x, digits) {
posneg = sign(x)
z = abs(x)*10^digits
z = z + 0.5 + sqrt(.Machine$double.eps)
z = trunc(z)
z = z/10^digits
z*posneg
}
x is the object you want to round, and digits is the number of digits you are rounding to.
An Example
x = c(1.85, 1.54, 1.65, 1.85, 1.84)
round(x, 1)
# [1] 1.8 1.5 1.6 1.8 1.8
round2(x, 1)
# [1] 1.9 1.5 1.7 1.9 1.8
(Thanks #Gregor for the addition of + sqrt(.Machine$double.eps).)
If you want something that behaves exactly like round except for those xxx.5 values, try this:
x <- seq(0, 1, 0.1)
x
# [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
floor(0.5 + x)
# [1] 0 0 0 0 0 1 1 1 1 1 1
As #CarlWitthoft said in the comments, this is the IEC 60559 standard as mentioned in ?round:
Note that for rounding off a 5, the IEC 60559 standard is expected to be used, ‘go to the even digit’. Therefore round(0.5) is 0 and round(-1.5) is -2. However, this is dependent on OS services and on representation error (since e.g. 0.15 is not represented exactly, the rounding rule applies to the represented number and not to the printed number, and so round(0.15, 1) could be either 0.1 or 0.2).
An additional explanation by Greg Snow:
The logic behind the round to even rule is that we are trying to
represent an underlying continuous value and if x comes from a truly
continuous distribution, then the probability that x==2.5 is 0 and the
2.5 was probably already rounded once from any values between 2.45 and 2.54999999999999..., if we use the round up on 0.5 rule that we learned in grade school, then the double rounding means that values
between 2.45 and 2.50 will all round to 3 (having been rounded first
to 2.5). This will tend to bias estimates upwards. To remove the
bias we need to either go back to before the rounding to 2.5 (which is
often impossible to impractical), or just round up half the time and
round down half the time (or better would be to round proportional to
how likely we are to see values below or above 2.5 rounded to 2.5, but
that will be close to 50/50 for most underlying distributions). The
stochastic approach would be to have the round function randomly
choose which way to round, but deterministic types are not
comforatable with that, so "round to even" was chosen (round to odd
should work about the same) as a consistent rule that rounds up and
down about 50/50.
If you are dealing with data where 2.5 is likely to represent an exact
value (money for example), then you may do better by multiplying all
values by 10 or 100 and working in integers, then converting back only
for the final printing. Note that 2.50000001 rounds to 3, so if you
keep more digits of accuracy until the final printing, then rounding
will go in the expected direction, or you can add 0.000000001 (or
other small number) to your values just before rounding, but that can
bias your estimates upwards.
This appears to work:
rnd <- function(x) trunc(x+sign(x)*0.5)
Ananda Mahto's response seems to do this and more - I am not sure what the extra code in his response is accounting for; or, in other words, I can't figure out how to break the rnd() function defined above.
Example:
seq(-2, 2, by=0.5)
# [1] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
round(x)
# [1] -2 -2 -1 0 0 0 1 2 2
rnd(x)
# [1] -2 -2 -1 -1 0 1 1 2 2
Depending on how comfortable you are with jiggling your data, this works:
round(x+10*.Machine$double.eps)
# [1] 1 2 3 4 5 6 7 8 9 10
This method:
round2 = function(x, n) {
posneg = sign(x)
z = abs(x)*10^n
z = z + 0.5
z = trunc(z)
z = z/10^n
z*posneg
}
does not seem to work well when we have numbers with many digits. E.g. doing round2(2436.845, 2) will give us 2436.84. The issue seems to occur with the trunc(z) function.
Overall, I think it has something to do with the way R stores numbers and thus the trunc and float function doesn't always work. I was able to get around it in not the most elegant way:
round2 = function(x, n) {
posneg = sign(x)
z = abs(x)*10^n
z = z + 0.5
z = trunc(as.numeric(as.character(z)))
z = z/10^n
(z)*posneg
}
This mimics the rounding away from zero at .5:
round_2 <- function(x, digits = 0) {
x = x + abs(x) * sign(x) * .Machine$double.eps
round(x, digits = digits)
}
round_2(.5 + -2:4)
-2 -1 1 2 3 4 5

Wrong result from constroptim function

I'm trying to use constrOptim to optimize the sum of square errors from a linear multiple regression. The main equation should be D = Beta1*Xa+Beta2*Xb+Beta3*Xc+Beta4*Xd , with D,Xa,Xb,Xc,Xd from a imported .csv file, and the Betas are the coefficients I want to find, minimizing the quadratic errors.
So far I imported the file.csv to R, named each column as Ds,Xa,Xb,Xc,Xd, created the objfunction=
function(Beta1,Beta2,Beta3,Beta4)'sum(E²)'=(sum(D) - sum(Beta1*Xa+Beta2*Xb+Beta3*Xc+Beta4*Xd))^2)
created the matrix 'C' and vector 'd' to configure the constraints that should restrict the Beta's to <=0. I dont know how to find the feasible region, although I've used initial values that made the function work.
Here is the code:
> Tabela= read.table("Simulacao.csv", header=T, sep= ";")
> Tabela
D A B C D.1
1 -1 1 -1 0 0
2 4 0 0 1 -1
3 4 1 0 -1 0
4 0 0 1 0 -1
5 -2 1 0 0 -1
> Ds= Tabela[,1]
> Xa= Tabela[,2]
> Xb= Tabela[,3]
> Xc= Tabela[,4]
> Xd= Tabela[,5]
> simulaf= function(x1,x2,x3,x4) {
+ Ds= Tabela[,1]
+ Xa= Tabela[,2]
+ Xb= Tabela[,3]
+ Xc= Tabela[,4]
+ Xd= Tabela[,5]
+ J=sum(Ds)
+ H=sum(x1*Xa+x2*Xb+x3*Xc+x4*Xd)
+ sx=(J-H)^2
+ return(sx)
+ }
> s= function(x) {simulaf(x[1],x[2],x[3],x[4])}
> d= c(0,0,0,0)
> C= matrix(c(-1,0,0,0,0,-1,0,0,0,0,-1,0,0,0,0,-1),nrow=4,ncol=4,byrow=T)
> constrOptim(c(-1,-1,-1,-1),s,NULL,C,d)
$par
[1] -0.2608199 -0.8981110 -1.1095961 -1.9274866
The result I expect should be:
$par
[1] -0.125 0 -0.5 -0.875
After researching this, my conclusions are that it could be because I'm using bad initial values, parameterization problem (don't understand why its needed) or if it's simply that I have programmed it incorrectly.
What do I need to do to fix this?
The formula for the sum of squared errors is
sum((y - yhat)^2)
and not
(sum(y) - sum(yhat))^2
where yhat is the predicted value.
Also, if your only constraints are that the estimated betas should be negative (which is a bit weird, usually you want them to be positive but never mind), then you don't need constrOptim. Regular optim(method="L-BFGS-B") or nlminb will work with so-called box constraints.

within function warnings when using while in a function

I write a function to calculate critical depth of water in a circular channel
while the flow (Q) and diameter (d) are given:
D_Critic<- function (Q,Dia) {
g=9.81
Diff=1
Phi=0.01
while(Diff>=0.001) {
A=16*Q*sqrt((2/g)*sin(Phi/2))
B=Dia^5/2*(Phi-sin(Phi))^3/2
Diff=A-B
Phi=Phi+0.001
Yc=Dia/2*(1-cos(Phi/2))
}
return(Yc)
}
now I want to use within function to bind Yc with dataframe DQ, but it returns only first calculated Yc and several repeated warnings:
Q<-c(2.5975,2.5900,2.4183,2.3077)
D<-c(1,1,1,1)
DQ<-data.frame(Q,D)
> D_Q<-within(DQ,Yc<-D_Critic( Q/2, D))
There were 50 or more warnings (use warnings() to see the first 50)
> D_Q
Q D Yc
1 2.5975 1 0.52609
2 2.5900 1 0.52609
3 2.4183 1 0.52609
4 2.3077 1 0.52609
> warnings()
Warning messages:
1: In while (Diff >= 0.001) { ... :
the condition has length > 1 and only the first element will be used
The while statement only takes one boolean value, e.g. Diff >= 0.001 where Diff must be a single number. In the first time you go through the loop, this is the case, as Diff equals 1. However, in the second instance, Diff becomes equal to A-B, where A and B are both vectors of length 4.
So, when your code reaches the second iteration, while generates a warning, as it is not sure how to deal with a vector of booleans. The choice it makes is to simply use the first element in the boolean vector, discarding the rest.
You need to consider what Diff actually is. Probably a single number, so Diff sum(A-B) or sum((A-B)^2). This would result in a single Diff value, and get rid of your errors. What Diff should exactly depends on the theory you are working on. Your text book should list this.
It was resolved with a trick:
Yc<-matrix(NA,length(DQ$Q),1)
for (i in 1:length(DQ$Q)) {
Yc[i,1]<- D_Critic(DQ$Q[i]/2,DQ$D[i])
}
> Yc
[,1]
[1,] 0.5260900
[2,] 0.5255907
[3,] 0.5163489
[4,] 0.5098512
DQ<-cbind(DQ,Yc)
> DQ
Q D Yc
1 2.5975 1 0.5260900
2 2.5900 1 0.5255907
3 2.4183 1 0.5163489
4 2.3077 1 0.5098512

Resources