Odd behaviour of quotient in R [duplicate] - r

This question already has answers here:
Why are these numbers not equal?
(6 answers)
Closed 4 years ago.
The third line of code in the below example is giving 11 while all the other lines give 12. Any reason why this should happen? If there is a reason, any way to fix it?
> .03 %/% 0.0025
[1] 12
> .03 / 0.0025
[1] 12
> .3 %/% 0.025
[1] 11
> .3 / 0.025
[1] 12
> 3 %/% 0.25
[1] 12
> 3 / 0.25
[1] 12
This happens with multiple numbers btw, some more examples below -
> 0.35 %/% 0.025
[1] 13
> 0.35 / 0.025
[1] 14
> 0.85 %/% 0.025
[1] 33
> 0.85 / 0.025
[1] 34
> 0.425 %/% 0.025
[1] 16
> 0.425 / 0.025
[1] 17
> 0.975 %/% 0.025
[1] 38
> 0.975 / 0.025
[1] 39

I am not sure, but it was too long for a comment. From ?"%%":
%% and x %/% y can be used for non-integer y, e.g. 1 %/% 0.2, but the
results are subject to representation error and so may be
platform-dependent. Because the IEC 60059 representation of 0.2 is a
binary fraction slightly larger than 0.2, the answer to 1 %/% 0.2
should be 4 but most platforms give 5.

Related

How to round any decimal number less than one to show only one number at the end in R?

How to round any decimal number less than one to show only one number at the end?.
0.00045454 -> 0.0005
0.024 -> 0.02
0.2 - > 0.2
0.000000007020 - > 0.000000007
0.000000008 - > 0.000000008
Use log10 with ceiling.
fRoundEnd <- function(x) round(x, ceiling(-log10(x)))
fRoundEnd(0.00045454)
#> [1] 5e-04
fRoundEnd(0.024)
#> [1] 0.02
fRoundEnd(0.2)
#> [1] 0.2
fRoundEnd(0.000000007020)
#> [1] 7e-09
fRoundEnd(0.000000008)
#> [1] 8e-09
fRoundEnd(2022)
#> [1] 2000
fRoundEnd(c(0.00045454, 0.024, 0.2, 0.000000007020, 0.000000008, 2022))
#> [1] 5e-04 2e-02 2e-01 7e-09 8e-09 2e+03
You can use signif
signif(c(0.00045454, 0.024, 0.2, 0.000000007020, 0.000000008, 2022), 1)
# [1] 5e-04 2e-02 2e-01 7e-09 8e-09 2e+03

Why does my random trajectory systematic biais in angle?

I have a tracjectory in 2D (list of x,y positions).
I am trying to measure the angles of the motion between consecutive points.
So I calculate the scalar product of the two consecutive vectors, divide by the vector norms, and this gives me the cosinus of the angles I am looking for.
However, when I generate totally random trajectories (by generating random x and random y), I always have a high number of cos results very close to -1, or 1. While I was expecting to have all cos between -1 and 1 equally likely.
Here's my code to generate the trajectories (after correction from the comments below), and calculate the cosinus:
cost = c()
t = seq(0,500,0.5)
x = 1*runif( length(t),-1,1 )
y = 1*runif( length(t),-1,1 )
x = cumsum(x)
y = cumsum(y)
step = 1
dstep = 2
for ( j in 1:((length(t)-dstep)))
{
x1 = x[j+step]-x[j]
y1 = y[j+step]-y[j]
x2 = x[j+dstep]-x[j+step]
y2 = y[j+dstep]-y[j+step]
n1 = sqrt( x1*x1 + y1*y1 )
n2 = sqrt( x2*x2 + y2*y2 )
if ( (n1*n2) > 0 )
{
scal = x1*x2 + y1*y2
cost = c( cost, scal/(n1*n2) )
#print( paste(n1, " ", n2, " ", n1*n2, " ", scal, " ", x1, " ", x2, " ", scal/(n1*n2), sep="") )
}
}
When i look at the histogram of the cost results, I always have a high number of cost very close to -1 and 1:
> hist(cost, plot=F)
$breaks
[1] -1.00 -0.95 -0.90 -0.85 -0.80 -0.75 -0.70 -0.65 -0.60 -0.55 -0.50 -0.45
[13] -0.40 -0.35 -0.30 -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15
[25] 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75
[37] 0.80 0.85 0.90 0.95 1.00
$counts
[1] 108 43 32 20 22 21 19 20 19 17 16 19 8 19 23 17 15 10 18
[20] 22 15 19 14 15 18 16 21 11 18 20 16 35 23 24 24 20 23 33
[39] 37 107
Any idea where I'm wrong or why it should do that ?
Thanks for help
In case somebody else meet this problem, here's the summary of the solution from the comments:
Actually this distribution of the cos is what you get when angles are uniformly distributed! Consider hist(cos(runif(1000, min = 0, max = 2*pi))). So it's working as expected. cos just moves quickly over 0 and slowly over 1 and -1. See plot(cos, from = 0, to = 2*pi).
Which is indeed explained there: https://math.stackexchange.com/questions/1153339/distribution-of-cosine-of-uniformly-random-variables
The solution is thus that it is normal to have more values of cosinus close to 1 and -1 from a distribution of totally random angles.

Assigning a value to each range of consecutive numbers with same sign in R

I'm trying to create a data frame where a column exists that holds values representing the length of runs of positive and negative numbers, like so:
Time V Length
0.5 -2 1.5
1.0 -1 1.5
1.5 0 0.0
2.0 2 1.0
2.5 0 0.0
3.0 1 1.75
3.5 2 1.75
4.0 1 1.75
4.5 -1 0.75
5.0 -3 0.75
The Length column sums the length of time that the value has been positive or negative. Zeros are given a 0 since they are an inflection point. If there is no zero separating the sign change, the values are averaged on either side of the inflection.
I am trying to approximate the amount of time that these values are spending either positive or negative. I've tried this with a for loop with varying degrees of success, but I would like to avoid looping because I am working with extremely large data sets.
I've spent some time looking at sign and diff as they are used in this question about sign changes. I've also looked at this question that uses transform and aggregate to sum consecutive duplicate values. I feel like I could use this in combination with sign and/or diff, but I'm not sure how to retroactively assign these sums to the ranges that created them or how to deal with spots where I'm taking the average across the inflection.
Any suggestions would be appreciated. Here is the sample dataset:
dat <- data.frame(Time = seq(0.5, 5, 0.5), V = c(-2, -1, 0, 2, 0, 1, 2, 1, -1, -3))
First find indices of "Time" which need to be interpolated: consecutive "V" which lack a zero between positive and negative values; they have an abs(diff(sign(V)) equal to two.
id <- which(abs(c(0, diff(sign(dat$V)))) == 2)
Add rows with average "Time" between relevant indices and corresponding "V" values of zero to the original data. Also add rows of "V" = 0 at "Time" = 0 and at last time step (according to the assumptions mentioned by #Gregor). Order by "Time".
d2 <- rbind(dat,
data.frame(Time = (dat$Time[id] + dat$Time[id - 1])/2, V = 0),
data.frame(Time = c(0, max(dat$Time)), V = c(0, 0))
)
d2 <- d2[order(d2$Time), ]
Calculate time differences between time steps which are zero and replicate them using "zero-group indices".
d2$Length <- diff(d2$Time[d2$V == 0])[cumsum(d2$V == 0)]
Add values to original data:
merge(dat, d2)
# Time V Length
# 1 0.5 -2 1.50
# 2 1.0 -1 1.50
# 3 1.5 0 1.00
# 4 2.0 2 1.00
# 5 2.5 0 1.75
# 6 3.0 1 1.75
# 7 3.5 2 1.75
# 8 4.0 1 1.75
# 9 4.5 -1 0.75
# 10 5.0 -3 0.75
Set "Length" to 0 where V == 0.
This works, at least for your test case. And it should be pretty efficient. It makes some assumptions, I'll try to point out the big ones.
First we extract the vectors and stick 0s on the beginning. We also set the last V to 0. The calculation will be based on time differences between 0s, so we need to start and end with 0s. Your example seems to tacitly assume V = 0 at Time = 0, hence the initial 0, and it stops abruptly at the maximum time, so we set V = 0 there as well:
Time = c(0, dat$Time)
V = c(0, dat$V)
V[length(V)] = 0
To fill in the skipped 0s, we use approx to do linear approximation on sign(V). It also assumes that your sampling frequency is regular, so we can get away with doubling the frequency to get all the missing 0s.
ap = approx(Time, sign(V), xout = seq(0, max(Time), by = 0.25))
The values we want to fill in are the durations between the 0s, both observed and approximated. In the correct order, these are:
dur = diff(ap$x[ap$y == 0])
Lastly, we need the indices of the original data to fill in the durations. This is the hackiest part of this answer, but it seem to work. Maybe someone will suggest a nice simplification.
# first use rleid to get the sign groupings
group = data.table::rleid(sign(dat$V))
# then we need to set the groups corresponding to 0 values to 0
# and reduce any group numbers following 0s correspondingly
# lastly we add 1 to everything so that we can stick 0 at the
# front of our durations and assign those to the 0 V values
ind = (group - cumsum(dat$V == 0)) * (dat$V != 0) + 1
# fill it in
dat$Length = c(0, dur)[ind]
dat
# Time V Length
# 1 0.5 -2 1.50
# 2 1.0 -1 1.50
# 3 1.5 0 0.00
# 4 2.0 2 1.00
# 5 2.5 0 0.00
# 6 3.0 1 1.75
# 7 3.5 2 1.75
# 8 4.0 1 1.75
# 9 4.5 -1 0.75
# 10 5.0 -3 0.75
It took me longer than I care to admit, but here is my solution.
Because you said you wanted to use it on large datasets (thus speed matters) I use Rcpp to write a loop that does all the checking. For speed comparisons I also create another sample dataset with 500,000 data.points and check the speed (I tried to compare to the other datasets but couldn't translate them to data.table (without that it would be an unfair comparison...)). If supplied, I will gladly update the speed-comparisons!
Part 1: My solution
My solution looks like this:
(in length_time.cpp)
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector length_time(NumericVector time, NumericVector v) {
double start = 0;
double time_i, v_i;
bool last_positive = v[0] > 0;
bool last_negative = v[0] < 0;
int length_i = time.length();
NumericVector ret_vec(length_i);
for (int i = 0; i < length_i; ++i) {
time_i = time[i];
v_i = v[i];
if (v_i == 0) { // injection
if (i > 0) { // if this is not the beginning, then a regime has ended!
ret_vec[i - 1] = time_i - start;
start = time_i;
}
} else if ((v_i > 0 && last_negative) || (v_i < 0 && last_positive)) {
ret_vec[i - 1] = (time_i + time[i - 1]) / 2 - start;
start = (time_i + time[i - 1]) / 2;
}
last_positive = v_i > 0;
last_negative = v_i < 0;
}
ret_vec[length_i - 1] = time[length_i - 1] - start;
// ret_vec now only has the values for the last observation
// do something like a reverse na_locf...
double tmp_val = ret_vec[length_i - 1];
for (int i = length_i - 1; i >= 0; --i) {
if (v[i] == 0) {
ret_vec[i] = 0;
} else if (ret_vec[i] == 0){
ret_vec[i] = tmp_val;
} else {
tmp_val = ret_vec[i];
}
}
return ret_vec;
}
and then in an R-file (i.e., length_time.R):
library(Rcpp)
# setwd("...") #to find the .cpp-file
sourceCpp("length_time.cpp")
dat$Length <- length_time(dat$Time, dat$V)
dat
# Time V Length
# 1 0.5 -2 1.50
# 2 1.0 -1 1.50
# 3 1.5 0 0.00
# 4 2.0 2 1.00
# 5 2.5 0 0.00
# 6 3.0 1 1.75
# 7 3.5 2 1.75
# 8 4.0 1 1.75
# 9 4.5 -1 0.75
# 10 5.0 -3 0.75
Which seems to work on the sample dataset.
Part 2: Testing for Speed
library(data.table)
library(microbenchmark)
n <- 10000
set.seed(1235278)
dt <- data.table(time = seq(from = 0.5, by = 0.5, length.out = n),
v = cumsum(round(rnorm(n, sd = 1))))
dt[, chg := v >= 0 & shift(v, 1, fill = 0) <= 0]
plot(dt$time, dt$v, type = "l")
abline(h = 0)
for (i in dt[chg == T, time]) abline(v = i, lty = 2, col = "red")
Which results in a dataset with 985 observations (crossings).
Testing the speed with microbenchmark results in
microbenchmark(dt[, length := length_time(time, v)])
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt[, `:=`(length, length_time(time, v))] 2.625714 2.7184 3.054021 2.817353 3.077489 5.235689 100
Resulting in about 3 milliseconds for calculating with 500,000 observations.
Does that help you?
Here is my attempt done completely in base R.
Joseph <- function(df) {
is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol
v <- df$V
t <- df$Time
sv <- sign(v)
nR <- length(v)
v0 <- which(v==0)
id <- which(abs(c(0, diff(sv))) > 1) ## This line and (t[id] + t[id - 1L])/2 From #Henrik
myZeros <- sort(c(v0*t[1L], (t[id] + t[id - 1L])/2))
lenVals <- diff(c(0,myZeros,t[nR])) ## Actual values that
## will populate the Length column
## remove values that result from repeating zeros from the df$V column
lenVals <- lenVals[lenVals != t[1L] | c(!is.wholenumber(myZeros/t[1L]),F)]
## Below we need to determine how long to replicate
## each of the lenVals above, so we need to find
## the starting place and length of each run...
## rle is a great candidate for both of these
m <- rle(sv)
ml <- m$lengths
cm <- cumsum(ml)
zm <- m$values != 0 ## non-zero values i.e. we won't populate anything here
rl <- m$lengths[zm] ## non-zero run-lengths
st <- cm[zm] - rl + 1L ## starting index
out <- vector(mode='numeric', length = nR)
for (i in 1:length(st)) {out[st[i]:(st[i]+rl[i]-1L)] <- lenVals[i]}
df$Length <- out
df
}
Here is the output of the given example:
Joseph(dat)
Time V Length
1 0.5 -2 1.50
2 1.0 -1 1.50
3 1.5 0 0.00
4 2.0 2 1.00
5 2.5 0 0.00
6 3.0 1 1.75
7 3.5 2 1.75
8 4.0 1 1.75
9 4.5 -1 0.75
10 5.0 -3 0.75
Here is a larger example:
set.seed(142)
datBig <- data.frame(Time=seq(0.5,50000,0.5), V=sample(-3:3, 10^5, replace=TRUE))
library(compiler)
library(data.table)
library(microbenchmark)
c.Joseph <- cmpfun(Joseph)
c.Henrik <- cmpfun(Henrik)
c.Gregor <- cmpfun(Gregor)
microbenchmark(c.Joseph(datBig), c.Gregor(datBig), c.Henrik(datBig), David(datBig), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval cld
David(datBig) 2.20602 2.617742 4.35927 2.788686 3.13630 114.0674 10 a
c.Joseph(datBig) 61.91015 62.62090 95.44083 64.43548 93.20945 225.4576 10 b
c.Gregor(datBig) 59.25738 63.32861 126.29857 72.65927 214.35961 229.5022 10 b
c.Henrik(datBig) 1511.82449 1678.65330 1727.14751 1730.24842 1816.42601 1871.4476 10 c
As #Gregor pointed out, the goal is to find the x-distance between each occurrence of zero. This can be seen visually by plotting (again, as pointed out by #Gregor (many kudos btw)). For example, if we plot the first 20 values of datBig, we obtain:
From this, we can see that the x-distances such that the graph is either positive or negative (i.e. not zero (this happens when there are repeats of zeros)) are approximately:
2.0, 1.25, 0.5, 0.75, 2.0, 1.0, 0.75, 0.5
t1 <- c.Joseph(datBig)
t2 <- c.Gregor(datBig)
t3 <- c.Henrik(datBig)
t4 <- David(datBig)
## Correct values according to the plot above (x above a value indicates incorrect value)
## 2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 2.00 2.00 2.00 0.00 0.00 0.00 1.00 0.00 0.75 0.50
## all correct
t1$Length[1:20]
[1] 2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 2.00 2.00 2.00 0.00 0.00 0.00 1.00 0.00 0.75 0.50
## mostly correct
t2$Length[1:20] x x x x x
[1] 2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 0.75 0.75 0.75 0.00 0.00 0.00 0.50 0.00 0.75 0.25
## least correct
t3$Length[1:20] x x x x x x x x x x x x x
[1] 2.00 2.00 2.00 0.50 1.00 1.25 0.75 1.25 0.00 1.75 1.75 0.00 1.50 1.50 0.00 0.00 1.25 1.25 1.25 1.25
## all correct
t4$Length[1:20]
[1] 2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 2.00 2.00 2.00 0.00 0.00 0.00 1.00 0.00 0.75 0.50
# agreement with David's solution
all.equal(t4$Length, t1$Length)
[1] TRUE
Well, it seems the Rcpp solution provided by David is not only accurate but blazing fast.

How to turn a continuous time sample path into discrete time over the (0,1) range using R?

Suppose observed X(1), X(2), ..., X(N) from a continuous time process. How can i discretize the time of this data on the grid {0,1/N,...,(N-1)/N,1} using R?
I really appreciate any help. Thanks.
This would be the way to do it in continuous time:
x <- cumsum(abs(rnorm(20)))
n <- (x-min(x))/diff(range(x))
> n
[1] 0.00000000 0.01884929 0.02874295 0.07230612 0.11253305 0.19770821 0.26356939
[8] 0.33310811 0.36687944 0.47041629 0.53331128 0.61724640 0.72534086 0.74782335
[15] 0.79829820 0.83023417 0.85336221 0.85528100 0.90023497 1.00000000
To get a numeric vector analogous to what you might get from cut or Hmisc::cut2 you can use findInterval:
> findInterval(n, seq(0,1,length=length(n) )/length(n) )
[1] 1 8 11 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
And "normalizing to [0,1] is then simple, even trivial;
> findInterval(n, seq(0,1,length=length(n) ))/length(n)
[1] 0.05 0.05 0.05 0.10 0.15 0.20 0.30 0.35 0.35 0.45 0.55 0.60 0.70 0.75 0.80 0.80 0.85
[18] 0.85 0.90 1.00

R round to nearest .5 or .1

I have a data set of stock prices that have already been rounded to 2 decimal places (1234.56). I am now trying to round to a specific value which is different for each stock. Here are some examples:
Current Stock Price Minimum Tick Increment Desired Output
123.45 .50 123.50
155.03 .10 155.00
138.24 .50 138.00
129.94 .10 129.90
... ... ...
I'm not really sure how to do this but am open to suggestions.
Probably,
round(a/b)*b
will do the work.
> a <- seq(.1,1,.13)
> b <- c(.1,.1,.1,.2,.3,.3,.7)
> data.frame(a, b, out = round(a/b)*b)
a b out
1 0.10 0.1 0.1
2 0.23 0.1 0.2
3 0.36 0.1 0.4
4 0.49 0.2 0.4
5 0.62 0.3 0.6
6 0.75 0.3 0.6
7 0.88 0.7 0.7
I'm not familiar with R the language, but my method should work with any language with a ceiling function. I assume it's rounded UP to nearest 0.5:
a = ceiling(a*2) / 2
if a = 0.4, a = ceiling(0.4*2)/2 = ceiling(0.8)/2 = 1/2 = 0.5
if a = 0.9, a = ceiling(0.9*2)/2 = ceiling(1.8)/2 = 2/2 = 1
Like what JoshO'Brien said in the comments: round_any in the package plyr works very well!
> library(plyr)
> stocks <- c(123.45, 155.03, 138.24, 129.94)
> round_any(stocks,0.1)
[1] 123.4 155.0 138.2 129.9
>
> round_any(stocks,0.5)
[1] 123.5 155.0 138.0 130.0
>
> round_any(stocks,0.1,f = ceiling)
[1] 123.5 155.1 138.3 130.0
>
> round_any(stocks,0.5,f = floor)
[1] 123.0 155.0 138.0 129.5
Read more here:
https://www.rdocumentation.org/packages/plyr/versions/1.8.4/topics/round_any
The taRifx package has just such a function:
> library(taRifx)
> roundnear( seq(.1,1,.13), c(.1,.1,.1,.2,.3,.3,.7) )
[1] 0.1 0.2 0.3 0.4 0.6 0.6 0.7
In your case, just feed it the stock price and the minimum tick increment as its first and second arguments, and it should work its magic.
N.B. This has now been deprecated. See comment.

Resources