Why does my random trajectory systematic biais in angle? - r

I have a tracjectory in 2D (list of x,y positions).
I am trying to measure the angles of the motion between consecutive points.
So I calculate the scalar product of the two consecutive vectors, divide by the vector norms, and this gives me the cosinus of the angles I am looking for.
However, when I generate totally random trajectories (by generating random x and random y), I always have a high number of cos results very close to -1, or 1. While I was expecting to have all cos between -1 and 1 equally likely.
Here's my code to generate the trajectories (after correction from the comments below), and calculate the cosinus:
cost = c()
t = seq(0,500,0.5)
x = 1*runif( length(t),-1,1 )
y = 1*runif( length(t),-1,1 )
x = cumsum(x)
y = cumsum(y)
step = 1
dstep = 2
for ( j in 1:((length(t)-dstep)))
{
x1 = x[j+step]-x[j]
y1 = y[j+step]-y[j]
x2 = x[j+dstep]-x[j+step]
y2 = y[j+dstep]-y[j+step]
n1 = sqrt( x1*x1 + y1*y1 )
n2 = sqrt( x2*x2 + y2*y2 )
if ( (n1*n2) > 0 )
{
scal = x1*x2 + y1*y2
cost = c( cost, scal/(n1*n2) )
#print( paste(n1, " ", n2, " ", n1*n2, " ", scal, " ", x1, " ", x2, " ", scal/(n1*n2), sep="") )
}
}
When i look at the histogram of the cost results, I always have a high number of cost very close to -1 and 1:
> hist(cost, plot=F)
$breaks
[1] -1.00 -0.95 -0.90 -0.85 -0.80 -0.75 -0.70 -0.65 -0.60 -0.55 -0.50 -0.45
[13] -0.40 -0.35 -0.30 -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15
[25] 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75
[37] 0.80 0.85 0.90 0.95 1.00
$counts
[1] 108 43 32 20 22 21 19 20 19 17 16 19 8 19 23 17 15 10 18
[20] 22 15 19 14 15 18 16 21 11 18 20 16 35 23 24 24 20 23 33
[39] 37 107
Any idea where I'm wrong or why it should do that ?
Thanks for help

In case somebody else meet this problem, here's the summary of the solution from the comments:
Actually this distribution of the cos is what you get when angles are uniformly distributed! Consider hist(cos(runif(1000, min = 0, max = 2*pi))). So it's working as expected. cos just moves quickly over 0 and slowly over 1 and -1. See plot(cos, from = 0, to = 2*pi).
Which is indeed explained there: https://math.stackexchange.com/questions/1153339/distribution-of-cosine-of-uniformly-random-variables
The solution is thus that it is normal to have more values of cosinus close to 1 and -1 from a distribution of totally random angles.

Related

compare qqplot of a sample with a reference probability distribution in R

I have a weather dataset, i found a simple linear model for two columns Temperature and Humidity and plotted the histogram of its residuals and calculated the mean and std.
model <- lm(Temperature..C. ~ Humidity, data = inputData)
model.res = resid(model)
hist(model.res)
mean(model.res)
sd(model.res)
I should Plot QQ-plot of residuals versus a zero-mean normal distribution with estimated std. I used Kolmogorov-Smirnov to compare a sample with a reference probability distribution but i don't know how to plot it together:
ks<-ks.test(model.res, "pnorm", mean=0, sd=sd(model.res))
qqnorm(model.res, main="qqnorm")
qqline(model.res)
Data example:
Temperature..C. Humidity
1 9.472222 0.89
2 9.355556 0.86
3 9.377778 0.89
4 8.288889 0.83
5 8.755556 0.83
6 9.222222 0.85
7 7.733333 0.95
8 8.772222 0.89
9 10.822222 0.82
10 13.772222 0.72
11 16.016667 0.67
12 17.144444 0.54
13 17.800000 0.55
14 17.333333 0.51
15 18.877778 0.47
16 18.911111 0.46
17 15.388889 0.60
18 15.550000 0.63
19 14.255556 0.69
20 13.144444 0.70
Here is a solution using ggplot2
ggplot(model, aes(sample = rstandard(model))) +
geom_qq() +
stat_qq_line(dparams=list(sd=sd(model.res)), color="red") +
stat_qq_line()
The red line represents the qqline with your sample sd, the blackline a sd of 1.
You did not ask for that, but you could also add a smoothed qqplot:
data_model <- model
data_model$theo <- unlist(qqnorm(data_model$residuals)[1])
ggplot(data_model, aes(sample = rstandard(data_model))) +
geom_qq() +
stat_qq_line(dparams=list(sd=sd(model.res)), color="red") +
geom_smooth(aes(x=data_model$theo, y=data_model$residuals), method = "loess")

Simulating tournament results with R

aIn R, how do one run a tournament simulation?
I have the probabilities of each teams chance of winning against the other pairs, for example:
prob_res <- matrix(round(runif(64),2), 8, 8)
prob_res[lower.tri(prob_res, diag = TRUE)] <- 0
prob_res <- as.data.frame(prob_res)
colnames(prob_res) <- 1:8
rownames(prob_res) <- 1:8
Which would mean something like this:
1 2 3 4 5 6 7 8
1 0 0.76 0.35 0.81 0.95 0.08 0.47 0.26
2 0 0.00 0.24 0.34 0.54 0.48 0.53 0.54
3 0 0.00 0.00 0.47 0.51 0.68 0.50 0.80
4 0 0.00 0.00 0.00 0.52 0.59 0.38 0.91
5 0 0.00 0.00 0.00 0.00 0.05 0.88 0.64
6 0 0.00 0.00 0.00 0.00 0.00 0.23 0.65
7 0 0.00 0.00 0.00 0.00 0.00 0.00 0.77
8 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
The next step would be to run a set of simulations, say n = 100000
First the quarter-finals (best out of 3):
1 vs 8
2 vs 7
3 vs 6
4 vs 5
And then the winners of each pair face off in the semi-finals:
1-8 winner VS 4-5 winner
2-7 winner VS 3-6 winner
Winners move on to the final. All is best out of 3.
What approach/package could I use to run bracket simulations? I did find a package called mRchmadness but it's too specific to handle this simulation.
I have created some dummy code that can help you figure out how to do it. The code is not optimized at all, but it is quite linear for you to understand how to do it.
prob_res <- matrix(round(runif(64),2), 8, 8)
prob_res[lower.tri(prob_res, diag = TRUE)] <- 0
prob_res <- as.data.frame(prob_res)
colnames(prob_res) <- 1:8
rownames(prob_res) <- 1:8
prob_res
## Total number of combinations
posscombi<-t(combn(1:8, 2))
## This function gives you winners of the match with n repetitionmatches against every other team possible combination of teams.
## It "reproduces" like the whole league assuming winning probabilities are static.
League <- function(repetitionMatches, posscomb , prob_res)
{
TotalVect<-integer(0)
for(i in 1:nrow(posscomb)){
pair <- posscomb[i,]
Vect<-sample(pair,
size = repetitionMatches,
prob = c(prob_res[pair[1], pair[2]], 1-prob_res[pair[1], pair[2]]),
replace = TRUE)
TotalVect <- c(TotalVect, Vect)
}
return(table(TotalVect))
}
Result<-League(100,posscomb = posscombi, prob_res= prob_res)
Myorder<-order(Result)
### Quarters
pair1<- c(names(Result)[Myorder[c(1,8)]])
pair2<- c(names(Result)[Myorder[c(2,7)]])
pair3<- c(names(Result)[Myorder[c(3,6)]])
pair4<- c(names(Result)[Myorder[c(4,5)]])
## This function gives you the results to n matches (being 3 in the example)
PlayMatch<-function(pairs, numMatches){
Res <-sample(pairs, size = numMatches,
prob = c(prob_res[pairs[1], pairs[2]], 1-prob_res[pairs[1], pairs[2]]),
replace = TRUE)
return(table(Res))
}
# Results of the matches
winner1<-PlayMatch(pairs = pair1, 3)
winner2<-PlayMatch(pairs = pair2, 3)
winner3<-PlayMatch(pairs = pair3, 3)
winner4<-PlayMatch(pairs = pair4, 3)
## Semis
#Choosing the winning teams
pair1<- c(names(winner1)[which.max(winner1)],names(winner2)[which.max(winner2)])
pair2<- c(names(winner3)[which.max(winner3)],names(winner4)[which.max(winner4)])
winner1<-PlayMatch(pairs = pair1, 3)
winner2<-PlayMatch(pairs = pair2, 3)
## Final
# Same as before
pair1<- c(names(winner1)[which.max(winner1)],names(winner2)[which.max(winner2)])
winner1<-PlayMatch(pairs = pair1, 3)
paste0( "team ",names(winner1)[which.max(winner1)], " is the winner!")

svd imputation R

I'm trying to use the SVD imputation from the bcv package but all the imputed values are the same (by column).
This is the dataset with missing data
http://pastebin.com/YS9qaUPs
#load data
dataMiss = read.csv('dataMiss.csv')
#impute data
SVDimputation = round(impute.svd(dataMiss)$x, 2)
#find index of missing values
bool = apply(X = dataMiss, 2, is.na)
#put in a new data frame only the imputed value
SVDImpNA = mapply(function(x,y) x[y], as.data.frame(SVDimputation), as.data.frame(bool))
View(SVDImpNA)
head(SVDImpNA)
V1 V2 V3
[1,] -0.01 0.01 0.01
[2,] -0.01 0.01 0.01
[3,] -0.01 0.01 0.01
[4,] -0.01 0.01 0.01
[5,] -0.01 0.01 0.01
[6,] -0.01 0.01 0.01
Where am I wrong?
The impute.svd algorithm works as follows:
Replace all missing values with the corresponding column means.
Compute a rank-k approximation to the imputed matrix.
Replace the values in the imputed positions with the corresponding values from the rank-k approximation computed in Step 2.
Repeat Steps 2 and 3 until convergence.
In your example code, you are setting k=min(n,p) (the default). Then, in Step 2, the rank-k approximation is exactly equal to imputed matrix. The algorithm converges after 0 iterations. That is, the algorithm sets all imputed entries to be the column means (or something extremely close to this if there is numerical error).
If you want to do something other than impute the missing values with the column means, you need to use a smaller value for k. The following code demonstrates this with your sample data:
> library("bcv")
> dataMiss = read.csv('dataMiss.csv')
k=3
> SVDimputation = impute.svd(dataMiss, k = 3, maxiter=10000)$x
> table(round(SVDimputation[is.na(dataMiss)], 2))
-0.01 0.01
531 1062
k=2
> SVDimputation = impute.svd(dataMiss, k = 2, maxiter=10000)$x
> table(round(SVDimputation[is.na(dataMiss)], 2))
-11.31 -6.94 -2.59 -2.52 -2.19 -2.02 -1.67 -1.63
25 23 61 2 54 23 5 44
-1.61 -1.2 -0.83 -0.8 -0.78 -0.43 -0.31 -0.15
14 10 13 19 39 1 14 19
-0.14 -0.02 0 0.01 0.02 0.03 0.06 0.17
83 96 94 77 30 96 82 28
0.46 0.53 0.55 0.56 0.83 0.91 1.26 1.53
1 209 83 23 28 111 16 8
1.77 5.63 9.99 14.34
112 12 33 5
Note that for your data, the default maximum number of iterations (100) was too low (I got a warning message). To fix this, I set maxiter=10000.
The problem that you describe likely occurs because impute.svd initially sets all of the NA values to be equal to the column means, and then doesn't change these values upon convergence.
It depends on the reason that you are using SVD imputation in the first place, but in case you are flexible, a good solution to this problem might be to switch the rank of the SVD call, by setting k to, e.g., 1. Currently, k is set automatically to min(n, p), where n = nrow, and p = ncol, which for your data means k = 3. For example, if you set it to 1 (as it is set in the example in the impute.svd function documentation), then this problem does not occur:
library(bcv)
dataMiss = read.csv("dataMiss.csv")
SVDimputation = round(impute.svd(dataMiss, k = 1)$x, 2)
head(SVDimputation)
[,1] [,2] [,3]
[1,] 0.96 -0.23 0.52
[2,] 0.02 -0.23 -1.92
[3,] -1.87 -0.23 0.52
[4,] -0.92 -0.23 0.52
[5,] 0.49 -0.46 0.52
[6,] -1.87 -0.23 0.52

R Programming issue intervals

I'm trying to figure out a formula to be able to divide the max and min number inside the intervals.
x <- sample(10:40,100,rep=TRUE)
factorx<- factor(cut(x, breaks=nclass.Sturges(x)))
xout<-as.data.frame(table(factorx))
xout<- transform(xout, cumFreq = cumsum(Freq), relative = prop.table(Freq))
Using the above code in the R editor program, I get the following:
xout
factorx Freq cumFreq relative
1 (9.97,13.8] 14 14 0.14
2 (13.8,17.5] 13 27 0.13
3 (17.5,21.2] 16 43 0.16
4 (21.2,25] 5 48 0.05
5 (25,28.8] 11 59 0.11
6 (28.8,32.5] 8 67 0.08
7 (32.5,36.2] 16 83 0.16
8 (36.2,40] 17 100 0.17
What I want to know is if there is a way to calculate the interval. For example it would be:
(13.8 + 9.97)/2
It's called the class midpoint in statistics I believe.
Here's a one-liner that is probably close to what you want:
> sapply(strsplit(levels(xout$factorx), ","), function(x) sum(as.numeric(gsub("[[:space:]]", "", chartr(old = "(]", new = " ", x))))/2)
[1] 11.885 15.650 19.350 23.100 26.900 30.650 34.350 38.100
#One possible solution is to split by (,] (xout is your dataframe)
x1<-strsplit(as.character(xout$factorx),",|\\(|]")
x2<-do.call(rbind,x1)
xout$lower=as.numeric(x2[,2])
xout$higher=as.numeric(x2[,3])
xout$ave<-rowMeans(xout[,c("lower","higher")])
> head(xout,3)
factorx Freq cumFreq relative higher lower aver
1 (9.97,13.7] 15 15 0.15 13.7 9.97 11.835
2 (13.7,17.5] 14 29 0.14 17.5 13.70 15.600
3 (17.5,21.2] 12 41 0.12 21.2 17.50 19.350

How to turn a continuous time sample path into discrete time over the (0,1) range using R?

Suppose observed X(1), X(2), ..., X(N) from a continuous time process. How can i discretize the time of this data on the grid {0,1/N,...,(N-1)/N,1} using R?
I really appreciate any help. Thanks.
This would be the way to do it in continuous time:
x <- cumsum(abs(rnorm(20)))
n <- (x-min(x))/diff(range(x))
> n
[1] 0.00000000 0.01884929 0.02874295 0.07230612 0.11253305 0.19770821 0.26356939
[8] 0.33310811 0.36687944 0.47041629 0.53331128 0.61724640 0.72534086 0.74782335
[15] 0.79829820 0.83023417 0.85336221 0.85528100 0.90023497 1.00000000
To get a numeric vector analogous to what you might get from cut or Hmisc::cut2 you can use findInterval:
> findInterval(n, seq(0,1,length=length(n) )/length(n) )
[1] 1 8 11 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
And "normalizing to [0,1] is then simple, even trivial;
> findInterval(n, seq(0,1,length=length(n) ))/length(n)
[1] 0.05 0.05 0.05 0.10 0.15 0.20 0.30 0.35 0.35 0.45 0.55 0.60 0.70 0.75 0.80 0.80 0.85
[18] 0.85 0.90 1.00

Resources