Could you please explain me the difference between these two cases?
> cor(1:10, rep(10,10))
[1] NA
Warning message:
In cor(1:10, rep(10, 10)) : the standard deviation is zero
> cor(1:10, 1:10)
[1] 1
the first one is just a straight line as well as the second I would expect the correlation to be one. What am I not considering? Thanks
Plot the data and it should be clear. The data set
## y doesn't vary
plot(1:10, rep(10,10))
is just a horizontal line. The correlation coefficient undefined for a horizontal line, since the estimate of the standard deviation for y is 0 (this appears on the denominator of the correlation coefficient). While
plot(1:10, 1:10)
is the line:
y = x
If you want to measure how much "in line" the points are,
you can use (one minus) the ratio of the eigenvalues of the variance matrix.
f <- function(x,y) {
e <- eigen(var(cbind(x,y)))$values
1 - e[2] / e[1]
}
# To have values closer to 0, you can square that quantity.
f <- function(x,y) {
e <- eigen(var(cbind(x,y)))$values
( 1 - e[2] / e[1] )^2
}
f( 1:10, 1:10 )
f( 1:10, rep(1,10) )
f( rnorm(100), rnorm(100) ) # Close to 0
f( rnorm(100), 2 * rnorm(100) ) # Closer to 1
f( 2 * rnorm(100), rnorm(100) ) # Similar
It will be 1 if the points are aligned,
0 if the cloud they form has a spherical shape,
invariant by translations and rotations,
non-negative, and symmetric.
If your situation is not symmetric, i.e., if x and y do not play the same role,
the regression-based approach suggested in Roland's comment makes more sense.
Related
I need to evaluate an integral in the following form:
\int_a^b f(x) \int_0^x g(t)(x-t)dtdx
Can you please suggest a way? I assume that this integral can't be done in the standard approach suggested in the following answer:
Standard approach
Update: Functions are added in the following image. f(x) basically represents a pdf of a uniform distribution but the g(t) is a bit more complicated. a and b can be any positive real numbers.
The domain of integration is a simplex (triangle) with vertices (a,a), (a,b) and (b,b). Use the SimplicialCubature package:
library(SimplicialCubature)
alpha <- 3
beta <- 4
g <- function(t){
((beta/t)^(1/2) + (beta/t)^(3/2)) * exp(-(t/beta + beta/t - 2)/(2*alpha^2)) /
(2*alpha*beta*sqrt(2*pi))
}
a <- 1
b <- 2
h <- function(tx){
t <- tx[1]
x <- tx[2]
g(t) * (x-t)
}
S <- cbind(c(a, a), c(a ,b), c(b, b))
adaptIntegrateSimplex(h, S)
# $integral
# [1] 0.01962547
#
# $estAbsError
# [1] 3.523222e-08
Another way, less efficient and less reliable, is:
InnerFunc <- function(t, x) { g(t) * (x - t) }
InnerIntegral <- Vectorize(function(x) { integrate(InnerFunc, a, x, x = x)$value})
integrate(InnerIntegral, a, b)
# 0.01962547 with absolute error < 2.2e-16
I am using a simple polynomial to fit a curve.
poly <- function(a, b, c, x) a * x^2 + b * x + c
I'd like to find the value of x that results in the maximum value of the curve. Currently I create a grid with a range of x from 20000 to 50000, run the function for each row, then use max() on the result. It works, but I have a lot of groups and it creates a big dataframe every time I do it. It is very clunky and I feel like there must be a better way.
Some typical coefficients are:
a <- -0.000000179
b <- 0.011153167
c <- 9.896420781
If you rearrange your function so the variable you want to maximize is first and you set the default values like so:
poly <- function(x, a, b, c) a * x^2 + b * x + c
formals(poly)$a <- -0.000000179
formals(poly)$b <- 0.011153167
formals(poly)$c <- 9.896420781
Then you can use the optimize function to maximize over your interval:
optimize(poly, c(20000, 50000), maximum = T)
$`maximum`
[1] 31154.1
$objective
[1] 183.6298
Where $maximum is the x value at which the maximum occurs and $objective is the height.
If a is negative, maximum of parabola a * x^2 + b * x + c is reached at -b/(2*a) :
a<0
#[1] TRUE
-b/(2*a)
#[1] 31154.1
You could use optim. I think the other solutions answered in this thread are more appealing, but I'll write this up for completeness:
a <- -0.000000179
b <- 0.011153167
c <- 9.896420781
o <- optim(
par=list(x=0),
fn=function(x){ -poly(a,b,c,x=x) },
method="Brent",
lower=-50e3, upper=50e3
)
Output:
> o
$par
[1] 31154.1
$value
[1] -183.6298
$counts
function gradient
NA NA
$convergence
[1] 0
$message
NULL
Say I have a vector of random numbers, I can order them lowest to highest:
set.seed(1)
x <- runif(20)
v <- x[order(x)]
Now, say I want to order them but with some degree of noise.
I can randomly move elements like this:
z <-sample(1:20,2)
replace(v, z, v[rev(z)])
but this doesn't necessarily move closely related values. I could be equally likely to randomly switch the 1st and 20th values as the 5th and 6th. I would like to have some control over the switching, so I can switch more closely related values.
Ideally, I would be able to reorder the vector to have a specific Spearman's correlation. Say rather than the Spearman correlation of rank order being 1 when they are perfectly ordered, is there a way to reorder that same vector of numbers to have e.g. a Spearman's correlation of 0.5 ?
What if you added some noise to their rankings. This will makes sure values don't get moved too far away from the starting point. For example
set.seed(1)
N <- 50
D <- 3 # controls how far things can move
x <- runif(N)
v <- x[vx <- order(rank(x) + runif(N, -D, D))]
z <- x[order(x)]
layout(matrix(c(1,3,2,3), nrow=2))
plot(v, main ="Ordered")
plot(z, main ="Mixed")
plot(v, z, xlab="ordered", ylab="mixed"); abline(0,1)
I don't think I have completely understood your question but here's a start. I am simply recursively swapping random consecutive values of the sorted vector. You can control the amount of swapping with n_swaps argument. -
noisy_sort <- function(x, n_swaps) {
sorted_x <- sort(x)
indices <- sample(seq_along(x[-1]), n_swaps)
for(i in indices) {
sorted_x[c(i, i+1)] <- sorted_x[c(i+1, i)]
}
sorted_x
}
set.seed(1)
x <- runif(20)
result <- noisy_sort(x, 3)
order(result)
[1] 1 2 3 5 4 6 7 8 9 10 11 13 12 14 15 16 17 19 18 20
^ ^ ^ ^ ^ ^
Here is a very rudimentary algo.
Using Spearman correlation for distinct ranks, you can back out the desired sum of squared difference (SSE) between ranks. Then, using a Markov Chain Monte Carlo (MCMC) approach, you sample a pair of indices to swap and transit to the new vector with swapped elements if it improves the SSE towards desired score.
I used the number of iterations as the stopping criteria. You can change the condition so that it meets a target tolerance level.
set.seed(1)
n <- 20
x <- runif(n)
v <- sort(x)
calc_exp_sse <- function(rho, N) {
(1 - rho) * N * (N^2 - 1) / 6
}
exp_sse <- calc_exp_sse(0.5, n)
ord <- 1:n
vec <- ord
for (i in 1:1000) {
swap <- vec
swid <- sample(n, 2L)
swap[swid] <- swap[c(swid[2L], swid[1L])]
if (abs(exp_sse - sum((ord-swap)^2)) < abs(exp_sse - sum((ord-vec)^2))) {
vec <- swap
}
}
vec
cor(vec, ord, method="spearman")
#[1] 0.5007519
cor(v, v[vec], method="spearman")
#[1] 0.5007519
Suppose I have the following system of equations:
a * b = 5
sqrt(a * b^2) = 10
How can I solve these equations for a and b in R ?
I guess this problem can be stated as an optimisation problem, with the following function... ?
fn <- function(a, b) {
rate <- a * b
shape <- sqrt(a * b^2)
return(c(rate, shape) )
}
In a comment the poster specifically asks about using solve and optim so we show how to solve this (1) by hand, (2) using solve, (3) using optim and (4) a fixed point iteration.
1) by hand First note that if we write a = 5/b based on the first equation and substitute that into the second equation we get sqrt(5/b * b^2) = sqrt(5 * b) = 10 so b = 20 and a = 0.25.
2) solve Regarding the use of solve these equations can be transformed into linear form by taking the log of both sides giving:
log(a) + log(b) = log(5)
0.5 * (loga + 2 * log(b)) = log(10)
which can be expressed as:
m <- matrix(c(1, .5, 1, 1), 2)
exp(solve(m, log(c(5, 10))))
## [1] 0.25 20.00
3) optim Using optim we can write this where fn is from the question. fn2 is formed by subtracting off the RHS of the equations and using crossprod to form the sum of squares.
fn2 <- function(x) crossprod( fn(x[1], x[2]) - c(5, 10))
optim(c(1, 1), fn2)
giving:
$par
[1] 0.2500805 19.9958117
$value
[1] 5.51508e-07
$counts
function gradient
97 NA
$convergence
[1] 0
$message
NULL
4) fixed point For this one rewrite the equations in a fixed point form, i.e. in the form c(a, b) = f(c(a, b)) and then iterate. In general, there will be several ways to do this and not all of them will converge but in this case this seems to work. We use starting values of 1 for both a and b and divide both side of the first equation by b to get the first equation in fixed point form and we divide both sides of the second equation by sqrt(a) to get the second equation in fixed point form:
a <- b <- 1 # starting values
for(i in 1:100) {
a = 5 / b
b = 10 / sqrt(a)
}
data.frame(a, b)
## a b
## 1 0.25 20
Use this library.
library("nleqslv")
You need to define the multivariate function you want to solve for.
fn <- function(x) {
rate <- x[1] * x[2] - 5
shape <- sqrt(x[1] * x[2]^2) - 10
return(c(rate, shape))
}
Then you're good to go.
nleqslv(c(1,5), fn)
Always look at the detailed results. Numerical calculations can be tricky. In this case I got this:
Warning message:
In sqrt(x[1] * x[2]^2) : NaNs produced
That just means the procedure searched a region that included x[1] < 0 and then presumably noped the heck back to the right hand side of the plane.
I have a 2396x34 double matrix named y wherein each row (2396) represents a separate situation consisting of 34 consecutive time segments.
I also have a numeric[34] named x that represents a single situation of 34 consecutive time segments.
Currently I am calculating the correlation between each row in y and x like this:
crs[,2] <- cor(t(y),x)
What I need now is to replace the cor function in the above statement with a weighted correlation. The weight vector xy.wt is 34 elements long so that a different weight can be assigned to each of the 34 consecutive time segments.
I found the Weighted Covariance Matrix function cov.wt and thought that if I first scale the data it should work just like the cor function. In fact you can specify for the function to return a correlation matrix as well. Unfortunately it does not seem like I can use it in the same manner because I cannot supply my two variables (x and y) separately.
Does anyone know of a way I can get a weighted correlation in the manner I described without sacrificing much speed?
Edit: Perhaps some mathematical function could be applied to y prior to the cor function in order to get the same results that I'm looking for. Maybe if I multiply each element by xy.wt/sum(xy.wt)?
Edit #2 I found another function corr in the boot package.
corr(d, w = rep(1, nrow(d))/nrow(d))
d
A matrix with two columns corresponding to the two variables whose correlation we wish to calculate.
w
A vector of weights to be applied to each pair of observations. The default is equal weights for each pair. Normalization takes place within the function so sum(w) need not equal 1.
This also is not what I need but it is closer.
Edit #3
Here is some code to generate the type of data I am working with:
x<-cumsum(rnorm(34))
y<- t(sapply(1:2396,function(u) cumsum(rnorm(34))))
xy.wt<-1/(34:1)
crs<-cor(t(y),x) #this works but I want to use xy.wt as weight
Unfortunately the accepted answer is wrong when y is a matrix of more than one row. The error is in the line
vy <- rowSums( w * y * y )
We want to multiply the columns of y by w, but this will multiply the rows by the elements of w, recycled as necessary. Thus
> f(x, y[1, , drop = FALSE], xy.wt)
[1] 0.103021
is correct, because in this case the multiplication is performed element-wise, which is equivalent to column-wise multiplication here, but
> f(x, y, xy.wt)[1]
[1] 0.05463575
gives a wrong answer due to the row-wise multiplication.
We can correct the function as follows
f2 <- function( x, y, w = rep(1,length(x))) {
stopifnot(length(x) == dim(y)[2] )
w <- w / sum(w)
# Center x and y, using the weighted means
x <- x - sum(x * w)
ty <- t(y - colSums(t(y) * w))
# Compute the variance
vx <- sum(w * x * x)
vy <- colSums(w * ty * ty)
# Compute the covariance
vxy <- colSums(ty * x * w)
# Compute the correlation
vxy / sqrt(vx * vy)
}
and check the results against those produced by corr from the boot package:
> res1 <- f2(x, y, xy.wt)
> res2 <- sapply(1:nrow(y),
+ function(i, x, y, w) corr(cbind(x, y[i,]), w = w),
+ x = x, y = y, w = xy.wt)
> all.equal(res1, res2)
[1] TRUE
which in itself gives another way that this problem could be solved.
You can go back to the definition of the correlation.
f <- function( x, y, w = rep(1,length(x))) {
stopifnot( length(x) == dim(y)[2] )
w <- w / sum(w)
# Center x and y, using the weighted means
x <- x - sum(x*w)
y <- y - apply( t(y) * w, 2, sum )
# Compute the variance
vx <- sum( w * x * x )
vy <- rowSums( w * y * y ) # Incorrect: see Heather's remark, in the other answer
# Compute the covariance
vxy <- colSums( t(y) * x * w )
# Compute the correlation
vxy / sqrt(vx * vy)
}
f(x,y)[1]
cor(x,y[1,]) # Identical
f(x, y, xy.wt)
Here is a generalization to compute the weighted Pearson correlation between two matrices (instead of a vector and a matrix, as in the original question):
matrix.corr <- function (a, b, w = rep(1, nrow(a))/nrow(a))
{
# normalize weights
w <- w / sum(w)
# center matrices
a <- sweep(a, 2, colSums(a * w))
b <- sweep(b, 2, colSums(b * w))
# compute weighted correlation
t(w*a) %*% b / sqrt( colSums(w * a**2) %*% t(colSums(w * b**2)) )
}
Using the above example and the correlation function from Heather, we can verify it:
> sum(matrix.corr(as.matrix(x, nrow=34),t(y),xy.wt) - f2(x,y,xy.wt))
[1] 1.537507e-15
In terms of calling syntax, this resembles the unweighted cor:
> a <- matrix( c(1,2,3,1,3,2), nrow=3)
> b <- matrix( c(2,3,1,1,7,3,5,2,8,1,10,12), nrow=3)
> matrix.corr(a,b)
[,1] [,2] [,3] [,4]
[1,] -0.5 0.3273268 0.5 0.9386522
[2,] 0.5 0.9819805 -0.5 0.7679882
> cor(a, b)
[,1] [,2] [,3] [,4]
[1,] -0.5 0.3273268 0.5 0.9386522
[2,] 0.5 0.9819805 -0.5 0.7679882