Related
Starting from the centre of this hexagonal grid and working outwards, I would like to generate a list of 100000 pairs of coordinates (without any coordinates duplicated). Looking for suggestions on the best way to do this. If possible I'd like them ordered from closest to the centre to furthest from the centre.
The following function will find the centroids of hexagons in a regular grid, centered at x = 0, y = 0, where each hexagon is 1 unit wide:
hex_rings_at_d <- function(d) {
if(d == 0) return(data.frame(x = 0, y = 0))
d2 <- sqrt(d^2 - (0.5 * d)^2)
d3 <- 0.5 * d
f <- function(a, b) seq(a, b, length.out = d + 1)[-1]
data.frame(x = c(f( d, d3), f( d3, -d3), f(-d3, -d),
f(-d, -d3), f(-d3, d3), f( d3, d)),
y = c(f(0, d2), f(d2, d2), f(d2, 0),
f(0, -d2), f(-d2, -d2), f(-d2, 0)))
}
If we get all centroids out to a distance of 577 we will have 100,000 of them:
df <- do.call(rbind, lapply(0:577, hex_rings_at_d))
nrow(df)
#> [1] 1000519
These will be ordered from inside out:
head(df)
#> x y
#> 1 0.0 0.0000000
#> 2 0.5 0.8660254
#> 3 -0.5 0.8660254
#> 4 -1.0 0.0000000
#> 5 -0.5 -0.8660254
#> 6 0.5 -0.8660254
And we can confirm they are arranged in a regular hexagonal grid. Here are the first 169 centroids:
plot(df[1:169,])
Created on 2022-06-24 by the reprex package (v2.0.1)
I would like to perform nonlinear least squares regression in R where I simultaneously minimize the squared residuals of three models (see below). Now, the three models share some of the parameters, in my example, parameters b and d.
Is there a way of doing this with either nls(), or, either packages minpack.lm or nlsr?
So, ideally, I would like to generate the objective function (the sum of least squares of all models together) and regress all parameters at once: a1, a2, a3, b, c1, c2, c3 and d.
(I am trying to avoid running three independent regressions and then perform some averaging on b and d.)
my_model <- function(x, a, b, c, d) {
a * b ^ (x - c) + d
}
# x values
x <- seq(0, 10, 0.2)
# Shared parameters
b <- 2
d <- 10
a1 <- 1
c1 <- 1
y1 <- my_model(x,
a = a1,
b = b,
c = c1,
d = d) + rnorm(length(x))
a2 <- 2
c2 <- 5
y2 <- my_model(x,
a = a2,
b = b,
c = c2,
d = d) + rnorm(length(x))
a3 <- -2
c3 <- 3
y3 <- my_model(x,
a = a3,
b = b,
c = c3,
d = d) + rnorm(length(x))
plot(
y1 ~ x,
xlim = range(x),
ylim = d + c(-50, 50),
type = 'b',
col = 'red',
ylab = 'y'
)
lines(y2 ~ x, type = 'b', col = 'green')
lines(y3 ~ x, type = 'b', col = 'blue')
Below we run nls (using a slightly modified model) and nlxb (from nlsr) but nlxb stops before convergence. Desite these problems both of these nevertheless do give results which visually fit the data well. These problems suggest that there are problems with the model itself so in the Other section, guided by the nlxb output, we show how to fix the model giving a submodel of the original model which fits the data easily with both nls and nlxb and also gives a good fit. At the end in the Notes section we provide the data in reproducible form.
nls
Assuming the setup shown reproducibly in the Note at the end, reformulate the problem for the nls plinear algorithm by defining a right hand side matrix whose columns multiply each of the linear parameters, a1, a2, a3 and d, respectively. plinear does not require starting values for those simplifying the setup. It will report them as .lin1, .lin2, .lin3 and .lin4 respectively.
To get starting values we used a simpler model with no grouping and a grid search over b from 1 to 10 and c also from 1 to 10 using nls2 in the package of the same name. We also found that nls still produced errors but by using abs in the formula, as shown, it ran to completion.
The problems with the model suggest that there is a fundamental problem with it and in the Other section we discuss how to fix it up.
xx <- c(x, x, x)
yy <- c(y1, y2, y3)
# startingi values using nls2
library(nls2)
fo0 <- yy ~ cbind(b ^ abs(xx - c), 1)
st0 <- data.frame(b = c(1, 10), c = c(1, 10))
fm0 <- nls2(fo0, start = st0, alg = "plinear-brute")
# run nls using starting values from above
g <- rep(1:3, each = length(x))
fo <- yy ~ cbind((g==1) * b ^ abs(xx - c[g]),
(g==2) * b ^ abs(xx - c[g]),
(g==3) * b ^ abs(xx - c[g]),
1)
st <- with(as.list(coef(fm0)), list(b = b, c = c(c, c, c)))
fm <- nls(fo, start = st, alg = "plinear")
plot(yy ~ xx, col = g)
for(i in unique(g)) lines(predict(fm) ~ xx, col = i, subset = g == i)
fm
giving:
Nonlinear regression model
model: yy ~ cbind((g == 1) * b^abs(xx - c[g]), (g == 2) * b^abs(xx - c[g]), (g == 3) * b^abs(xx - c[g]), 1)
data: parent.frame()
b c1 c2 c3 .lin1 .lin2 .lin3 .lin4
1.997 0.424 1.622 1.074 0.680 0.196 -0.532 9.922
residual sum-of-squares: 133
Number of iterations to convergence: 5
Achieved convergence tolerance: 5.47e-06
(continued after plot)
nlsr
With nlsr it would be done like this. No grid search for starting values was needed and adding abs was not needed either. The b and d values seem similar to the nls solution but the other coefficients differ. Visually both solutions seem to fit the data.
On the other hand from the JSingval column we see that the jacobian is rank deficient which caused it to stop and not produce SE values and the convergence is in doubt (although it may be sufficient given that visually the plot, not shown, seems like a good fit). We discuss how to fix this up in the Other section.
g1 <- g == 1; g2 <- g == 2; g3 <- g == 3
fo2 <- yy ~ g1 * (a1 * b ^ (xx - c1) + d) +
g2 * (a2 * b ^ (xx - c2) + d) +
g3 * (a3 * b ^ (xx - c3) + d)
st2 <- list(a1 = 1, a2 = 1, a3 = 1, b = 1, c1 = 1, c2 = 1, c3 = 1, d = 1)
fm2 <- nlxb(fo2, start = st2)
fm2
giving:
vn: [1] "yy" "g1" "a1" "b" "xx" "c1" "d" "g2" "a2" "c2" "g3" "a3" "c3"
no weights
nlsr object: x
residual sumsquares = 133.45 on 153 observations
after 16 Jacobian and 22 function evaluations
name coeff SE tstat pval gradient JSingval
a1 3.19575 NA NA NA 9.68e-10 4097
a2 0.64157 NA NA NA 8.914e-11 662.5
a3 -1.03096 NA NA NA -1.002e-09 234.9
b 1.99713 NA NA NA -2.28e-08 72.57
c1 2.66146 NA NA NA -2.14e-09 10.25
c2 3.33564 NA NA NA -3.955e-11 1.585e-13
c3 2.0297 NA NA NA -7.144e-10 1.292e-13
d 9.92363 NA NA NA -2.603e-12 3.271e-14
We can calculate SE's using nls2 as a second stage but this still does not address the problem with the whole lthing that the singular values suggest.
summary(nls2(fo2, start = coef(fm2), algorithm = "brute-force"))
giving:
Formula: yy ~ g1 * (a1 * b^(xx - c1) + d) + g2 * (a2 * b^(xx - c2) + d) +
g3 * (a3 * b^(xx - c3) + d)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a1 3.20e+00 5.38e+05 0.0 1
a2 6.42e-01 3.55e+05 0.0 1
a3 -1.03e+00 3.16e+05 0.0 1
b 2.00e+00 2.49e-03 803.4 <2e-16 ***
c1 2.66e+00 9.42e-02 28.2 <2e-16 ***
c2 3.34e+00 2.43e+05 0.0 1
c3 2.03e+00 8.00e+05 0.0 1
d 9.92e+00 4.42e+05 0.0 1
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.959 on 145 degrees of freedom
Number of iterations to convergence: 8
Achieved convergence tolerance: NA
Other
When nls has trouble fitting a model it often suggests that there is something wrong with the model itself. Playing around with it a bit, guided by the JSingval column in nlsr output above which suggests that c parameters or d might be the problem, we find that if we fix all c parameter values to 0 then the model is easy to fit given sufficiently good starting values and it still gives a low residual sum of squares.
library(nls2)
fo3 <- yy ~ cbind((g==1) * b ^ xx, (g==2) * b ^ xx, (g==3) * b ^ xx, 1)
st3 <- coef(fm0)["b"]
fm3 <- nls(fo3, start = st3, alg = "plinear")
giving:
Nonlinear regression model
model: yy ~ cbind((g == 1) * b^xx, (g == 2) * b^xx, (g == 3) * b^xx, 1)
data: parent.frame()
b .lin1 .lin2 .lin3 .lin4
1.9971 0.5071 0.0639 -0.2532 9.9236
residual sum-of-squares: 133
Number of iterations to convergence: 4
Achieved convergence tolerance: 1.67e-09
which the following anova indicates is comparable to fm from above despite having 3 fewer parameters:
anova(fm3, fm)
giving:
Analysis of Variance Table
Model 1: yy ~ cbind((g == 1) * b^xx, (g == 2) * b^xx, (g == 3) * b^xx, 1)
Model 2: yy ~ cbind((g == 1) * b^abs(xx - c[g]), (g == 2) * b^abs(xx - c[g]), (g == 3) * b^abs(xx - c[g]), 1)
Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)
1 148 134
2 145 133 3 0.385 0.14 0.94
We can redo fm3 using nlxb like this:
fo4 <- yy ~ g1 * (a1 * b ^ xx + d) +
g2 * (a2 * b ^ xx + d) +
g3 * (a3 * b ^ xx + d)
st4 <- list(a1 = 1, a2 = 1, a3 = 1, b = 1, d = 1)
fm4 <- nlxb(fo4, start = st4)
fm4
giving:
nlsr object: x
residual sumsquares = 133.45 on 153 observations
after 24 Jacobian and 33 function evaluations
name coeff SE tstat pval gradient JSingval
a1 0.507053 0.005515 91.94 1.83e-132 8.274e-08 5880
a2 0.0638554 0.0008735 73.11 4.774e-118 1.26e-08 2053
a3 -0.253225 0.002737 -92.54 7.154e-133 -4.181e-08 2053
b 1.99713 0.002294 870.6 2.073e-276 -2.55e-07 147.5
d 9.92363 0.09256 107.2 3.367e-142 -1.219e-11 10.26
Note
The assumed input below is the same as in the question except we additionally
set the seed to make it reproducible.
set.seed(123)
my_model <- function(x, a, b, c, d) a * b ^ (x - c) + d
x <- seq(0, 10, 0.2)
b <- 2; d <- 10 # shared
a1 <- 1; c1 <- 1
y1 <- my_model(x, a = a1, b = b, c = c1, d = d) + rnorm(length(x))
a2 <- 2; c2 <- 5
y2 <- my_model(x, a = a2, b = b, c = c2, d = d) + rnorm(length(x))
a3 <- -2; c3 <- 3
y3 <- my_model(x, a = a3, b = b, c = c3, d = d) + rnorm(length(x))
I'm not sure this is really the best way, but you could minimize the sum of the squared residuals using optim().
#start values
params <- c(a1=1, a2=1, a3=1, b=1, c1=1, c2=1, c3=1,d=1)
# minimize total sum of squares of residuals
fun <- function(p) {
sum(
(y1-my_model(x, p["a1"], p["b"], p["c1"], p["d"]))^2 +
(y2-my_model(x, p["a2"], p["b"], p["c2"], p["d"]))^2 +
(y3-my_model(x, p["a3"], p["b"], p["c3"], p["d"]))^2
)
}
out <- optim(params, fun, method="BFGS")
out$par
# a1 a2 a3 b c1 c2 c3
# 0.8807542 1.0241804 -2.8805848 1.9974615 0.7998103 4.0030597 3.5184600
# d
# 9.8764917
And we can add the plots on top of the image
curve(my_model(x, out$par["a1"], out$par["b"], out$par["c1"], out$par["d"]), col="red", add=T)
curve(my_model(x, out$par["a2"], out$par["b"], out$par["c2"], out$par["d"]), col="green", add=T)
curve(my_model(x, out$par["a3"], out$par["b"], out$par["c3"], out$par["d"]), col="blue", add=T)
I have defined a function to calculate the relationship between height (h) and diameter (dbh) of trees based on equations extracted from 2 publications. My goal is to use the relationship established in paper 1 (Xiangtao) to predict the values of variables in an equation in paper 2 (Marechaux and Chave). I would like to test to see over what diameter range [x:y] the generated nls() curve of paper 2 fits paper 1. Currently, I keep getting an error (I believe in plot())
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
if I use anything except x=1 for [x:y] i.e. dbh.min:dbh.max
My function is as follows:
# Plant.Functional.Type constants...
Dsb1 <- 2.09
Dsb2 <- 0.54
Db1 <- 0.93
Db2 <- 0.84
BDb1 <- 2.66
BDb2 <- 0.48
Eb1 <- 1.41
Eb2 <- 0.65
# # # # # # # # # # # # # # # # # # # # # # # # # # #
Generate.curve <- function(b1, b2, dbh.min, dbh.max){
# calculate Xiangtao's allometry...
tmp_h <- c(dbh.min:dbh.max)
for (dbh in dbh.min:dbh.max)
{
h = b1*dbh^(b2)
tmp_h[dbh] = h
}
# plot to check curve
plot(dbh.min:dbh.max, tmp_h)
# define secondary function for Marechaux and Chave allometry
h_fxn <- function(hlim,dbh,ah){
h = hlim * (dbh / (dbh + ah))
return(h)
}
# use nonlinear least squares model to solve for ah and hlim
# set model inputs
start.ah <- 1
start.hlim <- 5
tmp_v <- cbind(dbh.min:dbh.max,tmp_h)
tmp.fit <- nls(tmp_h ~ h_fxn(hlim,dbh.min:dbh.max,ah), start = list(hlim = start.hlim,
ah = start.ah), algorithm = "port", upper = list(hlim = 75, ah = 99))
# seems to be no way of extracting ah and hlim from tmp.fit via subset
# extract manually and then check fit with
# lines(dbh.min:dbh.max, hlim * (dbh.min:dbh.max/(dbh.min:dbh.max + ah)))
# for equation h = hlim * (dbh / (dbh + ah)) from Marechaux and Chave
return(tmp.fit)
}
# # # # # # # # # # # # # # # # # # # # # # # # # # #
This works great for
Generate.curve(Dsb1,Dsb2,1,100)
lines(1:100, 36.75 * (1:100/(1:100 + 52.51)))
But I would like to be able to examine the curve fit in ranges such as [80:100] as well.
I have been trying to figure out why Generate.curve(Dsb1,Dsb2,80,100) returns an error for about 3 days now. Thanks for any help.
Your problem lies in this section:
tmp_h <- c(dbh.min:dbh.max)
for (dbh in dbh.min:dbh.max)
{
h = b1*dbh^(b2)
tmp_h[dbh] = h
}
Think about what happens when you set dbh.min to 80 and dbh.max to 100:
tmp_h <- 80:100
for (dbh in 80:100)
{
h = b1*dbh^(b2)
tmp_h[dbh] = h
}
What happens on the first cycle of the loop? Well, tmp_h is length 20, but on the first cycle, dbh is 80, and you are assigning a number to tmp_h[dbh], which is tmp_h[80]. By the time the loop has finished, tmp_h will have the correct values stored, but they will be in the indices 80:100. So tmp_h will have the numbers 80:100 stored in the first 21 indices, then a bunch of NAs then the correct numbers in the last 21 indices.
So change it to:
tmp_h <- c(dbh.min:dbh.max)
for (dbh in dbh.min:dbh.max)
{
h = b1*dbh^(b2)
tmp_h[dbh - dbh.min + 1] = h
}
and it will work.
However, you don't actually need a loop at all here, since R uses vectorized operations, so this whole section can be replaced with:
tmp_h <- b1 * (dbh.min:dbh.max)^(b2)
and then when you do
Generate.curve(Dsb1,Dsb2,80,100)
lines(80:100, 36.75 * (80:100/(80:100 + 52.51)))
you get this:
I tried to implement the AdaBoost algorithm of Freund and Schapire as close to the original as possible (see p. 2 here: http://rob.schapire.net/papers/explaining-adaboost.pdf):
library(rpart)
library(OneR)
maxdepth <- 1
T <- 100 # number of rounds
# Given: (x_1, y_1),...,(x_m, y_m) where x_i element of X, y_i element of {-1, +1}
myocarde <- read.table("http://freakonometrics.free.fr/myocarde.csv", head = TRUE, sep = ";")
#myocarde <- read.table("data/myocarde.csv", header = TRUE, sep = ";")
y <- (myocarde[ , "PRONO"] == "SURVIE") * 2 - 1
x <- myocarde[ , 1:7]
m <- nrow(x)
data <- data.frame(x, y)
# Initialize: D_1(i) = 1/m for i = 1,...,m
D <- rep(1/m, m)
H <- replicate(T, list())
a <- vector(mode = "numeric", T)
set.seed(123)
# For t = 1,...,T
for(t in 1:T) {
# Train weak learner using distribution D_t
# Get weak hypothesis h_t: X -> {-1, +1}
data_D_t <- data[sample(m, 10*m, replace = TRUE, prob = D), ]
H[[t]] <- rpart(y ~., data = data_D_t, maxdepth = maxdepth, method = "class")
# Aim: select h_t with low weighted error: e_t = Pr_i~D_t[h_t(x_i) != y_i]
h <- predict(H[[t]], x, type = "class")
e <- sum(h != y) / m
# Choose a_t = 0.5 * log((1-e) / e)
a[t] <- 0.5 * log((1-e) / e)
# Update for i = 1,...,m: D_t+1(i) = (D_t(i) * exp(-a_t * y_i * h_t(x_i))) / Z_t
# where Z_t is a normalization factor (chosen so that Dt+1 will be a distribution)
D <- D * exp(-a[t] * y * as.numeric(h))
D <- D / sum(D)
}
# Output the final hypothesis: H(x) = sign(sum of a_t * h_t(x) for t=1 to T)
newdata <- x
H_x <- sapply(H, function(x) as.numeric(as.character(predict(x, newdata = newdata, type = "class"))))
H_x <- t(a * t(H_x))
pred <- sign(rowSums(H_x))
#H
#a
eval_model(pred, y)
##
## Confusion matrix (absolute):
## Actual
## Prediction -1 1 Sum
## -1 0 1 1
## 1 29 41 70
## Sum 29 42 71
##
## Confusion matrix (relative):
## Actual
## Prediction -1 1 Sum
## -1 0.00 0.01 0.01
## 1 0.41 0.58 0.99
## Sum 0.41 0.59 1.00
##
## Accuracy:
## 0.5775 (41/71)
##
## Error rate:
## 0.4225 (30/71)
##
## Error rate reduction (vs. base rate):
## -0.0345 (p-value = 0.6436)
As can be seen the accuracy of the model is horrible compared to other AdaBoost implementations, e.g.:
library(JOUSBoost)
## JOUSBoost 2.1.0
boost <- adaboost(as.matrix(x), y, tree_depth = maxdepth, n_rounds = T)
pred <- predict(boost, x)
eval_model(pred, y)
##
## Confusion matrix (absolute):
## Actual
## Prediction -1 1 Sum
## -1 29 0 29
## 1 0 42 42
## Sum 29 42 71
##
## Confusion matrix (relative):
## Actual
## Prediction -1 1 Sum
## -1 0.41 0.00 0.41
## 1 0.00 0.59 0.59
## Sum 0.41 0.59 1.00
##
## Accuracy:
## 1 (71/71)
##
## Error rate:
## 0 (0/71)
##
## Error rate reduction (vs. base rate):
## 1 (p-value < 2.2e-16)
My question
Could you please give me a hint what went wrong in my implementation? Thank you
Edit
The final and corrected code can be found in my blog post: Understanding AdaBoost – or how to turn Weakness into Strength
There are quite a few contributing factors as to why your implementation is not working.
You were not using rpart correctly. Adaboost implementation does not mention upsampling with the weights - but rpart itself can accept weights. My example below shows how rpart should be used for this purpose.
Calculation of the weighted error was wrong. You were calculating the error proportion (number of samples calculated incorrectly divided by number of samples). Adaboost uses the sum of the weights that were incorrectly predicted (sum(D[y != yhat])).
Final predictions seemed to be incorrect too, I just ended up doing a simple loop.
Next time I recommend diving into the source code the the other implementations you are comparing against.
https://github.com/cran/JOUSBoost/blob/master/R/adaboost.R uses almost identical code to my below example - and probably would have helped guide you originally.
Additionally using T as a variable could potentially interfere with the logical TRUE and it's shorthand T, so I'd avoid it.
### packages ###
library(rpart)
library(OneR)
### parameters ###
maxdepth <- 1
rounds <- 100
set.seed(123)
### data ###
myocarde <- read.table("http://freakonometrics.free.fr/myocarde.csv", head = TRUE, sep = ";")
y <- (myocarde[ , "PRONO"] == "SURVIE") * 2 - 1
x <- myocarde[ , 1:7]
m <- nrow(x)
dataset <- data.frame(x, y)
### initialisation ###
D <- rep(1/m, m)
H <- list()
a <- vector(mode = "numeric", length = rounds)
for (i in seq.int(rounds)) {
# train weak learner
H[[i]] = rpart(y ~ ., data = dataset, weights = D, maxdepth = maxdepth, method = "class")
# predictions
yhat <- predict(H[[i]], x, type = "class")
yhat <- as.numeric(as.character(yhat))
# weighted error
e <- sum(D[yhat != y])
# alpha coefficient
a[i] <- 0.5 * log((1 - e) / e)
# updating weights (D)
D <- D * exp(-a[i] * y * yhat)
D <- D / sum(D)
}
# predict with each weak learner on dataset
y_hat_final <- vector(mode = "numeric", length = m)
for (i in seq(rounds)) {
pred = predict(H[[i]], dataset, type = "class")
pred = as.numeric(as.character(pred))
y_hat_final = y_hat_final + (a[i] * pred)
}
pred <- sign(y_hat_final)
eval_model(pred, y)
> eval_model(pred, y)
Confusion matrix (absolute):
Actual
Prediction -1 1 Sum
-1 29 0 29
1 0 42 42
Sum 29 42 71
Confusion matrix (relative):
Actual
Prediction -1 1 Sum
-1 0.41 0.00 0.41
1 0.00 0.59 0.59
Sum 0.41 0.59 1.00
Accuracy:
1 (71/71)
Error rate:
0 (0/71)
Error rate reduction (vs. base rate):
1 (p-value < 2.2e-16)
I have correlated one set number with .9, .5, .0
A derives from rnorm(30,-0.5,1)
B derives from rnorm(30,.5,2)
and want to make A & B correlated with .9, .5, .0.
You are describing a multivariate normal distribution, which can be computed with the mvrnorm function:
library(MASS)
meanA <- -0.5
meanB <- 0.5
sdA <- 1
sdB <- 2
correlation <- 0.9
set.seed(144)
vals <- mvrnorm(10000, c(meanA, meanB), matrix(c(sdA^2, correlation*sdA*sdB,
correlation*sdA*sdB, sdB^2), nrow=2))
mean(vals[,1])
# [1] -0.4883265
mean(vals[,2])
# [1] 0.5201586
sd(vals[,1])
# [1] 0.9994628
sd(vals[,2])
# [1] 1.992816
cor(vals[,1], vals[,2])
# [1] 0.8999285
As an alternative, please consider the following. Let the random variables X ~ N(0,1) and Y ~ N(0,1) independently. Then the random variables X and rho X + sqrt(1 - rho^2) Y are both distributed N(0,1), but are now correlated with correlation rho. So possible R code could be
# Define the parameters
meanA <- -0.5
meanB <- 0.5
sdA <- 1
sdB <- 2
correlation <- 0.9
n <- 10000 # You want 30
# Generate from independent standard normals
x <- rnorm(n, 0, 1)
y <- rnorm(n, 0, 1)
# Transform
x2 <- x # could be avoided
y2 <- correlation*x + sqrt(1 - correlation^2)*y
# Fix up means and standard deviations
x3 <- meanA + sdA*x2
y3 <- meanB + sdB*y2
# Check summary statistics
mean(x3)
# [1] -0.4981958
mean(y3)
# [1] 0.4999068
sd(x3)
# [1] 1.014299
sd(y3)
# [1] 2.022377
cor(x3, y3)
# [1] 0.9002529
I created the correlate package to be able to create a correlation between any type of variable (regardless of distribution) given a certain amount of toleration. It does so by permutations.
install.packages('correlate')
library('correlate')
A <- rnorm(30, -0.5, 1)
B <- rnorm(30, .5, 2)
C <- correlate(cbind(A,B), 0.9)
# 0.9012749
D <- correlate(cbind(A,B), 0.5)
# 0.5018054
E <- correlate(cbind(A,B), 0.0)
# -0.00407327
You can pretty much decide the whole matrix if you want (for multiple variables), by giving a matrix as second argument.
Ironically, you can also use it to create a multivariate normal.....