How to create covariance matrix in R? - r

I'm trying to build covariance matrix from a scratch (cov() function). My task is not to use any package. Hence I created my functions:
meanf <- function(x){
sum(x) / length(x)
sampleCov <- function(x,y){
stopifnot(identical(length(x), length(y)))
sum((x - meanf(x)) * (y - meanf(y))) / (length(x) - 1)
> sampleCov(winequality_red$quality, winequality_red$alcohol)
[1] 0.409789
Unfortunately, I'm stuck here. All loops I tried to apply are missing any point. Of course it's possible to just copy the sampleCov function and make it for every possible combination but that's not my point.

If I understand you correctly then I believe you want to recreate a covariate output like the one returned by cov function.
OPs given function:
meanf <- function(x){
sum(x) / length(x)
sampleCov <- function(x,y){
stopifnot(identical(length(x), length(y)))
sum((x - meanf(x)) * (y - meanf(y))) / (length(x) - 1)
You can try this way, I have taken mtcars data here:
Covariate Function:
vars <- names(mtcars)
egrid <- expand.grid(vars, vars)
egrid <- data.frame(sapply(egrid, as.character),stringsAsFactors = F)
egrid <- egrid[order(egrid$Var1, egrid$Var2),]
mat <- vector("list", nrow(egrid))
for(i in 1:nrow(egrid)){
mat[[i]] <- sampleCov(mtcars[,egrid[i,"Var1"]], mtcars[,egrid[i,"Var2"]])
finaldat <- cbind(egrid, cov ='rbind', mat))
finaldat_list <- split(finaldat, finaldat$Var1)
mat_form <-'cbind', finaldat_list)
cov_values <- mat_form[,grepl("\\.cov",names(mat_form))]
col_values <- mat_form[,paste0(egrid$Var1[1],".Var2")]
final_matrix_cov <- cbind(col_values, cov_values)
Sample Output:
> final_matrix_cov
col_values am.cov carb.cov cyl.cov disp.cov
9 mpg 1.80393145 -5.36310484 -9.1723790 -633.09721
20 cyl -0.46572581 1.52016129 3.1895161 199.66028
31 disp -36.56401210 79.06875000 199.6602823 15360.79983
42 hp -8.32056452 83.03629032 101.9314516 6721.15867

You need the matrix multiplication %*%.
sampleCov <- function(x,y){
stopifnot(identical(length(x), length(y)))
sum((x - mean(x)) %*% (y - mean(y))) / (length(x) - 1)
> sampleCov(rnorm(10000),rnorm(10000))
[1] 0.01808466

This is probably a little more than you need, but it should answer your question, and I think it is a nice illustration of the practical application of covariances, correlations, etc.
# load the data
link <- ""
df <- data.table(read.csv(link))
# calculate the necessary values:
# I) expected returns for the two assets
er_x <- mean(df$x)
er_y <- mean(df$y)
# II) risk (standard deviation) as a risk measure
sd_x <- sd(df$x)
sd_y <- sd(df$y)
# III) covariance
cov_xy <- cov(df$x, df$y)
# create 1000 portfolio weights (omegas)
x_weights <- seq(from = 0, to = 1, length.out = 1000)
# create a data.table that contains the weights for the two assets
two_assets <- data.table(wx = x_weights,
wy = 1 - x_weights)
# calculate the expected returns and standard deviations for the 1000 possible portfolios
two_assets[, ':=' (er_p = wx * er_x + wy * er_y,
sd_p = sqrt(wx^2 * sd_x^2 +
wy^2 * sd_y^2 +
2 * wx * (1 - wx) * cov_xy))]
# lastly plot the values
ggplot() +
geom_point(data = two_assets, aes(x = sd_p, y = er_p, color = wx)) +
geom_point(data = data.table(sd = c(sd_x, sd_y), mean = c(er_x, er_y)),
aes(x = sd, y = mean), color = "red", size = 3, shape = 18) +
# Miscellaneous Formatting
theme_bw() + ggtitle("Possible Portfolios with Two Risky Assets") +
xlab("Volatility") + ylab("Expected Returns") +
scale_y_continuous(label = percent, limits = c(0, max(two_assets$er_p) * 1.2)) +
scale_x_continuous(label = percent, limits = c(0, max(two_assets$sd_p) * 1.2)) +
scale_color_continuous(name = expression(omega[x]), labels = percent)
See the link below for all details.


Can you reverse scaling and centering in the axes of a ggplot2 plot?

I want to scale the predictor variable of a regression model but I then want to plot the original values on the x-axis for intelligibility using ggplot2. I have attempted to do this using scale_x_continuous().
x <- rnorm(100, 10, 1.5)
Zx <- scale(x)
Zy <- .8*Zx + rnorm(100, 0, sqrt(1 - (.8^2)))
df <- tibble(Zx = Zx, y = Zy)
SD_scale <- attr(df$Zx, "scaled:scale")
center <- attr(df$Zx, "scaled:center")
unscale_trans <- function(x){
function(x) center + x * SD_scale,
function(x) scale(x) )
df %>%
ggplot(aes(x=Zx,y=y)) + geom_point() +
geom_smooth(method = "lm") +
This throws the following warnings:
1: In c(-1, 1) * (width * mul + add) :
Recycling array of length 1 in vector-array arithmetic is deprecated.
Use c() or as.vector() instead.
2: In c(-1, 1) * (width * mul + add) :
Recycling array of length 1 in vector-array arithmetic is deprecated.
Use c() or as.vector() instead.
And results in oddly scaled labels.
Scaled x predictor
Is there a means of satisfactorily 'unscaling' the axis?
Thank you in advance for your help,

Add a Passing-Bablok regression line

I have to perform many comparisons between different measurement methods and I have to use the Passing-Bablok regression approach.
I would like to take advantage of ggplot2 and faceting, but I don't know how to add a geom_smooth layer based on the Passing-Bablok regression.
I was thinking about something like:
Furthermore, I would also need to show the regression line equation, with confidence interval for intercept and slope parameters, in each plot.
Edit with partial solution
I've found a partial solution combining the code provided in this post and in this answer.
## Regression algorithm <- function(x, y) {
x_name <- deparse(substitute(x))
lx <- length(x)
l <- lx*(lx - 1)/2
k <- 0
S <- rep(NA, lx)
for (i in 1:(lx - 1)) {
for (j in (i + 1):lx) {
k <- k + 1
S[k] <- (y[i] - y[j])/(x[i] - x[j])
S.sort <- sort(S)
N <- length(S.sort)
neg <- length(subset(S.sort,S.sort < 0))
K <- floor(neg/2)
if (N %% 2 == 1) {
b <- S.sort[(N+1)/2+K]
} else {
b <- sqrt(S.sort[N / 2 + K]*S.sort[N / 2 + K + 1])
a <- median(y - b * x)
res <- as.vector(c(a,b))
names(res) <- c("(Intercept)", x_name)
class(res) <- "Passing_Bablok"
## Computing confidence intervals
passing_bablok <- function(formula, data, R = 100, weights = NULL){
ret <- boot::boot(
data = model.frame(formula, data),
statistic = function(data, ind) {
data <- data[ind, ]
args <- rlang::parse_exprs(colnames(data))
names(args) <- c("y", "x")
rlang::eval_tidy(rlang::expr(!!!args)), data, env = rlang::current_env())
class(ret) <- c("Passing_Bablok", class(ret))
## Plotting confidence bands
predictdf.Passing_Bablok <- function(model, xseq, se, level) {
pred <- as.vector(tcrossprod(model$t0, cbind(1, xseq)))
if(se) {
preds <- tcrossprod(model$t, cbind(1, xseq))
x = xseq,
y = pred,
ymin = apply(preds, 2, function(x) quantile(x, probs = (1-level)/2)),
ymax = apply(preds, 2, function(x) quantile(x, probs = 1-((1-level)/2)))
} else {
return(data.frame(x = xseq, y = pred))
An example of usage:
z <- data.frame(x = rnorm(100, mean = 100, sd = 5),
y = rnorm(100, mean = 110, sd = 8))
ggplot(z, aes(x, y)) +
geom_point() +
geom_smooth(method = passing_bablok) +
geom_abline(slope = 1, intercept = 0)
So far, I haven't been able to show the regression line equation, with confidence interval for intercept and slope parameters (as +- or in parentheses).
You've arguably done with difficult part with the PaBa regression.
Here's a basic solution using your function:
z <- data.frame(x = 101:200+rnorm(100,sd=10),
y = 101:200+rnorm(100,sd=8))
mycoefs <- as.numeric( = z$x, y=z$y))
paba_eqn <- function(thecoefs) {
l <- list(m = format(thecoefs[2], digits = 2),
b = format(abs(thecoefs[1]), digits = 2))
if(thecoefs[1] >= 0){
eq <- substitute(italic(y) == m %.% italic(x) + b,l)
} else {
eq <- substitute(italic(y) == m %.% italic(x) - b,l)
ggplot(z, aes(x, y)) +
geom_point() +
geom_smooth(method = passing_bablok) +
geom_abline(slope = 1, intercept = 0) +
annotate("text",x = 110, y = 220, label = paba_eqn(mycoefs), parse = TRUE)
Note the equation will vary because of rnorm in the data creation..
The solution could definitely be made more slick and robust, but it works for both positive and negative intercepts.
Equation concept sourced from:

Performing residual bootstrap using kernel regression in R

Kernel regression is a non-parametric technique that wants to estimate the conditional expectation of a random variable. It uses local averaging of the response value, Y, in order to find some non-linear relationship between X and Y.
I am have used bootstrap for kernel density estimation and now want to use it for kernel regression as well. I have been told to use residual bootstrapping for kernel regression and have read a couple of papers on this. I am however unsure how to perform this. Programming has been done in R using the FKSUM package. I have made an attempt to use standard resampling on kernel regression:
n <- 5000
sample.size <- 500
B.replications <- 200
x <- rbeta(n, 2, 2) * 10
y <- 3 * sin(2 * x) + 10 * (x > 5) * (x - 5)
y <- y + rnorm(n) + (rgamma(n, 2, 2) - 1) * (abs(x - 5) + 3)
#taking x.y to be the population
x.y <- data.frame(x, y)
xs <- seq(min(x), max(x), length = 1000)
ftrue <- 3 * sin(2 * xs) + 10 * (xs > 5) * (xs - 5)
#Sample from the population
sample.ind <- sample(seqx, size = sample.size, replace = FALSE)
x_s <- sample.reg$x
y_s <- sample.reg$y
fhat_loc_lin.pop <- fk_regression(x, y)
fhat_loc_lin.sample <- fk_regression(x = x_s, y = y_s)
plot(x, y, col = rgb(.7, .7, .7, .3), pch = 16, xlab = 'x',
ylab = 'x', main = 'Local linear estimator with amise bandwidth')
lines(xs, ftrue, col = 2, lwd = 3)
lines(fhat_loc_lin, lty = 2, lwd = 2)
n.B.sample = sample.size # sample bootstrap size
boot.reg.mat.X <- matrix(0,ncol=B.replications, nrow=n.B.sample)
boot.reg.mat.Y <- matrix(0,ncol=B.replications, nrow=n.B.sample)
fhat_loc_lin.boot <- matrix(0,ncol = B.replications, nrow=100)
Temp.reg.y <- matrix(0,ncol = B.replications,nrow = 1000)
for(i in 1:B.replications){
sequence.x.boot <- seq(from=1,to=n.B.sample,by=1)
sample.ind.boot <- sample(sequence.x.boot, size = sample.size, replace = TRUE)
boot.reg.mat <- sample.reg[sample.ind.boot,]
boot.reg.mat.X <- boot.reg.mat$x
boot.reg.mat.Y <- boot.reg.mat$y
fhat_loc_lin.boot <- fk_regression(x = boot.reg.mat.X ,
y = boot.reg.mat.Y,
h = fhat_loc_lin.sample$h)
lines(y=fhat_loc_lin.boot$y,x= fhat_loc_lin.sample$x, col =c(i) )
Temp.reg.y[,i] <- fhat_loc_lin.boot$y
quan.reg.l <- vector()
quan.reg.u <- vector()
for(i in 1:length(xs)){
quan.reg.l[i] <- quantile(x = Temp.reg.y[i,],probs = 0.025)
quan.reg.u[i] <- quantile(x = Temp.reg.y[i,],probs = 0.975)
# Lower Bound
Temp.reg.2 <- quan.reg.l
lines(y=Temp.reg.2,x=fhat_loc_lin.boot$x ,col="red",lwd=4,lty=1)
# Upper Bound
Temp.reg.3 <- quan.reg.u
lines(y=Temp.reg.3,x=fhat_loc_lin.boot$x ,col="navy",lwd=4,lty=1)
Asking the question on here now since I haven't received any response on CV. Any help would be greatly appreciated!

vector field visualisation R

I have a big text file with a lot of rows. Every row corresponds to one vector.
This is the example of each row:
x y dx dy
99.421875 52.078125 0.653356799108 0.782479314511
First two columns are coordinates of the beggining of the vector. And two second columnes are coordinate increments (the end minus the start).
I need to make the picture of this vector field (all the vectors on one picture).
How could I do this?
Thank you
If there is a lot of data (the question says "big file"),
plotting the individual vectors may not give a very readable plot.
Here is another approach: the vector field describes a way of deforming something drawn on the plane;
apply it to a white noise image.
vector_field <- function(
f, # Function describing the vector field
xmin=0, xmax=1, ymin=0, ymax=1,
width=600, height=600,
) {
z <- matrix(runif(width*height),nr=height)
i_to_x <- function(i) xmin + i / width * (xmax - xmin)
j_to_y <- function(j) ymin + j / height * (ymax - ymin)
x_to_i <- function(x) pmin( width, pmax( 1, floor( (x-xmin)/(xmax-xmin) * width ) ) )
y_to_j <- function(y) pmin( height, pmax( 1, floor( (y-ymin)/(ymax-ymin) * height ) ) )
i <- col(z)
j <- row(z)
x <- i_to_x(i)
y <- j_to_y(j)
res <- z
for(k in 1:iterations) {
v <- matrix( f(x, y), nc=2 )
x <- x+.01*v[,1]
y <- y+.01*v[,2]
i <- x_to_i(x)
j <- y_to_j(y)
res <- res + z[cbind(i,j)]
if(trace) {
cat(k, "/", iterations, "\n", sep="")
if(trace) {
image(res>quantile(res,.6), col=0:1)
# Sample data
van_der_Pol <- function(x,y, mu=1) c(y, mu * ( 1 - x^2 ) * y - x )
res <- vector_field(
xmin=-3, xmax=3, ymin=-3, ymax=3,
width=800, height=800,
You may want to apply some image processing to the result to make it more readable.
image(res > quantile(res,.6), col=0:1)
In your case, the vector field is not described by a function:
you can use the value of the nearest neighbour or some 2-dimensional interpolation
(e.g., from the akima package).
With ggplot2, you can do something like this :
df <- data.frame(x=runif(10),y=runif(10),dx=rnorm(10),dy=rnorm(10))
ggplot(data=df, aes(x=x, y=y)) + geom_segment(aes(xend=x+dx, yend=y+dy), arrow = arrow(length = unit(0.3,"cm")))
This is taken almost directly from the geom_segment help page.
OK, here's a base solution:
DF <- data.frame(x=rnorm(10),y=rnorm(10),dx=runif(10),dy=runif(10))
plot(NULL, type = "n", xlim=c(-3,3),ylim=c(-3,3))
arrows(DF[,1], DF[,2], DF[,1] + DF[,3], DF[,2] + DF[,4])
Here is a example from the R-Help of pracma-package.
f <- function(x, y) x^2 - y^2
xx <- c(-1, 1); yy <- c(-1, 1)
vectorfield(f, xx, yy, scale = 0.1)
for (xs in seq(-1, 1, by = 0.25)) {
sol <- rk4(f, -1, 1, xs, 100)
lines(sol$x, sol$y, col="darkgreen")
You can use quiver also.
xyRange <- seq(-1*pi,1*pi,0.2)
temp <- meshgrid(xyRange,xyRange)
u <- sin(temp$Y)
v <- cos(temp$X)

How to get something like Matplotlib's symlog scale in ggplot or lattice?

For very heavy-tailed data of both positive and negative sign, I sometimes like to see all the data on a plot without hiding structure in the unit interval.
When plotting with Matplotlib in Python, I can achieve this by selecting a symlog scale, which uses a logarithmic transform outside some interval, and linear plotting inside it.
Previously in R I have constructed similar behavior by transforming the data with an arcsinh on a one-off basis. However, tick labels and the like are very tricky to do right (see below).
Now, I am faced with a bunch of data where the subsetting in lattice or ggplot would be highly convenient. I don't want to use Matplotlib because of the subsetting, but I sure am missing symlog!
I see that ggplot uses a package called scales, which solves a lot of this problem (if it works). Automatically choosing tick mark and label placing still looks pretty hard to do nicely though. Some combination of log_breaks and cbreaks perhaps?
Edit 2:
The following code is not too bad
sinh.scaled <- function(x,scale=1){ sinh(x)*scale }
asinh.scaled <- function(x,scale=1) { asinh(x/scale) }
asinh_breaks <- function (n = 5, scale = 1, base=10)
function(x) {
log_breaks.callable <- log_breaks(n=n,base=base)
rng <- rng <- range(x, na.rm = TRUE)
minx <- floor(rng[1])
maxx <- ceiling(rng[2])
if (maxx == minx)
return(sinh.scaled(minx, scale=scale))
big.vals <- 0
if (minx < (-scale)) {
big.vals = big.vals + 1
if (maxx>scale) {
big.vals = big.vals + 1
brk <- c()
if (minx < (-scale)) {
rbrk <- log_breaks.callable( c(-min(maxx,-scale), -minx ) )
rbrk <- -rev(rbrk)
brk <- c(brk,rbrk)
if ( !(minx>scale | maxx<(-scale)) ) {
rng <- c(max(minx,-scale), min(maxx,scale))
minc <- floor(rng[1])
maxc <- ceiling(rng[2])
by <- floor((maxc - minc)/(n-big.vals)) + 1
cb <- seq(minc, maxc, by = by)
brk <- c(brk,cb)
if (maxx>scale) {
brk <- c(brk,log_breaks.callable( c(max(minx,scale), maxx )))
asinh_trans <- function(scale = 1) {
trans <- function(x) asinh.scaled(x, scale)
inv <- function(x) sinh.scaled(x, scale)
trans_new(paste0("asinh-", format(scale)), trans, inv,
asinh_breaks(scale = scale),
domain = c(-Inf, Inf))
A solution based on the package scales and inspired by Brian Diggs' post mentioned by #Dennis:
symlog_trans <- function(base = 10, thr = 1, scale = 1){
trans <- function(x)
ifelse(abs(x) < thr, x, sign(x) *
(thr + scale * suppressWarnings(log(sign(x) * x / thr, base))))
inv <- function(x)
ifelse(abs(x) < thr, x, sign(x) *
base^((sign(x) * x - thr) / scale) * thr)
breaks <- function(x){
sgn <- sign(x[which.max(abs(x))])
if(all(abs(x) < thr))
else if(prod(x) >= 0){
if(min(abs(x)) < thr)
sgn * unique(c(pretty_breaks()(c(min(abs(x)), thr)),
log_breaks(base)(c(max(abs(x)), thr))))
sgn * log_breaks(base)(sgn * x)
} else {
if(min(abs(x)) < thr)
unique(c(sgn * log_breaks()(c(max(abs(x)), thr)),
pretty_breaks()(c(sgn * thr, x[which.min(abs(x))]))))
unique(c(-log_breaks(base)(c(thr, -x[1])),
pretty_breaks()(c(-thr, thr)),
log_breaks(base)(c(thr, x[2]))))
trans_new(paste("symlog", thr, base, scale, sep = "-"), trans, inv, breaks)
I am not sure whether the impact of a parameter scale is the same as in Python, but here are a couple of comparisons (see Python version here):
data <- data.frame(x = seq(-50, 50, 0.01), y = seq(0, 100, 0.01))
data$y2 <- sin(data$x / 3)
# symlogx
ggplot(data, aes(x, y)) + geom_line() + theme_bw() +
scale_x_continuous(trans = symlog_trans())
# symlogy
ggplot(data, aes(y, x)) + geom_line() + theme_bw()
# symlog both, threshold = 0.015 for y
# not too pretty because of too many breaks in short interval
ggplot(data, aes(x, y2)) + geom_line() + theme_bw()
scale_y_continuous(trans=symlog_trans(thr = 0.015)) +
scale_x_continuous(trans = "symlog")
# Again symlog both, threshold = 0.15 for y
ggplot(data, aes(x, y2)) + geom_line() + theme_bw()
scale_y_continuous(trans=symlog_trans(thr = 0.15)) +
scale_x_continuous(trans = "symlog")
