I need to build a plot showing the difference in variance between the ordinary least squares estimator and the ridge estimator. For this I want to plot contour ellipses for the variance.
The data I'm working with is this:
set.seed(10)
sigma = 10
U = rnorm(50,0,sigma) #The errors
X = scale(matrix(rnorm(50*2),ncol=2))
Y = scale(U + X%*%c(5,-2))
I then find the ridge(r) and OLS(m) and their covariancematrices:
(EDIT: I've realised the covariance matrix for the ridge estimator is wrong, however that doesn't really matter. It will just change the ellipses, not how to plot the.)
lbda = 100
r = solve(t(X)%*%X + lbda * diag(2))%*%t(X)%*%Y
m = solve(t(X)%*%X)%*%t(X)%*%Y
covmatr = sigma^2 * solve(t(X)%*%X + lbda * diag(2))
covmatm = sigma^2 * solve(t(X)%*%X)
I then want to build the plot. In standard R I would do something like this using the function ellipse from the car-module:
library(car)
plot(r[1],r[2],xlim=c(-5,5),ylim=c(-5,5), xlab = TeX("$\\beta_1$"), col="red")
points(m[1],m[2], col="blue")
abline(h=0)
abline(v=0)
ellipse(center=c(r),shape=covmatr,center.cex=0,radius=sqrt(qchisq(0.5,2)), col = "red")
ellipse(center=c(r),shape=covmatr,center.cex=0,radius=sqrt(qchisq(0.9,2)), col = "red")
ellipse(center=c(m),shape=covmatm,center.cex=0,radius=sqrt(qchisq(0.5,2)))
ellipse(center=c(m),shape=covmatm,center.cex=0,radius=sqrt(qchisq(0.9,2)))
I realize I forgot to rename the y-label
My problem is that I cant figure out how to do the same in ggplot2. I have tried:
#Building the plot with dots
rd = data.frame(r[1],r[2])
plot1 = ggplot(rd, aes(x=rd$r.1, y = rd$r.2, color = 'r')) +
geom_point(size = 3) +
geom_point(aes(x=m[1], y=m[2]), colour="blue", size = 3) +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0) +
xlim(-5,5) +
ylim(-5,5) +
ggtitle(TeX("Varians for ridge"))+
xlab(TeX("$\\beta_1$")) + ylab("$\\beta_2$")+
theme_bw()
plot1
#Trying to add ellipses:
ellipse(center=c(r),shape=covmatr,center.cex=0,radius=sqrt(qchisq(0.5,2)), col = "red")
ellipse(center=c(r),shape=covmatr,center.cex=0,radius=sqrt(qchisq(0.9,2)), col = "red")
ellipse(center=c(m),shape=covmatm,center.cex=0,radius=sqrt(qchisq(0.5,2)))
ellipse(center=c(m),shape=covmatm,center.cex=0,radius=sqrt(qchisq(0.9,2)))
However this gives me the following error
Fejl i plot.xy(xy.coords(x, y), type = type, ...) :
plot.new has not been called yet
I have also tried to simply add the ellipses within the ggplot like this using the ellipse function from the car package:
plot1 = ... +
ellipse(center=c(m),shape=covmatm,center.cex=0,radius=sqrt(qchisq(0.9,2)))
where "..." depicts that there are lines missing. I know both my attempts are quite naive, however I hoped one would work. I have found the command stat_ellipse but this is not useful in my case because as far as I can see it cannot calculate the covariance matrix of the ridge estimator.
Related
I am using the following code in R to a plot a linear regression with confidence interval bands (95%) around the regression line.
Average <- c(0.298,0.783429,0.2295,0.3725,0.598,0.892,2.4816,2.79975,
1.716368,0.4845,0.974133,0.824,0.936846,1.54905,0.8166,1.83535,
1.6902,1.292667,0.2325,0.801,0.516,2.06645,2.64965,2.04785,0.55075,
0.698615,1.285,2.224118,2.8576,2.42905,1.138143,1.94225,2.467357,0.6615,
0.75,0.547,0.4518,0.8002,0.5936,0.804,0.7,0.6415,0.702182,0.7662,0.847)
Area <-c(8.605,16.079,4.17,5.985,12.419,10.062,50.271,61.69,30.262,11.832,25.099,
8.594,17.786,36.995,7.473,33.531,30.97,30.894,4.894,8.572,5.716,45.5,69.431,
40.736,8.613,14.829,4.963,33.159,66.32,37.513,27.302,47.828,39.286,9.244,19.484,
11.877,9.73,11.542,12.603,9.988,7.737,9.298,14.918,17.632,15)
lm.out <- lm (Area ~ Average)
newx = seq(min(Average), by = 0.05)
conf_interval <- predict(lm.out, newdata = data.frame(Average = newx), interval ="confidence",
level = 0.95)
plot(Average, Area, xlab ="Average", ylab = "Area", main = "Regression")
abline(lm.out, col = "lightblue")
lines(newx, conf_interval[,2], col = "blue", lty ="dashed")
lines(newx, conf_interval[,3], col = "blue", lty ="dashed")
I am stuck because the graph I got reports the bands just for the first part pf the line, leaving out all the remaining line (you find the link to the image at the bottom of the message). What is going wrong? I would also like to shade the area of the confidence interval (not just the lines corresponding to the limits) but I can't understand how to do it.
Any help would be really appreciated, I am completely new in R.
This is very easy with the ggplot2 -library. Here is the code:
library(ggplot2)
data = data.frame(Average, Area)
ggplot(data=data, aes(x=Area, y=Average))+
geom_smooth(method="lm", level=0.95)+
geom_point()
Code to install the library:
install.packages("ggplot2")
I'd like to use something like ggplot2 and ggmap to produce a heat map of arbitrary values such as property prices per metre squared over a geographic area at a street level (with a high resolution).
Unfortunately, the task appears to be rather difficult because while ggplot2 can produce a great density plot, it seems unable to visualise spatial data like this without prior interpolation.
For this, I've used libraries akima (gridded bivariate interpolation for irregular data) and mgcv (generalised additive models with integrated smoothness estimation), however my knowledge of interpolation methods is mediocre at best and the results I've been able to produce aren't satisfactory enough.
Consider the following example:
Data
library(ggplot2)
library(ggmap)
## data simulation
set.seed(1945)
df <- tibble(x = rnorm(500, -0.7406, 0.03),
y = rnorm(500, 51.9976, 0.03),
z = abs(rnorm(500, 2000, 1000)))
Map, scatterplot, density plot
## ggmap
map <- get_map("Bletchley Park, Bletchley, Milton Keynes", zoom = 13, source = "stamen", maptype = "toner-background")
q <- ggmap(map, extent = "device", darken = .5)
## scatterplot over map
q + geom_point(aes(x, y), data = df, colour = z)
## classic density heat map
q +
stat_density2d(aes(x=x, y=y, fill=..level..), data=df, geom="polygon", alpha = .2) +
geom_density_2d(aes(x=x, y=y), data=df, colour = "white", alpha = .4) +
scale_fill_distiller(palette = "Spectral")
As you can see, the data are rather dense over the chosen area and the density heat map looks great with round edges and closed curves (except for some of the outermost layers).
Interpolation and plotting using akima
## akima interpolation
library(akima)
df_akima <-interp2xyz(interp(x=df$x, y=df$y, z=df$z, duplicate="mean", linear = T,
xo=seq(min(df$x), max(df$x), length=200),
yo=seq(min(df$y), max(df$y), length=200)), data.frame=TRUE)
## akima plot
q +
geom_tile(aes(x = x, y = y, fill = z), data = df_akima, alpha = .4) +
stat_contour(aes(x = x, y = y, z = z, fill = ..level..), data = df_akima, geom = 'polygon', alpha = .4) +
geom_contour(aes(x = x, y = y, z = z), data = df_akima, colour = 'white', alpha = .4) +
scale_fill_distiller(palette = "Spectral", na.value = NA)
This produces a dense grid of interpolated values (to ensure a sufficient resolution) and while the tile plot underneath is acceptable, the contour plots are too ragged and many of the curves aren't closed.
Non-linear interpolation using linear = F is smoother, but apparently sacrifices resolution and goes wild with the numbers (negative values of z).
Interpolation and plotting using mgcv
## mgcv interpolation
library(mgcv)
gam <- gam(z ~ s(x, y, bs = 'sos'), data = df)
df_mgcv <- data.frame(expand.grid(x = seq(min(df$x), max(df$x), length=200),
y = seq(min(df$y), max(df$y), length=200)))
resp <- predict(gam, df_mgcv, type = "response")
df_mgcv$z <- resp
## mgcv plot
q +
geom_tile(aes(x = x, y = y, fill = z), data = df_mgcv, alpha = .4) +
stat_contour(aes(x = x, y = y, z = z, fill = ..level..), data = df_mgcv, geom = 'polygon', alpha = .4) +
geom_contour(aes(x = x, y = y, z = z), data = df_mgcv, colour = 'white', alpha = .4) +
scale_fill_distiller(palette = "Spectral", na.value = NA)
The same process using mgcv results in a nice and smooth plot, but the resolution is much lower and practically all curves aren't closed.
Questions
Could you please suggest a better method or modify my attempt to obtain a plot similar to the first one (clean, connected, and smooth lines with high resolution)?
Is it possible to close the curves, e.g. in the last plot (the shaded area should be computed beyond the image boundaries)?
Thank you for your time!
The problem with your maps is not the interpolation method you're using, but the way ggplot displays density lines. Here's an answer to this: Remove gaps in a stat_density2d ggplot chart without modifying XY limits.
The density lines go beyond the map, so any polygon that goes outside the plot area is rendered inappropriately (ggplot will close the polygon using the next point of the correspondent level). This does not show up much on your first map because the interpolation resolution is low.
The trick proposed by Andrew is to first expand the plot area, so that the density lines are rendered correctly, then cut off the display area to hide the extra space. Since I tested his solution with your first example, here's the code:
q +
stat_density2d(
aes(x = x, y = y, fill = ..level..),
data = df,
geom = "polygon",
alpha = .2,
color = "white",
bins = 20
) +
scale_fill_distiller(
palette = "Spectral"
) +
xlim(
min(df$x) - 10^-5,
max(df$x) + 10^-5
) +
ylim(
min(df$y) - 10^-3,
max(df$y) + 10^-3
) +
coord_equal(
expand = FALSE,
xlim = c(-.778, -.688),
ylim = c(51.965, 52.03)
)
The only differences is that I used min()- / max() + instead of fixed numbers and coord_equal to ensure the map wasn't distorted. In addition, I manually specified a greater number of levels (using bin), since by increasing the plot area, stat_density automatically chooses a lower resolution.
As for the best interpolation method, this depends on your objective and the type of data you have. The question is not what is the best method for your map, but what is the best method for your data. This is a very broad issue, out of scope for this space. But here's a good guide: http://www.rspatial.org/analysis/rst/4-interpolation.html
For general ideas on how to make good maps in R using ggplot: http://spatial.ly/r/
Sorry, I can't run your example at the moment to provide details. But try autoKrige() from automap package.
Kriging is a great method for interpolation. Just be sure that your data fits the requisitions. Here's a good guide:
https://gisgeography.com/kriging-interpolation-prediction/
I am new in plotting time series. I downloaded a time series data and calculated a linear equation and I would like to add it in the time series plot. I want to show the year in the plot so I used index(stk) as x-axis input.
code:
library(quantmod)
stk <- suppressWarnings(getSymbols("AAPL", auto.assign = FALSE,
src = "yahoo", periodicity = "daily"))
stk <- na.omit(stk)
stk.lm1 <- lm(log(Cl(stk)) ~ c(1:nrow(stk)), data = stk)
plot(index(stk), log(Cl(stk)), type = "l", lwd = 3, las = 1)
abline(coefficients(stk.lm1)[1], coefficients(stk.lm1)[2], col="blue")
I know it is the plot using index(stk), how can I do to keep the x axis of plot in date and can I use plot.xts or other like ggplot2 to do the same things? Please advise, thank you very much.
It isn't dificult to do the plot that you want in base r plot or ggplot2 here is what you what:
plot(index(stk), log(Cl(stk)), type="l", lwd=3, las=1)
lines(x = index(stk.lm1$fitted.values), y = stk.lm1$fitted.values,col = "blue")
for the base r plot I added a line with the fitted values of the linear regression that I extracted with the $ signed and the dates of theme. Take into account that lm respect the structure of the data so the results are xts
library(ggplot2)
ggplot(stk, aes(x = index(stk), y = as.numeric(log(Cl(stk)))))+geom_line(lwd=1)+
geom_line(aes(x = index(stk.lm1$fitted.values), y = stk.lm1$fitted.values),col = "blue")+
labs(x = "Date", y = "Log Price")
For ggplot2 is quite similar. First you have to initiate the plot with ggplot where you defined the data and aesthetics (aes), then you add a line with geom_line and for the extra line I used the this command and define the new line in a new aes the same way I did it with the base r function.
Here's a ggplot solution. You shouldn't have to calculate the linear regression coefficients yourself:
# convert stk to data frame & specify your x-axis variable explicitly
stk.df <- as.data.frame(stk)
stk.df$Date <- as.Date(rownames(stk.df))
# plot
ggplot(stk.df,
aes(x = Date, y = log(AAPL.Close))) +
geom_line() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Year", y = "log(AAPL's closing value)") +
theme_bw()
The geom_smooth line takes care of the regression. Set se = TRUE if you want to include a confidence interval around the line.
I am creating a number of heatmaps in R, but I am having problems when it comes to keeping the colour scale consistent across graphs.
I find that the colours are scaled within a graph, is there a way to make colours consistent across graphs? Ie. So that that colour difference between a value of 0.4 and 0.5 is always the same?
Code Example:
set.seed(123)
d1 = matrix(rnorm(9, mean = 0.2, sd = 0.1), ncol = 3)
d2 = matrix(rnorm(9, mean = 0.8, sd = 0.1), ncol = 3)
mat = list(d1, d2)
for(m in mat)
heatmap(m, Rowv = NA ,Colv = NA)
You'll note in the example that cell (2,3) the first graph is similar to cell (1,3) in the second, despite being ~0.8 different
Here's a way to do it with ggplot2, if you're open to not using base graphics:
library(reshape2)
library(ggplot2)
# Set common limits for color scale
limits = range(unlist(mat))
Here's the code for two separate graphs. The last line of code for each graph ensures that they use the same z limits for setting the colors:
ggplot(melt(mat[[1]]), aes(Var1, Var2, fill=value)) +
geom_tile() +
scale_fill_continuous(limits=limits)
ggplot(melt(mat[[2]]), aes(Var1, Var2, fill=value)) +
geom_tile() +
scale_fill_continuous(limits=limits)
Another option is to plot both heatmaps in a single graph using facetting, which automatically ensures both graphs are on the same color scale:
ggplot(melt(mat), aes(Var1, Var2, fill=value)) +
geom_tile() +
facet_grid(. ~ L1)
I've used the default colors here, but for either approach you can set the color scale to be anything you wish. For example:
ggplot(melt(mat), aes(Var1, Var2, fill=value)) +
geom_tile() +
facet_grid(. ~ L1) +
scale_fill_gradient(low="red", high="green")
You could use the image function directly (heatmap uses image), though it will require some extra formatting to match the output of heatmap. You can use zlim to set the color range. Quoting from the ?image page:
the minimum and maximum z values for which colors should be plotted,
defaulting to the range of the finite values of z. Each of the given
colors will be used to color an equispaced interval of this range. The
midpoints of the intervals cover the range, so that values just
outside the range will be plotted.
# define zlim min and max for all the plots
minz = Reduce(min, mat)
maxz = Reduce(max, mat)
for(m in mat) {
image( m, zlim = c(minz, maxz), col = heat.colors(20))
}
To get closer to the formatting produced by heatmap, you can just reuse some code from the heatmap function:
for(m in mat) {
labCol = dim(m)[2]
labRow = dim(m)[1]
image(seq_len(labCol), seq_len(labRow), m, zlim = c(minz, maxz),
col = heat.colors(20), axes = FALSE, xlab = "", ylab = "",
xlim = 0.5 + c(0, labCol), ylim = 0.5 + c(0, labRow))
axis(1, 1L:labCol, labels = seq_len(labCol), las = 2, line = -0.5, tick = 0)
axis(4, 1L:labRow, labels = seq_len(labRow), las = 2, line = -0.5, tick = 0)
}
Using the breaks argument to image is another option. It allows more flexibility than zlim in setting the breakpoints for colors. Quoting from the help page, breaks is
a set of finite numeric breakpoints for the colours: must have one
more breakpoint than colour and be in increasing order. Unsorted
vectors will be sorted, with a warning.
Say, I am using facet_grid() in ggplot2 to obtain 2 histograms. Now I want to superimpose these histograms with Poisson curves (having different means for the 2 histogram plots/grids) and a second curve of another distribution (for which I want to manually provide the probability function of values). How can this be done?
Constructing an example:
library(ggplot2)
value<-c(rpois(500,1.5))
group<-rep(c("A","B"),250)
data<-data.frame(value,group)
g1<-ggplot(data,aes(value))
g1+geom_histogram(aes(y=..count..),binwidth=1,position="identity")+facet_grid(.~group)
What next?
Alternatively, can it be done using the lattice package?
The easy way is to plot densities instead of counts and use stat_function()
library(ggplot2)
value<-c(rpois(500,1.5))
group<-rep(c("A","B"),250)
data<-data.frame(value,group)
ggplot(data,aes(value)) +
geom_histogram(aes(y=..density..), binwidth=1,position="identity") +
facet_grid(.~group) +
stat_function(geom = "line", fun = dpois, arg = list(lambda = 1.5), colour = "red", fill = NA, n = 9)
If you want counts then you need to convert the densities of dpois to 'counts'
ggplot(data,aes(value)) +
geom_histogram(aes(y=..count..), binwidth=1,position="identity") +
facet_grid(.~group) +
stat_function(geom = "line", fun = function(..., total){dpois(...) * total}, arg = list(lambda = 1.5, total = 250), colour = "red", fill = NA, n = 9)
When recently faced with a similar problem (comparing distributions), I wrote up some code for transparent overlapping histograms that might give you some ideas on where to start.