Simulated sphere shape data based on normal distribution - r

I'm clueless on below question. Any help is appreciated please.
"Simulate data with n=1000 observations and p=3 covariates -- all random variables from standard normal distribution. Create two category class variable assigning all observations within a sphere with radius of 1.5 centered at 3D zero to one class category and all others -- to the second".

Here's a 2D example to get you going...
library(ggplot2)
library(grid)
Sample x & y coords from normal distribution (default mean = 0, sd = 1)
df <- data.frame(x = rnorm(100), y = rnorm(100))
Calculate distance from centre (0,0)
df$r = sqrt(df$x^2 + df$y^2)
Assign to category
df$category <- ifelse(df$r < 1, "in", "out")
Plot
ggplot(df, aes(x = x, y = y, color = category)) +
geom_point() +
coord_equal() +
annotation_custom(grob=circleGrob(r=unit(1,"npc"), gp = gpar(fill = NA)), xmin=-0.5, xmax=0.5, ymin=-0.5, ymax=0.5)

Related

ggplot2: Plot two different Densities in the same Plot of the same Variable before and after a Cutoff

My goal is to plot two different densities in the same plot of the same variable. I want to do this as it is common to show robustness of the forcing variable (here z) in a Regression Discontinuity Design. In the code below, I got it working however I do not want the density to be plotted before the cutoff (here 0) if it the key is "above" and vice-versa. Also, the graph should not just be hidden because of the smoothing. It should start computing the density just until (or start) the cutoff.
library(ggplot2)
x <- rnorm(1000, mean = 0)
y <- rnorm(500, mean = 2)
z <- append(x,y)
d <- tibble(value = z, key = ifelse(z <= 0, "below", "above"))
ggplot(d) +
geom_density(aes(z, group = key)) +
geom_vline(aes(xintercept = 0))
Does anybody know how to implement this? For linear regressions I got it working, but with geom_density() it plots the other side of the cutoff as well and smoothes it.
Thanks in advance for your help.
You can use trim = TRUE in geom_density to only calculate density over the range of values in the data:
library(ggplot2)
library(dplyr)
x <- rnorm(1000, mean = 0)
y <- rnorm(500, mean = 2)
z <- append(x,y)
d <- tibble(value = z, key = ifelse(z <= 0, "below", "above"))
ggplot(d) +
# added fill for easier discrimination
geom_density(aes(value, group = key, fill = key),
alpha = 0.5, trim = TRUE) +
geom_vline(aes(xintercept = 0), lty = 2, colour = 'red')

Filling parts of a contour plot in R

I have made a contour plot in R with the following code:
library(mvtnorm)
# Define the parameters for the multivariate normal distribution
mu = c(0,0)
sigma = matrix(c(1,0.2,0.2,3),nrow = 2)
# Make a grid in the x-y plane centered in mu, +/- 3 standard deviations
xygrid = expand.grid(x = seq(from = mu[1]-3*sigma[1,1], to = mu[1]+3*sigma[1,1], length.out = 100),
y = seq(from = mu[2]-3*sigma[2,2], to = mu[2]+3*sigma[2,2], length.out = 100))
# Use the mvtnorm library to calculate the multivariate normal density for each point in the grid
distribution = as.matrix(dmvnorm(x = xygrid, mean = mu, sigma = sigma))
# Plot contours
df = as.data.frame(cbind(xygrid, distribution))
myPlot = ggplot() + geom_contour(data = df,geom="polygon",aes( x = x, y = y, z = distribution))
myPlot
I want to illustrate cumulative probability by shading/colouring certain parts of the plot, for instance everything in the region {x<0, y<0} (or any other self defined region).
Is there any way of achieving this in R with ggplot?
So you are able to get the coordinates used to draw the circles in the plot using ggplot_build. Subsequently you could try to use these coordinates in combination with geom_polygon to shade a particular region. My best try:
library(dplyr)
data <- ggplot_build(myPlot)$data[[1]]
xCoor <- 0
yCoor <- 0
df <- data %>% filter(group == '-1-001', x <= xCoor, y <= yCoor) %>% select(x,y)
# Insert the [0,0] coordinate in the right place
index <- which.max(abs(diff(rank(df$y))))
df <- rbind( df[1:index,], data.frame(x=xCoor, y=yCoor), df[(index+1):nrow(df),] )
myPlot + geom_polygon(data = df, aes(x=x, y=y), fill = 'red', alpha = 0.5)
As you can see it's not perfect because the [x,0] and [0,y] coordinates are not included in the data, but it's a start.

topoplot in ggplot2 – 2D visualisation of e.g. EEG data

Can ggplot2 be used to produce a so-called topoplot (often used in neuroscience)?
Sample data:
label x y signal
1 R3 0.64924459 0.91228430 2.0261520
2 R4 0.78789621 0.78234410 1.7880972
3 R5 0.93169511 0.72980685 0.9170998
4 R6 0.48406513 0.82383895 3.1933129
Full sample data.
Rows represent individual electrodes. Columns x and y represent the projection into 2D space and the column signal is essentially the z-axis representing voltage measured at a given electrode.
stat_contour doesn't work, apparently due to unequal grid.
geom_density_2d only provides a density estimation of x and y.
geom_raster is one not fitted for this task or I must be using it incorrectly since it quickly runs out of memory.
Smoothing (like in the image on the right) and head contours (nose, ears) aren't necessary.
I want to avoid Matlab and transforming the data so that it fits this or that toolbox… Many thanks!
Update (26 January 2016)
The closest I've been able to get to my objective is via
library(colorRamps)
ggplot(channels, aes(x, y, z = signal)) + stat_summary_2d() + scale_fill_gradientn(colours=matlab.like(20))
which produces an image like this:
Update 2 (27 January 2016)
I've tried #alexforrence's approach with full data and this is the result:
It's a great start but there is a couple of issues:
The last call (ggplot()) takes about 40 seconds on an Intel i7 4790K while Matlab toolboxes manage to generate these almost instantly; my ‘emergency solution’ above takes about a second.
As you can see, the upper and lower border of the central part appear to be ‘sliced’ – I'm not sure what causes this but it could be the third issue.
I'm getting these warnings:
1: Removed 170235 rows containing non-finite values (stat_contour).
2: Removed 170235 rows containing non-finite values (stat_contour).
Update 3 (27 January 2016)
Comparison between two plots produced with different interp(xo, yo) and stat_contour(binwidth) values:
Ragged edges if one chooses low interp(xo, yo), in this case xo/yo = seq(0, 1, length = 100):
Here's a potential start:
First, we'll attach some packages. I'm using akima to do linear interpolation, though it looks like EEGLAB uses some sort of spherical interpolation here? (the data was a little sparse to try it).
library(ggplot2)
library(akima)
library(reshape2)
Next, reading in the data:
dat <- read.table(text = " label x y signal
1 R3 0.64924459 0.91228430 2.0261520
2 R4 0.78789621 0.78234410 1.7880972
3 R5 0.93169511 0.72980685 0.9170998
4 R6 0.48406513 0.82383895 3.1933129")
We'll interpolate the data, and stick that in a data frame.
datmat <- interp(dat$x, dat$y, dat$signal,
xo = seq(0, 1, length = 1000),
yo = seq(0, 1, length = 1000))
datmat2 <- melt(datmat$z)
names(datmat2) <- c('x', 'y', 'value')
datmat2[,1:2] <- datmat2[,1:2]/1000 # scale it back
I'm going to borrow from some previous answers. The circleFun below is from Draw a circle with ggplot2.
circleFun <- function(center = c(0,0),diameter = 1, npoints = 100){
r = diameter / 2
tt <- seq(0,2*pi,length.out = npoints)
xx <- center[1] + r * cos(tt)
yy <- center[2] + r * sin(tt)
return(data.frame(x = xx, y = yy))
}
circledat <- circleFun(c(.5, .5), 1, npoints = 100) # center on [.5, .5]
# ignore anything outside the circle
datmat2$incircle <- (datmat2$x - .5)^2 + (datmat2$y - .5)^2 < .5^2 # mark
datmat2 <- datmat2[datmat2$incircle,]
And I really liked the look of the contour plot in R plot filled.contour() output in ggpplot2, so we'll borrow that one.
ggplot(datmat2, aes(x, y, z = value)) +
geom_tile(aes(fill = value)) +
stat_contour(aes(fill = ..level..), geom = 'polygon', binwidth = 0.01) +
geom_contour(colour = 'white', alpha = 0.5) +
scale_fill_distiller(palette = "Spectral", na.value = NA) +
geom_path(data = circledat, aes(x, y, z = NULL)) +
# draw the nose (haven't drawn ears yet)
geom_line(data = data.frame(x = c(0.45, 0.5, .55), y = c(1, 1.05, 1)),
aes(x, y, z = NULL)) +
# add points for the electrodes
geom_point(data = dat, aes(x, y, z = NULL, fill = NULL),
shape = 21, colour = 'black', fill = 'white', size = 2) +
theme_bw()
With improvements mentioned in the comments (setting extrap = TRUE and linear = FALSE in the interp call to fill in gaps and do a spline smoothing, respectively, and removing NAs before plotting), we get:
mgcv can do spherical splines. This replaces akima (the chunk containing interp() isn't necessary).
library(mgcv)
spl1 <- gam(signal ~ s(x, y, bs = 'sos'), data = dat)
# fine grid, coarser is faster
datmat2 <- data.frame(expand.grid(x = seq(0, 1, 0.001), y = seq(0, 1, 0.001)))
resp <- predict(spl1, datmat2, type = "response")
datmat2$value <- resp

R - add centroids to scatter plot

I have a dataset two continuous variables and one factor variable (two classes). I want to create a scatterplot with two centroids (one for each class) that includes error bars in R. The centroids should be positioned at the mean values for x and y for each class.
I can easily create the scatter plot using ggplot2, but I can't figure out how to add the centroids. Is it possible to do this using ggplot / qplot?
Here is some example code:
x <- c(1,2,3,4,5,2,3,5)
y <- c(10,11,14,5,7,9,8,5)
class <- c(1,1,1,0,0,1,0,0)
df <- data.frame(class, x, y)
qplot(x,y, data=df, color=as.factor(class))
Is this what you had in mind?
centroids <- aggregate(cbind(x,y)~class,df,mean)
ggplot(df,aes(x,y,color=factor(class))) +
geom_point(size=3)+ geom_point(data=centroids,size=5)
This creates a separate data frame, centroids, with columns x, y, and class where x and y are the mean values by class. Then we add a second point geometry layer using centroid as the dataset.
This is a slightly more interesting version, useful in cluster analysis.
gg <- merge(df,aggregate(cbind(mean.x=x,mean.y=y)~class,df,mean),by="class")
ggplot(gg, aes(x,y,color=factor(class)))+geom_point(size=3)+
geom_point(aes(x=mean.x,y=mean.y),size=5)+
geom_segment(aes(x=mean.x, y=mean.y, xend=x, yend=y))
EDIT Response to OP's comment.
Vertical and horizontal error bars can be added using geom_errorbar(...) and geom_errorbarh(...).
centroids <- aggregate(cbind(x,y)~class,df,mean)
f <- function(z)sd(z)/sqrt(length(z)) # function to calculate std.err
se <- aggregate(cbind(se.x=x,se.y=y)~class,df,f)
centroids <- merge(centroids,se, by="class") # add std.err column to centroids
ggplot(gg, aes(x,y,color=factor(class)))+
geom_point(size=3)+
geom_point(data=centroids, size=5)+
geom_errorbar(data=centroids,aes(ymin=y-se.y,ymax=y+se.y),width=0.1)+
geom_errorbarh(data=centroids,aes(xmin=x-se.x,xmax=x+se.x),height=0.1)
If you want to calculate, say, 95% confidence instead of std. error, replace
f <- function(z)sd(z)/sqrt(length(z)) # function to calculate std.err
with
f <- function(z) qt(0.025,df=length(z)-1, lower.tail=F)* sd(z)/sqrt(length(z))
I could not get the exact code by #jlhoward to work for me (specifically with the error bars), so I made minor changes to remove errors and even remove warnings. So, you should be able to run the code from start to finish, and if #jlhoward wants to incorporate this into the existing answer, that's great.
centroids <- aggregate(cbind(mean.x = x, mean.y = y) ~ class, df, mean)
gg <- merge(df, centroids, by = "class")
f <- function(z) sd(z) / sqrt(length(z)) # function to calculate std.err
se <- aggregate(cbind(se.x = x ,se.y = y) ~ class, df, f)
centroids <- merge(centroids, se, by = "class") # add std.err column to centroids
ggplot(gg, aes(x = x, y = y, color = factor(class))) +
geom_point(size = 3) +
geom_point(data = centroids, aes(x = mean.x, y = mean.y), size = 5) +
geom_errorbar(data = centroids,
aes(x = mean.x, y = mean.y, ymin = mean.y - se.y, ymax = mean.y + se.y),
width = 0.1) +
geom_errorbarh(data = centroids, inherit.aes=FALSE, # keeps ggplot from using first aes
aes(xmin = (mean.x - se.x), xmax = (mean.x + se.x), y = mean.y,
height = 0.1, color = factor(class))) +
labs(x = "Label for x-axis", y = "Label for y-axis") +
theme(legend.title = element_blank()) # remove legend title

Fill superimposed ellipses in ggplot2 scatterplots

This question is a follow-up of "How can a data ellipse be superimposed on a ggplot2 scatterplot?".
I want to create a 2D scatterplot using ggplot2 with filled superimposed confidence ellipses. Using the solution of Etienne Low-Décarie from the above mentioned post, I do get superimposed ellipses to work. The solution is based on stat_ellipse available from https://github.com/JoFrhwld/FAAV/blob/master/r/stat-ellipse.R
Q: How can I fill the inner area of the ellipse(s) with a certain color (more specifically I want to use the color of the ellipse border with some alpha)?
Here is the minimal working example modified from the above mentioned post:
# create data
set.seed(20130226)
n <- 200
x1 <- rnorm(n, mean = 2)
y1 <- 1.5 + 0.4 * x1 + rnorm(n)
x2 <- rnorm(n, mean = -1)
y2 <- 3.5 - 1.2 * x2 + rnorm(n)
class <- rep(c("A", "B"), each = n)
df <- data.frame(x = c(x1, x2), y = c(y1, y2), colour = class)
# get code for "stat_ellipse"
library(devtools)
library(ggplot2)
source_url("https://raw.github.com/JoFrhwld/FAAV/master/r/stat-ellipse.R")
# scatterplot with confidence ellipses (but inner ellipse areas are not filled)
qplot(data = df, x = x, y = y, colour = class) + stat_ellipse()
Output of working example:
As mentioned in the comments, polygon is needed here:
qplot(data = df, x = x, y = y, colour = class) +
stat_ellipse(geom = "polygon", alpha = 1/2, aes(fill = class))

Resources