Plot decision boundaries with ggplot2? - r

How do I plot the equivalent of contour (base R) with ggplot2? Below is an example with linear discriminant function analysis:
require(MASS)
iris.lda<-lda(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)
datPred<-data.frame(Species=predict(iris.lda)$class,predict(iris.lda)$x) #create data.frame
#Base R plot
eqscplot(datPred[,2],datPred[,3],pch=as.double(datPred[,1]),col=as.double(datPred[,1])+1)
#Create decision boundaries
iris.lda2 <- lda(datPred[,2:3], datPred[,1])
x <- seq(min(datPred[,2]), max(datPred[,2]), length.out=30)
y <- seq(min(datPred[,3]), max(datPred[,3]), length.out=30)
Xcon <- matrix(c(rep(x,length(y)),
rep(y, rep(length(x), length(y)))),,2) #Set all possible pairs of x and y on a grid
iris.pr1 <- predict(iris.lda2, Xcon)$post[, c("setosa","versicolor")] %*% c(1,1) #posterior probabilities of a point belonging to each class
contour(x, y, matrix(iris.pr1, length(x), length(y)),
levels=0.5, add=T, lty=3,method="simple") #Plot contour lines in the base R plot
iris.pr2 <- predict(iris.lda2, Xcon)$post[, c("virginica","setosa")] %*% c(1,1)
contour(x, y, matrix(iris.pr2, length(x), length(y)),
levels=0.5, add=T, lty=3,method="simple")
#Eqivalent plot with ggplot2 but without decision boundaries
ggplot(datPred, aes(x=LD1, y=LD2, col=Species) ) +
geom_point(size = 3, aes(pch = Species))
It is not possible to use a matrix when plotting contour lines with ggplot. The matrix can be rearranged to a data-frame using melt. In the data-frame below the probability values from iris.pr1 are displayed in the first column along with the x and y coordinates in the following two columns. The x and y coordinates form a grid of 30 x 30 points.
df <- transform(melt(matrix(iris.pr1, length(x), length(y))), x=x[X1], y=y[X2])[,-c(1,2)]
I would like to plot the coordinates (preferably connected by a smoothed curve) where the posterior probabilities are 0.5 (i.e. the decision boundaries).

You can use geom_contour in ggplot to achieve a similar effect. As you correctly assumed, you do have to transform your data. I ended up just doing
pr<-data.frame(x=rep(x, length(y)), y=rep(y, each=length(x)),
z1=as.vector(iris.pr1), z2=as.vector(iris.pr2))
And then you can pass that data.frame to the geom_contour and specify you want the breaks at 0.5 with
ggplot(datPred, aes(x=LD1, y=LD2) ) +
geom_point(size = 3, aes(pch = Species, col=Species)) +
geom_contour(data=pr, aes(x=x, y=y, z=z1), breaks=c(0,.5)) +
geom_contour(data=pr, aes(x=x, y=y, z=z2), breaks=c(0,.5))
and that gives

The partimat function in the klaR library does what you want for observed predictors, but if you want the same for the LDA projections, you can build a data frame augmenting the original with the LD1...LDk projections, then call partimat with formula Group~LD1+...+LDk, method='lda' - then you see the "LD-plane" that you intended to see, nicely partitioned for you. This seemed easier to me, at least to explain to students newer to R, since I'm just reusing a function already provided in a way in which it wasn't quite intended.

Related

plotting principal components 3,4 using autoplot (ggfortify) in R

I have been trying to use autoplot (in the ggfortify R package) to plot data points in PCA coordinates. For data matrix D2,
autoplot(prcomp(D2),colour=color_codes)
works fine as far a generating a scatterplot of points in the space of principal components 1+2. However, PCA components 1+2 only explain about 30% of the covariance, and I would like to do the same for PCA 1+3, 2+3, and 3+4, etc. Is there a simple argument in autoplot that will let me do this, and if not, what's the simplest function I can use to do so?
Additionally, is there some way to calculate and add centroids using autoplot?
From ?autoplot.prcomp:
autoplot(object, data = NULL, scale = 1, x = 1, y = 2, ...)
where:
x = principal component number used in x axis and
y = principal component number used in y axis
Hence, if you need to plot PC2 vs PC3 and to add the centroid:
library(ggfortify)
set.seed(1)
D2 <- matrix(rnorm(1000),ncol=10)
prcmp <- prcomp(D2)
pc.x <- 2
pc.y <- 3
cnt.x <- mean(prcmp$x[,pc.x])
cnt.y <- mean(prcmp$x[,pc.y])
autoplot(prcmp, x=2, y=3) +
geom_point(x=cnt.x, y=cnt.y, colour="red", size=5)

Combining two plot in R

I wish to compare the observed values to the fitted ones. To do so, I decided to use a plot in R. What I want to do is to plot X vs Y and X vs Y.fitted on the same plot. I have written some code, but it is incomplete. My plot needs to look like this one below. On the plot, circles and crosses represent the observed and fitted values respectively
set.seed(1)
x <- runif(8,0,1)
y <- runif(8,0,1)
y.fitted <- runif(8,0,1)
plot(x,y,pch=1)
plot(x,y.fitted,pch=5)
In your code, the second plot will not add points to the existing plot but create a new one. You can + use the function points to add points to the existing plot.
plot(x, y, pch = 1)
points(x, y.fitted, pch = 4)
running plot the second time will create a new one. You could use points
set.seed(1)
x <- runif(8,0,1)
y <- runif(8,0,1)
y.fitted <- runif(8,0,1)
plot(x,y,pch=1)
points(x,y.fitted,pch=5)
A solution with ggplot2 giving a better and neat graph outlook:
library(ggplot2)
df = data.frame(x=runif(8,0,1),y=runif(8,0,1),y.fitted=runif(8,0,1))
df = melt(df, id=c('x'))
ggplot() + geom_point(aes(x=x,y=value, shape=variable, colour=variable), df)

plotting multiple plots in ggplot2 on same graph that are unrelated

How would one use the smooth.spline() method in a ggplot2 scatterplot?
If my data is in the data frame called data, with two columns, x and y.
The smooth.spline would be sm <- smooth.spline(data$x, data$y). I believe I should use geom_line(), with sm$x and sm$y as the xy coordinates. However, how would one plot a scatterplot and a lineplot on the same graph that are completely unrelated? I suspect it has something to do with the aes() but I am getting a little confused.
You can use different data(frames) in different geoms and call the relevant variables using aes or you could combine the relevant variables from the output of smooth.spline
# example data
set.seed(1)
dat <- data.frame(x = rnorm(20, 10,2))
dat$y <- dat$x^2 - 20*dat$x + rnorm(20,10,2)
# spline
s <- smooth.spline(dat)
# plot - combine the original x & y and the fitted values returned by
# smooth.spline into a data.frame
library(ggplot2)
ggplot(data.frame(x=s$data$x, y=s$data$y, xfit=s$x, yfit=s$y)) +
geom_point(aes(x,y)) + geom_line(aes(xfit, yfit))
# or you could use geom_smooth
ggplot(dat, aes(x , y)) + geom_point() + geom_smooth()

Nonparametric quantile regression curves to scatterplot

I created a scatterplot (multiple groups GRP) with IV=time, DV=concentration. I wanted to add the quantile regression curves (0.025,0.05,0.5,0.95,0.975) to my plot.
And by the way, this is what I did to create the scatter-plot:
attach(E) ## E is the name I gave to my data
## Change Group to factor so that may work with levels in the legend
Group<-as.character(Group)
Group<-as.factor(Group)
## Make the colored scatter-plot
mycolors = c('red','orange','green','cornflowerblue')
plot(Time,Concentration,main="Template",xlab="Time",ylab="Concentration",pch=18,col=mycolors[Group])
## This also works identically
## with(E,plot(Time,Concentration,col=mycolors[Group],main="Template",xlab="Time",ylab="Concentration",pch=18))
## Use identify to identify each point by group number (to check)
## identify(Time,Concentration,col=mycolors[Group],labels=Group)
## Press Esc or press Stop to stop identify function
## Create legend
## Use locator(n=1,type="o") to find the point to align top left of legend box
legend('topright',legend=levels(Group),col=mycolors,pch=18,title='Group')
Because the data that I created here is a small subset of my larger data, it may look like it can be approximated as a rectangular hyperbole. But I don't want to call a mathematical relationship between my independent and dependent variables yet.
I think nlrq from the package quantreg may be the answer, but I don't understand how to use the function when I don't know the relationship between my variables.
I find this graph from a science article, and I want to do precisely the same kind of graph:
Again, thanks for your help!
Update
Test.csv
I was pointed out that my sample data is not reproducible. Here is a sample of my data.
library(evd)
qcbvnonpar(p=c(0.025,0.05,0.5,0.95,0.975),cbind(TAD,DV),epmar=T,plot=F,add=T)
I also tried qcbvnonpar::evd,but the curve doesn't seem very smooth.
Maybe have a look at quantreg:::rqss for smoothing splines and quantile regression.
Sorry for the not so nice example data:
set.seed(1234)
period <- 100
x <- 1:100
y <- sin(2*pi*x/period) + runif(length(x),-1,1)
require(quantreg)
mod <- rqss(y ~ qss(x))
mod2 <- rqss(y ~ qss(x), tau=0.75)
mod3 <- rqss(y ~ qss(x), tau=0.25)
plot(x, y)
lines(x[-1], mod$coef[1] + mod$coef[-1], col = 'red')
lines(x[-1], mod2$coef[1] + mod2$coef[-1], col = 'green')
lines(x[-1], mod3$coef[1] + mod3$coef[-1], col = 'green')
I have in the past frequently struggled with rqss and my issues have almost always been related to the ordering of the points.
You have multiple measurements at various time points, which is why you're getting different lengths. This works for me:
dat <- read.csv("~/Downloads/Test.csv")
library(quantreg)
dat <- plyr::arrange(dat,Time)
fit<-rqss(Concentration~qss(Time,constraint="N"),tau=0.5,data = dat)
with(dat,plot(Time,Concentration))
lines(unique(dat$Time)[-1],fit$coef[1] + fit$coef[-1])
Sorting the data frame prior to fitting the model appears necessary.
In case you want ggplot2 graphic...
I based this example on that of #EDi. I increased the x and y so that the quantile lines would be less wiggly. Because of this increase, I need to use unique(x) in place of x in some of the calls.
Here's the modified set-up:
set.seed(1234)
period <- 100
x <- rep(1:100,each=100)
y <- 1*sin(2*pi*x/period) + runif(length(x),-1,1)
require(quantreg)
mod <- rqss(y ~ qss(x))
mod2 <- rqss(y ~ qss(x), tau=0.75)
mod3 <- rqss(y ~ qss(x), tau=0.25)
Here are the two plots:
# #EDi's base graphics example
plot(x, y)
lines(unique(x)[-1], mod$coef[1] + mod$coef[-1], col = 'red')
lines(unique(x)[-1], mod2$coef[1] + mod2$coef[-1], col = 'green')
lines(unique(x)[-1], mod3$coef[1] + mod3$coef[-1], col = 'green')
# #swihart's ggplot2 example:
## get into dataset so that ggplot2 can have some fun:
qrdf <- data.table(x = unique(x)[-1],
median = mod$coef[1] + mod$coef[-1],
qupp = mod2$coef[1] + mod2$coef[-1],
qlow = mod3$coef[1] + mod3$coef[-1]
)
line_size = 2
ggplot() +
geom_point(aes(x=x, y=y),
color="black", alpha=0.5) +
## quantiles:
geom_line(data=qrdf,aes(x=x, y=median),
color="red", alpha=0.7, size=line_size) +
geom_line(data=qrdf,aes(x=x, y=qupp),
color="blue", alpha=0.7, size=line_size, lty=1) +
geom_line(data=qrdf,aes(x=x, y=qlow),
color="blue", alpha=0.7, size=line_size, lty=1)

How do I plot the first derivative of the smoothing function?

I have the following script that emulates the type of data structure I have and analysis that I want to do on it,
library(ggplot2)
library(reshape2)
n <- 10
df <- data.frame(t=seq(n)*0.1, a =sort(rnorm(n)), b =sort(rnorm(n)),
a.1=sort(rnorm(n)), b.1=sort(rnorm(n)),
a.2=sort(rnorm(n)), b.2=sort(rnorm(n)))
head(df)
mdf <- melt(df, id=c('t'))
## head(mdf)
levels(mdf$variable) <- rep(c('a','b'),3)
g <- ggplot(mdf,aes(t,value,group=variable,colour=variable))
g +
stat_smooth(method='lm', formula = y ~ ns(x,3)) +
geom_point() +
facet_wrap(~variable) +
opts()
What I would like to do in addition to this is plot the first derivative of the smoothing function against t and against the factors, c('a','b'), as well. Any suggestions how to go about this would be greatly appreciated.
You'll have to construct the derivative yourself, and there are two possible ways for that. Let me illustrate by using only one group :
require(splines) #thx #Chase for the notice
lmdf <- mdf[mdf$variable=="b",]
model <- lm(value~ns(t,3),data=lmdf)
You then simply define your derivative as diff(Y)/diff(X) based on your predicted values, as you would do for differentiation of a discrete function. It's a very good approximation if you take enough X points.
X <- data.frame(t=seq(0.1,1.0,length=100) ) # make an ordered sequence
Y <- predict(model,newdata=X) # calculate predictions for that sequence
plot(X$t,Y,type="l",main="Original fit") #check
dY <- diff(Y)/diff(X$t) # the derivative of your function
dX <- rowMeans(embed(X$t,2)) # centers the X values for plotting
plot(dX,dY,type="l",main="Derivative") #check
As you can see, this way you obtain the points for plotting the derivative. You'll figure out from here how to apply this to both levels and combine those points to the plot you like. Below the plots from this sample code :
Here's one approach to plotting this with ggplot. There may be a more efficient way to do it, but this uses the manual calculations done by #Joris. We'll simply construct a long data.frame with all of the X and Y values while also supplying a variable to "facet" the plots:
require(ggplot2)
originalData <- data.frame(X = X$t, Y, type = "Original")
derivativeData <- data.frame(X = dX, Y = dY, type = "Derivative")
plotData <- rbind(originalData, derivativeData)
ggplot(plotData, aes(X,Y)) +
geom_line() +
facet_wrap(~type, scales = "free_y")
If data is smoothed using smooth.spline, the derivative of predicted data can be specified using the argument deriv in predict. Following from #Joris's solution
lmdf <- mdf[mdf$variable == "b",]
model <- smooth.spline(x = lmdf$t, y = lmdf$value)
Y <- predict(model, x = seq(0.1,1.0,length=100), deriv = 1) # first derivative
plot(Y$x[, 1], Y$y[, 1], type = 'l')
Any dissimilarity in the output is most likely due to differences in the smoothing.

Resources