Using external variables with ggpairs - r

I'm writing functions for an R-package which will use a wrapper function for ggpairs from the package GGally to plot the output objects of the methods. I would like ggpairs to be able to use variables not part of the input object for defining aesthetics but this produces an error message with ggpairs, see below for a minimal example:
library(GGally)
library(ggplot2)
# The data object
object <- list(x = iris[, 1:2], label = "Iris data")
# The grouping
y <- iris[, 5]
# The plotting function
wrapper <- function(object, mapping = aes()){
ggpairs(object$x, mapping)
}
# This works
wrapper(object)
# This doesn't work
wrapper(object, aes(color = y))
The latter one produces the error message:
Error in .subset(col, i) : object of type 'symbol' is not subsettable
Any trick to get the second plotting command to work without modifying the input object would be greatly appreciated.

Related

Set Axis Limits of mixfitEM plot

I want to set limits to the x-axis of the plot of the output of the mixfit function of the mixR package. This output is of class mixfitEM.
Reproducable example: First I simulate a mixture of log normals.
rm(list=ls())
library(mixR)
library("tidyverse")
set.seed(07062022)
N <- 1000
lbda=.5
mu1=1
Del=3
mu2=mu1+Del
components <- sample(1:2,prob=c((1-lbda), lbda),size=N,replace=TRUE)
mus <- c(mu1,mu2)
sds <- sqrt(c(1,.5))
Y <- rlnorm(n=N,meanlog=mus[components],sdlog=sds[components])
Then I try to fit two log normals into this data using mixfit() from mixR. In the first attempt I tried to use the plot() function from the base package; but plot ignores the xlim argument.
mod4 <- mixfit(Y, ncomp = 2, family = 'lnorm')
plot(mod4, title = 'Log-Normal Mixture (2 components)', xlim=c(0,200))
Then I tried to plot with ggplot, which, in theory should be possible according to the mixR manual. But ggplot does not understand the mixfitEM class.
ggplot(mod4)+
+ coord_cartesian(xlim = c(0, 200))
produces the following error:
> ggplot(data.frame(mod4))+
+ + coord_cartesian(xlim = c(0, 200))
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ‘"mixfitEM"’ to a data.frame
> ggplot(mod4)+
+ + coord_cartesian(xlim = c(0, 200))
Error in `fortify()`:
! `data` must be a data frame, or other object coercible by `fortify()`, not an S3 object with class mixfitEM.
Run `rlang::last_error()` to see where the error occurred.
When you call plot on the mixfitEM object, you are creating a ggplot. The reason for this is that plot is a generic function, so when package authors create a new class, they are free to use whatever method they want to draw the plot. In this case, if you examine the source code of mixR:::plot.mixfitEM you will see it actually uses ggplot to draw its output. This means you can use ggplot syntax to modify the output:
plot(mod4, title = 'Log-Normal Mixture (2 components)') + xlim(c(0, 200))

Error in axis(side = side, at = at, labels = labels, ...) : invalid value specified for graphical parameter "pch"

I have applied DBSCAN algorithm on built-in dataset iris in R. But I am getting error when tried to visualise the output using the plot( ).
Following is my code.
library(fpc)
library(dbscan)
data("iris")
head(iris,2)
data1 <- iris[,1:4]
head(data1,2)
set.seed(220)
db <- dbscan(data1,eps = 0.45,minPts = 5)
table(db$cluster,iris$Species)
plot(db,data1,main = 'DBSCAN')
Error: Error in axis(side = side, at = at, labels = labels, ...) :
invalid value specified for graphical parameter "pch"
How to rectify this error?
I have a suggestion below, but first I see two issues:
You're loading two packages, fpc and dbscan, both of which have different functions named dbscan(). This could create tricky bugs later (e.g. if you change the order in which you load the packages, different functions will be run).
It's not clear what you're trying to plot, either what the x- or y-axes should be or the type of plot. The function plot() generally takes a vector of values for the x-axis and another for the y-axis (although not always, consult ?plot), but here you're passing it a data.frame and a dbscan object, and it doesn't know how to handle it.
Here's one way of approaching it, using ggplot() to make a scatterplot, and dplyr for some convenience functions:
# load our packages
# note: only loading dbscacn, not loading fpc since we're not using it
library(dbscan)
library(ggplot2)
library(dplyr)
# run dbscan::dbscan() on the first four columns of iris
db <- dbscan::dbscan(iris[,1:4],eps = 0.45,minPts = 5)
# create a new data frame by binding the derived clusters to the original data
# this keeps our input and output in the same dataframe for ease of reference
data2 <- bind_cols(iris, cluster = factor(db$cluster))
# make a table to confirm it gives the same results as the original code
table(data2$cluster, data2$Species)
# using ggplot, make a point plot with "jitter" so each point is visible
# x-axis is species, y-axis is cluster, also coloured according to cluster
ggplot(data2) +
geom_point(mapping = aes(x=Species, y = cluster, colour = cluster),
position = "jitter") +
labs(title = "DBSCAN")
Here's the image it generates:
If you're looking for something else, please be more specific about what the final plot should look like.

Passing object to lines function ifor plot data

My issue is the following. I use ROCR package to plot data. performance function returns an object that I pass to plot the data like this:
example <- performance(prediction1,"tpr","fpr")
plot(example,col="red")
I want to add another performance object to this plot, but lines function accepts x and y coords and not an object. In fact if I do: lines(example2, col="blue") this error appears:
Error in as.double(y) :
cannot coerce type 'S4' to vector of type 'double'**
You can add new line with add = TRUE as plot argument:
library(ROCR)
data(ROCR.simple)
prediction1 <- prediction( ROCR.simple$predictions, ROCR.simple$labels)
example1 <- performance(prediction1,"tpr","fpr")
plot(example1, col="red")
example2 <- performance(prediction1, "sens", "spec")
plot(example2, col="blue", add = TRUE)

Name aesthetics using ".resid" syntax in ggplot2

Someone pointed out to me that there's a different way to specify the data and the aesthetics in ggplot2 as below. I've never seen this -- in all the books, docs, data is always a data frame and inside aes are the names of the variables. What's this dot syntax?
y <- rnorm(100) ; x <- rnorm(100)
m <- lm(y ~ x)
library(ggplot2)
ggplot(data = m, aes(.resid, .fitted)) + geom_point()
Upgrade comment
ggplot is calling fortify on the lm object, which produces a dataframe that is then passed to ggplot.data.frame.
To see the code use
ggplot2:::ggplot.default
#function (data = NULL, mapping = aes(), ..., environment = parent.frame())
#{
# ggplot.data.frame(fortify(data, ...), mapping, environment = environment)
#}
#<environment: namespace:ggplot2>
As for fortify it coerces various models and R objects to dataframes. Have a look at methods(fortify).
You can directly see the results of fortify
ff <- fortify(m)
names(ff)
#[1] "y" "x" ".hat" ".sigma" ".cooksd" ".fitted" ".resid" ".stdresid"
So the dot isn't doing anything clever within aes, but is actually part of the column names that fortify produces.

Using foreach() in R to speed up loop for ggplot2

I would like to create a PDF file containing hundreds of plots in a certain order.
My strategy was using foreach() and storing each ggplot2 object into the output list, and then printing each ggplot2 object to the output file.
For example, I would like to plot a histogram of prices for every factor "carat" in the diamonds dataset:
library(ggplot2)
library(plyr)
library(foreach) # for parallelization
library(doParallel) # for parallelization
#setup parallel backend to use 4 processors
cl<-makeCluster(4)
registerDoParallel(cl)
# use diamonds dataset
carats.summary <- ddply(diamonds, .(carat), summarise, count = length(carat))
m.list <- foreach(i = 1:length(carats.summary$carat),
.packages = "ggplot2") %dopar% {
jcarat = carats.summary$carat[i]
m <- ggplot(subset(diamonds, carat == jcarat), aes(x = price)) +
geom_histogram()
print(m)
}
With this code, I am hoping to create a list of ggplot2 objects which I can then save into a single pdf file (for example using pdf()) in an ordered manner (for example, in ascending carats).
However, running this results in an error message:
Error in serialize(data, node$con) : error writing to connection
I suspect this is due to the fact that if I tried to append the ggplot2 object to a list, I would get a warning message like this:
lst <- vector(mode = "list")
lst[1] <- m
Warning message:
In lst[1] <- m :
number of items to replace is not a multiple of replacement length
Although this is pure speculation and I could be wrong.
Does anybody have an idea how to use foreach() to save ggplot2 objects onto a list? Or some way to parallelize for loops involving ggplot2?
Thanks in advance.
You shouldn't be printing the object inside the loop, just create the ggplot object. Only print when you have the graphic device open that you want.
m.list <- foreach(i = 1:length(carats.summary$carat),
.packages = "ggplot2") %dopar% {
jcarat = carats.summary$carat[i]
ggplot(subset(diamonds, carat == jcarat), aes(x = price)) +
geom_histogram()
}
then you can get at them with
m.list[[1]]
etc...

Resources