I imputed my data using both the packages mice and miceRanger and I would like to compare the distributions of the imputed variables with the original data. In miceRanger this is very easy using the function plotDistributions() which displays density plots for the numeric variables and barplots for factors.
In mice, if the variables are numeric, it is easy to compare the distributions using stripplot() or bwplot(), but I cannot find a simple way to do it if the variables are factors. I wonder if I am missing something or I just have to give in and create a custom routine for that.
Does anyone have any suggestion? Thanks in advance!
As far as I am aware, there isn't a mice-equivalent function similar to miceRanger::plotDistributions() (which is disappointing because it's a very convenient function). However, you can use ggmice to use ggplot2 syntax on mids objects.
library(miceRanger)
library(mice)
library(ggmice)
data(nhanes)
nhanes$hyp <- factor(nhanes$hyp)
## miceRanger
imp1 <- miceRanger(data = nhanes)
plotDistributions(imp1)
## mice and ggmice
imp2 <- mice(data = nhanes)
ggmice(imp2, aes(x = hyp)) +
geom_histogram(stat = "count")
I adapted PlotDistributions() from miceRanger to work with mice.
Beware this function has been only tested quickly on 2 random datasets, therefore I cannot guarantee it does not contain bugs.
For more info, see here
PlotDist <- function (miceObj, vars = names(miceObj$imp), dotsize = 0.5,
...)
{
pos <- which(miceObj$nmis!=0)
vars <- vars[pos]
newClasses <- sapply(miceObj$data[pos], class)
if (vars[[1]] == "allCategorical")
vars <- names(newClasses[newClasses == "factor"])
if (vars[[1]] == "allNumeric")
vars <- names(newClasses[newClasses != "factor"])
newClasses <- newClasses[vars]
facVars <- newClasses[newClasses == "factor"]
numVars <- newClasses[newClasses != "factor"]
if (length(facVars) > 0) {
facList <- lapply(names(facVars), function(var) {
dat <- as.data.table(miceObj$imp[[var]])
dat <- melt(dat, measure.vars = names(dat))
setnames(dat, "value", var)
agg <- dat[, .(Percentage = .N/sum(miceObj$where[, var])), by = c("variable", var)]
rawAgg <- na.omit(as.data.table(mice$data),cols = var)[, .(Percentage = .N/sum(!miceObj$where[, var])), by = var]
return(ggplot() + geom_dotplot(data = agg, aes(x = !!sym(var),
y = !!sym("Percentage")),
binaxis = "y", stackdir = "center",
dotsize = dotsize, stackratio = 0.75, binwidth = 1/50) +
geom_bar(data = rawAgg, aes(x = !!sym(var), y = !!sym("Percentage")),
stat = "identity", alpha = 0.5) +
scale_x_discrete(guide = guide_axis(n.dodge = 2)))
})
}
else facList <- NULL
if (length(numVars) > 0) {
numList <- lapply(names(numVars), function(var) {
dat <- as.data.table(miceObj$imp[[var]])
dat <- melt(dat, measure.vars = names(dat))
setnames(dat, "value", var)
dens <- density(miceObj$data[, get(var)], na.rm = TRUE)
return(ggplot() + geom_density(data = dat, aes(!!sym(var), group = !!sym("variable")), bw = dens$bw) +
geom_density(data = miceObj$data[!is.na(get(var))], aes(!!sym(var)),
linewidth = 1, color = "red", bw = dens$bw) +
ylab("Density"))
})
}
else numList <- NULL
pList <- c(numList, facList)
ggarrange(plotlist = pList, ...)
}
Related
I have been attempting to solve this issue for a considerable amount of time with no success. I am creating multiple partial dependence plots (PDPs) and utilising a package called zenplots to lay them out. However, the issue I am having is I cannot figure out a way to have a common legend for the multiple plots. I have tried placing them on a grid and plotting and tried changing the positioning of the grobs... but I cant figure it out. For example:
In the above plot, all PDPs are on the same scale and I would like a single legend. Currently, when I produce the image, it plots a legend for each individual plot. Whereas, what I want is something like the image below (which I made in photoshop):
The code Im providing to produce the plots is somewhat long ( which I hope won't deter people)... but essentially it's only the ggplot part of the code that I need to manipulate. That is, Im creating the actual ggplot on lines 103-105 and more generally between lines 103-125, where I use ggtable to build the plots. For example, changing the color argument on line 115 to: guides(fill = FALSE, color = "colour bar") will create the legend for each plot... setting color = FALSE will remove the legends.
below is the code used to make the plots and it's application on the air quality data:
library(randomForest)
library(ggplot2)
library(dplyr)
pdpLayout <- function(data,
fit,
response,
pal = rev(RColorBrewer::brewer.pal(11, "RdYlBu")),
gridSize = 10,
nmax = 500,
class = 1,
rug = TRUE,
...) {
data <- na.omit(data)
# if (is.numeric(nmax) && nmax < nrow(data)) {
# data <- data[sample(nrow(data), nmax), , drop = FALSE]
# }
gridSize <- min(gridSize, nmax)
predData <- predict(fit, data)
vars <- names(data)
vars <- vars[-match(response, vars)]
datap <- data[,vars]
zpath <- 1:length(vars)
zdata <- datap
zpairs <- t(sapply(1:(length(zpath)-1), function(i){
z <- zpath[i:(i+1)]
if (i %% 2 == 0) rev(z) else z
}))
zpairs <- cbind(vars[zpairs[, 1]], vars[zpairs[, 2]])
# loop through vars and create a list of pdps for each pair
pdplist <- vector("list", nrow(zpairs))
for (i in 1:nrow(zpairs)) {
ind <- zpairs[i, ]
if (!is.na(ind[1])) {
px <- pdp_data(data, ind, gridsize = gridSize)
px$.pid <- i
pdplist[[i]] <- px
} else {
pdplist[[i]] <- NULL
}
}
pdplist <- bind_rows(pdplist)
pdplist$fit <- predict(fit, pdplist)
pdplist <- split(pdplist, pdplist$.pid)
pdplist0 <- vector("list", nrow(zpairs))
j <- 1
for (i in 1:nrow(zpairs)) {
ind <- zpairs[i, ]
if (!is.na(ind[1])) {
pdplist0[[i]] <- pdplist[[j]] %>%
group_by(.data[[ind[1]]], .data[[ind[2]]]) %>%
summarise(fit = mean(fit))
j <- j + 1
} else {
pdplist0[[i]] <- NULL
}
}
pdplist <- pdplist0
pdplist0 <- NULL
names(pdplist) <- paste(zpairs[, 2], zpairs[, 1], sep = "pp")
message("Finished ice/pdp")
# Set limits for pairs
pdplist0 <- pdplist[!sapply(pdplist, is.null)]
r <- range(sapply(pdplist0, function(x) range(x$fit)))
limits <- range(labeling::rpretty(r[1], r[2]))
# Zenplot graphing function
data$pred <- predData
z2index <- 0
pdpnn <- function(zargs) {
z2index <<- z2index + 1
vars <- zpairs[z2index, ]
pdp <- pdplist[[z2index]]
if (!is.null(pdp)) {
if (is.factor(pdp[[vars[1]]]) + is.factor(pdp[[vars[2]]]) == 1) {
if (is.factor(pdp[[vars[1]]])) vars <- rev(vars)
p <- ggplot(data = pdp, aes(x = .data[[vars[1]]], y = fit, color = .data[[vars[2]]])) +
geom_line() +
geom_rug(data = data, sides = "b", aes(y = .data[["pred"]]))
} else {
if (is.factor(pdp[[vars[1]]])) posx <- "jitter" else posx <- "identity"
if (is.factor(pdp[[vars[2]]])) posy <- "jitter" else posy <- "identity"
p <- ggplot(data = pdp, aes(x = .data[[vars[1]]], y = .data[[vars[2]]])) +
geom_tile(aes(fill = fit)) +
scale_fill_gradientn(name = "y-hat", colors = pal, limits = limits, oob = scales::squish)
if (rug) {
p <- p +
geom_rug(data = data, sides = "b", position = posx, aes(color = .data[["pred"]])) +
geom_rug(data = data, sides = "l", position = posy, aes(color = .data[["pred"]])) +
scale_color_gradientn(name = "y-hat", colors = pal, limits = limits, oob = scales::squish)
}
}
p <- p +
guides(fill = FALSE, color = FALSE) +
theme_bw() +
theme(
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.border = element_rect(colour = "gray", fill = NA, size = 1.5)
)
} else {
p <- ggplot() +
theme(panel.background = element_blank())
}
ggplot_gtable(ggplot_build(p))
}
suppressMessages({
zenplots::zenplot(zdata,
pkg = "grid", labs = list(group = NULL),
plot2d = pdpnn, ...
)
})
}
pdp_data <- function(d, var, gridsize = 30) {
if (length(var) == 1) {
pdpvar <- d[[var]]
if (is.factor(pdpvar)) {
gridvals <- levels(pdpvar)
} else {
gridvals <- seq(min(pdpvar, na.rm = T), max(pdpvar, na.rm = T), length.out = gridsize)
}
dnew <- do.call(rbind, lapply(gridvals, function(i) {
d1 <- d
d1[[var]] <- i
d1
}))
if (is.factor(pdpvar)) dnew[[var]] <- factor(dnew[[var]], levels = levels(pdpvar), ordered = is.ordered(pdpvar))
}
else {
pdpvar1 <- d[[var[1]]]
pdpvar2 <- d[[var[2]]]
if (is.factor(pdpvar1)) {
gridvals1 <- levels(pdpvar1)
} else {
gridvals1 <- seq(min(pdpvar1, na.rm = T), max(pdpvar1, na.rm = T), length.out = gridsize)
}
if (is.factor(pdpvar2)) {
gridvals2 <- levels(pdpvar2)
} else {
gridvals2 <- seq(min(pdpvar2, na.rm = T), max(pdpvar2, na.rm = T), length.out = gridsize)
}
gridvals <- expand.grid(gridvals1, gridvals2)
dnew <- do.call(rbind, lapply(1:nrow(gridvals), function(i) {
d1 <- d
d1[[var[1]]] <- gridvals[i, 1]
d1[[var[2]]] <- gridvals[i, 2]
d1
}))
if (is.factor(pdpvar1)) dnew[[var[1]]] <- factor(dnew[[var[1]]], levels = levels(pdpvar1), ordered = is.ordered(pdpvar1))
if (is.factor(pdpvar2)) dnew[[var[2]]] <- factor(dnew[[var[2]]], levels = levels(pdpvar2), ordered = is.ordered(pdpvar2))
}
dnew$.id <- 1:nrow(d)
rownames(dnew) <- NULL
dnew
}
Now use some data to create the plots:
aq <- na.omit(airquality)
rf <- randomForest(Ozone~., data = aq)
pdpLayout(aq, rf, "Ozone")
Any help or suggestions is greatly appreciated.
I am trying to add lm model coefs of two parallel modelling results onto the same ggplot plot. Here is my working example:
library(ggplot2)
set.seed(100)
dat <- data.frame(
x <- rnorm(100, 1),
y <- rnorm(100, 10),
lev <- gl(n = 2, k = 50, labels = letters[1:2])
)
mod1 <- lm(y~x, dat = dat[lev %in% "a", ])
r1 <- paste("R^2==", round(summary(mod1)[[9]], 3))
p1<- paste("p==", round(summary(mod1)[[4]][2, 4], 3), sep= "")
lab1 <- paste(r1, p1, sep =",")
mod2 <- lm(y~x, dat = dat[lev %in% "b", ])
r2 <- paste("R^2==", round(summary(mod2)[[9]], 3))
p2 <- paste("p==", round(summary(mod2)[[4]][2, 4], 3), sep= "")
lab2 <- paste(r2, p2, sep =",")
ggplot(dat, aes(x = x, y = y, col = lev)) + geom_jitter() + geom_smooth(method = "lm") + annotate("text", x = 2, y = 12, label = lab1, parse = T) + annotate("text", x = 10, y = 8, label = lab2, parse = T)
Here is the promot shows:
Error in parse(text = text[[i]]) : <text>:1:12: unexpected ','
1: R^2== 0.008,
Now the problem is that I could label either R2 or p value seperately, but not both of them together. How could I do to put the two results into one single line on the figure?
BTW, any other efficienty way of doing the same thing as my code? I have nine subplots that I want to put into one full plot, and I don't want to add them one by one.
++++++++++++++++++++++++++ Some update ++++++++++++++++++++++++++++++++++
Following #G. Grothendieck 's kind suggestion and idea, I tried to wrap the most repeatative part of the codes into a function, so I could finish all the plot with a few lines. Now the problem is that, whatever I changed the input variables, the output plot are basically the same, except the axis labels. Can anyone explain why? The following is the working code I used:
library(ggplot2)
library(ggpubr)
set.seed(100)
dat <- data.frame(
x = rnorm(100, 1),
y = rnorm(100, 10),
z = rnorm(100, 25),
lev = gl(n = 2, k = 50, labels = letters[1:2])
)
test <- function(dat, x, y){
fmt <- "%s: Adj ~ R^2 == %.3f * ',' ~ {p == %.3f}"
mod1 <- lm(y ~ x, dat, subset = lev == "a")
sum1 <- summary(mod1)
lab1 <- sprintf(fmt, "a", sum1$adj.r.squared, coef(sum1)[2, 4])
mod2 <- lm(y ~ x, dat, subset = lev == "b")
sum2 <- summary(mod2)
lab2 <- sprintf(fmt, "b", sum2$adj.r.squared, coef(sum2)[2, 4])
colors <- 1:2
p <- ggplot(dat, aes(x = x, y = y, col = lev)) +
geom_jitter() +
geom_smooth(method = "lm") +
annotate("text", x = 2, y = c(12, 8), label = c(lab1, lab2),
parse = TRUE, hjust = 0, color = colors) +
scale_color_manual(values = colors)
return(p)
}
ggarrange(test(dat, x, z), test(dat, y, z))
There are several problems here:
x, y and lev are arguments to data.frame so they must be specified using = rather than <-
make use of the subset= argument in lm
use sprintf instead of paste to simplify the specification of labels
label the text strings a and b and make them the same color as the corresponding lines to identify which is which
the formula syntax needs to be corrected. See fmt below.
it would be clearer to use component names and accessor functions of the summary objects where available
use TRUE rather than T because the latter can be overridden if there is a variable called T but TRUE can never be overridden.
use hjust=0 and adjust the x= and y= in annotate to align the two text strings
combine the annotate statements
place the individual terms of the ggplot statement on separate lines for improved readability
This gives:
library(ggplot2)
set.seed(100)
dat <- data.frame(
x = rnorm(100, 1),
y = rnorm(100, 10),
lev = gl(n = 2, k = 50, labels = letters[1:2])
)
fmt <- "%s: Adj ~ R^2 == %.3f * ',' ~ {p == %.3f}"
mod1 <- lm(y ~ x, dat, subset = lev == "a")
sum1 <- summary(mod1)
lab1 <- sprintf(fmt, "a", sum1$adj.r.squared, coef(sum1)[2, 4])
mod2 <- lm(y ~ x, dat, subset = lev == "b")
sum2 <- summary(mod2)
lab2 <- sprintf(fmt, "b", sum2$adj.r.squared, coef(sum2)[2, 4])
colors <- 1:2
ggplot(dat, aes(x = x, y = y, col = lev)) +
geom_jitter() +
geom_smooth(method = "lm") +
annotate("text", x = 2, y = c(12, 8), label = c(lab1, lab2),
parse = TRUE, hjust = 0, color = colors) +
scale_color_manual(values = colors)
Unless I'm misunderstanding your question, the problem's with the parse = T arguments to your annotate calls. I don't think your strings need to be parsed. Try parse = F instead, or just drop the parameter, as the default value seems to be FALSE anyway
I need to create some gam plots in ggplot. I can do them with the general plot function, but am unsure how to do with ggplot. Here is my code and plots with the regular plot function. I'm using the College data set from the ISLR package.
train.2 <- sample(dim(College)[1],2*dim(College)[1]/3)
train.college <- College[train.2,]
test.college <- College[-train.2,]
gam.college <- gam(Outstate~Private+s(Room.Board)+s(Personal)+s(PhD)+s(perc.alumni)+s(Expend)+s(Grad.Rate), data=train.college)
par(mfrow=c(2,2))
plot(gam.college, se=TRUE,col="blue")
See update below old answer.
Old answer:
There is an implementation of GAM plotting using ggplot2 in voxel library. Here is how you would go about it:
library(ISLR)
library(mgcv)
library(voxel)
library(tidyverse)
library(gridExtra)
data(College)
set.seed(1)
train.2 <- sample(dim(College)[1],2*dim(College)[1]/3)
train.college <- College[train.2,]
test.college <- College[-train.2,]
gam.college <- gam(Outstate~Private+s(Room.Board)+s(Personal)+s(PhD)+s(perc.alumni)+s(Expend)+s(Grad.Rate), data=train.college)
vars <- c("Room.Board", "Personal", "PhD", "perc.alumni","Expend", "Grad.Rate")
map(vars, function(x){
p <- plotGAM(gam.college, smooth.cov = x) #plot customization goes here
g <- ggplotGrob(p)
}) %>%
{grid.arrange(grobs = (.), ncol = 2, nrow = 3)}
after a bunch of errors: In plotGAM(gam.college, smooth.cov = x) :
There are one or more factors in the model fit, please consider plotting by group since plot might be unprecise
To compare to the plot.gam:
par(mfrow=c(2,3))
plot(gam.college, se=TRUE,col="blue")
You might also want to plot the observed values:
map(vars, function(x){
p <- plotGAM(gam.college, smooth.cov = x) +
geom_point(data = train.college, aes_string(y = "Outstate", x = x ), alpha = 0.2) +
geom_rug(data = train.college, aes_string(y = "Outstate", x = x ), alpha = 0.2)
g <- ggplotGrob(p)
}) %>%
{grid.arrange(grobs = (.), ncol = 3, nrow = 2)}
or per group (especially important if you used the by argument (interaction in gam).
map(vars, function(x){
p <- plotGAM(gam.college, smooth.cov = x, groupCovs = "Private") +
geom_point(data = train.college, aes_string(y = "Outstate", x = x, color= "Private"), alpha = 0.2) +
geom_rug(data = train.college, aes_string(y = "Outstate", x = x, color= "Private" ), alpha = 0.2) +
scale_color_manual("Private", values = c("#868686FF", "#0073C2FF")) +
theme(legend.position="none")
g <- ggplotGrob(p)
}) %>%
{grid.arrange(grobs = (.), ncol = 3, nrow = 2)}
Update, 08. Jan. 2020.
I currently think the package mgcViz offers superior functionality compared to the voxel::plotGAMfunction. An example using the above data set and models:
library(mgcViz)
viz <- getViz(gam.college)
print(plot(viz, allTerms = T), pages = 1)
plot customization is similar go ggplot2 syntax:
trt <- plot(viz, allTerms = T) +
l_points() +
l_fitLine(linetype = 1) +
l_ciLine(linetype = 3) +
l_ciBar() +
l_rug() +
theme_grey()
print(trt, pages = 1)
This vignette shows many more examples.
here is a test code and I don't understand why is not working as expected. Is a ggplot2 question, not an R one.
library(ggplot2)
K = 10
x <- 1:100/100
y <- sapply (x, FUN= function(x) 1+x)
xy <- data.frame(x,y)
set.seed(1234)
xy$yrand <- xy$y + runif(100,min = -0.35, max = 0.5)
folds <- cut(seq(1, nrow(xy)), breaks = K, labels = FALSE)
p1 <- ggplot(xy, aes(x = xy$x, y = xy$yrand))+geom_point() +ggtitle ("Simple
x vs y plot with added random noise") + xlab("X") + ylab("Y")
for(i in 1:K){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- xy[testIndexes, ]
trainData <- xy[-testIndexes, ]
lmTemp <- lm(yrand ~ x, data = trainData)
p1 <- p1 + geom_line(data = trainData, aes(x = trainData$x, y = predict(lmTemp, newdata = trainData)))
}
p1
Now what I would like to see is a plot with 10 lines (the regression lines). But I only see one. Can someone help me out? Is the ggplot2 syntax that is wrong?
Thanks, Umberto
EDITED:
I marked the answer I got since it is a nice way of doing it. I just wanted to add a simple way of doing it preparing the datasets for the graph I wanted to create. I think this method is slightly easier to understand if you don't have so much R experience.
for(i in 1:K){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- xy[testIndexes, ]
trainData <- xy[-testIndexes, ]
lmTemp <- lm(yrand ~ x, data = trainData)
# Let's build a data set for the lines
fitLines <- rbind(fitLines, data.frame(rep(paste("set",i),nrow(trainData)),trainData[,1], predict(lmTemp, newdata = trainData)))
}
names(fitLines) <- c("set", "x","y")
p1 + geom_line(data = fitLines, aes(x = x, y = y, col = set))
And this is what you get
You could use the crossv_kfold()function from the modelr-package, and put your complete code into a "pipe-workflow":
library(modelr)
library(tidyverse)
x <- 1:100/100
y <- sapply (x, FUN= function(x) 1+x)
xy <- data.frame(x,y)
set.seed(1234)
xy$yrand <- xy$y + runif(100,min = -0.35, max = 0.5)
xy %>%
crossv_kfold() %>%
mutate(
models = map(train, ~ lm(yrand ~ x, data = .x)),
predictions = map2(models, test, ~predict(.x, newdata = .y, type = "response"))
) %>%
select(-train, -test, -models) %>%
unnest() %>%
bind_cols(xy) %>%
ggplot(aes(x = x, y = predictions)) +
stat_smooth(aes(colour = .id), method = "lm", se = FALSE) +
geom_point(aes(y = yrand))
Putting the colour-aes inside the ggplot-call would also map the points to the groups:
xy %>%
crossv_kfold() %>%
mutate(
models = map(train, ~ lm(yrand ~ x, data = .x)),
predictions = map2(models, test, ~predict(.x, newdata = .y, type = "response"))
) %>%
select(-train, -test, -models) %>%
unnest() %>%
bind_cols(xy) %>%
ggplot(aes(x = x, y = predictions, colour = .id)) +
stat_smooth(, method = "lm", se = FALSE) +
geom_point(aes(y = yrand))
Some test data:
ltd <- data.frame(r = c(rnorm(10), f1 = c(rep("L", 5), rep("H", 5)),
f2 = rep(c("A", "B"), 5))
And a minimal function:
tf <- function(formula = NULL, data = NULL) {
res <- as.character(formula[[2]]) # clean & prep data
fac1 <- as.character(formula[[3]][2])
fac2 <- as.character(formula[[3]][3])
counts <- count(data, vars = c(fac2, fac1)) # get table data ready
colnames(counts) <- c(fac2, fac1, "count")
myt <- tableGrob(counts, show.box = TRUE,
show.rownames = FALSE, show.colnames = TRUE,
show.csep = TRUE, show.rsep = TRUE,
separator = "black")
p <- ggplot()
p <- p + geom_point(data = data,
aes_string(x = fac1, y = res, color = fac2, group = fac2))
p <- p + annotation_custom(myt) # comment out and it works
}
Run it:
require("plyr")
require("gridExtra")
require("ggplot2")
tmp <- tf(formula = r~f1*f2, data = ltd)
print(tmp)
Gives Error in if (nrow(layer_data) == 0) return() : argument is of length zero
If you print the tableGrob it does exist, so I'm not sure what's going on here. If you comment out the annotation_custom it works, and I think I'm following the docs. Thanks. (ggplot2_0.9.3)
Here is solution to your problem: I relocated your data= and aes_string call to the main ggplot call. I don't know why it matters, but now the plot prints without error.
p <- ggplot(data=data, aes_string(x=fac1, y=res, color=fac2, group=fac2)) +
geom_point() +
annotation_custom(myt)