aggregate/sum with ggplot - r

Is there a way to sum data with ggplot2 ?
I want to do a bubble map with the size depending of the sum of z.
Currently I'm doing something like
dd <- ddply(d, .(x,y), transform, z=sum(z))
qplot(x,y, data=dd, size=z)
But I feel I'm writing the same thing twice, I would like to be able to write something
qplot(x,y, data=dd, size=sum(z))
I had a look at stat_sum and stat_summmary but I'm not sure they are appropriate either.
Is it possible to it with ggplot2 ? If not, what would be best way to write those 2 lines.

It can be done using stat_sum within ggplot2. By default, the dot size represents proportions. To get dot size to represent counts, use size = ..n.. as an aesthetic. Counts (and proportions) by a third variable can be obtained by weighting by the third variable (weight = cost) as an aesthetic. Some examples, but first, some data.
library(ggplot2)
set.seed = 321
# Generate somme data
df <- expand.grid(x = seq(1:5), y = seq(1:5), KEEP.OUT.ATTRS = FALSE)
df$Count = sample(1:25, 25, replace = F)
library(plyr)
new <- dlply(df, .(Count), function(data) matrix(rep(matrix(c(data$x, data$y), ncol = 2), data$Count), byrow = TRUE, ncol = 2))
df2 <- data.frame(do.call(rbind, new))
df2$cost <- 1:325
The data contains units categorised according to two factors: X1 and X2; and a third variable which is the cost of each unit.
Plot 1: Plots the proportion of elements at each X1 - X2 combination. group=1 tells ggplot to calculate proportions out of the total number of units in the data frame.
ggplot(df2, aes(factor(X1), factor(X2))) +
stat_sum(aes(group = 1))
Plot 2: Plots the number of elements at each X1 - X2 combination.
ggplot(df2, aes(factor(X1), factor(X2))) +
stat_sum(aes(size = ..n..))
Plot 3: Plots the cost of the elements at each X1 - X2 combination, that is weight by the third variable.
ggplot(df2, aes(x=factor(X1), y=factor(X2))) +
stat_sum(aes(group = 1, weight = cost, size = ..n..))
Plot 4: Plots the proportion of the total cost of all elements in the data frame at each X1 - X2 combination
ggplot(df2, aes(x=factor(X1), y=factor(X2))) +
stat_sum(aes(group = 1, weight = cost))
Plot 5: Plots proportions, but instead of the proportion being out of the total cost across all elements in the data frame, the proportion is out of the cost for elements within each category of X1. That is, within each X1 category, where does the major cost for X2 units occur?
ggplot(df2, aes(x=factor(X1), y=factor(X2))) +
stat_sum(aes(group = X1, weight = cost))

You could put the ddply call into the qplot:
d <- data.frame(x=1:10, y=1:10, z= runif(100))
qplot(x, y, data=ddply(d, .(x,y), transform, z=sum(z)), size=z)
Or use the data.table package.
DT <- data.table(d, key='x,y')
qplot(x, y, data=DT[, sum(z), by='x,y'], size=V1)

Related

How to delete outliers from a QQ-plot graph made with ggplot()?

I have a two dimensional dataset (say columns x and y). I use the following function to plot a QQ-plot of this data.
# Creating a toy data for presentation
df = cbind(x = c(1,5,8,2,9,6,1,7,12), y = c(1,4,10,1,6,5,2,1,32))
# Plotting the QQ-plot
df_qq = as.data.frame(qqplot(df[,1], df[,2], plot.it=FALSE))
ggplot(df_qq) +
geom_point(aes(x=x, y=y), size = 2) +
geom_abline(intercept = c(0,0), slope = 1)
That is the resulting graph:
My question is, how to avoid plotting the last point (i.e. (12,32))? I would rather not delete it manually because i have several of these data pairs and there are similar outliers in each of them. What I would like to do is to write a code that somehow identifies the points that are too far from the 45 degree line and eliminate them from df_qq (for instance if it is 5 times further than the average distance to the 45 line it can be eliminated). My main objective is to make the graph easier to read. When outliers are not eliminated the more regular part of the QQ-plot occupies a too small part of the graph and it prevents me from visually evaluating the similarity of two vectors apart from the outliers.
I would appreciate any help.
There is a CRAN package, referenceIntervals that uses Cook's distance to detect outliers. By applying it to the values of df_qq$y it can then give an index into df_qq to be removed.
library(referenceIntervals)
out <- cook.outliers(df_qq$y)$outliers
i <- which(df_qq$y %in% out)
ggplot(df_qq[-i, ]) +
geom_point(aes(x=x, y=y), size = 2) +
geom_abline(intercept = c(0,0), slope = 1)
Edit.
Following the OP's comment,
But as far as I understand this function does not look at
the relation between x & y,
maybe the following function is what is needed to remove outliers only if they are outliers in one of the vectors but not in both.
cookOut <- function(X){
out1 <- cook.outliers(X[[1]])$outliers
out2 <- cook.outliers(X[[2]])$outliers
i <- X[[1]] %in% out1
j <- X[[2]] %in% out2
w <- which((!i & j) | (i & !j))
if(length(w)) X[-w, ] else X
}
Test with the second data set, the one in the comment.
The extra vector, id is just to make faceting easier.
df1 <- data.frame(x = c(1,5,8,2,9,6,1,7,12), y = c(1,4,10,1,6,5,2,1,32))
df2 <- data.frame(x = c(1,5,8,2,9,6,1,7,32), y = c(1,4,10,1,6,5,2,1,32))
df_qq1 = as.data.frame(qqplot(df1[,1], df1[,2], plot.it=FALSE))
df_qq2 = as.data.frame(qqplot(df2[,1], df2[,2], plot.it=FALSE))
df_qq_out1 <- cookOut(df_qq1)
df_qq_out2 <- cookOut(df_qq2)
df_qq_out1$id <- "A"
df_qq_out2$id <- "B"
df_qq_out <- rbind(df_qq_out1, df_qq_out2)
ggplot(df_qq_out) +
geom_point(aes(x=x, y=y), size = 2) +
geom_abline(intercept = c(0,0), slope = 1) +
facet_wrap(~ id)

Converting data to percentage rank

I have data whose mean and variance changes as a function of the independent variable. How do I convert the dependent variable into (estimated) conditional percentage ranks?
For example, say the data looks like Z below:
library(dplyr)
library(ggplot2)
data.frame(x = runif(1000, 0, 5)) %>%
mutate(y = sin(x) + rnorm(n())*cos(x)/3) ->
Z
we can plot it with Z %>% ggplot(aes(x,y)) + geom_point(): it looks like a disperse sine function, where the variance around that sine function varies with x. My goal is to convert each y value into a number between 0 and 1 which represents its percentage rank for values with similar x. So values very close to that sine function should be converted to about 0.5 while values below it should be converted to values closer to 0 (depending on the variance around that x).
One quick way to do this is to bucket the data and then simply compute the rank of each observation in each bucket.
Another way (which I think is preferable) to do what I ask is to perform a quantile regression for a number of different quantiles (tau):
library(quantreg)
library(splines)
model.fit <- rq(y ~ bs(x, df = 5), tau = (1:9)/10, data = Z)
which can be plotted as follows:
library(tidyr)
data.frame(x = seq(0, 5, len = 100)) %>%
data.frame(., predict(model.fit, newdata = .), check.names = FALSE) %>%
gather(Tau, y, -x) %>%
ggplot(aes(x,y)) +
geom_point(data = Z, size = 0.1) +
geom_line(aes(color = Tau), size = 1)
Given model.fit I could now use the estimated quantiles for each x value to convert each y value into a percentage rank (with the help of approx(...)) but I suspect that package quantreg may do this more easily and better. Is there, in fact, some function in quantreg which automates this?

Making surface plot of regression estimates from multiple continuous variables

I have a multi-level model with categorical and continuous variables and splines. Nice and complex. Anyhow I am trying to visualize model fit.
For example, here is some toy data:
library(lme4)
library(rms)
library(gridExtra)
## Make model using sleepstudy data
head(sleepstudy)
# Add some extra vars
sleepstudy$group <- factor( sample(c(1,2), nrow(sleepstudy), replace=TRUE) )
sleepstudy$x1 <- jitter(sleepstudy$Days, factor=5)^2 * jitter(sleepstudy$Reaction)
# Set up a mixed model with spline
fm1 <- lmer(Reaction ~ rcs(Days, 4) * group + (rcs(Days, 4) | Subject), sleepstudy)
# Now add continuous covar
fm2 <- lmer(Reaction ~ rcs(Days, 4) * group + x1 + (rcs(Days, 4) | Subject), sleepstudy)
# Plot fit
new.df <- sleepstudy
new.df$pred1 <- predict(fm1, new.df, allow.new.levels=TRUE, re.form=NA)
new.df$pred2 <- predict(fm2, new.df, allow.new.levels=TRUE, re.form=NA)
g1 <- ggplot(data=new.df, aes(x=Days)) +
geom_line(aes(y=pred1, col=group), size=2) +
ggtitle("Model 1")
g2 <- ggplot(data=new.df, aes(x=Days)) +
geom_line(aes(y=pred2, col=group), size=2) +
ggtitle("Model 2")
grid.arrange(g1, g2, nrow=1)
Plot 1 is smooth, but plot 2 is jagged due to the effect of x1. So I would like to make a surface plot with x = Days, y = x1 and z = pred2 and stratified by group. Not having experience of surface plots I've started out with the wireframe command:
wireframe(pred2 ~ Days * x1, data = new.df[new.df$group==1,],
xlab = "Days", ylab = "x1", zlab="Predicted fit"
)
However although this command does not give an error, my plot is blank:
Questions:
Where am I going wrong with my wireframe?
Is there a better way to visualize my model fit?
I figured out that the data format needed for a wireframe' orplot_ly' surface is that of a 2D matrix of x rows by y columns of corresponding z values (I got a hint towards this from this question Plotly 3d surface graph has incorrect x and y axis values). I also realised I could use `expand.grid' to make a matrix covering the range of possible x and y values and use those to predict z as follows:
days <- 0:9
x1_range <- range(sleepstudy$x1)[2] * c(0.05, 0.1, 0.15, 0.2, 0.25, 0.3)
new.data2 <- expand.grid(Days = days, x1 = x1_range, group = unique(sleepstudy$group) )
new.data2$pred <- predict(fm2, new.data2, allow.new.levels=TRUE, re.form=NA)
I can then stuff those into two different matrices to represent the z-surface for each group in my model:
surf1 <- ( matrix(new.data2[new.data2$group == 1, ]$pred, nrow = length(days), ncol = length(x1_range)) )
surf2 <- ( matrix(new.data2[new.data2$group == 2, ]$pred, nrow = length(days), ncol = length(x1_range)) )
group <- c(rep(1, nrow(surf1)), rep(2, nrow(surf2) ))
Finally I can use plot_ly to plot each surface:
plot_ly (z=surf1, x = mets_range, y = ages, type="surface") %>%
add_surface (z = surf2, surfacecolor=surf2,
color=c('red','yellow'))
The resulting plot:
So the resulting plot is what I wanted (albeit not very useful in this made up example but useful in real data). The only thing I can't figure out is how to show two different color scales. I can suppres the scale altogether but if anyone knows how to show 2 scales for different surfaces do please let me know and I will edit the answer.

Plotting model comparison statistics in R

I combined several data-frames into a data-frame dfc with a fifth column called model specifying which model was used for imputation. I want to plot the distributions by grouping them by model.
dfc looks something like: (1000 rows, 5 columns)
X1 X2 X3 X4 model
1500000 400000 0.542 7.521 actual
250000 32000 2.623 11.423 missForest
...
I use the lines below to plot:
library(lattice)
densityplot(X1 + X2 + X3 + X4, group = dfc$model)
giving:
Note that X1 <- dfc$X1 (and likewise)
My questions are:
How can I add a legend to this plot? (this plot is useless if one can't tell which colour belongs to which model)
Is there, perhaps, a more visually appealing way to plot this? Using ggplot, perhaps?
Is there a better way to compare these models? For example, I could plot for each column separately.
A fast density plot using ggplot.
library(ggplot2)
library(reshape2)
a <- rnorm(50)
b <- runif(50, min = -5, max = 5)
c <- rchisq(50, 2)
data <- data.frame(rnorm = a, runif = b, rchisq = c)
data <- melt(data) #from reshape2 package
ggplot(data) + geom_density(aes(value, color = variable)) +
geom_jitter(aes(value, 0, color = variable), alpha = 0.5, height = 0.02 )
Remark: I added the reshape2 package because ggplot likes "long" data and I think yours are "wide".
Plotting each column seperatly would work like that:
ggplot(data) + geom_density(aes(value, color = variable))
+ geom_point(aes(value, 0, color = variable))
+ facet_grid(.~variable)
Here the color might be redundant but you can just remove the color argument.
All I had to do was set an argument:
densityplot(X1 + X2 + X3 + X4, group = dfc$model, auto.key = TRUE) gives the desired plot
The problem was that I couldn't figure out which densityplot() was R using.
The other parts of the question remain open.
Data copied from #alex
library(ggplot2)
library(reshape2)
a <- rnorm(50)
b <- runif(50, min = -5, max = 5)
c <- rchisq(50, 2)
dat <- data.frame(Hmisc = a, MICE = b, missForest = c)
dat <- melt(dat)
library(lattice) # using lattice package
densityplot(~value,dat,groups = variable,auto.key = T)
individual plots
densityplot(~value|variable,dat,groups = variable,auto.key = T,scales=list(relation="free"))

ggplot loess line from one dataset over scatterplot of another

The function below calculates binned averages, sizes the bin points on the graph relative to the number of observations in each bin, and plots a lowess line through the bin means. Instead of plotting the lowess line through the bin means, however, I would like to plot the line through the original dataset so that the error bands on the lowess line represent the uncertainty in the actual dataset, not the uncertainty in the binned averages. How do I modify geom_smooth() so that it will plot the line using df instead of dfplot?
library(fields)
library(ggplot2)
binplot <- function(df, yvar, xvar, sub = FALSE, N = 50, size = 40, xlabel = "X", ylabel = "Y"){
if(sub != FALSE){
df <- subset(df, eval(parse(text = sub)))
}
out <- stats.bin(df[,xvar], df[,yvar], N= N)
x <- out$centers
y <- out$stats[ c("mean"),]
n <- out$stats[ c("N"),]
dfplot <- as.data.frame(cbind(x,y,n))
if(size != FALSE){
sizes <- n * (size/max(n))
}else{
sizes = 3
}
ggplot(dfplot, aes(x,y)) +
xlab(xlabel) +
ylab(ylabel) +
geom_point(shape=1, size = sizes) +
geom_smooth()
}
Here is a reproducible example that demonstrates how the function currently works:
sampleSize <- 10000
x1 <- rnorm(n=sampleSize, mean = 0, sd = 4)
y1 <- x1 * 2 + x1^2 * .3 + rnorm(n=sampleSize, mean = 5, sd = 10)
binplot(data.frame(x1,y1), "y1", "x1", N = 25)
As you can see, the error band on the lowess line reflects the uncertainty if each bin had an equal number of observations, but they do not. The bins at the extremes have far fewer obseverations (as illustrated by the size of the points) and the lowess line's error band should reflect that.
You can explicitly set the data= parameter for each layer. You will also need to change the aesthetic mapping since the original data.frame had different column names. Just change your geom_smooth call to
geom_smooth(data=df, aes_string(xvar, yvar))
with the sample data, this returned

Resources