Plotting model comparison statistics in R - r

I combined several data-frames into a data-frame dfc with a fifth column called model specifying which model was used for imputation. I want to plot the distributions by grouping them by model.
dfc looks something like: (1000 rows, 5 columns)
X1 X2 X3 X4 model
1500000 400000 0.542 7.521 actual
250000 32000 2.623 11.423 missForest
...
I use the lines below to plot:
library(lattice)
densityplot(X1 + X2 + X3 + X4, group = dfc$model)
giving:
Note that X1 <- dfc$X1 (and likewise)
My questions are:
How can I add a legend to this plot? (this plot is useless if one can't tell which colour belongs to which model)
Is there, perhaps, a more visually appealing way to plot this? Using ggplot, perhaps?
Is there a better way to compare these models? For example, I could plot for each column separately.

A fast density plot using ggplot.
library(ggplot2)
library(reshape2)
a <- rnorm(50)
b <- runif(50, min = -5, max = 5)
c <- rchisq(50, 2)
data <- data.frame(rnorm = a, runif = b, rchisq = c)
data <- melt(data) #from reshape2 package
ggplot(data) + geom_density(aes(value, color = variable)) +
geom_jitter(aes(value, 0, color = variable), alpha = 0.5, height = 0.02 )
Remark: I added the reshape2 package because ggplot likes "long" data and I think yours are "wide".
Plotting each column seperatly would work like that:
ggplot(data) + geom_density(aes(value, color = variable))
+ geom_point(aes(value, 0, color = variable))
+ facet_grid(.~variable)
Here the color might be redundant but you can just remove the color argument.

All I had to do was set an argument:
densityplot(X1 + X2 + X3 + X4, group = dfc$model, auto.key = TRUE) gives the desired plot
The problem was that I couldn't figure out which densityplot() was R using.
The other parts of the question remain open.

Data copied from #alex
library(ggplot2)
library(reshape2)
a <- rnorm(50)
b <- runif(50, min = -5, max = 5)
c <- rchisq(50, 2)
dat <- data.frame(Hmisc = a, MICE = b, missForest = c)
dat <- melt(dat)
library(lattice) # using lattice package
densityplot(~value,dat,groups = variable,auto.key = T)
individual plots
densityplot(~value|variable,dat,groups = variable,auto.key = T,scales=list(relation="free"))

Related

Plotting the theoretical distribution of exponential with a minimum on facets (ggh4x)

Users of a Shiny app can test data sets for Poisson, normality, and exponentiality. I am returning the results of the statistical test they chose. In addition, I thought it would be nice to plot the density from the data along with the theoretical distribution. They could be testing multiple sets of data at once, so I am faceting the plot.
From ggplot add Normal Distribution while using `facet_wrap` I found the really great ggh4x package. However, since this could be industry data, there may be a minimum that is not zero.
The problem is that theodensity(distri="exp") uses dexp which doesn't account for a minimum number, so the theoretical distribution plot doesn't match the data.
How can I tell the stat_theodensity that there is an xmin for each facet, which is the min of the data in the facet? I see that fitdistrplus can use different methods to fit an exponential curve, and that, for example, method="mse" would work. Is there a way to pass this through stat_theodensity?
library(ggh4x)
#generate 2 exponential distributions with xmin > 0
data1 <- rexp(n = 500,rate = 1/100)+100
data2 <- rexp(n = 500,rate = 1/250)+500
data <- c(data1,data2)
#generate a code for facets
ID1 <- c(rep("Set 1",times=500))
ID2 <- c(rep("Set 2",times=500))
ID <- c(ID1,ID2)
#make the data for plotting
plot_dat <- data.frame(ID,data)
#make the graph
p <- ggplot(data = plot_dat, aes(x=data))+
geom_density()+
stat_theodensity(distri = "exp")+
facet_wrap(facets = ~ID,scales = "free")
p
#what the first point of the graphs should be
dexp(x = 100-100,rate = 1/100)
#[1] 0.01
dexp(x = 500-500,rate = 1/250)
#[1] 0.004
********EDIT
OK I am getting closer. The following code works, but only for the second pass through the loop. If I change the numbers around for data1 and data2, it is always only the second one that plots the theoretical distribution.
I did ggplot_build after the first loop and it gives an error in fitdist(), which is code 100. I don't know why it would always fail on the first one but not on the second one, even with the same data.
Any ideas?
#generate 2 exponential distributions with xmin > 0
data1 <- rexp(n = 500,rate = 1/250)+500
data2 <- rexp(n = 500,rate = 1/100)+250
data <- c(data1,data2)
#generate a code for facets
ID1 <- c(rep("Set 1",times=500))
ID2 <- c(rep("Set 2",times=500))
ID <- c(ID1,ID2)
#make the data for plotting
plot_dat <- data.frame(ID,data)
#make the graph
p <- ggplot(data = plot_dat, aes(x=data))+
geom_density(color="red")
#loop through sets and add facets
for (set in unique(plot_dat$ID)){
xmin <- min(plot_dat$data[ID == set])
p<-p+
stat_theodensity(
data = ~subset(.x, ID == set),
aes(x = stage(data - xmin, after_stat = x + xmin)),
distri = "exp"
)
}
#stat_theodensity(distri = "exp")+
p<-p+
facet_wrap(facets = ~ID,scales = "free")
p
I don't know about the statistics of your problem, but if the issue is subtracting a number before calculating the density and afterwards adding it, you might do that with stage(). I couldn't find a more elegant way than hardcoding these values for each set separately, but I'd be happy to hear about more creative solutions.
library(ggh4x)
#> Loading required package: ggplot2
#generate 2 exponential distributions with xmin > 0
data1 <- rexp(n = 500,rate = 1/100)+100
data2 <- rexp(n = 500,rate = 1/250)+500
data <- c(data1,data2)
#generate a code for facets
ID1 <- c(rep("Set 1",times=500))
ID2 <- c(rep("Set 2",times=500))
ID <- c(ID1,ID2)
#make the data for plotting
plot_dat <- data.frame(ID,data)
#make the graph
ggplot(data = plot_dat, aes(x=data))+
geom_density() +
stat_theodensity(
data = ~ subset(.x, ID == "Set 1"),
aes(x = stage(data - 100, after_stat = x + 100)),
distri = "exp"
) +
stat_theodensity(
data = ~ subset(.x, ID == "Set 2"),
aes(x = stage(data - 500, after_stat = x + 500)),
distri = "exp"
) +
facet_wrap(facets = ~ID,scales = "free")
Created on 2022-09-26 by the reprex package (v2.0.1)
EDIT
I think OP's update had a problem with non-standard evaluation. It should work when you use a lapply() loop instead of a for-loop because then xmin is not a global variable that might be mistakingly looked up.
library(ggh4x)
#> Loading required package: ggplot2
library(ggplot2)
#generate 2 exponential distributions with xmin > 0
data1 <- rexp(n = 500,rate = 1/250)+500
data2 <- rexp(n = 500,rate = 1/100)+250
data <- c(data1,data2)
#generate a code for facets
ID1 <- c(rep("Set 1",times=500))
ID2 <- c(rep("Set 2",times=500))
ID <- c(ID1,ID2)
#make the data for plotting
plot_dat <- data.frame(ID,data)
#make the graph
p <- ggplot(data = plot_dat, aes(x=data))+
geom_density(color="red") +
facet_wrap(facets = ~ ID, scales = "free")
#loop through sets and add facets
p + lapply(unique(plot_dat$ID), function(i) {
xmin <- min(plot_dat$data[plot_dat$ID == i])
stat_theodensity(
data = ~ subset(.x, ID == i),
aes(x = stage(data - xmin, after_stat = x + xmin)),
distri = "exp"
)
})
Created on 2022-09-27 by the reprex package (v2.0.1)

Plot two 3D graphics from own models in one plot in R

I have a model like this
lmer(response ~ poly(pred1, 2) * poly(pred2, 2) * grouping_variable ...)
Since my grouping variable has two levels I would like to plot two 3D Graphics in one plot like this:
this is done with scatter3d from the car package. Unfortunately there is no option to plot an own model. There are some options to chose (linear, quadratic,...) but I would like to plot my model.
I was able to plot my own model with scatter3D from the plot3D package, but I could not find an option to plot both levels of the grouping variable.
Do you have an idea, how I could achieve this?
Here are some example data (I am not good in simulating data, but I think it should work):
library(car)
library(dplyr)
X <- seq(76, 135) + rnorm(sd = 2, n = 60)
Y <- seq(65, 365, length.out = 60) + rnorm(sd = 4, n = 60)
Test.grid <- expand.grid(X = X, Y = Y)
Test.grid$A <- 1
Test.grid$Z <- 2*X + 0.5*Y
df1 <- sample_n(Test.grid, 60)
df2 <- df1 %>% mutate(A = 2, Y = Y + 50)
Test <- rbind(df1, df2)
X <- Test$X
Y <- Test$Y
Z <- Test$Z
scatter3d(x=X, y=Y, z=Z, groups = as.factor(Test$A), grid = FALSE, fit = "linear", surface.col = c("red", "black"))
All commands from the plot3D package include a command add = T. With that it is very easy to plot the second surface, by just adding add = T to the second plot command.

Making surface plot of regression estimates from multiple continuous variables

I have a multi-level model with categorical and continuous variables and splines. Nice and complex. Anyhow I am trying to visualize model fit.
For example, here is some toy data:
library(lme4)
library(rms)
library(gridExtra)
## Make model using sleepstudy data
head(sleepstudy)
# Add some extra vars
sleepstudy$group <- factor( sample(c(1,2), nrow(sleepstudy), replace=TRUE) )
sleepstudy$x1 <- jitter(sleepstudy$Days, factor=5)^2 * jitter(sleepstudy$Reaction)
# Set up a mixed model with spline
fm1 <- lmer(Reaction ~ rcs(Days, 4) * group + (rcs(Days, 4) | Subject), sleepstudy)
# Now add continuous covar
fm2 <- lmer(Reaction ~ rcs(Days, 4) * group + x1 + (rcs(Days, 4) | Subject), sleepstudy)
# Plot fit
new.df <- sleepstudy
new.df$pred1 <- predict(fm1, new.df, allow.new.levels=TRUE, re.form=NA)
new.df$pred2 <- predict(fm2, new.df, allow.new.levels=TRUE, re.form=NA)
g1 <- ggplot(data=new.df, aes(x=Days)) +
geom_line(aes(y=pred1, col=group), size=2) +
ggtitle("Model 1")
g2 <- ggplot(data=new.df, aes(x=Days)) +
geom_line(aes(y=pred2, col=group), size=2) +
ggtitle("Model 2")
grid.arrange(g1, g2, nrow=1)
Plot 1 is smooth, but plot 2 is jagged due to the effect of x1. So I would like to make a surface plot with x = Days, y = x1 and z = pred2 and stratified by group. Not having experience of surface plots I've started out with the wireframe command:
wireframe(pred2 ~ Days * x1, data = new.df[new.df$group==1,],
xlab = "Days", ylab = "x1", zlab="Predicted fit"
)
However although this command does not give an error, my plot is blank:
Questions:
Where am I going wrong with my wireframe?
Is there a better way to visualize my model fit?
I figured out that the data format needed for a wireframe' orplot_ly' surface is that of a 2D matrix of x rows by y columns of corresponding z values (I got a hint towards this from this question Plotly 3d surface graph has incorrect x and y axis values). I also realised I could use `expand.grid' to make a matrix covering the range of possible x and y values and use those to predict z as follows:
days <- 0:9
x1_range <- range(sleepstudy$x1)[2] * c(0.05, 0.1, 0.15, 0.2, 0.25, 0.3)
new.data2 <- expand.grid(Days = days, x1 = x1_range, group = unique(sleepstudy$group) )
new.data2$pred <- predict(fm2, new.data2, allow.new.levels=TRUE, re.form=NA)
I can then stuff those into two different matrices to represent the z-surface for each group in my model:
surf1 <- ( matrix(new.data2[new.data2$group == 1, ]$pred, nrow = length(days), ncol = length(x1_range)) )
surf2 <- ( matrix(new.data2[new.data2$group == 2, ]$pred, nrow = length(days), ncol = length(x1_range)) )
group <- c(rep(1, nrow(surf1)), rep(2, nrow(surf2) ))
Finally I can use plot_ly to plot each surface:
plot_ly (z=surf1, x = mets_range, y = ages, type="surface") %>%
add_surface (z = surf2, surfacecolor=surf2,
color=c('red','yellow'))
The resulting plot:
So the resulting plot is what I wanted (albeit not very useful in this made up example but useful in real data). The only thing I can't figure out is how to show two different color scales. I can suppres the scale altogether but if anyone knows how to show 2 scales for different surfaces do please let me know and I will edit the answer.

Specify regression line intercept (R & ggplot2)

BACKGROUND
My current plot looks like this:
PROBLEM
I want to force the regression line to start at 1 for station_1.
CODE
library(ggplot2)
#READ IN DATA
var_x = c(2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011)
var_y = c(1.000000,1.041355,1.053106,1.085738,1.126375,1.149899,1.210831,1.249480,1.286305,1.367923,1.486978,1.000000,0.9849343,0.9826141,0.9676000,0.9382975,0.9037476,0.8757748,0.8607960,0.8573634,0.8536138,0.8258877)
var_z = c('Station_1','Station_1','Station_1','Station_1','Station_1','Station_1','Station_1','Station_1','Station_1','Station_1','Station_1','Station_2','Station_2','Station_2','Station_2','Station_2','Station_2','Station_2','Station_2','Station_2','Station_2','Station_2')
df_data = data.frame(var_x,var_y,var_z)
out = ggplot(df_data,aes(x=var_x,y=var_y,group=var_z))
out = out + geom_line(aes(linetype=var_z),size=1)
out = out + theme_classic()
#SELECT DATA FOR Station_1
PFI_data=subset(df_data,var_z=="Station_1")
#PLOT REGRESSION FOR Station_1
out = out+ stat_smooth(data = PFI_data,
method=lm,
formula = y~x,
se=T,size = 1.4,colour = "blue",linetype=1)
Any help would be appreciated - this has been driving me crazy for too long!
First of all, you should be careful when forcing a regression line to some fixed point. Here's a link to a discussion why.
Now, from a technical perspective, I'm relying heavily on these questions and answers: one, two. The outline of my solution is the following: precompute the desired intercept, run a regression without it, add the intercept to the resulting prediction.
I'm using an internal ggplot2:::predictdf.default function to save some typing. The cbind(df, df) part may look strange, but it's a simple hack to make geom_smooth work properly, since there are two factor levels in var_z.
# Previous code should remain intact, replace the rest with this:
# SELECT DATA FOR Station_1
PFI_data=subset(df_data,var_z=="Station_1")
names(PFI_data) <- c("x", "y", "z")
x0 <- df_data[df_data$var_z == "Station_1", "var_x"][1]
y0 <- df_data[df_data$var_z == "Station_1", "var_y"][1]
model <- lm(I(y-y0) ~ I(x-x0) + 0, data = PFI_data)
xrange <- range(PFI_data$x)
xseq <- seq(from=xrange[1], to=xrange[2])
df <- ggplot2:::predictdf.default(model, xseq, se=T, level=0.95)
df <- rbind(df, df)
df[c("y", "ymin", "ymax")] <- df[c("y", "ymin", "ymax")] + y0
out + geom_smooth(aes_auto(df), data=df, stat="identity")

aggregate/sum with ggplot

Is there a way to sum data with ggplot2 ?
I want to do a bubble map with the size depending of the sum of z.
Currently I'm doing something like
dd <- ddply(d, .(x,y), transform, z=sum(z))
qplot(x,y, data=dd, size=z)
But I feel I'm writing the same thing twice, I would like to be able to write something
qplot(x,y, data=dd, size=sum(z))
I had a look at stat_sum and stat_summmary but I'm not sure they are appropriate either.
Is it possible to it with ggplot2 ? If not, what would be best way to write those 2 lines.
It can be done using stat_sum within ggplot2. By default, the dot size represents proportions. To get dot size to represent counts, use size = ..n.. as an aesthetic. Counts (and proportions) by a third variable can be obtained by weighting by the third variable (weight = cost) as an aesthetic. Some examples, but first, some data.
library(ggplot2)
set.seed = 321
# Generate somme data
df <- expand.grid(x = seq(1:5), y = seq(1:5), KEEP.OUT.ATTRS = FALSE)
df$Count = sample(1:25, 25, replace = F)
library(plyr)
new <- dlply(df, .(Count), function(data) matrix(rep(matrix(c(data$x, data$y), ncol = 2), data$Count), byrow = TRUE, ncol = 2))
df2 <- data.frame(do.call(rbind, new))
df2$cost <- 1:325
The data contains units categorised according to two factors: X1 and X2; and a third variable which is the cost of each unit.
Plot 1: Plots the proportion of elements at each X1 - X2 combination. group=1 tells ggplot to calculate proportions out of the total number of units in the data frame.
ggplot(df2, aes(factor(X1), factor(X2))) +
stat_sum(aes(group = 1))
Plot 2: Plots the number of elements at each X1 - X2 combination.
ggplot(df2, aes(factor(X1), factor(X2))) +
stat_sum(aes(size = ..n..))
Plot 3: Plots the cost of the elements at each X1 - X2 combination, that is weight by the third variable.
ggplot(df2, aes(x=factor(X1), y=factor(X2))) +
stat_sum(aes(group = 1, weight = cost, size = ..n..))
Plot 4: Plots the proportion of the total cost of all elements in the data frame at each X1 - X2 combination
ggplot(df2, aes(x=factor(X1), y=factor(X2))) +
stat_sum(aes(group = 1, weight = cost))
Plot 5: Plots proportions, but instead of the proportion being out of the total cost across all elements in the data frame, the proportion is out of the cost for elements within each category of X1. That is, within each X1 category, where does the major cost for X2 units occur?
ggplot(df2, aes(x=factor(X1), y=factor(X2))) +
stat_sum(aes(group = X1, weight = cost))
You could put the ddply call into the qplot:
d <- data.frame(x=1:10, y=1:10, z= runif(100))
qplot(x, y, data=ddply(d, .(x,y), transform, z=sum(z)), size=z)
Or use the data.table package.
DT <- data.table(d, key='x,y')
qplot(x, y, data=DT[, sum(z), by='x,y'], size=V1)

Resources