ggplot2 vs sm package density plot output (and statistical analysis) - r

Consider the following data frame example
library('ggplot2')
library('sm')
original<-c(1:100,1)
a<-sample(original,100)
b<-rep(1:4,25)
lala<-data.frame(a,b)
My aim is to produce density plots for values in lala$a, according to each group (1,2,3,4) defined in lala$b.
For doing so in ggplot2 I could do the following
plotDensityggplot<-ggplot()+
geom_density(data = lala, aes(a, colour=factor(b)))+
theme_classic()
print(plotDensityggplot)
producing this:
However, when I plot the same data using the 'sm' package to make a formal comparison of the densities using the following code:
sm.density.compare(lala$a,as.numeric(lala$b),model = "equal")
The density curves extend beyond zero in the X-axis, despite there is no value below zero in lala$a
What's going on? - note that this affect the densities reported in the y-axis.
Is the p-value from the permutation test of equality obtained from sm.density.compare a reliable estimate? - thank you!

For what it's worth, you can (more or less) reproduce the sm output in ggplot by pre-computing densities with base R's density (I'm not familiar with sm but I imagine that sm.density calls base R's density at some point as well).
library(tidyverse)
lala %>%
group_by(b) %>%
summarise(tmp = list(map_dfc(c("x", "y"), ~density(a)[.x]))) %>%
unnest() %>%
ggplot(aes(x, y, colour = as.factor(b))) +
geom_line()
I'm not sure how geom_density (or stat_density) tune kernel density estimation parameters, but you seem to have less control over them than in base R's density.

Related

Plotting an nlme object in ggplot2

I calculated a linear-mixed model using the nlme package. I was evaluating a psychological treatment and used treatment condition and measurement point as predictors. I did post-hoc comparisons using the emmans package. So far so good, everything worked out well and I am looking forward to finish my thesis. There is only one problem left. I am really really bad in plotting. I want to plot the emmeans for the four measurement points for each group. The emmip function in emmeans does this, but I am not that happy with the result. I used the following code to generate the result:
emmip(HLM_IPANAT_pos, Gruppe~TP, CIs=TRUE) + theme_bw() + labs(x = "Zeit", y = "IPANAT-PA")
I don't like the way the confidence intervals are presented. I would prefer a line bar with "normal" confidence bars, like the one below, which is taken from Ireland et al. (2017). I tried to do it in excel, but did not find out how to integrate seperate confidence intervals for each line. So I was wondering if there was the possibility to do it using ggplot2. However, I do not know how to integrate the values I obtained using emmeans in ggplot. As I said, I really have no idea about plotting. Does someone know how to do it?
I think it is possible. Rather than using emmip to create the plot, you could use emmeans to get the values for ggplot2. With ggplot2 and the data, you might be able to better control the format of the plot. Since I do not have your data, I can only suggest a few steps.
First, after fitting the model HLM_IPANAT_pos, get values using emmeans. Second, broom::tidy this object. Third, ggplot the above broom::tidy object.
Using mtcars data as an example:
library(emmeans)
# mtcars data
mtcars$cyl = as.factor(mtcars$cyl)
# Model
mymodel <- lm(mpg ~ cyl * am, data = mtcars)
# using ggplot2
library(tidyverse)
broom::tidy(emmeans(mymodel, ~ am | cyl)) %>%
mutate(cyl_x = as.numeric(as.character(cyl)) + 0.1*am) %>%
ggplot(aes(x = cyl_x, y = estimate, color = as.factor(am))) +
geom_point() +
geom_line() +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.1)
Created on 2019-12-29 by the reprex package (v0.3.0)

Creating multiple density plots using only summary statistics (no raw data) in R

I work with a massive 4D nifti file (x - y - z - subject; MRI data) and due to the size I can't convert to a csv file and open in R. I would like to get a series of overlaying density plots (classic example here) one for each subject with the idea to just visualise that there is not much variance in density distributions across the sample.
I could however, extract summary statistics for each subject (mean, median, SD, range etc. of the variable of interest) and use these to create the density plots (at least for the variables that are normally distributed). Something like this would be fantastic but I am not sure how to do it for density plots.
Your help will be much appreciated.
So these really aren't density plots per se - they are plots of densties of normal distributions with given means and standard deviations.
That can be done in ggplot2, but you need to expand your table of subjects and summaries into grids of points and normal densities at those points.
Here's an example. First, make up some data, consisting of subject IDs and some simulated sample averages and sample standard deviations.
library(tidyverse)
set.seed(1)
foo <- data_frame(Subject = LETTERS[1:10], avg=runif(10, 10,20), stdev=runif(10,1,2))
Now, for each subject we need to obtain a suitable grid of "x" values along with the normal density (for that subject's avg and stdev) evaluated at those "x" values. I've chosen plus/minus 4 standard deviations. This can be done using do. But that produces a funny data frame with a column consisting of data frames. I use unnest to explode out the data frame.
bar <- foo %>%
group_by(Subject) %>%
do(densities=data_frame(x=seq(.$avg-4*.$stdev, .$avg+4*.$stdev, length.out = 50),
density=dnorm(x, .$avg, .$stdev))) %>%
unnest()
Have a look at bar to see what happened. Now we can use ggplot2 to put all these normal densities on the same plot. I'm guessing with lots of subjects you wouldn't want a legend for the plot.
bar %>%
ggplot(aes(x=x, y=density, color=Subject)) +
geom_line(show.legend = FALSE)

R: Weighted Joyplot/Ridgeplot/Density Plot?

I am trying to create a joyplot using the ggridges package (based on ggplot2). The general idea is that a joyplot creates nicely scaled stacked density plots. However, I cannot seem to produce one of these using weighted density. Is there some way of incorporating sampling weights (for weighted density) in the calculation of the densities in the creation of a joyplot?
Here's a link to the documentation for the ggridges package: https://cran.r-project.org/web/packages/ggridges/ggridges.pdf I know a lot of packages based on ggplot can accept additional aesthetics, but I don't know how to add weights to this type of geom object.
Additionally, here is an example of an unweighted joyplot in ggplot. I am trying to convert this to a weighted plot with the density weighted according to pweight.
# Load package, set seed
library(ggplot)
set.seed(1)
# Create an example dataset
dat <- data.frame(group = c(rep("A",100), rep("B",100)),
pweight = runif(200),
val = runif(200))
# Create an example of an unweighted joyplot
ggplot(dat, aes(x = val, y = group)) + geom_density_ridges(scale= 0.95)
It looks like the way to do this is to use stat_density rather than the default stat_density_ridges. Per the docs you linked to:
Note that the default stat_density_ridges makes joint density
estimation across all datasets. This may not generate the desired
result when using faceted plots. As an alternative, you can set
stat = "density" to use stat_density. In this case, it is required
to add the aesthetic mapping height = ..density.. (see examples).
Fortunately, stat_density (unlike stat_density_ridges) understands the aesthetic weight and will pass it to the underlying density call. You end up with something like:
ggplot(dat, aes(x = val, y = group)) +
geom_density_ridges(aes(height=..density.., # Notice the additional
weight=pweight), # aes mappings
scale= 0.95,
stat="density") # and use of stat_density
The ..density.. variable is automatically generated by stat_density.
Note: It appears that when you use stat_density the x-axis range behaves a little differently: it will trim the density plot to the data range and drop the nice-looking tails. You can easily correct this by manually expanding your x-axis, but I thought it was worth mentioning.

Combine logistic regression with bar graph for maturity results

I am trying to present the results of a logistic regression analysis for the maturity schedule of a fish species. Below is my reproducible code.
#coded with R version R version 3.0.2 (2013-09-25)
#Frisbee Sailing
rm(list=ls())
library(ggplot2)
library(FSA)
#generate sample data 1 mature, 0 non mature
m<-rep(c(0,1),each=25)
tl<-seq(31,80, 1)
dat<-data.frame(m,tl)
# add some non mature individuals at random in the middle of df to
#prevent glm.fit: fitted probabilities numerically 0 or 1 occurred error
tl<-sample(50:65, 15)
m<-rep(c(0),each=15)
dat2<-data.frame(tl,m)
#final dataset
data3<-rbind(dat,dat2)
ggplot can produce a logistic regression graph showing each of the data points employed, with the following code:
#plot logistic model
ggplot(data3, aes(x=tl, y=m)) +
stat_smooth(method="glm", family="binomial", se=FALSE)+
geom_point()
I want to combine the probability of being mature at a given size, which is obtained, and plotted with the following code:
#plot proportion of mature
#clump data in 5 cm size classes
l50<-lencat(~tl,data=data3,startcat=30,w=5)
#table of frequency of mature individuals by size
mat<-with(l50, table(LCat, m))
#proportion of mature
mat_prop<-as.data.frame.matrix(prop.table(mat, margin=1))
colnames(mat_prop)<-c("nm", "m")
mat_prop$tl<-as.factor(seq(30,80, 5))
# Bar plot probability mature
ggplot(mat_prop, aes(x=tl,y=m)) +
geom_bar(stat="bin")
What I've been trying to do, with no success, is to make a graph that combines both, since the axis are the same it should be straightforward, but I cant seem to make t work. I have tried:
ggplot(mat_prop, aes(x=tl,y=m)) +
geom_bar(stat="bin")+
stat_smooth(method="glm", family="binomial", se=FALSE)
but does not work. Any help would be greatly appreciated. I am new so not able to add the resulting graphs to this post.
I see three problems with your code:
Using stat="bin" in your geom_bar() is inconsisten with giving values for the y-axis (y=m). If you bin, then you count the number of x-values in an interval and use that count as y-value, so there is no need to map your data to the y-axis.
The data for the glm-plot is in data3, but your combined plot only uses mat_prop.
The x-axis of the two plots are acutally not quite the same. In the bar plot, you use a factor variable on the x-axis, making the axis discrete, while in the glm-plot, you use a numeric variable, which leads to a continuous x-axis.
The following code gave a graph combining your two plots:
mat_prop$tl<-seq(30,80, 5)
ggplot(mat_prop, aes(x=tl,y=m)) +
geom_bar(stat="identity") +
geom_point(data=data3) +
geom_smooth(data=data3,aes(x=tl,y=m),method="glm", family="binomial", se=FALSE)
I could run it after first sourcing your script to define all the variables. The three problems mentioned above are adressed as follows:
I used geom_bar(stat="identity") in order not to use binning in the bar plot.
I use the data-argument in geom_point and geom_smooth in order to use the correct data (data3) for these parts of the plot.
I redifine mat_prop$tl to make it numeric. It is then consistent with the column tl in data3, which is numeric as well.
(I also added the points. If you don't want them, just remove geom_point(data=data3).)
The plot looks as follows:

Using hex binning for downsampling QQ plots

My datasets are pretty large and rendering generated QQ plots is slow and sometimes even freezes my browser. I know that one option that I have is simply to downsample the data vector. However, I wanted to try hex binning technique instead of downsampling. Unfortunately, I couldn't make it work (two of my several attempts are shown below). If downsampling is possible to achieve using hex binning (which I suspect is, as it's similar to histograms), I'd appreciate, if someone could show me how to do it. I use ggplot2. Thanks!
g <- ggplot(df, aes(x=var)) + stat_qq(aes(x = var), geom = "hex")
g <- ggplot(df, aes(x = var, y = ..density..)) +
geom_hex(aes(sample = var), stat = "qq")
print (g)
The first call results in the following error message:
Error: stat_qq requires the following missing aesthetics: sample
The second call results in this message:
Error in eval(expr, envir, enclos) : object 'density' not found
UPDATE: I think that more correct variant is this, but I'm not sure what should be the arguments:
g <- ggplot(df, aes(??, ??)) + stat_binhex()
Not sure if this is what you are looking for exactly, but I offer a couple ways to do hexagonal binning. First with ggplot as you are trying to work with and the second with the package hexbin which seems to look better to me, but just my preference.
library(ggplot2)
x <- rgamma(1000,8,2)
y <- rnorm(1000,4,1.5)
binFrame <- data.frame(x,y)
qplot(x,y,data=binFrame, geom='bin2d') # with ggplot...rectangular binning actually
library(hexbin)
hexbinplot(y~x, data=binFrame) # with hexbin...actually hexagonal binning
Edit:
So I was thinking a bit about this at lunch and I think the fundamental issues is that hexbining is a multidimensional data reduction technique and it seems like you are trying to do uni-variate QQ plots on really large sample, but with hexbin in ggplot. At any-rate I can think of a way to do hex bin plots with ggplot, but the best I came up with is to start from scratch and manually construct both the theoretical quantiles (x) and sample quantiles (y). So here is what I came up with.
Basic QQ-Plot Manually
# setting up manual QQ plot used to plot with and with out hexbins
xSamp <- rgamma(1000,8,.5) # sample data
len <- 1000
i <- seq(1,len,by=1)
probSeq <- (i-.5)/len # probability grid
invCDF <- qnorm(probSeq,0,1) # theoretical quantiles for standard normal, but you could compare your sample to any distribution
orderGam <- xSamp[order(xSamp)] # ordered sampe
df <- data.frame(invCDF,orderGam)
plot(invCDF,orderGam,xlab="Standard Normal Theoretical Quantiles",ylab="Standardized Data Quantiles",main="QQ-Plot")
abline(lm(orderGam~invCDF),col="red",lwd=2)
QQ Plot With Hexbins in ggplot:
ggplot(df, aes(invCDF, orderGam)) + stat_binhex() + geom_smooth(method="lm")
![QQ Plot with ggplot][2]
So at the end of the day this might not scale up readily, but if you are looking to do true multidimensional tests of normality you might think about chi-square plots for multivariate normality. cheers

Resources