Box plot of two groups add regression line to each group - r

I want to make a graph that graphs box plots for two groups and adds a regression line for each group. I have seen a few examples available, but none achieving my goal.
My dataframe is like so:
df<- data.frame(cont.burnint= c(rep(2,10), rep(12, 10), rep(25, 10)),
variable= rep(c("divA","divC"), 30),
value= sample(x = seq(-1,4,0.5), size = 60, replace =
TRUE))
I would like to produce a graph like:
However, I want to change the points to a box plot for each group. I have not found helpful examples in the following:
Add geom_smooth to boxplot
Adding a simple lm trend line to a ggplot boxplot
The code I have found available thus far, changes my continuous variable cont.burnint to a factor and reorders the x-values from c(2,12,25) to c(12,2,25). Also, the regression lines in the ggplot examples (refer to link)do not extend to the y axis. I would like the regression line to extend to the y-axis. Thirdly, the box plots become off set from each other and I would like an option that keeps the box plot for both groups on the same x value.
So basically, I want to change the points in the graph provided to a box and whisker plot and keep all else the same, in the example above. I wouldn't mind adding a legend below the plot and making text and lines bolder too.
Here is the code for the example above:
plot(as.numeric(as.character(manovadata$cont.burnint)),manovadata$divA,type="p",col="black", xlab="Burn Interval (yr)", ylab="Interaction Diveristy", bty="n", cex.lab=1.5)
points(as.numeric(as.character(manovadata$cont.burnint)),manovadata$divC,col="grey")
abline(lm(manovadata$divA~as.numeric(as.character(manovadata$cont.burnint)), manovadata),col="black",lty=1)
abline(lm(manovadata$divC~as.numeric(as.character(manovadata$cont.burnint)), manovadata),col="grey",lty=1)

I can't imagine why you want overlaying boxplots, but here you go I think:
library(ggplot2)
df$cont.burnint <- as.factor(df$cont.burnint)
ggplot(df, aes(x=cont.burnint, y=value, col=variable))+
geom_boxplot(position=position_dodge(width=0), alpha=0.5)+
geom_smooth(aes(group=variable), method="lm")
I added some transparency to the boxplots using alpha to make them visible on top of each other.
Update:
ggplot(df, aes(x=cont.burnint, y=value, col=variable))+
geom_boxplot(aes(group=paste(variable,cont.burnint)))+
geom_smooth(aes(group=variable), method="lm", fullrange=T, se=F)+xlim(0,30)

Related

Why do some of my violin plots look "wavy" for discrete scales?

I have overlayed violin plots comparing group A and group B scores for a particular section of a survey, facet wrapped by section. The scores are discrete 1-7 values. In some of these violin plots, the smoothing works as expected. In others, one group or the other looks very "wavy" between discrete scores (shown below).
I thought the problem may be a difference in the group sizes, but then surely the "waviness" would appear in all the section plots.
Also, this doesn't explain to me why the plots "dip in" despite being discrete 1-7 values.
When I add the adjust parameter it over-smooths the already smooth sections, so it's not quite ideal.
I use this code to create the plots
create_violin_across_groups_by_section <- function(data, test_group="first") {
g <- ggplot(data) +
aes(x=factor(nrow(data)),y=score,fill=group) +
geom_violin(alpha=0.5,position="identity") +
facet_wrap("section") +
labs(
title = paste("Comparison across groups for ", test_group)
)
return(g)
}
which results in something like this
in this case, "openness," is oddly wavy while the others all appear to be smoothed as normal.
I've thought perhaps it has something to do with the x=factor(nrow(data)) but again, surely the waviness would appear in all the section plots.
I would expect either all of the plots to be wavy (though I still wouldn't understand why) or all of them to have the same smoothness.
How can I make all of the facet-wrapped plots have the same smoothness, and why are they different in the first place?
Thanks all
The shape of the violin plot is calculated with a kernel density estimation. Kernel density estimations are designed for continuous data and not for discrete data, like your scores. While you can feed discrete data to the kernel estimator, the result may not always be beautiful or even meaningful. You can try to use different kernel and bw argument values in the geom_violin or you might consider something designed for discrete data, such as geom_dotplot.
+ geom_dotplot(binaxis = "y", stackdir = "center", position = "dodge")
Check out the corresponding example of geom_dotplot https://ggplot2.tidyverse.org/reference/geom_dotplot.html for a preview of how it can look like.
Check out the kernel and bw description of the violin plot https://ggplot2.tidyverse.org/reference/geom_violin.html that points to the density function https://www.rdocumentation.org/packages/stats/versions/3.6.1/topics/density for further information on how kernel density estimations are calculated.

Contour plot via Scatter plot

Scatter plots are useless when number of plots is large.
So, e.g., using normal approximation, we can get the contour plot.
My question: Is there any package to implement the contour plot from scatter plot.
Thank you #G5W !! I can do it !!
You don't offer any data, so I will respond with some artificial data,
constructed at the bottom of the post. You also don't say how much data
you have although you say it is a large number of points. I am illustrating
with 20000 points.
You used the group number as the plotting character to indicate the group.
I find that hard to read. But just plotting the points doesn't show the
groups well. Coloring each group a different color is a start, but does
not look very good.
plot(x,y, pch=20, col=rainbow(3)[group])
Two tricks that can make a lot of points more understandable are:
1. Make the points transparent. The dense places will appear darker. AND
2. Reduce the point size.
plot(x,y, pch=20, col=rainbow(3, alpha=0.1)[group], cex=0.8)
That looks somewhat better, but did not address your actual request.
Your sample picture seems to show confidence ellipses. You can get
those using the function dataEllipse from the car package.
library(car)
plot(x,y, pch=20, col=rainbow(3, alpha=0.1)[group], cex=0.8)
dataEllipse(x,y,factor(group), levels=c(0.70,0.85,0.95),
plot.points=FALSE, col=rainbow(3), group.labels=NA, center.pch=FALSE)
But if there are really a lot of points, the points can still overlap
so much that they are just confusing. You can also use dataEllipse
to create what is basically a 2D density plot without showing the points
at all. Just plot several ellipses of different sizes over each other filling
them with transparent colors. The center of the distribution will appear darker.
This can give an idea of the distribution for a very large number of points.
plot(x,y,pch=NA)
dataEllipse(x,y,factor(group), levels=c(seq(0.15,0.95,0.2), 0.995),
plot.points=FALSE, col=rainbow(3), group.labels=NA,
center.pch=FALSE, fill=TRUE, fill.alpha=0.15, lty=1, lwd=1)
You can get a more continuous look by plotting more ellipses and leaving out the border lines.
plot(x,y,pch=NA)
dataEllipse(x,y,factor(group), levels=seq(0.11,0.99,0.02),
plot.points=FALSE, col=rainbow(3), group.labels=NA,
center.pch=FALSE, fill=TRUE, fill.alpha=0.05, lty=0)
Please try different combinations of these to get a nice picture of your data.
Additional response to comment: Adding labels
Perhaps the most natural place to add group labels is the centers of the
ellipses. You can get that by simply computing the centroids of the points in each group. So for example,
plot(x,y,pch=NA)
dataEllipse(x,y,factor(group), levels=c(seq(0.15,0.95,0.2), 0.995),
plot.points=FALSE, col=rainbow(3), group.labels=NA,
center.pch=FALSE, fill=TRUE, fill.alpha=0.15, lty=1, lwd=1)
## Now add labels
for(i in unique(group)) {
text(mean(x[group==i]), mean(y[group==i]), labels=i)
}
Note that I just used the number as the group label, but if you have a more elaborate name, you can change labels=i to something like
labels=GroupNames[i].
Data
x = c(rnorm(2000,0,1), rnorm(7000,1,1), rnorm(11000,5,1))
twist = c(rep(0,2000),rep(-0.5,7000), rep(0.4,11000))
y = c(rnorm(2000,0,1), rnorm(7000,5,1), rnorm(11000,6,1)) + twist*x
group = c(rep(1,2000), rep(2,7000), rep(3,11000))
You can use hexbin::hexbin() to show very large datasets.
#G5W gave a nice dataset:
x = c(rnorm(2000,0,1), rnorm(7000,1,1), rnorm(11000,5,1))
twist = c(rep(0,2000),rep(-0.5,7000), rep(0.4,11000))
y = c(rnorm(2000,0,1), rnorm(7000,5,1), rnorm(11000,6,1)) + twist*x
group = c(rep(1,2000), rep(2,7000), rep(3,11000))
If you don't know the group information, then the ellipses are inappropriate; this is what I'd suggest:
library(hexbin)
plot(hexbin(x,y))
which produces
If you really want contours, you'll need a density estimate to plot. The MASS::kde2d() function can produce one; see the examples in its help page for plotting a contour based on the result. This is what it gives for this dataset:
library(MASS)
contour(kde2d(x,y))

Axis breaks in ggplot histogram in R [duplicate]

I have data that is mostly centered in a small range (1-10) but there is a significant number of points (say, 10%) which are in (10-1000). I would like to plot a histogram for this data that will focus on (1-10) but will also show the (10-1000) data. Something like a log-scale for th histogram.
Yes, i know this means not all bins are of equal size
A simple hist(x) gives
while hist(x,breaks=c(0,1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,3,4,5,7.5,10,15,20,50,100,200,500,1000,10000))) gives
none of which is what I want.
update
following the answers here I now produce something that is almost exactly what I want (I went with a continuous plot instead of bar-histogram):
breaks <- c(0,1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,4,8)
ggplot(t,aes(x)) + geom_histogram(colour="darkblue", size=1, fill="blue") + scale_x_log10('true size/predicted size', breaks = breaks, labels = breaks)![alt text][3]
the only problem is that I'd like to match between the scale and the actual bars plotted. There two options for doing that : the one is simply use the actual margins of the plotted bars (how?) then get "ugly" x-axis labels like 1.1754,1.2985 etc. The other, which I prefer, is to control the actual bins margins used so they will match the breaks.
Log scale histograms are easier with ggplot than with base graphics. Try something like
library(ggplot2)
dfr <- data.frame(x = rlnorm(100, sdlog = 3))
ggplot(dfr, aes(x)) + geom_histogram() + scale_x_log10()
If you are desperate for base graphics, you need to plot a log-scale histogram without axes, then manually add the axes afterwards.
h <- hist(log10(dfr$x), axes = FALSE)
Axis(side = 2)
Axis(at = h$breaks, labels = 10^h$breaks, side = 1)
For completeness, the lattice solution would be
library(lattice)
histogram(~x, dfr, scales = list(x = list(log = TRUE)))
AN EXPLANATION OF WHY LOG VALUES ARE NEEDED IN THE BASE CASE:
If you plot the data with no log-transformation, then most of the data are clumped into bars at the left.
hist(dfr$x)
The hist function ignores the log argument (because it interferes with the calculation of breaks), so this doesn't work.
hist(dfr$x, log = "y")
Neither does this.
par(xlog = TRUE)
hist(dfr$x)
That means that we need to log transform the data before we draw the plot.
hist(log10(dfr$x))
Unfortunately, this messes up the axes, which brings us to workaround above.
Using ggplot2 seems like the most easy option. If you want more control over your axes and your breaks, you can do something like the following :
EDIT : new code provided
x <- c(rexp(1000,0.5)+0.5,rexp(100,0.5)*100)
breaks<- c(0,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000,10000)
major <- c(0.1,1,10,100,1000,10000)
H <- hist(log10(x),plot=F)
plot(H$mids,H$counts,type="n",
xaxt="n",
xlab="X",ylab="Counts",
main="Histogram of X",
bg="lightgrey"
)
abline(v=log10(breaks),col="lightgrey",lty=2)
abline(v=log10(major),col="lightgrey")
abline(h=pretty(H$counts),col="lightgrey")
plot(H,add=T,freq=T,col="blue")
#Position of ticks
at <- log10(breaks)
#Creation X axis
axis(1,at=at,labels=10^at)
This is as close as I can get to the ggplot2. Putting the background grey is not that straightforward, but doable if you define a rectangle with the size of your plot screen and put the background as grey.
Check all the functions I used, and also ?par. It will allow you to build your own graphs. Hope this helps.
A dynamic graph would also help in this plot. Use the manipulate package from Rstudio to do a dynamic ranged histogram:
library(manipulate)
data_dist <- table(data)
manipulate(barplot(data_dist[x:y]), x = slider(1,length(data_dist)), y = slider(10, length(data_dist)))
Then you will be able to use sliders to see the particular distribution in a dynamically selected range like this:

Labels/points colored by category with PCA

I'm using prcomp to do PCA analysis in R, I want to plot my PC1 vs PC2 with different color text labels for each of the two categories,
I do the plot with:
plot(pca$x, main = "PC1 Vs PC2", xlim=c(-120,+120), ylim = c(-70,50))
then to draw in all the text with the different colors I've tried:
text(pca$x[,1][1:18], pca$[,1][1:18], labels=rownames(cava), col="green",
adj=c(0.3,-0.5))
text(pca$x[,1][19:35], pca$[,1][19:35], labels=rownames(cava), col="red",
adj=c(0.3,-0.5))
But R seams to plot 2 numbers over each other instead of one, the pcs$x[,1][1:18] plots the correct points I know because if I use that plot the points it works and produces the same plot as plot(pca$x).
It would be great if any could help to plot the labels for the two categories or
even plot the points different color to make it easy to differentiate between the plots easily.
You need to specify your x and y coordinates a bit differently:
text(pca$x[1:18,1], pca$x[1:18,2] ...)
This means take the first 18 rows and the first column (which is PC1) for the x coord, etc.
I'm surprised what you did doesn't throw an error.
If you want the points themselves colored, you can do it this way:
plot(pca$x, main = "PC1 Vs PC2", col = c(rep("green", 18), rep("red", 18)))

How can I plot a histogram of a long-tailed data using R?

I have data that is mostly centered in a small range (1-10) but there is a significant number of points (say, 10%) which are in (10-1000). I would like to plot a histogram for this data that will focus on (1-10) but will also show the (10-1000) data. Something like a log-scale for th histogram.
Yes, i know this means not all bins are of equal size
A simple hist(x) gives
while hist(x,breaks=c(0,1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,3,4,5,7.5,10,15,20,50,100,200,500,1000,10000))) gives
none of which is what I want.
update
following the answers here I now produce something that is almost exactly what I want (I went with a continuous plot instead of bar-histogram):
breaks <- c(0,1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,4,8)
ggplot(t,aes(x)) + geom_histogram(colour="darkblue", size=1, fill="blue") + scale_x_log10('true size/predicted size', breaks = breaks, labels = breaks)![alt text][3]
the only problem is that I'd like to match between the scale and the actual bars plotted. There two options for doing that : the one is simply use the actual margins of the plotted bars (how?) then get "ugly" x-axis labels like 1.1754,1.2985 etc. The other, which I prefer, is to control the actual bins margins used so they will match the breaks.
Log scale histograms are easier with ggplot than with base graphics. Try something like
library(ggplot2)
dfr <- data.frame(x = rlnorm(100, sdlog = 3))
ggplot(dfr, aes(x)) + geom_histogram() + scale_x_log10()
If you are desperate for base graphics, you need to plot a log-scale histogram without axes, then manually add the axes afterwards.
h <- hist(log10(dfr$x), axes = FALSE)
Axis(side = 2)
Axis(at = h$breaks, labels = 10^h$breaks, side = 1)
For completeness, the lattice solution would be
library(lattice)
histogram(~x, dfr, scales = list(x = list(log = TRUE)))
AN EXPLANATION OF WHY LOG VALUES ARE NEEDED IN THE BASE CASE:
If you plot the data with no log-transformation, then most of the data are clumped into bars at the left.
hist(dfr$x)
The hist function ignores the log argument (because it interferes with the calculation of breaks), so this doesn't work.
hist(dfr$x, log = "y")
Neither does this.
par(xlog = TRUE)
hist(dfr$x)
That means that we need to log transform the data before we draw the plot.
hist(log10(dfr$x))
Unfortunately, this messes up the axes, which brings us to workaround above.
Using ggplot2 seems like the most easy option. If you want more control over your axes and your breaks, you can do something like the following :
EDIT : new code provided
x <- c(rexp(1000,0.5)+0.5,rexp(100,0.5)*100)
breaks<- c(0,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000,10000)
major <- c(0.1,1,10,100,1000,10000)
H <- hist(log10(x),plot=F)
plot(H$mids,H$counts,type="n",
xaxt="n",
xlab="X",ylab="Counts",
main="Histogram of X",
bg="lightgrey"
)
abline(v=log10(breaks),col="lightgrey",lty=2)
abline(v=log10(major),col="lightgrey")
abline(h=pretty(H$counts),col="lightgrey")
plot(H,add=T,freq=T,col="blue")
#Position of ticks
at <- log10(breaks)
#Creation X axis
axis(1,at=at,labels=10^at)
This is as close as I can get to the ggplot2. Putting the background grey is not that straightforward, but doable if you define a rectangle with the size of your plot screen and put the background as grey.
Check all the functions I used, and also ?par. It will allow you to build your own graphs. Hope this helps.
A dynamic graph would also help in this plot. Use the manipulate package from Rstudio to do a dynamic ranged histogram:
library(manipulate)
data_dist <- table(data)
manipulate(barplot(data_dist[x:y]), x = slider(1,length(data_dist)), y = slider(10, length(data_dist)))
Then you will be able to use sliders to see the particular distribution in a dynamically selected range like this:

Resources