How to add sample size used in plotting geom_jitter - r

I want to add how many samples were added to a graph, next to my stat_cor (ggpubr) text.
I'm using the following code to generate the graph:
dataset = mtcars
ggplot(dataset, aes(dataset$wt, dataset$disp)) +
geom_jitter() +
geom_smooth(level=0.95, method = "loess") +
stat_cor(method="spearman") +
theme_classic()
But, if I want to plot multiple graphs in one figure, which uses a real data set where different variables have different missing values, it would be nice to have my sample size used to plot the geom_jitter.

It's a little hacky (and limited in its options), but you can use the label.sep argument to insert the sample size between the correlation coefficient and the p-value (note that somewhat older version of ggpubr have a bug with label.sep... if this doesn't work for you, try updating your package)
ggplot(mtcars, aes(wt, disp)) +
geom_jitter() +
geom_smooth(level = 0.95, method = "loess") +
stat_cor(method = "spearman", label.sep = sprintf(", n = %s, ", nrow(mtcars))) +
theme_classic()
If your concern is missing values, you might need to use a different function than nrow, but I'll leave that to you. This also will not work with facets (you'll get the same number in each facet).
For a fully flexible solution, I think you could use a geom_text, or maybe a stat_summary with geom = "text" would be possible?
Or go hardcore like this answer, if nothing else works
Just for completeness on missing values:
ggplot(mtcars, aes(wt, disp)) +
geom_jitter() +
geom_smooth(level = 0.95, method = "loess") +
stat_cor(method = "spearman", label.sep =
sprintf(", n = %s, ",
sum(complete.cases(mtcars[c("wt","disp")]))
)) +
theme_classic()
To plot the value of N on complete cases of wt and disp as the example shows

Related

Correlation plot between two variables with line and person r value in graph - seeking alternate example

Would just like some clarity here and a different example if someone has one.
Initially I wanted to use this example because it has the graph, the mean line, and the r value all presented in the graph: http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r
However, I'm using r studio server and creating a shiny app.
Library ggpubr will simply not install.
I've tried several ways to get this library to install.
So, does anyone have an alternate example that might work?
Cheers ~!
How about this:
library(ggplot2)
data(mtcars)
r <- round(cor(mtcars$wt, mtcars$mpg), 2)
p <- cor.test(mtcars$wt, mtcars$mpg)$p.value
ggplot(mtcars, aes(y=wt, x=mpg)) +
geom_point() +
geom_smooth(method="lm", col="black") +
annotate("text", x=20, y=4.5, label=paste0("r = ", r), hjust=0) +
annotate("text", x=20, y=4.25, label=paste0("p = ", round(p, 3)), hjust=0) +
theme_classic()
You could use the geom_smooth function from ggplot2 and implement the correlation coefficient aswell as the p-value as follows:
library(ggplot2)
my_data <- mtcars
cor_coefs <- cor.test(my_data$mpg, my_data$wt)
ggplot(data = my_data, aes(x = mpg, y = wt)) +
geom_point() +
geom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE) +
annotate("text", x = 30, y = 4, label = paste0("R: ", round(cor_coefs$estimate, 2))) +
annotate("text", x = 30, y = 3.5, label = paste0("p-value: ", round(cor_coefs$p.value, 10)))
cor_coefs safes the correlation test and you can use it to get the desired values. With annotate from ggplot2, you need to specify the x and y position of your text. You could implement that dynamically based on your needs (since you have not provided any data).

Add a manually designed non-linear line in ggplot2?

I would like to add a non-linear model line to a graph in R, but instead of letting ggplot find the best fit, I just want to preset its parameters and thus be able to see how multiple manually designed models fit on top of the data. I tried the following:
ggplot(cars, aes(x = speed, y = dist)) +
geom_point() +
geom_smooth(method = "nls", method.args = list(formula = y ~ 0.76*exp(x*0.5), color = blue, data = data)
But got the error:
Computation failed in 'stat_smooth()':
formal argument "data" matched by multiple actual arguments
with slight adjustments, I also get the error 'what" must be a function or character string. Does anyone know if manually designating a line like this is possible? I could not find any other Stack Overflow post about this specific topic.
You might be looking for geom_function():
gg0 <- ggplot(cars, aes(x = speed, y = dist)) + geom_point()
gg0 + geom_function(fun = function(x) 0.76*exp(x*0.5), colour = "blue") +
coord_cartesian(ylim=c(0,100))
I added coord_cartesian because the specified function attains really large values for the upper end of the x-range of this graph ...

Trouble using facet_wrap

I'm a newbie, so this might be super basic.
When I run this code
Dplot <- qplot(x = diamonds$carat, y = diamonds$price, color = diamonds$color) +
ggtitle("An A+ plot") +
xlab("carat") +
ylab("price") +
geom_point()
Dplot <- Dplot + facet_wrap(vars(diamonds$clarity))
Dplot
I get an error message that says:
Error in gList(list(x = 0.5, y = 0.5, width = 1, height = 1, just =
"centre", :
only 'grobs' allowed in "gList"
I've tried googling, but haven't been able to figure out what the issue is.
I would advise against using qplot except in the most basic cases. It teaches bad habits (like using $) that should be avoided in ggplot.
We can make the switch by passing the data frame diamonds to ggplot(), and putting the mappings inside aes() with just column names, not diamonds$. Then the facet_wrap works fine as long as we also omit the diamonds$:
Dplot = ggplot(diamonds, aes(x = carat, y = price, color = color)) +
ggtitle("An A+ plot") +
xlab("carat") +
ylab("price") +
geom_point()
Dplot + facet_wrap(vars(clarity))
Dplot + facet_wrap(~ clarity) # another option
Notice the code is actually shorter because we don't need to type diamonds$ all the time!
The vars(clarity) option works fine, more traditionally you would see formula interface used ~ clarity. The vars() option is new-ish, and will play a little nicer if you are writing a function where the name of a column to facet by is stored in a variable.

Automatic n plotting with ggplot and stat_summary

This is a question related to this one. I'm dealing with a boxplot of two groups and used the function n_fun proposed in that question with a small modification (I used y=10 to locate the "n = " because I find it disturbing above the median).
Here's the function:
n_fun <- function(x){
return(data.frame(y = 10, label = paste0("n = ",length(x))))
}
ggplot(mtcars, aes(x=factor(cyl), mpg, fill=factor(am))) +
geom_boxplot() + stat_summary(fun.data = n_fun, geom = "text")
The thing is that the function recognizes that there are two different "n = " to be plotted but they get plotted together on a single 'y'. I've tried to enter a vector on the y position of the n_fun and it is accepted. However, I get two overplotted "n= ". I'm looking for something like "position = dodge" for the stat_summary or another way to tell the ggplot that it must plot those texts in the same way that it plot's the dodged boxplots.
Well, as the help ?position_dodge states: Dodging things with different widths can be tricky. You may need to explicitly specify the width for dodging. In your case:
ggplot(mtcars, aes(x=factor(cyl), mpg, fill=factor(am))) +
stat_summary(fun.data = n_fun, geom = "text",
position = position_dodge(.9))

How to improve the aspect of ggplot histograms with log scales and discrete values

I am trying to improve the clarity and aspect of a histogram of discrete values which I need to represent with a log scale.
Please consider the following MWE
set.seed(99)
data <- data.frame(dist = as.integer(rlnorm(1000, sdlog = 2)))
class(data$dist)
ggplot(data, aes(x=dist)) + geom_histogram()
which produces
and then
ggplot(data, aes(x=dist)) + geom_line() + scale_x_log10(breaks=c(1,2,3,4,5,10,100))
which probably is even worse
since now it gives the impression that the something is missing between "1" and "2", and also is not totally clear which bar has value "1" (bar is on the right of the tick) and which bar has value "2" (bar is on the left of the tick).
I understand that technically ggplot provides the "right" visual answer for a log scale. Yet as observer I have some problem in understanding it.
Is it possible to improve something?
EDIT:
This what happen when I applied Jaap solution to my real data
Where do the dips between x=0 and x=1 and between x=1 and x=2 come from? My value are discrete, but then why the plot is also mapping x=1.5 and x=2.5?
The first thing that comes to mind, is playing with the binwidth. But that doesn't give a great solution either:
ggplot(data, aes(x=dist)) +
geom_histogram(binwidth=10) +
scale_x_continuous(expand=c(0,0)) +
scale_y_continuous(expand=c(0.015,0)) +
theme_bw()
gives:
In this case it is probably better to use a density plot. However, when you use scale_x_log10 you will get a warning message (Removed 524 rows containing non-finite values (stat_density)). This can be resolved by using a log plus one transformation.
The following code:
library(ggplot2)
library(scales)
ggplot(data, aes(x=dist)) +
stat_density(aes(y=..count..), color="black", fill="blue", alpha=0.3) +
scale_x_continuous(breaks=c(0,1,2,3,4,5,10,30,100,300,1000), trans="log1p", expand=c(0,0)) +
scale_y_continuous(breaks=c(0,125,250,375,500,625,750), expand=c(0,0)) +
theme_bw()
will give this result:
I am wondering, what if, y-axis is scaled instead of x-axis. It will results into few warnings wherever values are 0, but may serve your purpose.
set.seed(99)
data <- data.frame(dist = as.integer(rlnorm(1000, sdlog = 2)))
class(data$dist)
ggplot(data, aes(x=dist)) + geom_histogram() + scale_y_log10()
Also you may want to display frequencies as data labels, since people might ignore the y-scale and it takes some time to realize that y scale is logarithmic.
ggplot(data, aes(x=dist)) + geom_histogram(fill = 'skyblue', color = 'grey30') + scale_y_log10() +
stat_bin(geom="text", size=3.5, aes(label=..count.., y=0.8*(..count..)))
A solution could be to convert your data to a factor:
library(ggplot2)
set.seed(99)
data <- data.frame(dist = as.integer(rlnorm(1000, sdlog = 2)))
ggplot(data, aes(x=factor(dist))) +
geom_histogram(stat = "count") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Resulting in:
I had the same issue and, inspired by #Jaap's answer, I fiddled with the histogram binwidth using the x-axis in log scale.
If you use binwidth = 0.201, the bars will be juxtaposed as expected. However, this means you can only have up to five bars between two x coordinates.
set.seed(99)
data <- data.frame(dist = as.integer(rlnorm(1000, sdlog = 2)))
class(data$dist)
ggplot(data, aes(x=dist)) +
geom_histogram(binwidth = 0.201, color = 'red') +
scale_x_log10()
Result:

Resources