ggplot2 lm lines with categorical variables - r

I have some data from a R course class. The professor was adding each line kind of manually using base graphics. I'd like to do it using ggplot2.
So far I've created a facet'd plot in ggplot with scatter plots of hunger in different regions and also separately fitted a model to the data. The specific model has interaction terms between the x variable in the plot and the group/colour variable.
What I want to do now is plot the lines resulting for that model one per panel. I could do this by using geom_abline and defining the slope and the intercept as the sum of 2 of the coefficients (as the categorical variables for group have 0/1 values and in each panel only some values are multiplied by 1) - but this seems not easy.
I tried the same equation I used in lm in stat_smooth with no luck, I get an error.
Ideally, I'd think one can put the equation somehow into the stat_smooth and have ggplot do all the work. How would one go about it?
download.file("https://sparkpublic.s3.amazonaws.com/dataanalysis/hunger.csv",
"hunger.csv", method = "curl")
hunger <- read.csv("hunger.csv")
hunger <- hunger[hunger$Sex!="Both sexes",]
hunger_small <- hunger[hunger$WHO.region!="WHO Non Members",c(5,6,8)]
q<- qplot(x = Year, y = Numeric, data = hunger_small,
color = WHO.region) + theme(legend.position = "bottom")
q <- q + facet_grid(.~WHO.region)+guides(col=guide_legend(nrow=2))
q
# I could add the standard lm line from stat_smooth, but I dont want that
# q <- q + geom_smooth(method="lm",se=F)
#I want to add the line(s) from the lm fit below, it is really one line per panel
lmRegion <- lm(hunger$Numeric ~ hunger$Year + hunger$WHO.region +
hunger$Year *hunger$WHO.region)
# I also used a loop to do it, as below, but all in one panel
# I am not able to do that
# with facets, I used a function I found to get the colors
ggplotColours <- function(n=6, h=c(0, 360) +15) {
if ((diff(h)%%360) < 1) h[2] <- h[2] - 360/n
hcl(h = (seq(h[1], h[2], length = n)), c = 100, l = 65)
}
n <- length(levels(hunger_small$WHO.region))
q <- qplot(x = Year, y = Numeric, data = hunger_small,
color = WHO.region) + theme(legend.position = "bottom")
q <- q + geom_abline(intercept = lmRegion$coefficients[1],
slope = lmRegion$coefficients[2], color = ggplotColours(n=n)[1])
for (i in 2:n) {
q <- q + geom_abline(intercept = lmRegion$coefficients[1] +
lmRegion$coefficients[1+i], slope = lmRegion$coefficients[2] +
lmRegion$coefficients[7+i], color = ggplotColours(n=n)[i])
}

If you have one categorical data:
geom_point()
will not work,
geom_boxplot()
will work.
ggplot(hunger, aes(x = sex, y = hunger)) + geom_boxplot() + labs(x="sex") + geom_smooth(method = "lm",se=FALSE, col = "blue"). Susy

Related

ggplot2 geom_qq change theoretical data

I have a set of pvalues i.e 0<=pval<=1
I want to plot qqplot using ggplot2
As in the documentation the following code will plot a q_q plot, however if my data are pvalues I want the therotical values to be also probabilites ie. 0<=therotical v<=1
df <- data.frame(y = rt(200, df = 5))
p <- ggplot(df, aes(sample = y))
p + stat_qq() + stat_qq_line()
I am aware of the qqplot.pvalues from gaston package it does the job but the plot is not as customizable as the ggplot version.
In gaston package the theoretical data are plotted as -log10((n:1)/(n + 1)) where n is number of pvalues. How to pass these values to ggplot as theoritical data?
Assuming you have some p-values, say from a normal distribution you could create it manually
library(ggplot2)
data <- data.frame(outcome = rnorm(150))
data$pval <- pnorm(data$outcome)
data <- data[order(data$pval),]
ggplot(data = data, aes(y = pval, x = pnorm(qnorm(ppoints(nrow(data)))))) +
geom_point() +
geom_abline(slope = 1) +
labs(x = 'theoraetical p-val', y = 'observed p-val', title = 'qqplot (pval-scale)')
Although I am not sure this plot is sensible to use for conclusions.

Why does R behave differently when parsing parameters of plotting?

I am attempting to plot multiple time series variables on a single line chart using ggplot. I am using a data.frame which contains n time series variables, and a column of time periods. Essentially, I want to loop through the data.frame, and add exactly n goem_lines to a single chart.
Initially I tried using the following code, where;
df = data.frame containing n time series variables, and 1 column of time periods
wid = n (number of time series variables)
p <- ggplot() +
scale_color_manual(values=c(colours[1:wid]))
for (i in 1:wid) {
p <- p + geom_line(aes(x=df$Time, y=df[,i], color=var.lab[i]))
}
ggplotly(p)
However, this only produces a plot of the final time series variable in the data.frame. I then investigated further, and found that following sets of code produce completely different results:
p <- ggplot() +
scale_color_manual(values=c(colours[1:wid]))
i = 1
p = p + geom_line(aes(x=df$Time, y=df[,i], color=var.lab[i]))
i = 2
p = p + geom_line(aes(x=df$Time, y=df[,i], color=var.lab[i]))
i = 3
p = p + geom_line(aes(x=df$Time, y=df[,i], color=var.lab[i]))
ggplotly(p)
Plot produced by code above
p <- ggplot() +
scale_color_manual(values=c(colours[1:wid]))
p = p + geom_line(aes(x=df$Time, y=df[,1], color=var.lab[1]))
p = p + geom_line(aes(x=df$Time, y=df[,2], color=var.lab[2]))
p = p + geom_line(aes(x=df$Time, y=df[,3], color=var.lab[3]))
ggplotly(p)
Plot produced by code above
In my mind, these two sets of code are identical, so could anyone explain why they produce such different results?
I know this could probably be done quite easily using autoplot, but I am more interested in the behavior of these two snipits of code.
What you're trying to do is a 'hack' way by plotting multiple lines, but it's not ideal in ggplot terms. To do it successfully, I'd use aes_string. But it's a hack.
df <- data.frame(Time = 1:20,
Var1 = rnorm(20),
Var2 = rnorm(20, mean = 0.5),
Var3 = rnorm(20, mean = 0.8))
vars <- paste0("Var", 1:3)
col_vec <- RColorBrewer::brewer.pal(3, "Accent")
library(ggplot2)
p <- ggplot(df, aes(Time))
for (i in 1:length(vars)) {
p <- p + geom_line(aes_string(y = vars[i]), color = col_vec[i], lwd = 1)
}
p + labs(y = "value")
How to do it properly
To make this plot more properly, you need to pivot the data first, so that each aesthetic (aes) is mapped to a variable in your data frame. That means we need a single variable to be color in our data frame. Hence, we pivot_longer and plot again:
library(tidyr)
df_melt <- pivot_longer(df, cols = Var1:Var3, names_to = "var")
ggplot(df_melt, aes(Time, value, color = var)) +
geom_line(lwd = 1) +
scale_color_manual(values = col_vec)

Cannot overlay multiple stat_function with ggplot2

I have a table with a binning variable VAR2_BY_NS_BIN and an x-y data pair (MP_BIN,CORRECT_PROP). I want to plot the data point binned, and also draw a different line for each bin using stat_function, taking a different reference each time using the for loop.
test_tab <- data.table(VAR2_BY_NS_BIN=c(0.0005478, 0.0005478, 0.002266, 0.002266, 0.006783, 0.006783, 0.020709, 0.020709, 0.142961, 0.142961),
MP_BIN=rep(c(0.505, 0.995), 5),
CORRECT_PROP=c(0.5082, 0.7496, 0.5024, 0.8627, 0.4878, 0.9368, 0.4979, 0.9826, 0.4811, 0.9989))
VAR2_BIN <- sort(unique(test_tab$VAR2_BY_NS_BIN)) #get unique bin values
LEN_VAR2_BIN <- length(VAR2_BIN) #get number of bins
col_base <- c("#FF0000", "#BB0033", "#880088", "#3300BB", "#0000FF") #mark bins with different colours
p <- ggplot(data = test_tab)
for (i in 1:LEN_VAR2_BIN) {
p <- p + geom_point(data = test_tab[test_tab$VAR2_BY_NS_BIN==VAR2_BIN[i],],
aes(x = MP_BIN, y = CORRECT_PROP),
col = col_base[i],
alpha = 0.5) +
stat_function(fun = function(t) {VAR2_BIN[i]*(t-0.5)+0.5}, col = col_base[i])
}
p <- p + xlab("MP") + ylab("Observed proportion")
print(p)
The above code (a reproducible example), however, always returns a plot with only the last stat_function line drawn (which is the 5th line in the above case).
The following code (without using the for loop) works, but I in fact have a large number of bins so it is not very feasible...
p <- p + stat_function(fun = function(t) {VAR2_BIN[1]*(t-0.5)+0.5}, col = col_base[1])
p <- p + stat_function(fun = function(t) {VAR2_BIN[2]*(t-0.5)+0.5}, col = col_base[2])
p <- p + stat_function(fun = function(t) {VAR2_BIN[3]*(t-0.5)+0.5}, col = col_base[3])
p <- p + stat_function(fun = function(t) {VAR2_BIN[4]*(t-0.5)+0.5}, col = col_base[4])
p <- p + stat_function(fun = function(t) {VAR2_BIN[5]*(t-0.5)+0.5}, col = col_base[5])
Thanks in advance!
You don't need a for loop or stat_function. To plot the points, just map MP_BIN and CORRECT_PROP to x and y and the points can be plotted with a single call to geom_point. For the lines, you can create the necessary values on the fly (as done in the code below) and plot those with geom_line.
library(tidyverse)
ggplot(test_tab %>% mutate(model=VAR2_BY_NS_BIN*(MP_BIN - 0.5) + 0.5),
aes(x=MP_BIN, colour=factor(VAR2_BY_NS_BIN))) +
geom_point(aes(y=CORRECT_PROP)) +
geom_line(aes(y=model)) +
labs(colour="VAR2_BY_NS_BIN") +
guides(colour=guide_legend(reverse=TRUE))
In terms of the problem you were having with the for loop, what's going on is that ggplot doesn't actually evaluate the loop variable (i) until you print the plot. The value of i is 5 at the end of the loop when the plot is printed, so that's the only line you get. You can find several questions related to this issue on Stack Overflow. Here's one of them.

How to plot two distribution curves in a faceted way in R / ggplot2?

I have two probability distribution curves, a Gamma and a standarized Normal, that I need to compare:
library(ggplot2)
pgammaX <- function(x) pgamma(x, shape = 64.57849, scale = 0.08854802)
f <- ggplot(data.frame(x=c(-4, 9)), aes(x)) + stat_function(fun=pgammaX)
f + stat_function(fun = pnorm)
The output is like this
However I need to have the two curves separated by means of the faceting mechanism provided by ggplot2, sharing the Y axis, in a way like shown below:
I know how to do the faceting if the depicted graphics come from data (i.e., from a data.frame), but I don't understand how to do it in a case like this, when the graphics are generated on line by functions. Do you have any idea on this?
you can generate the data similar to what stat_function is doing ahead of time, something like:
x <- seq(-4,9,0.1)
dat <- data.frame(p = c(pnorm(x), pgammaX(x)), g = rep(c(0,1), each = 131), x = rep(x, 2) )
ggplot(dat)+geom_line(aes(x,p, group = g)) + facet_grid(~g)
The issue with doing facet_wrap is that the same stat_function is designed to be applied to each panel of the faceted variable which you don't have.
I would instead plot them separately and use grid.arrange to combine them.
f1 <- ggplot(data.frame(x=c(-4, 9)), aes(x)) + stat_function(fun = pgammaX) + ggtitle("Gamma") + theme(plot.title = element_text(hjust = 0.5))
f2 <- ggplot(data.frame(x=c(-4, 9)), aes(x)) + stat_function(fun = pnorm) + ggtitle("Norm") + theme(plot.title = element_text(hjust = 0.5))
library(gridExtra)
grid.arrange(f1, f2, ncol=2)
Otherwise create the data frame with y values from both pgammaX and pnorm and categorize them under a faceting variable.
Finally I got the answer. First, I need to have two data sets and attach each function to each data set, as follows:
library(ggplot2)
pgammaX <- function(x) pgamma(x, shape = 64.57849, scale = 0.08854802)
a <- data.frame(x=c(3,9), category="Gamma")
b <- data.frame(x=c(-4,4), category="Normal")
f <- ggplot(a, aes(x)) + stat_function(fun=pgammaX) + stat_function(data = b, mapping = aes(x), fun = pnorm)
Then, using facet_wrap(), I separate into two graphics according to the category assigned to each data set, and establishing a free_x scale.
f + facet_wrap("category", scales = "free_x")
The result is shown below:

predict x values from simple fitting and annoting it in the plot

I have a very simple question but so far couldn't find easy solution for that. Let's say I have a some data that I want to fit and show its x axis value where y is in particular value. In this case let's say when y=0 what is the x value. Model is very simple y~x for fitting but I don't know how to estimate x value from there. Anyway,
sample data
library(ggplot2)
library(scales)
df = data.frame(x= sort(10^runif(8,-6,1),decreasing=TRUE), y = seq(-4,4,length.out = 8))
ggplot(df, aes(x = x, y = y)) +
geom_point() +
#geom_smooth(method = "lm", formula = y ~ x, size = 1,linetype="dashed", col="black",se=FALSE, fullrange = TRUE)+
geom_smooth(se=FALSE)+
labs(title = "Made-up data") +
scale_x_log10(breaks = c(1e-6,1e-4,1e-2,1),
labels = trans_format("log10", math_format(10^.x)),limits = c(1e-6,1))+
geom_hline(yintercept=0,linetype="dashed",colour="red",size=0.6)
I would like to convert 1e-10 input to 10^-10 format and annotate it on the plot. As I indicated in the plot.
thanks in advance!
Because geom_smooth() uses R functions to calculate the smooth line, you can attain the predicted values outside the ggplot() environment. One option is then to use approx() to get a linear approximations of the x-value, given the predicted y-value 0.
# Define formula
formula <- loess(y~x, df)
# Approximate when y would be 0
xval <- approx(x = formula$fitted, y = formula$x, xout = 0)$y
# Add to plot
ggplot(...) + annotate("text", x = xval, y = 0 , label = yval)

Resources