This is a follow up question to Combine ggflags with linear regression in ggplot2
I have a plot like below with a log-linear model for x and y for certain countries that I have made in R with ggplot2 and ggflags:
The problem is when I want to print out the regression equation, the R2 and the p-value with the help of stat_regline_equation and stat_cor, I get values for a linear model and not the log-linear model I want to use.
How can I solve this?
library(ggplot2)
library(ggflags)
library(ggpubr)
library(SciViews)
set.seed(123)
Data <- data.frame(
country = c("at", "be", "dk", "fr", "it"),
x = runif(5),
y = runif(5)
)
ggplot(Data, aes(x = x, y = y, country = country, size = 11)) +
geom_flag() +
scale_country() +
scale_size(range = c(10, 10)) +
geom_smooth(aes(group = 1), method = "lm", , formula = y ~ log(x), se = FALSE, size = 1) +
stat_regline_equation(label.y = 0.695,
aes(group = 1, label = ..eq.label..), size = 5.5) +
stat_cor(aes(group = 1,
label =paste(..rr.label.., ..p.label.., sep = "~`,`~")),
label.y = 0.685, size = 5.5, digits= 1)
edit: I have also tried to use ln(x) instead of log(x) but I do not get any results when printing out the coefficient from that either.
There are four things you need to do:
Provide your regression formula to the formula argument of stat_regline_equation
Use sub to change "x" to "log(x)" in eq.label
Change the x aesthetic of stat_cor to log(x)
Fix the x limits inside coord_cartesian to compensate
ggplot(Data, aes(x = x, y = y, country = country, size = 11)) +
geom_flag() +
scale_country() +
scale_size(range = c(10, 10)) +
geom_smooth(aes(group = 1), method = "lm", , formula = y ~ log(x),
se = FALSE, size = 1) +
stat_regline_equation(label.y = 0.695, label.x = 0.25,
aes(group = 1, label = sub("x", "log(x)", ..eq.label..)),
size = 5.5,
formula = y ~ log(x),
check_overlap = TRUE, output.type = "latex") +
stat_cor(aes(group = 1, x = log(x),
label =paste(..rr.label.., ..p.label.., sep = "~`,`~")),
label.x = 0.25,
label.y = 0.65, size = 5.5, digits= 1, check_overlap = TRUE) +
coord_cartesian(xlim = c(0.2, 1))
Related
I am creating very basic plots to visualize trends in a dataset. I'm using the gam smoother with geom_smooth, but this is over-fitting the data at several sites. For example, in the very first facet, the smoother over-fits the red data points.
Is there a way to adjust my code to set a max # of inflection points to focus on the overarching trends in the dataset (e.g., max. 2 or 3 inflection points).
ggplot() +
stat_summary(fun = "mean",
data = df,
aes(x = year, y = morpho, group = interaction(year, group), col = group),
shape = 1, alpha = 0.4) +
geom_smooth(method = "gam", se = TRUE,
formula = y ~ s(x, bs = "cs"),
data = df,
aes(x = year, y = morpho, group = group, col = group)) +
facet_grid(. ~ site) +
theme_bw() +
theme(legend.position = "bottom",
legend.direction = "horizontal",
axis.text.x = element_text(angle = 60, vjust = 1, hjust=1))
EDIT (2022-12-19): following some of the comments raised by #GavinSimpson, I've tried to make some visuals of how changing k changes the shape of geom_smooth curve.
For these plots, I've tried to create an example dataset that has a rough quadratic section followed by a roughly linear section. Ideally, I want to find a k value that would capture the inflection point at the 'peak' of the quadratic section and the transition from quadratic to linear (x ~= 5 & 10).
library(ggpubr)
library(ggplot)
df <- data.frame(
x1 = c(1:19),
y1 = c(1.1, 4, 6.9, 8.5, 10, 7.2, 3.7, 2.2, 1, #quadratic
0.4, 1.6, 2.5, 3.7, 4.6, 5.8, 6.7, 7.9, 8.8, 10)) #linear
pk <- ggplot(data = df,
aes(y = y1, x = x1)) +
geom_point(col = "red") +
theme_bw()
# vary k from 2 to not specified
ggarrange(nrow = 3, ncol = 2,
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 2)) + ggtitle("k=2"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 3)) + ggtitle("k=3"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 4)) + ggtitle("k=4"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 5)) + ggtitle("k=5"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 6)) + ggtitle("k=6"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs")) + ggtitle("k unconstrained"))
The first plot throws an error message:
Warning message:
In smooth.construct.cr.smooth.spec(object, data, knots) :
basis dimension, k, increased to minimum possible
Indicating that R is increasing the k from 2 to 3.
Only when k = 5 do we start to see the smoother begin to capture the two main inflection points in the data (x ~= 5 & 10). But what does k = 5 mean? It also seems that it's added another slight inflection point around x ~= 16.
At k = 6, the SE is reduced and the line comes even closer to each data point, and it looks nearly identical to the figure when k is not specified at all, which might be ok in this situation, but for my dataset, when k isn't specified there is a lot of over-fitting when I really just want to see the most important inflection points.
So still no clear answer on how to specify inflection points.
Thanks to the suggestion from #tpetzoldt, I was able to identify how to set the max. number of inflection points in the gam formula.
ggplot() +
stat_summary(fun = "mean",
data = df,
aes(x = year, y = morpho, group = interaction(year, group), col = group),
shape = 1, alpha = 0.4) +
geom_smooth(method = "gam", se = TRUE,
formula = y ~ s(x, bs = "cs", k = 2), #k = number of inflection points
data = df,
aes(x = year, y = morpho, group = group, col = group)) +
facet_grid(. ~ site) +
theme_bw() +
theme(legend.position = "bottom",
legend.direction = "horizontal",
axis.text.x = element_text(angle = 60, vjust = 1, hjust=1))
I want to create a graph where I can change the line size for each line c(1,2,3) and the alpha values for each line c(0.5,0.6,0.7). I tried to use scale_size_manual but it didn't make any difference. Any ideas on how to proceed?
var <- c("T","T","T","M","M","M","A","A","A")
val <- rnorm(12,4,5)
x <- c(1:12)
df <- data.frame(var,val,x)
ggplot(aes(x= x , y = val, color = var, group = var), data = df) +
scale_color_manual(values = c("grey","blue","black")) + geom_smooth(aes(x = x, y = val), formula = "y ~ x", method = "loess",se = FALSE, size = 1) + scale_x_continuous(breaks=seq(1, 12, 1), limits=c(1, 12)) + scale_size_manual(values = c(1,2,3))
To set the size and alpha values for your lines you have to map on aesthetics. Otherwise scale_size_manual will have no effect:
library(ggplot2)
ggplot(aes(x = x, y = val, color = var, group = var), data = df) +
scale_color_manual(values = c("grey", "blue", "black")) +
geom_smooth(aes(x = x, y = val, size = var, alpha = var), formula = "y ~ x", method = "loess", se = FALSE) +
scale_x_continuous(breaks = seq(1, 12, 1), limits = c(1, 12)) +
scale_size_manual(values = c(1, 2, 3)) +
scale_alpha_manual(values = c(.5, .6, .7))
With the following ggplot2 code in R:
require(ggplot2)
df <- data.frame(x = rep(c(0, 1, 1.58,2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 6.5, 7.0),4), y = c(0.15,0.17,0.07,0.17,0.01,0.15,0.18,0.04,-0.06,-0.08,0,0.03,-0.27,-0.93,0.04,0.12,0.08,0.15,0.04,0.15,0.03,0.09,0.11,0.13,-0.11,-0.32,-0.7,-0.78,0.07,0.04,0.06,0.12,-0.15,0.05,-0.08,0.14,-0.02,-0.14,-0.24,-0.32,-0.78,-0.81,-0.04,-0.25,-0.09,0.02,-0.13,-0.2,-0.04,0,0.02,-0.05,-0.19,-0.37,-0.57,-0.81))
ggplot(df, aes(x = x, y = y)) + geom_point() +
stat_smooth(method = "lm", formula = y ~ x, size = 1, se = FALSE, aes(color = "black")) +
stat_smooth(method = "lm", formula = y ~ poly(x, 2), size = 1, se = FALSE, aes(color = "green")) +
stat_smooth(method = "lm", formula = y ~ poly(x, 3), size = 1, se = FALSE, aes(color = "orange")) +
stat_smooth(method = "gam", formula = y ~ s(x), size = 1, se = FALSE, aes(color = "blue")) +
theme(legend.justification=c(1,1),legend.position=c(0.45,0.45),legend.title=element_blank()) +
scale_color_manual(values=c("black","green","orange","blue"), labels=c("linear","quadratic","cubic","smooth"))
the legend is fine; but for some reason, three of the four curves are not colored as intended: the orange curve should be green, the blue curve should be orange, and the green curve should be blue. What am I missing?
The strings you use in the aesthetic call are arbitrary (even though you have called them after colours). They will be converted into a factor column internally in ggplot, and the levels of the factor are determined alphabetically. The factor levels are mapped across to the vector of values in the scale_color_manual call in the order they are put.
So you can just use "a" through "d" as arbitrary strings for the color aesthetic to keep track of them and control their ordering:
ggplot(df, aes(x = x, y = y)) +
geom_point() +
stat_smooth(method = "lm", formula = y ~ x,
size = 1, se = FALSE, aes(color = "a")) +
stat_smooth(method = "lm", formula = y ~ poly(x, 2),
size = 1, se = FALSE, aes(color = "b")) +
stat_smooth(method = "lm", formula = y ~ poly(x, 3),
size = 1, se = FALSE, aes(color = "c")) +
stat_smooth(method = "gam", formula = y ~ s(x),
size = 1, se = FALSE, aes(color = "d")) +
theme(legend.justification = c(1, 1),
legend.position = c(0.45, 0.45),
legend.title = element_blank()) +
scale_color_manual(values = c("black", "green", "orange", "blue"),
labels = c("linear", "quadratic", "cubic", "smooth"))
From the online help (my emphasis):
values
a set of aesthetic values to map data values to. The values will be matched in order (usually alphabetical) with the limits of the scale, or with breaks if provided. If this is a named vector, then the values will be matched based on the names instead. Data values that don't match will be given na.value.
So try
values=c("linear"="black","quadratic"="green","cubic"="orange","smooth"="blue")
or something similar. I can't check my code as you haven't provided your input data.
While completing a project for understanding central limit theorem for exponential distribution, I ran into an annoying error message when plotting simulated vs theoretical distributions. When I run the code below, I get an error: 'mapping' is not used by stat_function().
By mapping I assume the error is referring to the aes parameter, which I later map to color red using scale_color_manual in order to show it in a legend.
My question is two-fold: why is this error happening? and is there a more efficient way to create a legend without using scale_color_manual?
Thank you!
lambda <- 0.2
n_sims <- 1000
set.seed(100100)
total_exp <- rexp(40 * n_sims, rate = lambda)
exp_data <- data.frame(
Mean = apply(matrix(total_exp, n_sims), 1, mean),
Vars = apply(matrix(total_exp, n_sims), 1, var)
)
g <- ggplot(data = exp_data, aes(x = Mean))
g +
geom_histogram(binwidth = .3, color = 'black', aes(y=..density..), fill = 'steelblue') +
geom_density(size=.5, aes(color = 'Simulation'))+
stat_function(fun = dnorm, mapping = aes(color='Theoretical'), args = list(mean = 1/lambda, sd = 1/lambda/sqrt(40)), size=.5, inherit.aes = F, show.legend = T)+
geom_text(x = 5.6, y = 0.1, label = "Theoretical and Sample Mean", size = 2, color = 'red') +
scale_color_manual("Legend", values = c('Theoretical' = 'red', 'Simulation' = 'blue')) +
geom_vline(aes(xintercept = 1/lambda), lwd = 1.5, color = 'grey') +
labs(x = 'Exponential Distribution Simulations Average Values') +
ggtitle('Sample Mean vs Theoretical Mean of the Averages of the Exponential Distribution')+
theme_classic(base_size = 10)
It's not an error, it's a warning:
library(ggplot2)
lambda <- 0.2
n_sims <- 1000
set.seed(100100)
total_exp <- rexp(40 * n_sims, rate = lambda)
exp_data <- data.frame(
Mean = apply(matrix(total_exp, n_sims), 1, mean),
Vars = apply(matrix(total_exp, n_sims), 1, var)
)
g <- ggplot(data = exp_data, aes(x = Mean))
g +
geom_histogram(binwidth = .3, color = 'black', aes(y=..density..), fill = 'steelblue') +
geom_density(size=.5, aes(color = 'Simulation'))+
stat_function(fun = dnorm, mapping = aes(color='Theoretical'), args = list(mean = 1/lambda, sd = 1/lambda/sqrt(40)), size=.5, inherit.aes = F, show.legend = T)+
geom_text(x = 5.6, y = 0.1, label = "Theoretical and Sample Mean", size = 2, color = 'red') +
scale_color_manual("Legend", values = c('Theoretical' = 'red', 'Simulation' = 'blue')) +
geom_vline(aes(xintercept = 1/lambda), lwd = 1.5, color = 'grey') +
labs(x = 'Exponential Distribution Simulations Average Values') +
ggtitle('Sample Mean vs Theoretical Mean of the Averages of the Exponential Distribution')+
theme_classic(base_size = 10)
#> Warning: `mapping` is not used by stat_function()
Created on 2020-05-01 by the reprex package (v0.3.0)
You can suppress the warning by calling geom_line(stat = "function") rather than stat_function():
library(ggplot2)
lambda <- 0.2
n_sims <- 1000
set.seed(100100)
total_exp <- rexp(40 * n_sims, rate = lambda)
exp_data <- data.frame(
Mean = apply(matrix(total_exp, n_sims), 1, mean),
Vars = apply(matrix(total_exp, n_sims), 1, var)
)
g <- ggplot(data = exp_data, aes(x = Mean))
g +
geom_histogram(binwidth = .3, color = 'black', aes(y=..density..), fill = 'steelblue') +
geom_density(size=.5, aes(color = 'Simulation'))+
geom_line(stat = "function", fun = dnorm, mapping = aes(color='Theoretical'), args = list(mean = 1/lambda, sd = 1/lambda/sqrt(40)), size=.5, inherit.aes = F, show.legend = T)+
geom_text(x = 5.6, y = 0.1, label = "Theoretical and Sample Mean", size = 2, color = 'red') +
scale_color_manual("Legend", values = c('Theoretical' = 'red', 'Simulation' = 'blue')) +
geom_vline(aes(xintercept = 1/lambda), lwd = 1.5, color = 'grey') +
labs(x = 'Exponential Distribution Simulations Average Values') +
ggtitle('Sample Mean vs Theoretical Mean of the Averages of the Exponential Distribution')+
theme_classic(base_size = 10)
Created on 2020-05-01 by the reprex package (v0.3.0)
In my opinion, the warning is erroneous, and an issue has been filed about this problem: https://github.com/tidyverse/ggplot2/issues/3611
However, it's not that easy to solve, and therefore as of now the warning is there.
I'm unable to recreate your issue -- when I run your code a plot is generated (below), which suggests the issue is likely to do you with your environment. A general 'solution' is to clear your workspace using the menu dropdown or similar: Session -> Clear workspace..., then re-run your code.
For refactoring the color issue, you can simplify scale_color_manual to
scale_color_manual("Legend", values = c('blue','red')), but how it is now, is a bit better in my view. Anything beyond that has more to do with changing the data structure and mapping.
Apologies, I don't have the rep to make a comment.
when I tried to plot a graph of decision boundary in R, I met some problem and it returned a error "Continuous value supplied to discrete scale". I think the problem happened in the scale_colur_manual but I don't know how to fix it. Below is the code attached.
library(caTools)
set.seed(123)
split = sample.split(df$Purchased,SplitRatio = 0.75)
training_set = subset(df,split==TRUE)
test_set = subset(df,split==FALSE)
# Feature Scaling
training_set[,1:2] = scale(training_set[,1:2])
test_set[,1:2] = scale(test_set[,1:2])
# Fitting logistic regression to the training set
lr = glm(formula = Purchased ~ .,
family = binomial,
data = training_set)
#Predicting the test set results
prob_pred = predict(lr,type = 'response',newdata = test_set[-3])
y_pred = ifelse(prob_pred > 0.5, 1, 0)
#Making the Confusion Matrix
cm = table(test_set[,3],y_pred)
cm
#Visualizing the training set results
library(ggplot2)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
prob_set = predict(lr, type = 'response', newdata = grid_set)
y_grid = ifelse(prob_set > 0.5, 1,0)
ggplot(grid_set) +
geom_tile(aes(x = Age, y = EstimatedSalary, fill = factor(y_grid)),
show.legend = F) +
geom_point(data = set, aes(x = Age, y = EstimatedSalary, color = Purchased),
show.legend = F) +
scale_fill_manual(values = c("orange", "springgreen3")) +
scale_colour_manual(values = c("red3", "green4")) +
scale_x_continuous(breaks = seq(floor(min(X1)), ceiling(max(X2)), by = 1)) +
labs(title = "Logistic Regression (Training set)",
ylab = "Estimated Salary", xlab = "Age")
Is your Purchased variable a factor? If not, it has to be. Try this:
grid_set %>%
mutate(Purchased=factor(Purchased)) %>%
ggplot() +
geom_tile(aes(x = Age, y = EstimatedSalary, fill = factor(y_grid)),
show.legend = F) + ... # add the rest of your commands.