GAM does not fit mean data well

GAM does not fit mean data well - r

I am working on experimental data, where I applied different stressors over time (Day of experiment) to seastars. I measured the weight at different time points as well as their righting time (activity). I used GAM to model the data over time. When plotting the model and the mean values for each time point, for the weight, the model is underestimating the actual measured mean weight. And for the righting time the model shows a sinusoidal pattern, but the data don't support such pattern. It would be amazing if someone could explain to me why that is and if there is a possibility to fit the model better to my data. Thank you so much for your help.
My data look like this for weight:
Day.in.experiment Treatment Individual Weight.sea.star..g.
numeric factor factor numeric
And like this for the righting time:
Day.in.experiment Treatment Individual Time..sec.
numeric factor factor numeric
This is my code for the weight model:
model_weight <- bam (Weight.sea.star..g. ~ Treatment +
s (Day.in.experiment, k = 3) +
s (Day.in.experiment, by = Treatment, k = 3) +
s (Individual, bs = "re"),
family = gaussian, data = data_subset)
r1 <- start_value_rho (model_weight, plot = TRUE, lag = 2)
model_weight_autocorrelation <- bam (Weight.sea.star..g. ~ Treatment +
s (Day.in.experiment, k = 3) +
s (Day.in.experiment, by = Treatment, k = 3) +
s (Individual, bs = "re"),
family = gaussian, data = data_subset,
rho = r1, AR.start = data_subset$start.event)
And this is the code for the activity model:
model_righting <- bam (Time..sec. ~ Treatment +
s (Day.in.experiment, k = 7) +
s (Day.in.experiment, by = Treatment, k = 7) +
s (Individual, bs = "re") +
s (Individual, Day.in.experiment, bs = "re"),
family = gaussian, data = data_subset)
r1 <- start_value_rho (model_righting, plot = TRUE, lag = 2)
model_righting_autocorrelation <- bam (Time..sec. ~ Treatment +
s (Day.in.experiment, k = 7) +
s (Day.in.experiment, by = Treatment, k = 7) +
s (Individual, bs = "re")+
s (Individual, Day.in.experiment, bs = "re"),
family = gaussian, data = data_subset,
rho = r1, AR.start = data_subset$start.event)
In order to plot the model, I extracted the fitted data first and then plotted it with ggplot (). I provide the code for weight here, but I proceeded the very same way for righting time. To plot the mean values of the actual data I used another file called data_weight_mean.
data_model_weight_autocorrelation <- visreg (model_weight_autocorrelation, "Day.in.experiment", by = "Treatment",
breaks = c ("NoHW", "THW", "TEHW"),
gg = TRUE, overlay = TRUE, jitter = TRUE, lwd = 0.5,
rug = FALSE, partial = FALSE, plot = FALSE)
ggplot () +
scale_fill_manual (breaks = c ("NoHW", "THW", "TEHW"),
values = c ("cadetblue3", "aquamarine3","goldenrod3"),
labels = c ("No Heatwave", "Ambient Heatwave", "Extended Heatwave")) +
scale_color_manual (breaks = c ("NoHW", "THW", "TEHW"),
values = c ("cadetblue3", "aquamarine3","goldenrod3"),
labels = c ("No Heatwave", "Ambient Heatwave", "Extended Heatwave")) +
geom_ribbon (data = data_model_weight_autocorrelation$fit, aes (x = Day.in.experiment,
ymin = visregLwr, ymax = visregUpr,
fill = Treatment), alpha = 0.4, colour = "grey", size = 0.2) +
geom_line (data = data_model_weight_autocorrelation$fit, aes (x = Day.in.experiment, y = visregFit,
colour = Treatment),
size = 1.1) +
geom_point (data = data_weight_mean, aes (x = Day.in.experiment, y = Weight.sea.star..g., colour = Treatment),
size = 1) +
scale_y_continuous (limits = c (-2, 12),
breaks = c (0, 4, 8, 12),
labels = c (0, 4, 8, 12)) +
labs (x = "\nDay of Experiment", y = expression (atop ("Weight [g]")),
colour = "Treatment", fill = "Treatment") +
theme_bw () +
theme (panel.grid.major.x = element_blank (),
panel.grid.minor.x = element_blank (),
panel.grid.major.y = element_blank (),
panel.grid.minor.y = element_blank (),
legend.position = c (0.15, 0.85),
legend.direction = "vertical",
legend.title = element_text (colour = "black", size = 18),
legend.text = element_text (colour = "black", size = 16),
axis.line = element_line (),
axis.title = element_text (colour = "black", size = 24),
axis.text = element_text (colour = "black", size = 18))

Related

set max number of inflection points using geom_smooth

I am creating very basic plots to visualize trends in a dataset. I'm using the gam smoother with geom_smooth, but this is over-fitting the data at several sites. For example, in the very first facet, the smoother over-fits the red data points.
Is there a way to adjust my code to set a max # of inflection points to focus on the overarching trends in the dataset (e.g., max. 2 or 3 inflection points).
ggplot() +
stat_summary(fun = "mean",
data = df,
aes(x = year, y = morpho, group = interaction(year, group), col = group),
shape = 1, alpha = 0.4) +
geom_smooth(method = "gam", se = TRUE,
formula = y ~ s(x, bs = "cs"),
data = df,
aes(x = year, y = morpho, group = group, col = group)) +
facet_grid(. ~ site) +
theme_bw() +
theme(legend.position = "bottom",
legend.direction = "horizontal",
axis.text.x = element_text(angle = 60, vjust = 1, hjust=1))
EDIT (2022-12-19): following some of the comments raised by #GavinSimpson, I've tried to make some visuals of how changing k changes the shape of geom_smooth curve.
For these plots, I've tried to create an example dataset that has a rough quadratic section followed by a roughly linear section. Ideally, I want to find a k value that would capture the inflection point at the 'peak' of the quadratic section and the transition from quadratic to linear (x ~= 5 & 10).
library(ggpubr)
library(ggplot)
df <- data.frame(
x1 = c(1:19),
y1 = c(1.1, 4, 6.9, 8.5, 10, 7.2, 3.7, 2.2, 1, #quadratic
0.4, 1.6, 2.5, 3.7, 4.6, 5.8, 6.7, 7.9, 8.8, 10)) #linear
pk <- ggplot(data = df,
aes(y = y1, x = x1)) +
geom_point(col = "red") +
theme_bw()
# vary k from 2 to not specified
ggarrange(nrow = 3, ncol = 2,
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 2)) + ggtitle("k=2"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 3)) + ggtitle("k=3"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 4)) + ggtitle("k=4"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 5)) + ggtitle("k=5"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 6)) + ggtitle("k=6"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs")) + ggtitle("k unconstrained"))
The first plot throws an error message:
Warning message:
In smooth.construct.cr.smooth.spec(object, data, knots) :
basis dimension, k, increased to minimum possible
Indicating that R is increasing the k from 2 to 3.
Only when k = 5 do we start to see the smoother begin to capture the two main inflection points in the data (x ~= 5 & 10). But what does k = 5 mean? It also seems that it's added another slight inflection point around x ~= 16.
At k = 6, the SE is reduced and the line comes even closer to each data point, and it looks nearly identical to the figure when k is not specified at all, which might be ok in this situation, but for my dataset, when k isn't specified there is a lot of over-fitting when I really just want to see the most important inflection points.
So still no clear answer on how to specify inflection points.

Thanks to the suggestion from #tpetzoldt, I was able to identify how to set the max. number of inflection points in the gam formula.
ggplot() +
stat_summary(fun = "mean",
data = df,
aes(x = year, y = morpho, group = interaction(year, group), col = group),
shape = 1, alpha = 0.4) +
geom_smooth(method = "gam", se = TRUE,
formula = y ~ s(x, bs = "cs", k = 2), #k = number of inflection points
data = df,
aes(x = year, y = morpho, group = group, col = group)) +
facet_grid(. ~ site) +
theme_bw() +
theme(legend.position = "bottom",
legend.direction = "horizontal",
axis.text.x = element_text(angle = 60, vjust = 1, hjust=1))

SHAP Summary Plot for XGBoost model in R without displaying Mean Absolute SHAP value on the plot

I don't want to display the Mean Absolute Values on my SHAP Summary Plot in R. I want an output similar to the one produced in python. What line of code will help remove the mean absolute values from the summary plot in R?
I'm currently using this line of code:
shap.plot.summary.wrap1(xgb_model, X = x, top_n = 10)

You can do this by sligtly modifying the source code of shap.plot.summary() as below:
shap.plot.summary.edited <- function(data_long,
x_bound = NULL,
dilute = FALSE,
scientific = FALSE,
my_format = NULL){
if (scientific){label_format = "%.1e"} else {label_format = "%.3f"}
if (!is.null(my_format)) label_format <- my_format
# check number of observations
N_features <- setDT(data_long)[,uniqueN(variable)]
if (is.null(dilute)) dilute = FALSE
nrow_X <- nrow(data_long)/N_features # n per feature
if (dilute!=0){
# if nrow_X <= 10, no dilute happens
dilute <- ceiling(min(nrow_X/10, abs(as.numeric(dilute)))) # not allowed to dilute to fewer than 10 obs/feature
set.seed(1234)
data_long <- data_long[sample(nrow(data_long),
min(nrow(data_long)/dilute, nrow(data_long)/2))] # dilute
}
x_bound <- if (is.null(x_bound)) max(abs(data_long$value))*1.1 else as.numeric(abs(x_bound))
plot1 <- ggplot(data = data_long) +
coord_flip(ylim = c(-x_bound, x_bound)) +
geom_hline(yintercept = 0) + # the y-axis beneath
# sina plot:
ggforce::geom_sina(aes(x = variable, y = value, color = stdfvalue),
method = "counts", maxwidth = 0.7, alpha = 0.7) +
# print the mean absolute value:
#geom_text(data = unique(data_long[, c("variable", "mean_value")]),
# aes(x = variable, y=-Inf, label = sprintf(label_format, mean_value)),
# size = 3, alpha = 0.7,
# hjust = -0.2,
# fontface = "bold") + # bold
# # add a "SHAP" bar notation
# annotate("text", x = -Inf, y = -Inf, vjust = -0.2, hjust = 0, size = 3,
# label = expression(group("|", bar(SHAP), "|"))) +
scale_color_gradient(low="#FFCC33", high="#6600CC",
breaks=c(0,1), labels=c(" Low","High "),
guide = guide_colorbar(barwidth = 12, barheight = 0.3)) +
theme_bw() +
theme(axis.line.y = element_blank(),
axis.ticks.y = element_blank(), # remove axis line
legend.position="bottom",
legend.title=element_text(size=10),
legend.text=element_text(size=8),
axis.title.x= element_text(size = 10)) +
# reverse the order of features, from high to low
# also relabel the feature using `label.feature`
scale_x_discrete(limits = rev(levels(data_long$variable))#,
#labels = label.feature(rev(levels(data_long$variable)))
)+
labs(y = "SHAP value (impact on model output)", x = "", color = "Feature value ")
return(plot1)
}

I am trying to create an exponent instead of R^2

I am using ggplot2 to create a scatter plot of 2 variables. I want to have these printed out on the caption portion of ggplot:
linear regression equation
r2 value
p-value
I am using brackets, new lines and stored values to concatenate everything together. I have attempted using expression(), parse() and bquote() functions but it only prints out the variable name and not the stored values.
This is the graph I have now. Everything looks great other than the R^2 part. Brackets seem to cause a lot of problems but I want to keep them (looks better in my opinion).This is my ggplot script. I am only concerned about the caption section at the end.

Difficult to work with the code you have provided as an example (see comment re: reproducible example), but I had my students complete a similar exercise for their homework recently, and can provide an example which you can likely generalize from. My approach is to use the TeX() function from the latex2exp package.
A psychologist is interested in whether she can predict GPA in graduate school from students' earlier scores on the Graduate Record Exam (GRE).
Setup the Toy Data and Regression Model
GPA <- c(3.70,3.18,2.90,2.93,3.02,2.65,3.70,3.77,3.41,2.38,
3.54,3.12,3.21,3.35,2.60,3.25,3.48,2.74,2.90,3.28)
GRE <- c(637,562,520,624,500,500,700,680,655,525,
593,656,592,689,550,536,629,541,588,619)
gpa.gre <- data.frame(GPA, GRE)
mod <- lm(GPA ~ GRE, data = gpa.gre)
mod.sum <- summary(mod)
print(cofs <- round(mod$coefficients, digits = 4))
aY <- cofs[[1]]
bY <- cofs[[2]]
print(Rsqr <- round(cor(GPA,GRE)^2, digits = 2))
Generate the Plot
require(ggplot2)
require(latex2exp)
p <- ggplot(data = gpa.gre, aes(x = GRE, y = GPA)) +
geom_smooth(formula = 'y ~ x', color ="grey40", method = "lm",
linetype = 1, lwd = 0.80, se = TRUE, alpha = 0.20) +
geom_point(color = "grey10", size = 1) +
labs(y = "Grade Point Average", x = "GRE Score") +
coord_cartesian(ylim = c(2.28, 3.82), xlim = c(498, 702), clip = "off") +
scale_y_continuous(breaks = seq(2.30, 3.80, 0.25)) +
scale_x_continuous(breaks = seq(500, 700, 50)) +
theme_classic() +
theme(axis.title.x = element_text(margin = unit(c(3.5,0,0,0), "mm"), size = 11.5),
axis.title.y = element_text(margin = unit(c(0,3.5,0,0), "mm"), size = 11.5),
axis.text = element_text(size = 10),
plot.margin = unit(c(0.25,4,1,0.25), "cm"))
# Use TeX function to use LaTeX
str_note <- TeX("\\textit{Note. ***p} < .001")
str_eq <- TeX("$\\hat{\\textit{y}} = 0.4682 + 0.0045 \\textit{x}$")
str_rsq <- TeX("$\\textit{R}^2 = .54***$")
# Create annotations
p + annotate("text", x = 728, y = 3.70, label = str_eq, size = 3.5,
hjust = 0, na.rm = TRUE) +
annotate("text", x = 728, y = 3.57, label = str_rsq, size = 3.5,
hjust = 0, na.rm = TRUE) +
annotate("text", x = 490, y = 1.80, label = str_note, size = 3.5,
hjust = 0, na.rm = TRUE)
Get Result
ggsave(filename = '~/Documents/gregpa.png', # your favourite file path here
width = unit(5, "in"), # width of plot
height = unit(4, "in"), # height of plot
dpi = 400) # resolution in dots per inch

Continuous value supplied to discrete scale ggplot2

when I tried to plot a graph of decision boundary in R, I met some problem and it returned a error "Continuous value supplied to discrete scale". I think the problem happened in the scale_colur_manual but I don't know how to fix it. Below is the code attached.
library(caTools)
set.seed(123)
split = sample.split(df$Purchased,SplitRatio = 0.75)
training_set = subset(df,split==TRUE)
test_set = subset(df,split==FALSE)
# Feature Scaling
training_set[,1:2] = scale(training_set[,1:2])
test_set[,1:2] = scale(test_set[,1:2])
# Fitting logistic regression to the training set
lr = glm(formula = Purchased ~ .,
family = binomial,
data = training_set)
#Predicting the test set results
prob_pred = predict(lr,type = 'response',newdata = test_set[-3])
y_pred = ifelse(prob_pred > 0.5, 1, 0)
#Making the Confusion Matrix
cm = table(test_set[,3],y_pred)
cm
#Visualizing the training set results
library(ggplot2)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
prob_set = predict(lr, type = 'response', newdata = grid_set)
y_grid = ifelse(prob_set > 0.5, 1,0)
ggplot(grid_set) +
geom_tile(aes(x = Age, y = EstimatedSalary, fill = factor(y_grid)),
show.legend = F) +
geom_point(data = set, aes(x = Age, y = EstimatedSalary, color = Purchased),
show.legend = F) +
scale_fill_manual(values = c("orange", "springgreen3")) +
scale_colour_manual(values = c("red3", "green4")) +
scale_x_continuous(breaks = seq(floor(min(X1)), ceiling(max(X2)), by = 1)) +
labs(title = "Logistic Regression (Training set)",
ylab = "Estimated Salary", xlab = "Age")

Is your Purchased variable a factor? If not, it has to be. Try this:
grid_set %>%
mutate(Purchased=factor(Purchased)) %>%
ggplot() +
geom_tile(aes(x = Age, y = EstimatedSalary, fill = factor(y_grid)),
show.legend = F) + ... # add the rest of your commands.

How to blend two overlapping graphs with ggplot2

Using ggplot2, how can I blend two graphs? If I graph two sets over data, the second set of data covers up the first set of data. Is there a way to blend both graphs. I already put the alpha value as low as I can. Any lower and I can't see individual points.
demanalyze <- function(infocode, n = 1){
infoname <- filter(infolookup, column_name == infocode)$description
infocolumn <- as.vector(as.matrix(mydata[infocode]))
ggplot(mydata) +
aes(x = infocolumn) +
ggtitle(infoname) +
xlab(infoname) +
ylab("Fraction of votes each canidate recieved") +
xlab(infoname) +
geom_point(aes(y = sanders_vote_fraction, colour = "Bernie Sanders"), size=I(2)) +#, color = alpha("blue",0.02), size=I(1)) +
stat_smooth(aes(y = sanders_vote_fraction), method = "lm", formula = y ~ poly(x, n), size = 1, color = "darkblue", se = F) +
geom_point(aes(y = clinton_vote_fraction, colour = "Hillary Clinton"), size=I(2)) +#, color = alpha("red",0.02), size=I(1)) +
stat_smooth(aes(y = clinton_vote_fraction), method = "lm", formula = y ~ poly(x, n), size = 1, color = "darkred", se = F) +
scale_colour_manual("",
values = c("Bernie Sanders" = alpha("blue",0.005), "Hillary Clinton" = alpha("red",0.005))
) +
guides(colour = guide_legend(override.aes = list(alpha = 1)))
}
By blend, I mean of a there is a red point and a blue point in the same spot, it should show up as purple.

Looking at the plot, my guess is that the issue is a ton of red stacking on top of each other, blocking the blue below. I think you may need to randomize the layering on the graph, which will require generating a single data.frame. Alternatively, if Hillary+Bernie always equals 1, you may be able to just plot that. If they don't, and you don't want to lose too much information, you could plot just one metric of (Hillary)/(Bernie+Hillary).
Example:
geom_point(aes(y = clinton_vote_fraction / ( clinton_vote_fraction + sanders_vote_fraction)
, colour = "Clinton Share"), size=I(2))
And here is an example with the melting approach:
library(dplyr)
library(reshape2)
df <-
data.frame(
metric = rnorm(1000)
, Clinton = rnorm(1000, 48, 10)
) %>%
mutate(Sanders = 100 - Clinton - rnorm(4))
meltDF <-
melt(df, "metric"
, variable.name = "Candidate"
, value.name = "Vote Share")
ggplot(meltDF %>%
arrange(sample(1:nrow(.)))
, aes(x = metric
, y = `Vote Share`
, col = Candidate)) +
geom_point(size = 2, alpha = 0.2) +
geom_smooth(se = FALSE, alpha = 1, show.legend = FALSE) +
scale_colour_manual("",
values = c("Clinton" = "darkblue"
, "Sanders" = "red3")
) +
theme_minimal()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

GAM does not fit mean data well - r

Related

set max number of inflection points using geom_smooth

SHAP Summary Plot for XGBoost model in R without displaying Mean Absolute SHAP value on the plot

I am trying to create an exponent instead of R^2

Continuous value supplied to discrete scale ggplot2

How to blend two overlapping graphs with ggplot2

Categories

Resources