I have run a beta regression in R and would like to assess the residual diagnostics. I have used the plot function and obtained plots, however, the potential outliers are not labelled. How can I add the corresponding labels to the outliers?
breg.full <- betareg(Percentage ~ Total_testscore + Campus + Programme +
Gender + SE_track + Hours_Math_SE, data = starters, # [-c(53, 24, 35), ]
link = "logit") # , , link.phi = NULL, type = "ML"
summary(breg.full)
par(mfrow = c(2,3))
plot(breg.full, which = 1:6)
EDIT:
I want to have something like this (without the actual pink box, but with the ID number.)
The author provides a link for this code (http://www.de.ufpe.br/~cribari/betareg_example.zip.) however it is no longer working ...
Explanation
I couldn't see your data anywhere here, but I will use the iris dataset to demonstrate how this can be achieved. I'll stick to only two examples because this takes some time to code, but once you see two examples I think it will become fairly quick to recognize what is going on. I will supply a reference at the end that will be helpful too.
Fitting Model Data
First we can fit a regression using the iris data, then turn the data into a tibble with model data using both fortify and as_tibble. I have added an index column for one of the plots later.
#### Load Library ####
library(tidyverse)
#### Fit Model ####
fit <- lm(Petal.Width ~ Petal.Length,
data = iris)
#### Turn Model into Data Frame ####
fit.data <- fortify(fit) %>%
as_tibble() %>%
mutate(.index = 1:150)
fit.data
Which gives you this:
# A tibble: 150 × 9
Petal…¹ Petal…² .hat .sigma .cooksd .fitted .resid .stdr…³ .index
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 0.2 1.4 0.0186 0.207 8.18e-5 0.219 -0.0190 -0.0928 1
2 0.2 1.4 0.0186 0.207 8.18e-5 0.219 -0.0190 -0.0928 2
3 0.2 1.3 0.0197 0.207 1.23e-4 0.177 0.0226 0.111 3
4 0.2 1.5 0.0176 0.207 7.86e-4 0.261 -0.0606 -0.296 4
5 0.2 1.4 0.0186 0.207 8.18e-5 0.219 -0.0190 -0.0928 5
6 0.4 1.7 0.0158 0.207 6.06e-4 0.344 0.0563 0.275 6
7 0.3 1.4 0.0186 0.207 1.49e-3 0.219 0.0810 0.396 7
8 0.2 1.5 0.0176 0.207 7.86e-4 0.261 -0.0606 -0.296 8
9 0.2 1.4 0.0186 0.207 8.18e-5 0.219 -0.0190 -0.0928 9
10 0.1 1.5 0.0176 0.207 5.53e-3 0.261 -0.161 -0.785 10
# … with 140 more rows, and abbreviated variable names ¹Petal.Width,
# ²Petal.Length, ³.stdresid
# ℹ Use `print(n = ...)` to see more rows
You can see here it gives you a lot of valuable information...residuals, fitted residuals, Cook's distance, etc. This makes it easy to plot them in ggplot2.
Plotting
The first example will be a Cook's distance plot. This takes the index of the data point and plots the columns representing their respective distance using the geom_col function. The key ingredient here is the geom_text portion. Simply subset the data and nudge it a little so it doesnt totally overlap and you can essentially label whatever you want:
#### Cooks Distance ####
fit.data %>%
ggplot(aes(x=.index,
y=.cooksd,
label=.index))+
geom_col()+
labs(x="Index",
y="Cook's Distance",
title = "Cook's Distance")+
geom_text(data=subset(fit.data,
.cooksd > .05),
nudge_y = .003)
Giving you this plot:
Another example using a similar method below plots fitted values versus their respective residuals, with an arbitrary label placed here was well:
#### Fitted vs Residuals ####
ggplot(fit.data,
aes(.fitted,
round(.resid,2),
label=round(.resid,2))) +
geom_point() +
geom_hline(yintercept = 0) +
geom_smooth(se = FALSE)+
labs(x="Fitted",
y="Residual",
title = "Fitted vs Residuals")+
geom_text(data=subset(fit.data,
.resid > .5 | .resid < -.5),
nudge_x = .09)
A slew of other examples of how to do this can be seen at this link. The customization will be up to you, but it should give you a fair idea of how to hand tailor some of these base R plots you are getting.
Related
I want to achieve a GAM plot that looks like this
Image from https://stats.stackexchange.com/questions/179947/statistical-differences-between-two-hourly-patterns/446048#446048
How can I accomplish this?
Model is
model = gam(y ~ s(t) + g, data = d)
The general way to do this is to compute model estimates (fitted values) over the range of the covariate(s) of interest for each group. The reproducible example below illustrates once way to do this using {mgcv} to fit the GAM and my {gratia} package for some helper functions to facilitate the process.
library("gratia")
library("mgcv")
library("ggplot2")
eg_data <- data_sim("eg4", n = 400, dist = "normal", scale = 2, seed = 1)
m <- gam(y ~ s(x2) + fac, data = eg_data, method = "REML")
ds <- data_slice(m, x2 = evenly(x2, n = 100), fac = evenly(fac))
fv <- fitted_values(m, data = ds)
The last line gets you fitted values from the model at the covariate combinations specified in the data slice:
> fv
# A tibble: 300 × 6
x2 fac fitted se lower upper
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 0.00131 1 -1.05 0.559 -2.15 0.0412
2 0.00131 2 -3.35 0.563 -4.45 -2.25
3 0.00131 3 1.13 0.557 0.0395 2.22
4 0.0114 1 -0.849 0.515 -1.86 0.160
5 0.0114 2 -3.14 0.519 -4.16 -2.13
6 0.0114 3 1.34 0.513 0.332 2.34
7 0.0215 1 -0.642 0.474 -1.57 0.287
8 0.0215 2 -2.94 0.480 -3.88 -2.00
9 0.0215 3 1.54 0.473 0.616 2.47
10 0.0316 1 -0.437 0.439 -1.30 0.424
# … with 290 more rows
# ℹ Use `print(n = ...)` to see more rows
This object is in a form suitable for plotting with ggplot():
fv |>
ggplot(aes(x = x2, y = fitted, colour = fac)) +
geom_point(data = eg_data, mapping = aes(y = y), size = 0.5) +
geom_ribbon(aes(x = x2, ymin = lower, ymax = upper, fill = fac,
colour = NULL),
alpha = 0.2) +
geom_line()
which produces
You can enhance and/or modify this using your ggplot skills.
The basic point with this model is that you have a common smooth effect of a covariate (here x2) plus group means (for the factor fac). Hence the curves are "parallel".
Note that there's a lot of variation around the estimated curves in this model because the simulated data are from a richer model with group-specific smooths and smooth effects of other covariates.
gg.bs30 <- ggplot(data,aes(x=Predictor,y=Output,col=class))+geom_point()+
geom_smooth(method='gam',formula=y ~ splines::bs(x, 30)) + facet_grid(class ~.)
print(gg.bs30)
Code from -> https://github.com/mariocastro73/ML2020-2021/blob/master/scripts/gams-with-ggplot-classes.R
This question already has answers here:
Calculate the Area under a Curve
(7 answers)
Closed 1 year ago.
I have a dataframe (gdata) with x (as "r") and y (as "km") coordinates of a function.
When I plot it like this:
plot(x = gdata$r, y = gdata$km, type = "l")
I get the graph of the function:
Now I want to calculate the area under the curve from x = 0 to x = 0.6. When I look for appropriate packages I only find something like calculation AUC of a ROC curve. But is there a way just to calculate the AUC of a normal function?
The area under the curve (AUC) of a given set of data points can be archived using numeric integration:
Let data be your data frame containing x and y values. You can get the area under the curve from lower x0=0 to upper x1=0.6 by integrating the function, which is linearly approximating your data.
This is a numeric approximation and not exact, because we do not have an infinite number of data points: For y=sqrt(x) we will get 0.3033 instead of true value of 0.3098. For 200 rows in data we'll get even better with auc=0.3096.
library(tidyverse)
data <-
tibble(
x = seq(0, 2, length.out = 20)
) %>%
mutate(y = sqrt(x))
data
#> # A tibble: 20 × 2
#> x y
#> <dbl> <dbl>
#> 1 0 0
#> 2 0.105 0.324
#> 3 0.211 0.459
#> 4 0.316 0.562
#> 5 0.421 0.649
#> 6 0.526 0.725
#> 7 0.632 0.795
#> 8 0.737 0.858
#> 9 0.842 0.918
#> 10 0.947 0.973
#> 11 1.05 1.03
#> 12 1.16 1.08
#> 13 1.26 1.12
#> 14 1.37 1.17
#> 15 1.47 1.21
#> 16 1.58 1.26
#> 17 1.68 1.30
#> 18 1.79 1.34
#> 19 1.89 1.38
#> 20 2 1.41
qplot(x, y, data = data)
integrate(approxfun(data$x, data$y), 0, 0.6)
#> 0.3033307 with absolute error < 8.8e-05
Created on 2021-10-03 by the reprex package (v2.0.1)
The absolute error returned by integrate is corerect, iff the real world between every two data points is a perfect linear interpolation, as we assumed.
I used the package MESS to solve the problem:
# Toy example
library(MESS)
x <- seq(0,3, by=0.1)
y <- x^2
auc(x,y, from = 0.1, to = 2, type = "spline")
The analytical result is:
7999/3000
Which is approximately 2.666333333333333
The R script offered gives: 2.66632 using the spline approximation and 2.6695 using the linear approximation.
This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 2 years ago.
My dataset looks like this
df = data.frame(site=c(rep('A',95),rep('B',110),rep('C',250)),
nps_score=c(floor(runif(455, min=0, max=10))),
service_score=c(floor(runif(455, min=0, max=10))),
food_score=c(floor(runif(455, min=0, max=10))),
clean_score=c(floor(runif(455, min=0, max=10))))
I'd like to run a linear model on each group (i.e. for each site), and produce the coefficients for each group in a dataframe, along with the significance levels of each variable.
I am trying to group_by the site variable and then run the model for each site but it doesn't seem to be working. I've looked at some existing solutions on stack overflow but cannot seem to adapt the code to my solution.
#Trying to run this by group, and output the resulting coefficients per site in a separate df with their signficance levels.
library(MASS)
summary(ols <- rlm(nps_score ~ ., data = df))
Any help on this would be greatly appreciated
library(tidyverse)
library(broom)
library(MASS)
# We first create a formula object
my_formula <- as.formula(paste("nps_score ~ ", paste(df %>% select(-site, -nps_score) %>% names(), collapse= "+")))
# Now we can group by site and use the formula object within the pipe.
results <- df %>%
group_by(site) %>%
do(tidy(rlm(formula(my_formula), data = .)))
which gives:
# A tibble: 12 x 5
# Groups: site [3]
site term estimate std.error statistic
<chr> <chr> <dbl> <dbl> <dbl>
1 A (Intercept) 5.16 0.961 5.37
2 A service_score -0.0656 0.110 -0.596
3 A food_score -0.0213 0.102 -0.209
4 A clean_score -0.0588 0.110 -0.536
5 B (Intercept) 2.22 0.852 2.60
6 B service_score 0.221 0.103 2.14
7 B food_score 0.163 0.104 1.56
8 B clean_score -0.0383 0.0928 -0.413
9 C (Intercept) 5.47 0.609 8.97
10 C service_score -0.0367 0.0721 -0.509
11 C food_score -0.0585 0.0724 -0.808
12 C clean_score -0.0922 0.0691 -1.33
Note: i'm not familiar with the rlm function and if it provides p-values in the first place. But at least the tidy function doesn't offer p-values for rlm. If a simple linear regression would fit your suits, you could replace the rlm function by lm in which case a sixth column with p-values would be added.
I'm trying to calculate the effect size among different factor levels. To compare the two means within each factor level, the code below works fine:
cohens_d_list <- by(mydata, mydata$factor, function(sub)
cohens_d(sub$score1, sub$score2)
)
cohens_d_list
However, I couldn't figure out how to compare each factor level for a single mean (e.g. for score1, I want to compare each factor level with each other: factor level 1 vs. factor level 2, factor level 1 vs. factor level 3, factor level 1. vs factor level 4....) with each other. I used psych, effectsize, and effsize packages, but they don't seem to account for more than 2 levels in a single factor variable. Any suggestions for a code or package?
After trying dozens of packages, esvis package did the trick.
df%>%
ungroup(Group)%>% # Include this line if you get grouping error
coh_d(score1~ Group)
You get a nice table with all possible comparisons.
You can fit a model and use the eff_size() function from emmeans (which will have the benefit of using the pooled SD from all groups, not just the 2 being compared):
m <- lm(mpg ~ factor(cyl), data = mtcars)
library(emmeans)
(em <- emmeans(m, ~ cyl))
#> cyl emmean SE df lower.CL upper.CL
#> 4 26.7 0.972 29 24.7 28.7
#> 6 19.7 1.218 29 17.3 22.2
#> 8 15.1 0.861 29 13.3 16.9
#>
#> Confidence level used: 0.95
eff_size(em, sigma = sigma(m), edf = df.residual(m))
#> contrast effect.size SE df lower.CL upper.CL
#> 4 - 6 2.15 0.56 29 1.003 3.29
#> 4 - 8 3.59 0.62 29 2.320 4.86
#> 6 - 8 1.44 0.50 29 0.418 2.46
#>
#> sigma used for effect sizes: 3.223
#> Confidence level used: 0.95
Created on 2021-06-07 by the reprex package (v2.0.0)
I ran a regression that looks as follows:
fit <- lmer(support ~ income + (1 | country), data = df)
When using summary(df), it shows me that for income, the minimum is -2.4 and the maximum is 2.6.
I would like to plot the predicted values. I tried by using the following code:
library(ggeffects)
library(ggplot2)
p1 <- ggpredict(reg1, terms = "income")
ggplot(p1, aes(x, predicted)) + geom_line() + geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1)
However, the plot goes from -3 to 3. How can I set the minimum and maximum values for the plot? I tried with min and max, but it did not work
By default, for continuous variables, a "pretty" range is chosen for the x-axis. This may include values that don't appear in the data. But using [all] might work, see this example, where in the 2nd case the predicted values range from 0.1 to 2.5, instead 0 to 2.6.
library(ggeffects)
data(iris)
m <- lm(Sepal.Length ~ Petal.Width, data = iris)
ggpredict(m, "Petal.Width")
#>
#> # Predicted values of Sepal.Length
#> # x = Petal.Width
#>
#> x predicted std.error conf.low conf.high
#> 0.0 4.778 0.073 4.635 4.921
#> 0.4 5.133 0.057 5.022 5.244
#> 0.6 5.311 0.050 5.213 5.408
#> 1.0 5.666 0.040 5.587 5.745
#> 1.4 6.022 0.040 5.943 6.101
#> 1.6 6.199 0.044 6.113 6.286
#> 2.0 6.555 0.057 6.444 6.666
#> 2.6 7.088 0.082 6.927 7.248
ggpredict(m, "Petal.Width [all]")
#>
#> # Predicted values of Sepal.Length
#> # x = Petal.Width
#>
#> x predicted std.error conf.low conf.high
#> 0.1 4.866 0.069 4.732 5.001
#> 0.4 5.133 0.057 5.022 5.244
#> 0.6 5.311 0.050 5.213 5.408
#> 1.2 5.844 0.039 5.767 5.920
#> 1.5 6.110 0.042 6.028 6.193
#> 1.7 6.288 0.047 6.197 6.380
#> 2.0 6.555 0.057 6.444 6.666
#> 2.5 6.999 0.077 6.847 7.151
Created on 2019-03-29 by the reprex package (v0.2.1)
This vignette could be helpful, too.