Identification of influential observations in regression with library car - r

Can somebody explain why library(car) finds influential observations here?:
library(car)
x = seq(1, 5, len = 100)
set.seed(99)
y = 2*x + 1 + rnorm(length(x), 0, 0.00005)
plot(x,y) # no influential observations!!
infl = influencePlot(lm(y ~ x))
infl # 4 influential observations??

If you read the help page for the function:
The default ‘method="noteworthy"’ is used only in this function and
indicates setting labels for points with large Studentized residuals,
hat-values or Cook's distances.
And the default settings:
id=TRUE’ is equivalent to ‘id=list(method="noteworthy", n=2, cex=1,
col=carPalette()1, location="lr")’
Using your example it looks like this:
It basically labels the 2 most extreme values for Studentized Residuals (y-axis) and 2 most extreme values for Hat-values (x-axis).
If you want the 3 most extreme, you can do:
influencePlot(lm(y ~ x),id=list(method="noteworthy",n=3))

Related

How to simulate a strong correlation of data with R

Sometimes I try to simulate data by using the rnorm function, which I have done below:
mom.iq <- rnorm(n=1000,
mean=120,
sd=15)
kid.score <- rnorm(n=1000,
mean=45,
sd=20)
df <- data.frame(mom.iq,
kid.score)
But when I plot something like this, it usually ends up with data thats highly uncorrelated:
library(ggpubr)
ggscatter(df,
x="mom.iq",
y="kid.score")+
geom_smooth(method = "lm")
However, I would like to simulate something with a stronger correlation if possible. Is there an easy way to do this within R? I'm aware that I could just as easily just produce my own values manually, but thats not super practical for recreating large samples.
What you are doing is to generate two independent variables; so, it is normal not to be correlated. What you can do is this:
# In order to make the values reproducible
set.seed(12345)
# Generate independent variable
x <- rnorm(n=1000, mean=120, sd=15)
# Generate the dependen variable
y <- 3*x + 6 + rnorm(n=1000, mean = 0, sd = 5)
I used 3 and 6, but you can define them as you want (a and b) in order to get a linear dependence defined as y = a*x + b.
The sum of rnorm(n=1000, mean = 0, sd = 5) is done to add some variability and avoid a perfect correlation between x and y. If you want to get a more correlated data, reduce the standard deviation (sd) and to get a lower correlation, increase its value.
You can create your second variable by taking the first variable into account, and adding some error with rnorm in order to avoid making the relationship completely deterministic/
library(ggplot2)
dat <- data.frame(father_age = rnorm(1000, 35, 5)) |>
dplyr::mutate(child_score = -father_age * 0.5 + rnorm(1000, 0, 4))
dat |>
ggplot(aes(father_age, child_score)) +
geom_point() +
geom_smooth(method = "lm")
#> `geom_smooth()` using formula 'y ~ x'
Created on 2022-07-07 by the reprex package (v2.0.1)
It seems to me that you don't just want to simulate arbitrary x and y with a linear relationship (which the other two answers show). You give your variables meaningful names mod.iq and kid.score, so it appears to me that you want them to have certain mean and variance. In this case, you can use MASS::mvrnorm to simulate samples from multivariate normal, where you can specify correlation. This allows you to preserve the marginal mean and marginal variance you specified.
## your current specification of marginal mean and marginal standard deviation
mean_mod.iq <- 120
mean_kid.score <- 45
sd_mod.iq <- 15
sd_kid.score <- 20
## introduce correlation coefficient between two variables
## coefficient must be between -1 and 1
corcoef <- 0.8
## the result covariance between two variables
covariance <- corcoef * sd_mod.iq * sd_kid.score
## the variance-covariance matrix
Sigma <- matrix(c(sd_mod.iq^2, covariance, covariance, sd_kid.score^2), nrow = 2)
# [,1] [,2]
#[1,] 225 240
#[2,] 240 400
Now you can use MASS::mvrnorm.
xy <- MASS::mvrnorm(n = 500, mu = c(mean_mod.iq, mean_kid.score), Sigma = Sigma)
colnames(xy) <- c("mod.iq", "kid.score")
xydf <- data.frame(xy)
head(xydf)
# mod.iq kid.score
#1 111.6211 33.26241
#2 114.4765 42.49280
#3 115.8160 47.57242
#4 121.8656 53.16578
#5 152.1459 89.60617
#6 107.4360 39.00345
plot(xydf)
You can verify marginal mean and marginal variance of the simulated samples.
sapply(xydf, mean) ## mean, you specified 120 and 45
# mod.iq kid.score
# 119.9499 44.4193
sapply(xydf, sd) ## standard error, you specified 15 and 20
# mod.iq kid.score
# 15.35214 20.16483

Other than burn-in increase and priors, how can I help my multiple change point (mcp package in R) models converge?

I would like to identify changepoints in my data and an associated error term for their estimate. The mcp package seems to do a good job (visually) of identifying changepoints in my data, but the model parameters generally have rhat values >1.1. From my understanding, I cannot trust any Bayesian parameter estimates unless all rhat values in the model are =< 1.1. Aside from increasing my burn-in period with the adapt argument or using priors (see note below), how else can I improve these models?
Alternatively, can I force mcp to fit a 'best' two and three segment model and return those parameter estimates with error? Ideally I would be able to provide changepoint estimates with an error term associated with each estimate, but packages like segmented and struccchange generally fail to identify changepoints in my data.
The code looks like this:
set.seed(42)
x <- c(227,227,228,228,228,228,228,230,230,230,230,230,230,230,230,230,230,230,230,230,230,230,230,230,232,233,233,233,233,233,233,233,233,236,236,236,236,237,237,237,238,238,238,238,238,238,238,238,239,239,239,239,239,239,243,244,244,244,244,244,244,244,244,244,245,245,245,246,246,246,246,247,250,250,250,250,251,251,251,251,251,251,251,251,253,253,253,257,257,260,260,260,260,260,260,260,264,264,264,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,267,267,267,267,267,267,267,267,267,267,267,267,267,271,271,271,271,271,271,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,274,274,274,274,274,274,274)
y <- c(8.43,6.9,8.93,7.33,7.28,7.24,6.62,8.36,8.17,8.07,8.07,7.63,7.6,7.54,7.37,7.31,7.21,7.03,6.93,6.88,6.82,6.78,6.7,6.5,8.35,10.97,7.48,7.46,7.28,7.17,6.72,6.68,6.08,7.42,7.14,6.92,6.68,7.49,7.28,6.67,9.4,7.54,7.04,6.89,6.88,6.52,6.45,6.39,8.48,8.04,7.52,7.35,6.9,6.57,6.86,7.46,7.39,7.16,7.08,6.83,6.83,6.7,6.54,6.47,9.75,7.38,5.96,10.49,8.32,7.22,7.05,8.55,10.34,8.23,7.9,7.31,8.18,7.8,7.31,7.18,7.17,7.13,7.02,6.84,10.62,10.09,9.26,10.8,10.37,10.9,10.52,10.23,9.28,9.18,8.85,8.81,11.03,8.84,6.29,11.36,10.91,10.87,10.4,10.17,9.61,9.5,9.36,9.17,9.13,8.88,8.73,8.55,8.37,8.33,8.25,7.82,6.9,9.77,9.53,9.39,9.1,8.93,8.68,8.64,8.47,8.41,8.38,8.28,8.18,7.74,10.67,10.64,10.54,10.36,10.35,7.03,9.51,9.37,9.24,9.22,9.18,8.96,8.95,8.94,8.89,8.82,8.79,8.72,8.35,8.22,8.13,8.07,7.91,7.85,7.79,8.82,8.59,8.44,8.42,8.37,8.06,7.34)
df <- data.frame(x, y)
#Writing the formula for a three-segment line
three_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 0 + x, #segment 2, specifying a changepoint and joined slope
y ~ 1 ~ 0 + x #segment 3, specifying a changepoint and joined slope
)
#Writing the formula for a two-segment line
two_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 0 + x #segment 2, specifying a changepoint and joined slope
)
#Disjointing the slopes of two segments
test_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 1 + x #segment 2, specifying a changepoint and disjoined slope
)
#Disjointing the slopes of the three expected segments
test_three_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 1 + x, #segment 2, specifying a changepoint and disjoined slope
y ~ 1 ~ 1 + x #segment 3, specifying a changepoint and disjoined slope
)
These are the models I've tried, but none achieve consistently tolerable rhat values. The code I use to test them is as follows:
#An example run, although each of these models fails to provide suitable rhat values
fit1 <- mcp(three_segment_model, df, chains = 4, iter = 10000, cores = 3)
plot(fit1)
fit1_summary <- data.frame(summary(fit1))
Given the literature, I have also tried priors (though admittedly I cannot make heads or tails of the truncate and distribution code with mcp), which do not seem to reflect the patterns poorly identified with this modelling approach.
If your issue is primarily a practical one, upping the number of iterations and chains reveals that the posterior is quite reproducible across chains for the "problematic" parameters:
fit1 <- mcp(three_segment_model, df, chains = 6, iter = 50000, cores = 6)
plot_pars(fit1, c("cp_1", "cp_2", "x_1"))
And you have good rhat values for the non-three-segment models; all pointing to the three-segment model being unidentifiable with this data. Without knowing the process, I did try some fairly informative priors:
prior = list(
x_1 = 0, # fixed horizontal!
x_2 = "dnorm(0, 1) T(0, )" # Positive slope
)
but it did not improve rhat values for the change point parameters.

Identifying lead/lags using multivariate regression analysis

I have three time-series variables (x,y,z) measured in 3 replicates. x and z are the independent variables. y is the dependent variable. t is the time variable. All the three variables follow diel variation, they increase during the day and decrease during the night. An example with a simulated dataset is below.
library(nlme)
library(tidyverse)
n <- 100
t <- seq(0,4*pi,,100)
a <- 3
b <- 2
c.unif <- runif(n)
amp <- 2
datalist = list()
for(i in 1:3){
y <- 3*sin(b*t)+rnorm(n)*2
x <- 2*sin(b*t+2.5)+rnorm(n)*2
z <- 4*sin(b*t-2.5)+rnorm(n)*2
data = as_tibble(cbind(y,x,z))%>%mutate(t = 1:100)%>% mutate(replicate = i)
datalist[[i]] <- data
}
df <- do.call(rbind,datalist)
ggplot(df)+
geom_line(aes(t,x),color='red')+geom_line(aes(t,y),color='blue')+
geom_line(aes(t,z),color = 'green')+facet_wrap(~replicate, nrow = 1)+theme_bw()
I can identify the lead/lag of y with respect to x and z individually. This can be done with ccf() function in r. For example
ccf(x,y)
ccf(z,y)
But I would like to do it in a multivariate regression approach. For example, nlme package and lme function indicates y and z are negatively affecting x
lme = lme(data = df, y~ x+ z , random=~1|replicate, correlation = corCAR1( form = ~ t| replicate))
It is impossible (in actual data) that x and z can negatively affect y.
I need the time-lead/lag and also I would like to get the standardized coefficient (t-value to compare the effect size), both from the same model.
Is there any multivariate model available that can give me the lead/lag and also give me regression coefficient?
We might be considering the " statistical significance of Cramer Rao estimation of a lower bound". In order to find Xbeta-Xinfinity, taking the expectation of Xbeta and an assumed mean neu; will yield a variable, neu^squared which can replace Xinfinity. Using the F test-likelihood ratio, the degrees of freedom is p2-p1 = n-p2.
Put it this way, the estimates are n=(-2neu^squared/neu^squared+n), phi t = y/Xbeta and Xbeta= (y-betazero)/a.
The point estimate is derived from y=aXbeta + b: , Xbeta. The time lead lag is phi t and the standardized coefficient is n. The regression generates the lower bound Xbeta, where t=beta.
Spectral analysis of the linear distribution indicates a point estimate beta zero = 0.27 which is a significant peak of
variability. Scaling Xbeta by Betazero would be an appropriate idea.

Exponential curve fitting in R

time = 1:100
head(y)
0.07841589 0.07686316 0.07534116 0.07384931 0.07238699 0.07095363
plot(time,y)
This is an exponential curve.
How can I fit line on this curve without knowing the formula ? I can't use 'nls' as the formula is unknown (only data points are given).
How can I get the equation for this curve and determine the constants in the equation?
I tried loess but it doesn't give the intercepts.
You need a model to fit to the data.
Without knowing the full details of your model, let's say that this is an
exponential growth model,
which one could write as: y = a * e r*t
Where y is your measured variable, t is the time at which it was measured,
a is the value of y when t = 0 and r is the growth constant.
We want to estimate a and r.
This is a non-linear problem because we want to estimate the exponent, r.
However, in this case we can use some algebra and transform it into a linear equation by taking the log on both sides and solving (remember
logarithmic rules), resulting in:
log(y) = log(a) + r * t
We can visualise this with an example, by generating a curve from our model, assuming some values for a and r:
t <- 1:100 # these are your time points
a <- 10 # assume the size at t = 0 is 10
r <- 0.1 # assume a growth constant
y <- a*exp(r*t) # generate some y observations from our exponential model
# visualise
par(mfrow = c(1, 2))
plot(t, y) # on the original scale
plot(t, log(y)) # taking the log(y)
So, for this case, we could explore two possibilies:
Fit our non-linear model to the original data (for example using nls() function)
Fit our "linearised" model to the log-transformed data (for example using the lm() function)
Which option to choose (and there's more options), depends on what we think
(or assume) is the data-generating process behind our data.
Let's illustrate with some simulations that include added noise (sampled from
a normal distribution), to mimic real data. Please look at this
StackExchange post
for the reasoning behind this simulation (pointed out by Alejo Bernardin's comment).
set.seed(12) # for reproducible results
# errors constant across time - additive
y_add <- a*exp(r*t) + rnorm(length(t), sd = 5000) # or: rnorm(length(t), mean = a*exp(r*t), sd = 5000)
# errors grow as y grows - multiplicative (constant on the log-scale)
y_mult <- a*exp(r*t + rnorm(length(t), sd = 1)) # or: rlnorm(length(t), mean = log(a) + r*t, sd = 1)
# visualise
par(mfrow = c(1, 2))
plot(t, y_add, main = "additive error")
lines(t, a*exp(t*r), col = "red")
plot(t, y_mult, main = "multiplicative error")
lines(t, a*exp(t*r), col = "red")
For the additive model, we could use nls(), because the error is constant across
t. When using nls() we need to specify some starting values for the optimization algorithm (try to "guesstimate" what these are, because nls() often struggles to converge on a solution).
add_nls <- nls(y_add ~ a*exp(r*t),
start = list(a = 0.5, r = 0.2))
coef(add_nls)
# a r
# 11.30876845 0.09867135
Using the coef() function we can get the estimates for the two parameters.
This gives us OK estimates, close to what we simulated (a = 10 and r = 0.1).
You could see that the error variance is reasonably constant across the range of the data, by plotting the residuals of the model:
plot(t, resid(add_nls))
abline(h = 0, lty = 2)
For the multiplicative error case (our y_mult simulated values), we should use lm() on log-transformed data, because
the error is constant on that scale instead.
mult_lm <- lm(log(y_mult) ~ t)
coef(mult_lm)
# (Intercept) t
# 2.39448488 0.09837215
To interpret this output, remember again that our linearised model is log(y) = log(a) + r*t, which is equivalent to a linear model of the form Y = β0 + β1 * X, where β0 is our intercept and β1 our slope.
Therefore, in this output (Intercept) is equivalent to log(a) of our model and t is the coefficient for the time variable, so equivalent to our r.
To meaningfully interpret the (Intercept) we can take its exponential (exp(2.39448488)), giving us ~10.96, which is quite close to our simulated value.
It's worth noting what would happen if we'd fit data where the error is multiplicative
using the nls function instead:
mult_nls <- nls(y_mult ~ a*exp(r*t), start = list(a = 0.5, r = 0.2))
coef(mult_nls)
# a r
# 281.06913343 0.06955642
Now we over-estimate a and under-estimate r
(Mario Reutter
highlighted this in his comment). We can visualise the consequence of using the wrong approach to fit our model:
# get the model's coefficients
lm_coef <- coef(mult_lm)
nls_coef <- coef(mult_nls)
# make the plot
plot(t, y_mult)
lines(t, a*exp(r*t), col = "brown", lwd = 5)
lines(t, exp(lm_coef[1])*exp(lm_coef[2]*t), col = "dodgerblue", lwd = 2)
lines(t, nls_coef[1]*exp(nls_coef[2]*t), col = "orange2", lwd = 2)
legend("topleft", col = c("brown", "dodgerblue", "orange2"),
legend = c("known model", "nls fit", "lm fit"), lwd = 3)
We can see how the lm() fit to log-transformed data was substantially better than the nls() fit on the original data.
You can again plot the residuals of this model, to see that the variance is not constant across the range of the data (we can also see this in the graphs above, where the spread of the data increases for higher values of t):
plot(t, resid(mult_nls))
abline(h = 0, lty = 2)
Unfortunately taking the logarithm and fitting a linear model is not optimal.
The reason is that the errors for large y-values weight much more than those
for small y-values when apply the exponential function to go back to the
original model.
Here is one example:
f <- function(x){exp(0.3*x+5)}
squaredError <- function(a,b,x,y) {sum((exp(a*x+b)-f(x))^2)}
x <- 0:12
y <- f(x) * ( 1 + sample(-300:300,length(x),replace=TRUE)/10000 )
x
y
#--------------------------------------------------------------------
M <- lm(log(y)~x)
a <- unlist(M[1])[2]
b <- unlist(M[1])[1]
print(c(a,b))
squaredError(a,b,x,y)
approxPartAbl_a <- (squaredError(a+1e-8,b,x,y) - squaredError(a,b,x,y))/1e-8
for ( i in 0:10 )
{
eps <- -i*sign(approxPartAbl_a)*1e-5
print(c(eps,squaredError(a+eps,b,x,y)))
}
Result:
> f <- function(x){exp(0.3*x+5)}
> squaredError <- function(a,b,x,y) {sum((exp(a*x+b)-f(x))^2)}
> x <- 0:12
> y <- f(x) * ( 1 + sample(-300:300,length(x),replace=TRUE)/10000 )
> x
[1] 0 1 2 3 4 5 6 7 8 9 10 11 12
> y
[1] 151.2182 203.4020 278.3769 366.8992 503.5895 682.4353 880.1597 1186.5158 1630.9129 2238.1607 3035.8076 4094.6925 5559.3036
> #--------------------------------------------------------------------
>
> M <- lm(log(y)~x)
> a <- unlist(M[1])[2]
> b <- unlist(M[1])[1]
> print(c(a,b))
coefficients.x coefficients.(Intercept)
0.2995808 5.0135529
> squaredError(a,b,x,y)
[1] 5409.752
> approxPartAbl_a <- (squaredError(a+1e-8,b,x,y) - squaredError(a,b,x,y))/1e-8
> for ( i in 0:10 )
+ {
+ eps <- -i*sign(approxPartAbl_a)*1e-5
+ print(c(eps,squaredError(a+eps,b,x,y)))
+ }
[1] 0.000 5409.752
[1] -0.00001 5282.91927
[1] -0.00002 5157.68422
[1] -0.00003 5034.04589
[1] -0.00004 4912.00375
[1] -0.00005 4791.55728
[1] -0.00006 4672.70592
[1] -0.00007 4555.44917
[1] -0.00008 4439.78647
[1] -0.00009 4325.71730
[1] -0.0001 4213.2411
>
Perhaps one can try some numeric method, i.e. gradient search, to find the
minimum of the squared error function.
If it really is exponential, you can try taking the logarithm of your variable and fitting a linear model to that.

Adding arbitrary curve with AUC 0.8 to ROC plot

I have a simple ROC plot that I am creating using pROC package:
plot.roc(response, predictor)
It is working fine, as expected, but I would like to add an "ideally" shaped reference curve with AUC 0.8 for comparison (the AUC of my ROC plot is 0.66).
Any thoughts?
Just to clarify, I am not trying to smoothen my ROC plot, but trying to add a reference curve that would represent AUC 0.8 (similar to the reference diagonal line representing AUC 0.5).
The reference diagonal line has a meaning (a model that guesses randomly), so you would similarly have to define the model associated with your reference curve of AUC 0.8. Different models would be associated with different reference curves.
For instance, one might define a model for which predicted probabilities are distributed evenly between 0 and 1 and for a point with predicted probability p, the probability of the true outcome is p^k for some constant k. It turns that for this model, k=2 yields a plot with AUC 0.8.
library(pROC)
set.seed(144)
probs <- seq(0, 1, length.out=10000)
truth <- runif(10000)^2 < probs
plot.roc(truth, probs)
# Call:
# plot.roc.default(x = truth, predictor = probs)
#
# Data: probs in 3326 controls (truth FALSE) < 6674 cases (truth TRUE).
# Area under the curve: 0.7977
Some algebra shows that this particular family of models has AUC (2+3k)/(2+4k), meaning it can generate curves with AUC between 0.75 and 1 depending on the value of k.
Another approach you could use is linked to logistic regression. If you had logistic regression linear predictor function value p, aka you would have predicted probability 1/(1+exp(-p)), then you could label the true outcome as true if p plus some normally distributed noise exceeds 0 and otherwise label the true outcome as false. If the normally distributed noise has variance 0 your model will have AUC 1, and if the normally distributed noise has variance approaching infinity your model will have AUC 0.5.
If I assume the original predictions are drawn from the standard normal distribution, it looks like normally distributed noise with standard deviation 1.2 give AUC 0.8 (I couldn't figure out a nice closed form for AUC, though):
set.seed(144)
pred.fxn <- rnorm(10000)
truth <- (pred.fxn + rnorm(10000, 0, 1.2)) >= 0
plot.roc(truth, pred.fxn)
# Call:
# plot.roc.default(x = truth, predictor = pred.fxn)
#
# Data: pred.fxn in 5025 controls (truth FALSE) < 4975 cases (truth TRUE).
# Area under the curve: 0.7987
A quick/rough way is to add a circle of radius 1 onto your plot which will have AUC pi/4 = 0.7853982
library(pROC)
library(car)
n <- 100L
x1 <- rnorm(n, 2.0, 0.5)
x2 <- rnorm(n, -1.0, 2)
y <- rbinom(n, 1L, plogis(-0.4 + 0.5 * x1 + 0.1 * x2))
mod <- glm(y ~ x1 + x2, "binomial")
probs <- predict(mod, type = "response")
plot(roc(y, probs))
ellipse(c(0, 0), matrix(c(1,0,0,1), 2, 2), radius = 1, center.pch = FALSE, col = "blue")

Resources