How can I generate random data with different varience (heteroskedasticity). in R - r

The regression model should be like the one on the left, and its value should be distributed normally like the equation on the right.

The rnorm() function support vector entry. Hence, you can generate u_i given x_i using the following line
x <- c(1:10)
u <- rnorm(rep(1,10), mean = rep(0,10), sd = sqrt(0.5*sqrt(x)))
Then
y <- 10 + x + u
To validation:
set.seed(2021)
validation.u <- NULL
for (i in 1:1000){
u_i <- rnorm(rep(1,10), mean = rep(0,10), sd = 0.5*sqrt(x))
validation.u <- rbind(validation.u, u_i)
}
colMeans(validation.u)
> [1] 0.04053143 0.02626776 -0.01915211 0.02319424 0.05187192 -0.06950472 0.01335739 0.03787621 0.01348861 0.03323902
diag(cov(validation.u))
> [1] 0.5277132 0.6585050 0.8586679 1.1081190 1.1200815 1.2266432 1.2553346 1.4172857 1.5748265 1.6788966

Related

Predict segmented lm outside of package

I have an array of outputs from hundreds of segmented linear models (made using the segmented package in R). I want to be able to use these outputs on new data, using the predict function. To be clear, I do not have the segmented linear model objects in my workspace; I just saved and reimported the relevant outputs (e.g. the coefficients and breakpoints). For this reason I can't simply use the predict.segmented function from the segmented package.
Below is a toy example based on this link that seems promising, but does not match the output of the predict.segmented function.
library(segmented)
set.seed(12)
xx <- 1:100
zz <- runif(100)
yy <- 2 + 1.5*pmax(xx-35,0) - 1.5*pmax(xx-70,0) +
15*pmax(zz-0.5,0) + rnorm(100,0,2)
dati <- data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)
o<-## S3 method for class 'lm':
segmented(out.lm,seg.Z=~x,psi=list(x=c(30,60)),
control=seg.control(display=FALSE))
# Note that coefficients with U in the name are differences in slopes, not slopes.
# Compare:
slope(o)
coef(o)[2] + coef(o)[3]
coef(o)[2] + coef(o)[3] + coef(o)[4]
# prediction
pred <- data.frame(x = 1:100)
pred$dummy1 <- pmax(pred$x - o$psi[1,2], 0)
pred$dummy2 <- pmax(pred$x - o$psi[2,2], 0)
pred$dummy3 <- I(pred$x > o$psi[1,2]) * (coef(o)[2] + coef(o)[3])
pred$dummy4 <- I(pred$x > o$psi[2,2]) * (coef(o)[2] + coef(o)[3] + coef(o)[4])
names(pred)[-1]<- names(model.frame(o))[-c(1,2)]
# compute the prediction, using standard predict function
# computing confidence intervals further
# suppose that the breakpoints are fixed
pred <- data.frame(pred, predict(o, newdata= pred,
interval="confidence"))
# Try prediction using the predict.segment version to compare
test <- predict.segmented(o)
plot(pred$fit, test, ylim = c(0, 100))
abline(0,1, col = "red")
# At least one segment not being predicted correctly?
Can I use the base r predict() function (not the segmented.predict() function) with the coefficients and break points saved from segmented linear models?
UPDATE
I figured out that the code above has issues (don't use it). Through some reverse engineering of the segmented.predict() function, I produced the design matrix and use that to predict values instead of directly using the predict() function. I do not consider this a full answer of the original question yet because predict() can also produce confidence intervals for the prediction, and I have not yet implemented that--question still open for someone to add confidence intervals.
library(segmented)
## Define function for making matrix of dummy variables (this is based on code from predict.segmented())
dummy.matrix <- function(x.values, x_names, psi.est = TRUE, nameU, nameV, diffSlope, est.psi) {
# This function creates a model matrix with dummy variables for a segmented lm with two breakpoints.
# Inputs:
# x.values: the x values of the segmented lm
# x_names: the name of the column of x values
# psi.est: this is legacy from the predict.segmented function, leave it set to 'TRUE'
# obj: the segmented lm object
# nameU: names (class character) of 3rd and 4th coef, which are "U1.x" "U2.x" for lm with two breaks. Example: names(c(obj$coef[3], obj$coef[4]))
# nameV: names (class character) of 5th and 6th coef, which are "psi1.x" "psi2.x" for lm with two breaks. Example: names(c(obj$coef[5], obj$coef[6]))
# diffSlope: the coefficients (class numeric) with the slope differences; called U1.x and U2.x for lm with two breaks. Example: c(o$coef[3], o$coef[4])
# est.psi: the estimated break points (class numeric); these are the estimated breakpoints from segmented.lm. Example: c(obj$psi[1,2], obj$psi[2,2])
#
n <- length(x.values)
k <- length(est.psi)
PSI <- matrix(rep(est.psi, rep(n, k)), ncol = k)
newZ <- matrix(x.values, nrow = n, ncol = k, byrow = FALSE)
dummy1 <- pmax(newZ - PSI, 0)
if (psi.est) {
V <- ifelse(newZ > PSI, -1, 0)
dummy2 <- if (k == 1)
V * diffSlope
else V %*% diag(diffSlope)
newd <- cbind(x.values, dummy1, dummy2)
colnames(newd) <- c(x_names, nameU, nameV)
} else {
newd <- cbind(x.values, dummy1)
colnames(newd) <- c(x_names, nameU)
}
# if (!x_names %in% names(coef(obj.seg)))
# newd <- newd[, -1, drop = FALSE]
return(newd)
}
## Test dummy matrix function----------------------------------------------
set.seed(12)
xx<-1:100
zz<-runif(100)
yy<-2+1.5*pmax(xx-35,0)-1.5*pmax(xx-70,0)+15*pmax(zz-.5,0)+rnorm(100,0,2)
dati<-data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)
#1 segmented variable, 2 breakpoints: you have to specify starting values (vector) for psi:
o<-segmented(out.lm,seg.Z=~x,psi=c(30,60),
control=seg.control(display=FALSE))
slope(o)
plot.segmented(o)
summary(o)
# Test dummy matrix fn with the same dataset
newdata <- dati
nameU1 <- c("U1.x", "U2.x")
nameV1 <- c("psi1.x", "psi2.x")
diffSlope1 <- c(o$coef[3], o$coef[4])
est.psi1 <- c(o$psi[1,2], o$psi[2,2])
test <- dummy.matrix(x.values = newdata$x, x_names = "x", psi.est = TRUE,
nameU = nameU1, nameV = nameV1, diffSlope = diffSlope1, est.psi = est.psi1)
# Predict response variable using matrix multiplication
col1 <- matrix(1, nrow = dim(test)[1])
test <- cbind(col1, test) # Now test is the same as model.matrix(o)
predY <- coef(o) %*% t(test)
plot(predY[1,])
lines(predict.segmented(o), col = "blue") # good, predict.segmented gives same answer

Simulating a mixed linear model and evaluating it with lmerTest in R

I am trying to understand how to use mixed linear models to analyse my data by simulating a model, but I can't reproduce the input parameters. What am I missing?
I want to start simulating a model with a random intercept for each subject. Here is the formula of what I want to simulate and reproduce:
If beta1 (<11) is small I find gamma00 as the intercept in fixed section, but I am completedly unaable to retrieve the slope (beta1). Also, the linear effect is not significant. Where is my conceptual mistake?
library(lmerTest)
# Generating data set
# General values and variables
numObj <- 20
numSub <- 100
e <- rnorm(numObj * numSub, mean = 0, sd = 0.1)
x <- scale(runif(numObj * numSub, min = -100, max = 100))
y <- c()
index <- 1
# Coefficients
gamma00 <- 18
gamma01 <- 0.5
beta1 <- -100
w <- runif(numSub, min = -3, max = 3)
uo <- rnorm(numSub, mean = 0, sd = 0.1)
meanBeta0 <- mean(gamma00 + gamma01*w + uo) # I should be able to retrieve that parameter.
for(j in 1:numSub){
for(i in 1:numObj){
y[index] <- gamma00 + gamma01*w[j]+ uo[j] + beta1*x[i] + e[index]
index <- index + 1
}
}
dataFrame2 <- data.frame(y = y, x = x, subNo = factor(rep(1:numSub, each = numObj)), objNum = factor(rep(1:numObj, numSub)))
model2 <- lmer(y ~ x +
(1 | subNo), data = dataFrame2)
summary(model2)
anova(model2)
No conceptual mistake here, just a mixed up index value: you should be using index rather than i to index x in your data generation loop.
Basically due to the mix-up you were using the first subject's x values for generating data for all the subjects, but using the individual x values in the model.

How to simulate data in R, such that p-value of regressor is exactly 0.05?

I have written a small function that simulates data from a normal distribution, how it is usual in linear models. My question is how to get a model with a pvalue of sim[, 1] == 0.05. I want to show that if I add a random variable even it is normal distributed around zero with small variance N(0,0.0023) , that pvalue of sim[,1] changes. The code below shows the true model.
set.seed(37) # seed for reproducability
simulation <- function(b_0, b_1,n,min_x_1 ,max_x_1,sd_e){
mat <- NA
x_1 <- runif(n = n, min = min_x_1, max =max_x_1)
error <- rnorm(mean = 0,sd = sd_e, n = n )
y <- b_0 + b_1*x_1 + error
mat <- matrix(cbind(x_1,y), ncol = 2)
return(mat)
#plot(mat[,1],mat[,2])
}
sim <- simulation(10,-2,10000,-10,70,0.003)
summary(lm(sim[,2] ~ sim[,1] ))

Estimating the Standard Deviation of a ratio using Taylor expansion

I am interested to build a R function that I can use to test the limits of the Taylor series approximation. I am aware that there is limits to what I am doing, but it's exactly those limits I wish to investigate.
I have two normally distributed random variables x and y. x has a mean of 7 and a standard deviation (sd) of 1. y has a mean of 5 and a sd of 4.
me.x <- 4; sd.x <- 1
me.y <- 5; sd.y <- 4
I know how to estimate the mean ratio of y/x, like this
# E(y/x) = E(y)/E(x) - Cov(y,x)/E(x)^2 + Var(x)*E(y)/E(x)^3
me.y/me.x - 0/me.x^2 + sd.x*me.y/me.x^3
[1] 1.328125
I am however stuck on how to estimate the Standard Deviation of the ratio? I realize I have to use a Taylor expansion, but not how to use it.
Doing a simple simulation I get
x <- rnorm(10^4, mean = 4, sd = 1); y <- rnorm(10^4, mean = 5, sd = 4)
sd(y/x)
[1] 2.027593
mean(y/x)[1]
1.362142
There is an analytical expression for the PDF of the ratio of two gaussians, done
by David Hinkley (e.g. see Wikipedia). So we could compute all momentums, means etc. I typed it and apparently it clearly doesn't have finite second momentum, thus it doesn't have finite standard deviation. Note, I've denoted your Y gaussian as my X, and your X as my Y (formulas assume X/Y). I've got mean value of ratio pretty close to the what you've got from simulation, but last integral is infinite, sorry. You could sample more and more values, but from sampling std.dev is growing as well, as noted by #G.Grothendieck
library(ggplot2)
m.x <- 5; s.x <- 4
m.y <- 4; s.y <- 1
a <- function(x) {
sqrt( (x/s.x)^2 + (1.0/s.y)^2 )
}
b <- function(x) {
(m.x*x)/s.x^2 + m.y/s.y^2
}
c <- (m.x/s.x)^2 + (m.y/s.y)^2
d <- function(x) {
u <- b(x)^2 - c*a(x)^2
l <- 2.0*a(x)^2
exp( u / l )
}
# PDF for the ratio of the two different gaussians
PDF <- function(x) {
r <- b(x)/a(x)
q <- pnorm(r) - pnorm(-r)
(r*d(x)/a(x)^2) * (1.0/(sqrt(2.0*pi)*s.x*s.y)) * q + exp(-0.5*c)/(pi*s.x*s.y*a(x)^2)
}
# normalization
nn <- integrate(PDF, -Inf, Inf)
nn <- nn[["value"]]
# plot PDF
p <- ggplot(data = data.frame(x = 0), mapping = aes(x = x))
p <- p + stat_function(fun = function(x) PDF(x)/nn) + xlim(-2.0, 6.0)
print(p)
# first momentum
m1 <- integrate(function(x) x*PDF(x), -Inf, Inf)
m1 <- m1[["value"]]
# mean
print(m1/nn)
# some sampling
set.seed(32345)
n <- 10^7L
x <- rnorm(n, mean = m.x, sd = s.x); y <- rnorm(n, mean = m.y, sd = s.y)
print(mean(x/y))
print(sd(x/y))
# second momentum - Infinite!
m2 <- integrate(function(x) x*x*PDF(x), -Inf, Inf)
Thus, it is impossible to test any Taylor expansion for std.dev.
With the cautions suggested by #G.Grothendieck in mind: a useful mnemonic for products and quotients of independent X and Y variables is
CV^2(X/Y) = CV^2(X*Y) = CV^2(X) + CV^2(Y)
where CV is the coefficient of variation (sd(X)/mean(X)), so CV^2 is Var/mean^2. In other words
Var(Y/X)/(m(Y/X))^2 = Var(X)/m(X)^2 + Var(Y)/m(Y)^2
or rearranging
sd(Y/X) = sqrt[ Var(X)*m(Y/X)^2/m(X)^2 + Var(Y)*m(Y/X)^2/m(Y)^2 ]
For random variables with the mean well away from zero, this is a reasonable approximation.
set.seed(101)
y <- rnorm(1000,mean=5)
x <- rnorm(1000,mean=10)
myx <- mean(y/x)
sqrt(var(x)*myx^2/mean(x)^2 + var(y)*myx^2/mean(y)^2) ## 0.110412
sd(y/x) ## 0.1122373
Using your example is considerably worse because the CV of Y is close to 1 -- I initially thought it looked OK, but now I see that it's biased as well as not capturing the variability very well (I'm also plugging in the expected values of the mean and SD rather than their simulated values, but for such a large sample that should be a minor part of the error.)
me.x <- 4; sd.x <- 1
me.y <- 5; sd.y <- 4
myx <- me.y/me.x - 0/me.x^2 + sd.x*me.y/me.x^3
x <- rnorm(1e4,me.x,sd.x); y <- rnorm(1e4,me.y,sd.y)
c(myx,mean(y/x))
sdyx <- sqrt(sd.x^2*myx^2/me.x^2 + sd.y^2*myx^2/me.y^2)
c(sdyx,sd(y/x))
## 1.113172 1.197855
rvals <- replicate(1000,
sd(rnorm(1e4,me.y,sd.y)/rnorm(1e4,me.x,sd.x)))
hist(log(rvals),col="gray",breaks=100)
abline(v=log(sdyx),col="red",lwd=2)
min(rvals) ## 1.182698
All the canned delta-method approaches to computing the variance of Y/X use the point estimate for Y/X (i.e. m(Y/X) = mY/mX), rather than the second-order approximation you used above. Constructing higher-order forms for both the mean and the variance should be straightforward if possibly tedious (a computer algebra system might help ...)
mvec <- c(x = me.x, y = me.y)
V <- diag(c(sd.x, sd.y)^2)
car::deltaMethod(mvec, "y/x", V)
## Estimate SE
## y/x 1.25 1.047691
library(emdbook)
sqrt(deltavar(y/x,meanval=mvec,Sigma=V)) ## 1.047691
sqrt(sd.x^2*(me.y/me.x)^2/me.x^2 + sd.y^2*(me.y/me.x)^2/me.y^2) ## 1.047691
For what it's worth, I took the code in #SeverinPappadeux's answer and made it into a function gratio(mx,my,sx,sy). For the Cauchy case (gratio(0,0,1,1)) it gets confused and reports a mean of 0 (which should be NA/divergent) but correctly reports the variance/std dev as divergent. For the parameters specified by the OP (gratio(5,4,4,1)) it gives mean=1.352176, sd=NA as above. For the first parameters I tried above (gratio(10,5,1,1)) it gives mean=0.5051581, sd=0.1141726.
These numerical experiments strongly suggest to me that the ratio of Gaussians sometimes has a well-defined variance, but I don't know when (time for another question on Math StackOverflow or CrossValidated?)
Such approximations are unlikely to be useful since the distribution may not have a finite standard deviation. Look at how unstable it is:
set.seed(123)
n <- 10^6
X <- rnorm(n, me.x, sd.x)
Y <- rnorm(n, me.y, sd.y)
sd(head(Y/X, 10^3))
## [1] 1.151261
sd(head(Y/X, 10^4))
## [1] 1.298028
sd(head(Y/X, 10^5))
## [1] 1.527188
sd(Y/X)
## [1] 1.863168
Contrast that with what happens when we try the same thing with a normal random variable:
sd(head(Y, 10^3))
## [1] 3.928038
sd(head(Y, 10^4))
## [1] 3.986802
sd(head(Y, 10^5))
## [1] 3.984113
sd(Y)
## [1] 3.999024
Note: If you were in a different situation, e.g. the denominator has compact support, then you could do this:
library(car)
m <- c(x = me.x, y = me.y)
v <- diag(c(sd.x, sd.y)^2)
deltaMethod(m, "y/x", v)

R - Fitting a constrained AutoRegression time series

I have a time-series which I need to fit onto an AR (auto-regression) model.
The AR model has the form:
x(t) = a0 + a1*x(t-1) + a2*x(t-2) + ... + aq*x(t-q) + noise.
I have two contraints:
Find the best AR fit when lag.max = 50.
Sum of all coefficients a0 + a1 + ... + aq = 1
I wrote the below code:
require(FitAR)
data(lynx) # my real data comes from the stock market.
z <- -log(lynx)
#find best model
step <- SelectModel(z, ARModel = "AR" ,lag.max = 50, Criterion = "AIC",Best=10)
summary(step) # display results
# fit the model and get coefficients
arfit <- ar(z,p=1, order.max=ceil(mean(step[,1])), aic=FALSE)
#check if sum of coefficients are 1
sum(arfit$ar)
[1] 0.5784978
My question is, how to add the constraint: sum of all coefficients = 1?
I looked at this question, but I do not realize how to use it.
**UPDATE**
I think I manage to solve my question as follow.
library(quadprog)
coeff <- arfit$ar
y <- 0
for (i in 1:length(coeff)) {
y <- y + coeff[i]*c(z[(i+1):length(z)],rep(0,i))
ifelse (i==1, X <- c(z[2:length(z)],0), X <- cbind(X,c(z[(i+1):length(z)],rep(0,i))))
}
Dmat <- t(X) %*% X
s <- solve.QP(Dmat , t(y) %*% X, matrix(1, nr=15, nc=1), 1, meq=1 )
s$solution
# The coefficients should sum up to 1
sum(s$solution)

Resources