I would like to estimate the coefficients of a nonlinear model with a binary dependent variable. The nonlinearity arises because two regressors, A and B, depend on a subset of the dataset and on the two parameters lambda1 and lambda2 respectively:
y = alpha + beta1 * A(lambda1) + beta2 * B(lambda2) + delta * X + epsilon
where for each observation i, we have
Where a and Rs are variables in the data.frame. The regressor B(lambda2) is defined in a similar way.
Moreover, I need to include what in Stata are known as pweights, i.e. survey weights or sampling weights. For this reason, I'm working with the R package survey by Thomas Lumley.
First, I create a function for A (and B), i.e.:
A <- function(l1){
R <- as.matrix(data[,1:(80)])
a <- data[,169]
N = length(a)
var <- numeric(N)
for (i in 1:N) {
ai <- rep(a[i],a[i]-1) # vector of a(i)
k <- 1:(a[i]-1) # numbers from 1 to a(i)-1
num <- (ai-k)^l1
den <- sum((ai-k)^l1)
w <- num/den
w <- c(w,rep(0,dim(R)[2]-length(w)))
var[i] <- R[i,] %*% w
}
return(var)
}
B <- function(l2){
C <- as.matrix(data[,82:(161-1)])
a <- data[,169]
N = length(a)
var <- numeric(N)
for (i in 1:N) {
ai <- rep(a[i],a[i]-1) # vector of a(i)
k <- 1:(a[i]-1) # numbers from 1 to a(i)-1
num <- (ai-k)^l2
den <- sum((ai-k)^l2)
w <- num/den
w <- c(w,rep(0,dim(C)[2]-length(w)))
var[i] <- C[i,] %*% w
}
return(var)
}
But the problem is that I don't know how to include the nonlinear regressors in the model (or in the survey design, using the function svydesign):
d_test <- svydesign(id=~1, data = data, weights = ~data$hw0010)
Because, when I try to estimate the model:
# loglikelihood function:
LLsvy <- function(y, model, lambda1, lambda2){
aux1 <- y * log(pnorm(model))
aux2 <- (1-y) * log(1-pnorm(model))
LL <- (aux1) + (aux2)
return(LL)
}
fit <- svymle(loglike=LLsvy,
formulas=list(~y, model = ~ A(lambda1)+B(lambda2)+X,lambda1=~1,lambda2=~1),
design=d_test,
start=list(c(0,0,0,0),c(lambda1=11),c(lambda2=8)),
na.action="na.exclude")
I get the error message:
Error in eval(expr, envir, enclos) : object 'lambda1' not found
I think that the problem is in including the nonlinear part, because everything works fine if I fix A and B for some lambda1 and lambda2 (so that the model becomes linear):
lambda1=11
lambda2=8
data$A <- A(lambda1)
data$B <- B(lambda2)
d_test <- svydesign(id=~1, data = data, weights = ~data$hw0010)
LLsvylin <- function(y, model){
aux1 <- y * log(pnorm(model))
aux2 <- (1-y) * log(1-pnorm(model))
LL <- (aux1) + (aux2)
return(LL)
}
fitlin <- svymle(loglike=LLsvylin,
formulas=list(~y, model = ~A+B+X),
design=d_test,
start=list(0,0,0,0),
na.action="na.exclude")
On the contrary, if I don't use the sampling weights, I can easily estimate my nonlinear model using the function mle from package stats4 or the function mle2 from package bbmle.
To sum up,
how can I combine sampling weights (svymle) while estimating a nonlinear model (which I can do using mle or mle2)?
=========================================================================
A problem with the nonlinear part of the model arises also when using the function svyglm (with fixed lambda1 and lambda2, in order to get good starting values for svymle):
lambda1=11
lambda2=8
model0 = y ~ A(lambda1) + B(lambda2) + X
probit1 = svyglm(formula = model0,
data = data,
family = binomial(link=probit),
design = d_test)
Because I get the error message:
Error in svyglm.survey.design(formula = model0, data = data, family = binomial(link = probit), :
all variables must be in design= argument
This isn't what svymle does -- it's for generalised linear models, which have linear predictors and a potentially complicated likelihood or loss function. You want non-linear weighted least squares, with a simple loss function but complicated predictors.
There isn't an implementation of design-weighted nonlinear least squares in the survey package, probably because no-one has previously asked for one. You could try emailing the package author.
The upcoming version 4 of the survey package will have a function svynls, so if you know how to fit your model without sampling weights using nls you will be able to fit it with sampling weights.
Related
I used the "systemfit" function in R to estimate a 2SLS model as it allows to specify the first and second stage separately which is important for my estimation. I need robust standard errors but I am having problems replicating the robust standard errors obtained when estimating the model using "iv_robust" or "ivreg" in combination with "coeftest".
I tried to calculate the robust standard errors manually as follows:
firststagehfsystem <- H ~ HL1
secondstageivsystem <- Index ~ H
system = list(firststagehfsystem, secondstageivsystem)
inst <- ~ HL1
ivhfsystem = systemfit(system, method = "2SLS", inst = inst, data = DataControls)
coef <- ivhfsystem$eq[[2]]$coefficients
vcov <- ivhfsystem$eq[[2]]$coefCov
se <- sqrt(diag(vcov))
residuals <- ivhfsystem$eq[[2]]$residuals
n <- nrow(DataControls)
k <- length(coef)
X <- model.matrix(ivhfsystem$eq[[2]])
hc1 <- (n/(n-k))*ivhfsystem$eq[[2]]$residuals^2
vce_hc1 <- solve(t(X) %*% X) %*% (t(X) %*% (hc1*diag(n)) %*% X) %*% solve(t(X) %*% X)
sqrt(diag(vce_hc1))
Can anyone point out my mistake? I already googled it for two days and could not find a solution. Thank you!
I'm fitting a power model to a dataset by applying a simple linear model with the R function lm after log-log transformation, as in the example below (instead of fitting directly the power model, for example by applying the nls function).
I could use the function predict.lm to apply the model on new data and calculate prediction intervals.
data(stackloss); dat <- stackloss[c(2, 4)]; colnames(dat) <- c("x","y")
dat.lm <- lm(log(y) ~ log(x), data = dat)
new <- data.frame(x = seq(0, 30, 1))
pred <- predict.lm(dat.lm, new, interval = "prediction", level = 0.95)
matplot(new$x, exp(pred), type = "l", col = 1, lty = c(1, 2, 2)); points(dat$x, dat$y)
Now, I need to sum n predicted values (which is straightforward, after applying the 'exp' function) and also to calculate the aggregated variance and prediction intervals.
The latter has been described for a simple linear model in the following Q&A: linear model with `lm`: how to get prediction variance of sum of predicted values.
In that interesting answer, the following functions lm_predict (that allows to compute complete variance-covariance matrix of predicted values) and agg_pred were introduced for the simple linear model.
lm_predict <- function (lmObject, newdata, diag = TRUE) {
## input checking
if (!inherits(lmObject, "lm")) stop("'lmObject' is not a valid 'lm' object!")
## extract "terms" object from the fitted model, but delete response variable
tm <- delete.response(terms(lmObject))
## linear predictor matrix
Xp <- model.matrix(tm, newdata)
## predicted values by direct matrix-vector multiplication
pred <- c(Xp %*% coef(lmObject))
## efficiently form the complete variance-covariance matrix
QR <- lmObject$qr ## qr object of fitted model
piv <- QR$pivot ## pivoting index
r <- QR$rank ## model rank / numeric rank
if (is.unsorted(piv)) {
## pivoting has been done
B <- forwardsolve(t(QR$qr), t(Xp[, piv]), r)
} else {
## no pivoting is done
B <- forwardsolve(t(QR$qr), t(Xp), r)
}
## residual variance
sig2 <- c(crossprod(residuals(lmObject))) / df.residual(lmObject)
if (diag) {
## return point-wise prediction variance
VCOV <- colSums(B ^ 2) * sig2
} else {
## return full variance-covariance matrix of predicted values
VCOV <- crossprod(B) * sig2
}
list(fit = pred, var.fit = VCOV, df = lmObject$df.residual, residual.var = sig2)
}
agg_pred <- function (w, predObject, alpha = 0.95) {
## input checing
if (length(w) != length(predObject$fit)) stop("'w' has wrong length!")
if (!is.matrix(predObject$var.fit)) stop("'predObject' has no variance-covariance matrix!")
## mean of the aggregation
agg_mean <- c(crossprod(predObject$fit, w))
## variance of the aggregation
agg_variance <- c(crossprod(w, predObject$var.fit %*% w))
## adjusted variance-covariance matrix
VCOV_adj <- with(predObject, var.fit + diag(residual.var, nrow(var.fit)))
## adjusted variance of the aggregation
agg_variance_adj <- c(crossprod(w, VCOV_adj %*% w))
## t-distribution quantiles
Qt <- c(-1, 1) * qt((1 - alpha) / 2, predObject$df, lower.tail = FALSE)
## names of CI and PI
NAME <- c("lower", "upper")
## CI
CI <- setNames(agg_mean + Qt * sqrt(agg_variance), NAME)
## PI
PI <- setNames(agg_mean + Qt * sqrt(agg_variance_adj), NAME)
## return
list(mean = agg_mean, var = agg_variance, CI = CI, PI = PI)
}
However, these cannot be applied directly to properly aggregate variance in the case of log-log regression. Maybe I should transform the variance in the output of lm_predict, but I couldn't figure how to proceed.
Thank'you in advance for any help.
I am fitting a linear mixed effects model using lme4:
library(lme4)
data(Orthodont)
dent <- Orthodont
d.test <- lmer(distance ~ age + (1|Subject), data=dent)
If we say generically Y = X * B + Z * d + e is the form of a linear mixed effects model, then I am trying to get Var(Y) = Z * Var(d) * Z^t + Var(e) from the results of the model.
Is the following formulation the right way to do this?
k <- table(dent$Subject)[1]
vars <- VarCorr(d.test)
v <- as.data.frame(vars)
sigma <- attr(vars, "sc")
s.tech <- diag(v$vcov[1], nrow=k)
icc <- v$vcov[1]/sum(v$vcov)
s.tech[upper.tri(s.tech)] <- icc
s.tech[lower.tri(s.tech)] <- icc
sI <- diag(sigma^2, nrow=length(dent$age))
var.b <- kronecker(diag(1, nrow=length(dent$age)/k), s.tech)
var.y <- sI + var.b
I think this is a simple question, but I can't find anywhere code for doing this, so I'm asking if I'm doing it right.
You can do this a bit more easily if you know about getME(), which is a general purpose extract-bits-of-a-lmer-fit function. In particular, you can extract the transposed Z matrix (getME(.,"Zt")) and the transposed Lambda matrix - the Lambda matrix is the Cholesky factor of the scaled variance-covariance matrix of the conditional models (BLUPs); in your notation, Var(d) is the residual variance times the cross-product of Lambda.
The answer cited here is pretty good but the answer below is slightly more general (it should work for any lmer fit).
Fit model:
library(lme4)
data(Orthodont,package="nlme")
d.test <- lmer(distance ~ age + (1|Subject), data=Orthodont)
Extract components:
var.d <- crossprod(getME(d.test,"Lambdat"))
Zt <- getME(d.test,"Zt")
vr <- sigma(d.test)^2
Combine them:
var.b <- vr*(t(Zt) %*% var.d %*% Zt)
sI <- vr * Diagonal(nrow(Orthodont))
var.y <- var.b + sI
A picture:
image(var.y)
I am attempting to fit this model into a multivariate time series data using the package KFAS in R:
y_t = Zx_t + a + v_t, v_t ~ MVN(0,R)
x_t = x_(t-1) + w_t, w_t ~ MVN(0,Q)
This is a dynamic factor model. I need to estimate as well some parameters, namely the matrix of factor loadings Z, and the variance-covariance matrix of observation disturbance, R. I am well aware that this type of model can be ran using MARSS package however I would still need to run it using a more flexible package as I would modify the state equations later on (to include seasonal decomposition).
This is the code that I used (using a simulated data instead of the actual data I intend to run):
library(KFAS)
library(mAr)
set.seed(100)
w=c(0.25,0.1)
C=rbind(c(1,0.5),c(0.5,1.5))
A=rbind(c(0.1,0,0,0),c(0.3,0,0,0))
data=as.matrix(mAr.sim(w,A,C,N=300))
N.ts = dim(data)[2]
N.ls = 1
#ASSUMING 1 FACTOR
Z.vals = matrix(NA,N.ts,N.ls)
Zt = matrix(Z.vals, nrow=N.ts, ncol=N.ls, byrow=TRUE) #MATRIX OF LOADINGS, N X P
Ht <- diag(NA,N.ts) #VAR-COV MATRIX OF OBS ERROR, N x N
Tt <- diag(N.ls) #SLOPE OF LATENT STATE AT T-1, P X P
Rt <- diag(N.ls) #SLOPE OF THE LATENT STATE DISTURBANCES, P X P
Qt <- diag(N.ls) #VAR-COV MATRIX OF THE LATENT STATE DISTURBANCES, P X P
ss_model <- SSModel(data ~
-1 + SSMcustom(Z = Zt, T = Tt, R = Rt, Q = Qt),
H=Ht
)
objf <- function(pars, model, estimate = TRUE) {
model$Z[1] <- pars[1]
model$H[1] <- pars[2]
if (estimate) {
-logLik(model)
} else {
model
}
}
opt <- optim(par = rep(1,50), fn = objf, method = "L-BFGS-B",
model = ss_model)
ss_model_opt <- objf(opt$par, ss_model, estimate = FALSE)
updatefn <- function(pars, model) {
model$Z[1] <- pars[1]
model$H[1] <- pars[2]
model
}
fit <- fitSSM(ss_model, rep(1,50), updatefn, method = "L-BFGS-B")
If I look at the model specification, it seems correct to me:
Call:
SSModel(formula = data ~ -1 + SSMcustom(Z = Zt, T = Tt, R = Rt,
Q = Qt), H = Ht)
State space model object of class SSModel
Dimensions:
[1] Number of time points: 300
[1] Number of time series: 2
[1] Number of disturbances: 1
[1] Number of states: 1
Names of the states:
[1] custom1
Distributions of the time series:
[1] gaussian
Object is a valid object of class SSModel.
However it's returning this error message:
Error in is.SSModel(do.call(updatefn, args = c(list(inits, model), update_args)), :
System matrices (excluding Z) contain NA or infinite values, covariance matrices contain values larger than 1e+07
Hope that someone can guide me doing this. Thanks a lot!
you could look up variance covariance matrix on google. the seasonal component; a'a is
applied after finding the vector/deviation of the scores matrix a=A-11A(1/n).
the 1 means the scores of N1 ones. p=1, n=number of rows; 2rows of data which gives the variance. the red colour denotes the variance which is the diagonal of the matrix. Na is the value of the variance. we don't know the missing elements in the matrix, so we assume that T-1=Np ones where p=1, for a vector scores of matrix, we fill the latent ones with T-1=N*p ones where p=1. type the error message,ssmodel
I am trying to get a perceptron algorithm for classification working but I think something is missing. This is the decision boundary achieved with logistic regression:
The red dots got into college, after performing better on tests 1 and 2.
This is the data, and this is the code for the logistic regression in R:
dat = read.csv("perceptron.txt", header=F)
colnames(dat) = c("test1","test2","y")
plot(test2 ~ test1, col = as.factor(y), pch = 20, data=dat)
fit = glm(y ~ test1 + test2, family = "binomial", data = dat)
coefs = coef(fit)
(x = c(min(dat[,1])-2, max(dat[,1])+2))
(y = c((-1/coefs[3]) * (coefs[2] * x + coefs[1])))
lines(x, y)
The code for the "manual" implementation of the perceptron is as follows:
# DATA PRE-PROCESSING:
dat = read.csv("perceptron.txt", header=F)
dat[,1:2] = apply(dat[,1:2], MARGIN = 2, FUN = function(x) scale(x)) # scaling the data
data = data.frame(rep(1,nrow(dat)), dat) # introducing the "bias" column
colnames(data) = c("bias","test1","test2","y")
data$y[data$y==0] = -1 # Turning 0/1 dependent variable into -1/1.
data = as.matrix(data) # Turning data.frame into matrix to avoid mmult problems.
# PERCEPTRON:
set.seed(62416)
no.iter = 1000 # Number of loops
theta = rnorm(ncol(data) - 1) # Starting a random vector of coefficients.
theta = theta/sqrt(sum(theta^2)) # Normalizing the vector.
h = theta %*% t(data[,1:3]) # Performing the first f(theta^T X)
for (i in 1:no.iter){ # We will recalculate 1,000 times
for (j in 1:nrow(data)){ # Each time we go through each example.
if(h[j] * data[j, 4] < 0){ # If the hypothesis disagrees with the sign of y,
theta = theta + (sign(data[j,4]) * data[j, 1:3]) # We + or - the example from theta.
}
else
theta = theta # Else we let it be.
}
h = theta %*% t(data[,1:3]) # Calculating h() after iteration.
}
theta # Final coefficients
mean(sign(h) == data[,4]) # Accuracy
With this, I get the following coefficients:
bias test1 test2
9.131054 19.095881 20.736352
and an accuracy of 88%, consistent with that calculated with the glm() logistic regression function: mean(sign(predict(fit))==data[,4]) of 89% - logically, there is no way of linearly classifying all of the points, as it is obvious from the plot above. In fact, iterating only 10 times and plotting the accuracy, a ~90% is reach after just 1 iteration:
Being in line with the training classification performance of logistic regression, it is likely that the code is not conceptually wrong.
QUESTIONS: Is it OK to get coefficients so different from the logistic regression:
(Intercept) test1 test2
1.718449 4.012903 3.743903
This is really more of a CrossValidated question than a StackOverflow question, but I'll go ahead and answer.
Yes, it's normal and expected to get very different coefficients because you can't directly compare the magnitude of the coefficients between these 2 techniques.
With the logit (logistic) model you're using a binomial distribution and logit-link based on a sigmoid cost function. The coefficients are only meaningful in this context. You've also got an intercept term in the logit.
None of this is true for the perceptron model. The interpretation of the coefficients are thus totally different.
Now, that's not saying anything about which model is better. There aren't comparable performance metrics in your question that would allow us to determine that. To determine that you should do cross-validation or at least use a holdout sample.