Weighted censored regression in R? - r

I am very new to R (mostly program in SQL) but was faced with a problem that SQL couldn't help me with. I'll try to simplify the problem below.
Assume I have a set of data with 100 rows where each row has a different weight associated with it. Out of those 100 rows of data, 5 have an X value that is top-coded at 1000. Also assume that X can be represented by the linear equation X ~ Y + Z + U + 0 (want a positive value so I don't want a Y-intercept).
Now, without taking the weights of each row of data into consideration, the formula I used in R was:
fit = censReg(X ~ Y + Z + U + 0, left = -Inf, right = 1000, data = dataset)
If I computed summary(fit) I would get 0 left-censored values, 95 uncensored values, and 5 right censored values which is exactly what I want, minus the fact that the weights haven't been sufficiently added into the mix. I checked the reference manual on the censReg function and it doesn't seem like it accepts a weight argument.
Is there something I'm missing about the censReg function or is there another function that would be of better use to me? My end goal is to estimate X in the cases where it is censored (i.e. the 5 cases where it is 1000).

You should use Tobit regression for this situation, it is designed specifically to linearly model latent variables such as the one you describe.
The regression accounts for your weights and the censored observations, which can be seen in the derivation of the log-likelihood function for the Type I Tobit (upper and lower bounded).
Tobit regression can be found in the VGAM package using the vglm function with a tobit control parameter. An excellent example can be found here:
http://www.ats.ucla.edu/stat/r/dae/tobit.htm

Related

Linear Regression Model with a variable that zeroes the result

For my class we have to create a model to predict the credit balance of each individuals. Based on observations, many results are zero where the lm tries to calculate them.
To overcome this I created a new variable that results in zero if X and Y are true.
CB$Balzero = ifelse(CB$Rating<=230 & CB$Income<90,0,1)
This resulted in getting 90% of the zero results right. The problem is:
How can I place this variable in the lm so it correctly results in zeros when the proposition is true and the calculation when it is false?
Something like: lm=Balzero*(Balance~.)
I think that
y ~ -1 + Balzero:Balance
might work (you haven't given us a reproducible example to try).
-1 tells R to omit the intercept
: specifies an interaction. If both variables are numeric, then A:B includes the product of A and B as a term in the model.
The second term could also be specified as I(Balzero*Balance) (I means "as is", i.e. interpret * in the usual numerical sense, not in its formula-construction context.)
These specifications should fit the model
Y = beta1*Balzero*Balance + eps
where eps is an error term.
If Balzero == 0, the predicted value will be zero. If Balzero==1 the predicted value will be beta1*Balance.
You might want to look into random forest models, which naturally incorporate the kind of qualitative splitting that you're doing by hand in your example.

Min timepoints to model longitudinal data with natural quadratic splines?

I'm new to applying splines to longitudinal data, so here comes my question:
I've some longitudinal data on growing mice in 3 timepoints: at x, y and z months. It's known from the existent literature that the trajectories of growth in this type of data are usually better modeled in non-linear terms.
However, since I have only 3 timepoints, I wonder if this allows me to apply natural quadratic spline to age variable in my lmer model?
edit:I mean is
lmer<-mincLmer(File ~ ns(Age,2) * Genotype + Sex + (1|Subj_ID),data, mask=mask)
a legit way to go around?
I'm sorry if this is a stupid question - I'm just a lonely PhD student without supervision, and I would be super-grateful for any advice!!!
Marina
With the nls() function you can fit your data to whatever non-linear function you want. Then, from the biological point of view, probably your data is described by a Gompertz-like function (sigmoidal), but as you have only three time points, probably you can simplify these kind of functions into an exponential one. Try the following:
fit_formula <- independent_variable ~ a * exp(b * dependent_variable)
result <- nls(formula = fit_formula, data = your_Dataset)
It will probably give you an error the first times, something like singular matrix gradient at initial estimates ; if this happens, try adding the additional parameter start, where you provide different starting values for a and b more close to the true values. Remember that in your dataset, the column names must be equal to the names of the variables in the formula.

MLE regression that accounts for two constraints

So I am wanting to create a logistic regression that simultaneously satisfies two constraints.
The link here, outlines how to use the Excel solver to maximize the value of Log-Likelihood value of a logistic regression, but I am wanting to implement a similar function in R
What I am trying to create in the end is an injury risk function. These take an S-shape function.
As we see, the risk curves are calculated from the following equation
Lets take some dummy data to begin with
set.seed(112233)
A <- rbinom(153, 1, 0.6)
B <- rnorm(153, mean = 50, sd = 5)
C <- rnorm(153, mean = 100, sd = 15)
df1 <- data.frame(A,B,C)
Lets assume A indicates if a bone was broken, B is the bone density and C is the force applied.
So we can form a logistic regression model that uses B and C are capable of explaining the outcome variable A. A simple example of the regression may be:
Or
glm(A ~ B + C, data=df1, family=binomial())
Now we want to make the first assumption that at zero force, we should have zero risk. This is further explained as A1. on pg.124 here
Here we set our A1=0.05 and solve the equation
A1 = 1 - (1-P(0))^n
where P(0) is the probability of injury when the injury related parameter is zero and n is the sample size.
We have our sample size and can solve for P(0). We get 3.4E-4. Such that:
The second assumption is that we should maximize the log-likelihood function of the regression
We want to maximize the following equation
Where pi is estimated from the above equation and yi is the observed value for non-break for each interval
My what i understand, I have to use one of the two functions in R to define a function for max'ing LL. I can use mle from base R or the mle2 from bbmle package.
I guess I need to write a function along these lines
log.likelihood.sum <- function(sequence, p) {
log.likelihood <- sum(log(p)*(sequence==1)) + sum(log(1-p)*(sequence==0))
}
But I am not sure where I should account for the first assumption. Ie. am I best to build it into the above code, and if so, how? Or will it be more effiecient to write a secondary finction to combine the two assumptions. Any advice would be great, as I have very limited experience in writing and understanding functions

Using glm in R to solve simple equation

I have some data from a poisson distribution and have a simple equation I want to solve using glm.
The mathematical equation is observed = y * expected.
I have the observed and expected data and want to use glm to find the optimal value of y which I need to multiply expected by to get observed. I also want to get confidence intervals for y.
Should I be doing something like this
glm(observed ~ expected + offset(log(expected)) + 0, family = 'poisson', data = dataDF)
Then taking the exponential of the coefficient? I tried this but the value given is pretty different to what I get when I divide the sum of the observed by the sum of the expected, and I thought these should be similar.
Am I doing something wrong?
Thanks
Try this:
logFac <- coef( glm(observed ~ offset(expected) , family = 'poisson', data = dataDF))
Fac <- exp( logFac[1] ) # That's the intercept term
That model is really : observed ~ 1 + offset(expected) and since it's being estimated on a log scale, the intercept becomes that conversion factor to convert between 'expected' and 'observed'. The negative comments are evidence that you should have posted on CrossValidated.com where general statistics methods questions are more welcomed.

Understanding lm Multiple R when fit is a horizontal line

I have been using R and the lm function to do linear regression and report R2.
y = c(1,2,3,4)
x = c(1,2,3,4)
f = lm(y~x)
r2 = summary(f)$r.squared
However, someone gave me this case-
y = c(1,1,1,1,1)
x = c(75,33,50,33,50)
Excel reports an intercept of 1, coefficient of 0 and Multiple R and r2 of 1.
R reports an intercept of 1, coefficient of 01e-17 and Multiple R-squared of 0.3392
Not being a statistician, I'm not understanding where lm() gets that number for Multiple R-squared from. Could someone help me out with an explanation?
If I change the data to
y = c(1,1,1,1,1)
x = c(1,1,1,1,1)
Excel still gives y = 1 + 0 * x r2 = 1
whereas lm() reports the slope as NA and does not report a Multiple R-squared.
While this seems like a unique case, I'm still being told my program that calls lm() does not work because it fails these tests and Excel gives the 'expected' answer.
Thanks
I thought I would summarize the very helpful, but long series of comments related to my original question which I will restate as: what is the appropriate value of r2 when y does not vary, i.e., the y data can be exactly fit to the equation y = c?
a. Excel reports an r2 of 1. Which is what my users want, since the data is fit exactly.
b. The r2 value should reflect the fraction of variation accounted for by the model as compared with that accounted for by the null hypothesis, i.e., the mean. The equation is
R2 = 1 - SSR/SST
where SSR is the sum of the squared distances between the actual and model (predicted) values and SST is the sum of the squared distances between the actual and mean values.
When the data exactly fits a horizontal line, there is no deviation from the mean. So asking what proportion of the deviation is accounted for by the model is actually meaningless. From the equation, one is dividing 0 by 0.
The values reported by R are thus likely to be nothing more than round-off error in values that are effectively zero.
I should therefore check for this condition and not report an R2 rather than either report the number coming from R (lm) or report the value that Excel would give (1).

Resources