Error in linear model when values are 0 - r

I have a data set that has names, value 1, and value 2. I need to run a regression and obtain the t-statistic for each of the names. I got help on StackOverflow in constructing the linear model. I noticed that sometimes I get data that's 0's. It's OK and I want the model to keep running and not bomb. However, when the 0's are in there, the linear model bombs.
v1<-rnorm(1:50)
v2<-rnorm(1:50)
data<-data.frame(v1,v2)
data[1:50,"nm"]<-"A"
data[50:100,"nm"]<-"B"
data[50:100,"v1"]<-0
data[50:100,"v2"]<-0
data<-data[c("nm","v1","v2")]
## run regression and generate universe
plyrFunc <- function(x){
mod <- lm(v1~v2, data = x)
return(summary(mod)$coefficients[2,3])
}
lm <- ddply(data, .(nm), plyrFunc)
As you can see, for name B, since everything is 0, the model bombs. I cannot just remove all 0's because often times the values are indeed 0.
I don't know how to edit the above code so that it keeps going.
Can anyone let me know? Thank you!

The model actually works fine, it is a subsetting of summary(mod)$coefficients that throws you an error because it contains only one row in the all-zeroes case:
> summary(lm(v1~v2,data[data$nm=="A",]))$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1462766 0.1591779 -0.9189503 0.3628138
v2 -0.1315238 0.1465024 -0.8977590 0.3738900
> summary(lm(v1~v2,data[data$nm=="B",]))$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0 0 NaN NaN
Thus, you need to modify your function to take this case into account:
plyrFunc <- function(x){
mod <- lm(v1~v2, data = x)
res <- summary(mod)$coefficients
if (nrow(res)>1) res[2,3] else NA
}
library(plyr)
result <- ddply(data, .(nm), plyrFunc)
Output for your sample data set:
nm V1
1 A -0.1825896
2 B NA

Related

Fit data with multiple functions

I am trying to fit my data with multiple fitting functions consisting of multiple variables. An example is below for two variables. There are cases where for a certain variable the fit is not good and I get a singular gradient error. I would like to ignore those cases and proceed anyway and furthermore for the remaining variables take the better solution between the two fitting function by comparing the deviance. Like in this example for both type1 and type2 the sum of residuals is less with the first function
sum(resid(myfitfun1)^2) < sum(resid(myfitfun2)^2) so take the first function for both of the variables.
myfun1<-function(x,a,b){1/(1+exp(-(x/a)+b))}
myfun2<-function(x,a,b){1+b*exp(-(x)/a)}
mydata <- data.frame(v=c("type1","type1","type1","type1","type1","type1","type1","type1","type1","type1","type1",
"type1","type1","type1","type1","type1","type1","type1","type1","type1","type1","type1","type1","type1",
"type1","type1","type1","type1","type1","type1","type1","type1","type1","type1","type1","type1",
"type1","type1","type1","type1","type1","type1","type1","type1","type2","type2","type2","type2",
"type2","type2"),
m=c(1.116975672,1.38641493,1.423833959,1.482008121,1.513588607,1.527179033,
1.543512539,1.555874185,1.607579807,1.721182154,1.729059048,1.748226681,
1.774814055,1.815147988,1.835638421,1.854582642,1.861972,1.887704144,
1.915360975,1.948689331,1.97516491,1.985962227,2.011310496,2.043716548,
2.068918438,2.091184665,2.120366813,2.126865141,2.148241318,2.15871161,
2.193529738,2.256197915,2.302364722,2.316381935,2.31909683,2.325213451,
2.336299128,2.410419652,2.473160411,2.478302702,2.5238233,2.651124474,
2.70470831,2.927536062,-0.1736072,0.1235610,0.5848941,0.9016486,0.9744832,
1.2767238),
n=c(0.022727273,0.045454545,0.068181818,0.090909091,0.113636364,0.136363636,
0.159090909,0.181818182,0.204545455,0.227272727,0.25,0.272727273,0.295454545,
0.318181818,0.340909091,0.363636364,0.386363636,0.409090909,0.431818182,
0.454545455,0.477272727,0.5,0.522727273,0.545454545,0.568181818,0.590909091,
0.613636364,0.636363636,0.659090909,0.681818182,0.704545455,0.727272727,0.75,
0.772727273,0.795454545,0.818181818,0.840909091,0.863636364,0.886363636,
0.909090909,0.931818182,0.954545455,0.977272727,1,0.1666667,0.3333333,0.5000000,
0.6666667,0.8333333,1))
myfitfun1 <- nls(n~myfun1(m,a,b),mydata,start=list(a=1,b=1))
myfitfun2 <- nls(n~myfun2(m,a,b),mydata,start=list(a=1,b=1))
I would like to program it in a way that it handels automatically the better fit between the two functions for various type and ignoring in case of an error. Any help is appreciated.
You could put both functions in a function and work with tryCatch. In the one tryCatches, just throw an NA to overcome the error. In another tryCatch set the value to Inf when an error occurs to ensure the "better" fit for the non-failing function. In normal cases the minimum is chosen. With `attr<-` we can give the MSE as an attribute to the output of the fit.
fun <- function(data) {
myfitfun1 <- tryCatch(
nls(n ~ myfun1(m, a, b), data, start=list(a=1, b=1)),
error=function(e) NA)
myfitfun2 <- tryCatch(
nls(n ~ myfun2(m, a, b), data, start=list(a=1, b=1)),
error=function(e) NA)
L <- list(myfitfun1, myfitfun2)
res <- sapply(L, function(x) {
tryCatch(sum(resid(x)^2), error=function(e) Inf)
})
`attr<-`(L[[which.min(res)]], "MSE", min(res))
}
fun(mydata)
# Nonlinear regression model
# model: n ~ myfun1(m, a, b)
# data: data
# a b
# 0.3465 5.6461
# residual sum-of-squares: 2.323
#
# Number of iterations to convergence: 26
# Achieved convergence tolerance: 7.675e-06
To get the MSE attribute, use:
attr(fun(mydata), "MSE")
# [1] 2.322945

How to find the minimum floating-point value accepted by betareg package?

I'm doing a beta regression in R, which requires values between 0 and 1, endpoints excluded, i.e. (0,1) instead of [0,1].
I have some 0 and 1 values in my dataset, so I'd like to convert them to the smallest possible neighbor, such as 0.0000...0001 and 0.9999...9999. I've used .Machine$double.xmin (which gives me 2.225074e-308), but betareg() still gives an error:
invalid dependent variable, all observations must be in (0, 1)
If I use 0.000001 and 0.999999, I got a different set of errors:
1: In betareg.fit(X, Y, Z, weights, offset, link, link.phi, type, control) :
failed to invert the information matrix: iteration stopped prematurely
2: In sqrt(wpp) :
Error in chol.default(K) :
the leading minor of order 4 is not positive definite
Only if I use 0.0001 and 0.9999 I can run without errors. Is there any way I can improve this minimum values with betareg? Or should I just be happy with that?
Try it with eps (displacement from 0 and 1) first equal to 1e-4 (as you have here) and then with 1e-3. If the results of the models don't differ in any way you care about, that's great. If they are, you need to be very careful, because it suggests your answers will be very sensitive to assumptions.
In the example below the dispersion parameter phi changes a lot, but the intercept and slope parameter don't change very much.
If you do find that the parameters change by a worrying amount for your particular data, then you need to think harder about the process by which zeros and ones arise, and model that process appropriately, e.g.
a censored-data model: zero/one arise through a minimum/maximum detection threshold, models the zero/one values as actually being somewhere in the tails or
a hurdle/zero-one inflation model: zeros and ones arise through a separate process from the rest of the data, use a binomial or multinomial model to characterize zero vs. (0,1) vs. one, then use a Beta regression on the (0,1) component)
Questions about these steps are probably more appropriate for CrossValidated than for SO.
sample data
set.seed(101)
library(betareg)
dd <- data.frame(x=rnorm(500))
rbeta2 <- function(n, prob=0.5, d=1) {
rbeta(n, shape1=prob*d, shape2=(1-prob)*d)
}
dd$y <- rbeta2(500,plogis(1+5*dd$x),d=1)
dd$y[dd$y<1e-8] <- 0
trial fitting function
ss <- function(eps) {
dd <- transform(dd,
y=pmin(1-eps,pmax(eps,y)))
m <- try(betareg(y~x,data=dd))
if (inherits(m,"try-error")) return(rep(NA,3))
return(coef(m))
}
ss(0) ## fails
ss(1e-8) ## fails
ss(1e-4)
## (Intercept) x (phi)
## 0.3140810 1.5724049 0.7604656
ss(1e-3) ## also fails
ss(1e-2)
## (Intercept) x (phi)
## 0.2847142 1.4383922 1.3970437
ss(5e-3)
## (Intercept) x (phi)
## 0.2870852 1.4546247 1.2029984
try it for a range of values
evec <- seq(-4,-1,length=51)
res <- t(sapply(evec, function(e) ss(10^e)) )
library(ggplot2)
ggplot(data.frame(e=10^evec,reshape2::melt(res)),
aes(e,value,colour=Var2))+
geom_line()+scale_x_log10()

Extracting ARIMA coefficients for use in custom function

I want to run a customized function based on ARIMA model. The function calls the ma3 coefficient from ARIMA (2, 0, 3) model ran on daily data of a year and subtracts ma3 coefficient from 2, for every firm. I have 5 years daily data for five firms, so each firm should have 5 year-wise values. My code:
>Stressy =function(x) 2-summary(arima(x, order=c(2,0,3)))$coefficients[1, "ma3"]
>Funny = aggregate(cbind(QQ) ~ Year + Firm , df, FUN = Stressy)
Running my code gives the following error:
Error in summary(arima(x, order = c(2, 0, 3)))$coefficients : $ operator is invalid for atomic vectors
I know the result can be estimated manually but my data set is large enough to be confusing when handled manually. Please suggest an edit to fix this.
There are two ways you could get the ma3 coefficient:
Stressy <- function(x) 2-coef(arima(x, order=c(2,0,3)))["ma3"]
or
Stressy <- function(x) 2-arima(x, order=c(2,0,3))$coef["ma3"]
Your original custom function didn't work because summary(arima_object) gives you a table, to which you cannot apply the $ operator:
x <- arima(df, c(2,0,3))
class(summary(x))
[1] "summaryDefault" "table"

Confidence interval for regression error, R,

Can someone please explain what I am doing wrong here. I want to
find a confidence interval for an average response of my variable
"list1." R has an example online using the 'faithful' dataset and it
works fine. However, whenever I try to find a confidence/prediction
interval, I ALWAYS get this error message. I have been at this for 5
hours and tried a million different things, nothing works.
> list1 <- c(1,2,3,4,5) #first data set
> list2 <- c(2,4,5,6,7) # second data set
> frame <- data.frame(list1,list2) # made a data.frame object
> reg <- lm(list1~list2,data=frame) # regression
> newD = data.frame(list1 = 2.3) #new data input for confidence/prediction interval estimation
> predict(reg,newdata=newD,interval="confidence")
fit lwr upr
1 0.7297297 -0.08625234 1.545712
2 2.3513514 1.88024388 2.822459
3 3.1621622 2.73210185 3.592222
4 3.9729730 3.45214407 4.493802
5 4.7837838 4.09033237 5.477235
Warning message:
'newdata' had 1 row but variables found have 5 rows #Why does this keep happening??
The problem is that you are trying to pass in a new independent variable for prediction, but the name of that predictor matches the dependent variable from the initial model. The formula syntax in the regression is y ~ x. When you use the predict() function, you can pass new independent (x) variables. See the Details section of ?predict for more details.
This however seems to work:
newD2 = data.frame(list2 = 2.3) #note the name is list2 and not list1
predict(reg, newdata = newD2, interval = "confidence")
---
fit lwr upr
1 0.972973 0.2194464 1.7265

ML estimation of Rician distribution parameters in R

I have data samples arranged in a 1000 x 56 array, and I would like to extract the parameters of a Rician distribution that best fits the data in each column. I am using the VGAM package, which seems like a perfect fit, and given the example in the documentation for riceff
vee = exp(2); sigma = exp(1);
y = rrice(n <- 1000, vee, sigma)
fit = vglm(y ~ 1, riceff, trace=TRUE, crit="c")
I figured the following code would work without a problem
nu <- rep(-1,ncol(data))
sigma <- rep(-1,ncol(data))
for( coln in seq(ncol(data)) ) {
fdata <- c(data[,coln])
fit <- vglm( fdata ~ 1, riceff, trace=TRUE, crit="c" )
sigma[coln] <- matrix(Coef(fit)[1])[1,1]
nu[coln] <- matrix(Coef(fit)[2])[1,1]
}
but instead I get the error
VGLM linear loop 1 : coefficients = -723936.834084, 598.301767
Error in if ((temp <- sum(wz[, 1:M, drop = FALSE] < wzepsilon))) warning(paste(temp, :
argument is not interpretable as logical
as for my data, I ran some basic checks
> is.matrix(data)
[1] TRUE
> dim(data)
[1] 1000 56
> summary(data)
V1
Min. :1.402e-05
1st Qu.:9.533e-04
Median :1.548e-03
Mean :1.640e-03
3rd Qu.:2.175e-03
Max. :4.657e-03
... (omitted for brevity)
V56
Min. :5.252e-05
1st Qu.:1.125e-03
Median :1.692e-03
Mean :1.776e-03
3rd Qu.:2.293e-03
Max. :5.903e-03
None of the information in the summary indicates that there is a NaN hidden somewhere, so I am at a loss as to why vglm is failing.
Does anyone have an idea as to what may be the problem? Any insight is greatly appreciated.
As suggested by Ben Bolker, here is the "solution" to my own problem (for future reference):
The vglm function in the VGAM package does not necessarily behave well for all data inputs. Since a lot of data is often close to being Rayleigh distributed, the command just exits with that bizarre error (Koay inversion also fails, for similar reasons I assume). If I fit my data against a generalized Rayleigh distribution via genrayleigh, everything works well enough.
One way to try both, as Ben suggested, is to use try or tryCatch to attempt both, or to emit NA values when the fitting function breaks down.
tryCatch( {
fit <- vglm( fdata ~ 1, riceff, trace=TRUE, crit="c" )
# extract fit parameters here
# ...
}, error = function(ex) {
# insert NA value into your data here
# ...
} )

Resources