I have a data set which I need to fit to two quadratic equation:
f1(x) = a*x + b*x^2
f2(x) = b*x^2
Is there a way to estimate the error where I take into account both the standard error in the measurement and the error in the curve fittig?
I guess you mean that "error due to measurement" is the distribution of the measured values around the "true" predicted values by some physical law, and "error in curve fitting" is caused by fitting the data to a model that does not fully capture the physical law.
There is no way to know which kind of error you are seeing unless you already know the physical law. For example:
Suppose you have a perfect amplifier whose transfer function is Vo = Vi^2. You input a range of voltages Vo and measure the output Vi for each.
If you fit a quadratic to the data, you know that any error is caused by measurement.
If you fit a line to the data, your error is caused by both measurement and your choice of curve fitting. But you'd have to know that the behavior is actually quadratic in order to measure the error source. And you'd do it by... fitting a quadratic.
In the real world, nothing ever behaves perfectly, so you're always stuck with your best approximation to the physical reality.
If you have errors in your measurements as well as in your response variable, you might try fitting your models using Orthogonal Regression. There's a demo illustrating exactly this process that ships as part MATLAB's Statistics Toolbox.
Related
I'm trying to run GAMs to analyze some temperature data. I have remote cameras and external temperature loggers, and I'm trying to model the difference in the temperatures recorded by them (camera temperature - logger temperature). Most of the time, the cameras are recording higher temperatures, but sometimes, the logger returns the higher temperature, in which case the difference ends up being a negative value. The direction of the difference is something that I care about, so I do have to have non-positive values as a response. My explanatory variables are percent canopy cover (quantitative), direct and diffuse radiation (quant.), and camera direction (ordered factor) as fixed effects as well as the camera/logger pair (factor) for a random effect.
I had mostly been using the gam() function in mgcv to run my models. I'm using a scat distribution since my data is heavy-tailed. My model code is as follows:
gam(f1, family = scat(link = "identity"), data = d)
I wanted to try using bam() since I have 60,000 data points (one temperature observation per hour of the day for several months). The gam() models run fine, though they take a while to run. But the exact same model formulas run in bam() end up returning negative deviance explained values. I also get 50+ warning messages that all say:
In y - mu : longer object length is not a multiple of shorter object length
Running gam.check() on the fitted models returns identical residuals plots. The parametric coefficients, smooth terms, and R-squared values are also almost identical. The only things that have really noticeably changed are the deviance explained values, and they've changed to something completely nonsensical (the deviance explained values for the bam() models range from -61% to -101% deviance explained).
I'll admit that I'm brand new to using GAM's. I know just enough to know that the residuals plots are more important than the deviance explained values, and the residuals plots look good (way better than they did with a Gaussian distribution, at least). More than anything, I'm curious about what's going on within the bam() function specifically that's causing the function to pass that warning and return a negative deviance explained value. Is there some extra argument that I can set in bam() or some further manipulations I can do to my data to prevent this from happening, or can I ignore it and move forward since my residuals plots look good and the outputs are mostly the same?
Thanks in advance for any help.
I am testing different models for the best fit and most robust statistics to my data. My dataset contains over 50000 observations, approx. over 99.3% of the data are zeroes - such 0.7% are actual events.
Eventually see: https://imgur.com/a/CUuTlSK
I search to find the best fit of the following models; Logistic, Poisson, NB, ZIP, ZINB, PLH, NBLH. (NB: Negative-binomial, ZI: Zero-Inflated, P: Poisson, LH: Logit Hurdle)
The first way I tried doing this was by estimating the binary response with logistic regression.
My questions: Can I use Poisson on the binary variable or should I instead impose the binary with some integer values? For instance with the associated loss; if y=1 then y_val=y*loss. In my case, the variance of y_val becomes approx. 2.5E9. I held to use the binary variable because it does not matter, in this purpose, what the company defaulted with, default is default no matter the amount.
Both with logistic regression and Poisson, I got some terrible statistic: Very high deviance value (and 0 p-value), terrible estimates (=many of the estimated parameters are 0 -> odds ratio =1), very low confidence intervals, everything seems to be 'wrong'. If I transform the response variable to log(y_val) for y>1 in Poisson the statistics seem to get better - however, this is against the assumptions of integer count response in Poisson.
I briefly have tested the ZINB, it does not change the statistics significantly (=it does not help at all in this case).
Does there exist any proper way of dealing with such a dataset? I am interested in achieving the best fit for my data (about startup business' and their default status).
The data are cleaned and ready to be fitted. Is there anything I should be aware of that I don't have mentioned?
I use the genmod procedure in SAS with dist=Poisson, zinb, zip etc.
Thanks in advance.
Sorry, my rep is too low to comment, so it has to be an answer.
You should consider undersampling technique before using any regression/model, because your target is below 5%, which makes it extremely difficult to to predict.
Undersampling is a method of cutting out non-target events, in order to increase target ratio, I really recommend considering it, I got to use it once in my practice, and it seemed pretty helpful
I had a dataset for which I needed to provide a linear regression model that represents diameter as a function of length.Data which has length in first column and diameter in second looked like:
0.455,0.365
0.44,0.365
I carried out the required operations on the given dataset in R,and plotted the regression line for the data
I am just confused about what to conclude from the parameters(slope=0.8154, y intercept:-0.019413, correlation coefficient:0.98 ). Can I conclude anything other than line is a good fit. I am new to statistics. Any help would be appreciated.
Slope 0.8154 informs you that each unit increase for lenght causes increase of diamater in 0.8154*unit. Intercept -0.019413 is probably statistically insignificant in this case. To verify that you have to look at t-statistics for example.
On this page you can find nice course with visualizations about simple linear regression and other statistical methods answering your questions.
From the parameters slope and intercept, you cannot conclude if the line is a good fit. The correlation coefficient says that they depend highly and that a straight line could fit your model. However, from the p-values for the slope and intercept, you can conclude if your fit is good. If they are small (say below 0.05) you can conclude that the fit is pretty good.
I ran a model using glmer looking at the effect that Year and Treatment had on the number of points covered with wood, then plotted the residuals to check for normality and the resulting graph is slightly skewed to the right. Is this normally distributed?
model <- glmer(Number~Year*Treatment(1|Year/Treatment), data=data,family=poisson)
This site recommends using glmmPQL if your data is not normal: http://ase.tufts.edu/gsc/gradresources/guidetomixedmodelsinr/mixed%20model%20guide.html
library(MASS)
library(nlme)
model1<-glmmPQL(Number~Year*Treatment,~1|Year/Treatment,
family=gaussian(link = "log"),
data=data,start=coef(lm(Log~Year*Treatment)),
na.action = na.pass,verbose=FALSE)
summary(model1)
plot(model1)
Now do you transform the data in the Excel document or in the R code (Number1 <- log(Number)) before running this model? Does the link="log" imply that the data is already log transformed or does it imply that it will transform it?
If you have data with zeros, is it acceptable to add 1 to all observations to make it more than zero in order to log transform it: Number1<-log(Number+1)?
Is fit<-anova(model,model1,test="Chisq") sufficient to compare both models?
Many thanks for any advice!
tl;dr your diagnostic plots look OK to me, you can probably proceed to interpret your results.
This formula:
Number~Year*Treatment+(1|Year/Treatment)
might not be quite right (besides the missing + between the terms above ...) In general you shouldn't include the same term in both the random and the fixed effects (although there is one exception - if Year has more than a few values and there are multiple observations per year you can include it as a continuous covariate in the fixed effects and a grouping factor in the random effects - so this might be correct).
I'm not crazy about the linked introduction; at a quick skim there's nothing horribly wrong with it, but there seem to b e a lot of minor inaccuracies and confusions. "Use glmmPQL if your data aren't Normal" is really shorthand for "you might want to use a GLMM if your data aren't Normal". Your glmer model should be fine.
interpreting diagnostic plots is a bit of an art, but the degree of deviation that you show above doesn't look like a problem.
since you don't need to log-transform your data, you don't need to get into the slightly messy issue of how to log-transform data containing zeros. In general log(1+x) transformations for count data are reasonable - but, again, unnecessary here.
anova() in this context does a likelihood ratio test, which is a reasonable way to compare models.
I am looking into time series data compression at the moment.
The idea is to fit a curve on a time series of n points so that the maximum deviation on any of the points is not greater than a given threshold. In other words, none of the values that the curve takes at the points where the time series is defined, should be "further away" than a certain threshold from the actual values.
Till now I have found out how to do nonlinear regression using the least squares estimation method in R (nls function) and other languages, but I haven't found any packages that implement nonlinear regression with the L-infinity norm.
I have found literature on the subject:
http://www.jstor.org/discover/10.2307/2006101?uid=3737864&uid=2&uid=4&sid=21100693651721
or
http://www.dtic.mil/dtic/tr/fulltext/u2/a080454.pdf
I could try to implement this in R for instance, but I first looking to see if this hasn't already been done and that I could maybe reuse it.
I have found a solution that I don't believe to be "very scientific": I use nonlinear least squares regression to find the starting values of the parameters which I subsequently use as starting points in the R "optim" function that minimizes the maximum deviation of the curve from the actual points.
Any help would be appreciated. The idea is to be able to find out if this type of curve-fitting is possible on a given time series sequence and to determine the parameters that allow it.
I hope there are other people that have already encountered this problem out there and that could help me ;)
Thank you.