R Nonlinear Least Squares (nls) Model Fitting - r

I'm trying to fit the information from the G function of my data to the following mathematical mode: y = A / ((1 + (B^2)*(x^2))^((C+1)/2)) . The shape of this graph can be seen here:
http://www.wolframalpha.com/input/?i=y+%3D+1%2F+%28%281+%2B+%282%5E2%29*%28x%5E2%29%29%5E%28%282%2B1%29%2F2%29%29
Here's a basic example of what I've been doing:
data(simdat)
library(spatstat)
simdat.Gest <- Gest(simdat) #Gest is a function within spatstat (explained below)
Gvalues <- simdat.Gest$rs
Rvalues <- simdat.Gest$r
GvsR_dataframe <- data.frame(R = Rvalues, G = rev(Gvalues))
themodel <- nls(rev(Gvalues) ~ (1 / (1 + (B^2)*(R^2))^((C+1)/2)), data = GvsR_dataframe, start = list(B=0.1, C=0.1), trace = FALSE)
"Gest" is a function found within the 'spatstat' library. It is the G function, or the nearest-neighbour function, which displays the distance between particles on the independent axis, versus the probability of finding a nearest neighbour particle on the dependent axis. Thus, it begins at y=0 and hits a saturation point at y=1.
If you plot simdat.Gest, you'll notice that the curve is 's' shaped, meaning that it starts at y = 0 and ends up at y = 1. For this reason, I reveresed the vector Gvalues, which are the dependent variables. Thus, the information is in the correct orientation to be fitted the above model.
You may also notice that I've automatically set A = 1. This is because G(r) always saturates at 1, so I didn't bother keeping it in the formula.
My problem is that I keep getting errors. For the above example, I get this error:
Error in nls(rev(Gvalues) ~ (1/(1 + (B^2) * (R^2))^((C + 1)/2)), data = GvsR_dataframe, :
singular gradient
I've also been getting this error:
Error in nls(Gvalues1 ~ (1/(1 + (B^2) * (x^2))^((C + 1)/2)), data = G_r_dataframe, :
step factor 0.000488281 reduced below 'minFactor' of 0.000976562
I haven't a clue as to where the first error is coming from. The second, however, I believe was occurring because I did not pick suitable starting values for B and C.
I was hoping that someone could help me figure out where the first error was coming from. Also, what is the most effective way to pick starting values to avoid the second error?
Thanks!

As noted your problem is most likely the starting values. There are two strategies you could use:
Use brute force to find starting values. See package nls2 for a function to do this.
Try to get a sensible guess for starting values.
Depending on your values it could be possible to linearize the model.
G = (1 / (1 + (B^2)*(R^2))^((C+1)/2))
ln(G)=-(C+1)/2*ln(B^2*R^2+1)
If B^2*R^2 is large, this becomes approx. ln(G) = -(C+1)*(ln(B)+ln(R)), which is linear.
If B^2*R^2 is close to 1, it is approx. ln(G) = -(C+1)/2*ln(2), which is constant.
(Please check for errors, it was late last night due to the soccer game.)
Edit after additional information has been provided:
The data looks like it follows a cumulative distribution function. If it quacks like a duck, it most likely is a duck. And in fact ?Gest states that a CDF is estimated.
library(spatstat)
data(simdat)
simdat.Gest <- Gest(simdat)
Gvalues <- simdat.Gest$rs
Rvalues <- simdat.Gest$r
plot(Gvalues~Rvalues)
#let's try the normal CDF
fit <- nls(Gvalues~pnorm(Rvalues,mean,sd),start=list(mean=0.4,sd=0.2))
summary(fit)
lines(Rvalues,predict(fit))
#Looks not bad. There might be a better model, but not the one provided in the question.

Related

Beta regression model in R

Please again accept my apologies for my little knowledge in R. I'm, trying to get better! I'm a biologist and my statistical knowledge is sadly low
I have the following data set:
Perc_Reacting,Pulses,IndMutant,Proportion
93,1,1,0.93
81,2,1,0.81
73,3,1,0.73
64,4,1,0.64
73,5,1,0.73
68,6,1,0.68
64,7,1,0.64
65,8,1,0.65
50,9,1,0.5
68,10,1,0.68
57,11,1,0.57
50,12,1,0.5
62,13,1,0.62
44,14,1,0.44
54,15,1,0.54
56,16,1,0.56
50,17,1,0.5
42,18,1,0.42
42,19,1,0.42
29,20,1,0.29
96,1,0,0.96
100,2,0,1
92,3,0,0.92
96,4,0,0.96
92,5,0,0.92
92,6,0,0.92
84,7,0,0.84
96,8,0,0.96
91,9,0,0.91
82,10,0,0.82
86,11,0,0.86
82,12,0,0.82
91,13,0,0.91
85,14,0,0.85
83,15,0,0.83
70,16,0,0.7
74,17,0,0.74
64,18,0,0.64
68,19,0,0.68
78,20,0,0.78
The first and last rows are the same, one expressed in % an the other in a 1-0 proportion
I need to run a Beta regression model, but when I try to create the model an error jumps:
model.beta<-betareg(C_elegans$Proportion~C_elegans$Pulses)
Error in betareg(C_elegans$Proportion ~ C_elegans$Pulses) :
invalid dependent variable, all observations must be in (0, 1)
Could you help me to create a beta regression model for this data and how to make relevant plots to show it fits good?
Also I need to propose a linear regression model for this data, can anyone let me know how you think it could be done better?
Here are the results of fitting the last three columns to a flat surface plane equation "Proportion = a + (b * Pulses) + (c * IndMutant)" with parameters a = 1.0468289473684214E+00, b = -1.8650375939849695E-02, and c = -2.5850000000000006E-01 yielding R-squared = 0.876 and RMSE = 0.064.
(here "absolute error" means "not relative error")

How to remove two data points from a data set that have a large influence on the regression model

I have found two outlier data points in my data set but I don't know how to remove them. All of the guides that I have found online seem to emphasize plotting the data but my question does not require plotting, it only takes regression model fitting. I am having great difficulty finding out how to remove the two data points from my data set and then fitting the new data set with a new model.
Here is the code that I have written and the outliers that I found:
library(alr4)
library(MASS)
data(lathe1)
head(lathe1)
y=lathe1$Life
x1=lathe1$Speed
x2=lathe1$Feed
x1_square=(x1)^2
x2_square=(x2)^2
#part A (Box-Cox method show log transformation)
y.regression=lm(y~x1+x2+(x1)^2+(x2)^2+(x1*x2))
mod=boxcox(y.regression, data=lathe1, lambda = seq(-1, 1, length=10))
best.lam=mod$x[which(mod$y==max(mod$y))]
best.lam
#part B (null-hypothesis F-test)
y.regression1_Reduced=lm(log(y)~1)
y.regression1=lm(log(y)~x1+x2+x1_square+x2_square+(x1*x2))
anova(y.regression1_Reduced, y.regression1)
#part D (F-test of log(Y) without beta1)
y.regression2=lm(log(y)~x2+x2_square)
anova(y.regression1_Reduced, y.regression2)
#part E (Cook's distance and refit)
cooks.distance(y.regression1)
Outliers:
9 10
0.7611370235 0.7088115474
I think you may be able (if execution time / corpus size allows it) to pass through your data using a loop and copy / remove elems by your criteria to obtain your desired result e.g.
corpus_list_without_outliers = []
for elem in corpus_list:
if(elem.speed <= 10000) # elem.[any_param_name] < arbitrary_outlier_value
# push to corpus_list_without_outliers because it is OK :)
print corpus_list_without_outliers
# regression algorithm after
this is how I'd see the situation, but you can change the above-if with a remove statement to avoid the creation of a second list etc. e.g.
for elem in corpus_list:
if(elem.speed > 10000) # elem.[any_param_name]
# remove from current corpus because it is an outlier :(
print corpus_list
# regression algorithm after
Hope it helped you!

How to calculate the area under each end of a sine curve

Given this data set:
y<-c(-13,16,35,40,28,36,43,33,40,33,22,-5,-27,-31,-29,-25,-26,-31,-26,-24,-25,-29,-23,4)
t<-1:24
My goal is to calculate two areas. The first area would integrate only data from the first part of the curve found above the Zero line. The second area would integrate data from the second part of the curve found below the zero line.
First I would like to fit a sine wave to this data. Using this excellent answer:
https://stats.stackexchange.com/questions/60994/fit-a-sinusoidal-term-to-data
I was able to fit a sine wave (I will be using the periodic with second harmonic which looks to have a better fit)
ssp <- spectrum(y)
per <- 1/ssp$freq[ssp$spec==max(ssp$spec)]
reslm <- lm(y ~ sin(2*pi/per*t)+cos(2*pi/per*t))
summary(reslm)
rg <- diff(range(y))
plot(y~t,ylim=c(min(y)-0.1*rg,max(y)+0.1*rg))
lines(fitted(reslm)~t,col=4,lty=2) # dashed blue line is sin fit
# including 2nd harmonic really improves the fit
reslm2 <- lm(y ~ sin(2*pi/per*t)+cos(2*pi/per*t)+sin(4*pi/per*t)+cos(4*pi/per*t))
summary(reslm2)
lines(fitted(reslm2)~t,col=3) # solid green line is periodic with second harmonic
abline(h=0,lty=2)
Next I would like to calculate the area under the curve that is only positive, as well as the area under the curve that is exclusively negative. I've had luck looking at similar answers using the AUC functions in the Bolstad2 and Mess packages. But my data points do not fall neatly on zero line, and I do not know how to break up the sine function into areas only above the Zero line and only below the Zero line.
First things first. To get an exact calculation, you will need to work with the exact function of the 2nd harmonic fourier. Secondly, the beauty of harmonics functions is that they are repetitive. So if you want to find where your function reaches 0, you merely need to expand your interval to so you can be sure to find more than 2 roots.
First we get the exact function from the regression model
fourierfnct <- function(t){
fnct <- reslm2$coeff[1]+
reslm2$coeff[2]*sin(2*pi/per*t)+
reslm2$coeff[3]*cos(2*pi/per*t)+
reslm2$coeff[4]*sin(4*pi/per*t)+
reslm2$coeff[5]*cos(4*pi/per*t)
return(fnct)
}
secondly,you can write a function which can find the roots (where the function is 0). R provides a uniroot function which you can use to find multiple roots in a loop.
manyroots <- function(f,inter,period){
roots <- array(NA, inter)
for(i in 1:(length(inter)-1)){
roots[i] <- tryCatch({
return_value <- uniroot(f,c(inter[i],inter[i+1]))$root
}, error = function(err) {
return_value <- -1
})
}
retroots <- roots[-which(roots==-1)]
return(retroots)
}
then you simply calculate the roots, and use them to integrate the function across those boundaries.
roots <- manyroots(fourierfnct,seq(0,25),per)
integrate(fourierfnct, roots[1],roots[2])
#300.6378 with absolute error < 3.3e-12
integrate(fourierfnct, roots[2],roots[3])
#-284.6378 with absolute error < 3.2e-12
This may not be the solution you are looking for, but you could try this:
# Create a new t vector but with more subdivisions
t2 = seq(1,24,length.out = 10000)
# Evaluate your model on this t2
y2 = predict(reslm2, newdata = data.frame(t = t2))
lines(t2[y2>=0],y2[y2>=0],col="red")
# Estimate the area where the curve is greater than 0
sum(diff(t2)[1]*y2[y2>0])
# Estimate the area where the curve is less than 0
sum(diff(t2)[1]*y2[y2<0])

linear regression using lm() - surprised by the result

I used a linear regression on data I have, using the lm function. Everything works (no error message), but I'm somehow surprised by the result: I am under the impression R "misses" a group of points, i.e. the intercept and slope are not the best fit. For instance, I am referring to the group of points at coordinates x=15-25,y=0-20.
My questions:
is there a function to compare fit with "expected" coefficients and "lm-calculated" coefficients?
have I made a silly mistake when coding, leading the lm to do
that?
Following some answers: additionnal information on x and y
x and y are both visual estimates of disease symptoms. There is the same uncertainty on both of them.
The data and code are here:
x1=c(24.0,23.9,23.6,21.6,21.0,20.8,22.4,22.6,
21.6,21.2,19.0,19.4,21.1,21.5,21.5,20.1,20.1,
20.1,17.2,18.6,21.5,18.2,23.2,20.4,19.2,22.4,
18.8,17.9,19.1,17.9,19.6,18.1,17.6,17.4,17.5,
17.5,25.2,24.4,25.6,24.3,24.6,24.3,29.4,29.4,
29.1,28.5,27.2,27.9,31.5,31.5,31.5,27.8,31.2,
27.4,28.8,27.9,27.6,26.9,28.0,28.0,33.0,32.0,
34.2,34.0,32.6,30.8)
y1=c(100.0,95.5,93.5,100.0,98.5,99.5,34.8,
45.8,47.5,17.4,42.6,63.0,6.9,12.1,30.5,
10.5,14.3,41.1, 2.2,20.0,9.8,3.5,0.5,3.5,5.7,
3.1,19.2,6.4, 1.2, 4.5, 5.7, 3.1,19.2, 6.4,
1.2,4.5,81.5,70.5,91.5,75.0,59.5,73.3,66.5,
47.0,60.5,47.5,33.0,62.5,87.0,86.0,77.0,
86.0,83.0,78.5,83.0,83.5,73.0,69.5,82.5,78.5,
84.0,93.5,83.5,96.5,96.0,97.5)
## x11()
plot(x1,y1,xlim=c(0,35),ylim=c(0,100))
# linear regression
reg_lin=lm(y1 ~ x1)
abline(reg_lin,lty="solid", col="royalblue")
text(12.5,25,labels="R result",col="royalblue", cex=0.85)
text(12.5,20,labels=bquote(y== .(5.26)*x - .(76)),col="royalblue", cex=0.85)
# result I would have imagined
abline(a=-150,b=8,lty="dashed", col="red")
text(27.5,25,labels="What I think is better",col="red", cex=0.85)
text(27.5,20,labels=bquote(y== .(8)*x - .(150)),col="red", cex=0.85)
Try this:
reg_lin_int <- reg_lin$coefficients[1]
reg_lin_slp <- reg_lin$coefficients[2]
sum((y1 - (reg_lin_int + reg_lin_slp*x1)) ^ 2)
# [1] 39486.33
sum((y1 - (-150 + 8 * x1)) ^ 2)
# [1] 55583.18
The sum of squared residuals is lower under the lm fit line. This is to be expected, as reg_lin_int and reg_lin_slp are guaranteed to produce the minimal total squared error.
Intuitively, we know estimators under squared loss functions are sensitive to outliers. It's "missing" the group at the bottom because it gets closer to the group at the top left that's much further away--and squared distance gives these points more weight.
In fact, if we use Least Absolute Deviations regression (i.e., specify an absolute loss function instead of a square), the result is much closer to your guess:
library(quantreg)
lad_reg <- rq(y1 ~ x1)
(Pro tip: use lwd to make your graphs much more readable)
What gets even closer to what you had in mind is Total Least Squares, as mentioned by #nongkrong and #MikeWilliamson. Here is the result of TLS on your sample:
v <- prcomp(cbind(x1, y1))$rotation
bbeta <- v[-ncol(v), ncol(v)] / v[1, 1]
inter <- mean(y1) - bbeta * mean(x1)
You got a nice answer already, but maybe this is also helpful:
As you know, OLS minimizes the sum of squared errors in y-direction. This implies that the uncertainty of your x-values is negligible, which is often the case. But possibly it's not the case for your data. If we assume that uncertainties in x and y are equal and do Deming regression we get a fit more similar to what you expected.
library(MethComp)
dem_reg <- Deming(x1, y1)
abline(dem_reg[1:2], col = "green")
You don't provide detailed information about your data. Thus, this might be useful or not.

R: Robust fitting of data points to a Gaussian function

I need to do some robust data-fitting operation.
I have bunch of (x,y) data, that I want to fit to a Gaussian (aka normal) function.
The point is, I want to remove the ouliers. As one can see on the sample plot below, there is another distribution of data thats pollutting my data on the right, and I don't want to take it into account to do the fitting (i.e. to find \sigma, \mu and the overall scale parameter).
R seems to be the right tool for the job, I found some packages (robust, robustbase, MASS for example) that are related to robust fitting.
However, they assume the user already has a strong knowledge of R, which is not my case, and the documentation is only provided as a sort of reference manual, no tutorial or equivalent. My statistical background is rather low, I attempted to read reference material on fitting with R, but it didn't really help (and I'm not even sure thats the right way to go).
But I have the feeling that this is actually a quite simple operation.
I have checked this related question (and the linked ones), however they take as input a single vector of values, and I have a vector of pairs, so I don't see how to transpose.
Any help on how to do this would be appreciated.
Fitting a Gaussian curve to the data, the principle is to minimise the sum of squares difference between the fitted curve and the data, so we define f our objective function and run optim on it:
fitG =
function(x,y,mu,sig,scale){
f = function(p){
d = p[3]*dnorm(x,mean=p[1],sd=p[2])
sum((d-y)^2)
}
optim(c(mu,sig,scale),f)
}
Now, extend this to two Gaussians:
fit2G <- function(x,y,mu1,sig1,scale1,mu2,sig2,scale2,...){
f = function(p){
d = p[3]*dnorm(x,mean=p[1],sd=p[2]) + p[6]*dnorm(x,mean=p[4],sd=p[5])
sum((d-y)^2)
}
optim(c(mu1,sig1,scale1,mu2,sig2,scale2),f,...)
}
Fit with initial params from the first fit, and an eyeballed guess of the second peak. Need to increase the max iterations:
> fit2P = fit2G(data$V3,data$V6,6,.6,.02,8.3,0.10,.002,control=list(maxit=10000))
Warning messages:
1: In dnorm(x, mean = p[1], sd = p[2]) : NaNs produced
2: In dnorm(x, mean = p[4], sd = p[5]) : NaNs produced
3: In dnorm(x, mean = p[4], sd = p[5]) : NaNs produced
> fit2P
$par
[1] 6.035610393 0.653149616 0.023744876 8.317215066 0.107767881 0.002055287
What does this all look like?
> plot(data$V3,data$V6)
> p = fit2P$par
> lines(data$V3,p[3]*dnorm(data$V3,p[1],p[2]))
> lines(data$V3,p[6]*dnorm(data$V3,p[4],p[5]),col=2)
However I would be wary about statistical inference about your function parameters...
The warning messages produced are probably due to the sd parameter going negative. You can fix this and also get a quicker convergence by using L-BFGS-B and setting a lower bound:
> fit2P = fit2G(data$V3,data$V6,6,.6,.02,8.3,0.10,.002,control=list(maxit=10000),method="L-BFGS-B",lower=c(0,0,0,0,0,0))
> fit2P
$par
[1] 6.03564202 0.65302676 0.02374196 8.31424025 0.11117534 0.00208724
As pointed out, sensitivity to initial values is always a problem with curve fitting things like this.
Fitting a Gaussian:
# your data
set.seed(0)
data <- c(rnorm(100,0,1), 10, 11)
# find & remove outliers
outliers <- boxplot(data)$out
data <- setdiff(data, outliers)
# fitting a Gaussian
mu <- mean(data)
sigma <- sd(data)
# testing the fit, check the p-value
reference.data <- rnorm(length(data), mu, sigma)
ks.test(reference.data, data)

Resources