I just new in R for solving my statistical problem. Currently I'm working to estimate the parameters of a distribution using 200 random numbers (RN) that I generate using R. I generate 200 RN in 100 times. So it means there will be 100 kinds of 200 RN and I will estimate this 100 kinds of RN. It also means that there will be 100 kinds of estimation results.
So here is the code I use to generate the RN:
#Generate random numbers U~(0, 1)
rep <-100 #total replication
unif <-matrix(0, 200, rep)
for (k in 1: rep)
{
unif[,k] <- runif(200, min = 0, max = 1)
}
# Based on the 100 kinds of generated random numbers that follow U ~ (0.1), I will generate again 100 kinds of random numbers which follow the estimated distribution:
# Define parameters
a <- 49.05 #1st parameter
b <- 3.148 #2nd parameter
c <- 0.145 #3rd parameter
d <- 0.00007181 #4th parameter
X <-matrix(0, 200, rep)
for (k in 1: rep)
{
X[,k] <- a*(log(1-((log(1-((unif[,k])^(1/c))))/(a*d))))^(1/b)
}
# Sorting the generated RN from the smallest to the largest
X_sort <-matrix(0, 200, rep)
for (k in 1: rep)
{
X_sort[,k] <- sort(X[,k])
}
Up here I've managed to generate 100 kinds of RN that will be estimated. However, the problem I face now is how to estimate this 100 kinds of RN. I can only estimate one. Here is the code I use for estimation the parameter with maxLik package and the estimation method is BHHH:
xi = X_sort[,1]
log_likelihood<-function(theta,xi){
p1 <- theta[1] #1st parameter
p2 <- theta[2] #2nd parameter
p3 <- theta[3] #3rd parameter
p4 <- theta[4] #4th parameter
logL=log((p4*p2*p3*((xi/p1)^(p2-1))*(exp(((xi/p1)^(p2))+(p4*p1*(1-(exp((xi/p1)^(p2)))))))*((1-(exp((p4*p1*(1-(exp((xi/p1)^(p2))))))))^(p3-1))))
return(logL)
}
library(maxLik);
# Initial parameters
a <- 49.05 #1st parameter
b <- 3.148 #2nd parameter
c <- 0.145 #3rd parameter
d <- 0.00007181 #4th parameter
m <- maxLik(log_likelihood, start=c(a,b,c,d), xi = xi, method="bhhh");
summary(m)
Here is the result:
--------------------------------------------
Maximum Likelihood estimation
BHHH maximisation, 5 iterations
Return code 2: successive function values within tolerance limit
Log-Likelihood: -874.0024
4 free parameters
Estimates:
Estimate Std. error t value Pr(> t)
[1,] 4.790e+01 1.846e+00 25.953 < 2e-16 ***
[2,] 3.015e+00 1.252e-01 24.091 < 2e-16 ***
[3,] 1.717e-01 2.964e-02 5.793 6.91e-09 ***
[4,] 7.751e-05 6.909e-05 1.122 0.262
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
--------------------------------------------
To estimate the other 99 RN, I have to change manually xi = X_sort[,k] for k=1,2,...,100 , so for the second RN, it should turn into X_sort[,2], and so on until the hundredth RN. I think this is not efficient because it takes a long time to replace them one by one. So is there a way to modify this code so that it did not take long for estimating the other RN?
Firstly, I'd suggest you to rewrite your code in more compact way.
1. Generating random numbers. There is no need to generate 100 vectors each of length 200 while we can generate a vector of length 100*200 and then write this vector column-wise into matrix. This can be done in the following way:
rep <-100
n <- 200
unif <- matrix(runif(rep*n, min = 0, max = 1), n, rep)
2. Calculating function of matrix. In R it is possible to apply vector functions to matrices. So in your case it will be:
X <- a*(log(1-((log(1-((unif)^(1/c))))/(a*d))))^(1/b)
3. Column-wise matrix sorting We can easily sort each column of the matrix using apply function. Parameter 2 means that we do it column-wise (1 stands for rows).
X_sort <- apply(X, 2, sort)
4. Performing estimations. Again, we can use apply here.
estimations <- apply(X_sort, 2, function(x) maxLik(log_likelihood, start=c(a,b,c,d),
xi = x, method="bhhh"))
Then to print all the summaries you can do the following:
lapply(estimations, summary)
Related
I have been tasked with estimating the degrees of freedom k of the probability distribution function of several chi-squared distributions using maximum likelihood estimation.
So I started by drawing 20 random values from the chi-distribution and proceeded to use maximum likelihood to estimate the degrees of freedom k in R.
The likelihood function of the chi-distribution is assumed to be:
library('maxLik')
library('lmtest')
> n <- 20
> df = 3
> df <- 3
> chi20 <- rchisq(n, df)
> X <- chi20
loglikfun <- function(param) {
if (param[1] <= 0) return(NA)
#return(sum(dchisq(X = chi20, df = param[1], log = TRUE])))
return((param[1]/2-1)*(sum(log(X)))-(1/2*sum(X)-n*log(gamma(param[1]/2)))-(n*param[1]/2*log(2)))
}
mle <- maxLik(Loglikfun, start = c(df = 1))
summary(mle)
However, rather than getting an estimate of approximately 3, I get 7496171 with a bunch of NaNs errors.
Any suggestions?
I think you have just transcribed the formula incorrectly. The section that you have written as
-(1/2*sum(X)-n*log(gamma(param[1]/2)))
Should not have the outer parentheses, since these have the effect of negating the sign of the -n*log(gamma(param[1]/2)) term. It therefore gets added to the result, when it should be subtracted according to the formula.
A more general observation here is that it's easier to keep track when implementing a mathematical formula if you match the names consistently and use spaces around the operators to make your code more readable. This might seem like a minor point, but in reality it is a frequent source of easily-avoided bugs.
Another point is that the function can be made more efficient by allowing R's native vectorization to try several parameter estimates at once. Effectively you can get this "for free" by removing any zero and negative values in the input vector first.
One way to correct the implementation would be:
loglikfun <- function(k)
{
k[k <= 0] <- NA
(k / 2 - 1) * sum(log(X)) - sum(X) / 2 - n * log(gamma(k / 2)) - n * k * log(2) / 2
}
We can check that this gives sensible results by plotting its output:
set.seed(100)
df <- 3
n <- 20
X <- rchisq(n, df)
parameter_estimate <- seq(0, 10, 0.1)
log_likelihood <- loglikfun(parameter_estimate)
plot(parameter_estimate, log_likelihood, type = "l")
abline(v = parameter_estimate[which.max(log_likelihood)], lty = 2, col = 2)
abline(h = max(log_likelihood, na.rm = TRUE), lty = 2, col = 2)
Where we can see the maximum log likelihood occurs at around 3, which is expected from the input.
If we try your maxLik code now, we can see we get the correct answer:
mle <- maxLik(loglikfun, start = c(k = 1))
summary(mle)
#> --------------------------------------------
#> Maximum Likelihood estimation
#> Newton-Raphson maximisation, 5 iterations
#> Return code 8: successive function values within relative tolerance limit (reltol)
#> Log-Likelihood: -35.58475
#> 1 free parameters
#> Estimates:
#> Estimate Std. error t value Pr(> t)
#> k 2.8960 0.4523 6.403 1.52e-10 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> --------------------------------------------
and
df <- 5
n <- 100
X <- rchisq(n, df)
mle <- maxLik(loglikfun, start = c(k = 1))
summary(mle)
#> --------------------------------------------
#> Maximum Likelihood estimation
#> Newton-Raphson maximisation, 6 iterations
#> Return code 8: successive function values within relative tolerance limit (reltol)
#> Log-Likelihood: -243.5511
#> 1 free parameters
#> Estimates:
#> Estimate Std. error t value Pr(> t)
#> k 4.9735 0.2851 17.45 <2e-16 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> --------------------------------------------
Where we see that the parameter estimates are now close to the parameters used in the creation of the random sample.
Created on 2021-11-01 by the reprex package (v2.0.0)
F(k)= -nlog(2gamma(k/2)+(k/2-1)*SumIFom1toN(log(Xi/2))-
-SumIFom1toN(Xi)/2
You can optimize
QS=SumIFom1toN(log(Xi/2);
HS=SumIFom1toN(Xi)/2;
F(k,QS,HS)= -nlog(2gamma(k/2)+(k/2-1)*QS-HS;
Calculation precission for machine means was out of consideration;
I need to write the following formulas in R. The STAT formula is copying effects of oneway.test-function.
where sample variance is
and
The variables are: m - number of samples, n - sample size, vector sample_means - mean of each sample and vector sample_vars - sample variance of each sample.
I'm trying to work with the following code, but it doesn't give the correct results when I compare it to aov:
my_anova <- function(m, n, sample_means, sample_vars) {
overall_mean <- mean(sample_means)
sample_vars <- sum((sample_means - overall_mean)^2)/(m-1)
STAT <- (n*sample_vars)/(sum(sample_vars/m))
PVAL <- pf(STAT, m - 1, m*(n - 1), lower.tail = FALSE)
}
Not very sure where you obtained the formulas above, but from what I can gather, you want to obtain the F stats and p value for a one way anova. n should be the degree of freedom and not sample size. Try using this table:
So bottom line is SSF should always be the sum of residuals between your predicted mean and overall mean, whereas SSE is the sum of residuals between your predicted mean and actual values. Then you divide by the corresponding degree of freedom. It should be like below:
my_aov <- function(sample_values, sample_means,n){
overall_mean = mean(sample_values)
SSF = sum((sample_means - overall_mean)^2)
SSE = sum((sample_values - sample_means)^2)
DoF = c(n,length(sample_values)-1-n)
Mean_Square = c(SSF/DoF[1] , SSE/DoF[2])
FSTAT = c(Mean_Square[1]/Mean_Square[2],NA)
PVAL <- pf(FSTAT, DoF[1], DoF[2], lower.tail = FALSE)
cbind(Sum_of_Squares= c(SSF,SSE),DoF,Mean_Square,FSTAT,PVAL)
}
Using an example:
values = iris$Sepal.Length
Species_values = tapply(iris$Sepal.Length,iris$Species,mean)
predicted_values = Species_values[as.character(iris$Species)]
# since there are 3 groups, degree of freedom is 3-1
n = length(unique(iris$Species)) - 1
my_aov(values,predicted_values,n)
Sum_of_Squares DoF Mean_Square FSTAT PVAL
[1,] 63.21213 2 31.6060667 119.2645 1.669669e-31
[2,] 38.95620 147 0.2650082 NA NA
Compare with:
summary(aov(Sepal.Length ~ Species,data=iris))
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.21 31.606 119.3 <2e-16 ***
Residuals 147 38.96 0.265
---
I am using the nls package in R to perform a nonlinear fit. I have specified my independent variable as follows:
t <- seq(1,7)
and my dependent variables as P <- c(0.0246, 0.2735, 0.5697, 0.6715, 0.8655, 0.9614, 1)
I then have tried:
m <- nls(P ~ 1 / (c + q*exp(-b*t))^(1/v)),
but every time I get:
"Error in c + q * exp(-b * t) : non-numeric argument to binary
operator"
Every one of my variables is numeric. Any ideas?
Thanks!
You have more than one problem in your script. The main issue is that you should never use names which are used by R: t is the matrix transpose, c is a common method to create vectors, and q is the quit instruction. nls() will not try to fit them, as they are already defined. I recommend using more meaningful and less dangerous variables such as Coef1, Coef2, …
The second problem is that you are trying to fit a model with 4 variables to a dataset with 7 data... This may yield singularities and other problems.
For the sake of the argument, I have reduced your model to three variables, and changed some names:
Time <- seq(1,7)
Prob <- c(0.0246, 0.2735, 0.5697, 0.6715, 0.8655, 0.9614, 1)
plot(Time, Prob)
And now we perform the nls() fit:
Fit <- nls(Prob ~ 1 / (Coef1 + Coef2 * exp(-Coef3 * Time)))
X <- data.frame(Time = seq(0, 7, length.out = 100))
Y <- predict(object = Fit, newdata = X)
lines(X$Time, Y)
And a summary of the results:
summary(Fit)
# Formula: Prob ~ 1/(Coef1 + Coef2 * exp(-Coef3 * Time))
#
# Parameters:
# Estimate Std. Error t value Pr(>|t|)
# Coef1 1.00778 0.06113 16.487 7.92e-05 ***
# Coef2 23.43349 14.42378 1.625 0.1796
# Coef3 1.04899 0.21892 4.792 0.0087 **
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.06644 on 4 degrees of freedom
#
# Number of iterations to convergence: 12
# Achieved convergence tolerance: 3.04e-06
I know it is not exactly what you wanted, but I hope it helps.
I need generate data on the a given value of coefficient of multiple determination.
For example,if i indicated R^2 = 0.77, i want generate data, which create regression model with R^2=0.77
but these data must be in a certain range. For example, sample= 100 and i need 4 variables(x1 - dependent var), where values in range from 5-15. How do that?
I use optim
optim(0.77, fn, gr = NULL,
method = c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN",
"Brent"),
lower = 5, upper = 15,
control = list(), hessian = FALSE)
but i don't know how create function fn for my purpose. Please help to write this function
First here's a solution:
library(mvtnorm)
get.r <- function(x) c((x+sqrt(x**2+3*x))/(3),(x-sqrt(x**2+3*x))/(3))
set.seed(123)
cv <- get.r(0.77)[1]
out <- rmvnorm(100,sigma=matrix(c(1,cv,cv,cv,cv,1,cv,cv,cv,cv,1,cv,cv,cv,cv,1),ncol=4))
out1 <- as.data.frame(10*(out-min(out))/diff(range(out))+5)
range(out1)
# [1] 5 15
lm1 <- lm(V1~V2+V3+V4,data=out1)
summary(lm1)
# Call:
# lm(formula = V1 ~ V2 + V3 + V4, data = out1)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.75179 -0.64323 -0.03397 0.64770 2.23142
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.36180 0.50940 0.710 0.479265
# V2 0.29557 0.09311 3.175 0.002017 **
# V3 0.31433 0.08814 3.566 0.000567 ***
# V4 0.35438 0.07581 4.674 9.62e-06 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.927 on 96 degrees of freedom
# Multiple R-squared: 0.7695, Adjusted R-squared: 0.7623
# F-statistic: 106.8 on 3 and 96 DF, p-value: < 2.2e-16
Now let me explain how I got there. We can construct this statistically. First we need to understand a little about correlation and covariance. One formula for correlation is
Corr(X, Y) = Cov(X,Y)/sqrt(Var(X)Var(Y))
And one formula for covariance is:
Cov(X,Y) = E(XY) - E(X)E(Y)
In your question you want to get the multiple correlation of the regression model:
Y = X1 + X2 + X3
Let's make this as simple as possible and force the variance of all variables to be 1 and let's make the pairwise correlation between any two variables to be equal and call it r.
Now we're looking for the square of the correlation between Y and X1 + X2 + X3, which is:
R^2 = [Cov(Y,X1 + X2 + X3)]^2/[Var(Y)Var(X1 + X2 + X3)]
Note that
Cov(Y,X1 + X2 + X3) = Cov(Y,X1) + Cov(Y,X2) + Cov(Y,X3)
Further note that the variance of each variable is 1 and the pairwise correlation is r, so the above result is equivalent to 3r.
Also note that
Var(X1 + X2 + X3) = Var(X1) + Var(X2) + Var(X3) + Cov(X1,X2) + Cov(X1,X3) + Cov(X2,X3).
Since the variance of each is 1, this is equivalent to 3 + 6r, so
R^2 = 9r^2/(3 + 6r) = 3r^2/(1 + 2r)
We can use the quadratic equation to solve for r and get
r = (R^2 +/- sqrt((R^2)^2+3R^2))/3
If we substitute R^2 = 0.77, then r = -0.3112633 or 0.8245966. We can use either to get what you need by using rmvnorm() within the mvtnorm package. And since R^2 is invariant to linear transformations, we can transform the resulting variables so that they fall between 5 and 15.
Update:
If we want to simulate with n predictors, we can use the following (note that I am not transforming the range of each predictor, but that can be done after the fact without altering the multiple R^2):
get.r <- function(x,n) c(((n-1)*x+sqrt(((n-1)*x)**2+4*n*x))/(2*n),
((n-1)*x-sqrt(((n-1)*x)**2+4*n*x))/(2*n))
sim.data <- function(R2, n) {
sig.mat <- matrix(get.r(R2,n+1)[1],n+1,n+1)
diag(sig.mat) <- 1
out <- as.data.frame(rmvnorm(100,sigma=sig.mat))
return(out)
}
This isn't an answer, but I wanted to share what I did. I don't believe optim can be used the way you want it to. I attempted a "brute force" method to find a dataset that could work, but the highest r-squared I "randomed" was 0.23:
# Initializing our boolean and counter.
rm(list = ls())
Done <- FALSE
count <- 1
maxr2 <- .000001
# I set y ahead of time.
y <- sample(5:15, 100, replace = TRUE)
# Running until an appropriate r-squared is found.
while(!Done) {
# Generating a sample data set to optimize y on.
a <- sample(5:15, 100, replace = TRUE)
b <- sample(5:15, 100, replace = TRUE)
c <- sample(5:15, 100, replace = TRUE)
data <- data.frame(y = y, a = a, b = b, c = c)
# Making our equation and making a linear model.
EQ <- "y ~ a + b + c" # Creating the equation.
model <- lm(EQ, data) # Running the model.
if (count != 1) { if (summary(model)$r.squared > maxr2) { maxr2 <- summary(model)$r.squared } }
r2 <- summary(model)$r.squared # Grabbing the r-squared.
print(r2) # Printing r-squared out to see what is popping out.
if (r2 <= 0.78 & r2 >= 0.76) { Done <- TRUE } # If the r-squared is satfisfactory, pop it out.
count <- count + 1 # Incrementing our counter.
if (count >= 1000000) { Done <- TRUE ; print("A satisfactory r-squared was not found.") } # Setting this to run at most 1,000,000 times.
}
# Data will be your model that has an r-squared of 0.77 if you found one.
The issue with optim is that it optimizes individual parameters, single values. The first argument in optim is the par argument, which is meant to be a list of the values you want to optimize. This could be used in optimizing an r-squared by some decay function that is dependent on several values (these would be your par values). However, in this case, you're asking to optimize entire columns towards maximizing an r-squared, which doesn't make sense (as far as I know) with optim.
I'm trying to find solution in R that performs similarly to MATLAB's trust region reflective algorithm. This question has been asked before but the author was asked to provide reproducible example. I couldn't comment there so the only solution was to post new question. Here's my example:
x <- c(5000,5000,5000,5000,2500,2500,2500,2500,1250,1250,1250,1250,625,625, 625,625,312,312,312,312,156,156,156,156)
y <- c(0.209065186,0.208338898,0.211886104,0.209638321,0.112064803,0.110535275,0.111748670,0.111208841,0.060416469,0.059098975,0.059274827,0.060859512,0.032178730,0.033190833,0.031621743,0.032345817,0.017983939,0.016632180,0.018468540,0.019513489,0.011490089,0.011076365,0.009282322,0.012309134)
Since initial parameter values are the central issue I tried using 'nls2' package which uses 'brute-force' algorithm to find good starting parameters. Even with that, nls and nls.lm cannot reach convergence. Here's some basic code for this:
library('nls2'); library('minpack.lm')
fo <- y ~ I(A * (x ^ B) + C)
sA <- seq(-2,1,len=10) # range of parameter values
sB <- seq(-1,1,len=10)
sC <- seq(-1,1,len=10)
st1 <- expand.grid(A=sA,B=sB,C=sC)
mod1 <- nls2(fo,start=st1,algorithm="brute-force")
fit_ <- nls(fo,start=coef(mod1)) # basic nls
# or nls.lm
fit_ <- nlsLM(fo, start=coef(mod1),algorithm = "LM")
MATLAB produced:
a = 7.593e-05 (6.451e-05, 8.736e-05)
b = 0.9289 (0.9116, 0.9462)
c = 0.002553 (0.001333, 0.003772)
Goodness of fit:
SSE: 2.173e-05
R-square: 0.9998
Adjusted R-square: 0.9998
RMSE: 0.001017
and yes, using these parameter values, R also produced the solution.
Question: how to obtain this in R without using matlab ?
After looking at a the plotted data, I have no problem guessing suitable starting values:
plot(y ~ x)
The data is almost on a straight line through 0. So good starting value vor B and C should be 1 and 0, respectively. Then you only need to guestimate the slope of the straight line. Of course you could also use lm(y ~ x) to find starting values for A and C.
fo <- y ~ A * (x ^ B) + C
DF <- data.frame(x, y)
fit <- nls(fo, start = list(A = 0.001, B = 1, C = 0), data = DF)
summary(fit)
#Formula: y ~ A * (x^B) + C
#
#Parameters:
# Estimate Std. Error t value Pr(>|t|)
#A 7.593e-05 5.495e-06 13.820 5.17e-12 ***
#B 9.289e-01 8.317e-03 111.692 < 2e-16 ***
#C 2.552e-03 5.866e-04 4.351 0.000281 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 0.001017 on 21 degrees of freedom
#
#Number of iterations to convergence: 5
#Achieved convergence tolerance: 9.084e-07
lines(seq(min(x), max(x), length.out = 100),
predict(fit, newdata = data.frame(x = seq(min(x), max(x), length.out = 100))),
col = "blue")