So, I am using nls() to do nonlinear regression in R.
I now have some code which does it for me and I get the correct output (phew!).
I can easily store the coefficients in a data frame using <- coeff(), but I also need to store some of the other data from the summary too.
Here's what I get when I run summary(Power.model)
Formula: Power.mean ~ a + (b * (Power.rep^-c))
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 1240.197 4.075 304.358 <2e-16 ***
b 10.400 14.550 0.715 0.490
c 6.829 230.336 0.030 0.977
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.97 on 11 degrees of freedom
Number of iterations to convergence: 17
Achieved convergence tolerance: 4.011e-06
I can get the Estimates and calculate the Residual sum of squares, but I would really like to also store std.error, t value, residual std error, number of iterations and (most important of all) the achieved convergence tolerance in the table too.
I understand that I can use capture.output(summary(Power.model)) to capture these, but I just end up with a bunch of strings. What I really want is to capture only the numbers (ideally as numbers) without (a) all of the extras (e.g., the string "Achieved convergence tolerance: ") and (b) without having to convert the strings into regular (single/double) numbers (e.g., 4.011e-06 into 0.000004011).
I can't seem to find a list of all of the functions I can run on my nls output. The only ones I have found so far are coeff() and resid(). A list would be ideal, but otherwise any other advice on accessing the data in the summary without resorting to capture.output() and the string editing/conversion that would inevitably follow would be very much appreciated.
coef(summary(Power.model)) will give a matrix containing some of these items and Power.model$convInfo will give a list whose components contains other of these items. The residual sum of squares can be obtained using deviance(Power.model).
methods(class = "nls") will give a list of functions that act on "nls" objects and str(Power.model) and str(summary(Power.model))will show the internal components of "nls" and "summary.nls" objects.
For example, using the builtin BOD data frame:
> fm <- nls(demand ~ a + b * Time, BOD, start = list(a = 1, b = 1))
> coef(summary(fm))
Estimate Std. Error t value Pr(>|t|)
a 8.521429 2.6589490 3.204811 0.03275033
b 1.721429 0.6386589 2.695380 0.05435392
> fm$convInfo
$isConv
[1] TRUE
$finIter
[1] 1
$finTol
[1] 3.966571e-09
$stopCode
[1] 0
$stopMessage
[1] "converged"
> deviance(fm)
[1] 38.06929
> sum(resid(fm)^2) # same
[1] 38.06929
You might also be interested in the broom package which will provide data frame representations of nls output like this:
> library(broom)
> tidy(fm)
term estimate std.error statistic p.value
1 a 8.521429 2.6589490 3.204811 0.03275033
2 b 1.721429 0.6386589 2.695380 0.05435392
> glance(fm)
sigma isConv finTol logLik AIC BIC deviance df.residual
1 3.085016 TRUE 3.966571e-09 -14.05658 34.11315 33.48843 38.06929 4
use names(Power.model), it will returns you the names of the object and you can also use names(Power.model$...), with ... one of the names of Power.model.
For example, Power.model$convInfo$finTol returns the Achieved convergence Tolerance.
If you are using RStudio you can click on the arrow near Power.model in the Environment window and it will displays all the names of Power.model with the value, which allow you to choose the correct name.
Related
I am doing some exercises using package r-exams, in which I print a summary from an lm object and ask students things like, “which is the estimated value of the intercept”. The idea is that the student copies the values of the summary output and use that value as the correct answer. The issue here is that I use the values from coef() function as the correct answers, but this is not a good idea since the precision of these values are quite different from the precision of the values shown in the summary output. Here is an example:
set.seed(123)
library(tidyverse)
## DATA GENERATION
xbreaks<-c(runif(1,4,4.8),runif(1,6,6.9),runif(1,7.8,8.5),runif(1,9,10))
ybreaks<-c(runif(1,500,1000),runif(1,1800,4000),runif(1,200,800))
b11<-(ybreaks[2]-ybreaks[1])/(xbreaks[2]-xbreaks[1])
b10<-ybreaks[1]-b11*xbreaks[1]
b31<-(ybreaks[3]-ybreaks[2])/(xbreaks[4]-xbreaks[3])
b30<-ybreaks[2]-b31*xbreaks[3]
points_df<-data.frame(x=xbreaks,y=ybreaks[c(1,2,2,3)])
n<-rpois(3,120)
x1<-runif(n[1],xbreaks[1],xbreaks[2])
x2<-runif(n[2],xbreaks[2],xbreaks[3])
x3<-runif(n[3],xbreaks[3],xbreaks[4])
y<-c(b10+b11*x1+rnorm(n[1],0,200),
ybreaks[2]+rnorm(n[2],0,200),
b30+b31*x3+rnorm(n[3],0,200))
z0_aw<-data.frame(ph=c(x1,x2,x3),UFC=y,case=factor(c(rep(1,n[1]),rep(2,n[2]),rep(3,n[3]))))
mean_x<-z0_aw$ph%>% mean %>% round(2)
caserng<-sample(1:4,1)
modrng<-sample(1:2,1)
if(caserng!=4){
z0_aw<-z0_aw[z0_aw$case == caserng,]
}
if(modrng==1){
m0<-lm(UFC~ph,data=z0_aw)
}else{
cl <- call("lm", formula = UFC ~ I(ph - mean_x), data = as.name("z0_aw"))
cl$formula[[3]][[2]][[3]] <- mean_x
m0<-eval(cl)
}
summary(m0)
#>
#> Call:
#> lm(formula = UFC ~ I(ph - 7.2), data = z0_aw)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -555.53 -121.98 5.46 115.38 457.08
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2726.86 57.33 47.57 <2e-16 ***
#> I(ph - 7.2) -840.05 31.46 -26.70 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 182.7 on 116 degrees of freedom
#> Multiple R-squared: 0.8601, Adjusted R-squared: 0.8589
#> F-statistic: 713.1 on 1 and 116 DF, p-value: < 2.2e-16
coef(m0)
#> (Intercept) I(ph - 7.2)
#> 2726.8605 -840.0515
Created on 2021-05-14 by the reprex package (v2.0.0)
Suppose that extol: 0.0001 in r-exams is set, and the student is asked to give the estimated value of the intercept. The student will get a wrong answer since he will answer 2726.86 but the correct answer from coef is 2726.8605 .
As can be seen, output of summary uses 2 decimals, whereas coef() values has quite more precision. I want to know how many decimals is summary using in order to apply the same format to values produced by coef(). This will ensure that the answer provided by the student is the same as the summary output.
I just want to do this:
answers<-coef(m0) %>% format(digits=dsum) %>% as.numeric()
where dsum is the number of digits used also by the summary output.
Note: retain a precision of 4 decimals is needed since I also ask students about the R-squared value provided in the same summary output, so it is not a good idea to set extol: 0.01 for example. Also the problems are generated at random and the magnitude of the estimated coefficients changes, as I have noted that this is directly related to the precision used in summary output.
Some useful information for such questions in R/exams:
The extol can also be a vector so that you can set different tolerances for coefficients and R-squared etc.
When asking about the R-squared, though, I typically ask for it "in percent". Then the same tolerance may be suitable as for the coefficients.
I would recommend to control the size of the coefficients suitably so that digits and extol can be set accordingly.
Personally, I typically store the exsolution at a higher precision than I request from the students. For example, exsolution could be 12.345678 while I only set extol to 0.01. This makes sure that when the correct answer is rounded to two decimal places it is inside the correct interval determined by exsolution and extol.
Details on formatting of the coefficients in the summary:
It is not obvious where exactly the formatting happens: The summary() method for lm objects returns an object of class summary.lm which has its own print() method which in turn calls printCoefmat(). The latter is the function that does the actual formatting.
When setting the digits in these functions, this controls the number of significant digits and not the number of decimal places. This is particularly important when the coefficients become relatively large (say, in the thousands or more).
The coefficients are not formatted individually but jointly with the corresponding standard errors. The details depend on the digits, the size of both coefficients and standard errors, and whether any coefficients are aliased or exactly zero etc.
Without aliased/zero coefficients the formatting from summary(m0) can be replicated using format_coef(m0) as defined below. That's essentially the boiled-down code from printCoefmat().
format_coef <- function(object, digits = max(3L, getOption("digits") - 2L)) {
coef_se <- summary(object)$coefficients[, 1L:2L]
digmin <- 1L + floor(log10(range(abs(coef_se))))
format(round(coef_se, max(1L, digits - digmin)), digits = digits)[, 1L]
}
I want to find estimate parameter with optim() package in R.
And I compare my result with GLM model in R. The code is
d <- read.delim("http://dnett.github.io/S510/Disease.txt")
d$disease=factor(d$disease)
d$ses=factor(d$ses)
d$sector=factor(d$sector)
str(d)
oreduced <- glm(disease~age+sector, family=binomial(link=logit), data=d)
summary(oreduced)
y<-as.numeric(as.character(d$disease))
x1<-as.numeric(as.character(d$age))
x2<-as.numeric(as.character(d$sector))
nlldbin=function(param){
eta<-param[1]+param[2]*x1+param[3]*x2
p<-1/(1+exp(-eta))
-sum(y*log(p)+(1-y)*log(1-p),na.rm=TRUE)
}
MLE_estimates<-optim(c(Intercept=0.1,age=0.1,sector2=0.1),nlldbin,hessian=TRUE)
MLE_estimatesenter
The result with GLM model
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.15966 0.34388 -6.280 3.38e-10 ***
age 0.02681 0.00865 3.100 0.001936 **
sector2 1.18169 0.33696 3.507 0.000453 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
And with optim()
$par
Intercept age sector2
-3.34005918 0.02680405 1.18101449
Can someone please tell me why its different and how to fix this? Thank you
You've given R two different problems. In your GLM, all of the parameters in the formula are factor variables. This mean that you've told R that they can only take particular values (e.g. d$disease can only take values 0 and 1). In your MLE approach, you've converted them to numeric variables, meaning that they can take any value and that your data just happens to use a small set of values.
The "fix" is to only give R one problem to solve. For example, if you instead fit glm(y~x1+x2, family=binomial(link=logit)), which uses no factor variables, you get pretty much the same parameter estimates with both the MLE as with the fitted model. You've seen this before.
I am making a bioinformatics shiny app that reads user-supplied group names from an excel file. As these names can be non-sytactically valid names, I would like to represent them internally as valid names.
As an example, I can have this input:
(grps <- as.factor(c("T=0","T=0","T=4-","T=4+","T=4+")))
[1] T=0 T=0 T=4- T=4+ T=4+
Levels: T=0 T=4- T=4+
Ideally, I would like R to make valid names, but keep the groups/levels the same, for instance the following would be fine:
"T.0" "T.0" "T.4minus" "T.4plus" "T.4plus"
When using make.names() however, all non-valid characters are converted to the same charater:
(grps2 <- as.factor(make.names(grps)))
[1] T.0 T.0 T.4. T.4. T.4.
Levels: T.0 T.4.
So both T=4- and T=4+ are given the same name and a level is lost (which causes problems in subsequent analyses). Also, setting unique=TRUE does not solve the problem, because
(grps3 <- as.factor(make.names(grps,unique=TRUE)))
[1] T.0 T.0.1 T.4. T.4..1 T.4..2
Levels: T.0 T.0.1 T.4. T.4..1 T.4..2
and group T=4+ is split into 2 different groups and levels are gained.
Does anybody know how it is possible in general to make a factor into valid names, while keeping the same levels?
Please keep in mind that user input can widely vary, so manually replacing "-" with "minus" does not work here.
Thanks in advance for your help!
With the mapvalues function from plyr you can do:
require("plyr")
mapvalues(grps, levels(grps), make.names(levels(grps), unique=TRUE))
Since this works directly on the levels instead of the factor, the number of the values stays the same.
The labels associated with the levels of a factor are not required to fit the same expectations of object names. Consider the following example, where I rename the gear columns of the mtcars data set, make it a factor, and give it the same levels as you have given in your example.
library(magrittr)
library(dplyr)
library(broom)
D <- mtcars[c("mpg", "gear")] %>%
setNames(c("y", "grps")) %>%
mutate(grps = factor(grps, 3:5, c("T=0", "T=4-", "T=4+")))
Notice that I am able to fit a linear model, get a summary, force it to a data frame, all while the level names have the =, -, and + symbols in them.
fit <- lm(y ~ grps, data = D)
fit
Call:
lm(formula = y ~ grps, data = D)
Coefficients:
(Intercept) grpsT=4- grpsT=4+
16.107 8.427 5.273
summary(fit)
Call:
lm(formula = y ~ grps, data = D)
Residuals:
Min 1Q Median 3Q Max
-6.7333 -3.2333 -0.9067 2.8483 9.3667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.107 1.216 13.250 7.87e-14 ***
grpsT=4- 8.427 1.823 4.621 7.26e-05 ***
grpsT=4+ 5.273 2.431 2.169 0.0384 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.708 on 29 degrees of freedom
Multiple R-squared: 0.4292, Adjusted R-squared: 0.3898
F-statistic: 10.9 on 2 and 29 DF, p-value: 0.0002948
tidy(fit)
term estimate std.error statistic p.value
1 (Intercept) 16.106667 1.215611 13.249852 7.867272e-14
2 grpsT=4- 8.426667 1.823417 4.621361 7.257382e-05
3 grpsT=4+ 5.273333 2.431222 2.169005 3.842222e-02
So I'm left thinking that either
You're making things harder on yourself than you need to, or
It isn't clear why you need to make the levels syntactically valid object names.
I am interested to reproduce results calculated by the GNU plugin to MS Word WordMat in R, but I can't get them to arrive at similar results (I am not looking for identical, but simply similar).
I have some y and x values and a power function, y = bx^a
Using the following data,
x <- c(15,31,37,44,51,59)
y <- c(126,71,61,53,47,42)
I get a = -0.8051 and b = 1117.7472 in WordMat, but a = -0.8026 and B = 1108.2533 in R, slightly different values.
Am I using the nls function in some wrong way or is there a better (more transparent) way to calculate it in R?
Data and R code,
# x <- c(15,31,37,44,51,59)
# y <- c(126,71,61,53,47,42)
df <- data.frame(x,y)
moD <- nls(y~a*x^b, df, start = list(a = 1,b=1))
summary(moD)
Formula: y ~ a * x^b
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 1.108e+03 1.298e+01 85.35 1.13e-07 ***
b -8.026e-01 3.626e-03 -221.36 2.50e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3296 on 4 degrees of freedom
Number of iterations to convergence: 19
Achieved convergence tolerance: 5.813e-06
It looks like WordMat is estimating the parameters of y=b*x^a by doing the log-log regression rather than by solving the nonlinear least-squares problem:
> x <- c(15,31,37,44,51,59)
> y <- c(126,71,61,53,47,42)
>
> (m1 <- lm(log(y)~log(x)))
Call:
lm(formula = log(y) ~ log(x))
Coefficients:
(Intercept) log(x)
7.0191 -0.8051
> exp(coef(m1)[1])
(Intercept)
1117.747
To explain what's going on here a little bit more: if y=b*x^a, taking the log on both sides gives log(y)=log(b)+a*log(x), which has the form of a linear regression (lm() in R). However, log-transforming also affects the variance of the errors (which are implicitly included on the right-hand side of the question), meaning that you're actually solving a different problem. Which is correct depends on exactly how you state the problem. This question on CrossValidated gives more details.
I am trying to understand how to work with ANOVAs and post-hoc tests in R.
So far, I have used aov() and TukeyHSD() to analyse my data. Example:
uni2.anova <- aov(Sum_Uni ~ Micro, data= uni2)
uni2.anova
Call:
aov(formula = Sum_Uni ~ Micro, data = uni2)
Terms:
Micro Residuals
Sum of Squares 0.04917262 0.00602925
Deg. of Freedom 15 48
Residual standard error: 0.01120756
Estimated effects may be unbalanced
My problem is, now I have a huge list of pairwise comparisons but cannot do anything with it:
TukeyHSD(uni2.anova)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Sum_Uni ~ Micro, data = uni2)
$Micro
diff lwr upr p adj
Act_Glu2-Act_Ala2 -0.0180017863 -0.046632157 0.0106285840 0.6448524
Ana_Ala2-Act_Ala2 -0.0250134285 -0.053643799 0.0036169417 0.1493629
NegI_Ala2-Act_Ala2 0.0702274527 0.041597082 0.0988578230 0.0000000
This dataset has 40 rows...
Idealy, I would like to get a dataset that looks something like this:
Act_Glu2 : a
Act_Ala2 : a
NegI_Ala2: b...
I hope you get the point. So far, I have found nothing comparable online... I also tried to select only significant pairs in the file resulting from TukeyHSD, but the file does not "acknowlegde" that it is made up of rows & columns, making selecting impossible...
Maybe there is something fundamentally wrong with my approach?
I think the OP wants the letters to get a view of the comparisons.
library(multcompView)
multcompLetters(extract_p(TukeyHSD(uni2.anova)))
That will get you the letters.
You can also use the multcomp package
library(multcomp)
cld(glht(uni2.anova, linct = mcp(Micro = "Tukey")))
I hope this is what you need.
The results from the TukeyHSD are a list. Use str to look at the structure. In your case you'll see that it's a list of one item and that item is basically a matrix. So, to extract the first column you'll want to save the TukeyHSD result
hsd <- TukeyHSD(uni2.anova)
If you look at str(hsd) you can that you can then get at bits...
hsd$Micro[,1]
That will give you the column of your differences. You should be able to extract what you want now.
Hard to tell without example data, but assuming Micro is just a factor with 4 levels and uni2 looks something like
n = 40
Micro = c('Act_Glu2', 'Act_Ala2', 'Ana_Ala2', 'NegI_Ala2')[sample(4, 40, rep=T)]
Sum_Uni = rnorm(n, 5, 0.5)
Sum_Uni[Micro=='Act_Glu2'] = Sum_Uni[Micro=='Act_Glu2'] + 0.5
uni2 = data.frame(Sum_Uni, Micro)
> uni2
Sum_Uni Micro
1 4.964061 Ana_Ala2
2 4.807680 Ana_Ala2
3 4.643279 NegI_Ala2
4 4.793383 Act_Ala2
5 5.307951 NegI_Ala2
6 5.171687 Act_Glu2
...
then I think what you're actually trying to get at is the basic multiple regression output:
fit = lm(Sum_Uni ~ Micro, data = uni2)
summary(fit)
anova(fit)
> summary(fit)
Call:
lm(formula = Sum_Uni ~ Micro, data = uni2)
Residuals:
Min 1Q Median 3Q Max
-1.26301 -0.35337 -0.04991 0.29544 1.07887
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.8364 0.1659 29.157 < 2e-16 ***
MicroAct_Glu2 0.9542 0.2623 3.638 0.000854 ***
MicroAna_Ala2 0.1844 0.2194 0.841 0.406143
MicroNegI_Ala2 0.1937 0.2158 0.898 0.375239
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4976 on 36 degrees of freedom
Multiple R-squared: 0.2891, Adjusted R-squared: 0.2299
F-statistic: 4.88 on 3 and 36 DF, p-value: 0.005996
> anova(fit)
Analysis of Variance Table
Response: Sum_Uni
Df Sum Sq Mean Sq F value Pr(>F)
Micro 3 3.6254 1.20847 4.8801 0.005996 **
Residuals 36 8.9148 0.24763
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
You can access the numbers in any of these tables like, for example,
> summary(fit)$coef[2,4]
[1] 0.0008536287
To see the list of what is stored in each object, use names():
> names(summary(fit))
[1] "call" "terms" "residuals" "coefficients"
[5] "aliased" "sigma" "df" "r.squared"
[9] "adj.r.squared" "fstatistic" "cov.unscaled"
In addition to the TukeyHSD() function you found, there are many other options for looking at the pairwise tests further, and correcting the p-values if desired. These include pairwise.table(), estimable() in gmodels, the resampling and boot packages, and others...