Related
I'm trying to understand how anova calculates F-value for a Gamma glm.
I have some weird skewed data:
y <- c(0, 0.88, 0.94, 0, 0.95, 0.77, 3.22, 3.52, 1.22, 1.52, 1.23,
0.92, 1.11, 1.18, 1.47, 1.53, 0, 0, 1.09, 0.83, 0.8, 1.56, 6,
0.74, 1.18, 1.01, 0.82, 3.83, 1.75, 1.27, 1.54, 1.05, 1.08, 0.9,
0.77, 1.44, 4.55, 0, 1.44, 2.91, 0.71, 12.93, 0.77, 0, 1.14,
1.06, 3.96, 1.57, 1.63)
x <- c(6.9469287, 6.290469147, 6.1918829, 6.104770097, 5.939523496,
5.942857082, 6.163662277, 6.399218779, 5.783065061, 5.638420345,
5.552741687, 5.683432022, 5.857426116, 6.162680044, 5.957396843,
6.571818964, 5.446848271, 5.712962062, 5.653265224, 6.349141363,
5.46503105, 6.049651518, 7.380125424, 5.722479551, 5.950585693,
5.808206582, 6.096318404, 5.913429847, 5.997807119, 6.206943676,
6.550982371, 6.543636484, 6.822385253, 6.507588297, 5.940914702,
6.439753879, 6.899586949, 6.156580921, 7.116019293, 6.355315455,
6.538796291, 6.498027706, 6.196593891, 6.339028678, 6.23909998,
6.551869452, 6.688031206, 6.492259138, 5.997315277)
y <- y + 0.001
I added 0.001 to y to avoid zeros. For a simple regression I could reproduce the F test run by anova:
lm0 <- lm(y ~ 1)
lm1 <- lm(y ~ x)
#
y.p <- lm1$fitted.values # predicted/fitted values
SSE <- sum((y - y.p)^2)
SSR <- sum((y.p - mean(y))^2)
SST <- sum((y - mean(y))^2)
round(SST - (SSE + SSR), 4) #check
# [1] 0
#
SS1 <- sum(residuals(lm0, "deviance")^2) #=SST
SS2 <- sum(residuals(lm1, "deviance")^2) #=SSE
df1 <- lm0$df.residual
df2 <- lm1$df.residual
MSE <- SS2/df2
MSR <- ((SS1 - SS2)/(df1 - df2))
MSR/MSE # F-value
# [1] 5.927608
anova(lm0, lm1, test="F")$F[2]
# [1] 5.927608
However, I could not reproduce F for a Gamma-version of the regression:
lm0 <- glm(y ~ 1, family=Gamma(link="log"))
lm1 <- glm(y ~ x, family=Gamma(link="log"))
#
oo <- Gamma(link="log") # family info
y.p <- oo$mu.eta(eta) # fitted values on the original scale
# ... the same as for lm example above
MSR/MSE # F-value
# [1] 3.862559
anova(lm0, lm1, test="F")$F[2]
# [1] 7.356901
However if I take the MSR, which is obviously called "Deviance" in the anova-output and divide it by the Dispersion parameter from the model summary (which strangely for me can be produced from working residuals) I'll get the correct F:
# correct F for the gamma regression:
disp <- summary(lm1)$dispersion
mdisp <- sum(residuals(lm1, "working")^2)/df2 # MSE-variant with working residuals
disp - mdisp # check
# [1] 0
Dev <- anova(lm1)$Deviance[2]
MSR-Dev # check
# [1] 0
MSR/mdisp # correct F as in anova
# [1] 7.356901
For me (w/o a deeper mathematical education) these last manipulations - which I found through try and error - look like magic. Could somebody help me to understand how the link-function of the Gamma glm is "interwoven" in the MSR/MSE calculation? I need the understanding to be able to calculate F for a Gamma regression performed with the R fastglm package, which is not compatible with anova.
I have a few .CSV files and want to perform an ANOVA test in R using the last column from each .CSV file as samples for the ANOVA.
To make an example, here's the first file:
E, Run1, Run2, ..., Runn, Mean
0, 1.48, 1.22, ..., 1.01, 1.24
1, 1.42, 1.12, ..., 1.00, 1.18
2, 1.11, 1.11, ..., 0.98, 1.07
3, 1.00, 0.98, ..., 0.90, 0.96
..., ..., ..., ..., ..., ...
And the second:
E, Run1, Run2, ..., Runn, Mean
0, 2.11, 2.00, ..., 1.98, 2.03
1, 2.00, 1.99, ..., 1.64, 1.88
2, 1.98, 1.90, ..., 1.60, 1.83
3, 1.90, 1.85, ..., 1.22, 1.66
..., ..., ..., ..., ..., ...
Declaratively, I want the final column of each file, and ignore the first row. So I might have sample1 = [1.24, 1.18, 1.07, 0.96, ...], and sample2 = [2.03, 1.88, 1.83, 1.66, ...]. Then I can do an ANOVA using the two samples.
My example uses two files but if there's an approach that can do an arbitrary number of .CSV files and performs the ANOVA, that's preferable.
I'm not terribly familiar with R so any help is appreciated. I've tagged as bash and awk because I don't mind using those.
This would be one way, with the data at hand:
library(purrr)
data1 <- read.table(text=
"Epoch, Run 1, Run 2, Run 3, Mean
0, 1.48, 1.22, 1.01, 1.24
1, 1.42, 1.12, 1.00, 1.18
2, 1.11, 1.11, 0.98, 1.07
3, 1.00, 0.98, 0.90, 0.96
",header=TRUE, sep=",")
data2 <- read.table(text=
"Epoch, Run 1, Run 2, Run 3, Mean
0, 2.11, 2.00, 1.98, 2.03
1, 2.00, 1.99, 1.64, 1.88
2, 1.98, 1.90, 1.60, 1.83
3, 1.90, 1.85, 1.22, 1.66
",header=TRUE, sep=",")
list(data1, data2) %>%
imap_dfr( ~ data.frame( sample=paste0("sample",.y), value=.x[,ncol(.x)] ) ) %>%
lm( data=., value~sample ) %>% anova
Output:
> anova(lm(value~sample, data=d))
Analysis of Variance Table
Response: value
Df Sum Sq Mean Sq F value Pr(>F)
sample 1 1.08781 1.08781 56.424 0.0002881 ***
Residuals 6 0.11567 0.01928
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
To generalize across multiple files:
library(purrr)
dir( "some/where/", pattern="\\.csv$", full.names=TRUE ) %>%
map_df( ~ read.csv( .x ) %>% select(last_col()), .id="filename" ) %>%
setnames( c("filenumber", "value" ) ) %>%
lm( data=., value~. ) %>% anova
I'm trying to figure out how to calculate absolute EC50 values using the LL.3 and LL.4 (3 and 4 parameter) dose response models in the package drc, but I keep getting these errors of "Warning message:In log(exp(-tempVal/parmVec[5]) - 1) : NaNs produced" and the EC50 value is "NA".
Here is an example of the code I'm trying to run
###use rygrass dataset in drc
gr.LL.3 <- drm(ryegrass$rootl ~ ryegrass$conc, fct = LL.3()) # 3 parameter log-logistic model
gr.LL.4 <- drm(ryegrass$rootl ~ ryegrass$conc, fct = LL.4()) # 4 parameter log-logistic model
plot(gr.LL.3) #graph looks fine
plot(gr.LL.4) #graph looks fine
ED (gr.LL.3, respLev = c(50), type = "relative") # this works fine
ED (gr.LL.4, respLev = c(50), type = "relative") # this works fine
ED (gr.LL.3, respLev = c(50), type = "absolute") # this gives me "NA" for EC50 along with warning message
ED (gr.LL.4, respLev = c(50), type = "absolute") # this gives me "NA" for EC50 along with warning message
It's not due to 0 values for concentrations
### It's not due to 0 values for concentrations
# ryegrass dataset with 0 value concentrations and corresponding rootl removed
rootlength <- c(8.3555556, 6.9142857, 7.75, 6.8714286, 6.45, 5.9222222, 1.925, 2.8857143, 4.2333333, 1.1875, 0.8571429, 1.0571429, 0.6875, 0.525, 0.825, 0.25, 0.22, 0.44)
conc.wo.0 <- c(0.94, 0.94, 0.94, 1.88, 1.88, 1.88, 3.75, 3.75, 3.75, 7.5, 7.5, 7.5, 15, 15, 15, 30, 30, 30)
gro.LL.3 <- drm(rootlength ~ conc.wo.0, fct = LL.3())
plot(gro.LL.3) #graph looks fine
ED (gro.LL.3, respLev = c(50), type = "relative") # this works fine
ED (gro.LL.3, respLev = c(50), type = "absolute") # once again, this gives me "NA" for EC50 along with warning message
It's also not due to the response being in absolute vs relative terms
### It's also not due to the response being in absolute vs relative terms
# ryegrass dataset with response relative to average response with 0 concentration (sorry, I did the absolute to relative conversion in excel, I'm still learning r)
rel.rootl <- c(0.98, 1.03, 1.07, 0.94, 0.95, 1.03, 1.08, 0.89, 1.00, 0.89, 0.83, 0.76, 0.25, 0.37, 0.55, 0.15, 0.11, 0.14, 0.09, 0.07, 0.11, 0.03, 0.03, 0.06)
concentration <- c(0, 0, 0, 0, 0, 0, 0.94, 0.94, 0.94, 1.88, 1.88, 1.88, 3.75, 3.75, 3.75, 7.5, 7.5, 7.5, 15, 15, 15, 30, 30, 30)
rel.gro.LL.3 <- drm(rel.rootl ~ concentration, fct = LL.3())
plot(rel.gro.LL.3) #graph looks fine
ED (rel.gro.LL.3, respLev = c(50), type = "relative") # this works fine
ED (rel.gro.LL.3, respLev = c(50), type = "absolute") # once again, this gives me "NA" for EC50 along with warning message
I'm new to this, so any help is appreciated.
rel.rootl <- c(0.98, 1.03, 1.07, 0.94, 0.95, 1.03, 1.08, 0.89, 1.00, 0.89, 0.83, 0.76, 0.25, 0.37, 0.55, 0.15, 0.11, 0.14, 0.09, 0.07, 0.11, 0.03, 0.03, 0.06)
concentration <- c(0, 0, 0, 0, 0, 0, 0.94, 0.94, 0.94, 1.88, 1.88, 1.88, 3.75, 3.75, 3.75, 7.5, 7.5, 7.5, 15, 15, 15, 30, 30, 30)
rel.gro.LL.3 <- drm(rel.rootl ~ concentration, fct = LL.3())
plot(rel.gro.LL.3) #graph looks fine
ED (rel.gro.LL.3, respLev = c(50), type = "relative") # this works fine
ED (rel.gro.LL.3, respLev = c(50), type = "absolute") # once again, this gives me "NA" for EC50 along with warning message
The problem is because when you are trying to estimate the absolute EC50 the ED function solves for the point on the curve where you want (i.e. the respLev argument) so if your relative response level does not have 50% on the y-axis it will run into an error because your y-axis is proportions.
To fix this issue either multiply your normalized response by 100 to turn it into a percent relative response
rel.gro.LL.3.percent <- drm(rel.rootl*100 ~ concentration, fct = LL.3())
ED (rel.gro.LL.3.percent, respLev = c(50), type = "relative") # same result as above
Estimated effective doses
Estimate Std. Error
e:1:50 3.26520 0.19915
ED (rel.gro.LL.3.percent, respLev = c(50), type = "absolute") # very similar to relative EC50
Estimated effective doses
Estimate Std. Error
e:1:50 3.30154 0.20104
Alternatively, you could change the respLev to 0.5 in your original model.
ED (rel.gro.LL.3, respLev = c(50), type = "relative") # this still works fine
Estimated effective doses
Estimate Std. Error
e:1:50 3.26520 0.19915
ED (rel.gro.LL.3, respLev = c(0.5), type = "absolute") # Now this works and is the same as we got before with response multiplied by 100
Estimated effective doses
Estimate Std. Error
e:1:0.5 3.30154 0.20104
I am new to using R, and I am trying to use survival analysis in order to find correlation in censored data.
The x data is the envelope mass of protostars. The y data is the intensity of an observed molecular line, and some values are upper limits. The data is:
x <- c(17.299, 4.309, 7.368, 29.382, 1.407, 3.404, 0.450, 0.815, 1.027, 0.549, 0.018)
y <- c(2.37, 0.91, 1.70, 1.97, 0.60, 1.45, 0.25, 0.16, 0.36, 0.88, 0.42)
censor <- c(0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1)
I am using the function survreg from the R Survival library
modeldata<-survreg(formula=Surv(y,censor)~x, dist="exponential", control = list(maxiter=90))
Which gives the following result:
summary(modeldata)
Call:
survreg(formula = Surv(y, censor) ~ x, dist = "exponential",
control = list(maxiter = 90))
Value Std. Error z p
(Intercept) -0.114 0.568 -0.20 0.841
x 0.153 0.110 1.39 0.163
Scale fixed at 1
Exponential distribution
Loglik(model)= -6.9 Loglik(intercept only)= -9
Chisq= 4.21 on 1 degrees of freedom, p= 0.04
Number of Newton-Raphson Iterations: 5
n= 11
However, when I plot the data and the model using the following method:
plot(x,y,pch=(censor+1))
xnew<-seq(0,30)
model<-predict(modeldata,list(x=xnew))
lines(xnew,model,col="red")
I get this plot of x and y data; triangles are censored data
I am not sure where I am going wrong. I have tried different distributions, but all produce similar results. The same is true when I use other data, for example:
x <- c(1.14, 1.14, 1.19, 0.78, 0.43, 0.24, 0.19, 0.16, 0.17, 0.66, 0.40)
I am also not sure if I am interpreting the results correctly.
I have tried other examples using the same method (e.g. https://stats.idre.ucla.edu/r/examples/asa/r-applied-survival-analysis-ch-1/), and it works well, as far as I can tell.
So my questions are:
Am I using the correct function for fitting the data? If not, which would be more suitable?
If it is the correct function, why is the model not fitting the data even closely? Does it have to do with the plotting?
Thank you for your help.
The "shape" of the relationship looks concave downward, so I would have guessed a ~ log(x) fit might be be more appropriate:
dfrm <- data.frame( x = c(17.299, 4.309, 7.368, 29.382, 1.407, 3.404, 0.450, 0.815, 1.027, 0.549, 0.018),
y = c(2.37, 0.91, 1.70, 1.97, 0.60, 1.45, 0.25, 0.16, 0.36, 0.88, 0.42),
censor= c(0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1))
modeldata<-survreg(formula=Surv(y,censor)~log(x), data=dfrm, dist="loggaussian", control = list(maxiter=90))
You code seemed appropriate:
png(); plot(y~x,pch=(censor+1),data=dfrm)
xnew<-seq(0,30)
model<-predict(modeldata,list(x=xnew))
lines(xnew,model,col="red"); dev.off()
modeldata
Call:
survreg(formula = Surv(y, censor) ~ log(x), data = dfrm, dist = "loggaussian",
control = list(maxiter = 90))
Coefficients:
(Intercept) log(x)
0.02092589 0.32536509
Scale= 0.7861798
Loglik(model)= -6.6 Loglik(intercept only)= -8.8
Chisq= 4.31 on 1 degrees of freedom, p= 0.038
n= 11
I have a real data and predicted data and I want to calculate overall MAPE and MSE. The data are time series, with each column representing data for different weeks. I predict value for each of the 52 weeks for each of the items as shown below. What would be the best possible calculate overall Error in R.
real = matrix(
c("item1", "item2", "item3", "item4", .5, .7, 0.40, 0.6, 0.3, 0.29, 0.7, 0.09, 0.42, 0.032, 0.3, 0.37),
nrow=4,
ncol=4)
colnames(real) <- c("item", "week1", "week2", "week3")
predicted = matrix(
c("item1", "item2", "item3", "item4", .55, .67, 0.40, 0.69, 0.13, 0.9, 0.47, 0.19, 0.22, 0.033, 0.4, 0.37),
nrow=4,
ncol=4)
colnames(predicted) <- c("item", "week1", "week2", "week3")
How do you get the predicted values in the first place? The model you use to get the predicted values is probably based on minimising some function of prediction errors (usually MSE). Therefore, if you calculate your predicted values, the residuals and some metrics on MSE and MAPE have been calculated somewhere along the line in fitting the model. You can probably retrieve them directly.
If the predicted values happened to be thrown into your lap and you have nothing to do with fitting the model, then you calculate MSE and MAPE as per below:
You have only one record per week for every item. So for every item, you can only calculate one prediction error per week. Depending on your application, you can choose to calculate the MSE and MAPE per item or per week.
This is what your data looks like:
real <- matrix(
c(.5, .7, 0.40, 0.6, 0.3, 0.29, 0.7, 0.09, 0.42, 0.032, 0.3, 0.37),
nrow = 4, ncol = 3)
colnames(real) <- c("week1", "week2", "week3")
predicted <- matrix(
c(.55, .67, 0.40, 0.69, 0.13, 0.9, 0.47, 0.19, 0.22, 0.033, 0.4, 0.37),
nrow = 4, ncol = 3)
colnames(predicted) <- c("week1", "week2", "week3")
Calculate the (percentage/squared) errors for every entry:
pred_error <- real - predicted
pct_error <- pred_error/real
squared_error <- pred_error^2
Calculate MSE, MAPE:
# For per-item prediction errors
apply(squared_error, MARGIN = 1, mean) # MSE
apply(abs(pct_error), MARGIN = 1, mean) # MAPE
# For per-week prediction errors
apply(squared_error, MARGIN = 0, mean) # MSE
apply(abs(pct_error), MARGIN = 0, mean) # MAPE