I have some time series data that looks like this:
x <- c(0.5833, 0.95041, 1.722, 3.1928, 3.941, 5.1202, 6.2125, 5.8828,
4.3406, 5.1353, 3.8468, 4.233, 5.8468, 6.1872, 6.1245, 7.6262,
8.6887, 7.7549, 6.9805, 4.3217, 3.0347, 2.4026, 1.9317, 1.7305,
1.665, 1.5655, 1.3758, 1.5472, 1.7839, 1.951, 1.864, 1.6638,
1.5624, 1.4922, 0.9406, 0.84512, 0.48423, 0.3919, 0.30773, 0.29264,
0.19015, 0.13312, 0.25226, 0.29403, 0.23901, 0.000213074755156413,
5.96565965097398e-05, 0.086874, 0.000926808687858284, 0.000904641782399267,
0.000513042259030044, 0.40736, 4.53928073402494e-05, 0.000765719624469057,
0.000717419263673946)
I would like to fit a curve to this data, using mixtures of one to five Gaussians. In Matlab, I could do the following:
fits{1} = fit(1:length(x),x,fittype('gauss1'));
fits{2} = fit(1:length(x),x,fittype('gauss2'));
fits{3} = fit(1:length(x),x,fittype('gauss3'));
... and so on.
In R, I am having difficulty identifying a similar method.
dat <- data.frame(time = 1:length(x), x = x)
fits[[1]] <- Mclust(dat, G = 1)
fits[[2]] <- Mclust(dat, G = 2)
fits[[3]] <- Mclust(dat, G = 3)
... but this does not really seem to be doing quite the same thing. For example, I am not sure how to calculate the R^2 between the fit curve and the original data using the Mclust solution.
Is there a simpler alternative in base R to fitting a curve using a mixture of Gaussians?
Function
With the code given below, and with a bit of luck in finding good initial parameters, you should be able to curve-fit Gaussian's to your data.
In the function fit_gauss, aim is to y ~ fit_gauss(x) and the number of Gaussians to use is determined by the length of the initial values for parameters: a, b, d all of which should be equal length
I have demonstrated curve-fitting of OP's data up to three Gaussian's.
Specifying Initial Values
This it pretty much most work I have done with nls (thanks to OP for that). So, I am not quite sure what is the best method select the initial values. Naturally, they depend on height's of peaks (a), mean and standard deviation of x around them (b and d).
One option would be for given number of Gaussian's, try with a number of starting values, and find the one that has best fit based on residual standard error fit$sigma.
I fiddled a bit to find initial parameters, but I dare say the parameters and
the plot with three Gaussian model looks solid.
Fitting one, two and thee Gaussian's to Example data
ind <- 1 : length(x)
# plot original data
plot(ind, x, pch = 21, bg = "blue")
# Gaussian fit
fit_gauss <- function(y, x, a, b, d) {
p_model <- function(x, a, b, d) {
rowSums(sapply(1:length(a),
function(i) a[i] * exp(-((x - b[i])/d[i])^2)))
}
fit <- nls(y ~ p_model(x, a, b, d),
start = list(a=a, b = b, d = d),
trace = FALSE,
control = list(warnOnly = TRUE, minFactor = 1/2048))
fit
}
Single Gaussian
g1 <- fit_gauss(y = x, x = ind, a=1, b = mean(ind), d = sd(ind))
lines(ind, predict(g1), lwd = 2, col = "green")
Two Gaussian's
g2 <- fit_gauss(y = x, x = ind, a = c(coef(g1)[1], 1),
b = c(coef(g1)[2], 30),
d = c(coef(g1)[1], 2))
lines(ind, predict(g2), lwd = 2, col = "red")
Three Gaussian's
g3 <- fit_gauss(y = x, x = ind, a=c(5, 4, 4),
b = c(12, 17, 11), d = c(13, 2, 2))
lines(ind, predict(g3), lwd = 2, col = "black")
Summery of fit with three Gaussian
summary(g3)
# Formula: x ~ p_model(ind, a, b, d)
#
# Parameters:
# Estimate Std. Error t value Pr(>|t|)
# a1 5.9307 0.5588 10.613 5.93e-14 ***
# a2 3.5689 0.7098 5.028 8.00e-06 ***
# a3 -2.2066 0.8901 -2.479 0.016894 *
# b1 12.9545 0.5289 24.495 < 2e-16 ***
# b2 17.4709 0.2708 64.516 < 2e-16 ***
# b3 11.3839 0.3116 36.538 < 2e-16 ***
# d1 11.4351 0.8568 13.347 < 2e-16 ***
# d2 1.8893 0.4897 3.858 0.000355 ***
# d3 1.0848 0.6309 1.719 0.092285 .
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.7476 on 46 degrees of freedom
#
# Number of iterations to convergence: 34
# Achieved convergence tolerance: 8.116e-06
Related
I am new to R, and only have a basic understanding of statistics. I am learning how to use factorial experiments, and how to fit models to results from Design and Analysis of Experiments (Montgomery 2013, ISBN: 9781118097939).
I used example 5.3 from the textbook. The data can be seen in the code below.
bottel_vul_data <- data.frame(A = rep(c(10, 12, 14), each = 2),
B = rep(c(25, 30), each = 12),
C = rep(c(200, 250), each = 6),
vul = c(-3, -1, 0, 1, 5, 4,
-1, 0, 2, 1, 7, 6,
-1, 0, 2, 3, 7, 9,
1, 1, 6, 5, 10, 11))
# bottel_vul_data
and completed the ANOVA analysis
bottel_anova <- aov(vul ~ factor(A) * B * C, data = bottel_vul_data)
summary(bottel_anova)
which yielded the same results as the textbook
Df Sum Sq Mean Sq F value Pr(>F)
factor(A) 2 252.75 126.38 178.412 1.19e-09 ***
B 1 45.37 45.37 64.059 3.74e-06 ***
C 1 22.04 22.04 31.118 0.00012 ***
factor(A):B 2 5.25 2.63 3.706 0.05581 .
factor(A):C 2 0.58 0.29 0.412 0.67149
B:C 1 1.04 1.04 1.471 0.24859
factor(A):B:C 2 1.08 0.54 0.765 0.48687
Residuals 12 8.50 0.71
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I then plotted the data to visualise it and compared four linear regression models:
only considering factors A, B and C;
considering factors A, B and C as well as the interactions A:B, A:C and B:C;
considering all factors and interactions;
and finally considering factors A, B, C and the interaction A:B.
The fourth model is supposed to be adequate because it includes all factors and interactions that shows a statistical significant difference. Finally, I completed an ANOVA on the four models.
data <- bottel_vul_data[c('vul', 'A', 'B', 'C')]
plot(data)
par(mfrow = c(2,2))
boxplot(vul~A, data = data, main = '% karbonering')
boxplot(vul~B, data = data, main = 'bedryfsdruk')
boxplot(vul~C, data = data, main = 'lynsnelheid')
boxplot(vul~A*B*C, data = data, main = 'A B C')
par(mfrow = c(2,2))
interaction.plot(data$A,data$B, data$vul)
interaction.plot(data$A,data$C, data$vul)
interaction.plot(data$B,data$C, data$vul)
par(mfrow = c(1,1))
model1 = lm(vul~., data = data)
summary(model1)
model2 = lm(vul~.^2, data = data)
summary(model2)
model3 = lm(vul~.^3, data = data)
summary(model3)
model4 = lm(vul ~ A + B + C + A*B, data = data)
summary(model4)
anova(model1, model2, model3, model4)
The output of the ANOVA is shown below
A anova: 4 × 6
Res.Df RSS Df Sum of Sq F Pr(>F)
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 20 21.14583 NA NA NA NA
2 17 14.47917 3 6.666667 2.46628131 0.09958762
3 16 14.41667 1 0.062500 0.06936416 0.79562649
4 19 16.08333 -3 -1.666667 0.61657033 0.61423917
Which suggests that there is no statistical significant difference between Models 3 and 4. Therefore, Model 4 should be the simplest model that describes the data adequately.
I would like to know if there is a way to visualise the fitted models by plotting them along with the real data on a graph in R? I would also like to know if there is a way to check that the model is actually representative of the data? Finally, I want to fit second order models as well so that I can get surface responses, but I have no idea how to do this in R. Is there a function like lm() that I can use?
The problem is really one of demonstrating multiple dimensions of data in a single plot. If you show one of your variables (such as A) on the x axis, with vul on the y axis, then you have already used up your two dimensions. It is possible to represent extra dimensions using color, or point size, or alpha, or facets, though it can be hard to show regression results this way.
With this example, we can plot vul against A, and since there are only two values of the C variable, we can facet by C. To show the effect of B, we can draw several different regression lines in each panel at different values of B:
library(ggplot2)
new_df <- expand.grid(A = seq(min(data$A), max(data$A), length.out = 10),
C = c(min(data$C), max(data$C)),
B = seq(min(data$B), max(data$B), length.out = 20))
new_df$vul <- predict(model4, new_df)
ggplot(new_df, aes(A, vul, color = B, group = B)) +
geom_line(alpha = 0.9) +
geom_point(data = data, shape = 21, size = 3, aes(fill = B), color = 'black') +
facet_wrap(paste('C =', C) ~ .) +
scale_color_viridis_c() +
scale_fill_viridis_c() +
theme_bw(base_size = 16)
Note that we have our original data on the plot, with each variable being identifiable by its panel, position and color.
We can see that with our model, vul increases with A, since all the lines are up-sloping. We can see vul increases with C because the lines in the second panel are just shifted upwards with higher values of C, and we can see that there is a positive interaction between A and B because the lines in each panel get steeper as B gets higher (i.e. they 'fan out').
Note it would be possible to have B on the facets and C on the color scale to get an equally valid plot, but it just depends on what you wish to illustrate with your plot. I think the above example demonstrates nicely that both A and C are positively associated with vul, and that there is an interaction between A and B. However, it doesn't show us the direction of B's association with vul, or any of the intercept values. We could create plots to illustrate these features of the model if they are the things we wanted to demonstrate.
I want to fit a model to a three factor factorial experiment. In an attempt to do this with R, I am reproducing examples from a textbook (Montgomery, DC (2013) Design and Analysis of Experiments, 8th ed. John Wiley & Sons ISBN: 9781118097939). The specific example I am attempting is Example 5.5, and although only a two factor example, I am hoping to learn the basics from it.
I can easily reproduce the ANOVA table in R, and I can retract the coefficients of the model (I think). Considering the model equation given on the image above, I assume that the four coefficients returned by R is β0, β1, β2 and β12. I have no idea how to plot the surface described by the model, which is my first problem. Secondly, the textbook discuss how a better model fit can be attained if the interaction parameters, i.e. β112, β122 and β1122 are included. Is it possible to do this in R as well? The surface fitted to the model including the interaction parameters is attached here.
I am relatively comfortable in python, although I have never plotted surfaces using matplotlib. I am very new in R, and have never plotted anything in R. From surfing the web I could not find anything useful for what I am trying to do. My code is attached below.
lewensduur_data <- data.frame(A = rep(c(15, 20, 25), each = 2),
B = rep(c(125, 150, 175), each = 6),
lewe = c(-2, -1, 0, 2, -1, 0,
-3, 0, 1, 3, 5, 6,
2, 3, 4, 6, 0, -1))
lewensduur_anova <- aov(lewe ~ A * B, data = lewensduur_data)
lewensduur_anova
which yields the ANOVA table
Call:
aov(formula = lewe ~ A * B, data = lewensduur_data)
Terms:
A B A:B Residuals
Sum of Squares 8.33333 21.33333 8.00000 86.33333
Deg. of Freedom 1 1 1 14
Residual standard error: 2.483277
Estimated effects may be unbalanced
I retrieved the coefficients as follows
coefficients(lewensduur_anova)
yielding
(Intercept)-34A1.36666666666667B0.213333333333333A:B-0.008
As an after thought, I noticed that aov() returns that the estimated effects may be unbalanced. From what I understand, aov() is best suited for factors having the same amount of levels and replicates. Is there a better ANOVA function to use for cases like my example?
There are many ways you can make this happen. I'm going to start with what you've done so far, though.
Your aov() reflects that you have 1 degree of freedom for A and 1 for B. That's a sign that something is wrong. The degrees of freedom for a categorical field is going to be the number of unique values minus one. The degrees of freedom for both A and B should be two.
Let's go over what went wrong...your entries from A and B are numbers, so the aov function did not interpret them correctly. The easiest way to fix this is to make these two columns factor-type.
av <- aov(lewe ~ A * B, data = mutate(lewensduur_data, A = as.factor(A), B = as.factor(B)))
summary(av)
# Df Sum Sq Mean Sq F value Pr(>F)
# A 2 24.33 12.167 8.423 0.00868 **
# B 2 25.33 12.667 8.769 0.00770 **
# A:B 4 61.33 15.333 10.615 0.00184 **
# Residuals 9 13.00 1.444
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
These numbers are quite a bit different than your original numbers. You still need to ask yourself if this information is accurate. Well, ANOVA requires that the data meet the assumptions of both normal distribution and homogeneity. Normality and homogeneity where there are only two observations in each category? Your results are not going to be meaningful.
This is not enough data to evaluate with ANOVA.
However, for the sake of your original questions, let's look at this data using the jmv package. Note that I didn't make A or B factors for this call. Additionally, I used the sum of squares type 2 method (this is actually irrelevant, though.)
an <- ANOVA(formula = lewe ~ A * B,
data = lewensduur_data, ss = "2", effectSize = "partEta",
homo = T, norm = T, postHocCorr = 'tukey', postHoc = ~A * B)
an # fails for homogeneity; passes for normality
#
# ANOVA
#
# ANOVA - lewe
# ───────────────────────────────────────────────────────────────────────────────────────────
# Sum of Squares df Mean Square F p η²p
# ───────────────────────────────────────────────────────────────────────────────────────────
# A 24.33333 2 12.166667 8.423077 0.0086758 0.6517857
# B 25.33333 2 12.666667 8.769231 0.0077028 0.6608696
# A:B 61.33333 4 15.333333 10.615385 0.0018438 0.8251121
# Residuals 13.00000 9 1.444444
# ───────────────────────────────────────────────────────────────────────────────────────────
#
#
# ASSUMPTION CHECKS
#
# Homogeneity of Variances Test (Levene's)
# ────────────────────────────────────────────
# F df1 df2 p
# ────────────────────────────────────────────
# 1.163368e+31 8 9 < .0000001
# ────────────────────────────────────────────
#
#
# Normality Test (Shapiro-Wilk)
# ─────────────────────────────
# Statistic p
# ─────────────────────────────
# 0.9209304 0.1342718
# ─────────────────────────────
If we ignored the insufficient data sample size, we still can't trust this information. ANOVA is very robust against deviations from normal distribution, but it's very sensitive to homogeneity. (Your homogeneity was never going to pass due to the sample size when compared to the number of groups.)
To show you how you can plot this, I used the plotly package. To use this package for a surface plot, you need a matrix that essentially looks like the table in your book. (Where the speeds are the column names and the angles are the row names. You can only have one entry for each row/column, as well. I chose to use the average of the values by group. To collect the averages I used lapply.
# get the average by speed and angle
new_lewe <- lapply(seq(1, (nrow(lewensduur_data) - 1), by = 2),
function(i) {
with(lewensduur_data[i:(i + 1), ], mean(lewe))
}) %>% unlist()
I placed these averages into a matrix and gave it row and column names.
ld <- matrix(data = new_lewe,
nrow = 3, ncol = 3,
dimnames = with(lewensduur_data, list(unique(A), unique(B))))
You can see a basic surface plot now.
plot_ly(x = colnames(ld), y = rownames(ld),
z = ld) %>% add_surface()
You can dress this surface plot up to look more like your book's surface plot, as well.
plot_ly(x = colnames(ld), y = rownames(ld),
z = ld) %>%
add_surface(contours = list(z = list(usecolormap = T, show = T,
project = list(z = T)))) %>%
layout(scene = list(aspectmode = "manual",
aspectratio = list(x = 2, y = 1.5, z = 1),
xaxis = list(title = "Cutting Speed"),
yaxis = list(title = "Tool Angle"),
zaxis = list(title = "Life"),
camera = list(
center = list(x = .5, y = .5, z = .4),
eye = list(x = 1.5, y = 2, z = 2))))
You can project this data to a basic contour plot, as well.
plot_ly(x = colnames(ld), y = rownames(ld),
z = ld, type = "contour", line = list(width = 0))
In that basic contour plot (above) the x & y are opposite of those in your book.
In the next contour plot, I've flipped the axes, so they match. You can dress it up, as well.
plot_ly(y = colnames(ld), x = rownames(ld),
z = t(ld), type = "contour", ncontours = 8,
line = list(width = 0, smoothing = 1.3),
contours = list(showlabels = T)) %>%
layout(xaxis = list(title = "Angle"),
yaxis = list(title = "Speed"))
If you have any questions, let me know!
I have seen pairwise or general paired simple linear regression many times on Stack Overflow. Here is a toy dataset for this kind of problem.
set.seed(0)
X <- matrix(runif(100), 100, 5, dimnames = list(1:100, LETTERS[1:5]))
b <- c(1, 0.7, 1.3, 2.9, -2)
dat <- X * b[col(X)] + matrix(rnorm(100 * 5, 0, 0.1), 100, 5)
dat <- as.data.frame(dat)
pairs(dat)
So basically we want to compute 5 * 4 = 20 regression lines:
----- A ~ B A ~ C A ~ D A ~ E
B ~ A ----- B ~ C B ~ D B ~ E
C ~ A C ~ B ----- C ~ D C ~ E
D ~ A D ~ B D ~ C ----- D ~ E
E ~ A E ~ B E ~ C E ~ D -----
Here is a poor man's strategy:
poor <- function (dat) {
n <- nrow(dat)
p <- ncol(dat)
## all formulae
LHS <- rep(colnames(dat), p)
RHS <- rep(colnames(dat), each = p)
## function to fit and summarize a single model
fitmodel <- function (LHS, RHS) {
if (RHS == LHS) {
z <- data.frame("LHS" = LHS, "RHS" = RHS,
"alpha" = 0,
"beta" = 1,
"beta.se" = 0,
"beta.tv" = Inf,
"beta.pv" = 0,
"sig" = 0,
"R2" = 1,
"F.fv" = Inf,
"F.pv" = 0,
stringsAsFactors = FALSE)
} else {
result <- summary(lm(reformulate(RHS, LHS), data = dat))
z <- data.frame("LHS" = LHS, "RHS" = RHS,
"alpha" = result$coefficients[1, 1],
"beta" = result$coefficients[2, 1],
"beta.se" = result$coefficients[2, 2],
"beta.tv" = result$coefficients[2, 3],
"beta.pv" = result$coefficients[2, 4],
"sig" = result$sigma,
"R2" = result$r.squared,
"F.fv" = result$fstatistic[[1]],
"F.pv" = pf(result$fstatistic[[1]], 1, n - 2, lower.tail = FALSE),
stringsAsFactors = FALSE)
}
z
}
## loop through all models
do.call("rbind.data.frame", c(Map(fitmodel, LHS, RHS),
list(make.row.names = FALSE,
stringsAsFactors = FALSE)))
}
The logic is clear: get all pairs, construct the model formula (reformulate), fit a regression (lm), do a summary summary, return all statistics and rbind them to be a data frame.
OK, fine, but what if there are p variables? We then need to do p * (p - 1) regressions!
An immediate improvement I could think of, is Fitting a linear model with multiple LHS. For example, the first column of that formula matrix is merged to
cbind(B, C, D, E) ~ A
This reduces the number of regression from p * (p - 1) to p.
But we can definitely do even better without using lm and summary. Here is my previous attempt: Is there a fast estimation of simple regression (a regression line with only intercept and slope)?. It is fast because it uses covariance between variables for estimation, like solving the normal equation. But the simpleLM function there is pretty limited:
it needs to compute residual vectors to estimate residual standard error, which is a performance bottleneck;
it doesn't support multiple LHS, so it needs be called p * (p - 1) times in pairwise regression settings).
Can we generalize it for fast pairwise regression, by writing a function pairwise_simpleLM?
General paired simple linear regression
A more useful variation of the above pairwise regression is the general paired regression between a set of LHS variables and a set of RHS variables.
Example 1
Fit paired regression between LHS variables A, B, C and RHS variables D, E, that is, fit 6 simple linear regression lines:
A ~ D A ~ E
B ~ D B ~ E
C ~ D C ~ E
Example 2
Fit a simple linear regression with multiple LHS variables to a particular RHS variable, say: cbind(A, B, C, D) ~ E.
Example 3
Fit a simple linear regression with a particular LHS variable, and a set of RHS variables one at a time, for example:
A ~ B A ~ C A ~ D A ~ E
Can we also have a fast function general_paired_simpleLM for this?
Caution
All variables must be numeric; factors are not allowed or pairwise regression makes no sense.
Weighted regression is not discussed, as variance-covariance method is not justified in that case.
Some statistical result / background
(Link in the picture: Function to calculate R2 (R-squared) in R)
Computational details
Computations involved here is basically the computation of the variance-covariance matrix. Once we have it, results for all pairwise regression is just element-wise matrix arithmetic.
The variance-covariance matrix can be obtained by R function cov, but functions below compute it manually using crossprod. The advantage is that it can obviously benefit from an optimized BLAS library if you have it. Be aware that significant amount of simplification is made in this way. R function cov has argument use which allows handling NA, but crossprod does not. I am assuming that your dat has no missing values at all! If you do have missing values, remove them yourself with na.omit(dat).
The initial as.matrix that converts a data frame to a matrix might be an overhead. In principle if I code everything up in C / C++, I can eliminate this coercion. And in fact, many element-wise matrix matrix arithmetic can be merged into a single loop-nest. However, I really bother doing this at the moment (as I have no time).
Some people may argue that the format of the final return is inconvenient. There could be other format:
a list of data frames, each giving the result of the regression for a particular LHS variable;
a list of data frames, each giving the result of the regression for a particular RHS variable.
This is really opinion-based. Anyway, you can always do a split.data.frame by "LHS" column or "RHS" column yourself on the data frame I return you.
R function pairwise_simpleLM
pairwise_simpleLM <- function (dat) {
## matrix and its dimension (n: numbeta.ser of data; p: numbeta.ser of variables)
dat <- as.matrix(dat)
n <- nrow(dat)
p <- ncol(dat)
## variable summary: mean, (unscaled) covariance and (unscaled) variance
m <- colMeans(dat)
V <- crossprod(dat) - tcrossprod(m * sqrt(n))
d <- diag(V)
## R-squared (explained variance) and its complement
R2 <- (V ^ 2) * tcrossprod(1 / d)
R2_complement <- 1 - R2
R2_complement[seq.int(from = 1, by = p + 1, length = p)] <- 0
## slope and intercept
beta <- V * rep(1 / d, each = p)
alpha <- m - beta * rep(m, each = p)
## residual sum of squares and standard error
RSS <- R2_complement * d
sig <- sqrt(RSS * (1 / (n - 2)))
## statistics for slope
beta.se <- sig * rep(1 / sqrt(d), each = p)
beta.tv <- beta / beta.se
beta.pv <- 2 * pt(abs(beta.tv), n - 2, lower.tail = FALSE)
## F-statistic and p-value
F.fv <- (n - 2) * R2 / R2_complement
F.pv <- pf(F.fv, 1, n - 2, lower.tail = FALSE)
## export
data.frame(LHS = rep(colnames(dat), times = p),
RHS = rep(colnames(dat), each = p),
alpha = c(alpha),
beta = c(beta),
beta.se = c(beta.se),
beta.tv = c(beta.tv),
beta.pv = c(beta.pv),
sig = c(sig),
R2 = c(R2),
F.fv = c(F.fv),
F.pv = c(F.pv),
stringsAsFactors = FALSE)
}
Let's compare the result on the toy dataset in the question.
oo <- poor(dat)
rr <- pairwise_simpleLM(dat)
all.equal(oo, rr)
#[1] TRUE
Let's see its output:
rr[1:3, ]
# LHS RHS alpha beta beta.se beta.tv beta.pv sig
#1 A A 0.00000000 1.0000000 0.00000000 Inf 0.000000e+00 0.0000000
#2 B A 0.05550367 0.6206434 0.04456744 13.92594 5.796437e-25 0.1252402
#3 C A 0.05809455 1.2215173 0.04790027 25.50126 4.731618e-45 0.1346059
# R2 F.fv F.pv
#1 1.0000000 Inf 0.000000e+00
#2 0.6643051 193.9317 5.796437e-25
#3 0.8690390 650.3142 4.731618e-45
When we have the same LHS and RHS, regression is meaningless hence intercept is 0, slope is 1, etc.
What about speed? Still using this toy example:
library(microbenchmark)
microbenchmark("poor_man's" = poor(dat), "fast" = pairwise_simpleLM(dat))
#Unit: milliseconds
# expr min lq mean median uq max
# poor_man's 127.270928 129.060515 137.813875 133.390722 139.029912 216.24995
# fast 2.732184 3.025217 3.381613 3.134832 3.313079 10.48108
The gap is going be increasingly wider as we have more variables. For example, with 10 variables we have:
set.seed(0)
X <- matrix(runif(100), 100, 10, dimnames = list(1:100, LETTERS[1:10]))
b <- runif(10)
DAT <- X * b[col(X)] + matrix(rnorm(100 * 10, 0, 0.1), 100, 10)
DAT <- as.data.frame(DAT)
microbenchmark("poor_man's" = poor(DAT), "fast" = pairwise_simpleLM(DAT))
#Unit: milliseconds
# expr min lq mean median uq max
# poor_man's 548.949161 551.746631 573.009665 556.307448 564.28355 801.645501
# fast 3.365772 3.578448 3.721131 3.621229 3.77749 6.791786
R function general_paired_simpleLM
general_paired_simpleLM <- function (dat_LHS, dat_RHS) {
## matrix and its dimension (n: numbeta.ser of data; p: numbeta.ser of variables)
dat_LHS <- as.matrix(dat_LHS)
dat_RHS <- as.matrix(dat_RHS)
if (nrow(dat_LHS) != nrow(dat_RHS)) stop("'dat_LHS' and 'dat_RHS' don't have same number of rows!")
n <- nrow(dat_LHS)
pl <- ncol(dat_LHS)
pr <- ncol(dat_RHS)
## variable summary: mean, (unscaled) covariance and (unscaled) variance
ml <- colMeans(dat_LHS)
mr <- colMeans(dat_RHS)
vl <- colSums(dat_LHS ^ 2) - ml * ml * n
vr <- colSums(dat_RHS ^ 2) - mr * mr * n
##V <- crossprod(dat - rep(m, each = n)) ## cov(u, v) = E[(u - E[u])(v - E[v])]
V <- crossprod(dat_LHS, dat_RHS) - tcrossprod(ml * sqrt(n), mr * sqrt(n)) ## cov(u, v) = E[uv] - E{u]E[v]
## R-squared (explained variance) and its complement
R2 <- (V ^ 2) * tcrossprod(1 / vl, 1 / vr)
R2_complement <- 1 - R2
## slope and intercept
beta <- V * rep(1 / vr, each = pl)
alpha <- ml - beta * rep(mr, each = pl)
## residual sum of squares and standard error
RSS <- R2_complement * vl
sig <- sqrt(RSS * (1 / (n - 2)))
## statistics for slope
beta.se <- sig * rep(1 / sqrt(vr), each = pl)
beta.tv <- beta / beta.se
beta.pv <- 2 * pt(abs(beta.tv), n - 2, lower.tail = FALSE)
## F-statistic and p-value
F.fv <- (n - 2) * R2 / R2_complement
F.pv <- pf(F.fv, 1, n - 2, lower.tail = FALSE)
## export
data.frame(LHS = rep(colnames(dat_LHS), times = pr),
RHS = rep(colnames(dat_RHS), each = pl),
alpha = c(alpha),
beta = c(beta),
beta.se = c(beta.se),
beta.tv = c(beta.tv),
beta.pv = c(beta.pv),
sig = c(sig),
R2 = c(R2),
F.fv = c(F.fv),
F.pv = c(F.pv),
stringsAsFactors = FALSE)
}
Apply this to Example 1 in the question.
general_paired_simpleLM(dat[1:3], dat[4:5])
# LHS RHS alpha beta beta.se beta.tv beta.pv sig
#1 A D -0.009212582 0.3450939 0.01171768 29.45071 1.772671e-50 0.09044509
#2 B D 0.012474593 0.2389177 0.01420516 16.81908 1.201421e-30 0.10964516
#3 C D -0.005958236 0.4565443 0.01397619 32.66585 1.749650e-54 0.10787785
#4 A E 0.008650812 -0.4798639 0.01963404 -24.44040 1.738263e-43 0.10656866
#5 B E 0.012738403 -0.3437776 0.01949488 -17.63426 3.636655e-32 0.10581331
#6 C E 0.009068106 -0.6430553 0.02183128 -29.45569 1.746439e-50 0.11849472
# R2 F.fv F.pv
#1 0.8984818 867.3441 1.772671e-50
#2 0.7427021 282.8815 1.201421e-30
#3 0.9158840 1067.0579 1.749650e-54
#4 0.8590604 597.3333 1.738263e-43
#5 0.7603718 310.9670 3.636655e-32
#6 0.8985126 867.6375 1.746439e-50
Apply this to Example 2 in the question.
general_paired_simpleLM(dat[1:4], dat[5])
# LHS RHS alpha beta beta.se beta.tv beta.pv sig
#1 A E 0.008650812 -0.4798639 0.01963404 -24.44040 1.738263e-43 0.1065687
#2 B E 0.012738403 -0.3437776 0.01949488 -17.63426 3.636655e-32 0.1058133
#3 C E 0.009068106 -0.6430553 0.02183128 -29.45569 1.746439e-50 0.1184947
#4 D E 0.066190196 -1.3767586 0.03597657 -38.26820 9.828853e-61 0.1952718
# R2 F.fv F.pv
#1 0.8590604 597.3333 1.738263e-43
#2 0.7603718 310.9670 3.636655e-32
#3 0.8985126 867.6375 1.746439e-50
#4 0.9372782 1464.4551 9.828853e-61
Apply this to Example 3 in the question.
general_paired_simpleLM(dat[1], dat[2:5])
# LHS RHS alpha beta beta.se beta.tv beta.pv sig
#1 A B 0.112229318 1.0703491 0.07686011 13.92594 5.796437e-25 0.16446951
#2 A C 0.025628210 0.7114422 0.02789832 25.50126 4.731618e-45 0.10272687
#3 A D -0.009212582 0.3450939 0.01171768 29.45071 1.772671e-50 0.09044509
#4 A E 0.008650812 -0.4798639 0.01963404 -24.44040 1.738263e-43 0.10656866
# R2 F.fv F.pv
#1 0.6643051 193.9317 5.796437e-25
#2 0.8690390 650.3142 4.731618e-45
#3 0.8984818 867.3441 1.772671e-50
#4 0.8590604 597.3333 1.738263e-43
We can even just do a simple linear regression between two variables:
general_paired_simpleLM(dat[1], dat[2])
# LHS RHS alpha beta beta.se beta.tv beta.pv sig
#1 A B 0.1122293 1.070349 0.07686011 13.92594 5.796437e-25 0.1644695
# R2 F.fv F.pv
#1 0.6643051 193.9317 5.796437e-25
This means that the simpleLM function in is now obsolete.
Appendix: Markdown (needs MathJax support) fot the picture
Denote our variables by $x_1$, $x_2$, etc, a pairwise simple linear regression takes the form $$x_i = \alpha_{ij} + \beta_{ij}x_j$$ where $\alpha_{ij}$ and $\beta_{ij}$ is the intercept and the slope of $x_i \sim x_j$, respectively. We also denote $m_i$ and $v_i$ as the sample mean and **unscaled** sample variance of $x_i$. Here, the unscaled variance is just the sum of squares without dividing by sample size, that is $v_i = \sum_{k = 1}^n(x_{ik} - m_i)^2 = (\sum_{k = 1}^nx_{ik}^2) - n m_i^2$. We also denote $V_{ij}$ as the **unscaled** covariance between $x_i$ and $x_j$: $V_{ij} = \sum_{k = 1}^n(x_{ik} - m_i)(x_{jk} - m_j)$ = $(\sum_{k = 1}^nx_{ik}x_{jk}) - nm_im_j$.
Using the results for a simple linear regression given in [Function to calculate R2 (R-squared) in R](https://stackoverflow.com/a/40901487/4891738), we have $$\beta_{ij} = V_{ij} \ / \ v_j,\quad \alpha_{ij} = m_i - \beta_{ij}m_j,\quad r_{ij}^2 = V_{ij}^2 \ / \ (v_iv_j),$$ where $r_{ij}^2$ is the R-squared. Knowing $r_{ij}^2 = RSS_{ij} \ / \ TSS_{ij}$ where $RSS_{ij}$ and $TSS_{ij} = v_i$ are residual sum of squares and total sum of squares of $x_i \sim x_j$, we can derive $RSS_{ij}$ and residual standard error $\sigma_{ij}$ **without actually computing residuals**: $$RSS_{ij} = (1 - r_{ij}^2)v_i,\quad \sigma_{ij} = \sqrt{RSS_{ij} \ / \ (n - 2)}.$$
F-statistic $F_{ij}$ and associated p-value $p_{ij}^F$ can also be obtained from sum of squares: $$F_{ij} = \tfrac{(TSS_{ij} - RSS_{ij}) \ / \ 1}{RSS_{ij} \ / \ (n - 2)} = (n - 2) r_{ij}^2 \ / \ (1 - r_{ij}^2),\quad p_{ij}^F = 1 - \texttt{CDF_F}(F_{ij};\ 1,\ n - 2),$$ where $\texttt{CDF_F}$ denotes the CDF of F-distribution.
The only thing left is the standard error $e_{ij}$, t-statistic $t_{ij}$ and associated p-value $p_{ij}^t$ for $\beta_{ij}$, which are $$e_{ij} = \sigma_{ij} \ / \ \sqrt{v_i},\quad t_{ij} = \beta_{ij} \ / \ e_{ij},\quad p_{ij}^t = 2 * \texttt{CDF_t}(-|t_{ij}|; \ n - 2),$$ where $\texttt{CDF_t}$ denotes the CDF of t-distribution.
I am interested in using nls to aid in fitting the Langmuir Equation Y =(Qmax*k*X)/(1+(k*X)) similar to what was done in this post Fitting Non-linear Langmuir Isotherm in R. The parameter of the equation I am interested in is Qmax which corresponds to the horizontal asymptote (green line) of the plotted sorption data below. Is there a more robust approach other than nls or a way to improve my use of nls that I could employ to get a Qmax value as close as possible to the visual asymptote (green line) around Qmax=3200?
Lang <- nls(formula = Y ~ (Qmax*k*X)/(1+(k*X)), data = data, start = list(Qmax = 3600, k = 0.015), algorith = "port")
Using the following data:
X Y
1 3.08 84.735
2 5.13 182.832
3 6.67 251.579
4 9.75 460.077
5 16.30 779.350
6 25.10 996.540
7 40.80 1314.739
8 68.90 1929.422
9 111.00 2407.668
10 171.00 3105.850
11 245.00 3129.240
12 300.00 3235.000
I'm getting a Qmax = 4253.63 (red line) - approximately 1000 units away. Using upper and lower limits only results in a Qmax of what I set the upper limit to and changing initial values doesn't appear to change the outcome. Is this a challenge that can be solved with a different approach to non-linear regression than I've taken in base R or is this a statistical/mathematical problem first and foremost?
Plot of Non-linear Langmuir Isotherm
summary(Lang)
Formula: Y ~ (Qmax * k * X)/(1 + (k * X))
Parameters:
Estimate Std. Error t value Pr(>|t|)
Qmax 4.254e+03 1.554e+02 27.37 9.80e-11 ***
k 1.209e-02 1.148e-03 10.53 9.87e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 99.14 on 10 degrees of freedom
Algorithm "port", convergence message: relative convergence (4)
My attempt at the linearization of the model was less successful:
z <- 1/data
plot(Y~X,z)
abline(lm(Y~X,z))
M <- lm(Y~X,z)
Qmax <- 1/coef(M)[1]
#4319.22
k <- coef(M)[1]/coef(M)[2]
#0.00695
Disclaimer: This is my first post so please bear with me, and I'm relatively new to R. With that being said any technical advice that might help me improve my technique above would be greatly appreciated.
Not sure why you expect Qmax to be that low
I rewrote your dependency in a simplest form, removing multiplication and replacing it with addition (a => 1/k) by dividing both nominator and denominator by k. Result looks perfect to my eye.
library(ggplot2)
library(data.table)
dt <- fread("R/Langmuir.dat", sep = " ")
Lang <- nls(formula = Y ~ (Qmax*X)/(a+X), data = dt, start = list(Qmax = 3600, a = 100.0), algorithm = "port")
q <- summary(Lang)
Qmax <- q$coefficients[1,1]
a <- q$coefficients[2,1]
f <- function(x, Qmax, a) {
(Qmax*x)/(a+x)
}
p <- ggplot(data = dt, aes(x = X, y = Y))
p <- p + geom_point()
p <- p + xlab("T") + ylab("Q") + ggtitle("Langmuir Fit")
p <- p + stat_function(fun = function(x) f(x, Qmax=Qmax, a=a))
print(p)
print(Qmax)
print(a)
Output
4253.631
82.68501
Graph
UPDATE
Basically, too many points at low X, hard to get curve bending for lower Qmax. Designed way to make curve bend is to add weights. For example, if I add weights column after reading data table:
dt[, W := (as.numeric(N)/12.0)^3]
and run nls with weights
Lang <- nls(formula = Y ~ (Qmax*X)/(a+X), data = dt, start = list(Qmax = 3600, a = 100.0), weights = dt$W, algorithm = "port")
I'll get Qmax and a
[1] 4121.114
[1] 74.89386
with the following graph
I am now trying to estimate the sample size needed for A/B testing of website conversion rate. pwr.chisq.test always gives me error message, when I have small value of conversion rate:
# conversion rate for two groups
p1 = 0.001
p2 = 0.0011
# degree of freedom
df = 1
# effect size
w = ES.w1(p1,p2)
pwr.chisq.test(w,
df = 1,
power=0.8,
sig.level=0.05)
**Error in uniroot(function(N) eval(p.body) - power, c(1 + 1e-10, 1e+05)) :
f() values at end points not of opposite sign**
However, if I have larger value for p1 and p2, this code works fine.
# conversion rate for two groups
p1 = 0.01
p2 = 0.011
# degree of freedom
df = 1
# effect size
w = ES.w1(p1,p2)
pwr.chisq.test(w,
df = 1,
power=0.8,
sig.level=0.05)
Chi squared power calculation
w = 0.01
N = 78488.61
df = 1 sig.level = 0.05
power = 0.8
NOTE: N is the number of observations
I think there is a "numerical" explanation to that. If you take a look at the function's code, you can see that the number of samples is computed by uniroot and is supposed to belong to an interval whose boundaries are set to 1e-10 and 1e5. The error message states that this interval does not give you the result: in your case, the upper limit is too small.
Knowing that, we can simply take a wider interval:
w <- 0.00316227766016838
k <- qchisq(0.05, df = 1, lower = FALSE)
p.body <- quote(pchisq(k, df = 1, ncp = N * w^2, lower = FALSE))
N <- uniroot(function(N) eval(p.body) - 0.8, c(1 + 1e-10, 1e+7))$root
The "solution" is N=784886.1... that's a huge number of observations.