LMS (Lambda-Mu-Sigma) method in R - r

I want to create percentile curves for my data using LMS (Lambda-Mu-Sigma) method. I have following example data. How can 10th, 50th and 90th percentile curves of yvar (on y-axis) vs age (on x-axis) be drawn using LMS?
age = sample(5:75, 500, replace=T)
yvar = rnorm(500, age, 20)
mydata = data.frame(age, yvar)
head(mydata)
age yvar
1 61 87.16011
2 58 49.73289
3 65 15.60212
4 71 83.32699
5 33 40.89592
6 18 25.04376
plot(age, yvar)
I came across VGAM package http://www.inside-r.org/packages/cran/VGAM/docs/lms.bcn . Is that the best method to do it? I could not really understand its example code to create simple percentile curve from above data. Thanks for your help.

Simulate data (reproducibly):
set.seed(1001)
mydata <- data.frame(
age = sample(5:75, 500, replace=TRUE))
mydata <- transform(mydata,
yvar = rnorm(500, age, 20))
Since the LMS method typically appears to be based on variants of the Box-Cox transformation, which requires positive values, a simpler way to do this would be to use quantile regression.
library("quantreg")
library("ggplot2"); theme_set(theme_bw())
g0 <- ggplot(mydata,aes(x=age,y=yvar))+geom_point()
g0 + geom_smooth(method="rq",tau=c(0.1),se=FALSE,lty=2)+
geom_smooth(method="rq",tau=c(0.5),se=FALSE)+
geom_smooth(method="rq",tau=c(0.9),se=FALSE,lty=2)
rq() by itself has the capability to fit all three percentiles at the same time, but you need to use the strategy suggested in this blog post to draw them more conveniently:
model.rq <- rq(yvar ~ age, mydata, tau=c(0.1, 0.5, 0.9))
quantile.regressions <- data.frame(t(coef(model.rq)))
colnames(quantile.regressions) <- c("intercept", "slope")
quantile.regressions$quantile <- rownames(quantile.regressions)
g0 + geom_abline(aes(intercept=intercept, slope=slope,
colour=quantile), show_guide=TRUE, data=quantile.regressions)
Alternatively it is possible to do this within VGAM, but I'm not sure whether it's what you want/whether the results make sense or not. The Yeo-Johnson transformation, via lms.yjn, allows you to do this even when some data values are negative, but you might look at ?lms.bcg, ?lms.bcn for alternatives that work for non-negative data.
library("VGAM")
fit <- vgam(yvar ~ s(age, df = 4), lms.yjn, data=mydata,
control=vgam.control(maxit=100),
trace=FALSE)
We get a warning message:
## Warning message:
## In vgam.fit(x = x, y = y, w = w, mf = mf, Xm2 = Xm2, Ym2 = Ym2, :
## convergence not obtained in 100 iterations
This might be because we're overfitting the data using a 4-knot spline model?
Quantile plot (following example("lms.yjn"))
par(bty = "l", mar = c(5, 4, 4, 3) + 0.1, xpd = TRUE)
qtplot(fit, percentiles = c(10, 50, 90),
las = 1, ylab = "yvar", lwd = 2, lcol = 4)
This is a terrible hack, but if you want access to the raw values so you can plot the curves yourself:
pcurves <- qtplot.lmscreg(fit,show.plot=FALSE,
percentiles=c(10,50,90))
vals <- data.frame(age=mydata$age,pcurves$fitted.values)
vals <- vals[order(vals$age),]
matplot(vals$age,vals[,-1],type="l",lty=c(2,1,2),col=1,
xlab="age",ylab="")

Related

How to plot Two Sample t.test() means, medians, and CI in R?

With the data I have, this R code x <- t.test(Age ~ Completers, var.equal = TRUE, data = data) renders the following result:
Two Sample t-test
data: Age by Completers
t = 0.93312, df = 1060, p-value = 0.351
alternative hypothesis: true difference in means between group Completers and group Non Completers is not equal to 0
95 percent confidence interval:
-0.5844018 1.6442118
sample estimates:
mean in group Completers mean in group Non Completers
37.16052 36.63062
What I would like is to plot each mean (found in x$estimate[1] and x$estimate[2]) with its own point on the x axis at its proper height on the y axis (on the same graph) and each point complemented with the same confidence interval (CI) (found in x$conf.int[1] and x$conf.int[2]). Like this[*]:
Unfortunately, if I'm not mistaken, plot() (from the Generic X-Y Plotting) does not seem to handle this. So I tried with plotCI (from gplots) as follows:
library(gplots)
plotCI(x = x$estimate[1], y = x$estimate[2],
li = x$conf.int[1], ui = x$conf.int[2])
But it renders as shown below:
My questions:
Is there a way to obtain a plot such as in the first graph with Base R code?
If not, what would be the solution (short of using the jmv:: code (see [*]))?
EDIT
As requested in the comments, please find hereunder some code that help reproduce the data (T-Test results are won't be exactly the same as above, but the idea is the same):
# Generate random numbers with specific mean and standard deviation
completers <- data.frame(Completers = 1,
Age = rnorm(100, mean = 37.16052, sd = 8.34224))
nonCompleters <- data.frame(Completers = 0,
Age = rnorm(100, mean = 36.63062, sd = 11.12173))
# Convert decimaled number to integers
completers[] <- lapply(completers, as.integer)
nonCompleters[] <- lapply(nonCompleters, as.integer)
# Stack data from 2 different data frames
df <- rbind(completers, nonCompleters)
# Remove useless data frames
rm(completers, nonCompleters)
# Age ~ Completers (T-Test)
(tTest <- t.test(df$Age ~ df$Completers, var.equal = TRUE))
Sources:
Generate random numbers with specific mean and standard deviation (Scroll down until "From Normal Distribution")
Convert decimaled number to integers
Stack data from 2 different data frames
[*] Graph obtained with Jamovi Version 2.3.15.0 which uses the following code (but I would like to avoid using jmv::):
jmv::ttestIS(
formula = Age ~ Completers,
data = data,
plots = TRUE
)
System used:
R 4.2.1
RStudio 2022.07.1 Build 554
macOS Monterey Version 12.5.1 (Intel)
There appears to be a misalignment of what you want and what t.test() is giving you. t.test() let you know if there is a difference in means, and report the CI of the difference in sample means (not the CIs of the individual means).
Since you stated you want the CIs of the individual means using base R, you can accomplish this by:
Sample data
nn <- 100
df <- data.frame(Completers = rep(c(1,0), each = nn),
Age = c(as.integer(rnorm(nn, mean = 37.16052, sd = 8.34224)),
as.integer(rnorm(nn, mean = 36.63062, sd = 11.12173))))
With the raw data, calculate the summary statistics and confidence interval:
# Base R - find summary statistics and restructure into data frame
df_summary <- aggregate(Age ~ Completers, df, function(x) c(mean = mean(x),
sd = sd(x),
median = median(x),
n = length(x)))
df_summary <- data.frame(Completers = df_summary[, 1], df_summary$Age) #reformat nested matrix
# Calculate 95% CI
alpha <- 0.05/2
# Lower CI
df_summary$ci_low <-
df_summary$mean - qt(1 - alpha, df = df_summary$n) * df_summary$sd /
sqrt(df_summary$n)
# Upper CI
df_summary$ci_hi <-
df_summary$mean + qt(1 - alpha, df = df_summary$n) * df_summary$sd /
sqrt(df_summary$n)
# Output
# Completers mean sd median n ci_low ci_hi
#1 0 34.94 10.730698 34 100 32.81106 37.06894
#2 1 37.43 7.645234 37 100 35.91321 38.94679
Now you can plot the mean and CI for each group (your example also mentioned you wanted the median in there):
# Set Y limits (change to whatever)
ylimits <- c(min(df_summary$ci_low) - 1,
max(df_summary$ci_hi) + 1)
# Plot
plot(NA, xlim = c(0,3), ylim = ylimits, # blank plot
axes = FALSE, xlab = "", ylab = "")
segments(x0 = c(1,2), y0 = df_summary$ci_low, y1 = df_summary$ci_hi) # add segments
points(df_summary$mean, pch = 19) # add means
points(df_summary$median, pch = 0)
axis(1, at = 0:3, labels = c(NA, "Completers", "Noncompleters", NA)) # add x axis
axis(2) #add y axis
mtext(side = 1, "Completers", padj = 4) # add x label
mtext(side = 2, "Age", padj = -4) # add y label
legend("topleft", c("Mean", "Median", "95% CI"),
pch = c(19, 0, NA), lty = c(NA, NA, 1), bty = "n")
Output:

create sequence of predictor values to generate posterior predictions of simultaneous change in predictors

I am trying to create a data frame using (either tidyr::expand.grid or tibble::data_frame) in order to then generate posterior predictions using the tidybayes::epred_draws function from tidybayes (akin to posterior_predict). I have three continuous predictors that I could like to vary simultaneously at three set values: 1 standard dev below the mean of each predictor, the mean of each predictor, and 1 standard deviation above the mean of each predictor. The issue I am running into is that I cannot figure out a way to generate values in between the set standard deviation while keeping the structure of the dataset intact.
I created a reproducible example below, as you can see the final posterior prediction doesn't look great. Is there any way to generate additional incremental values in between the set standard deviation and mean?
My go to method would be either be seq() or even
modelr::seq_range(data_var_1, pretty=TRUE, n=100), but I'm not sure how to incorporate that in the new dataset in a way that allows me to see what happens the predictors simultaneously shift at once.
Let me know if I can explain anything else.
library(brms)
library(tidybayes)
library(ggplot2)
library(ggthemes)
## create a dataset
data <- tibble(
outcome = rnorm(100, 2, 2),
var_1 = rnorm(100, 5, 2),
var_2 = rnorm(100, 8, 2),
var_3 = rnorm(100, 10, 2)
)
## model the data
m1 <- brms::brm(outcome ~ var_1 + var_2 + var_3, data) # run model (takes a few sec.)
## prepare for predictions with set values
new_data = tibble(
var_1 = c(mean(var_1) - sd(var_1)*1, mean(var_1), mean(var_1) + sd(var_1)*1),
var_2 = c(mean(var_2) - sd(var_2)*1, mean(var_2), mean(var_2) + sd(var_2)*1),
var_3 = c(mean(var_3) - sd(var_3)*1, mean(var_3), mean(var_3) + sd(var_3)*1))
pred_1 <- m1 %>%
tidybayes::epred_draws(new_data)
# generate grand mean posterior predictions (for more on this,
# see: https://www.andrewheiss.com/blog/2021/11/10/ame-bayes-re-guide/)
plot_1 <- ggplot(pred_1, aes(x = var_1, y = .epred)) +
stat_lineribbon() +
scale_fill_brewer(palette = "Reds") +
labs(x = "Shifts in Var 1, 2, and 3", y = "Outcome",
fill = "Credible interval") +
ggthemes::theme_pander() +
theme(legend.position = "bottom") +
scale_x_continuous(limits = c(new_data$var_1[1], new_data$var_1[3]),
breaks=c(new_data$var_1[1],
new_data$var_1[2],
new_data$var_1[3]),
labels = c("-1 SD", "Mean", "+1 SD"))
# visualize posterior predictions (example isn't so pretty, sorry)

Poisson regression model - how to add regression line with specific value of coefficient?

So, I have my poisson regression model: (mvdiff = market value diff, participations = participations in world cup)
mod <- glm(goals~participations+MVdiff, family = poisson)
plot(MVdiff, jitter(goals , 0.2), pch=17)
Now I would like to include a regression function line into my plot: as a function of MVdiff, with the value of 9.2 for participations (as in 9.2 is the mean of the participations in the world cup).
Here my try:
curve(exp(mod$coefficients[1]+mod$coefficients[2]*x+9.2+mod$coefficients[3]),
lwd=3,col="red",add=TRUE)
But this doesn't quite work out. Is there a way to properly add the value of 9.2 into my coefficient variable participation?
Plot:
To get the model output at a fixed value of an independent variable, use predict with the value of that variable fixed:
pred_df <- data.frame(MVdiff = seq(-1500, 1500), participations = 9.2)
predictions <- predict(mod, newdata = pred_df , type = "response")
Now plot this as a line over your data:
plot(df$MVdiff, jitter(df$goals , 0.2), pch = 16)
lines(pred_df$MVdiff, predictions, col = "red")
Data used
Obviously, we don't have your data, so I had to create my own for the example with the following code:
set.seed(123)
participations <- rpois(500, 9.2)
MVdiff <- rnorm(500, 0, 600)
goals <- rpois(500, 1 + MVdiff/3000)

R: Determine the threshold that maximally separates two groups based on a continuous variable?

Say I have 200 subjects, 100 in group A and 100 in group B, and for each I measure some continuous parameter.
require(ggplot2)
set.seed(100)
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep('A', 100), rep('B', 100))
data <- data.frame(value, group)
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
I would like to determine the value (Threshold? Breakpoint?) that maximizes separation and minimizes misclassification between the groups. Does such a function exist in R?
I've tried searching along the lines of "r breakpoint maximal separation between groups," and "r threshold minimize misclassification," but my google-foo seems to be off today.
EDIT:
Responding to #Thomas's comment, I have tried to fit the data using logistic regression and then solve for the threshold, but I haven't gotten very far.
lr <- glm(group~value)
coef(lr)
# (Intercept) value
# 1.1857435 -0.0911762
So Bo = 1.1857435 and B1 = -0.0911762
From Wikipedia, I see that F(x) = 1/(1+e^-(Bo + B1x)), and solving for x:
x = (ln(F(x) / (1 - F(x))) - Bo)/B1
But trying this in R, I get an obviously incorrect answer:
(log(0.5/(1 - 0.5)) - 1.1857435)/-0.0911762 # 13.00497
A simple approach is to write a function that calculates the accuracy given a threshold:
accuracy = Vectorize(function(th) mean(c("A", "B")[(value > th) + 1] == group))
Then find the maximum using optimize:
optimize(accuracy, c(min(value), max(value)), maximum=TRUE)
# $maximum
# [1] 8.050888
#
# $objective
# [1] 0.86
I've gotten the answer I need thanks to help from #Thomas and #BenBolker.
Summary
The problem with my attempt at solving it through logistic regression was that I hadn't specified family = binomial
The dose.p() function in MASS will do the work for me given a glm fit
Code
# Include libraries
require(ggplot2)
require(MASS)
# Set seed
set.seed(100)
# Put together some dummy data
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep(0, 100), rep(1, 100))
data <- data.frame(value, group)
# Plot the distribution -- visually
# The answer appears to be b/t 7 and 8
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
# Fit a glm model, specifying the binomial distribution
my.glm <- glm(group~value, data = data, family = binomial)
b0 <- coef(my.glm)[[1]]
b1 <- coef(my.glm)[[2]]
# See what the probability function looks like
lr <- function(x, b0, b1) {
prob <- 1 / (1 + exp(-1*(b0 + b1*x)))
return(prob)
}
# The line appears to cross 0.5 just above 7.5
x <- -0:12
y <- lr(x, b0, b1)
lr.val <- data.frame(x, y)
ggplot(lr.val, aes(x = x, y = y)) +
geom_line()
# The inverse of this function computes the threshold for a given probability
inv.lr <- function(p, b0, b1) {
x <- (log(p / (1 - p)) - b0)/b1
return(x)
}
# With the betas from this function, we get 7.686814
inv.lr(0.5, b0, b1)
# Or, feeding the glm model into dose.p from MASS, we get the same answer
dose.p(my.glm, p = 0.5)
Thanks, everyone, for your help!

R Language - Sorting data into ranges; averaging; ignore outliers

I am analyzing data from a wind turbine, normally this is the sort of thing I would do in excel but the quantity of data requires something heavy-duty. I have never used R before and so I am just looking for some pointers.
The data consists of 2 columns WindSpeed and Power, so far I have arrived at importing the data from a CSV file and scatter-plotted the two against each other.
What I would like to do next is to sort the data into ranges; for example all data where WindSpeed is between x and y and then find the average of power generated for each range and graph the curve formed.
From this average I want recalculate the average based on data which falls within one of two standard deviations of the average (basically ignoring outliers).
Any pointers are appreciated.
For those who are interested I am trying to create a graph similar to this. Its a pretty standard type of graph but like I said the shear quantity of data requires something heavier than excel.
Since you're no longer in Excel, why not use a modern statistical methodology that doesn't require crude binning of the data and ad hoc methods to remove outliers: locally smooth regression, as implemented by loess.
Using a slight modification of csgillespie's sample data:
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
plot(w_sp, power)
x_grid <- seq(0, 100, length = 100)
lines(x_grid, predict(loess(power ~ w_sp), x_grid), col = "red", lwd = 3)
Throw this version, similar in motivation as #hadley's, into the mix using an additive model with an adaptive smoother using package mgcv:
Dummy data first, as used by #hadley
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
df <- data.frame(power = power, w_sp = w_sp)
Fit the additive model using gam(), using an adaptive smoother and smoothness selection via REML
require(mgcv)
mod <- gam(power ~ s(w_sp, bs = "ad", k = 20), data = df, method = "REML")
summary(mod)
Predict from our model and get standard errors of fit, use latter to generate an approximate 95% confidence interval
x_grid <- with(df, data.frame(w_sp = seq(min(w_sp), max(w_sp), length = 100)))
pred <- predict(mod, x_grid, se.fit = TRUE)
x_grid <- within(x_grid, fit <- pred$fit)
x_grid <- within(x_grid, upr <- fit + 2 * pred$se.fit)
x_grid <- within(x_grid, lwr <- fit - 2 * pred$se.fit)
Plot everything and the Loess fit for comparison
plot(power ~ w_sp, data = df, col = "grey")
lines(fit ~ w_sp, data = x_grid, col = "red", lwd = 3)
## upper and lower confidence intervals ~95%
lines(upr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
lines(lwr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
## add loess fit from #hadley's answer
lines(x_grid$w_sp, predict(loess(power ~ w_sp, data = df), x_grid), col = "blue",
lwd = 3)
First we will create some example data to make the problem concrete:
w_sp = sample(seq(0, 100, 0.01), 1000)
power = 1/(1+exp(-(rnorm(1000, mean=w_sp, sd=5) -40)/5))
Suppose we want to bin the power values between [0,5), [5,10), etc. Then
bin_incr = 5
bins = seq(0, 95, bin_incr)
y_mean = sapply(bins, function(x) mean(power[w_sp >= x & w_sp < (x+bin_incr)]))
We have now created the mean values between the ranges of interest. Note, if you wanted the median values, just change mean to median. All that's left to do, is to plot them:
plot(w_sp, power)
points(seq(2.5, 97.5, 5), y_mean, col=3, pch=16)
To get the average based on data that falls within two standard deviations of the average, we need to create a slightly more complicated function:
noOutliers = function(x, power, w_sp, bin_incr) {
d = power[w_sp >= x & w_sp < (x + bin_incr)]
m_d = mean(d)
d_trim = mean(d[d > (m_d - 2*sd(d)) & (d < m_d + 2*sd(d))])
return(mean(d_trim))
}
y_no_outliers = sapply(bins, noOutliers, power, w_sp, bin_incr)
Here are some examples of fitted curves (weibull analysis) for commercial turbines:
http://www.inl.gov/wind/software/
http://www.irec.cmerp.net/papers/WOE/Paper%20ID%20161.pdf
http://www.icaen.uiowa.edu/~ie_155/Lecture/Power_Curve.pdf
I'd recommend also playing around with Hadley's own ggplot2. His website is a great resource: http://had.co.nz/ggplot2/ .
# If you haven't already installed ggplot2:
install.pacakges("ggplot2", dependencies = T)
# Load the ggplot2 package
require(ggplot2)
# csgillespie's example data
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
# Bind the two variables into a data frame, which ggplot prefers
wind <- data.frame(w_sp = w_sp, power = power)
# Take a look at how the first few rows look, just for fun
head(wind)
# Create a simple plot
ggplot(data = wind, aes(x = w_sp, y = power)) + geom_point() + geom_smooth()
# Create a slightly more complicated plot as an example of how to fine tune
# plots in ggplot
p1 <- ggplot(data = wind, aes(x = w_sp, y = power))
p2 <- p1 + geom_point(colour = "darkblue", size = 1, shape = "dot")
p3 <- p2 + geom_smooth(method = "loess", se = TRUE, colour = "purple")
p3 + scale_x_continuous(name = "mph") +
scale_y_continuous(name = "power") +
opts(title = "Wind speed and power")

Resources