I have a list data like below. I want to perform nonlinear regression Gaussian curve fitting between mids and counts for each element of my list and report mean and standard deviation
mylist<- structure(list(A = structure(list(breaks = c(-10, -9,
-8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4), counts = c(1L,
0L, 1L, 5L, 9L, 38L, 56L, 105L, 529L, 2858L, 17L, 2L, 0L, 2L),
density = c(0.000276014352746343, 0, 0.000276014352746343,
0.00138007176373171, 0.00248412917471709, 0.010488545404361,
0.0154568037537952, 0.028981507038366, 0.146011592602815,
0.788849020149048, 0.00469224399668783, 0.000552028705492686,
0, 0.000552028705492686), mids = c(-9.5, -8.5, -7.5, -6.5,
-5.5, -4.5, -3.5, -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, 3.5),
xname = "x", equidist = TRUE), .Names = c("breaks", "counts",
"density", "mids", "xname", "equidist"), class = "histogram"),
B = structure(list(breaks = c(-7, -6, -5,
-4, -3, -2, -1, 0), counts = c(2L, 0L, 6L, 2L, 2L, 1L, 3L
), density = c(0.125, 0, 0.375, 0.125, 0.125, 0.0625, 0.1875
), mids = c(-6.5, -5.5, -4.5, -3.5, -2.5, -1.5, -0.5), xname = "x",
equidist = TRUE), .Names = c("breaks", "counts", "density",
"mids", "xname", "equidist"), class = "histogram"), C = structure(list(
breaks = c(-7, -6, -5, -4, -3, -2, -1, 0, 1), counts = c(2L,
2L, 4L, 5L, 14L, 22L, 110L, 3L), density = c(0.0123456790123457,
0.0123456790123457, 0.0246913580246914, 0.0308641975308642,
0.0864197530864197, 0.135802469135802, 0.679012345679012,
0.0185185185185185), mids = c(-6.5, -5.5, -4.5, -3.5,
-2.5, -1.5, -0.5, 0.5), xname = "x", equidist = TRUE), .Names = c("breaks",
"counts", "density", "mids", "xname", "equidist"), class = "histogram")), .Names = c("A",
"B", "C"))
I have read this
Fitting a density curve to a histogram in R
but this is how to fit a curve to a histogram. what I want is Best-fit values"
" Mean"
" SD"
If I use PRISM to do it, I should get the following results
for A
Mids Counts
-9.5 1
-8.5 0
-7.5 1
-6.5 5
-5.5 9
-4.5 38
-3.5 56
-2.5 105
-1.5 529
-0.5 2858
0.5 17
1.5 2
2.5 0
3.5 2
performing nonlinear regression Gaussian curve fitting , I get
"Best-fit values"
" Amplitude" 3537
" Mean" -0.751
" SD" 0.3842
for the second set
B
Mids Counts
-6.5 2
-5.5 0
-4.5 6
-3.5 2
-2.5 2
-1.5 1
-0.5 3
"Best-fit values"
" Amplitude" 7.672
" Mean" -4.2
" SD" 0.4275
and for the third one
Mids Counts
-6.5 2
-5.5 2
-4.5 4
-3.5 5
-2.5 14
-1.5 22
-0.5 110
0.5 3
I get this
"Best-fit values"
" Amplitude" 120.7
" Mean" -0.6893
" SD" 0.4397
In order to convert the histogram back to the estimate of the mean and standard deviation. First convert the results of the bin counts times the bin. This will be an approximation of the original data.
Based on your example above:
#extract the mid points and create list of simulated data
simdata<-lapply(mylist, function(x){rep(x$mids, x$counts)})
#if the original data were integers then this may give a better estimate
#simdata<-lapply(mylist, function(x){rep(x$breaks[-1], x$counts)})
#find the mean and sd of simulated data
means<-lapply(simdata, mean)
sds<-lapply(simdata, sd)
#or use sapply in the above 2 lines depending on future process needs
If your data was integers then using the breaks as the bins will provide a better estimate. Depending on the function for the histogram (ie right=TRUE/FALSE) may shift the results by one.
Edit
I thought this was going to be an easy one. I reviewed the video, the sample data shown was:
mids<-seq(-7, 7)
counts<-c(7, 1, 2, 2, 2, 5, 217, 70, 18, 0, 2, 1, 2, 0, 1)
simdata<-rep(mids, counts)
The video results were mean = -0.7359 and sd= 0.4571. The solution which I found provided the closest results was using the "fitdistrplus" package:
fitdist(simdata, "norm", "mge")
Using the "maximizing goodness-of-fit estimation" resulted in mean = -0.7597280 and sd= 0.8320465.
At this point, the method above provides a close estimate but does not exactly match. I don't not know what technique was used to calculate the fit from the video.
Edit #2
The above solutions involved recreating the original data and fitting that using either the mean/sd or using the fitdistrplus package. This attempt is an attempt to perform a least-square fit using the Gaussian distribution.
simdata<-lapply(mylist, function(x){rep(x$mids, x$counts)})
means<-sapply(simdata, mean)
sds<-sapply(simdata, sd)
#Data from video
#mids<-seq(-7, 7)
#counts<-c(7, 1, 2, 2, 2, 5, 217, 70, 18, 0, 2, 1, 2, 0, 1)
#make list of the bins and distribution in each bin
mids<-lapply(mylist, function(x){x$mids})
dis<-lapply(mylist, function(x) {x$counts/sum(x$counts)})
#function to perform the least square fit
nnorm<-function(values, mids, dis) {
means<-values[1]
sds<-values[2]
#print(paste(means, sds))
#calculate out the Gaussian distribution for each bin
modeld<-dnorm(mids, means, sds)
#sum of the squares
diff<-sum( (modeld-dis)^2)
diff
}
#use optim function with the mean and sd as initial guesses
#find the mininium with the mean and SD as fit parameters
lapply(1:3, function(i) {optim(c(means[[i]], sds[[i]]), nnorm, mids=mids[[i]], dis=dis[[i]])})
This solution provides a closer answer to PRISM results, but still not the same. Here is a comparison of all the 4 solutions.
From the table, the least square fit (the one just above) provides the closest approximation. Maybe tweaking the mid points dnorm function might help. But Case B data is farthest from being normally distributed but the PRISM software still generates a small standard deviation, while the other methods are similar. It is possible the PRISM software performs some type of data filtering to remove the outliers before the fit.
Related
I'm having trouble creating an interaction effect plot. There is probably something fairly simple I don't yet know how to do. I'm pretty new to R and ggplot. My reprex is below. Your insight is greatly appreciated!
The data is from UCLA and I'm also adapting their example for my purposes here.
library(here)
library(emmeans)
library(tidyverse)
dat <- read.csv("https://stats.idre.ucla.edu/wp-content/uploads/2019/03/exercise.csv")
Convert prog into factor variable
dat$prog <- factor(dat$prog, labels = c("jog","swim","read"))
The model
contcat <- lm(loss ~ hours * prog, data=dat)
summary(contcat)
I create mylist with certain points on hours and the two categories in prog that I want to contrast.
(mylist <- list(hours = seq(0, 4, .5), prog=c("jog","read")))
I then pass the object contcat into the emmeans. I request that predicted values of every combination of hours and prog be specified in at=mylist and store the output into an object called emcontcat.
emcontcat <- emmeans(contcat, ~ hours * prog, at=mylist)
I use emmip to output a set of values using plotit=FALSE.
contcatdat <- emmip(contcat, prog ~ hours, at = mylist, CIs=TRUE, plotit=FALSE)
The output object is fed to ggplot. The interaction effect is plotted along with CI bands.
ggplot(data=contcatdat, aes(x=hours, y=yvar, color=prog)) +
geom_line() +
geom_ribbon(aes(ymax=UCL, aymin=LCL, fill=prog), alpha=0.4)
The plot looks like this:
But overlapping CIs do not always correspond to the portions of the lines where there is no significant differences in predicted values. I want to add hashed lines for the portions of the lines where there is no significant difference in predicted values. This figure below
shows the kind of figure I'm trying to create. (The figure is from a paper by Trenton Mize (2019) found here at Fig. 14.)
To get the simple effect (i.e., difference of two predicted values), I pass emcontcat into a function called contrast where we can request "pairwise" differences (or simple effects). P-values are given for jog - read at each level of hours that was specified in mylist.
contrast(emcontcat, "pairwise", by="hours")
The output:
Where I am having trouble is how to incorporate the simple effect (i.e., the parts of hours where jog - read are significantly different or not) into ggplot as hashed or solid portions of the lines like the Mize 2019 figure.
We want to know if the intervals overlap, and if so, we want dashed lines. Actually that's easy by writing a respective function itvl_is_l(). However, on the LHS of the plot, there is just one point, but to draw a line we need a minimum of two. So we have to interpolate with "approximate", which is also done internally in the plot functions. Since we want to do everything for the two progs, we use by.
Preprocessing
## merge interpolations by prog
aux <- by(contcatdat, contcatdat$prog, \(x) {
x <- merge(x, data.frame(hours=with(x, seq.int(min(hours), max(hours),
length.out=1e3))), all=TRUE)
x$prog <- unique(na.omit(x$prog))
u <- c('yvar', 'LCL', 'UCL')
x[u] <- lapply(x[u], \(x) approx(x, xout=seq_along(x))$y)
x
})
## logical interval intersect function
itvl_is_l <- \(a, b) {unname(as.vector(ifelse(b[, 1] > a[, 2] | a[, 1] > b[2], TRUE, FALSE)))}
## check if intersecting CIs
its <- itvl_is_l(aux$jog[c('LCL', 'UCL')], aux$read[c('LCL', 'UCL')])
aux <- lapply(aux, `[<-`, 'its', val=its) ## add as variable
aux <- lapply(aux, \(x) transform(x, itsn=cumsum(c(0, diff(x$its)) != 0) + 1)) ## making a sequence out of it
contcatdat <- do.call(rbind, aux) ## combine back as contcatdat
Plot
clr <- c('#FF0000', '#0000FF', '#0000001A') ## some colors
png('foo.png', 600, 400) ## open .png device
plot(yvar ~ hours, contcatdat, type='n')
grid()
## lines left
lines(yvar ~ hours, contcatdat, subset=prog == 'jog' & itsn > 2, lwd=2, col=clr[1])
lines(yvar ~ hours, contcatdat, subset=prog == 'read' & itsn > 2, lwd=2, col=clr[2])
## lines middle, dashed
lines(yvar ~ hours, contcatdat, subset=prog == 'jog' & itsn == 2, lwd=2, col=clr[1], lty=2)
lines(yvar ~ hours, contcatdat, subset=prog == 'read' & itsn == 2, lwd=2, col=clr[2], lty=2)
## lines right
lines(yvar ~ hours, contcatdat, subset=prog == 'jog' & itsn < 2, lwd=2, col=clr[1])
lines(yvar ~ hours, contcatdat, subset=prog == 'read' & itsn < 2, lwd=2, col=clr[2])
## CIs
with(subset(contcatdat, prog == 'jog'),
polygon(c(hours, rev(hours)), c(UCL, rev(LCL)), border=NA, col=clr[3]))
with(subset(contcatdat, prog == 'read'),
polygon(c(hours, rev(hours)), c(UCL, rev(LCL)), border=NA, col=clr[3]))
## legend
legend('topleft', legend=unique(contcatdat$prog), title='Group', col=clr[1:2], lty=1, lwd=2)
dev.off() ## close .png device
You could also try to plot the polygons first and opaque with a border, if that might look better.
Data:
contcatdat <- structure(list(prog = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), levels = c("jog",
"read"), class = "factor"), hours = c(0, 0, 0.5, 0.5, 1, 1, 1.5,
1.5, 2, 2, 2.5, 2.5, 3, 3, 3.5, 3.5, 4, 4), yvar = c(-6.78065983345649,
2.21637209230689, -3.05428518360714, 0.738291278604121, 0.672089466242214,
-0.739789535098646, 4.39846411609157, -2.21787034880141, 8.12483876594092,
-3.69595116250418, 11.8512134157903, -5.17403197620695, 15.5775880656396,
-6.65211278990971, 19.303962715489, -8.13019360361248, 23.0303373653383,
-9.60827441731525), SE = c(1.64384530410457, 1.48612021916972,
1.25520349531108, 1.14711211184156, 0.87926401607137, 0.820840725755632,
0.543079708493216, 0.531312719216624, 0.375535476484592, 0.376041650300328,
0.558013604603198, 0.501120592808483, 0.89777081499028, 0.781944232621328,
1.27470257475094, 1.1056003463909, 1.66373129934114, 1.44356083265185
), df = c(894, 894, 894, 894, 894, 894, 894, 894, 894, 894, 894,
894, 894, 894, 894, 894, 894, 894), LCL = c(-10.0069052579393,
-0.700318757711651, -5.51777400669205, -1.51305511813823, -1.05357261502514,
-2.35078883599747, 3.33260443922245, -3.26063588462286, 7.38780492844162,
-4.43397842739773, 10.7560441598055, -6.15754180868669, 13.815604150934,
-8.18677301395645, 16.8022045883112, -10.3000681349591, 19.7650632676689,
-12.4414373187615), UCL = c(-3.55441440897366, 5.13306294232543,
-0.590796360522233, 2.98963767534648, 2.39775154750957, 0.871209765800175,
5.46432379296068, -1.17510481297997, 8.86187260344022, -2.95792389761063,
12.946382671775, -4.19052214372721, 17.3395719803452, -5.11745256586298,
21.8057208426668, -5.96031907226584, 26.2956114630078, -6.77511151586902
), tvar = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), levels = c("jog", "read"), class = "factor"),
xvar = c(0, 0, 0.5, 0.5, 1, 1, 1.5, 1.5, 2, 2, 2.5, 2.5,
3, 3, 3.5, 3.5, 4, 4)), estName = "yvar", clNames = c("lower.CL",
"upper.CL"), pri.vars = c("prog", "hours"), adjust = "none", side = 0, delta = 0, type = "link", mesg = "Confidence level used: 0.95", row.names = c(NA,
18L), class = c("summary_emm", "data.frame"), labs = list(xlab = "hours",
ylab = "Linear prediction", tlab = "prog"), vars = list(byvars = character(0),
tvars = "prog"))
I'm trying to create a combination Boxplot/Scatterplot. I'm doing alright with it so far but there's one issue that's really bothering me that I've been unable to figure out. I'm in R and I've installed the ggplot2 package. Here's the code I'm using:
#(xx= stand in for my data set, which I imported from excel with the
# column labels as the X-axis values)
> boxplot(xx, lwd = 1.5, ylab = 'Minutes', xlab = "Epoch")
> stripchart(xx, vertical = TRUE,
+ method = "jitter", add = TRUE, pch = 20, col = 'blue')
This gives me a plot that is pretty close to what I want but the problem is that the outliers are placed on the chart twice. If possible, I'd like to have the stripchart exclude them (highest groups of blue dots) and only use the ones from the boxplot (black outlined circles) so they stand out as different and don't look so sloppy.
I've tried to alter the points in question by putting a lot of different outlier arguments into the stripchart command, unfortunately with no luck. I've tried setting y-limits below their values, tried using outline=false (which completely removes the stripchart), tried changing outlier color, outpch, etc. The command has not worked for any of these attempts. Here's an example of ylim:
> stripchart(xx, vertical = TRUE,
+ method = "jitter", add = TRUE, pch = 20, col = 'blue', ylim = true,
ylim (0,20))
Error in ylim(0, 20) : could not find function "ylim"
And here's an example with outlier color:
> stripchart(xx vertical = TRUE,
+ method = "jitter", add = TRUE, pch = 20, col = 'blue', outcol = "black")
Warning messages:
1: In plot.xy(xy.coords(x, y), type = type, ...) : "outcol" is not a
graphical parameter
.......# warning messages continue as such.
Are stripcharts capable of outlier exclusion? Or do I simply not know enough about them yet (and R as a whole, for that matter) to effectively write the code?
If this can be done, how should I proceed? I'm totally fine with solutions that don't directly address the outlier issue in terms of the data as long as the visual effect on the plot is the same.
Thank you for your time and any help you can give!
Edit: Here's some of the data to play around with. Top row is column labels and data is beneath. Sorry if this formatting is bad.The 29s and 30s and such in the 9th row of data, 10th overall, are examples of some of the points plotted as outliers in my graphs that I would like to keep in the boxplot but not in the scatterplot/stripchart.
1 5 10 15 30 60
7.233333333 8.166666667 9.666666667 7.75 9 7
7.133333333 9.25 9.333333333 9.75 10 11
0.733333333 0.5 0.833333333 1 1 0
1.766666667 1.166666667 1 0.75 1 0
1.75 2.25 2.333333333 2.25 1 1
6.75 7 7.166666667 7.75 6.5 7
1.516666667 1.75 1.333333333 2 2 2
1.533333333 1.5 2 1.25 1.5 2
27.3 28.33333333 29.33333333 30.25 28.5 29
6.35 6 6.333333333 7 6 6
7.083333333 8.333333333 8.833333333 8.75 8 8
8.533333333 10.08333333 10.5 12 10.5 11
7.65 8.416666667 9 10.75 9 12
6.85 7.333333333 8 7.25 6 8
4.433333333 5 5.5 5 6.5 6
8.616666667 10 11.66666667 12.25 13 12
3.633333333 3.75 3.5 3.25 3 2
0.8 0.75 0.833333333 1 1 0
7.283333333 8.583333333 9.666666667 9.75 12 8
7.483333333 8.75 8.333333333 7.75 6.5 7
3.466666667 2.916666667 3.166666667 2.5 2 0
5.483333333 6.416666667 6.833333333 6.75 7 8
There are a few things going on here. If you wanted to stick with the base plotting functions (boxplot() and stripchart()), you could simply tell stripchart to plot only the points that are within some criterion. A common standard for outliers would be any point 3 or more standard deviations away from the mean. Instead of passing your unmodified data set to stripchart, we subset that data set (note the [ ] brackets).
boxplot(xx)
stripchart(xx[xx <= mean(xx) + sd(xx) * 3], vertical = T, method = 'jitter', add = T, pch = 20, col = 'blue')
Of course, if you really did want to use ggplot2 (and I recommend installing not only that package, but the entire tidyverse with install.packages('tidyverse')), you could produce an arguably nicer plot:
The data formatting and commands needed to produce the ggplot version are quite different from the base graphics version, and beyond the scope of this answer. Reproducible code follows.
library(tidyverse)
df <- structure(list(X1 = c(7.233333333, 7.133333333, 0.733333333, 1.766666667, 1.75, 6.75, 1.516666667, 1.533333333, 27.3, 6.35, 7.083333333, 8.533333333, 7.65, 6.85, 4.433333333, 8.616666667, 3.633333333, 0.8, 7.283333333, 7.483333333, 3.466666667, 5.483333333 ), X5 = c(8.166666667, 9.25, 0.5, 1.166666667, 2.25, 7, 1.75, 1.5, 28.33333333, 6, 8.333333333, 10.08333333, 8.416666667, 7.333333333, 5, 10, 3.75, 0.75, 8.583333333, 8.75, 2.916666667, 6.416666667 ), X10 = c(9.666666667, 9.333333333, 0.833333333, 1, 2.333333333, 7.166666667, 1.333333333, 2, 29.33333333, 6.333333333, 8.833333333, 10.5, 9, 8, 5.5, 11.66666667, 3.5, 0.833333333, 9.666666667, 8.333333333, 3.166666667, 6.833333333), X15 = c(7.75, 9.75, 1, 0.75, 2.25, 7.75, 2, 1.25, 30.25, 7, 8.75, 12, 10.75, 7.25, 5, 12.25, 3.25, 1, 9.75, 7.75, 2.5, 6.75), X30 = c(9, 10, 1, 1, 1, 6.5, 2, 1.5, 28.5, 6, 8, 10.5, 9, 6, 6.5, 13, 3, 1, 12, 6.5, 2, 7), X60 = c(7L, 11L, 0L, 0L, 1L, 7L, 2L, 2L, 29L, 6L, 8L, 11L, 12L, 8L, 6L, 12L, 2L, 0L, 8L, 7L, 0L, 8L)), .Names = c("X1", "X5", "X10", "X15", "X30", "X60"), class = "data.frame", row.names = c(NA, -22L))
df.long <- gather(df, x, value) %>%
mutate(x = as.factor(as.numeric(gsub('X', '', x)))) %>%
group_by(x) %>%
mutate(is.outlier = value > mean(value) + sd(value) * 3)
plot.df <- ggplot(data = df.long, aes(x = x, y = value, group = x)) +
geom_boxplot() +
geom_point(data = filter(df.long, !is.outlier), color = '#0000ff88', position = position_jitter(width = 0.1))
print(plot.df)
I use the toy dataset (class membership variable & 2 features) below to apply a Gaussian Naive Bayes model and plot the contours of the class-specific bivariate normal distributions.
How to add a line for the decision boundary to the plot below?
Like here:
(Image source: https://alliance.seas.upenn.edu/~cis520/dynamic/2016/wiki/uploads/Lectures/2class_gauss_NB.jpg)
# Packages
library(klaR)
library(MASS)
# Data
d <- structure(list(y = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("0", "1"), class = "factor"), x1 = c(2, 2.8, 1.5, 2.1, 5.5, 8, 6.9, 8.5, 2.5, 7.7), x2 = c(1.5, 1.2, 1, 1, 4, 4.8, 4.5, 5.5, 2, 3.5)), .Names = c("y", "x1", "x2"), row.names = c(NA, -10L), class = "data.frame")
# Naive Bayes Model
mN <- NaiveBayes(y ~ x1+x2, data = d)
# Data
# Class 1
m1 <- mean(d[which(d$y==1),]$x1)
m2 <- mean(d[which(d$y==1),]$x2)
mu1_2 <- c(m1,m2) # Mean
sd1 <- sd(d[which(d$y==1),]$x1)
sd2 <- sd(d[which(d$y==1),]$x2)
Sigma1_2 <- matrix(c(sd1, 0, 0, sd2), 2) # Covariance matrix
bivn1_2 <- mvrnorm(5000, mu = mu1_2, Sigma = Sigma1_2 ) # from Mass package: Simulate bivariate normal PDF
bivn1_2.kde <- kde2d(bivn1_2[,1], bivn1_2[,2], n = 50) # from MASS package: Calculate kernel density estimate
# Class 0
m3 <- mean(d[which(d$y==0),]$x1)
m4 <- mean(d[which(d$y==0),]$x2)
mu3_4 <- c(m3,m4) # Mean
sd3 <- sd(d[which(d$y==0),]$x1)
sd4 <- sd(d[which(d$y==0),]$x2)
Sigma3_4 <- matrix(c(sd3, 0, 0, sd4), 2) # Covariance matrix
bivn3_4 <- mvrnorm(5000, mu = mu3_4, Sigma = Sigma3_4 ) # from Mass package: Simulate bivariate normal PDF
bivn3_4.kde <- kde2d(bivn3_4[,1], bivn3_4[,2], n = 50) # from MASS package: Calculate kernel density estimate
# Plot
plot(x= d$x1, y=d$x2, xlim=c(-1,10), ylim=c(-1,10), col=d$y, pch=19, cex=2, ylab="x2", xlab="x1")
contour(bivn1_2.kde, add = TRUE, col="darkgrey") # from base graphics package
contour(bivn3_4.kde, add = TRUE, col="darkgrey") # from base graphics package
text(labels = "Class 1",x = 8, y=7, col="grey")
text(labels = "Class 0",x = 0, y=4, col="grey")
I would like to plot both a linear model (LM) and non-linear (GLM) model of the same data.
The range between 16% - 84% should line up between a LM and GLM, Citation: section 3.5
I have included a more complete chunk of the code because I am not sure at which point I should try to cut the linear model. or at which point I have messed up - I think with the linear model.
The code below results in the following image:
My Objective (taken from previous citation-link).
Here is my data:
mydata3 <- structure(list(
dose = c(0, 0, 0, 3, 3, 3, 7.5, 7.5, 7.5, 10, 10, 10, 25, 25, 25, 50, 50, 50),
total = c(25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L),
affected = c(1, 0, 1.2, 2.8, 4.8, 9, 2.8, 12.8, 8.6, 4.8, 4.4, 10.2, 6, 20, 14, 12.8, 23.4, 21.6),
probability = c(0.04, 0, 0.048, 0.112, 0.192, 0.36, 0.112, 0.512, 0.344, 0.192, 0.176, 0.408, 0.24, 0.8, 0.56, 0.512, 0.936, 0.864)),
.Names = c("dose", "total", "affected", "probability"),
row.names = c(NA, -18L),
class = "data.frame")
My script:
#load libraries
library(ggplot2)
library(drc) # glm model
library(plyr) # rename function
library(scales) #log plot scale
#Creating linear model
mod_linear <- lm(probability ~ (dose), weights = total, data = mydata3)
#Creating data.frame: note values 3 and 120 refer to 16% and 84% response in sigmoidal plot
line_df <-expand.grid(dose=exp(seq(log(3),log(120),length=200)))
#Extracting values from linear model
p_line_df <- as.data.frame(cbind(dose = line_df,
predict(mod_linear, newdata=data.frame(dose = line_df),
interval="confidence",level=0.95)))
#Renaming linear df columns
p_line_df <-rename(p_line_df, c("fit"="probability"))
p_line_df <-rename(p_line_df, c("lwr"="Lower"))
p_line_df <-rename(p_line_df, c("upr"="Upper"))
p_line_df$model <-"Linear"
#Create sigmoidal dose-response curve using drc package
mod3 <- drm(probability ~ (dose), weights = total, data = mydata3, type ="binomial", fct=LL.2(names=c("Slope:b","ED50:e")))
#data frame for ggplot2
base_DF_3 <-expand.grid(dose=exp(seq(log(1.0000001),log(10000),length=200)))
#extract data from model
p_df3 <- as.data.frame(cbind(dose = base_DF_3,
predict(mod3, newdata=data.frame(dose = base_DF_3),
interval="confidence", level=.95)))
#renaming columns
p_df3 <-rename(p_df3, c("Prediction"="probability"))
p_df3$model <-"Sigmoidal"
#combining Both DataFames
p_df_all <- rbind(p_df3, p_line_df)
#plotting
ggplot(p_df_all, aes(x=dose,y=probability, group=model))+
geom_line(aes(x=dose,y=probability,group=model,linetype=model),show.legend = TRUE)+
scale_x_log10(breaks = c(0.000001, 10^(0:10)),labels = c(0, math_format()(0:10)))
Looking at the reference you provided, what the authors describe is the use of a linear model to approximate the central portion of a (sigmoidal) logistic function. The linear model that achieves this is a straight line that passes through the inflection point of the logistic curve, and has the same slope as the logistic function at that inflection point. We can use some basic algebra and calculus to solve this problem.
From ?LL.2, we see that the form of the logistic function being fitted by drm is
f(x) = 1 / {1 + exp(b(log(x) - log(e)))}
We can get the values of the coefficient in this equation by
b = mod3$coefficients[1]
e = mod3$coefficients[2]
Now, by differentiation, the slope of the logistic function is given by
dy/dx = -(b * exp((log(x)-log(e))*b)) / (1+exp((log(x)-log(e))*b))^2
At the inflection point, the dose (x) is equal to the coefficient e, thus the slope at the inflection point simplifies (greatly) to
sl50 = -b/4
Since we also know that the inflection point occurs at the point where probability = 0.5 and dose = e, we can construct the straight line (in log-transformed coordinates) like this:
linear_probability = sl50 * (log(p_df3$dose) - log(e)) + 0.5
Now, to plot the logistic and linear functions together:
p_df3_lin = p_df3
p_df3_lin$model = 'linear'
p_df3_lin$probability = linear_probability
p_df_all <- rbind(p_df3, p_df3_lin)
ggplot(p_df_all, aes(x=dose,y=probability, group=model))+
geom_line(aes(x=dose,y=probability,group=model,linetype=model),show.legend = TRUE)+
scale_x_log10(breaks = c(0.000001, 10^(0:10)),labels = c(0, math_format()(0:10))) +
scale_y_continuous(limits = c(0,1))
I have a data set from a literature survey, where we looked at effects of pH to certain parameters (Metrics) in a group of animals. Because experiments are done on different time scales, I divided the response ratio by time.
This leads to very small differences around 1 (less than 1, there is a negative effect, greater than 1 a positive effect), which are still interesting and important (because the real values are divided by time). The problem is that some of the values are either very low or very high and the differences close to 1 are not visible.
Since values are close to 1, log transformation of y-axis scale does not help. How can I transform the y-axis scale in ggplot2 so that differences close to 1 are visible and yet intuitive? (that the reader can detect differences without thinking too much; I could standardize the values to minimum value, multiply by 10000 and take a log10 scale, but this would not lead to understandable differences.)
df <- structure(list(Study = c(1, 1, 2, 2, 3), pH_control = c(8.06,
8.06, 8.01, 8.01, 7.99), pH_treatment = c(7.86, 7.75, 7.8, 7.8,
7.45), time = c(120, 120, 60, 150, 140), Metrics = structure(c(3L,
1L, 2L, 3L, 1L), .Label = c("Growth", "Metabolism", "Survival"
), class = "factor"), RR_per_time_unit = c(0.9998, 1.001, 1.002,
0.98, 0.9), CI.max = c(1, 1.003, 1.00003, 0.9999, 0.92), CI.min = c(0.9996,
0.9999, 1.004, 0.9789, 0.89), pH_diff = c(0.2, 0.31, 0.21, 0.21,
0.54)), .Names = c("Study", "pH_control", "pH_treatment", "time",
"Metrics", "RR_per_time_unit", "CI.max", "CI.min", "pH_diff"), row.names = c(NA,
-5L), class = "data.frame")
df$pH_diff <- df$pH_control - df$pH_treatment
library(ggplot2)
ggplot(df, aes(y = RR_per_time_unit, x = pH_diff, ymin = CI.min, ymax = CI.max)) +
geom_pointrange(aes(color = Metrics)) + geom_hline(aes(yintercept = 1)) + coord_trans(y = "log10")