Three-way graph (variable, mean, sd) with ggplot2 - r

I think I have an error in my logic while reproducing a graph I found in this pdf here.
It should be fairly easy to do, but I have issues to plot a variable with its mean and standard deviation each in their own graph together, as can be seen in the example graph below. Did they do it with facet_grid() or facet_wrap()?
How can I plot an arbitrary variable in that way? In particular, I would not know how to plot the mean and sd over distance (or time).
Example graph:

Here's my approach to the solution outlined by #DavidArenburg (though I simplified the data a little, using simple cumulative statistics and a plain index):
library(tidyr)
library(dplyr)
library(TTR)
v <- rnorm(1000)
df <- data.frame(index = 1:1000,
variable = v,
mean = runMean(v, n=1, cumulative=TRUE),
sd = runSD(v, n=1, cumulative=TRUE))
dd <- gather(df, facet, value, -index)
ggplot(dd, aes(x = index, y = value)) +
geom_path() +
facet_grid(facet ~ .)
Bonus: illustration that sample mean and sd are unbiased (0 and 1, respectively).

Related

Extract critical points of a polynomial model object in R?

I am trying to solve for the inflection points of a cubic polynomial function which has been fitted to data, i.e. values of x where the first derivative is zero.
I also need a way to find the values of y at the critical points of x.
It is easy enough to fit the model using lm() and to view the model quality with summary(). And I can plot the function easily enough by adding predictions and using geom_line().
There must be a package or a base R function dedicated to this problem. Can anyone suggest a method?
Below is a reprex to depict the problem. Needless to say, the arrows are drawn only to illustrate the question; they are not mapped to the true inflection points or I would not be asking this question...
library(tidyverse)
library(modelr)
set.seed(0)
#generate random data and plot the values
df <- tibble(x= sample(x= c(-100:200), size= 50),
y= -0.5*(x^3) + 50*(x^2) + 7*(x) + rnorm(n=50, mean=10000, sd=50000) )
df %>% ggplot(aes(x, y)) +
geom_point()
# fit a model to the data
cubic_poly_model <- lm(data= df, formula = y~poly(x, 3))
# plot the fitted model
df %>%
add_predictions(model = cubic_poly_model) %>%
ggplot(aes(x, y))+
geom_point(alpha=1/3)+
geom_line(aes(x, y=pred))+
annotate('text', label= 'critical point A', x=-50, y=-250000)+
geom_segment(x=-50, xend=-10, y=-200000, yend=-5000, arrow = arrow(length=unit(3, 'mm'), type = 'closed'))+
annotate('text', label= 'critical point B', x=140, y=400000)+
geom_segment(x=110, xend=90, y=300000, yend=100000, arrow = arrow(length=unit(3, 'mm'), type = 'closed'))
# But how can I get the critical values of x and the y values they produce?
Created on 2020-09-03 by the reprex package (v0.3.0)
I devised a solution using the mosaic package . The makeFun() function allows a model object to be converted to a function. You can then use base R optimize()to find the max or min value of that function over a specified interval (in this case, the range of x values). Specify the "maximum" argument in optimize() to state whether you want the local maximum or local minimum.
See code below:
library(magrittr)
set.seed(0)
#generate random data and plot the values
df <- tibble::tibble(x= sample(x= c(-100:200), size= 50),
y= -0.5*(x^3) + 50*(x^2) + 7*(x) + rnorm(n=50, mean=10000, sd=50000) )
cubic_poly_model <- lm(data= df, formula = y~poly(x, 3))
crit_values <- cubic_poly_model %>%
mosaic::makeFun() %>%
optimize(interval = c(min(df$x), max(df$x)), maximum = TRUE)
funct_crit_x <- crit_values[['maximum']][[1]]
funct_max <- crit_values[['objective']]
funct_crit_x
funct_max

ANOVA significance visualisation of replicate experiment data (ggplot)

I am struggling to get significance values of my experiment replicate data. Experiment done in duplicate for each species and i want to compare how significant the values are for each time point between each species. I am trying to do two-way ANOVA...
library(ggplot2)
library(reshape)
library(dplyr)
abs2.melt<-melt(abs2,
id.vars='Time',
measure.vars=c('WT','WT.1','DsigB','DsigB.1','DrsbR','DrsbR.1'))
print(abs2.melt)
abs2.melt.mod<-abs2.melt %>%
separate(col=variable,into=c('Species'),sep='\\.')
print(abs2.melt.mod)
ggplot(abs2.melt.mod,aes(x=Time,y=value,group=Species))+
stat_summary(
fun =mean,
geom="line",
aes(color=Species))+
stat_summary(
fun=mean,
geom="point")+
stat_summary(
fun.data=mean_cl_boot,
geom='errorbar',
width=2)+
theme_bw()+
xlab("Time")+
ylab("OD600")+
labs(title="Growth Curve of Mutant Strains")
summary(abs2.melt.mod)
print(abs2.melt.mod)
###SD and mean values
as.data.frame<-abs2.melt.mod %>% group_by(Species,Time) %>%
summarize(mean.val=mean(value), sd.val=sd(value))
anova1<-aov(value~Species,data=abs2.melt.mod)
##statistical significance?
print(as.data.frame)
anova1<-aov(Time~Species+value,data=abs2.melt.mod)
summary(anova1)
Simulate something that looks like your data
set.seed(111)
df = expand.grid(rep=1:3,Time=1:5,Species=letters[1:3])
df$value = 0.5*df$Time + rnorm(nrow(df))
df$Time = factor(df$Time)
Then we plot, allowing comparison for each time point:
library(ggplot2)
ggplot(df,aes(x=Time,y=value,col=Species)) +
stat_summary(fun.data="mean_sdl",position=position_dodge(width=0.5))
Or error bar which i think looks bad:
ggplot(df,aes(x=Time,y=value,col=Species))+
stat_summary(fun.data="mean_sdl",position=position_dodge(width=0.5),
geom="errorbar",width=0.4)
Since you have a few data points, no point doing a boxplot, so you can try something like the above

show only 0-90% or 0-95% percentile

Here is my code and plot results, dues to some outliers, the x-axis is very long. Is there a simple method which I can filter df$foo by only 0-90% or 0-95% percentile in R, so that I can plot only normal values? Thanks.
df <- read.csv('~/Downloads/foo.tsv', sep='\t', header=F, stringsAsFactors=FALSE)
names(df) <- c('a', 'foo', 'goo')
df$foo <- as.numeric(df$foo)
goodValue <- df$foo
summary(goodValue)
hist(goodValue,main="Distribution",xlab="foo",breaks=20)
Maybe this is what you're looking for?
a = c(rnorm(99), 50) #create some data
quant <- as.numeric(quantile(a, c(0, 0.9))) #get 0 and 0.9 quantile
hist(a[a > quant[1] & a < quant[2]]) #histogram only data within these bounds
Suppose you wanted to examine the diamonds. (I don't have your data)
library(ggplot2)
library(dplyr)
diamonds %>% ggplot() + geom_histogram(aes(x = price))
You might decide to examine the deciles of your data, and since the tail probability is not of interest to you, you might throw away the top uppermost decile. You could do that as follows, with a free scale so that you can see what is happening within each decile.
diamonds %>% mutate(ntile = ntile(price, 10)) %>%
filter(ntile < 10) %>%
ggplot() + geom_histogram(aes(x = price)) +
facet_wrap(~ntile, scales = "free_x")
But be cautious although seeing your data in a much finer granularity has its benefits, notice how you could almost barely tell that your data is roughly exponentially distributed (with a heavy tail, as commodities price data often are).

ploting replicates in ggplot problems with annotation per facet

I have a large dataframe.
Here is fake data of a similar structure;
dat = data.frame(id=seq(1:12),variable=rep(c("p1","p2","p3"),times=2),value=c(runif(6),runif(6)+1),locus=c(rep("A",6),rep("B",6)),replicate=rep(c(1,2),6), TimesLocus=rep(2,times=12))
I would like to plot the correlation between replicate 1, and replicate 2.
I have achieved this using.
Corr<-cor(dat[dat$replicate==1,]$value,dat[dat$replicate==2,]$value)
ggplot(dat,aes(x=dat[dat$replicate==1,]$value,y=dat[dat$replicate==2,]$value))+
geom_point()+xlab("replicate1")+ylab("replicate2")+
geom_smooth(method = "lm") +
annotate("text", x = 0.9*max(dat[dat$replicate==1,]$value),
y = 0.9*max(dat[dat$replicate==2,]$value),
label = paste("r^2=",round(Corr,digits=2),sep=" "),color="blue")
However, now I want to see if the correlations are different PER VARIABLE.
I can do this using.
ggplot(dat,aes(x=dat[dat$replicate==1,]$value,y=dat[dat$replicate==2,]$value))+
geom_point()+xlab("replicate1")+ylab("replicate2")+
geom_smooth(method = "lm") + facet_wrap(~variable)
If I want to have the correlation per variable I know that I should make a separate dataframe, but I am having problems with this.
r_df <- ddply(dat, .(variable), summarise,
rsq=round(summary(lm(dat[dat$replicate==2,]$value~
dat[dat$replicate==1,]$value))$r.squared, 2))
It gives the same r2ed for each variable.
What am I doing wrong? Can I do this without reshaping my data again?
Okay, I am now trying to use info from #shadow, and have the following.
r_df_val <- ddply(df_mlt_loc_Dup, .(variable), summarise, rsq=round(summary(lm(value[replicate==2]~value[replicate==1]))$r.squared, 2))
Some how the calculation isn't correct. All of the rsq are 0.06 or something, when they should be near 0.8, you can see the correlation in the plot below. Is it somehow re-ordering the dataframe upon subsetting by variable?
In your ddply call, you used dat again. That refers to the original data. You should instead directly use value and replicate. Then they are interpreted correctly.
r_df <- ddply(dat, .(variable), summarise,
rsq = round(summary(lm(value[replicate==2]~value[replicate==1]))$r.squared, 2))
This does not work for the data you provided, because the datasets are too small. But for your original data it should work. Also here's a larger dataset (essentially the data you provided with some additional rows). For this data it should work as desired.
dat = data.frame(id=seq(1:24),variable=rep(c("p1","p2","p3"),times=4),value=c(runif(12),runif(12)+1),locus=c(rep("A",12),rep("B",12)),replicate=rep(c(1,2),12), TimesLocus=rep(2,times=24))

How do I plot the first derivative of the smoothing function?

I have the following script that emulates the type of data structure I have and analysis that I want to do on it,
library(ggplot2)
library(reshape2)
n <- 10
df <- data.frame(t=seq(n)*0.1, a =sort(rnorm(n)), b =sort(rnorm(n)),
a.1=sort(rnorm(n)), b.1=sort(rnorm(n)),
a.2=sort(rnorm(n)), b.2=sort(rnorm(n)))
head(df)
mdf <- melt(df, id=c('t'))
## head(mdf)
levels(mdf$variable) <- rep(c('a','b'),3)
g <- ggplot(mdf,aes(t,value,group=variable,colour=variable))
g +
stat_smooth(method='lm', formula = y ~ ns(x,3)) +
geom_point() +
facet_wrap(~variable) +
opts()
What I would like to do in addition to this is plot the first derivative of the smoothing function against t and against the factors, c('a','b'), as well. Any suggestions how to go about this would be greatly appreciated.
You'll have to construct the derivative yourself, and there are two possible ways for that. Let me illustrate by using only one group :
require(splines) #thx #Chase for the notice
lmdf <- mdf[mdf$variable=="b",]
model <- lm(value~ns(t,3),data=lmdf)
You then simply define your derivative as diff(Y)/diff(X) based on your predicted values, as you would do for differentiation of a discrete function. It's a very good approximation if you take enough X points.
X <- data.frame(t=seq(0.1,1.0,length=100) ) # make an ordered sequence
Y <- predict(model,newdata=X) # calculate predictions for that sequence
plot(X$t,Y,type="l",main="Original fit") #check
dY <- diff(Y)/diff(X$t) # the derivative of your function
dX <- rowMeans(embed(X$t,2)) # centers the X values for plotting
plot(dX,dY,type="l",main="Derivative") #check
As you can see, this way you obtain the points for plotting the derivative. You'll figure out from here how to apply this to both levels and combine those points to the plot you like. Below the plots from this sample code :
Here's one approach to plotting this with ggplot. There may be a more efficient way to do it, but this uses the manual calculations done by #Joris. We'll simply construct a long data.frame with all of the X and Y values while also supplying a variable to "facet" the plots:
require(ggplot2)
originalData <- data.frame(X = X$t, Y, type = "Original")
derivativeData <- data.frame(X = dX, Y = dY, type = "Derivative")
plotData <- rbind(originalData, derivativeData)
ggplot(plotData, aes(X,Y)) +
geom_line() +
facet_wrap(~type, scales = "free_y")
If data is smoothed using smooth.spline, the derivative of predicted data can be specified using the argument deriv in predict. Following from #Joris's solution
lmdf <- mdf[mdf$variable == "b",]
model <- smooth.spline(x = lmdf$t, y = lmdf$value)
Y <- predict(model, x = seq(0.1,1.0,length=100), deriv = 1) # first derivative
plot(Y$x[, 1], Y$y[, 1], type = 'l')
Any dissimilarity in the output is most likely due to differences in the smoothing.

Resources