Plotting confidence intervals in ggplot (from a matrix)

Plotting confidence intervals in ggplot (from a matrix) - r

I have simulated the enviromental damage on two economic sectors of a small region using basic input-output analysis.
I have plotted the average damage and its confidence interval by sector each year (I have a 10-year period) as reported in the code below, using matrix.
I would like to get a similar graph using ggplot.
For now, I have decided to omit the code corresponding to the data set up and the simulations to make the question as concise as possible, please, let me know if I should include it.
Thank you in advance for your help
# Average drop in each iteration
medias=t(apply(vX,2,function(x) apply(x,2,mean)))
# std deviations
devtip=t(apply(vX,2,function(x) apply(x,2,sd)))
devtip
# and their confidence intervals
inter95=t(apply(vX,2,function(x) apply(x,2,quantile,p=c(0.025,0.975))))
# where the first two columns are the interval for the first sector
# where the second two columns are the interval for the first sector
inter95[,1:2] # ci for the first sector
inter95[,3:4] # ci for the second sector
# Plots the drop in demandfor each sector each year and its CI
matplot(medias[,],type="l",lty=1,lwd=1,ylab="Variación de la producción en Media",xlab="Tiempo (Iteracción)",ylim=range(inter95))
for(sec in 1:length(sec.int)){
inter=apply(vX[,,sec],2,quantile,p=c(0.025,0.975))
segments(x0=1:N,y0=inter[1,],y1=inter[2,],col=(1:length(sec.int))[sec])
}
legend("right",paste("Sec.",sec.int),col=1:length(sec.int),bty="n",lty=1)

Firstly, please provide some reproducible data. And I think you question has been answered here.
Assuming that in your example medias is a matrix with ncol= 2 for the both trend means and inter95 another matrix with ncol= 4, saving the confidence intervals, I would do:
df <- cbind.data.frame(medias, inter95)
names(df) <- c("mean1", "mean2", "lwr1", "upr1", "lwr2", "upr2")
df$time <- 1:n
ggplot(df, aes(time, mean1)) +
geom_line() +
geom_ribbon(data= df,aes(ymin= lwr1,ymax= upr1),alpha=0.3) +
geom_line(aes(time, mean2), col= "red") +
geom_ribbon(aes(ymin= lwr2,ymax= upr2),alpha=0.3, fill= "red")
Using this data
set.seed(1)
n <- 10
b <- .5
medias <- matrix(rnorm(n*2), ncol= 2)
inter95 <- matrix(c(medias[ , 1]-b, medias[, 1]+b, medias[ , 2]-b, medias[ , 2]+b), ncol= 4)
gives you the following plot
plot

Related

How do I get the value of the x-axis in qqnorm plot

I'm trying to make a norm qqplot using plotly.js with the value obtained in R.
I can get the y-axis values.
m <- lm(Sepal.Length ~ Sepal.Width + Petal.Width, data=iris)
plot(m, which=2) #this plot is what I want to make using plotly
std.resi <- rstandard(m) # y-axis values
But, there's problem.
I don't know how to get the x-axis values.
please advise me on this matter.
thank you.

The x-axis contains the quantiles of a Gaussian distribution.
So, imagining that you have N points, you can obtain the values of you x-axis via:
a <- (1:N+1)/(N+1) #get N equally spaced values between 0 and 1
a <- a[c(-(N+1))] #remove value at 1
quant <- qnorm(a) #obtain gaussian quantiles
Hope it helps!

Thank you so much. I got x-axis values.
std.resi
N <- length(std.resi)
a_2 <- 1:(N+1) / (N+1)
a_2<- a_2[c(-(N+1))]
std.resi.sort <- sort(std.resi)
quant <- qnorm(a_2)
plot(quant, std.resi.sort)

Moving average on several time series using ggplot

Hi I try desperately to plot several time series with a 12 months moving average.
Here is an example with two time series of flower and seeds densities. (I have much more time series to work on...)
#datasets
taxon <- c(rep("Flower",36),rep("Seeds",36))
density <- c(seq(20, 228, length=36),seq(33, 259, length=36))
year <- rep(c(rep("2000",12),rep("2001",12),rep("2002",12)),2)
ymd <- c(rep(seq(ymd('2000-01-01'),ymd('2002-12-01'), by = 'months'),2))
#dataframe
df <- data.frame(taxon, density, year, ymd)
library(forecast)
#create function that does a Symmetric Weighted Moving Average (2x12) of the monthly log density of flowers and seeds
ma_12 <- function(x) {
ts_x <- ts(x, freq = 12, start = c(2000, 1), end = c(2002, 12)) # transform to time-series object as it is necessary to run the ma function
return(ma(log(ts_x + 1), order = 12, centre = T))
}
#trial of the function
ma_12(df[df$taxon=="Flower",]$density) #works well
library(ggplot2)
#Trying to plot flower and seeds log density as two time series
ggplot(df,aes(x=year,y=density,colour=factor(taxon),group=factor(taxon))) +
stat_summary(fun.y = ma_12, geom = "line") #or geom = "smooth"
#Warning message:
#Computation failed in `stat_summary()`:
#invalid time series parameters specified
Function ma_12 works correctly. The problem comes when I try to plot both time-series (Flower and Seed) using ggplot. I cannot define both taxa as different time series and apply a moving average on them. Seems that it has to do with "stat_summary"...
Any help would be more than welcome! Thanks in advance
Note: The following link is quite useful but can not directly help me as I want to apply a specific function and plot it in accordance to the levels of one group variable. For now, I can't find any solution. Any way, thank you to suggest me this.
Multiple time series in one plot

This is what you need?
f <- ma_12(df[df$taxon=="Flower", ]$density)
s <- ma_12(df[df$taxon=="Seeds", ]$density)
f <- cbind(f,time(f))
s <- cbind(s,time(s))
serie <- data.frame(rbind(f,s),
taxon=c(rep("Flower", dim(f)[1]), rep("Seeds", dim(s)[1])))
serie$density <- exp(serie$f)
library(lubridate)
serie$time <- ymd(format(date_decimal(serie$time), "%Y-%m-%d"))
library(ggplot2)
ggplot() + geom_point(data=df, aes(x=ymd, y=density, color=taxon, group=taxon)) +
geom_line(data=serie, aes(x= time, y=density, color=taxon, group=taxon))

Combining 3 different matrix plots in R

Can anyone tell me how to create a plot which features 3 different matrices sets of data. In general, I have 3 different matricies of data all 1*1001 dimensions, and i wish to plot all 3 on the same graph.
I have managed to get one matrix to plot at once, and assemble the code to create the other 2 matrices but not to plot it. B[i,] is randomly generated data. What I would like to know is what would be the coding to get all 3 plots together on one graph.
Code for one matrix:
ntime<-1000
average.price.at.each.timestep<-matrix(0,nrow=1,ncol=ntime+1)
for(i in 1:(ntime+1)){
average.price.at.each.timestep[i]<-mean(B[i,])
}
matplot(t, t(average.price.at.each.timestep), type="l", lty=1, main="MC Price of a Zero Coupon Bond", ylab="Price", xlab = "Option Exercise Date")
Code for 3:
average.price.at.each.timestep<-matrix(0,nrow=1,ncol=ntime+1)
s.e.at.each.time <-matrix(0,nrow=1,ncol=ntime+1)
upper.c.l.at <- matrix(0,nrow=1,ncol=ntime+1)
lower.c.l.at <- matrix(0,nrow=1,ncol=ntime+1)
std <- function(x) sd(x)/sqrt(length(x))
for(i in 1:(ntime+1)){
average.price.at.each.timestep[i]<-mean(B[i,])
s.e.at.each.time[i] <- std(B[i,])
upper.c.l.at[i] <- average.price.at.each.timestep[i]+1.96*s.e.at.each.time[i]
lower.c.l.at[i] <- average.price.at.each.timestep[i]-1.96*s.e.at.each.time[i]
}
I'm still struggling with this as I cannot get the solutions given to match with my data sets, I have now included the code below that generates the matrix B as a working example so you can see the data I am dealing with. As you can see it produces a plot of the different prices, I would like a plot with the average price and confidence intervals of the average.
# Define Bond Price Parameters
#
P<-1 #par value
# Define Vasicek Model Parameters
#
rev.rate<-0.3 #speed of reversion
long.term.mean<-0.1 #long term level of the mean
sigma<-0.05 #volatility
r0<-0.03 #spot interest rate
Strike<-0.05
# Define Simulation Parameters
#
T<-50 #time to expiry
ntime<-1000 #number of timesteps
yearstep<-ntime/T #yearstep
npaths<-1000 #number of paths
dt<-T/ntime #timestep
R <- matrix(0,nrow=ntime+1,ncol=npaths) #matrix of Vasicek interest rate values
B <- matrix(0,nrow=ntime+1,ncol=npaths) # matrix of Bond Prices
R[1,]<-r0 #specifies that all paths start at specified spot rate
B[1,]<-P
# do loop which generates values to fill matrix R with multiple paths of Interest Rates as they evolve over time.
# stochastic process based on standard normal distribution
for (j in 1:npaths) {
for (i in 1:ntime) {
dZ <-rnorm(1,mean=0,sd=1)*sqrt(dt)
Rij<-R[i,j]
Bij<-B[i,j]
dr <-rev.rate*(long.term.mean-Rij)*dt+sigma*dZ
R[i+1,j]<-Rij+dr
B[i+1,j]<-Bij*exp(-R[i+1,j]*dt)
}
}
t<-seq(0,T,dt)
par(mfcol = c(3,3))
matplot(t, B[,1:pmin(20,npaths)], type="l", lty=1, main="Price of a Zero Coupon Bond", ylab="Price", xlab = "Time to Expiry")

Your example isn't reproducible, so I created some fake data that I hope is structured similarly to yours. If this isn't what you were looking for, let me know and I'll update as needed.
# Fake data
ntime <- 100
mat1 <- matrix(rnorm(ntime+1, 10, 2), nrow=1, ncol=ntime+1)
mat2 <- matrix(rnorm(ntime+1, 20, 2), nrow=1, ncol=ntime+1)
mat3 <- matrix(rnorm(ntime+1, 30, 2), nrow=1, ncol=ntime+1)
matplot(1:(ntime+1), t(mat1), type="l", lty=1, ylim=c(0, max(c(mat1,mat2,mat3))),
main="MC Price of a Zero Coupon Bond",
ylab="Price", xlab = "Option Exercise Date")
# Add lines for mat2 and mat3
lines(1:101, mat2, col="red")
lines(1:101, mat3, col="blue")
UPDATE: Is this what you're trying to do?
matplot(t, t(average.price.at.each.timestep), type="l", lty=1,
main="MC Price of a Zero Coupon Bond", ylab="Price",
xlab = "Option Exercise Date")
matlines(t, t(upper.c.l.at), lty=2, col="red")
matlines(t, t(lower.c.l.at), lty=2, col="green")
See plot below. If you have multiple columns that you want to plot (as in your updated example where you plot 20 separate paths) and you want to add lower and upper CIs for all of them (though this would make the plot unreadable), just use a matrix of upper and lower CI values that correspond to each path in average.price.at.each.timestep and use matlines to add them to your existing plot of the multiple paths.

This is doable using ggplot2 and reshape2. The structures you have are a little awkward, which you could improve by using a data frame instead of a matrix.
#Dummy data
average.price.at.each.timestep <- rnorm(1000, sd=0.01)
s.e.at.each.time <- rnorm(1000, sd=0.0005, mean=1)
#CIs (note you can vectorise this):
upper.c.l.at <- average.price.at.each.timestep+1.96*s.e.at.each.time
lower.c.l.at <- average.price.at.each.timestep-1.96*s.e.at.each.time
#create a data frame:
prices <- data.frame(time = 1:length(average.price.at.each.timestep), price=average.price.at.each.timestep, upperCI= upper.c.l.at, lowerCI= lower.c.l.at)
library(reshape2)
#turn the data frame into time, variable, value triplets
prices.t <- melt(prices, id.vars=c("time"))
#plot
library(ggplot2)
ggplot(prices.t, aes(time, value, colour=variable)) + geom_line()
This produces the following plot:
This can be improved somewhat by using geom_ribbon instead:
ggplot(prices, aes(time, price)) + geom_ribbon(aes(ymin=lowerCI, ymax=upperCI), alpha=0.1) + geom_line()
Which produces this plot:

Here's another, slightly different ggplot solution that does not require you to calculate the confidence limits first - ggplot does it for you.
# create sample dataset
set.seed(1) # for reproducible example
B <- matrix(rnorm(1000,mean=rep(10+1:10/2,each=10)),nc=10)
library(ggplot2)
library(reshape2) # for melt(...)
gg <- melt(data.frame(date=1:nrow(B),B), id="date")
ggplot(gg, aes(x=date,y=value)) +
stat_summary(fun.y = mean, geom="line")+
stat_summary(fun.y = function(y)mean(y)-1.96*sd(y)/sqrt(length(y)), geom="line",linetype="dotted", color="blue")+
stat_summary(fun.y = function(y)mean(y)+1.96*sd(y)/sqrt(length(y)), geom="line",linetype="dotted", color="blue")+
theme_bw()
stat_summary(...) summarizes the y-values for a given value of x (the date). So in the first call, it calculates the mean, in the second the lowerCL, and in the third the upperCL.
You could also create a CL(...) function, and call that:
CL <- function(x,level=0.95,type=c("lower","upper")) {
fact <- c(lower=-1,upper=1)
mean(x) - fact[type]*qnorm((1-level)/2)*sd(x)/sqrt(length(x))
}
ggplot(gg, aes(x=date,y=value)) +
stat_summary(fun.y = mean, geom="line")+
stat_summary(fun.y = CL, type="lower", geom="line",linetype="dotted", color="blue")+
stat_summary(fun.y = CL, type="upper", geom="line",linetype="dotted", color="blue")+
theme_bw()
This produces a plot identical to the one above.

How to measure area between 2 distribution curves in R / ggplot2

The specific example is that imagine x is some continuous variable between 0 and 10 and that the red line is distribution of "goods" and the blue is "bads", I'd like to see if there is value in incorporating this variable into checking for 'goodness' but I'd like to first quantify the amount of stuff in the areas where the blue > red
Because this is a distribution chart, the scales look the same, but in reality there is 98 times more good in my sample which complicates things, since it's not actually just measuring the area under the curve, but rather measuring the bad sample where it's distribution is along lines where it's greater than the red.
I've been working to learn R, but am not even sure how to approach this one, any help appreciated.
EDIT
sample data:
http://pastebin.com/7L3Xc2KU <- a few million rows of that, essentially.
the graph is created with
graph <- qplot(sample_x, bad_is_1, data=sample_data, geom="density", color=bid_is_1)

The only way I can think of to do this is to calculate the area between the curve using simple trapezoids. First we manually compute the densities
d0 <- density(sample$sample_x[sample$bad_is_1==0])
d1 <- density(sample$sample_x[sample$bad_is_1==1])
Now we create functions that will interpolate between our observed density points
f0 <- approxfun(d0$x, d0$y)
f1 <- approxfun(d1$x, d1$y)
Next we find the x range of the overlap of the densities
ovrng <- c(max(min(d0$x), min(d1$x)), min(max(d0$x), max(d1$x)))
and divide that into 500 sections
i <- seq(min(ovrng), max(ovrng), length.out=500)
Now we calculate the distance between the density curves
h <- f0(i)-f1(i)
and using the formula for the area of a trapezoid we add up the area for the regions where d1>d0
area<-sum( (h[-1]+h[-length(h)]) /2 *diff(i) *(h[-1]>=0+0))
# [1] 0.1957627
We can plot the region using
plot(d0, main="d0=black, d1=green")
lines(d1, col="green")
jj<-which(h>0 & seq_along(h) %% 5==0); j<-i[jj];
segments(j, f1(j), j, f1(j)+h[jj])

Here's a way to shade the area between two density plots and calculate the magnitude of that area.
# Create some fake data
set.seed(10)
dat = data.frame(x=c(rnorm(1000, 0, 5), rnorm(2000, 0, 1)),
group=c(rep("Bad", 1000), rep("Good", 2000)))
# Plot densities
# Use y=..count.. to get counts on the vertical axis
p1 = ggplot(dat) +
geom_density(aes(x=x, y=..count.., colour=group), lwd=1)
Some extra calculations to shade the area between the two density plots
(adapted from this SO question):
pp1 = ggplot_build(p1)
# Create a new data frame with densities for the two groups ("Bad" and "Good")
dat2 = data.frame(x = pp1$data[[1]]$x[pp1$data[[1]]$group==1],
ymin=pp1$data[[1]]$y[pp1$data[[1]]$group==1],
ymax=pp1$data[[1]]$y[pp1$data[[1]]$group==2])
# We want ymax and ymin to differ only when the density of "Good"
# is greater than the density of "Bad"
dat2$ymax[dat2$ymax < dat2$ymin] = dat2$ymin[dat2$ymax < dat2$ymin]
# Shade the area between "Good" and "Bad"
p1a = p1 +
geom_ribbon(data=dat2, aes(x=x, ymin=ymin, ymax=ymax), fill='yellow', alpha=0.5)
Here are the two plots:
To get the area (number of values) in specific ranges of Good and Bad, use the density function on each group (or you can continue to work with the data pulled from ggplot as above, but this way you get more direct control over how the density distribution is generated):
## Calculate densities for Bad and Good.
# Use same number of points and same x-range for each group, so that the density
# values will line up. Use a higher value for n to get a finer x-grid for the density
# values. Use a power of 2 for n, because the density function rounds up to the nearest
# power of 2 anyway.
bad = density(dat$x[dat$group=="Bad"],
n=1024, from=min(dat$x), to=max(dat$x))
good = density(dat$x[dat$group=="Good"],
n=1024, from=min(dat$x), to=max(dat$x))
## Normalize so that densities sum to number of rows in each group
# Number of rows in each group
counts = tapply(dat$x, dat$group, length)
bad$y = counts[1]/sum(bad$y) * bad$y
good$y = counts[2]/sum(good$y) * good$y
## Results
# Number of "Good" in region where "Good" exceeds "Bad"
sum(good$y[good$y > bad$y])
[1] 1931.495 # Out of 2000 total in the data frame
# Number of "Bad" in region where "Good" exceeds "Bad"
sum(bad$y[good$y > bad$y])
[1] 317.7315 # Out of 1000 total in the data frame

Nested tables and calculating summary statistics with confidence intervals in R

This question is about the statistical program R.
Data
I have a data frame, study_data, that has 100 rows, each representing a different person, and three columns, gender, height_category, and freckles. The variable gender is a factor and takes the value of either "male" or "female". The variable height_category is also a factor and takes the value of "tall" or "short". The variable freckles is a continuous, numeric variable that states how many freckles that individual has.
Here are some example data (thanks to Roland for this):
set.seed(42)
DF <- data.frame(gender=sample(c("m","f"),100,T),
height_category=sample(c("tall","short"),100,T),
freckles=runif(100,0,100))
Question 1
I would like to create a nested table that divides these patients into "male" versus "female", further subdivides them into "tall" versus "short", and then calculates the number of patients in each sub-grouping along with the median number of freckles with the lower and upper 95% confidence interval.
Example
The table should look something like what is shown below, where the # signs are replaced with the appropriate calculated results.
gender height_category n median_freckles LCI UCI
male tall # # # #
short # # # #
female tall # # # #
short # # # #
Question 2
Once these results have been calculated, I would then like to create a bar graph. The y axis will be the median number of freckles. The x axis will be divided into male versus female. However, these sections will be subdivided by height category (so there will be a total of four bars in groups of two). I'd like to overlay the 95% confidence bands on top of the bars.
What I've tried
I know that I can make a nested table using the MASS library and xtabs command:
ftable(xtabs(formula = ~ gender + height_category, data = study_data))
However, I'm not sure how to incorporate calculating the median of the number of freckles into this command and then getting it to show up in the summary table. I'm also aware that ggplot2 can be used to make bar graphs, but am not sure how to do this given that I can't calculate the data that I need in the first place.

You should really provide a reproducible example. Anyway, you may find library(plyr) helpful. Be careful with these confidence intervals because the Central Limit Theorem doesn't apply if n < 30.
library(plyr)
ddply(df, .(gender, height_category), summarize,
n=length(freckles), median_freckles=median(freckles),
LCI=qt(.025, df=length(freckles) - 1)*sd(freckles)/length(freckles)+mean(freckles),
UCI=qt(.975, df=length(freckles) - 1)*sd(freckles)/length(freckles)+mean(freckles))
EDIT: I forgot to add the bit on the plot. Assuming we save the previous result as tab:
library(ggplot2)
library(reshape)
m.tab <- melt(tab, id.vars=c("gender", "height_category"))
dodge <- position_dodge(width=0.9)
ggplot(m.tab, aes(fill=height_category, x=gender, y=median_freckles))+
geom_bar(position=dodge) + geom_errorbar(aes(ymax=UCI, ymin=LCI), position=dodge, width=0.25)

set.seed(42)
DF <- data.frame(gender=sample(c("m","f"),100,T),
height_category=sample(c("tall","short"),100,T),
freckles=runif(100,0,100))
library(plyr)
res <- ddply(DF,.(gender,height_category),summarise,
n=length(na.omit(freckles)),
median_freckles=quantile(freckles,0.5,na.rm=TRUE),
LCI=quantile(freckles,0.025,na.rm=TRUE),
UCI=quantile(freckles,0.975,na.rm=TRUE))
library(ggplot2)
p1 <- ggplot(res,aes(x=gender,y=median_freckles,ymin=LCI,ymax=UCI,
group=height_category,fill=height_category)) +
geom_bar(stat="identity",position="dodge") +
geom_errorbar(position="dodge")
print(p1)
#a better plot that doesn't require to precalculate the stats
library(hmisc)
p2 <- ggplot(DF,aes(x=gender,y=freckles,colour=height_category)) +
stat_summary(fun.data="median_hilow",geom="pointrange",position = position_dodge(width = 0.4))
print(p2)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex