R barplot with standard deviation - r

I am a new R user and am having trouble graphing some data in a bar plot. Sorry in advance if this is really easy to do, and I just can’t figure it out.
I have six sets of data: 3 data sets for car #1 at 1, 5, and 10yrs, and 3 data sets of car#2 at 1,5, and 10 yrs., where measurements for each car at each age would consist of 1.) counting the total number of dents on the cars exterior and 2.) number of dents that remove paint. I want to make a boxplot with 6 bars, corresponding to each car and their respective ages, where the column height is the total number of dents that remove paint, with standard deviation bars.
Here’s what I’ve been trying so far (only 2 data sets included):
car1yr1 = c(rep(0, 101), rep(1, 9)) #car has 9 dents that remove paint
car1yr5 = c(rep(0, 131), rep(1, 19)) #car has 19 dents that remove paint
sd1 = sd(car1yr1)
sd2 = sd(car1yr5)
stdv = c(sd1, sd2)
car1yr1 = car1yr1[1:150]
dentsCar1 = data.frame("Car1Yr1" = car1yr1, "Car1Yr5" = car1yr5)
barplot(as.matrix(dentsCar1, ylim = c(0, 50), beside = TRUE))
I’ve found an example of error bars: arrows(bar, x, bar, x+ -(stdv), length = 0.15, angle = 90), but I can’t get this to work with my numbers. Also, in this example, the y-axis stops at 15, but the bars Car1Yr5 goes until 19. How can I draw a y-axis up to 20 or 30?
Again, I’m new at R and any help would be greatly appreciated. I’ve been trying to solve this on my own off and on for about 2 weeks. Thanks.

I am a little confused by your data... I am assuming from your example that car 1 has 101 dents that did not remove paint and 9 that did and car 2 has 131 that did not and 19 that did.
Now calculating the standard deviation on the number of dents does not make much sense to me... you are plotting count data, so you should not have any standard deviation unless you have, say, many cars of the same model and you want to see the variability between cars.
The best thing to do would be to calculate the % of dents that removed paint by doing:
car1yr1 = c(rep(0, 101), rep(1, 9)) #car has 9 dents that remove paint
car1yr5 = c(rep(0, 131), rep(1, 19)) #car has 19 dents that remove paint
# The total number of observations is the total number of dents
total.dents.1 <- length(car1yr1)
total.dents.5 <- length(car1yr5)
# The dents that remove paint are marked as 1, the others with 0,
# so we can just sum all of the data to get the number of paint-removing dents
dents.paint.1 <- sum(car1yr1)
dents.paint.5 <- sum(car1yr5)
# Alternatively you can use
# dents.paint.1 <- length(which(car1yr1==1))
# Calculate the %
dents.paint.perc.1 <- dents.paint.1/total.dents.1
dents.paint.perc.5 <- dents.paint.1/total.dents.5
df <- data.frame(dents.paint.perc.1, dents.paint.perc.5)
# Plot the data.
# ylim specifies the limits of the y axis
# ylab defines the axis title.
# las=1 puts the labels on the y axis horizontally
# names defines the labels on the x axis
barplot(as.matrix(df)*100, ylim=c(0,20),
ylab="% dents removing paint", las=1,
names=c("Car 1 year 1", "Car 1 year 5"))
In general it would be much better to put all your data in a single list, so that you can use the *apply family of function to perform repetitive operations on all of your dataset. This will give you cleaner and more manageable code. Also, if you add more data it will automagically add it to the plot.

Related

Highlight Region in Zoo Plot Based on Condition

I have searched very hard for a solution to this question, but I haven't been successful. I have a zoo plot with your standard time series X axis and simply want to highlight regions of the chart when the value of the series is less than a certain threshold. Specifically, I want to highlight when the p-value of the intercept is significant (and the intercept is plotted). This will occur at various intervals across the time series, rather than simply some range x <= y.
I have tried help (xblocks) example (xblocks) and I haven't been able to get the highlighted region to show for the dates to which I know they should apply.
Does this solves your problem?
rgb <- hcl(c(0, 0, 260), c = c(100, 0, 100), l = c(50, 90, 50), alpha = 0.3)
set.seed(1234)
x.Date <- as.Date("2015-02-01") + c(1,3,6,7,9,10,12,14,18,20) - 1
y <- zoo(rnorm(length(x.Date)), x.Date)
pval<-zoo(runif(length(x.Date),0,.2), x.Date)
plot(y,col=4)
xblocks(pval<=0.05,col = rgb[1])

How to make categorical scatterplot in R with median marking

My data are set up so that one column contains a continuous value testosterone concentration and the second column contains one of four "Kit type" values being "EIA," "RIA," "Other," or "All." I wanted to make the kit types into categories along the x axis with testosterone concentration along the y. I can't seem to figure out how to make sort of a cross between a boxplot and a scatterplot, but with only the individual data points and a median marking for each category marked on the graph?
This seemed to get me the data points into catagories alright, but the summarySE function does not have a median: Categorical scatter plot with mean segments using ggplot2 in R
Without data, I'm only guessing here, but ...
## create some data
set.seed(42)
n <- 100
dat <- data.frame(Testo=rbeta(n, 2, 5),
Kit=sample(c('EIA', 'RIA', 'Other', 'All'), size = n, replace = TRUE))
## show unequal distribution of points, no problem
table(dat$Kit)
## All EIA Other RIA
## 23 30 14 33
## break into individual levels
dat2 <- lapply(levels(dat$Kit), function(lvl) dat$Testo[ dat$Kit == lvl ])
names(dat2) <- levels(dat$Kit)
## parent plot
boxplot(dat2, main = 'Testosterone Levels per Kit')
## adding individual points
for (lvl in seq_along(dat2)) {
points(jitter(rep(lvl, length(dat2[[lvl]]))), dat2[[lvl]],
pch = 16, col = '#88888888')
}

Asymmetric density plot of outcomes of 2 dices rolled

I posted following question on https://stats.stackexchange.com/questions/117578/density-distribution-of-outcomes-of-2-dice-rolled but did not get any response.
This question is related to: A histogram with a bar for each frequency value
Two dice are rolled and the sum is plotted. The histogram is as expected but density graph shows different densities of 2 and 12 and the plot is assymetric. Why is this so?
Amongst the outcomes of 2 dice rolled, the chances of 2 are equal to that of 12. Why is it that the density graph is of unequal values?
num.dices <- 2L
num.rolls <- 100000L
outcomes <- matrix(sample(1:6, num.dices * num.rolls, replace = TRUE),
nrow = num.rolls, ncol = num.dices)
sums <- rowSums(outcomes)
Histogram:
ggplot(data.frame(sums), aes(x=factor(sums)))+geom_histogram()
Density plot:
ggplot(data.frame(sums), aes(x=factor(sums), fill=factor(sums)))+geom_density()
I also tried:
ggplot(data.frame(sums), aes(x=factor(sums), fill=factor(sums)))+geom_density(aes(y = ..count..))
It seems to me, that you are doing 11 different kernel-estimations with 11 diff. bandwitdths, instead do:
ggplot(data.frame(sums), aes(x=sums, fill=2))+geom_density()
or you could add group=1 if you insist to do it with the extra arguments:
ggplot(data.frame(sums), aes(x=sums, fill=factor(sums)))+geom_density(aes(group=1))

How to measure area between 2 distribution curves in R / ggplot2

The specific example is that imagine x is some continuous variable between 0 and 10 and that the red line is distribution of "goods" and the blue is "bads", I'd like to see if there is value in incorporating this variable into checking for 'goodness' but I'd like to first quantify the amount of stuff in the areas where the blue > red
Because this is a distribution chart, the scales look the same, but in reality there is 98 times more good in my sample which complicates things, since it's not actually just measuring the area under the curve, but rather measuring the bad sample where it's distribution is along lines where it's greater than the red.
I've been working to learn R, but am not even sure how to approach this one, any help appreciated.
EDIT
sample data:
http://pastebin.com/7L3Xc2KU <- a few million rows of that, essentially.
the graph is created with
graph <- qplot(sample_x, bad_is_1, data=sample_data, geom="density", color=bid_is_1)
The only way I can think of to do this is to calculate the area between the curve using simple trapezoids. First we manually compute the densities
d0 <- density(sample$sample_x[sample$bad_is_1==0])
d1 <- density(sample$sample_x[sample$bad_is_1==1])
Now we create functions that will interpolate between our observed density points
f0 <- approxfun(d0$x, d0$y)
f1 <- approxfun(d1$x, d1$y)
Next we find the x range of the overlap of the densities
ovrng <- c(max(min(d0$x), min(d1$x)), min(max(d0$x), max(d1$x)))
and divide that into 500 sections
i <- seq(min(ovrng), max(ovrng), length.out=500)
Now we calculate the distance between the density curves
h <- f0(i)-f1(i)
and using the formula for the area of a trapezoid we add up the area for the regions where d1>d0
area<-sum( (h[-1]+h[-length(h)]) /2 *diff(i) *(h[-1]>=0+0))
# [1] 0.1957627
We can plot the region using
plot(d0, main="d0=black, d1=green")
lines(d1, col="green")
jj<-which(h>0 & seq_along(h) %% 5==0); j<-i[jj];
segments(j, f1(j), j, f1(j)+h[jj])
Here's a way to shade the area between two density plots and calculate the magnitude of that area.
# Create some fake data
set.seed(10)
dat = data.frame(x=c(rnorm(1000, 0, 5), rnorm(2000, 0, 1)),
group=c(rep("Bad", 1000), rep("Good", 2000)))
# Plot densities
# Use y=..count.. to get counts on the vertical axis
p1 = ggplot(dat) +
geom_density(aes(x=x, y=..count.., colour=group), lwd=1)
Some extra calculations to shade the area between the two density plots
(adapted from this SO question):
pp1 = ggplot_build(p1)
# Create a new data frame with densities for the two groups ("Bad" and "Good")
dat2 = data.frame(x = pp1$data[[1]]$x[pp1$data[[1]]$group==1],
ymin=pp1$data[[1]]$y[pp1$data[[1]]$group==1],
ymax=pp1$data[[1]]$y[pp1$data[[1]]$group==2])
# We want ymax and ymin to differ only when the density of "Good"
# is greater than the density of "Bad"
dat2$ymax[dat2$ymax < dat2$ymin] = dat2$ymin[dat2$ymax < dat2$ymin]
# Shade the area between "Good" and "Bad"
p1a = p1 +
geom_ribbon(data=dat2, aes(x=x, ymin=ymin, ymax=ymax), fill='yellow', alpha=0.5)
Here are the two plots:
To get the area (number of values) in specific ranges of Good and Bad, use the density function on each group (or you can continue to work with the data pulled from ggplot as above, but this way you get more direct control over how the density distribution is generated):
## Calculate densities for Bad and Good.
# Use same number of points and same x-range for each group, so that the density
# values will line up. Use a higher value for n to get a finer x-grid for the density
# values. Use a power of 2 for n, because the density function rounds up to the nearest
# power of 2 anyway.
bad = density(dat$x[dat$group=="Bad"],
n=1024, from=min(dat$x), to=max(dat$x))
good = density(dat$x[dat$group=="Good"],
n=1024, from=min(dat$x), to=max(dat$x))
## Normalize so that densities sum to number of rows in each group
# Number of rows in each group
counts = tapply(dat$x, dat$group, length)
bad$y = counts[1]/sum(bad$y) * bad$y
good$y = counts[2]/sum(good$y) * good$y
## Results
# Number of "Good" in region where "Good" exceeds "Bad"
sum(good$y[good$y > bad$y])
[1] 1931.495 # Out of 2000 total in the data frame
# Number of "Bad" in region where "Good" exceeds "Bad"
sum(bad$y[good$y > bad$y])
[1] 317.7315 # Out of 1000 total in the data frame

Plot probability heatmap/hexbin with different sized bins

This is related to another question: Plot weighted frequency matrix.
I have this graphic (produced by the code below in R):
#Set the number of bets and number of trials and % lines
numbet <- 36
numtri <- 1000
#Fill a matrix where the rows are the cumulative bets and the columns are the trials
xcum <- matrix(NA, nrow=numbet, ncol=numtri)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(5/6,1/6), replace = TRUE)
xcum[,i] <- cumsum(x)/(1:numbet)
}
#Plot the trials as transparent lines so you can see the build up
matplot(xcum, type="l", xlab="Number of Trials", ylab="Relative Frequency", main="", col=rgb(0.01, 0.01, 0.01, 0.02), las=1)
I very much like the way that this plot is built up and shows the more frequent paths as darker than the rarer paths (but it is not clear enough for a print presentation). What I would like to do is to produce some kind of hexbin or heatmap for the numbers. On thinking about it, it seems that the plot will have to incorporate different sized bins (see my back of the envelope sketch):
My question then: If I simulate a million runs using the code above, how can I present it as a heatmap or hexbin, with the different sized bins as shown in the sketch?
To clarify: I do not want to rely on transparency to show the rarity of a trial passing through a part of the plot. Instead I would like to denote rarity with heat and show a common pathway as hot (red) and a rare pathway as cold (blue). Also, I do not think the bins should be the same size because the first trial has only two places where the path can be, but the last has many more. Hence the fact I chose a changing bin scale, based on that fact. Essentially I am counting the number of times a path passes through the cell (2 in col 1, 3 in col 2 etc) and then colouring the cell based on how many times it has been passed through.
UPDATE: I already had a plot similar to #Andrie, but I am not sure it is much clearer than the top plot. It is the discontinuous nature of this graph, that I do not like (and why I want some kind of heatmap). I think that because the first column has only two possible values, that there should not be a huge visual gap between them etc etc. Hence why I envisaged the different sized bins. I still feel that the binning version would show large number of samples better.
Update: This website outlines a procedure to plot a heatmap:
To create a density (heatmap) plot version of this we have to effectively enumerate the occurrence of these points at each discrete location in the image. This is done by setting a up a grid and counting the number of times a point coordinate "falls" into each of the individual pixel "bins" at every location in that grid.
Perhaps some of the information on that website can be combined with what we have already?
Update: I took some of what Andrie wrote with some of this question, to arrive at this, which is quite close to what I was conceiving:
numbet <- 20
numtri <- 100
prob=1/6
#Fill a matrix
xcum <- matrix(NA, nrow=numtri, ncol=numbet+1)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(prob, 1-prob), replace = TRUE)
xcum[i, ] <- c(i, cumsum(x)/cumsum(1:numbet))
}
colnames(xcum) <- c("trial", paste("bet", 1:numbet, sep=""))
mxcum <- reshape(data.frame(xcum), varying=1+1:numbet,
idvar="trial", v.names="outcome", direction="long", timevar="bet")
#from the other question
require(MASS)
dens <- kde2d(mxcum$bet, mxcum$outcome)
filled.contour(dens)
I don't quite understand what's going on, but this seems to be more like what I wanted to produce (obviously without the different sized bins).
Update: This is similar to the other plots here. It is not quite right:
plot(hexbin(x=mxcum$bet, y=mxcum$outcome))
Last try. As above:
image(mxcum$bet, mxcum$outcome)
This is pretty good. I would just like it to look like my hand-drawn sketch.
Edit
I think the following solution does what you ask for.
(Note that this is slow, especially the reshape step)
numbet <- 32
numtri <- 1e5
prob=5/6
#Fill a matrix
xcum <- matrix(NA, nrow=numtri, ncol=numbet+1)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(prob, 1-prob), replace = TRUE)
xcum[i, ] <- c(i, cumsum(x)/cumsum(1:numbet))
}
colnames(xcum) <- c("trial", paste("bet", 1:numbet, sep=""))
mxcum <- reshape(data.frame(xcum), varying=1+1:numbet,
idvar="trial", v.names="outcome", direction="long", timevar="bet")
library(plyr)
mxcum2 <- ddply(mxcum, .(bet, outcome), nrow)
mxcum3 <- ddply(mxcum2, .(bet), summarize,
ymin=c(0, head(seq_along(V1)/length(V1), -1)),
ymax=seq_along(V1)/length(V1),
fill=(V1/sum(V1)))
head(mxcum3)
library(ggplot2)
p <- ggplot(mxcum3, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", formatter="percent", low="red", high="blue") +
scale_y_continuous(formatter="percent") +
xlab("Bet")
print(p)
FYI: This is more of an extended comment than an answer.
To me, this new plot looks like a stacked bar where each bar's height is equal to the intersection points of the upper and lower line at the next trial.
The way that I would approach this is to treat "Trials" as a categorical variable. Then we can search each row of xcum for elements that are equal. If they are, then we can consider this to be a point of intersection whose minima also represents the multiple defining the height of our bars.
x <- t(xcum)
x <- x[duplicated(x),]
x[x==0] <- NA
Now we have the multiples of the actual points, we need to figure out how to take it to the next step and find a way of binning the information. That means we need to make a decision about how many points will represent each grouping. Let's write some points out for posterity.
Trial 1 (2) = 1, 0.5 # multiple = 0.5
Trial 2 (3) = 1, 0.66, 0.33 # multiple = 0.33
Trial 3 (4) = 1, 0.75, 0.5, 0.25 # multiple = 0.25
Trial 4 (5) = 1, 0.8, 0.6, 0.4, 0.2 # multiple = 0.2
Trial 5 (6) = 1, 0.8333335, 0.6666668, 0.5000001, 0.3333334, 0.1666667
...
Trial 36 (35) = 1, 0.9722223, ..., 0.02777778 # mutiple = 0.05555556 / 2
In other words, for each Trial there are n-1 points to plot. In your drawing you have 7 bins. So we need to figure out the multiples for each bin.
Let's cheat and divide the last two columns by two, we know from visual inspection that the minima is lower than 0.05
x[,35:36] <- x[,35:36] / 2
Then find the minimum of each column:
x <- apply(x, 2, function(x) min(x, na.rm=T))[-1] # Drop the 1
x <- x[c(1,2,3,4,8,17,35)] # I'm just guessing here by the "look" of your drawing.
The clearest way to do this is to create each bin separately. Obviously, this could be done automatically later. Remembering that each point is
bin1 <- data.frame(bin = rep("bin1",2), Frequency = rep(x[1],2))
bin2 <- data.frame(bin = rep("bin2",3), Frequency = rep(x[2],3))
bin3 <- data.frame(bin = rep("bin3",4), Frequency = rep(x[3],4))
bin4 <- data.frame(bin = rep("bin4",5), Frequency = rep(x[4],5))
bin5 <- data.frame(bin = rep("bin5",9), Frequency = rep(x[5],9))
bin6 <- data.frame(bin = rep("bin6",18), Frequency = rep(x[6],18))
bin7 <- data.frame(bin = rep("bin7",36), Frequency = rep(x[7],36))
df <- rbind(bin1,bin2,bin3,bin4,bin5,bin6,bin7)
ggplot(df, aes(bin, Frequency, color=Frequency)) + geom_bar(stat="identity", position="stack")

Resources