Related
TL;DR: Trying to create multiple plots in one graph (image attached), using loop function. Currently manually creating codes for each boxplot, then using par() function to plot them together. It works, but looking for a less repetitive way.
I was wondering if it's possible to create multiple plots; specifically to plot "plot means". You can find the exact output in image form here (the second example on plot means): How to create multiple ggboxplots on the same graph using the loop function?
My data looks something like this:
# A tibble: 62 x 4
offer payoff partner_transfer round_type
<dbl> <dbl> <dbl> <chr>
1 40 126 66 actual
2 100 273 273 actual
3 0 100 0 actual
4 100 6 6 actual
5 25 99 24 actual
6 80 29 9 practice
7 100 45 45 practice
8 0 100 0 practice
9 25 99 24 practice
10 100 183 183 practice
# ... with 52 more rows
I'm trying to get it to look like this:
![sample plot means][2]
Currently, my code to get this output is:
par(mfrow = c(2,2))
plot_offer <- plotmeans( offer ~ round_type, data = tg_proposer_split,
xlab = "Round Type", ylab = "Offer (by A)",
main="Mean Plot with 95% CI")
plot_partner_transfer <- plotmeans( partner_transfer ~ round_type, data = tg_proposer_split,
xlab = "Round Type", ylab = "Amount Transferred by Partner (Bot)",
main="Mean Plot with 95% CI")
plot_payoff <- plotmeans( payoff ~ round_type, data = tg_proposer_split,
xlab = "Round Type", ylab = "Payoff (for A)",
main="Mean Plot with 95% CI")
Is there a way I can shorten this code?
Biggest apologies, for some reason I'm unable to attach images because I haven't collated enough reputation points so I have no choice but to try it this way. Hope it is still clear.
Many thanks!
Here is a way to simplify the code with Map.
Define a general purpose function to take care of the plot, fun_plot;
Get the column names of the y axis variables;
Create a vector of y axis labels;
Plot in a Map loop.
The code becomes
fun_plot <- function(ycol, ylab){
fmla <- paste(ycol, "round_type", sep = "~")
fmla <- as.formula(fmla)
plotmeans(fmla, data = tg_proposer_split,
xlab = "Round Type", ylab = ylab,
main = "Mean Plot with 95% CI")
}
y_cols <- names(tg_proposer_split)[which(names(tg_proposer_split) != "round_type")]
y_lab <- c("Offer (by A)", "Amount Transferred by Partner (Bot)", "Payoff (for A)")
old_par <- par(mfrow = c(2,2))
Map(fun_plot, y_cols, y_lab)
par(old_par)
Edit.
Following the error reported in comment, here is a more general function, allowing for xcol and the data set to take any values, not just "round_type" and tg_proposer_split, respectively. This solution now uses mapply, not Map, in order for those two arguments to be passed in a MoreArgs list.
fun_plot2 <- function(ycol, ylab, xcol, data){
fmla <- paste(ycol, xcol, sep = "~")
fmla <- as.formula(fmla)
plotmeans(fmla, data = data,
xlab = "Round Type", ylab = ylab,
main = "Mean Plot with 95% CI")
}
old_par <- par(mfrow = c(2,2))
mapply(fun_plot2, y_cols, y_lab,
MoreArgs = list(
xcol = "round_type",
data = tg_proposer_split
)
)
par(old_par)
Data
tg_proposer_split <- read.table(text = "
offer payoff partner_transfer round_type
1 40 126 66 actual
2 100 273 273 actual
3 0 100 0 actual
4 100 6 6 actual
5 25 99 24 actual
6 80 29 9 practice
7 100 45 45 practice
8 0 100 0 practice
9 25 99 24 practice
10 100 183 183 practice
", header = TRUE)
I have created a program that simulates the throwing of dice 100 times. I need help with adding up the results of the individual dice and also how to plot the probability distribution of outcomes.
This is the code I have:
sample(1:6, size=100, replace = TRUE)
So far, what you've done is sample the dice throws (note I've added a line setting the seed for reproducibility:
set.seed(123)
x <- sample(1:6, size=100, replace = TRUE)
The simple command to "add[] up the results of the individual dice" is table():
table(x)
# x
# 1 2 3 4 5 6
# 17 16 20 14 18 15
Then, to "plot the probability distribution of outcomes," we must first get that distribution; luckily R provides the handy prop.table() function, which works for this sort of discrete distribution:
prop.table(table(x))
# x
# 1 2 3 4 5 6
# 0.17 0.16 0.20 0.14 0.18 0.15
Then we can easily plot it; for plotting PMFs, my preferred plot type is "h":
y <- prop.table(table(x))
plot(y, type = "h", xlab = "Dice Result", ylab = "Probability")
Update: Weighted die
sample() can easily used to simulate weighted die using its prob argument. From help("sample"):
Usage
sample(x, size, replace = FALSE, prob = NULL)
Arguments
[some content omitted]
prob a vector of probability weights for obtaining the elements of the vector being sampled.
So, we just add your preferred weights to the prob argument and proceed as usual (note I've also upped your sample size from 100 to 10000):
set.seed(123)
die_weights <- c(4/37, rep(6/37, 4), 9/37)
x <- sample(1:6, size = 10000, replace = TRUE, prob = die_weights)
(y <- prop.table(table(x)))
# x
# 1 2 3 4 5 6
# 0.1021 0.1641 0.1619 0.1691 0.1616 0.2412
plot(y, type = "h", xlab = "Dice Result", ylab = "Probability")
I generate this image using the hclust function. Now I wand to ID of those elements highlighted by squares.
Is there any way to get the ID and related value from the clusted datasets? Thanks
EDIT
I used this R script
library(gplots)
library(geneplotter)
# read the data in from URL
bots <- read.table("expression.txt")
# get just the alpha data
abot <- bots[,c(1:9)]
rownames(abot) <- bots[,1]
abot[1:7,]
# get rid of NAs
abot[is.na(abot)] <- 0
# we need to find a way of reducing the data. Can't do ANOVA as there are no
# replicates. Sort on max difference and take first 1000
min <-apply(abot, 1, min)
max <- apply(abot, 1, max)
sabot <- abot[order(max - min, decreasing=TRUE),][1:1000,]
# cluster on correlation
cdist <- as.dist(1 - cor(t(sabot)))
hc <- hclust(cdist, "average")
# draw a heatmap
x11()
heatmap.2(as.matrix(sabot),
Rowv=as.dendrogram(hc),
Colv=FALSE,
cexRow=1,
cexCol=1,
dendrogram="row",
scale="row",
trace="none",
density.info="none",
key=FALSE,
col=greenred.colors(80))
and my data look like this
YF MF SF YL ML SL Stem Root SULE
1 31.64075611 32.2728151 38.81790359 252.8901009 269.7599455 138.5011042 16.58308894 10.47935935 3.364295997
2 6.484902171 9.141084197 5.748798541 3.637332586 4.762966989 4.149302282 7.194971046 9.932508868 1.600027931
3 14.15218386 8.784155316 9.740794214 6.566584262 6.130503033 7.747728536 12.57014531 15.75181203 9.22907038
4 15.72881736 19.95755802 10.13050089 10.31313758 9.838844457 14.24864327 13.00442008 23.85404067 12.17251862
5 30.45475953 15.57131432 17.15277867 8.884751572 8.78786964 12.4745649 11.90176123 35.9844343 6.904763942
6 15.87149807 19.05523246 13.12846166 12.99750491 15.3775883 19.0044086 21.66051467 20.38501538 39.58478032
7 16.58935728 18.63990933 17.20955634 13.04423927 29.98424087 18.02165996 22.22403582 32.38377369 10.90832984
8 29.91118855 19.65844846 23.45958109 62.56338088 55.3926187 39.85296152 31.4832543 14.8484163 1.326553777
9 4.09192129 15.52499475 12.14321788 1.680854758 3.448485979 5.245481483 15.14443161 28.85873063 1.073855381
10 7.02768911 4.267210165 3.383501945 3.53716686 3.105614581 3.493791292 3.806360251 6.713067543 3.338740245
11 17.61821596 18.03607855 12.939663 8.951935241 15.45268577 15.53817186 20.5098186 23.42760284 27.97680418
12 66.35291651 40.41837702 37.7239447 32.42998176 30.09696289 27.81089554 33.27197681 46.5393928 4.141505618
13 15.45804403 15.98469202 17.21176468 9.105208867 11.76140929 13.9751105 14.72159466 25.68388472 7.493988128
I have a fitted binomial logit model and want to calculate the cumulative probability of experiencing an event <= some value of a covariate.
For example, If I have a fitted model that predicts and outcome based on a continuous distance range (0-8.5 km) I might want to find out the cumulative probability for distance <= to 4.5 km.
I have vectors of estimated probabilities and the associated distances as below
dat <- structure(list(km = c(0, 0.447368421052632, 0.894736842105263,
1.34210526315789, 1.78947368421053, 2.23684210526316, 2.68421052631579,
3.13157894736842, 3.57894736842105, 4.02631578947368, 4.47368421052632,
4.92105263157895, 5.36842105263158, 5.81578947368421, 6.26315789473684,
6.71052631578947, 7.15789473684211, 7.60526315789474, 8.05263157894737,
8.5), prob = c(0.99010519543441, 0.985413663823809, 0.97854588563623,
0.968547716962174, 0.954108659036907, 0.933496091194704, 0.904551377544634,
0.864833064332603, 0.81202174997839, 0.744668375529677, 0.663191827576796,
0.570704402277059, 0.47300143764816, 0.377323442817887, 0.290336664745317,
0.216433162546689, 0.157174982015906, 0.111825887625402, 0.0783449309507567,
0.054275681518511)), .Names = c("km", "prob"), row.names = c(NA,
-20L), class = "data.frame")
What I ultimately want to say is "x% of observations within x distance are predicted to experience an event". Is this the right way to go about that?
Also is there an easy way to calculate at which distance (from 0 - whatever) encompasses the 50% cumulative probability.
Thanks, Tim
There is probably some way to extract this from your model, but if you were doing it from scratch I would try to fit your data to a distribution, then extract your relevant data points.
First define an error function:
rmse <- function(x,y) sqrt(sum((x-y)^2)/length(x)) # or some other error fxn
Now let's say your data sort of looks like a gamma distribution, so try:
gdf <- function(x, d=dat$km) pgamma(d,shape=x[1], scale=x[2])
So your function to optimize will be the error function of your data and the fit distribution:
error_fxn <- function(x) rmse(rev(dat$prob),gdf(x)) # rev data to make ascending
Now optimize this function to get your parameters for the distribution of interest:
rr <- optim(c(1,1),error_fxn)
And let's see how good the fit is (just ok...);
rr
# $par
# [1] 3.108392 1.112584
# $value
# [1] 0.0333369
# $counts
# function gradient
119 NA
# $convergence
# [1] 0
# $message
# NULL
Or graphicaly:
with(dat,plot(km,prob,xlim=c(10,0)))
with(dat,lines(rev(km),pgamma(km,shape=rr$par[1], scale=rr$par[2]),col='red'))
Take a look at the values for the CDF:
kms <- seq(0,8.5,0.5)
data.frame(dist = kms, cdf = pgamma(kms,shape=rr$par[1], scale=rr$par[2]))
# dist cdf
# 1 0.0 0.000000000
# 2 0.5 0.008634055
# 3 1.0 0.053615340
# 4 1.5 0.137291689
# 5 2.0 0.245961242
# 6 2.5 0.363956061
# 7 3.0 0.479070721
# 8 3.5 0.583659363
# 9 4.0 0.673982194
# 10 4.5 0.749075757
# 11 5.0 0.809691054
# 12 5.5 0.857478086
# 13 6.0 0.894431622
# 14 6.5 0.922551998
# 15 7.0 0.943661710
# 16 7.5 0.959325076
# 17 8.0 0.970830577
# 18 8.5 0.979207658
And to answer your final question, get the distance at 50% of the CDF:
qgamma(0.5,shape=rr$par[1], scale=rr$par[2])
# [1] 3.095395
I asked this question a year ago and got code for this "probability heatmap":
numbet <- 32
numtri <- 1e5
prob=5/6
#Fill a matrix
xcum <- matrix(NA, nrow=numtri, ncol=numbet+1)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(prob, 1-prob), replace = TRUE)
xcum[i, ] <- c(i, cumsum(x)/cumsum(1:numbet))
}
colnames(xcum) <- c("trial", paste("bet", 1:numbet, sep=""))
mxcum <- reshape(data.frame(xcum), varying=1+1:numbet,
idvar="trial", v.names="outcome", direction="long", timevar="bet")
library(plyr)
mxcum2 <- ddply(mxcum, .(bet, outcome), nrow)
mxcum3 <- ddply(mxcum2, .(bet), summarize,
ymin=c(0, head(seq_along(V1)/length(V1), -1)),
ymax=seq_along(V1)/length(V1),
fill=(V1/sum(V1)))
head(mxcum3)
library(ggplot2)
p <- ggplot(mxcum3, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", formatter="percent", low="red", high="blue") +
scale_y_continuous(formatter="percent") +
xlab("Bet")
print(p)
(May need to change this code slightly because of this)
This is almost exactly what I want. Except each vertical shaft should have different numbers of bins, ie the first should have 2, second 3, third 4 (N+1). In the graph shaft 6 +7 have the same number of bins (7), where 7 should have 8 (N+1).
If I'm right, the reason the code does this is because it is the observed data and if I ran more trials we would get more bins. I don't want to rely on the number of trials to get the correct number of bins.
How can I adapt this code to give the correct number of bins?
I have used R's dbinom to generate the frequency of heads for n=1:32 trials and plotted the graph now. It will be what you expect. I have read some of your earlier posts here on SO and on math.stackexchange. Still I don't understand why you'd want to simulate the experiment rather than generating from a binomial R.V. If you could explain it, it would be great! I'll try to work on the simulated solution from #Andrie to check out if I can match the output shown below. For now, here's something you might be interested in.
set.seed(42)
numbet <- 32
numtri <- 1e5
prob=5/6
require(plyr)
out <- ldply(1:numbet, function(idx) {
outcome <- dbinom(idx:0, size=idx, prob=prob)
bet <- rep(idx, length(outcome))
N <- round(outcome * numtri)
ymin <- c(0, head(seq_along(N)/length(N), -1))
ymax <- seq_along(N)/length(N)
data.frame(bet, fill=outcome, ymin, ymax)
})
require(ggplot2)
p <- ggplot(out, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", low="red", high="blue") +
xlab("Bet")
The plot:
Edit: Explanation of how your old code from Andrie works and why it doesn't give what you intend.
Basically, what Andrie did (or rather one way to look at it) is to use the idea that if you have two binomial distributions, X ~ B(n, p) and Y ~ B(m, p), where n, m = size and p = probability of success, then, their sum, X + Y = B(n + m, p) (1). So, the purpose of xcum is to obtain the outcome for all n = 1:32 tosses, but to explain it better, let me construct the code step by step. Along with the explanation, the code for xcum will also be very obvious and it can be constructed in no time (without any necessity for for-loop and constructing a cumsum everytime.
If you have followed me so far, then, our idea is first to create a numtri * numbet matrix, with each column (length = numtri) having 0's and 1's with probability = 5/6 and 1/6 respectively. That is, if you have numtri = 1000, then, you'll have ~ 834 0's and 166 1's *for each of the numbet columns (=32 here). Let's construct this and test this first.
numtri <- 1e3
numbet <- 32
set.seed(45)
xcum <- t(replicate(numtri, sample(0:1, numbet, prob=c(5/6,1/6), replace = TRUE)))
# check for count of 1's
> apply(xcum, 2, sum)
[1] 169 158 166 166 160 182 164 181 168 140 154 142 169 168 159 187 176 155 151 151 166
163 164 176 162 160 177 157 163 166 146 170
# So, the count of 1's are "approximately" what we expect (around 166).
Now, each of these columns are samples of binomial distribution with n = 1 and size = numtri. If we were to add the first two columns and replace the second column with this sum, then, from (1), since the probabilities are equal, we'll end up with a binomial distribution with n = 2. Similarly, instead, if you had added the first three columns and replaced th 3rd column by this sum, you would have obtained a binomial distribution with n = 3 and so on...
The concept is that if you cumulatively add each column, then you end up with numbet number of binomial distributions (1 to 32 here). So, let's do that.
xcum <- t(apply(xcum, 1, cumsum))
# you can verify that the second column has similar probabilities by this:
# calculate the frequency of all values in 2nd column.
> table(xcum[,2])
0 1 2
694 285 21
> round(numtri * dbinom(2:0, 2, prob=5/6))
[1] 694 278 28
# more or less identical, good!
If you divide the xcum, we have generated thus far by cumsum(1:numbet) over each row in this manner:
xcum <- xcum/matrix(rep(cumsum(1:numbet), each=numtri), ncol = numbet)
this will be identical to the xcum matrix that comes out of the for-loop (if you generate it with the same seed). However I don't quite understand the reason for this division by Andrie as this is not necessary to generate the graph you require. However, I suppose it has something to do with the frequency values you talked about in an earlier post on math.stackexchange
Now on to why you have difficulties obtaining the graph I had attached (with n+1 bins):
For a binomial distribution with n=1:32 trials, 5/6 as probability of tails (failures) and 1/6 as the probability of heads (successes), the probability of k heads is given by:
nCk * (5/6)^(k-1) * (1/6)^k # where nCk is n choose k
For the test data we've generated, for n=7 and n=8 (trials), the probability of k=0:7 and k=0:8 heads are given by:
# n=7
0 1 2 3 4 5
.278 .394 .233 .077 .016 .002
# n=8
0 1 2 3 4 5
.229 .375 .254 .111 .025 .006
Why are they both having 6 bins and not 8 and 9 bins? Of course this has to do with the value of numtri=1000. Let's see what's the probabilities of each of these 8 and 9 bins by generating probabilities directly from the binomial distribution using dbinom to understand why this happens.
# n = 7
dbinom(7:0, 7, prob=5/6)
# output rounded to 3 decimal places
[1] 0.279 0.391 0.234 0.078 0.016 0.002 0.000 0.000
# n = 8
dbinom(8:0, 8, prob=5/6)
# output rounded to 3 decimal places
[1] 0.233 0.372 0.260 0.104 0.026 0.004 0.000 0.000 0.000
You see that the probabilities corresponding to k=6,7 and k=6,7,8 corresponding to n=7 and n=8 are ~ 0. They are very low in values. The minimum value here is 5.8 * 1e-7 actually (n=8, k=8). This means that you have a chance of getting 1 value if you simulated for 1/5.8 * 1e7 times. If you check the same for n=32 and k=32, the value is 1.256493 * 1e-25. So, you'll have to simulate that many values to get at least 1 result where all 32 outcomes are head for n=32.
This is why your results were not having values for certain bins because the probability of having it is very low for the given numtri. And for the same reason, generating the probabilities directly from the binomial distribution overcomes this problem/limitation.
I hope I've managed to write with enough clarity for you to follow. Let me know if you've trouble going through.
Edit 2:
When I simulated the code I've just edited above with numtri=1e6, I get this for n=7 and n=8 and count the number of heads for k=0:7 and k=0:8:
# n = 7
0 1 2 3 4 5 6 7
279347 391386 233771 77698 15763 1915 117 3
# n = 8
0 1 2 3 4 5 6 7 8
232835 372466 259856 104116 26041 4271 392 22 1
Note that, there are k=6 and k=7 now for n=7 and n=8. Also, for n=8, you have a value of 1 for k=8. With increasing numtri you'll obtain more of the other missing bins. But it'll require a huge amount of time/memory (if at all).