I have a dataframe with a numerical variable and a factor variable, like this:
set.seed(123)
df <- data.frame(
numbers = c(rnorm(50, 3), runif(50)),
levels = sample(LETTERS[1:5], 100, replace = T)
)
What I'd like to do is a stripchart that plots df$numbersagainst df$levels and inserts vertical segment lines representing the mean for each level.
stripchart(df$numbers ~ df$levels, method = "jitter")
Obviously, I could insert the means line for each level separately, e.g.:
segments(x0 = mean(df$numbers[df$levels=="A"]), y0 = 1-0.3, y1 = 1+0.3, col = "red" )
And so on for all other levels, which is tedious if you have multiple levels. So I've tried this forloop:
for(i in seq(unique(df$levels))){
segments(x0 = mean(df$numbers[df$levels==i]),
y0 = i - 0.3,
y1 = i + 0.3,
col = "red", lty = 3, lwd = 2)
}
But that doesn't print anything (and doesn't throw an error either). What's the cleanest and simplest code to insert the means segments?
As the 'levels' column is factor, use levels to get the levels of the factor 'un1', then loop over the sequence of unique elements, get the mean of the 'numbers' where the levels column is the unique value to create the segments
un1 <- levels(df$levels)
for(i in seq_along(un1)){
segments(x0 = mean(df$numbers[df$levels==un1[i]]),
y0 = i - 0.3,
y1 = i + 0.3,
col = "red", lty = 3, lwd = 2)
}
-checking the mean
with(df, tapply(numbers, levels, FUN = mean))
# A B C D E
#1.390202 1.541655 2.086605 2.377122 1.663159
Related
I stumbled upon this question and wanted to do something similar on my plot (for exponential and weibull function).
I want to extract the x value for a specific y value. for me, the Y value would be 0.5.
here is my reproducible example:
library(survival)
attach(stanford2)
km = survfit(Surv(time, status)~ 1)
wykładniczy <- survreg(Surv(time, status) ~ 1, dist = 'exponential')
lambda_wyk <- exp(-wykładniczy$coef)
weibulla <- survreg(Surv(time, status) ~ 1, dist = 'weibull')
lambda_wei <- exp(-weibulla$icoef[1])
kappa_wei <- 1/weibulla$icoef[2]
plot(km, col = 'black', lwd = 2, conf.int = FALSE)
x = seq(0, max(time), length = 10000)
lines(x, exp(-lambda_wyk*x), col = 'red', lwd = 2)
lines(x, exp(-(lambda_wei*x)^kappa_wei), col = 'blue', lwd = 2)
grid()
legend('topright', c('Kaplan-Meier', 'Wykladniczy',
'Weibull'), col = c('black',
'red', 'blue'), lwd = 2)
I'm using the stanford2 data from the survival package.
I tried to do this like this.
x[which(exp(-(lambda_wei*x)^kappa_wei) == 0.5)])
But I'm missing something.
Explanation what should I do would be great.
I'm not sure if this is exactly what you are looking for, but maybe it helps you find the final solution:
y = exp(-lambda_wyk*x)
df = data.frame(x, y)
# I convert the column to character to avoid the issue explained in the link below
df$x = lapply(df$x, as.character)
df$y = lapply(df$y, as.character)
# Find x value for y = 0.0188018801880188
df[df$x == '0.0188018801880188',]
Output:
x y
2 0.0188018801880188 0.999815684829423
Why are these numbers not equal?
Suppose you have 3 sets, numbered from 1 to 3.
Each set contains unique ID's associated to categorical active/inactive labels for variables A, B, C, D.
You want to make plots that show, for each variable, the proportions of active/inactive labels side to side in the 3 sets, to show if they are homogeneous or not.
The only way I could come up with to do this was the following:
# Simulate data: 3 different sets, each with 4 different variables, each with different proportions of labels
sets = c("1", "2", "3")
variables = c("A", "B", "C", "D")
labs = c("active", "inactive")
N = 10000
set.seed(1325)
d = data.frame("set" = sample(sets, N, replace = TRUE, prob = c(0.1, 0.2, 0.7)),
"variable" = sample(variables, N, replace = TRUE, prob = c(0.15, 0.25, 0.2, 0.4)))
d["label"] = "x"
for (v in variables) {
vw = which(d[["variable"]] == v)
vp = runif(1, 0.1, 0.6)
d[vw, "label"] = sample(labs, length(vw), replace = TRUE, prob = c(vp, 1 - vp))
}
d["ID"] <- 1:N
s = aggregate(ID ~ set + variable + label, d, length)
s.l = aggregate(ID ~ set + variable, d, length)
colnames(s.l)[3] <- "ID.l"
s = merge(s, s.l)
s["frac"] = with(s, ID / ID.l)
op = par()
par(mfrow = c(2,2))
for (v in variables) {
barplot(frac ~ label + set, s, subset = variable == v, col = c("blue", "orange"), main = v)
}
par(op)
Given how the labels are assigned in the code, their proportions are different for the different variables, but homogeneous across the sets.
To show what happens when the proportions are not homogeneous:
# change the proportion of labels for one set
sw = which(d[["set"]] == 1)
d.u = d
d.u[sw, "label"] = sample(labs, length(sw), replace = TRUE, prob = c(0.05, 1 - 0.05))
s.u = aggregate(ID ~ set + variable + label, d.u, length)
s.u.l = aggregate(ID ~ set + variable, d.u, length)
colnames(s.u.l)[3] <- "ID.l"
s.u = merge(s.u, s.u.l)
s.u["frac"] = with(s.u, ID / ID.l)
op = par()
par(mfrow = c(2,2))
for (v in variables) {
barplot(frac ~ label + set, s.u, subset = variable == v, col = c("blue", "orange"), main = v)
}
par(op)
Question: do you think this can be done or represented better / more efficiently?
In particular, I would have thought that the aggregate and division part might be already built-in for some type of plot.
And I am wondering if using mfrow and plotting an array of separate plots is any good, or if there is some way to make a more cohesive lattice or grid of plots by using variable as one of the parameters.
Any ideas?
Instead of the double aggregate calls, consider by to split data frame by variable and run xtabs + proportions on each subset:
boxplot formula style
op <- par(mfrow = c(2,2))
tbls <- by(d, d$variable, FUN=function(sub) {
tbl <- xtabs(~ label + set, sub)
props <- data.frame(proportions(tbl, 2))
barplot(Freq ~ label + set, props,
col = c("blue", "orange"),
main = sub$variable[1])
})
par(op)
barplot matrix style
op <- par(mfrow = c(2,2))
tbls <- by(d, d$variable, FUN=function(sub) {
tbl <- xtabs(~ label + set, sub)
props <- proportions(tbl, 2)
barplot(props, xlab = "set", ylab = "frac",
col = c("blue", "orange"),
main = sub$variable[1])
})
par(op)
Dear stackoverflow Community,
I have a vector with different correlation values, which I want to link to corresponding color codes (let's say -1="Dark Red", 0 ="Light Gray", 1="Dark Green"). So, for example, if my maximum value in the correlation would be 0.75, the corresponding color value should be a "Lighter green". Is there any solution to achieve this in R?
Thank you!
What you're looking for is ggplot2::scale_colour_gradient2(). Since you didn't provide any example data (which I highly recommend in the future; it encourages answers and helps answerers tailor their responses to your actual data structure), I concocted the following simple example:
library(ggplot2)
set.seed(123)
n <- 1000
corrs <- seq(-0.9, 0.9, length.out = 10)
vals <- matrix(0, nrow = 0, ncol = 2)
for ( corr in corrs ) {
tmp <- mvtnorm::rmvnorm(n/10, sigma = matrix(c(1, corr, corr, 1), nrow = 2))
# print(cor(tmp)) # If you want to do QA
vals <- rbind(vals, tmp)
}
df <- data.frame(var1 = vals[ , 1], var2 = vals[ , 2],
corr = rep(corrs, each = n/10))
ggplot(df, aes(x = var1, y = var2, colour = corr)) +
geom_point(shape = 1) +
scale_colour_gradient2(low = "darkred", mid = "gray", high = "darkgreen")
I'd like to create a sankey-like plot that I can create in ggplot2 where there are curved lines between my start and end locations. Currently, I have data that looks like this:
df <- data.frame(Line = rep(letters[1:4], 2),
Location = rep(c("Start", "End"), each=4),
X = rep(c(1, 10), each = 4),
Y = c(c(1,3, 5, 15), c(9,12, 14, 6)),
stringsAsFactors = F)
ex:
Line Location X Y
1 a Start 1 1
2 a End 10 9
and creates a plot that looks something like this:
library(ggplot2)
ggplot(df) +
geom_path(aes(x= X, y= Y, group = Line))
I would like to see the data come out like this:
This is another option for setting up the data:
df2 <- data.frame(Line = letters[1:4],
Start.X= rep(1, 4),
Start.Y = c(1,3,5,15),
End.X = rep(10, 4),
End.Y = c(9,12,14,6))
ex:
Line Start.X Start.Y End.X End.Y
1 a 1 1 10 9
I can find examples of how to add a curve to the graphics of base R but these examples don't demonstrate how to get a data frame of the points in between in order to draw that curve. I would prefer to use dplyr for data manipulation. I imagine this will require a for-loop to build a table of the interpolated points.
These examples are similar but do not produce an s-shaped curve:
Plotting lines on map - gcIntermediate
http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/
Thank you in advance!
The code below creates curved lines via a logistic function. You could use whatever function you like instead, but this is the main idea. I should note that for other than graphical purposes, creating a curved line out of 2 points is a bad idea. It implies that the data show a certain type of relation while it actually doesn't imply that relation.
df <- data.frame(Line = rep(letters[1:4], 2),
Location = rep(c("Start", "End"), each=4),
X = rep(c(1, 10), each = 4),
Y = c(c(1,3, 5, 15), c(9,12, 14, 6)),
stringsAsFactors = F)
# logistic function for curved lines
logistic = function(x, y, midpoint = mean(x)) {
ry = range(y)
if (y[1] < y[2]) {
sign = 2
} else {
sign = -2
}
steepness = sign*diff(range(x)) / diff(ry)
out = (ry[2] - ry[1]) / (1 + exp(-steepness * (x - midpoint))) + ry[1]
return(out)
}
# an example
x = c(1, 10)
y = c(1, 9)
xnew = seq(1, 10, .5)
ynew = logistic(xnew, y)
plot(x, y, type = 'b', bty = 'n', las = 1)
lines(xnew, ynew, col = 2, type = 'b')
# applying the function to your example
xnew = seq(min(df$X), max(df$X), .1) # new x grid
m = matrix(NA, length(xnew), 4) # matrix to store results
uniq = unique(df$Line) # loop over all unique values in df$Line
for (i in seq_along(uniq)) {
m[, i] = logistic(xnew, df$Y[df$Line == uniq[i]])
}
# base R plot
matplot(xnew, m, type = 'b', las = 1, bty = 'n', pch = 1)
# put stuff in a dataframe for ggplot
df2 = data.frame(x = rep(xnew, ncol(m)),
y = c(m),
group = factor(rep(1:ncol(m), each = nrow(m))))
library(ggplot2)
ggplot(df) +
geom_path(aes(x= X, y= Y, group = Line, color = Line)) +
geom_line(data = df2, aes(x = x, y = y, group = group, color = group))
I would like to plot individual subject means for two different conditions in a lattice stripplot with two panels. I would also like to add within-subject confidence intervals that I have calculated and stored in separate data frame. I am trying to overlay these confidence intervals with latticeExtra's layer function. When I add the layer, either both sets of intervals display on both panels (as illustrated in code and first image below) or both sets of intervals display on only the first panel if I add [subscripts] to the x's and y's in the layer command (illustrated in second code clip and image below). How do I get the appropriate intervals to display on the appropriate panel?
library(latticeExtra)
raw_data <- data.frame(subject = rep(1:6, 4), cond1 = as.factor(rep(1:2, each = 12)), cond2 = rep(rep(c("A", "B"), each = 6), 2), response = c(2:7, 6:11, 3:8, 7:12))
summary_data <- data.frame(cond1 = as.factor(rep(1:2, each = 2)), cond2 = rep(c("A", "B"), times = 2), mean = aggregate(response ~ cond2 * cond1, raw_data, mean)$response, within_ci = c(0.57, 0.54, 0.6, 0.63))
summary_data$lci <- summary_data$mean - summary_data$within_ci
summary_data$uci <- summary_data$mean + summary_data$within_ci
subject_stripplot <- stripplot(response ~ cond1 | cond2, groups = subject, data = raw_data,
panel = function(x, y, ...) {
panel.stripplot(x, y, type = "b", lty = 2, ...)
panel.average(x, y, fun = mean, lwd = 2, col = "black", ...) # plot line connecting means
}
)
addWithinCI <- layer(panel.segments(x0 = cond1, y0 = lci, x1 = cond1, y1 = uci, subscripts = TRUE), data = summary_data, under = FALSE)
plot(subject_stripplot + addWithinCI)
Stripplot with both sets of intervals on both panels:
addWithinCI2 <- layer(panel.segments(x0 = cond1[subscripts], y0 = lci[subscripts], x1 = cond1[subscripts], y1 = uci[subscripts], subscripts = TRUE), data = summary_data, under = FALSE)
plot(subject_stripplot + addWithinCI2)
Stripplot with both sets of intervals on only the first panel
One possible solution would be to print the stripplot (e.g., inside a png or any other graphics device) and subsequently modify each sub-panel using trellis.focus.
## display stripplot
print(subject_stripplot)
## loop over grops
for (i in c("A", "B")) {
# subset of current group
dat <- subset(summary_data, cond2 == i)
# add intervals to current panel
trellis.focus(name = "panel", column = ifelse(i == "A", 1, 2), row = 1)
panel.segments(x0 = dat$cond1, y0 = dat$lci,
x1 = dat$cond1, y1 = dat$uci, subscripts = TRUE)
trellis.unfocus()
}
Another (possibly more convenient) solution would be to create a separate xyplot and set the lower and upper y values (y0, y1) passed on to panel.segments manually in dependence of the current panel.number. In contrast to the initial approach using trellis.focus, the thus created plot can be stored in a variable and is hence available for subsequent processing inside R.
p_seg <- xyplot(lci ~ cond1 | cond2, data = summary_data, ylim = c(1, 13),
panel = function(...) {
# lower and upper y values
y0 <- list(summary_data$lci[c(1, 3)], summary_data$lci[c(2, 4)])
y1 <- list(summary_data$uci[c(1, 3)], summary_data$uci[c(2, 4)])
# insert vertical lines depending on current panel
panel.segments(x0 = 1:2, x1 = 1:2,
y0 = y0[[panel.number()]],
y1 = y1[[panel.number()]])
})
p_comb <- subject_stripplot +
as.layer(p_seg)
# print(p_comb)
Another solution that does not require latticeExtra (from Duncan Mackay):
summary_data$cond3 <- sapply(summary_data$cond2, pmatch, LETTERS)
mypanel <- function(x, y, ..., lci, uci, scond1, scond3, groups, type, lty){
pnl = panel.number()
panel.xyplot(x, y, ..., groups = groups, type = type, lty = lty)
panel.average(x, y, horizontal = FALSE, col = "black", lwd = 3)
panel.segments(x0 = scond1[scond3 == pnl],
y0 = lci[scond3 == pnl],
x1 = scond1[scond3 == pnl],
y1 = uci[scond3 == pnl])
}
with(summary_data,
stripplot(response ~ cond1 | cond2, data = raw_data,
groups = subject,
lci = lci,
uci = uci,
scond1 = summary_data$cond1,
scond3 = cond3,
type = "b",
lty = 2,
panel = mypanel)
)