How to plotmeans of a group conditional on a variable - r

Basically right now I have a graph that displays the mean of one of my variables by year. I want to get two mean lines of that variable: one for when another variable in my frame = 0 and one for when it = 1 on the same plot.
Right now this is what I have:
library(gplots)
plotmeans(x ~ year, data = df, frame = FALSE,
mean.labels = TRUE)
This currently works for just giving me the mean of x by year with no conditional. However, I want two lines one for the graphed mean of x when for example y=0 and one for the graphed mean of x when y=1 all by year still on the x axis.

Function gplots::plotmeans accepts only one variable on the RHS of the formula argument. The trick is to use interaction between the variables of interest.
First, make up a data set.
set.seed(1234) # Reproducible results
df <- data.frame(x = rnorm(210),
y = rbinom(210, 1, 0.5),
year = rep(2017:2019, 70))
Now the graphs.
library(gplots)
plotmeans(x ~ interaction(year, y, sep = " y = "), data = df,
mean.labels = TRUE, digits = 2,
connect = list(1:3, 4:6))
plotmeans(x ~ interaction(y, year, sep = " "), data = df,
mean.labels = TRUE, digits = 2,
connect = list(1:2, 3:4, 5:6))

Related

How to specify groups with colors in qqplot()?

I have created a qqplot (with quantiles of beta distribution) from a dataset including two groups. To visualize, which points belong to which group, I would like to color them. I have tried the following:
res <- beta.mle(data$values) #estimate parameters of beta distribution
qqplot(qbeta(ppoints(500),res$param[1], res$param[2]),data$values,
col = data$group,
ylab = "Quantiles of data",
xlab = "Quantiles of Beta Distribution")
the result is shown here:
I have seen solutions specifying a "col" vector for qqnorm, hover this seems to not work with qqplot, as simply half the points is colored in either color, regardless of group. Is there a way to fix this?
A simulated some data just to shown how to add color in ggplot
Libraries
library(tidyverse)
# install.packages("Rfast")
Data
#Simulating data from beta distribution
x <- rbeta(n = 1000,shape1 = .5,shape2 = .5)
#Estimating parameters
res <- Rfast::beta.mle(x)
data <-
tibble(
simulated_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2])
) %>%
#Creating a group variable using quartiles
mutate(group = cut(x = simulated_data,
quantile(simulated_data,seq(0,1,.25)),
include.lowest = T))
Code
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = simulated_data, col = group))+
geom_point()
Output
For those who are wondering, how to work with pre-defined groups, this is the code that worked for me:
library(tidyverse)
library(Rfast)
res <- beta.mle(x)
# make sure groups are not numerrical
# (else color skale might turn out continuous)
g <- plyr::mapvalues(g, c("1", "2"), c("Group1", "Group2"))
data <-
tibble(
my_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2]),
group = g[order(x)]
)
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = my_data, col = group))+
geom_point()
result

Splitting a plotly violin plot by more than two groups

My question is about expanding R plotly's grouped violin plot to a case with more than two groups.
Taking the data that are used in the grouped violin plot example code and adding a third level to df$sex:
library(dplyr)
set.seed(1)
df <- read.csv("https://raw.githubusercontent.com/plotly/datasets/master/violin_data.csv")
df <- df %>%
rbind(df[sample(nrow(df), 100, replace = F),] %>%
dplyr::mutate(sex = "undefined", day = sample(df$day, 100 , replace = F), day = sample(df$day, 100, replace = F)))
df$sex <- factor(df$sex)
Trying to plot this with:
plotly::plot_ly(x = df$day, y = df$total_bill, type = 'violin', split = df$sex, color = df$sex)
I get the violins of each of the sexes centered rather than split:
And this remains the case if I switch split = df$sex to name = df$sex.
But if I change type = 'violin' to type = 'bar' I do get df$sex split:
Any idea how to get this to work for the type = 'violin' case?

Reordering a factor based on a summary statistic of a subset of the data

I'm trying to reorder a factor from a subset of my data frame, defined by another factor using forcats::fct_reorder().
Consider the following data frame df:
set.seed(12)
df <- data.frame(fct1 = as.factor(rep(c("A", "B", 'C'), each = 200)),
fct2 = as.factor(rep(c("j", "k"), each = 100)),
val = c(rnorm(100, 2), # A - j
rnorm(100, 1), # A - k
rnorm(100, 1), # B - j
rnorm(100, 6), # B - k
rnorm(100, 8), # C - j
rnorm(100, 4)))# C - k
I want to plot facetted group densities using the ggridges package. For example:
ggplot(data = df, aes(y = fct2, x = val)) +
stat_density_ridges(geom = "density_ridges_gradient",
calc_ecdf = T,
quantile_fun = median,
quantile_lines = T) +
facet_wrap(~fct1, ncol = 1)
I would now like to order fct1 by the median (default in fct_reorder()) of the values of the upper density in each facet, i.e. where fct2 == "k". The goal in this example would therefore be that the facets appear in the order B - C - A.
This seems very similar to this question here, with the difference that I do not want to summarize the data first because I need the raw data to plot the densities.
I've tried to adapt the code in the answer of the linked question:
df <- df %>% mutate(fct1 = forcats::fct_reorder(fct1, filter(., fct2 == 'k') %>% pull(val)))
But it returns the following error:
Error in forcats::fct_reorder(fct1, filter(., fct2 == "k") %>% pull(val)) :
length(f) == length(.x) is not TRUE
It's obvious that they are not the same length, but I don't quite get why this error is necessary. My guess is that it's generally not guaranteed that all levels of fct1 are present in the subset, which would certainly be problematic. Yet, this isn't the case in my example. Is there a way to work around this error or am I doing something wrong more generally?
I'm aware that I can work around this with a couple of lines of extra code, e.g. create a helper variable of the subsetted data, reorder that and then take the level order to my factor in the original data set. I would still like a prettier solution, because I regularly face that very same task.
You can do this with a little helper function:
f <- function(i) -median(df$val[df$fct2 == "k" & df$fct1 == df$fct1[i]])
Which allows you to reorder like this:
df$fct1 <- forcats::fct_reorder(df$fct1, sapply(seq(nrow(df)), f))
Which gives you this plot:
ggplot(data = df, aes(y = fct2, x = val)) +
stat_density_ridges(geom = "density_ridges_gradient",
calc_ecdf = T,
quantile_fun = median,
quantile_lines = T) +
facet_wrap(~fct1, ncol = 1)

Add multiple ggplot2 geom_segment() based on mean() and sd() data

I have a data frame mydataAll with columns DESWC, journal, and highlight. To calculate the average and standard deviation of DESWC for each journal, I do
avg <- aggregate(DESWC ~ journal, data = mydataAll, mean)
stddev <- aggregate(DESWC ~ journal, data = mydataAll, sd)
Now I plot a horizontal stripchart with the values of DESWC along the x-axis and each journal along the y-axis. But for each journal, I want to indicate the standard deviation and average with a simple line. Here is my current code and the results.
stripchart2 <-
ggplot(data=mydataAll, aes(x=mydataAll$DESWC, y=mydataAll$journal, color=highlight)) +
geom_segment(aes(x=avg[1,2] - stddev[1,2],
y = avg[1,1],
xend=avg[1,2] + stddev[1,2],
yend = avg[1,1]), color="gray78") +
geom_segment(aes(x=avg[2,2] - stddev[2,2],
y = avg[2,1],
xend=avg[2,2] + stddev[2,2],
yend = avg[2,1]), color="gray78") +
geom_segment(aes(x=avg[3,2] - stddev[3,2],
y = avg[3,1],
xend=avg[3,2] + stddev[3,2],
yend = avg[3,1]), color="gray78") +
geom_point(size=3, aes(alpha=highlight)) +
scale_x_continuous(limit=x_axis_range) +
scale_y_discrete(limits=mydataAll$journal) +
scale_alpha_discrete(range = c(1.0, 0.5), guide='none')
show(stripchart2)
See the three horizontal geom_segments at the bottom of the image indicating the spread? I want to do that for all journals, but without handcrafting each one. I tried using the solution from this question, but when I put everything in a loop and remove the aes(), it give me an error that says:
Error in x - from[1] : non-numeric argument to binary operator
Can anyone help me condense the geom_segment() statements?
I generated some dummy data to demonstrate. First, we use aggregate like you have done, then we combine those results to create a data.frame in which we create upper and lower columns. Then, we pass these to the geom_segment specifying our new dataset. Also, I specify x as the character variable and y as the numeric variable, and then use coord_flip():
library(ggplot2)
set.seed(123)
df <- data.frame(lets = sample(letters[1:8], 100, replace = T),
vals = rnorm(100),
stringsAsFactors = F)
means <- aggregate(vals~lets, data = df, FUN = mean)
sds <- aggregate(vals~lets, data = df, FUN = sd)
df2 <- data.frame(means, sds)
df2$upper = df2$vals + df2$vals.1
df2$lower = df2$vals - df2$vals.1
ggplot(df, aes(x = lets, y = vals))+geom_point()+
geom_segment(data = df2, aes(x = lets, xend = lets, y = lower, yend = upper))+
coord_flip()+theme_bw()
Here, the lets column would resemble your character variable.

Add legend inside each panel with Lattice in R

I wanna make a scatter plot with connecting lines for different groups and different individuals. I make panels conditioned by my group variable and groups conditioned by my individual variables. Now, I would like to add legend inside each panels(see the code below). In the plots, I would like to have legends of individuals for GRP==1 in the first panel, GRP==2 in the second panel, so on so forth. All the legends are located in the upper left corner of the panel they belong to. How shall I code?
library(lattice)
mydata <- data.frame(ID = rep(1: 20, each = 10),
GRP = rep(1: 4, each = 50),
x = rep(0: 9, 20))
mydata$y <- 1.2 * mydata$GRP * mydata$x +
rnorm(nrow(mydata), sd = mydata$GRP)
xyplot(y~ x | factor(GRP), data = mydata,
groups = ID,
type = "b",
as.table = T,
layout = c(2, 2),
panel = panel.superpose,
panel.groups = function (x, y, ...) {
panel.xyplot(x, y, ...)
}
)
Try something like this. Note that the subset command comes in the data statement in xyplot. This is on purpose. If you call subset as an xyplot argument, then the plots would have shown all 20 labels in each plot.
library(lattice)
mydata <- data.frame(ID = rep(1:20, each = 10), GRP = rep(1:4, each = 50), x = rep(0:9, 20))
mydata$y <- 1.2 * mydata$GRP * mydata$x + rnorm(nrow(mydata), sd = mydata$GRP)
i=1; j=1
for(grp in 1:4) {
a <- xyplot(y~x|factor(GRP), data=subset(mydata, GRP==grp),
groups = factor(ID),
type = "b",
auto.key=list(columns=4,space="inside")
)
print(a, split=c(i,j,2,2), more=T)
i=i+1; if(i>2){i=1;j=j+1} # basically, tell the plots which quadrant to go in
}

Resources