How do I randomly classify my coordinate data in R - r

I have written a code that generates x and y data and am able to plot it.
# Number of observations
n <- 250
# x randomly drawn from a continuous uniform distribution with bounds [0,10]
x <- runif(min = 0, max = 1, n = sample(n))
# Error term from Normal distribution
error <- rnorm(n = n, mean = 0, sd = 2)
beta_0 <- 1
beta_1 <- -1
y <- beta_0*x + (beta_1*x - error)
library(tibble)
df <- tibble(x = x, y = y)
df
library(ggplot2)
ggplot(data = df, aes(x = x, y = y)) + geom_point()
labs(title = "y = f(x)")
I get an graph image like this:
I also get a data table like this of different coordinate data:
x
y.
0.139
-2.87
0.981
1.48
I would like to now randomly classify my data, such that my table looks like:
x
y.
Group1
Group2
0.139
-2.87
-1
1
0.981
1.48
1
-1
Where 1 represents that points membership to the group and -1 representing the point not being affiliated to the group. On the graph this would look like I had blue dots for Group1 membership vs red dots for Group2 membership.
Any help with this would be greatly appreciated.
Thank you.

To do it the way you suggested (with one column for group 1 and one column for group 2), you could do:
library(dplyr)
library(ggplot2)
df %>%
mutate(group1 = sample(c(-1, 1), n, TRUE),
group2 = -group1) %>%
ggplot(aes(x = x, y = y, color = factor(group1))) +
geom_point() +
scale_color_brewer('group', palette = 'Set1',
labels = c('Group 1', 'Group 2')) +
labs(title = "y = f(x)")
However, it seems a bit redundant to me having two mutually exclusive binary columns. You could just have a single column called group which is either group 1 or group 2:
df %>%
mutate(group = sample(c('Group 1', 'Group 2'), n, TRUE)) %>%
ggplot(aes(x = x, y = y, color = group)) +
geom_point() +
scale_color_brewer(palette = 'Set1') +
labs(title = "y = f(x)"

Related

Match boxplot and labels colors according Tukey's significance letters in ggplot

I am trying to match boxplot and labels colors according Tukey's significance letters in ggplot2 or ggboxplot
I don't know how to do it automatically or in a more elegant way using terrain.colors for example.
I have done it manually only to show what is my desired plot with boxplot and labels with the same colors as the Tukey's significance letters:
What I mean, is to have the "a", "b" and so on boxplots with the same color, both boxplots and letters. Something like this but using ggplot https://r-graph-gallery.com/84-tukey-test_files/figure-html/unnamed-chunk-3-1.png
your help will be very appreciated
Here is the script based on the accepted answer of this post: Is there a function to add AOV post-hoc testing results to ggplot2 boxplot?
library(plyr)
library(ggplot2)
library(multcompView)
set.seed(0)
lev <- gl(3, 10)
y <- c(rnorm(10), rnorm(10) + 0.1, rnorm(10) + 3)
d <- data.frame(lev=lev, y=y)
a <- aov(y~lev, data=d)
tHSD <- TukeyHSD(a, ordered = FALSE, conf.level = 0.95)
generate_label_df <- function(HSD, flev){
# Extract labels and factor levels from Tukey post-hoc
Tukey.levels <- HSD[[flev]][,4]
Tukey.labels <- multcompLetters(Tukey.levels)['Letters']
plot.labels <- names(Tukey.labels[['Letters']])
# Get highest quantile for Tukey's 5 number summary and add a bit of space to buffer between
# upper quantile and label placement
boxplot.df <- ddply(d, flev, function (x) max(fivenum(x$y)) + 0.2)
# Create a data frame out of the factor levels and Tukey's homogenous group letters
plot.levels <- data.frame(plot.labels, labels = Tukey.labels[['Letters']],
stringsAsFactors = FALSE)
# Merge it with the labels
labels.df <- merge(plot.levels, boxplot.df, by.x = 'plot.labels', by.y = flev, sort = FALSE)
return(labels.df)
}
#Generate ggplot
ggplot(d, aes(x=lev, y=y)) + geom_boxplot(fill = c("green", "green", "orange")) +
geom_text(data = generate_label_df(tHSD, 'lev'), colour = c("green","orange", "green"), aes(x = plot.labels, y = V1, label = labels )) +
scale_colour_manual(values=c("green", "green", "orange"))
Does this work for you? Find my comments below.
library(plyr)
library(ggplot2)
library(multcompView)
set.seed(0)
lev <- gl(3, 10)
y <- c(rnorm(10), rnorm(10) + 0.1, rnorm(10) + 3)
d <- data.frame(lev=lev, y=y)
a <- aov(y~lev, data=d)
tHSD <- TukeyHSD(a, ordered = FALSE, conf.level = 0.95)
generate_label_df <- function(HSD, flev){
# Extract labels and factor levels from Tukey post-hoc
Tukey.levels <- HSD[[flev]][,4]
Tukey.labels <- multcompLetters(Tukey.levels)['Letters']
plot.labels <- names(Tukey.labels[['Letters']])
# Get highest quantile for Tukey's 5 number summary and add a bit of space to buffer between
# upper quantile and label placement
boxplot.df <- ddply(d, flev, function (x) max(fivenum(x$y)) + 0.2)
# Create a data frame out of the factor levels and Tukey's homogenous group letters
plot.levels <- data.frame(plot.labels, labels = Tukey.labels[['Letters']],
stringsAsFactors = FALSE)
# Merge it with the labels
labels.df <- merge(plot.levels, boxplot.df, by.x = 'plot.labels', by.y = flev, sort = FALSE)
return(labels.df)
}
#############################
### new stuff starts here ###
#############################
label_df <- generate_label_df(tHSD, 'lev')
label_df$lev <- label_df$plot.labels
#Generate ggplot
lev_cols <- c("1" = "green", "2" = "green", "3" = "orange")
ggplot(d, aes(x = lev, y = y)) +
geom_boxplot(aes(fill = lev)) +
geom_text(
data = label_df,
aes(
x = plot.labels,
y = V1,
label = labels,
color = lev
)
) +
scale_color_manual(values = lev_cols) +
scale_fill_manual(values = lev_cols)
Created on 2022-10-14 with reprex v2.0.2
As you can see, you can tell different geoms_ in their aes() (!) that they should be colored according to e.g. the lev column. After doing that, you can define which of the levels in lev should have which color via a named vector c("Levelname1" = "Colorname1", ...) as we have here with lev_cols and provide it to scale_color_manual().
In this specific example, it was a bit more complex, because for geom_boxplot() we actually want different fill, while for geom_text() we want different color and thus we need both scale_color_manual() and scale_fill_manual(). Furthermore, the data you supply to the geom_text() does not have a column named lev, but I actually just made sure it does to keep it simple.
Bonus
FYI, you may also find the following alternative approach to get the compact letters display, as well as the alternative way to plot the results interesting. There's more on this here.
# extra -------------------------------------------------------------------
library(tidyverse)
library(emmeans)
library(multcomp)
library(multcompView)
set.seed(0)
lev <- gl(3, 10)
y <- c(rnorm(10), rnorm(10) + 0.1, rnorm(10) + 3)
d <- data.frame(lev = lev, y = y)
# This also gets you the letters ------------------------------------------
# fit model
model <- lm(y ~ lev, data = d)
# get (adjusted) y means per group
model_means <- emmeans(object = model,
specs = "lev")
# add letters to each mean
model_means_cld <- cld(object = model_means,
adjust = "Tukey",
Letters = letters,
alpha = 0.05)
#> Note: adjust = "tukey" was changed to "sidak"
#> because "tukey" is only appropriate for one set of pairwise comparisons
# show output
model_means_cld
#> lev emmean SE df lower.CL upper.CL .group
#> 2 -0.262 0.283 27 -0.982 0.457 a
#> 1 0.359 0.283 27 -0.361 1.079 a
#> 3 3.069 0.283 27 2.350 3.789 b
#>
#> Confidence level used: 0.95
#> Conf-level adjustment: sidak method for 3 estimates
#> P value adjustment: tukey method for comparing a family of 3 estimates
#> significance level used: alpha = 0.05
#> NOTE: If two or more means share the same grouping letter,
#> then we cannot show them to be different.
#> But we also did not show them to be the same.
# You may also like this plot ---------------------------------------------
ggplot() +
# general layout
theme_classic() +
theme(plot.caption = ggtext::element_textbox_simple()) +
# black data points
geom_point(
data = d,
aes(y = y, x = lev),
shape = 16,
alpha = 0.5,
position = position_nudge(x = -0.2)
) +
# black boxplot
geom_boxplot(
data = d,
aes(y = y, x = lev),
width = 0.05,
outlier.shape = NA,
position = position_nudge(x = -0.1)
) +
# red mean value
geom_point(
data = model_means_cld,
aes(y = emmean, x = lev),
size = 2,
color = "red"
) +
# red mean errorbar
geom_errorbar(
data = model_means_cld,
aes(ymin = lower.CL, ymax = upper.CL, x = lev),
width = 0.05,
color = "red"
) +
# red letters
geom_text(
data = model_means_cld,
aes(
y = emmean,
x = lev,
label = str_trim(.group)
),
position = position_nudge(x = 0.1),
hjust = 0,
color = "red"
) +
# caption
labs(
caption = "Black dots represent raw data. Red dots and error bars represent (estimated marginal) means ± 95% confidence interval per group. Means not sharing any letter are significantly different by the Tukey-test at the 5% level of significance."
)
Created on 2022-10-14 with reprex v2.0.2

How to put plotmath labels in ggplot facets

We often want individual regression equations in ggplot facets. The best way to do this is build the labels in a dataframe and then add them manually. But what if the labels contain plotmath, e.g., superscripts?
Here is a way to do it. The plotmath is converted to a string and then parsed by ggplot. The test_eqn function is taken from another Stackoverflow post, I'll link it when I find it again. Sorry about that.
library(ggplot2)
library(dplyr)
test_eqn <- function(y, x){
m <- lm(log(y) ~ log(x)) # fit y = a * x ^ b in log space
p <- exp(predict(m)) # model prediction of y
eq <- substitute(expression(Y==a~X^~b),
list(
a = format(unname(exp(coef(m)[1])), digits = 3),
b = format(unname(coef(m)[2]), digits = 3)
))
list(eq = as.character(eq)[2], pred = p)
}
set.seed(123)
x <- runif(20)
y <- runif(20)
test_eqn(x,y)$eq
#> [1] "Y == \"0.57\" ~ X^~\"0.413\""
data <- data.frame(x = x,
y = y,
f = sample(c("A","B"), 20, replace = TRUE)) %>%
group_by(f) %>%
mutate(
label = test_eqn(y,x)$eq, # add label
labelx = mean(x),
labely = mean(y),
pred = test_eqn(y,x)$pred # add prediction
)
# plot fits (use slice(1) to avoid multiple copies of labels)
ggplot(data) +
geom_point(aes(x = x, y = y)) +
geom_line(aes(x = x, y = pred), colour = "red") +
geom_text(data = slice(data, 1), aes(x = labelx, y = labely, label = label), parse = TRUE) +
facet_wrap("f")
Created on 2021-10-20 by the reprex package (v2.0.1)

Incorrect colour gradient when using cowplot to patch together plots

Say I have a data set with x and y values that are grouped according to two variables: grp is a, b, or c, while subgrp is E, F, or G.
a has y values in [0, 1]
b has y values in [10, 11]
c has y values in [100, 101].
I'd like to plot y against x with the colour of the point defined by y for all grp and subgrp combinations. Since each grp has very different y values, I can't just use facet_grid alone, as the colour scales would be useless. So, I plot each grp with its own scale then patch them together with plot_grid from cowplot. I also want to use a three-point gradient specified by scale_colour_gradient2. My code looks like this:
# Set RNG seed
set.seed(42)
# Toy data frame
df <- data.frame(x = runif(270), y = runif(270) + rep(c(0, 10, 100), each = 90),
grp = rep(letters[1:3], each = 90), subgrp = rep(LETTERS[4:6], 90))
head(df)
#> x y grp subgrp
#> 1 0.9148060 0.1362958 a D
#> 2 0.9370754 0.7853494 a E
#> 3 0.2861395 0.4533034 a F
#> 4 0.8304476 0.1357424 a D
#> 5 0.6417455 0.8852210 a E
#> 6 0.5190959 0.3367135 a F
# Load libraries
library(cowplot)
library(ggplot2)
library(dplyr)
# Plotting list
g_list <- list()
# Loop through groups 'grp'
for(i in levels(df$grp)){
# Subset the data
df_subset <- df %>% filter(grp == i)
# Calculate the midpoint
mp <- mean(df_subset$y)
# Print midpoint
message("Midpoint: ", mp)
g <- ggplot(df_subset) + geom_point(aes(x = x, y = y, colour = y))
g <- g + facet_grid(. ~ subgrp) + ggtitle(i)
g <- g + scale_colour_gradient2(low = "blue", high = "red", mid = "yellow", midpoint = mp)
g_list[[i]] <- g
}
#> Midpoint: 0.460748857570191
#> Midpoint: 10.4696476330981
#> Midpoint: 100.471083269571
plot_grid(plotlist = g_list, ncol = 1)
Created on 2019-04-17 by the reprex package (v0.2.1)
In this code, I specify the midpoint of the colour gradient as the mean of y for each grp. I print this and verify that it is correct. It is.
My question: why are my colour scales incorrect for the first two plots?
It appears the same range is applied to each grp despite subsetting the data. If I replace for(i in levels(df$grp)){ with for(i in levels(df$grp)[1]){, the colour scale is correct for the single plot that is produced.
Update
Okay, this is weird. Inserting ggplot_build(g)$data[[1]]$colour immediately before g_list[[i]] <- g solves the problem. But, why?
Long story short, you're creating unevaluated promises and then evaluate them at a time when the original data is gone. This problem is generally avoided if you use proper functional programming style rather than procedural code. I.e., define a function that does the work and then use an apply function for the loop.
set.seed(42)
# Toy data frame
df <- data.frame(x = runif(270), y = runif(270) + rep(c(0, 10, 100), each = 90),
grp = rep(letters[1:3], each = 90), subgrp = rep(LETTERS[4:6], 90))
library(cowplot)
library(ggplot2)
library(dplyr)
# Loop through groups 'grp'
g_list <- lapply(
levels(df$grp),
function(i) {
# Subset the data
df_subset <- df %>% filter(grp == i)
# Calculate the midpoint
mp <- mean(df_subset$y)
# Print midpoint
message("Midpoint: ", mp)
g <- ggplot(df_subset) + geom_point(aes(x = x, y = y, colour = y))
g <- g + facet_grid(. ~ subgrp) + ggtitle(i)
g <- g + scale_colour_gradient2(low = "blue", high = "red", mid = "yellow", midpoint = mp)
g
}
)
#> Midpoint: 0.460748857570191
#> Midpoint: 10.4696476330981
#> Midpoint: 100.471083269571
plot_grid(plotlist = g_list, ncol = 1)
Created on 2019-04-17 by the reprex package (v0.2.1)

Plot one data frame column against all other columns using ggplots and showing densities in R

I have a data frame with 20 columns, and I want to plot one specific column (called BB) against each single column in the data frame. The plots I need are probability density plots, and I’m using the following code to generate one plot (plotting columns BB vs. AA as an example):
mydata = as.data.frame(fread("filename.txt")) #read my data as data frame
#function to calculate density
get_density <- function(x, y, n = 100) {
dens <- MASS::kde2d(x = x, y = y, n = n)
ix <- findInterval(x, dens$x)
iy <- findInterval(y, dens$y)
ii <- cbind(ix, iy)
return(dens$z[ii])
}
set.seed(1)
#define the x and y of the plot; x = column called AA; y = column called BB
xy1 <- data.frame(
x = mydata$AA,
y = mydata$BB
)
#call function get_density to calculate density for the defined x an y
xy1$density <- get_density(xy1$x, xy1$y)
#Plot
ggplot(xy1) + geom_point(aes(x, y, color = density), size = 3, pch = 20) + scale_color_viridis() +
labs(title = "BB vs. AA") +
scale_x_continuous(name="AA") +
scale_y_continuous(name="BB")
Would appreciate it if someone can suggest a method to produce multiple plot of BB against every other column, using the above density function and ggplot command. I tried adding a loop, but found it too complicated especially when defining the x and y to be plotted or calling the density function.
Since you don't provide sample data, I'll demo on mtcars. We convert the data to long format, calculate the densities, and make a faceted plot. We plot the mpg column against all others.
library(dplyr)
library(tidyr)
mtlong = gather(mtcars, key = "var", value = "value", -mpg) %>%
group_by(var) %>%
mutate(density = get_density(value, mpg))
ggplot(mtlong, aes(x = value, y = mpg, color = density)) +
geom_point(pch = 20, size = 3) +
labs(x = "") +
facet_wrap(~ var, scales = "free")

Plot Grouped bar graph with calculated standard deviation in ggplot

I feel like this should be really easy to do, but I'm having a really hard time figuring this out.
I have a data frame
type <- c("a","b","c","d","e")
x <- rnorm(5)
y <- rnorm(5)
z <- rnorm(5)
xsd <- sd(x)
ysd <- sd(y)
zsd <- sd(z)
df <- data.frame(type, x,y,z,xsd,ysd,zsd)
df
type x y z xsd ysd zsd
1 a -1.16788106 0.2260430 -1.16788106 0.8182508 0.7321015 0.9016335
2 b -0.09955193 -0.6647980 -0.09955193 0.8182508 0.7321015 0.9016335
3 c -0.87901053 -0.4269936 -0.87901053 0.8182508 0.7321015 0.9016335
4 d -0.87861339 -1.3669793 -0.87861339 0.8182508 0.7321015 0.9016335
5 e 0.84350228 0.4702580 0.84350228 0.8182508 0.7321015 0.9016335
and I need a grouped bar graph of the mean of x, y, and z by type with error bars showing the standard deviation for each variable. The standard deviation is in different columns xsd,ysdand zsd
I need to plot the mean in the y axis, type grouping the x, y, z variables in the x axis.
I tried using gather(), to rearrange the data, but I'm not having any success...
Let ggplot2 do the calculations for you:
install.packages("hmisc") # for mean_sdl
library(tidyverse)
type <- c("a","b","c","d","e")
x <- rnorm(5, 10, 5)
y <- rnorm(5, 8, 3)
z <- rnorm(5, 2, 4)
df <- data.frame(type,x,y,z)
df_long <- df %>%
gather(variable, value, x:z)
ggplot(df_long, aes(x = variable, y = value, fill = variable)) +
stat_summary(fun.y = "mean", geom = "col") +
stat_summary(fun.data = mean_sdl, geom = "errorbar", width = .5, fun.args = list(mult = 1))
This example should help:
type <- c("a","b","c","d","e")
x <- rnorm(50,20, 5)
y <- rnorm(50, 25,1)
z <- rnorm(50, 40, 1)
df <- data.frame(type, x,y,z)
df
library(tidyverse)
df %>%
gather(x,value,-type) %>%
group_by(type, x) %>%
summarise(MEAN = mean(value),
SD = sd(value)) %>%
ggplot(aes(x, MEAN, fill=type))+
geom_bar(stat="identity", position = "dodge")+
geom_errorbar(aes(ymin=MEAN-SD, ymax=MEAN+SD), position = "dodge")

Resources