Adding the median value to every box (not ggplot) - r

I have a boxplot from the code below and i want to add median values.
boxplot(ndvi_pct_sep~edge_direction, data= data_sample, subset = edge_direction %in% c(64,4, 1,16),ylab="NDVI2028-2016", xlab="Forest edge direction",names=c("north", "south", "east", "west"))
.
I want to add the median values to the boxplots, any idea how to do it?

It will likely involve using legends - since I don't have your data I cant make it perfect, but the below code should get you started using the ToothGrowth data contained in R. I am showing a base R and ggplot example (I know you said no ggplot, but others may use it).
# Load libraries
library(dplyr); library(ggplot2)
# get median data
mediandata <- ToothGrowth %>% group_by(dose) %>% summarise(median = median(len, na.rm = TRUE))
l <- unname(unlist(mediandata))
tg <- ToothGrowth # for convenience
tg$dose <- as.factor(tg$dose)
### Base R approach
boxplot(len ~ dose, data = tg,
main = "Guinea Pigs' Tooth Growth",
xlab = "Vitamin C dose mg",
ylab = "tooth length", col = "red")
for (i in 1:3){
legend(i-0.65,l[i+3]+5, legend = paste0("Median: ",l[i+3]), bty = "n")
}
### ggplot approach
ggplot(data = tg, aes(dose, len)) +
theme_classic() + theme(legend.position = "none") +
geom_boxplot()+
annotate("text",
x = c(1,2,3),
y = l[4:6]+1, # shit so you can read it
label = l[4:6])
Base R:
ggplot:

Here's a straightforward solution with text and without forloop:
Toy data:
set.seed(12)
df <- data.frame(
var1 = sample(LETTERS[1:4], 100, replace = TRUE),
var2 = rnorm(100)
)
Calculate the medians:
library(dplyr)
med <- df %>%
group_by(var1) %>%
summarise(medians = median(var2)) %>%
pull(medians)
Alternatively, in base R:
bx <- boxplot(df$var2 ~ df$var1)
med <- bx$stats[3,1:4]
Boxplot:
boxplot(df$var2 ~ df$var1)
Annotate boxplots:
text(1:4, med, round(med,3), pos = 3, cex = 0.6)

You can do
b <- boxplot(count ~ spray, data = InsectSprays, col = "lightgray", boxwex=.2)
s <- b$stats
text(1:ncol(s)+.4, s[3,], round(s[3,],1), col="red")

Related

How to specify groups with colors in qqplot()?

I have created a qqplot (with quantiles of beta distribution) from a dataset including two groups. To visualize, which points belong to which group, I would like to color them. I have tried the following:
res <- beta.mle(data$values) #estimate parameters of beta distribution
qqplot(qbeta(ppoints(500),res$param[1], res$param[2]),data$values,
col = data$group,
ylab = "Quantiles of data",
xlab = "Quantiles of Beta Distribution")
the result is shown here:
I have seen solutions specifying a "col" vector for qqnorm, hover this seems to not work with qqplot, as simply half the points is colored in either color, regardless of group. Is there a way to fix this?
A simulated some data just to shown how to add color in ggplot
Libraries
library(tidyverse)
# install.packages("Rfast")
Data
#Simulating data from beta distribution
x <- rbeta(n = 1000,shape1 = .5,shape2 = .5)
#Estimating parameters
res <- Rfast::beta.mle(x)
data <-
tibble(
simulated_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2])
) %>%
#Creating a group variable using quartiles
mutate(group = cut(x = simulated_data,
quantile(simulated_data,seq(0,1,.25)),
include.lowest = T))
Code
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = simulated_data, col = group))+
geom_point()
Output
For those who are wondering, how to work with pre-defined groups, this is the code that worked for me:
library(tidyverse)
library(Rfast)
res <- beta.mle(x)
# make sure groups are not numerrical
# (else color skale might turn out continuous)
g <- plyr::mapvalues(g, c("1", "2"), c("Group1", "Group2"))
data <-
tibble(
my_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2]),
group = g[order(x)]
)
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = my_data, col = group))+
geom_point()
result

R: Automatically Producing Histograms

I am using the R programming language. I created the following data set for this example:
var_1 <- rnorm(1000,10,10)
var_2 <- rnorm(1000, 5, 5)
var_3 <- rnorm(1000, 6,18)
favorite_food <- c("pizza","ice cream", "sushi", "carrots", "onions", "broccoli", "spinach", "artichoke", "lima beans", "asparagus", "eggplant", "lettuce", "cucumbers")
favorite_food <- sample(favorite_food, 1000, replace=TRUE, prob=c(0.5, 0.45, 0.04, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001))
response <- c("a","b")
response <- sample(response, 1000, replace=TRUE, prob=c(0.3, 0.7))
data = data.frame( var_1, var_2, var_3, favorite_food, response)
data$favorite_food = as.factor(data$favorite_food)
data$response = as.factor(data$response)
From here, I want to make histograms for the two categorical variables in this data set and put them on the same page:
#make histograms and put them on the same page (note: I don't know why the "par(mfrow = c(1,2))" statement is not working)
par(mfrow = c(1,2))
histogram(data$response, main = "response"))
histogram(data$favorite_food, main = "favorite food"))
My question : Is it possibly to automatically produce histograms for all categorical variables (without manually writing the "histogram()" statement for each variable) in a given data set and print them on the same page? Is it better to the use the "ggplot2" library instead for this problem ?
I can manually write the "histogram()" statement for each individual categorical variables in the data set, but I was looking for a quicker way to do this. Is it possible to do this with a "for loop"?
Thanks
A ggplot2/tidyverse solution is to lengthen each column into data and then use faceting to plot them all in the same page:
(with edit to plot only factor variables)
factor_vars <- sapply(data, is.factor)
varnames <- names(data)
deselect_not_factors <- varnames[!factor_vars]
library(tidyr)
library(ggplot2)
data_long <- data %>%
pivot_longer(
cols = -deselect_not_factors,
names_to = "category",
values_to = "value"
)
ggplot(data_long) +
geom_bar(
aes(x = value)
) +
facet_wrap(~category, scales = "free")
Here's a base R alternative using barplot in for loop :
cols <- names(data)[sapply(data, is.factor)]
#This would need some manual adjustment if number of columns increase
par(mfrow = c(1,length(cols)))
for(i in cols) {
barplot(table(data[[i]]), main = i)
}
As an alternative, you can capitalize on the fantastic DataExplorer package.
Note that histograms are for continuous variables and hence, you wanted to create bar plots for your categorical variables. This can be done as follows:
if(require(DataExplorer)==FALSE) install.packages("DataExplorer"); library(DataExplorer)
DataExplorer::plot_histogram(data) # plots histograms for continuous variables
DataExplorer::plot_bar(data) # bar plots for categorical variables
Please refer to the package manual for more details.
Here is a try using cowplot & ggplot2
library(ggplot2)
library(dplyr)
library(foreach)
library(cowplot)
list_variables <- c("response", "favorite_food")
all_plot <- foreach(current_var = c(list_variables)) %do% {
# need to do this to avoid ggplot reference to same summary data afterward.
data_summary_name <- paste0(current_var, "_summary")
eval(substitute(
{
graph_data <- data %>%
group_by(!!sym(current_var)) %>%
summarize(count = n(), .groups = "drop") %>%
mutate(share = count / sum(count))
plot <- ggplot(graph_data) +
geom_bar(mapping = aes(x = !!sym(current_var), y = share), width = 1,
fill = "#00FFFF", color = "#000000", stat = "identity") +
scale_y_continuous(labels = scales::percent) +
ggtitle(current_var) + ylab("Perecent of Total") +
theme_bw()
}, list(graph_data = as.name(data_summary_name))
))
return(plot)
}
plot_grid(plotlist = all_plot, ncol = 2)
Note: For reference about why I use eval & substitue you can reference to this question on ggplot2 generate same plot for different variables in a for loop
Using facet_wrap as approach similar to QuishSwash with data calculated in share instead
list_variables <- c("response", "favorite_food")
# Calculate share for choosen variables defined in list_variables
# You can adjust by having some variables selection based on some condition
summary_df <- bind_rows(foreach(current_var = c(list_variables)) %do% {
data %>%
group_by(variable = !!sym(current_var)) %>%
summarize(count = n(), .groups = "drop") %>%
mutate(share = count / sum(count),
variable_name = current_var)
})
ggplot(summary_df) +
geom_bar(
aes(x = variable, y = share),
fill = "#00FFFF", color = "#000000", stat = "identity") +
facet_wrap(~variable_name, scales = "free") +
scale_y_continuous(labels = scales::percent) +
theme_bw()
Created on 2021-04-29 by the reprex package (v2.0.0)

Plotting exponential function returns excess lines

I am trying to fit a non-linear regression to a set of data. However, when ploted, R returns many different lines where there should only be one.
This problem is only reproducable in one set of data and I can't see any obvious difference between this data and others.
This is the code for my plot:
plot(df$logFC, df$log_pval,
xlim=c(0,11.1), ylim=c(0,11),
xlab = "logFC", ylab = "p_val")
c <- df$logFC
d <- df$log_pval
model = nls(d ~ a*exp(b*c), start = list(a = 2,b = 0.1))
lines(c, predict(model), col = "dodgerblue", lty = 2, lwd = 2)
And here is a sample of my data (df):
logFC log_pval
4.315 2.788
6.724 9.836
2.925 4.136
5.451 10.836
2.345 1.486
4.219 7.618
I have narrowed the problem down to the model, but I'm not sure where to go from there. Any help is greatly appreciated!
1) ggplot method
I tried graphing the data using ggplot2 and I think the output is more what you were expecting...
library(tibble)
library(ggplot2)
library(dplyr)
# Create dataset
df <- tibble::tribble(~logFC, ~log_pval,
4.315, 2.788,
6.724, 9.836,
2.925, 4.136,
5.451, 10.836,
2.345, 1.486,
4.219, 7.618)
# Extract some vectors
c <- df$logFC
d <- df$log_pval
# Your model
model <- nls(d ~ a*exp(b*c), start = list(a = 2,b = 0.1))
# Create second dataset for new plotting
df2 <- tibble(logFC = c, log_pval =predict(model))
# Plot output
ggplot() +
geom_line(data = df2, aes(x = logFC, y = log_pval)) +
geom_point(data = df, aes(x =logFC, y =log_pval)) +
theme_classic()
2) base method
If you want to stick to base try ordering the x variables in the data frame before plotting the lines:
plot(df$logFC, df$log_pval,
xlab = "logFC", ylab = "p_val")
df3 <- tibble(x = df$logFC, y = predict(model)) %>% dplyr::arrange(x)
lines(df3$x, df3$y, col = "dodgerblue", lty = 1, lwd = 1)
It can be achieved with ggplot. More customization can be added to the plot if needed.
library(ggplot2)
ggplot(df) + aes(x = logFC, y = log_pval) + geom_point() +
geom_line(aes(x = c, y = predict(model)))
data
df <- structure(list(logFC = c(4.315, 6.724, 2.925, 5.451, 2.345, 4.219
), log_pval = c(2.788, 9.836, 4.136, 10.836, 1.486, 7.618)), class =
"data.frame", row.names = c(NA, -6L))
c <- df$logFC
d <- df$log_pval
model = nls(d ~ a*exp(b*c), start = list(a = 2,b = 0.1))
Thanks for your help Klink and Ronak,
It turns out the issue was the data not being ordered by size, and so 'points' plotted the unordered x-axis by the predicted y-axis, resulting in a zigzag between the predicted data.
Because ggplot presumably reorders the data before plotting, this issue has been resolved.

Datapoints on top of complex barplots in R-software

I have this fake data frame. I am looking at a quicker vectorization to add data points over the barplot of means. My solution would be hard to apply when many columns are present. My problem is that only a vector and not a matrix is allowed in the "points" functions. Do you have a smart solution?
df <- data.frame(Test = 1:5,
Factor= c("A","A","B","B","A"),
V1=c(3.2,5.4,6.0,6.5,2),
V2=c(5,5,8.6,7,1))
str(df, list.len=ncol(df))
colnames(df)
dim(df)
df.agg <- aggregate(df[c(3,4)], by = list(Factor = df$Factor), mean)
df.agg <- df.agg[order(df.agg$Factor),]
df.agg
mat.agg <- as.matrix(df.agg[c(2,3)])
barx <- barplot(mat.agg,
beside = T,
ylim = c(0, 1.3*max(mat.agg)),
col = colors()[c(5,16)][df.agg$Factor],
legend.text = as.character(df.agg$Factor))
barx
barx <- as.vector(barx)
barx
points(
rep(barx[1], length(df[df$Factor == levels(df$Factor)[1], "V1"])),
df[df$Factor == levels(df$Factor)[1], "V1"])
points(
rep(barx[2], length(df[df$Factor == levels(df$Factor)[2], "V1"])),
df[df$Factor == levels(df$Factor)[2], "V1"])
points(
rep(barx[3], length(df[df$Factor == levels(df$Factor)[1], "V2"])),
df[df$Factor == levels(df$Factor)[1], "V2"])
points(
rep(barx[4], length(df[df$Factor == levels(df$Factor)[2], "V2"])),
df[df$Factor == levels(df$Factor)[2], "V2"])
You can try to use the tidyverse:
library(tidyverse)
df %>%
gather(key, value, -Test, -Factor) %>%
ggplot(aes(x = key, y = value, fill=Factor)) +
geom_bar(stat = "summary", fun.y = "mean",position = "dodge") +
geom_point(position=position_dodge(0.9))
In base R I would do:
library(reshape2)
df_wide <- melt(df[,-1]) # make your data wide
df_wide <- df_wide[ order(df_wide$variable,df_wide$Factor),] # order appropriate
# add the x-positions using interaction
df_wide$X <- barx[as.numeric(interaction(df_wide$Factor, df_wide$variable))]
# add the points to the bars
points(df_wide$X, df_wide$value)

How to plot additional statistics in boxplot for each group?

I would like to see boxplots of combination of factors and I was told to use lattice for that. I tried it and it looks like this:
But now I would like to also add an ANOVA statistics to each of the groups. Possibly the statistics should display the p-value in each panel (in the white below the e.g. "Australia"). How to do this in lattice? Note that I don't insist on lattice at all...
Example code:
set.seed(123)
n <- 300
country <- sample(c("Europe", "Africa", "Asia", "Australia"), n, replace = TRUE)
type <- sample(c("city", "river", "village"), n, replace = TRUE)
month <- sample(c("may", "june", "july"), n, replace = TRUE)
x <- rnorm(n)
df <- data.frame(x, country, type, month)
bwplot(x ~ type|country+month, data = df, panel=function(...) {
panel.abline(h=0, col="green")
panel.bwplot(...)
})
The code to perform ANOVA for one of the groups and to extract p-value is this:
model <- aov(x ~ type, data = df[df$country == 'Africa' & df$month == 'may',])
p_value <- summary(model)[[1]][["Pr(>F)"]][2]
Here's one way using ggplot2. First we can compute the p-values separately for every month/country combination (I use data.table. you can use whichever way you're comfortable with). Then, we add geom_text and specify pvalue as the label and specify x and y coordinates where the text should be within each facet.
require(data.table)
dt <- data.table(df)
pval <- dt[, list(pvalue = paste0("pval = ", sprintf("%.3f",
summary(aov(x ~ type))[[1]][["Pr(>F)"]][1]))),
by=list(country, month)]
ggplot(data = df, aes(x=type, y=x)) + geom_boxplot() +
geom_text(data = pval, aes(label=pvalue, x="river", y=2.5)) +
facet_grid(country ~ month) + theme_bw() +
theme(panel.margin=grid::unit(0,"lines"), # thanks to #DieterMenne
strip.background = element_rect(fill = NA),
panel.grid.major = element_line(colour=NA),
panel.grid.minor = element_line(colour=NA))

Resources