I created these two violin plots in R, using:
install.packages("vioplot")
par(mfrow = c(1, 2))
vioplot::vioplot(HEL$Y,las=2,main="HEL$Y",col="deepskyblue",notch=TRUE)
vioplot::vioplot(ITA$Y,las=2,main="ITA$Y",col="aquamarine",notch=TRUE)
as a result I get the following. However, I don't know why in the X axis I get 1 and 2. How can I get rid of the 2?
Thanks for your help.
This mysterious behavior is due to the use of the argument "notch = TRUE". Example:
set.seed(456)
vioplot(rnorm(10), notch = TRUE)
My interpretation is that notch is not an argument of vioplot, so the function interprets it as data to add to the graph (see the little smudge at y = 1: that's where it wants to put the new data, since TRUE equals 1 when it is converted into a numeric).
To confirm that an unknown argument is interpreted as data to be plotted, here is a little experiment:
vioplot(rnorm(10), unknown_argument = rnorm(10))
And the result:
This is a ggplot2 solution in case you're interested.
library(ggplot2)
library(dplyr)
# Recreate similar data
HEL <- data.frame(Y = rnorm(50, 8, 3))
ITA <- data.frame(Y = rnorm(50, 9, 2))
# Join in a single dataframe and reshape to longer format
dat <- bind_rows(rename(HEL, hel_y = Y),
rename(ITA, ita_y = Y)) |>
tidyr::pivot_longer(everything())
# Make the plots
dat |>
ggplot(aes(name, value)) +
geom_violin(aes(fill = name)) +
geom_boxplot(width = 0.1) +
scale_fill_manual(values = c("deepskyblue", "aquamarine")) +
theme(legend.position = "")
Created on 2022-04-28 by the reprex package (v2.0.1)
Related
I am trying to construct a facet plot using ggplot2 with an annotation that varies from one facet to the next. The annotation is to be located using the plot area coordinates between 0 and 1, rather than the usual (x,y) coordinates, and is to be in the same location for every facet. The annotation is to be constructed using the y aesthetic and the paste0() function.
My reprex shows one case that works, but this case does not include the part that comes from the y aesthetic and the annotation does not vary among the facets. The reprex also shows a second case where the percentage change in the last (most recent) value for the y aesthetic is added to the annotation, and this does not work. It is this second case that I want to solve.
The reprex uses the ggpp package, but I have also tried using annotation_custom instead of ggpp. However I have not been able to get that to work either. Any help much appreciated.
Here is my reprex:
# Reprex for facets with placed annotation
library(ggplot2)
library(ggpp)
PC <- function(x) {y <- round(100*(x/lag(x)-1),1)}
df <- data.frame(tm=1:25,A=sample(1:100,25,replace=T),
B=sample(1:100,25,replace=T),C=sample(1:100,25,replace=T),
D=sample(1:100,25,replace=T))
df <- tidyr::pivot_longer(df,cols=2:5,names_to="City",values_to="Value")
# This works:
ggplot(df,aes(x=tm,y=Value))+
geom_line()+
scale_y_continuous(lim=c(-10,100))+
ggpp::geom_text_npc(aes(npcx = x, npcy = y, label=label),
data = data.frame(x = 0.05, y = 0.05,
label='% change in various cities'))+
facet_wrap(~City,scale="free_y")
# But this does not work:
ggplot(df,aes(x=tm,y=Value))+
geom_line()+
scale_y_continuous(lim=c(-10,100))+
ggpp::geom_text_npc(aes(npcx = x, npcy = y, label=label),
data = data.frame(x = 0.05, y = 0.05,
label=paste0("Last change in this city ",PC(y)[25],'%')))+
facet_wrap(~City,scale="free_y")
This solution below is a bit complicated, there are probably simpler ones, but it works.
1. Function PC()
Without loading package dplyr, your function PC is calling stats::lag, not dplyr::lag. And assigning to y without returning its value. The right version is
PC <- function(x) {round(100*(x/dplyr::lag(x) - 1), 1)}
2. The data
The plot is created with data = df but then, when plotting the labels the data set changes and the y value no longer comes from df.
The ggpp::geom_text_npc layer doesn't compute PC(y) correctly because its data argument only is self-referring to y. The data.frame is ill formed. This y is not the one in df.
A way to correct this is to first note that the labels to be plotted are 4, one per city and compute the last change value beforehand. This is very simple:
Value <- with(df, tapply(Value, City, \(y) PC(y)[length(y)]))
Value
# A B C D
# -24.1 -16.7 -91.3 46.9
The labels data then becomes
df_labels <- data.frame(
x = rep(0.05, length(Value)), y = rep(0.05, length(Value)),
City = names(Value),
label = paste0("Last change in this city ", Value, "%")
)
3. The plot
Full reproducible example, from top to bottom.
# Reprex for facets with placed annotation
suppressPackageStartupMessages({
library(ggplot2)
library(ggpp)
})
set.seed(2022)
PC <- function(x) {y <- round(100*(x/dplyr::lag(x) - 1), 1)}
df <- data.frame(tm=1:25,A=sample(1:100,25,replace=T),
B=sample(1:100,25,replace=T),
C=sample(1:100,25,replace=T),
D=sample(1:100,25,replace=T))
df <- tidyr::pivot_longer(df,cols=2:5,names_to="City",values_to="Value")
Value <- with(df, tapply(Value, City, \(y) PC(y)[length(y)]))
df_labels <- data.frame(
x = rep(0.05, length(Value)), y = rep(0.05, length(Value)),
City = names(Value),
label = paste0("Last change in this city ", Value, "%")
)
ggplot(df, aes(x = tm, y = Value)) +
geom_line() +
scale_y_continuous(lim = c(-10, 100)) +
ggpp::geom_text_npc(
data = df_labels,
mapping = aes(
npcx = x, npcy = y,
label = label
)
) +
facet_wrap(~ City, scale = "free_y")
Created on 2022-08-08 by the reprex package (v2.0.1)
I am attempting to complete a principal component analysis on a set of data containing columns of numeric data.
Assuming a dataset like this (in reality I have a pre configured data frame, this one if for reproducibility):
v1 <- c(1,2,3,4,5,6,7)
v2 <- c(3,6,2,5,2,4,9)
v3 <- c(6,1,4,2,3,7,5)
dataset <-data.frame(v1,v2,v3)
row.names(dataset) <-c('New York', 'Seattle', 'Washington DC', 'Dallas', 'Chicago','Los Angeles','Minneapolis')
I have ran my principal component analysis, and successfully plotted it:
pca=prcomp(dataset,scale=TRUE)
plot(pca$x[,1], pca$x[,2],
xlab="First PC",ylab="Second PC")
text(pca$x[,1], pca$x[,2],cex=0.7,pos=3,col="darkgrey")
What I want to do however is colour code my data points based on the city, which is the row names of my dataset. I also want to use these cities (i.e. rownames) as labels.
I've tried the following, but neither have worked:
## attempt 1 - I get row labels, but no chart
plot(pca$x[,1], pca$x[,2],col=rownames(dataset),pch=rownames(dataset),
xlab="First PC",ylab="Second PC")
text(pca$x[,1], pca$x[,2],labels=rownames(dataset),cex=0.7,pos=3,col="darkgrey")
## attempt 2
datasetwithcity = rownames_to_column(dataset, var = "city")
head(datasetwithcity)
OnlyCities=datasetwithcity[,1]
OnlyCities
# this didn't work:
City_Labels=as.numeric(OnlyCities)
head(City_Labels)
# gets city labels, but loses points and no colour
plot(pca$x[,1], pca$x[,2],col=City_Labels,pch=City_Labels,
xlab="First PC",ylab="Second PC")
text(pca$x[,1], pca$x[,2],labels=rownames(dataset),
cex=0.7,pos=3,col="darkgrey")
There are many different ways to do this.
In base R, you could do:
plot(pca$x[,1], pca$x[,2],
xlab="First PC",ylab="Second PC", col = seq(nrow(pca$x)),
xlim = c(-2.5, 2.5), ylim = c(-2, 2))
text(pca$x[,1], pca$x[,2],cex=0.7,pos=3,col="darkgrey")
text(x = pca$x[,1], y = pca$x[,2], labels = rownames(pca$x), pos = 1)
Personally, I think the resulting aesthetics are nicer (and more easy to change to suit your needs) with ggplot. The code is also a bit easier to read once you get used to the syntax.
library(ggplot2)
df <- as.data.frame(pca$x)
df$city <- rownames(df)
ggplot(df, aes(PC1, PC2, color = city)) +
geom_point(size = 3) +
geom_text(aes(label = city) , vjust = 2) +
lims(x = c(-2.5, 2.5), y = c(-2, 2)) +
theme_bw() +
theme(legend.position = "none")
Created on 2021-10-28 by the reprex package (v2.0.0)
Transforming ggplot2 axes to log10 using scales::trans_breaks() can sometimes (if the range is small enough) produce un-pretty breaks, at non-integer powers of ten.
Is there a general purpose way of setting these breaks to occur only at 10^x, where x are all integers, and, ideally, consecutive (e.g. 10^1, 10^2, 10^3)?
Here's an example of what I mean.
library(ggplot2)
# dummy data
df <- data.frame(fct = rep(c("A", "B", "C"), each = 3),
x = rep(1:3, 3),
y = 10^seq(from = -4, to = 1, length.out = 9))
p <- ggplot(df, aes(x, y)) +
geom_point() +
facet_wrap(~ fct, scales = "free_y") # faceted to try and emphasise that it's general purpose, rather than specific to a particular axis range
The unwanted result -- y-axis breaks are at non-integer powers of ten (e.g. 10^2.8)
p + scale_y_log10(
breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x))
)
I can achieve the desired result for this particular example by adjusting the n argument to scales::trans_breaks(), as below. But this is not a general purpose solution, of the kind that could be applied without needing to adjust anything on a case-by-case basis.
p + scale_y_log10(
breaks = scales::trans_breaks("log10", function(x) 10^x, n = 1),
labels = scales::trans_format("log10", scales::math_format(10^.x))
)
Should add that I'm not wed to using scales::trans_breaks(), it's just that I've found it's the function that gets me closest to what I'm after.
Any help would be much appreciated, thank you!
Here is an approach that at the core has the following function.
breaks = function(x) {
brks <- extended_breaks(Q = c(1, 5))(log10(x))
10^(brks[brks %% 1 == 0])
}
It gives extended_breaks() a narrow set of 'nice numbers' and then filters out non-integers.
This gives us the following for you example case:
library(ggplot2)
library(scales)
#> Warning: package 'scales' was built under R version 4.0.3
# dummy data
df <- data.frame(fct = rep(c("A", "B", "C"), each = 3),
x = rep(1:3, 3),
y = 10^seq(from = -4, to = 1, length.out = 9))
ggplot(df, aes(x, y)) +
geom_point() +
facet_wrap(~ fct, scales = "free_y") +
scale_y_continuous(
trans = "log10",
breaks = function(x) {
brks <- extended_breaks(Q = c(1, 5))(log10(x))
10^(brks[brks %% 1 == 0])
},
labels = math_format(format = log10)
)
Created on 2021-01-19 by the reprex package (v0.3.0)
I haven't tested this on many other ranges that might be difficult, but it should generalise better than setting the number of desired breaks to 1. Difficult ranges might be those just in between -but not including- powers of 10. For example 11-99 or 101-999.
enter image description hereI wanted to plot multiple lines in one graph but I couldn't figure out which code to use. Also, is there a way I could assign colors to each of the lines? Just new to Rstudio and was assigned to pick up someones work so I've been doing a lot of trial and error but I haven't been lucky for the past few days. Hope someone could help me with this! Thank you so much
ecdf.shift <- function(OUR_threshold, des_cap = 40, nint = 10000){
#create some empty vectors for later use in the loop
ecdf_med = c()
ecdf_obs = c()
for (i in 1:length(OUR_threshold)){
# filter out the OUR threshold data, then select only the capture column and create a ecdf function
ecdf_fun <- HRP_rESS_no %>%
filter(ESS > OUR_threshold[i]) %>%
.$TSS_con %>%
ecdf()
# extract the ecdf data and put in tibble dataframe, then create a linear interpolation of the curve.
ecdf_data <- tibble(TSS_con = environment(ecdf_fun)$x, prob = environment(ecdf_fun)$y)
ecdf_interpol <- approx(x = ecdf_data$TSS_con, y = ecdf_data$prob, n = nint)
# find the vector numbers in x which correspond with the desired capture. Then find correlate the vectornumbers with probability numbers in the y vectors. Take the median value in case multiple hits. Put this number in a vector with designed vectornumber as ditacted by the loopnumber i.
ecdf_med[i] <- median(ecdf_interpol$y[(round(ecdf_interpol$x,1) == des_cap)])
# calculate the number of observations when the filtering takes place.
ecdf_obs[i] <- HRP_rESS_no %>%
filter(ESS > OUR_threshold[i]) %>%
.$TSS_con %>%
length()
# Flush the ecdf data. The ecdf is encoded as a function with global paramaters, so you want to reset them everytime the loop is done to avoid pesky bugs to appear.
rm(ecdf_data)
}
#create a tibble dataframe with all the loop data.
ecdf_out <- tibble(OUR_ratio_cutoff = OUR_threshold, prob = (ecdf_med)*100, nobs = ecdf_obs)
return(ecdf_out)
}
ratio_threshold <- seq(0,115, by = 5)
t = ecdf_MLSS_target <- 400 %>%
ecdf.shift(ratio_threshold, .) %>%
filter(nobs > 2) %>%
ggplot(aes( x = OUR_ratio_cutoff, y = prob)) +
geom_line() +
geom_point() +
theme_bw(base_size = 12) +
theme(panel.grid = element_blank()) +
scale_y_continuous(limits = c(0,100),
breaks = seq(0,300, by = 5),
expand = c(0,0)) +
scale_x_continuous(limits = c(0,120),
breaks = seq(0,110, by = 10),
expand = c(0,0)) +
labs(x = "ESS mg TSS/L",
y = "Probability of contactor MLSS > 400 mg TSS/L ")
plot(t)
Easiest would be to loop over your different t values first and bring the resulting data frames into one big data frame, and use this for your plot. Your code is not fully reproducible (it requires data that we do not have, i.e. HRP_rESS_no). So I have stripped down the function to the core - creating a data frame which makes different "lines" depending on your t value. I just used it as slope.
I hope the idea is clear.
library(tidyverse)
ecdf.shift <- function(OUR_threshold, t) {
data.frame(x = OUR_threshold, y = t * OUR_threshold)
}
ratio_threshold <- seq(0, 115, by = 5)
t_df <-
map(1:5, function(t) ecdf.shift(ratio_threshold, t)) %>%
bind_rows(, .id = "t")
ggplot(t_df, aes(x, y, color = t)) +
geom_line() +
geom_point()
Created on 2020-05-07 by the reprex package (v0.3.0)
This question already has answers here:
Easier way to plot the cumulative frequency distribution in ggplot?
(3 answers)
Closed 4 years ago.
I have a data frame, which after applying the melt function looks similar to:
var val
1 a 0.6133426
2 a 0.9736237
3 b 0.6201497
4 b 0.3482745
5 c 0.3693730
6 c 0.3564962
..................
The initial dataframe had 3 columns with the column names, a,b,c and their associated values.
I need to plot on the same graph, using ggplot the associated ecdf for each of these columns (ecdf(a),ecdf(b),ecdf(c)) but I am failing in doing this. I tried:
p<-ggplot(melt_exp,aes(melt_exp$val,ecdf,colour=melt_exp$var))
pg<-p+geom_step()
But I am getting an error :arguments imply differing number of rows: 34415, 0.
Does anyone have an idea on how this can be done? The graph should look similar to the one returned by plot(ecdf(x)), not a step-like one.
Thank you!
My first thought was to try to use stat_function, but since ecdf returns a function, I couldn't get that working quickly. Instead, here's a solution the requires that you attach the computed values to the data frame first (using Ramnath's example data):
library(plyr) # function ddply()
mydf_m <- ddply(mydf_m, .(variable), transform, ecd = ecdf(value)(value))
ggplot(mydf_m,aes(x = value, y = ecd)) +
geom_line(aes(group = variable, colour = variable))
If you want a smooth estimate of the ECDF you could also use geom_smooth together with the function ns() from the spline package:
library(splines) # function ns()
ggplot(mydf_m, aes(x = value, y = ecd, group = variable, colour = variable)) +
geom_smooth(se = FALSE, formula = y ~ ns(x, 3), method = "lm")
As noted in a comment above, as of version 0.9.2.1, ggplot2 has a specific stat for this purpose: stat_ecdf. Using that, we'd just do something like this:
ggplot(mydf_m,aes(x = value)) + stat_ecdf(aes(colour = variable))
Based on Ramnath, approach above, you get the ecdf from ggplot2 by doing the following:
require(ggplot2)
mydf = data.frame(
a = rnorm(100, 0, 1),
b = rnorm(100, 2, 1),
c = rnorm(100, -2, 0.5)
)
mydf_m = melt(mydf)
p0 = ggplot(mydf_m, aes(x = value)) +
stat_ecdf(aes(group = variable, colour = variable))
print(p0)
Here is one approach
require(ggplot2)
mydf = data.frame(
a = rnorm(100, 0, 1),
b = rnorm(100, 2, 1),
c = rnorm(100, -2, 0.5)
)
mydf_m = melt(mydf)
p0 = ggplot(mydf_m, aes(x = value)) +
geom_density(aes(group = variable, colour = variable)) +
opts(legend.position = c(0.85, 0.85))