I have the following code:
figg4 <- lala4 %>% gather(key, value, -Species_Name) %>%
mutate (Species_Name = factor(Species_Name,
levels=c('Dasyprocta punctata',
'Cuniculus paca','Large Rats',
'Heteromys unknown', 'Sciurus variegatoides',
'Sciurus granatensis','Dasypus novemcinctus',
'Didelphis marsupialis', 'Philander opossum',
'Metachirus nudicaudatus', 'Nasua narica',
'Procyon lotor', 'Eira barbara',
'Galictis vittata', 'Leopardus pardalis'))) %>%
ggplot(aes(x=Species_Name, y=value,
fill=key)) + coord_flip() + geom_col (position = "stack") +
theme(panel.background = element_blank()) + bbc_style() +
labs(title = "Species occupancy by Site Type")+
scale_fill_manual(values = c("#333333","#1380A1", "#FAAB18"))
I get bar graph which is listing the names in the reverse order, I want to make them appear in the order that I have written the levels in... how do I do so?
I tried using fct_reorder from forcats by adding the following code
mutate(name = fct_reorder(Species_Name, desc(value)))
But that did not change the order.
I am quite new to r and not sure of how to do this. Would be grateful for any help
Here is dput output for the source:
dput(lala4)
structure(list(Species_Name = structure(c(9L, 12L, 13L, 14L,
19L, 22L, 27L, 46L, 41L, 42L, 10L, 15L, 32L, 33L, 24L), .Label = c("Buteo platypterus",
"Canis latrans", "Cathartes aura", "Catharus unknown", "Catharus ustulatus",
"Cebus capucinus", "Chordeiles unknown", "Conepatus semistriatus",
"Crax rubra", "Crypturellus cinnamomeus", "Cuniculus paca", "Dasyprocta punctata",
"Dasypus novemcinctus", "Didelphis marsupialis", "Eira barbara",
"Galictis vittata", "Geotrygon montana", "Geotrygon violacea",
"Heteromys unknown", "Holcosus quadrilineatus", "Large Rats",
"Leopardus pardalis", "Leopardus wiedii", "Leptotila unknown",
"Melozone unknown", "Metachirus nudicaudatus", "Nasua narica",
"Odocoileus virginianus", "Panthera onca", "Parkesia noveboracensis",
"Pecari tajacu", "Penelopina nigra", "Philander opossum", "Piaya cayana",
"Procyon lotor", "Puma concolor", "Puma yagouaroundi", "Sciurus granatensis",
"Sciurus variegatoides", "Setophaga unknown", "Sylvilagus sp ",
"Tamandua mexicana", "Tapirus bairdii", "Tayassu pecari", "Tigrisoma fasciatum",
"Tinamus major"), class = "factor"), Forest Area (<5ha) = c(0.067307692,
0.134615385, 0.173076923, 0.144230769, 0.019230769, 0.086538462,
0.192307692, 0.009615385, 0.163461538, 0.038461538, 0, 0.019230769,
0, 0.163461538, 0.153846154), Forest Area (5-27ha) = c(0.067307692,
0.317307692, 0.269230769, 0.096153846, 0.038461538, 0.105769231,
0.192307692, 0.115384615, 0.134615385, 0.057692308, 0, 0.096153846,
0, 0.076923077, 0.173076923), Forest Area (>350ha) = c(0.163461538,
0.384615385, 0.278846154, 0.201923077, 0.105769231, 0.067307692,
0.144230769, 0.298076923, 0.028846154, 0.048076923, 0.086538462,
0.038461538, 0.019230769, 0.028846154, 0.125)), row.names = c(NA,
15L), class = "data.frame")
You need to redefine the factor as an ordered factor first.
Try just fixing the code where you define the factor by adding
ordered = TRUE
This should probably work:
figg4 <- lala4 %>% gather(key, value, -Species_Name) %>%
mutate (Species_Name = factor(Species_Name,
levels=c('Dasyprocta punctata',
'Cuniculus paca','Large Rats',
'Heteromys unknown', 'Sciurus variegatoides',
'Sciurus granatensis','Dasypus novemcinctus',
'Didelphis marsupialis', 'Philander opossum',
'Metachirus nudicaudatus', 'Nasua narica',
'Procyon lotor', 'Eira barbara',
'Galictis vittata', 'Leopardus pardalis'), ordered = TRUE)) %>%
ggplot(aes(x=Species_Name, y=value,
fill=key)) + coord_flip() + geom_col (position = "stack") +
theme(panel.background = element_blank()) + bbc_style() +
labs(title = "Species occupancy by Site Type")+
scale_fill_manual(values = c("#333333","#1380A1", "#FAAB18"))
I can't run it because I don't have the lala4 data to test though.
Related
I am trying to extract features-compounds lying in different percentile (25th, 50th, 75th) from the density plot. Then save these features in the new data.frame. I will then use these new features and map with the original data.frame. Identification of these features would help in further analysis and in-depth exploration. I have provided example data and density/boxplot (screenshot below).
dput(Delta)
structure(list(`PC1-PC2` = c(0.0161933528045602, 0.766612235998576,
-0.237724873642335, -0.0733015604900428, 0.400545815637124, 0.414481719044214,
0.208303811501068, 0.392408339922047, 0.336514581021898, -0.320322998122561,
0.36615463065484, -0.263557666645363, 0.180272570114807, 0.255255831254277,
0.0138502697450574, 0.23798933387042, -0.296936870921566, 0.206190306805568,
0.141038353337885, 0.167942308239497, 0.147174778368622, -0.0111611567646942,
-0.141468109519736, 0.11179112137823, 0.114216799808335, 0.0185917572079534,
0.0147028493400293), Gene_Symbols = structure(c(15L, 13L, 21L,
9L, 2L, 7L, 1L, 19L, 14L, 5L, 17L, 24L, 18L, 8L, 27L, 20L, 12L,
26L, 4L, 23L, 3L, 6L, 16L, 22L, 11L, 25L, 10L), .Label = c("Feature_1_Compound_2",
"Feature_1_Compound_3", "Feature_10_Compound_1", "Feature_10_Compound_2",
"Feature_10_Compound_3", "Feature_2_Compound_2", "Feature_2_Compound_3",
"Feature_3_Compound_1", "Feature_3_Compound_2", "Feature_4_Compound_1",
"Feature_4_Compound_2", "Feature_4_Compound_3", "Feature_5_Compound_1",
"Feature_5_Compound_2", "Feature_5_Compound_3", "Feature_6_Compound_1",
"Feature_6_Compound_2", "Feature_6_Compound_3", "Feature_7_Compound_1",
"Feature_7_Compound_2", "Feature_7_Compound_3", "Feature_8_Compound_1",
"Feature_8_Compound_2", "Feature_8_Compound_3", "Feature_9_Compound_1",
"Feature_9_Compound_2", "Feature_9_Compound_3"), class = "factor")), row.names = c("Feature_5_Compound_3",
"Feature_5_Compound_1", "Feature_7_Compound_3", "Feature_3_Compound_2",
"Feature_1_Compound_3", "Feature_2_Compound_3", "Feature_1_Compound_2",
"Feature_7_Compound_1", "Feature_5_Compound_2", "Feature_10_Compound_3",
"Feature_6_Compound_2", "Feature_8_Compound_3", "Feature_6_Compound_3",
"Feature_3_Compound_1", "Feature_9_Compound_3", "Feature_7_Compound_2",
"Feature_4_Compound_3", "Feature_9_Compound_2", "Feature_10_Compound_2",
"Feature_8_Compound_2", "Feature_10_Compound_1", "Feature_2_Compound_2",
"Feature_6_Compound_1", "Feature_8_Compound_1", "Feature_4_Compound_2",
"Feature_9_Compound_1", "Feature_4_Compound_1"), class = "data.frame")
#> PC1-PC2 Gene_Symbols
#> Feature_5_Compound_3 0.01619335 Feature_5_Compound_3
#> Feature_5_Compound_1 0.76661224 Feature_5_Compound_1
#> Feature_7_Compound_3 -0.23772487 Feature_7_Compound_3
#> Feature_3_Compound_2 -0.07330156 Feature_3_Compound_2
#> Feature_1_Compound_3 0.40054582 Feature_1_Compound_3
#> Feature_2_Compound_3 0.41448172 Feature_2_Compound_3
#> Feature_1_Compound_2 0.20830381 Feature_1_Compound_2
#> Feature_7_Compound_1 0.39240834 Feature_7_Compound_1
#> Feature_5_Compound_2 0.33651458 Feature_5_Compound_2
#> Feature_10_Compound_3 -0.32032300 Feature_10_Compound_3
#> Feature_6_Compound_2 0.36615463 Feature_6_Compound_2
#> Feature_8_Compound_3 -0.26355767 Feature_8_Compound_3
#> Feature_6_Compound_3 0.18027257 Feature_6_Compound_3
#> Feature_3_Compound_1 0.25525583 Feature_3_Compound_1
#> Feature_9_Compound_3 0.01385027 Feature_9_Compound_3
#> Feature_7_Compound_2 0.23798933 Feature_7_Compound_2
#> Feature_4_Compound_3 -0.29693687 Feature_4_Compound_3
#> Feature_9_Compound_2 0.20619031 Feature_9_Compound_2
#> Feature_10_Compound_2 0.14103835 Feature_10_Compound_2
#> Feature_8_Compound_2 0.16794231 Feature_8_Compound_2
#> Feature_10_Compound_1 0.14717478 Feature_10_Compound_1
#> Feature_2_Compound_2 -0.01116116 Feature_2_Compound_2
#> Feature_6_Compound_1 -0.14146811 Feature_6_Compound_1
#> Feature_8_Compound_1 0.11179112 Feature_8_Compound_1
#> Feature_4_Compound_2 0.11421680 Feature_4_Compound_2
#> Feature_9_Compound_1 0.01859176 Feature_9_Compound_1
#> Feature_4_Compound_1 0.01470285 Feature_4_Compound_1
# Density distribution
plt2 <- ggdensity(Delta, x = "PC1-PC2", y = "..count..",
xlab = "Delta (PC1-PC2)",
ylab = "Number of genes",
fill = "lightgray", color = "black",
label = "Gene_Symbols", repel = TRUE,
font.label = list(color= "PC1-PC2"),
xticks.by = 20, # Break x ticks by 20
gradient.cols = c("blue", "red"),
legend = c(0.7, 0.6),
legend.title = "" # Hide legend title
)
#
library(dplyr)
library(ggplot2)
plt1 <- Delta %>% select(`PC1-PC2`) %>%
ggplot(aes(x="", y = `PC1-PC2`)) +
geom_boxplot(fill = "lightblue", color = "black") +
coord_flip() +
theme_classic() +
xlab("") +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank())
# install.packages("egg", dependencies = TRUE)
egg::ggarrange(plt2, plt1, heights = 2:1)
Thank You,
Toufiq
Extract feature between 25th and 75th percentile of PC1-PC2:
Delta %>% filter(`PC1-PC2` >= quantile(Delta$`PC1-PC2`, .25) &
`PC1-PC2` <= quantile(Delta$`PC1-PC2`, .75) )
PC1-PC2 Gene_Symbols
Feature_5_Compound_3 0.01619335 Feature_5_Compound_3
Feature_1_Compound_2 0.20830381 Feature_1_Compound_2
Feature_6_Compound_3 0.18027257 Feature_6_Compound_3
Feature_9_Compound_3 0.01385027 Feature_9_Compound_3
Feature_7_Compound_2 0.23798933 Feature_7_Compound_2
Feature_9_Compound_2 0.20619031 Feature_9_Compound_2
Feature_10_Compound_2 0.14103835 Feature_10_Compound_2
Feature_8_Compound_2 0.16794231 Feature_8_Compound_2
Feature_10_Compound_1 0.14717478 Feature_10_Compound_1
Feature_8_Compound_1 0.11179112 Feature_8_Compound_1
Feature_4_Compound_2 0.11421680 Feature_4_Compound_2
Feature_9_Compound_1 0.01859176 Feature_9_Compound_1
Feature_4_Compound_1 0.01470285 Feature_4_Compound_1
I have this data like this
z <- structure(list(Description = c("Neurotransmitter receptors and postsynaptic signal transmission",
"Muscle contraction", "Class A/1 (Rhodopsin-like receptors)",
"Signaling by Rho GTPases", "Metabolism of carbohydrates", "Extracellular matrix organization",
"Transmission across Chemical Synapses", "G alpha (i) signalling events",
"GPCR ligand binding", "Neuronal System"), p.adjust = c(0.563732077253957,
0.563732077253957, 0.774251160588198, 0.797669099976286, 0.655931854998983,
0.655931854998983, 0.563732077253957, 0.774251160588198, 0.774251160588198,
0.655931854998983), Count = c(9L, 9L, 9L, 9L, 10L, 10L, 11L,
11L, 12L, 13L)), row.names = c("R-HSA-112314", "R-HSA-397014",
"R-HSA-373076", "R-HSA-194315", "R-HSA-71387", "R-HSA-1474244",
"R-HSA-112315", "R-HSA-418594", "R-HSA-500792", "R-HSA-112316"
), class = "data.frame")
I would like to plot labels of y axis based on values of x axis, so from the smallest one to the largest one. Now it plots me in alphabetic order. How to do this?
ggplot(z, aes(Count, Description, size=Count, color=p.adjust))+
geom_point()
Somethine like this
With forcats::fct_reorder(Description, Count) you can change the order of y values.
library(ggplot2)
library(forcats)
ggplot(z, aes(Count, fct_reorder(Description, Count), size=Count, color=p.adjust))+
geom_point()
Created on 2022-02-01 by the reprex package (v2.0.1)
I've got a dataframe that possess the next structure:
D1A1 D1A2 D1A3 D1B1 D1B2 D1B3 D2A1 D2A2 D2A3 D2B1 D2B2 D2B3
10 12 15 40 39 27 11 13 14 33 31 32
The actual dataframe has a greater dimension (40 observations / columns). My interest is to create any kind of possible plot showing all the numerical information together with the data clustered by their column classification (D1A, D1B, D2A, D2B) as follows:
D1A1+D1A2+D1A3 || D1B1+D1B2+D1B3 || D2A1+D2A2+D2A3 || D2B1+D2B2+D2B3
As long as I feel extremely lost, any suggestion would be appreciated.
We can split the dataset by the substring of column names, loop over the list and get the rowSums and use barplot
out <- sapply(split.default(df1, sub("\\d+$", "", names(df1))),
rowSums, na.rm = TRUE)
barplot(out)
If there are more rows and want to plot, use tidyverse, we can reshape into 'long' format with pivot_longer by making use of the pattern in column names i.e. capturing the substring of column names without the digits at the end. This create 4 columns. Then, we use summarise with across to get the sum of each columns and return a bar plot - geom_col
library(dplyr)
library(tidyr)
library(ggplot2)
df2 %>%
pivot_longer(cols = everything(), names_to = ".value",
names_pattern = "(.*)\\d+$") %>%
summarise(across(everything(), sum, na.rm = TRUE)) %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(x = name, y = value, fill = name)) +
geom_col()
-output
If we are interested in the spread of the data, a boxplot can help. Here, we don't summarise, and instead of geom_col use geom_boxplot
df2 %>%
pivot_longer(cols = everything(), names_to = ".value",
names_pattern = "(.*)\\d+$") %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(x = name, y = value, fill = name)) +
geom_boxplot()
data
df1 <- structure(list(D1A1 = 10L, D1A2 = 12L, D1A3 = 15L, D1B1 = 40L,
D1B2 = 39L, D1B3 = 27L, D2A1 = 11L, D2A2 = 13L, D2A3 = 14L,
D2B1 = 33L, D2B2 = 31L, D2B3 = 32L), class = "data.frame", row.names = c(NA,
-1L))
df2 <- structure(list(D1A1 = c(10L, 15L), D1A2 = c(12L, 23L), D1A3 = 15:14,
D1B1 = c(40L, 23L), D1B2 = c(39L, 14L), D1B3 = c(27L, 22L
), D2A1 = 11:10, D2A2 = c(13L, 15L), D2A3 = c(14L, 17L),
D2B1 = c(33L, 35L), D2B2 = c(31L, 35L), D2B3 = c(32L, 32L
)), class = "data.frame", row.names = c(NA, -2L))
Aloha all,
I've struggled to build a legend for a mix/match of time series data I'm making. Here is some code:
My understanding is that I need to somehow clean my data and put it all in the same data frame, but all of the time series don't line up very well. Some is at 15 minutes, other one hour. Is there any way to force a legend for these datasets? I don't know what else to post here - since the 5 datasets are quite large.
Plot I'm working on:
q<- ggplot(subset(cr200_Auwai1, timedate>startd & timedate<endd), aes(timedate, Turb_SS)) +
geom_point(color="coral4")+
geom_point(data=subset(dsloi_wl, timedate>startd & timedate<endd), aes(timedate, level), color="blue")+
#geom_point(data=subset(flow_data, mdate>startd & mdate<endd), aes(as.POSIXct(mdate), flow_cfs*1000), color="red")+
geom_point(data=subset(cr300_Wai1, timedate>startd & timedate<endd), aes(timedate, Lvl_m*1000), color="forestgreen", size=1)+ #aquamarine3
geom_point(data=subset(cr300_Wai1, timedate>startd & timedate<endd), aes(timedate, Turb_SS), color="orange")+
#geom_point(data=subset(hihimanu_wl, timedate>startd & timedate<endd), aes(timedate, level), color="azure4", size=0.1)+
#geom_point(data=subset(rain_data, timedate>startd & timedate<endd), aes(timedate, rainmm), color="red",size=5)+
geom_point(data=subset(haptuk_ysi, datetime>startd & datetime<endd), aes(datetime, Turb), color="pink")+
#scale_x_date(breaks=date_breaks("month"), labels = date_format("%b-%y"))+
xlab("Date")+
ylab("Turbidity (NTU) and Water Level (mm)")+
coord_cartesian(ylim=c(0, 1500))+
theme_bw()+
theme(axis.text=element_text(size=14),
axis.title=element_text(size=16,face="bold"),
legend.justification = c(1, 1),
legend.position = c(1, 1),
legend.title=element_text(size=14),
legend.text=element_text(size=12))
Here is a sample of two of the datasets: Note that the times don't line up at all... since I'm mixing sources.
dsloi_wl:
structure(list(ReceptionTime = c(1533895414.1134, 1533895414.1733,
1533895414.19397, 1533895414.20708, 1533895414.22283, 1533895414.23634,
1533895414.25135, 1533895414.26387, 1533895414.27653, 1533895414.29126,
1533896013.68755, 1533896013.7638, 1533896013.79232, 1533896013.80917,
1533896013.82312, 1533896013.83648, 1533896013.84988, 1533896013.8648,
1533896013.87724, 1533896013.8894), d2w = c(776.7, 789.7, 790.2,
777.1, 777.2, 777.7, 778.4, 793.4, 779.6, 794.1, 819.9, 780.7,
794.1, 806.9, 781.9, 781.9, 782.7, 782.8, 783.1, 783.4), timedate = structure(c(1533895414.1134,
1533895414.1733, 1533895414.19397, 1533895414.20708, 1533895414.22283,
1533895414.23634, 1533895414.25135, 1533895414.26387, 1533895414.27653,
1533895414.29126, 1533896013.68755, 1533896013.7638, 1533896013.79232,
1533896013.80917, 1533896013.82312, 1533896013.83648, 1533896013.84988,
1533896013.8648, 1533896013.87724, 1533896013.8894), class = c("POSIXct",
"POSIXt"), tzone = ""), level = c(723.3, 710.3, 709.8, 722.9,
722.8, 722.3, 721.6, 706.6, 720.4, 705.9, 680.1, 719.3, 705.9,
693.1, 718.1, 718.1, 717.3, 717.2, 716.9, 716.6)), .Names = c("ReceptionTime",
"d2w", "timedate", "level"), row.names = c(NA, 20L), class = "data.frame")
CR300_Wai1
structure(list(RECORD = 73027:73046, Temp_C = c(24.62861, 24.62332,
24.61533, 24.60857, 24.60189, 24.59733, 24.59068, 24.58404, 24.57869,
24.57327, 24.56781, 24.5606, 24.55551, 24.55218, 24.54648, 24.5416,
24.5358, 24.5319, 24.52781, 24.52294), Turb_BS = c(94.50522,
88.65939, 109.354, 57.71527, 134.1903, 46.37191, 78.17719, 52.22319,
58.07111, 96.95719, 51.47488, 44.65616, 70.43825, 99.58217, 93.68374,
87.4787, 175.5395, 167.6757, 110.8119, 132.5971), Turb_SS = c(36.63349,
34.31228, 37.02223, 32.97258, 36.68553, 33.82083, 37.43391, 33.43639,
31.17306, 33.6327, 34.69954, 30.99891, 34.69988, 33.64369, 32.54948,
32.1177, 32.86558, 48.97706, 30.65004, 33.71646), Temp_C_2 = c(24.9014,
24.89474, 24.88837, 24.88279, 24.87574, 24.86852, 24.86357, 24.85751,
24.85236, 24.84759, 24.84091, 24.83577, 24.83192, 24.82713, 24.8229,
24.81832, 24.81237, 24.80821, 24.8051, 24.80015), WD_OBS = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), Lvl_m = c(0.6907353, 0.6905226, 0.6896195, 0.6890779,
0.6881586, 0.6878724, 0.6862501, 0.6848835, 0.6844589, 0.6837503,
0.6836612, 0.6831629, 0.6821692, 0.6812283, 0.6799452, 0.6791196,
0.6782504, 0.6772775, 0.6763596, 0.6755115), timedate = structure(c(1533895500,
1533895800, 1533896100, 1533896400, 1533896700, 1533897000, 1533897300,
1533897600, 1533897900, 1533898200, 1533898500, 1533898800, 1533899100,
1533899400, 1533899700, 1533900000, 1533900300, 1533900600, 1533900900,
1533901200), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("RECORD",
"Temp_C", "Turb_BS", "Turb_SS", "Temp_C_2", "WD_OBS", "Lvl_m",
"timedate"), row.names = c(NA, 20L), class = "data.frame")
Here is a solution using mock data (next time provide a sample of your data) :
library(tidyverse)
library(lubridate)
#>
#> Attachement du package : 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
# mock data
time_15m <- seq(as.POSIXct("2018-08-30 00:00:00"), as.POSIXct("2018-08-31 00:00:00"), by = "15 min")
time_30m <- seq(as.POSIXct("2018-08-30 00:00:00"), as.POSIXct("2018-08-31 00:00:00"), by = "30 min")
time_60m <- seq(as.POSIXct("2018-08-30 00:00:00"), as.POSIXct("2018-08-31 00:00:00"), by = "60 min")
data_1 <- data.frame(time = time_15m,
var_1 = cos(hour(time_15m) + minute(time_15m)))
data_2 <- data.frame(time = time_30m,
var_2 = sin(hour(time_30m) + minute(time_30m)))
data_3 <- data.frame(time = time_60m,
var_3 = cos(1 - hour(time_60m) + minute(time_60m)))
# the kind of plot you have (prefer the 2nd version)
ggplot(data_1, aes(x = time, y = var_1)) +
geom_point(color = "red") +
geom_point(data = data_2, aes(time, var_2), color = "green") +
geom_point(data = data_3, aes(time, var_3), color = "blue") +
theme_bw()
# a version with long format data and use of gather function
data_1 %>%
left_join(data_2) %>% # join data from data_2 (timestep = 30m), missing data is NA
left_join(data_3) %>% # join data from data_3 (timestep = 60m), missing data is NA
gather(variable_name, variable_value, var_1, var_2, var_3) %>% # gather var_1, var_2 and var_3 in a single column
ggplot(., aes(x = time, y = variable_value, color = variable_name)) +
theme_bw() +
geom_point(size = 2)
#> Joining, by = "time"
#> Joining, by = "time"
#> Warning: Removed 120 rows containing missing values (geom_point).
Created on 2018-08-22 by the reprex package (v0.2.0).
EDIT 1 (include provided datasets)
library(tidyverse)
dsloi_wl %>%
full_join(cr300_Wai1) %>%
mutate(Lvl_m = 100 * Lvl_m) %>%
gather(variable_name, variable_value, level, Lvl_m, Turb_SS) %>%
ggplot(., aes(x = timedate, y = variable_value, color = variable_name)) +
geom_point() +
scale_color_manual("Legend title",
values = c("level" = "blue",
"Lvl_m" = "forestgreen",
"Turb_SS" = "orange"))
#> Joining, by = "timedate"
#> Warning: Removed 60 rows containing missing values (geom_point).
Created on 2018-08-23 by the reprex package (v0.2.0).
I am building a quantile-quantile plot out of an variable called x from a data frame called df in the working example provided below. I would like to label the points with the name variable of my df dataset.
Is it possible to do this in ggplot2 without resorting to the painful solution (coding the theoretical distribution by hand and then plotting it against the empirical one)?
Edit: it happens that yes, thanks to a user who posted and then deleted his answer. See the comments after Arun's answer below. Thanks to Didzis for his otherwise clever solution with ggbuild.
# MWE
df <- structure(list(name = structure(c(1L, 2L, 3L, 4L, 5L, 7L, 9L,
10L, 6L, 12L, 13L, 14L, 15L, 16L, 17L, 19L, 18L, 20L, 21L, 22L,
8L, 23L, 11L, 24L), .Label = c("AUS", "AUT", "BEL", "CAN", "CYP",
"DEU", "DNK", "ESP", "FIN", "FRA", "GBR", "GRC", "IRL", "ITA",
"JPN", "MLT", "NLD", "NOR", "NZL", "PRT", "SVK", "SVN", "SWE",
"USA"), class = "factor"), x = c(-0.739390016757746, 0.358177826874146,
1.10474523846099, -0.250589535389937, -0.423112615445571, -0.862144579740376,
0.823039669834058, 0.079521521937704, 1.08173649722493, -2.03962942823921,
1.05571087029737, 0.187147291278723, -0.144770773941437, 0.957990771847331,
-0.0546549555439176, -2.70142550075757, -0.391588386498849, -0.23855544527369,
-0.242781575907386, -0.176765072121165, 0.105155860923456, 2.69031085872414,
-0.158320176671995, -0.564560815972446)), .Names = c("name",
"x"), row.names = c(NA, -24L), class = "data.frame")
library(ggplot2)
qplot(sample = x, data = df) + geom_abline(linetype = "dotted") + theme_bw()
# ... using names instead of points would allow to spot the outliers
I am working on an adaptation of this gist, and will consider sending other questions to CrossValidated if I have questions about the regression diagnostics, which might be of interest to CV users.
You can save your original QQ plot as object (used function ggplot() and stat_qq() instead of qplot())
g<-ggplot(df, aes(sample = x)) + stat_qq()
Then with function ggplot_build() you can extract data used for plotting. They are stored in element data[[1]]. Saved those data as new data frame.
df.new<-ggplot_build(g)$data[[1]]
head(df.new)
x y sample theoretical PANEL group
1 -2.0368341 -2.7014255 -2.7014255 -2.0368341 1 1
2 -1.5341205 -2.0396294 -2.0396294 -1.5341205 1 1
3 -1.2581616 -0.8621446 -0.8621446 -1.2581616 1 1
4 -1.0544725 -0.7393900 -0.7393900 -1.0544725 1 1
5 -0.8871466 -0.5645608 -0.5645608 -0.8871466 1 1
6 -0.7415940 -0.4231126 -0.4231126 -0.7415940 1 1
Now you can add to hew data frame names of observations. Important is to use order() as data in new data frame are ordered.
df.new$name<-df$name[order(df$x)]
Now plot new data frame as usual and instead of geom_point() provide geom_text().
ggplot(df.new,aes(theoretical,sample,label=name))+geom_text()+
geom_abline(linetype = "dotted") + theme_bw()
The points are too close by. I would do something like this:
df <- df[with(df, order(x)), ]
df$t <- quantile(rnorm(1000), seq(0, 100, length.out = nrow(df))/100)
p <- ggplot(data = df, aes(x=t, y=x)) + geom_point(aes(colour=df$name))
This gives:
If you insist on having labels inside the plot, then, you could try something like:
df <- df[with(df, order(x)), ]
df$t <- quantile(rnorm(1000), seq(0, 100, length.out = nrow(df))/100)
p <- ggplot(data = df, aes(x=t, y=x)) + geom_point(aes(colour=df$name))
p <- p + geom_text(aes(x=t-0.05, y=x-0.15, label=df$name, size=1, colour=df$name))
p
You can play around with the x and y coordinates and if you want you can always remove the colour aesthetics.
#Arun has a good solution in the comment above, but this works with R 4.0.3:
ggplot(data = df, aes(sample = x)) + geom_qq() + geom_text_repel(label=df$name[order(df$x)], stat="qq") + stat_qq_line()
Basically the same thing, with addition of stat_qq_line() and [order(df$x)] as part of the label. If you don't include the order function then your labels will be all out of order and very misleading.
Here's hoping this saves someone else some hours of their life.