Order dataframe/bar chart using a 2 coordinate data point - ggplot2 - r

I have data, that is summarized by a 2 coordinate data point (e.g. [0,2]). However my data frame, and therefore my bar chart are ordered alphabetically even though the coordinate is a factor data type.
The data frame/ggplot default behavior: [0,1], [0,13], [0,2]
What I want to happen: [0,1], [0,2], [0,13]
This coordinate variable was created by pasteing numbers from 2 columns
mutate(swimlane_coord = factor(paste0("[", sl_subsection_index, ",", sl_element_index, "]")))
where sl_subsection_index is an integer and sl_element_index is an integer.
There can be any combination of coordinates, so I would like to avoid having to manually force the factor definitions.
Here is an example of the data:
structure(list(application_type1 = c("SamsungTV", "SamsungTV",
"SamsungTV", "SamsungTV", "SamsungTV", "SamsungTV", "SamsungTV",
"SamsungTV", "SamsungTV", "SamsungTV", "SamsungTV", "SamsungTV",
"SamsungTV", "SamsungTV"), variant_uuid = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Control",
"BackNav"), class = "factor"), allStreamSec = c("curatedCatalog",
"curatedCatalog", "curatedCatalog", "curatedCatalog", "curatedCatalog",
"curatedCatalog", "curatedCatalog", "curatedCatalog", "curatedCatalog",
"curatedCatalog", "curatedCatalog", "curatedCatalog", "curatedCatalog",
"curatedCatalog"), swimlane_coord = structure(c(1L, 2L, 8L, 9L,
10L, 21L, 1L, 2L, 8L, 9L, 10L, 11L, 25L, 29L), .Label = c("[0,0]",
"[0,1]", "[0,10]", "[0,11]", "[0,12]", "[0,13]", "[0,14]", "[0,2]",
"[0,3]", "[0,4]", "[0,5]", "[0,6]", "[0,7]", "[0,8]", "[0,9]",
"[1,0]", "[1,1]", "[1,3]", "[1,4]", "[1,5]", "[1,7]", "[2,0]",
"[2,11]", "[3,1]", "[3,11]", "[3,2]", "[3,5]", "[3,6]", "[3,7]",
"[3,8]"), class = "factor"), ESPerVisitBySL = c(1.775, 1.83333333333333,
0.976190476190476, 0.966666666666667, 1.08333333333333, 1, 1.33333333333333,
1.45161290322581, 1.68965517241379, 1.44827586206897, 1.5, 1,
1, 1), UESPerVisitBySL = c(13, 16.4, 8.80952380952381, 8.4, 9.33333333333333,
1, 11.5555555555556, 17.741935483871, 16.3448275862069, 8.10344827586207,
15.3571428571429, 6, 7, 2)), row.names = c(NA, -14L), groups = structure(list(
application_type1 = c("SamsungTV", "SamsungTV"), variant_uuid = structure(1:2, .Label = c("Control",
"BackNav"), class = "factor"), allStreamSec = c("curatedCatalog",
"curatedCatalog"), .rows = structure(list(1:6, 7:14), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
Notice that [3,11] comes before [3,2].
The only packages I have loaded are tidyverse and data.table.
Thank you
Harry

To achieve your desired result you could
arrange your data.frame by sl_subsection_index and sl_element_index
after doing so you could set the order of swimlane_coord using forcats::fct_inorder
library(ggplot2)
library(dplyr)
library(forcats)
d %>%
ungroup() %>%
mutate(
sl_subsection_index = gsub("^\\[(\\d+),\\d+\\]$", "\\1", swimlane_coord),
sl_element_index = gsub("^\\[\\d+,(\\d+)\\]$", "\\1", swimlane_coord)
) %>%
arrange(as.integer(sl_subsection_index), as.integer(sl_element_index)) %>%
mutate(swimlane_coord = forcats::fct_inorder(factor(swimlane_coord))) %>%
ggplot(aes(swimlane_coord)) +
geom_bar()
Created on 2021-06-04 by the reprex package (v2.0.0)

Related

Data does not appear in stacked area plot ggplot2

I am trying to make a stacked area plot, however my ggplot only display axis and not the plotted data, as shown in the picture. I am not sure what why it does not display any of the data, the code seemed pretty straight forward. I am not getting any error messages either.
stack20 <- ggplot(DF1, aes(x= date, y= number_reports, fill=nuisancelevel)) +
geom_area(position="stack") + theme_classic() + scale_fill_brewer(palette = "Reds") +
theme(axis.text.x = element_text(angle = 90))
stack20
See below the dput for my code, I have included the first 10 rows.
structure(list(date = c("2020-07-01", "2020-07-01", "2020-07-02",
"2020-07-03", "2020-07-05", "2020-07-05", "2020-07-05", "2020-07-06",
"2020-07-06", "2020-07-06"), nuisancelevel = structure(c(2L,
4L, 4L, 2L, 1L, 2L, 3L, 1L, 3L, 4L), levels = c("Geen overlast",
"Een beetje overlast", "Veel overlast", "Heel veel overlast"), class = "factor"),
number_reports = c(2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), groups = structure(list(date = c("2020-07-01",
"2020-07-01", "2020-07-02", "2020-07-03", "2020-07-05", "2020-07-05",
"2020-07-05", "2020-07-06", "2020-07-06", "2020-07-06"), nuisancelevel = structure(c(2L,
4L, 4L, 2L, 1L, 2L, 3L, 1L, 3L, 4L), levels = c("Geen overlast",
"Een beetje overlast", "Veel overlast", "Heel veel overlast"), class = "factor"),
.rows = structure(list(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), .drop = TRUE))
``
![enter image description here][1]
[1]: https://i.stack.imgur.com/05hvb.jpg
It is a bit tricky, but you could change your dates to date format and use rank to plot them like this:
library(ggplot2)
library(lubridate)
DF1$date <- as.Date(DF1$date, format = "%Y-%m-%d")
stack20 <- ggplot(DF1, aes(x= rank(date), y= number_reports, fill=nuisancelevel)) +
geom_area(position = "stack") +
scale_x_continuous(breaks = rank(DF1$date),labels = format(DF1$date, "%Y-%m-%d")) +
theme_classic() +
scale_fill_brewer(palette = "Reds") +
theme(axis.text.x = element_text(angle = 90))
stack20
Created on 2022-08-17 by the reprex package (v2.0.1)

Conditionally adding characters to new column based on separate dataset

Hello all and thank you in advance.
I would like to add a new column to my pre-existing data frame where the values sourced from a second data frame based on certain conditions. The dataset I wish to add the new column to ("data_melt") has many different sample IDs (sample.#) under the variable column. Using a second dataset ("metadata") I want to add the pond names to the "data_melt" new column based on the sample-ids. The sample IDs are the same in both datasets.
My gut tells me there's an obvious solution but my head is pretty fried. Here is a toy example of my data_melt df (since its 25,000 observations):
> dput(toy)
structure(list(gene = c("serA", "mdh", "fdhB", "fdhA"), process = structure(c(1L,
1L, 1L, 1L), .Label = "energy", class = "factor"), category = structure(c(1L,
1L, 1L, 1L), .Label = "metabolism", class = "factor"), ko = structure(1:4, .Label = c("K00058",
"K00093", "K00125", "K00148"), class = "factor"), variable = structure(c(1L,
2L, 3L, 3L), .Label = c("sample.10", "sample.19", "sample.72"
), class = "factor"), value = c(0.00116, 2.77e-05, 1.84e-05,
0.0125)), row.names = c(NA, -4L), class = "data.frame")
And here is a toy example of my metadata df:
> dput(toy)
structure(list(sample = c("sample.10", "sample.19", "sample.72",
"sample.13"), pond = structure(c(2L, 2L, 1L, 1L), .Label = c("lower",
"upper"), class = "factor")), row.names = c(NA, -4L), class = "data.frame")
Thank you again!
We can use match from base R to create a numeric index to replace the values
toy$pond <- with(toy, out$pond[match(variable, out$sample)])
I believe merge will work here.
sss <- structure(list(gene = c("serA", "mdh", "fdhB", "fdhA"), process = structure(c(1L,
1L, 1L, 1L), .Label = "energy", class = "factor"), category = structure(c(1L,
1L, 1L, 1L), .Label = "metabolism", class = "factor"), ko = structure(1:4, .Label = c("K00058",
"K00093", "K00125", "K00148"), class = "factor"), variable = structure(c(1L,
2L, 3L, 3L), .Label = c("sample.10", "sample.19", "sample.72"
), class = "factor"), value = c(0.00116, 2.77e-05, 1.84e-05,
0.0125)), row.names = c(NA, -4L), class = "data.frame")
ss <- structure(list(sample = c("sample.10", "sample.19", "sample.72",
"sample.13"), pond = structure(c(2L, 2L, 1L, 1L), .Label = c("lower",
"upper"), class = "factor")), row.names = c(NA, -4L), class = "data.frame")
ssss <- merge(sss, ss, by.x = "variable", by.y = "sample")
You can use left_join() from the dplyr package after renaming sample to variable in the metadata data frame.
library(tidyverse)
data_melt <- structure(list(gene = c("serA", "mdh", "fdhB", "fdhA"),
process = structure(c(1L, 1L, 1L, 1L),
.Label = "energy",
class = "factor"),
category = structure(c(1L, 1L, 1L, 1L),
.Label = "metabolism",
class = "factor"),
ko = structure(1:4,
.Label = c("K00058", "K00093", "K00125", "K00148"),
class = "factor"),
variable = structure(c(1L, 2L, 3L, 3L),
.Label = c("sample.10", "sample.19", "sample.72"),
class = "factor"),
value = c(0.00116, 2.77e-05, 1.84e-05, 0.0125)),
row.names = c(NA, -4L),
class = "data.frame")
metadata <- structure(list(sample = c("sample.10", "sample.19", "sample.72", "sample.13"),
pond = structure(c(2L, 2L, 1L, 1L),
.Label = c("lower", "upper"),
class = "factor")),
row.names = c(NA, -4L),
class = "data.frame") %>%
# Renaming the column, so we can join the two data sets together
rename(variable = sample)
data_melt <- data_melt %>%
left_join(metadata, by = "variable")

connect points within position_dodged factor x-axis in ggplot2

I'm trying to add significance annotations to an errorbar plot with a factor x-axis and dodged groups within each level of the x-axis. It is a similar but NOT identical use case to this
My base errorbar plot is:
library(ggplot2)
library(dplyr)
pres_prob_pd = structure(list(x = structure(c(1, 1, 1, 2, 2, 2, 3, 3, 3), labels = c(`1` = 1,
`2` = 2, `3` = 3)), predicted = c(0.571584427222816, 0.712630712634987,
0.156061969566517, 0.0162388386564817, 0.0371877245103279, 0.0165022541901018,
0.131528946944238, 0.35927812866896, 0.0708662221985375), std.error = c(0.355802875027348,
0.471253661425626, 0.457109887762665, 0.352871728451576, 0.442646879181155,
0.425913568532558, 0.376552208691762, 0.48178172708116, 0.451758041335245
), conf.low = c(0.399141779923204, 0.496138837620712, 0.0701919316506831,
0.00819832576725402, 0.0159620304815404, 0.00722904089045731,
0.0675129352870401, 0.17905347369819, 0.030504893442457), conf.high = c(0.728233665534388,
0.861980236164486, 0.311759350126477, 0.031911364587827, 0.0842227723261319,
0.0372248587668487, 0.240584344249407, 0.590437963881823, 0.156035177669385
), group = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("certain",
"neutral", "uncertain"), class = "factor"), group_col = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("certain", "neutral",
"uncertain"), class = "factor"), language = structure(c(2L, 2L,
2L, 1L, 1L, 1L, 3L, 3L, 3L), .Label = c("english", "dutch", "german"
), class = "factor"), top = c(0.861980236164486, 0.861980236164486,
0.861980236164486, 0.0842227723261319, 0.0842227723261319, 0.0842227723261319,
0.590437963881823, 0.590437963881823, 0.590437963881823)), row.names = c(NA,
-9L), groups = structure(list(language = structure(1:3, .Label = c("english",
"dutch", "german"), class = "factor"), .rows = structure(list(
4:6, 1:3, 7:9), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, 3L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
#dodge
pd = position_dodge(.75)
#plot
p = ggplot(pres_prob_pd,aes(x=language,y=predicted,color=group,shape=group)) +
geom_point(position=pd,size=2) +
geom_errorbar(aes(ymax=conf.high,ymin=conf.low),width=.125,position=pd)
p
What I want to do is annotate the plot such that the contrasts between group within each level of language are annotated for significance. I've plotted points representing the relevant contrasts and (toy) sig. annotations as follows:
#bump function
f = function(x){
v = c()
bump=0.025
constant = 0
for(i in x){
v = c(v,i+constant+bump)
bump = bump + 0.075
}
v
}
#create contrasts
combs = data.frame(gtools::combinations(3, 2, v=c("certain", "neutral", "uncertain"), set=F, repeats.allowed=F)) %>%
mutate(contrast=c("cont_1","cont_2","cont_3"))
combs = rbind(combs %>% mutate(language = 'english'),
combs %>% mutate(language='dutch'),
combs %>% mutate(language = "german")) %>%
left_join(select(pres_prob_pd,language:top)%>%distinct(),by='language') %>%
group_by(language)
#long transform and calc y_pos
combs_long = mutate(combs,y_pos=f(top)) %>% gather(long, probability, X1:X2, factor_key=TRUE) %>% mutate(language=factor(language,levels=c("english","dutch","german"))) %>%
arrange(language,contrast)
#back to wide
combs_wide =combs_long %>% spread(long,probability)
combs_wide$p = rep(c('***',"*","ns"),3)
#plot
p +
geom_point(data=combs_long,
aes(x = language,
color=probability,
shape=probability,
y=y_pos),
inherit.aes = T,
position=pd,
size=2) +
geom_text(data=combs_wide,
aes(x=language,
label=p,
y=y_pos+.025,
group=X1),
color='black',
position=position_dodge(.75),
inherit.aes = F)
What I am failing to achieve is plotting a line connecting each of the contrasts of group within each level of language, as is standard when annotating significant group-wise differences. Any help much appreciated!

Binding rows not taking into consideration variable names

I have several dataframes that share the same structure but have different column names. I want to merge them all into one dataframe, but if i use bind_rows() it creates new column names.
I tried smartbind(), union() , union_all() and other libraries, however, none of them is able to simply merge them.
Here goes some sample data:
df1 <- structure(list(Codigo_Cliente = c(292640L, 48296L, 28368L, 27631L,
21715L, 401076L), Segmento = structure(c(3L, 3L, 3L, 3L, 3L,
5L), .Label = c("Clasico", "Emergente", "Mi_Negocio", "Preferencial",
"Prestige"), class = "factor"), Sal_Cons_CA_2018 = c(115966976.4748,
41404074.5338, 21576406.4326, NA, 5217387.0461, NA), Sal_Cons_CA_2019 = c(233057582.7658,
146012775.8314, 121273292.4548, 72383484.8781, 76605696.1462,
64418761.5503), Tipo_Cliente = structure(c(2L, 2L, 2L, 2L, 2L,
1L), .Label = c("Nuevo", "Viejo"), class = "factor"), diferencia_anual = c(117090606.291,
104608701.2976, 99696886.0222, 72383484.8781, 71388309.1001,
64418761.5503), peso_cambio = c(11.7925653553277, 10.5354732191076,
10.040788765049, 7.28996973463426, 7.18974243396645, 6.48781725327502
), cum = c(117090606.291, 221699307.5886, 321396193.6108, 393779678.4889,
465167987.589, 529586749.1393), cum_cambio = c(11.7925653553277,
22.3280385744352, 32.3688273394842, 39.6587970741185, 46.8485395080849,
53.33635676136), ones = c(1, 1, 1, 1, 1, 1), clientes = c(1,
2, 3, 4, 5, 6), porcentaje_acumulado_clientes = c(0.040650406504065,
0.0813008130081301, 0.121951219512195, 0.16260162601626, 0.203252032520325,
0.24390243902439), Tipo_Aportante = c("Viejo Aportante", "Viejo Aportante",
"Viejo Aportante", "Nuevo Aportante", "Viejo Aportante", "Nuevo Aportante"
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L), groups = structure(list(Codigo_Cliente = c(21715L, 27631L,
28368L, 48296L, 292640L, 401076L), Segmento = structure(c(3L,
3L, 3L, 3L, 3L, 5L), .Label = c("Clasico", "Emergente", "Mi_Negocio",
"Preferencial", "Prestige"), class = "factor"), .rows = list(
5L, 4L, 3L, 2L, 1L, 6L)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE))
df2 <- structure(list(Codigo_Cliente = c(29460L, 208833L, 494610L, 292653L,
371679L, 54042L), Segmento = structure(c(3L, 3L, 3L, 3L, 3L,
3L), .Label = c("Clasico", "Emergente", "Mi_Negocio", "Preferencial",
"Prestige"), class = "factor"), Sal_Cons_CC_2018 = c(249412694.49,
226519.47, NA, 232072.25, 893861.14, 2305969.41), Sal_Cons_CC_2019 = c(492333714.52,
217220231.86, 140551673.22, 73744015.83, 57995686.81, 54669407.01
), Tipo_Cliente = structure(c(2L, 2L, 1L, 2L, 2L, 2L), .Label = c("Nuevo",
"Viejo"), class = "factor"), diferencia_anual = c(242921020.03,
216993712.39, 140551673.22, 73511943.58, 57101825.67, 52363437.6
), peso_cambio = c(30.7889911838579, 27.5028381525124, 17.8142024395939,
9.31726115143663, 7.23736301995891, 6.63679667747068), cum = c(242921020.03,
459914732.42, 600466405.64, 673978349.22, 731080174.89, 783443612.49
), cum_cambio = c(30.7889911838579, 58.2918293363703, 76.1060317759641,
85.4232929274008, 92.6606559473597, 99.2974526248303), ones = c(1,
1, 1, 1, 1, 1), clientes = c(1, 2, 3, 4, 5, 6), porcentaje_acumulado_clientes = c(0.0369822485207101,
0.0739644970414201, 0.11094674556213, 0.14792899408284, 0.18491124260355,
0.22189349112426), Tipo_Aportante = c("Viejo Aportante", "Viejo Aportante",
"Nuevo Aportante", "Viejo Aportante", "Viejo Aportante", "Viejo Aportante"
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L), groups = structure(list(Codigo_Cliente = c(29460L, 54042L,
208833L, 292653L, 371679L, 494610L), Segmento = structure(c(3L,
3L, 3L, 3L, 3L, 3L), .Label = c("Clasico", "Emergente", "Mi_Negocio",
"Preferencial", "Prestige"), class = "factor"), .rows = list(
1L, 6L, 2L, 4L, 5L, 3L)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE))
You can use data.table package, which has rbindlist function:
df <- rbindlist(list(df1,df2), use.names = T)

ggplot add aggregated summaries to a bar plot

I have the following data frame:
structure(list(StepsGroup = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("(-Inf,3e+03]", "(3e+03,1.2e+04]", "(1.2e+04, Inf]"
), class = "factor"), GlucoseGroup = structure(c(1L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, 3L), .Label = c("<100", "100-180", ">180"
), class = "factor"), n = c(396L, 1600L, 229L, 787L, 4182L, 375L,
110L, 534L, 55L), freq = c(0.177977528089888, 0.719101123595506,
0.102921348314607, 0.147267964071856, 0.782559880239521, 0.0701721556886228,
0.157367668097282, 0.763948497854077, 0.0786838340486409)), class =
c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L), vars = "StepsGroup",
labels = structure(list(
StepsGroup = structure(1:3, .Label = c("(-Inf,3e+03]", "(3e+03,1.2e+04]",
"(1.2e+04, Inf]"), class = "factor")), class = "data.frame", row.names =
c(NA, -3L), vars = "StepsGroup", drop = TRUE), indices = list(0:2,
3:5, 6:8), drop = TRUE, group_sizes = c(3L, 3L, 3L), biggest_group_size =
3L)
I would like to create a stacked bar plot, and add a summary of each StepsGroup on top of each bar. So the first group will have 2225, the second 5344 and the third 699.
I am using the following script:
ggplot(d_stepsFastingSummary , aes(y = freq, x = StepsGroup, fill =
GlucoseGroup)) + geom_bar(stat = "identity") +
geom_text(aes(label = sum(n()), vjust = 0))
The part until before the geom_text works, but for the last bit I get the following error:
Error: This function should not be called directly
Any idea how to add the aggregated quantity?
We could create a new dataframe stacked_df which would have sum for each StepsGroup
stacked_df <- df %>% group_by(StepsGroup) %>% summarise(nsum = sum(n))
ggplot(df) +
geom_bar(aes(y = freq, x = StepsGroup, fill= GlucoseGroup),stat = "identity") +
geom_text(data = stacked_df, aes(label = nsum, StepsGroup,y = 1.1))

Resources