Distribute a variable by deciles - r

I have a data set with many observations and variables, and I'm trying to create a decile with the ingressosmensualsllar variable (which represents the monthly income). The output I'm looking for is to add a new variable in my data set so that each observation would have its corresponding decile.
My goal is to have a geom_bar with the income deciles as the x variable, despesamonetaria as the y variable, and fill it with grup_CNAE. As well as an histogram to see what's the frequency of each income decile.
These are main columns from the despesa_llar dataset:
structure(list(grup_CNAE_Red = structure(c("Habitatge", "Habitatge",
"Habitatge", "Habitatge", "Comunicacions", "Restaurants i hotels",
"Altres béns i serveis", "Aliments i begudes no alcohòliques",
"Aliments i begudes no alcohòliques", "Aliments i begudes no alcohòliques",
"Aliments i begudes no alcohòliques"), label = "grup_CNAE_Red", format.stata = "%35s"),
despesatotal = structure(c(57629.21, 186827.47, 210879.71,
105439.85, 91381.21, 344980.45, 117155.39, 44334.78, 426350.53,
199874.51, 41750.52), label = "despesatotal", format.stata = "%9.0g"),
despesamonetaria = structure(c(57629.21, 186827.47, 210879.71,
105439.85, 91381.21, 344980.45, 117155.39, 44334.78, 426350.53,
199874.51, 41750.52), label = "despesamonetaria", format.stata = "%9.0g"),
ingressosmensualsllar = structure(c(782, 782, 782, 782, 782,
782, 782, 1283, 1283, 1283, 1283), label = "ingressosmensualsllar", format.stata = "%9.0g")), row.names = c(NA,
-11L), class = c("tbl_df", "tbl", "data.frame"))
So far, I have tried this:
renda_decils <- despesa_llar %>%
# group_by(ingressosmensualsllar) %>%
mutate(decile=ntile(ingressosmensualsllar, 10)) %>%
ungroup()
ggplot(renda_decils, aes(x=decile))+
geom_histogram()
ggplot(despesa_llar, aes(as.factor(decile), despesamonetaria, fill=reorder(despesamonetaria, grup_CNAE)))+
geom_col(position="dodge")

Are you looking for something like this?
ggplot(renda_decils,
aes(as.factor(decile), despesamonetaria,
fill = reorder(grup_CNAE_Red, despesamonetaria))) +
geom_col(position = "dodge", color = "gray50") +
scale_y_continuous(labels = scales::comma) +
scale_fill_brewer(palette = "Pastel1", name = "Grup CNAE") +
labs(x = "Decile") +
theme_minimal(base_size = 16)

Related

Using segment labels in ggplot with ggrepel with smooth segments

This is my dataframe:
df<-structure(list(year = c(1984, 1984), team = c("Australia", "Brazil"
), continent = c("Oceania", "Americas"), medal = structure(c(3L,
3L), .Label = c("Bronze", "Silver", "Gold"), class = "factor"),
n = c(84L, 12L)), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
And this is my ggplot (my question is related to the annotations regard Brazil label):
ggplot(data = df)+
geom_point(aes(x = year, y = n)) +
geom_text_repel(aes(x = year, y = n, label = team),
size = 3, color = 'black',
seed = 10,
nudge_x = -.029,
nudge_y = 35,
segment.size = .65,
segment.curvature = -1,
segment.angle = 178.975,
segment.ncp = 1)+
coord_flip()
So, I have a segment divided by two parts. On both parts I have 'small braks'. How can I avoid them?
I already tried to use segment.ncp, change nudge_xor nudge_ynut its not working.
Any help?
Not really sure what is going on here. This is the best I could generate by experimenting with variations to the input values for segment... arguments.
There is some guidance at: https://ggrepel.slowkow.com/articles/examples.html which has an example with shorter leader lines, maybe that's an approach you could use.
df<-structure(list(year = c(1984, 1984), team = c("Australia", "Brazil"
), continent = c("Oceania", "Americas"), medal = structure(c(3L,
3L), .Label = c("Bronze", "Silver", "Gold"), class = "factor"),
n = c(84L, 12L)), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
library(ggplot2)
library(ggrepel)
ggplot(data = df)+
geom_point(aes(x = year, y = n)) +
geom_text_repel(aes(x = year, y = n, label = team),
size = 3, color = 'black',
seed = 1,
nudge_x = -0.029,
nudge_y = 35,
segment.size = 0.5,
segment.curvature = -0.0000002,
segment.angle = 1,
segment.ncp = 1000)+
coord_flip()
Created on 2021-08-26 by the reprex package (v2.0.0)

Bubble plot for three observation

This my data. I m trying to put three column in my bubbleplot.
They are Altered, Unaltered and the associated survial q value
My data frame
dput(df)
structure(list(Class = c("cell fate commitment", "chromatin remodeling",
"chromatin_covalent", "demethylation", "histone methylation",
"intracellular receptor signaling pathway", "negative regulation of cell differentiation",
"Nuclear Receptor transcription pathway", "PID HDAC CLASSI PATHWAY",
"PID SMAD2 3NUCLEAR PATHWAY", "regulation of chromatin organization",
"Transcriptional misregulation in cancer"), Altered = c(182,
312, 433, 117, 354, 294, 258, 268, 244, 185, 197, 282), Unaltered = c(489,
361, 235, 559, 315, 370, 411, 409, 426, 491, 483, 387), `q-Value` = c(0.0009732,
1.1e-07, 2.832e-05, 0.137, 0.003188, 0.971, 0.139, 0.0008647,
0.002938, 2.843e-06, 3.102e-06, 0.032)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -12L), spec = structure(list(
cols = list(Class = structure(list(), class = c("collector_character",
"collector")), Altered = structure(list(), class = c("collector_double",
"collector")), Unaltered = structure(list(), class = c("collector_double",
"collector")), `q-Value` = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
Code for the plot
xm <- reshape2::melt(df, id.vars = "Class", variable.name = "Samples", value.name = "Size")
# Calculate bubble size
bubble_size <- function(val){
ifelse(val > 3, (1/15) * val + (1/3), val)
}
# Calculate bubble colour
bubble_colour <- function(val){
ifelse(val > 3, "A", "B")
}
# Calculate bubble size and colour
xm %<>%
mutate(bub_size = bubble_size(Size),
bub_col = bubble_colour(Size))
# Plot data
ggplot(xm, aes(x = Samples, y = fct_rev(Class))) +
geom_point(aes(size = bub_size, fill = bub_col), shape = 21, colour = "black") +
# geom_text()
geom_label_repel(aes(label=Size), size=3)+
theme(panel.grid.major = element_line(colour = alpha("gray", 0.5), linetype = "dashed"),
text = element_text(family = "serif"),
legend.position = "none") +
scale_size(range = c(1, 25)) +
scale_fill_manual(values = c("blue","red")) +
ylab("Class")
I get something like this
How do I label the two patient group into different color as well as label the data point such as patient number for both group and the qualue in the plot
Update
I can put label into the plot. But not able to map two different colors for the patient group altered and unaltered group.
Updated fig

Annotate ggplot based on a second data frame

I have a faceted plot made with ggplot that is already working, it shows data about river altitude against years. I'm trying to add arrows based on a second dataframe which details when floods occurred.
Here's the current plot:
I would like to draw arrows in the top part of each graph based on date information in my second dataframe where each row corresponds to a flood and contains a date.
The link between the two dataframes is the Station_code column, each river has one or more stations which is indicated by this data (in this case only the Var river has two stations).
Here is the dput of the data frame used to create the original plot:
structure(list(River = c("Durance", "Durance", "Durance", "Durance",
"Roya", "Var"), Reach = c("La Brillanne", "Les Mées", "La Brillanne",
"Les Mées", "Basse vallée", "Basse vallée"), Area_km = c(465,
465, 465, 465, 465, 465), Type = c("restored", "target", "restored",
"target", "witness", "restored"), Year = c(2017, 2017, 2012,
2012, 2018, 2011), Restoration_year = c(2013, 2013, 2013, 2013,
NA, 2009), Station_code = c("X1130010", "X1130010", "X1130010",
"X1130010", "Y6624010", "Y6442015"), BRI_adi_moy_sstransect = c(0.00375820736746399,
0.00244752138003355, 0.00446807607783864, 0.0028792618981479,
0.00989200896930529, 0.00357247516596474), SD_sstransect = c(0.00165574247612667,
0.0010044634990875, 0.00220534492332107, 0.00102694633805149,
0.00788573233793128, 0.00308489160008849), min_BRI_sstransect = c(0.00108123849595469,
0.00111493913953216, 0.000555500340370182, 0.00100279590198288,
0, 0), max_BRI_sstransect = c(0.0127781240385231, 0.00700537285706352,
0.0210216858227621, 0.00815151653110584, 0.127734814926934, 0.0223738711013954
), Nb_sstr_unique_m = c(0.00623321576795815, 0.00259754717331206,
0.00117035034437559, 0.00209845092352825, 0.0458628969163946,
3.60620609570031), BRI_adi_moy_transect = c(0.00280232169999531,
0.00173868254527501, 0.00333818552810438, 0.00181398859573415,
0.00903651639185542, 0.00447856455432537), SD_transect = c(0.00128472161839638,
0.000477209421076879, 0.00204050725984513, 0.000472466654940182,
0.00780731734792112, 0.00310039904793707), min_BRI_transect = c(0.00108123849595469,
0.00106445386542223, 0.000901992689363725, 0.000855135344651009,
0.000944414463851629, 0.000162012161197014), max_BRI_transect = c(0.00709151795418251,
0.00434366293208643, 0.011717024999411, 0.0031991369873946, 0.127734814926934,
0.0187952134332499), Nb_tr_unique_m = c(0, 0, 0, 0, 0, 0), Error_reso = c(0.0011,
8e-04, 0.0018, 0.0011, 0.0028, 0.0031), W_BA = c(296.553323029366,
411.056574923547, 263.944186046512, 363.32874617737, 88.6420798065296,
158.66866970576), W_BA_sd = c(84.1498544481585, 65.3909073242282,
100.067554749308, 55.5534084807705, 35.2337070278364, 64.6978349498119
), W_BA_min = c(131, 206, 33, 223, 6, 45), W_BA_max = c(472,
564, 657, 513, 188, 381), W_norm = c(5.73271228619998, 7.9461900926133,
5.10234066090722, 7.02355699765464, 5.09378494746752, 4.81262001531126
), W_norm_sd = c(1.62671218635823, 1.2640804493236, 1.93441939783807,
1.07391043231191, 2.02469218788178, 1.96236658443141), W_norm_min = c(2.53237866910643,
3.98221378500706, 0.637927450996277, 4.31084307794454, 0.344787822572658,
1.36490651299098), W_norm_max = c(9.12429566273463, 10.9027600715727,
12.7005556152895, 9.91687219276031, 10.8033517739433, 11.5562084766569
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
And here is the dput of the date frame containing the flooding date:
structure(list(Station_code = c("Y6042010", "Y6042010", "Y6042010",
"Y6042010", "Y6042010", "Y6042010"), Date = structure(c(12006,
12007, 12016, 12017, 13416, 13488), class = "Date"), Qm3s = c(156,
177, 104, 124, 125, 90.4), Qual = c(5, 5, 5, 5, 5, 5), Year = c(2002,
2002, 2002, 2002, 2006, 2006), Month = c(11, 11, 11, 11, 9, 12
), Station_river = c("Var#Entrevaux", "Var#Entrevaux", "Var#Entrevaux",
"Var#Entrevaux", "Var#Entrevaux", "Var#Entrevaux"), River = c("Var",
"Var", "Var", "Var", "Var", "Var"), Mod_inter = c(13.32, 13.32,
13.32, 13.32, 13.32, 13.32), Qm3s_norm = c(11.7117117117117,
13.2882882882883, 7.80780780780781, 9.30930930930931, 9.38438438438438,
6.78678678678679), File_name = c("Var#Entrevaux.dat", "Var#Entrevaux.dat",
"Var#Entrevaux.dat", "Var#Entrevaux.dat", "Var#Entrevaux.dat",
"Var#Entrevaux.dat"), Station_name = c("#Entrevaux", "#Entrevaux",
"#Entrevaux", "#Entrevaux", "#Entrevaux", "#Entrevaux"), Reach = c("Daluis",
"Daluis", "Daluis", "Daluis", "Daluis", "Daluis"), Restauration_year = c(2009,
2009, 2009, 2009, 2009, 2009), `Area_km[BH]` = c(676, 676, 676,
676, 676, 676), Starting_year = c(1920, 1920, 1920, 1920, 1920,
1920), Ending_year = c("NA", "NA", "NA", "NA", "NA", "NA"), Accuracy = c("good",
"good", "good", "good", "good", "good"), Q2 = c(86, 86, 86, 86,
86, 86), Q5 = c(120, 120, 120, 120, 120, 120), Q10 = c(150, 150,
150, 150, 150, 150), Q20 = c(170, 170, 170, 170, 170, 170), Q50 = c(200,
200, 200, 200, 200, 200), Data_producer = c("DREAL_PACA", "DREAL_PACA",
"DREAL_PACA", "DREAL_PACA", "DREAL_PACA", "DREAL_PACA"), Coord_X_L2e_Z32 = c(959313,
959313, 959313, 959313, 959313, 959313), Coord_Y_L2e_Z32 = c(1893321,
1893321, 1893321, 1893321, 1893321, 1893321), Coord_X_L93 = c(1005748.88,
1005748.88, 1005748.88, 1005748.88, 1005748.88, 1005748.88),
Coord_Y_L93 = c(6324083.97, 6324083.97, 6324083.97, 6324083.97,
6324083.97, 6324083.97), New_FN = c("Var#Entrevaux.csv",
"Var#Entrevaux.csv", "Var#Entrevaux.csv", "Var#Entrevaux.csv",
"Var#Entrevaux.csv", "Var#Entrevaux.csv"), NA_perc = c(14.92,
14.92, 14.92, 14.92, 14.92, 14.92), Q2_norm = c(6.45645645645646,
6.45645645645646, 6.45645645645646, 6.45645645645646, 6.45645645645646,
6.45645645645646), Q5_norm = c(9.00900900900901, 9.00900900900901,
9.00900900900901, 9.00900900900901, 9.00900900900901, 9.00900900900901
), Q10_norm = c(11.2612612612613, 11.2612612612613, 11.2612612612613,
11.2612612612613, 11.2612612612613, 11.2612612612613), Q20_norm = c(12.7627627627628,
12.7627627627628, 12.7627627627628, 12.7627627627628, 12.7627627627628,
12.7627627627628), Q50_norm = c(15.015015015015, 15.015015015015,
15.015015015015, 15.015015015015, 15.015015015015, 15.015015015015
)), row.names = c(NA, -6L), groups = structure(list(Station_code = "Y6042010",
.rows = structure(list(1:6), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = 1L, class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
EDIT
Here is an example of what I would like to do on the plot:
This is the code I use currently to do the plot:
ggplot(data = tst_formule[tst_formule$River != "Roya",], aes(x = Year, y = BRI_adi_moy_transect, shape = River, col = Type)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = BRI_adi_moy_transect - SD_transect, ymax = BRI_adi_moy_transect + SD_transect), size = 0.7, width = 0.3) +
geom_errorbar(aes(ymin = BRI_adi_moy_transect - Error_reso, ymax = BRI_adi_moy_transect + Error_reso, linetype = "Error due to resolution"), size = 0.3, width = 0.3, colour = "black") +
scale_linetype_manual(name = NULL, values = 2) +
scale_shape_manual(values = c(15, 18, 17, 16)) +
scale_colour_manual(values = c("chocolate1", "darkcyan")) +
new_scale("linetype") +
geom_vline(aes(xintercept = Restoration_year, linetype = "Restoration"), colour = "chocolate1") +
scale_linetype_manual(name = NULL, values = 5) +
new_scale("linetype") +
geom_hline(aes(yintercept = 0.004, linetype = "Threshold"), colour= 'black') +
scale_linetype_manual(name = NULL, values = 4) +
scale_y_continuous("BRI*", limits = c(min(tst_formule$BRI_adi_moy_transect - tst_formule$SD_transect, tst_formule$BRI_adi_moy_transect - tst_formule$Error_reso ), max(tst_formule$BRI_adi_moy_transect + tst_formule$SD_transect, tst_formule$BRI_adi_moy_transect + tst_formule$Error_reso))) +
scale_x_continuous(limits = c(min(tst_formule$Year - 1),max(tst_formule$Year + 1)), breaks = scales::breaks_pretty(n = 6)) +
theme_bw() +
facet_wrap(vars(River)) +
theme(legend.spacing.y = unit(-0.01, "cm")) +
guides(shape = guide_legend(order = 1),
colour = guide_legend(order = 2),
line = guide_legend(order = 3))
After tests and more research, I managed to do it by adding the second dataframe in geom_text():
new_scale("linetype") +
geom_segment(data = Flood_plot, aes(x = Date, xend = Date, y = 0.025, yend = 0.020, linetype = "Morphogenic flood"), arrow = arrow(length = unit(0.2, "cm")), inherit.aes = F, guide = guide_legend(order = 6)) +
scale_linetype_manual(name = NULL, values = 1) +
new_scale() creates a new linetype definition after the ones I created before, geom_segment() allows to draw arrows which I wanted but it works with geom_text() and scale_linetype_manual() draws the arrow in the legend without the mention "linetype" above. The second dataframe has the same column (River) as the 1st one to wrap and create the panels.

How to specify a certain csv in the errorbar line

I am trying to make a plot with three different csvs. In 2 of them, the columns are the same i.e. Year, GMSL and GMSLerror.
In the Frederikse file the columns are Year, GMSL, GMSLerrorlow and GMSLerrorup. How can I tell R to plot the Frederikse error using the columns GMSLerrorlow and GMSLerrorup? I tried the following but it did not work. Thanks.
p1<-files <- c("Frederikse.csv", "ChurchandWhite.csv","Hay.csv")
map_dfr(files, ~ read_csv(.x) %>%
mutate(Author = .x)) %>%
ggplot(aes(x = Time, y = GMSL, color = Author,fill=Author)) +
geom_line(size=0.6)+
theme_bw(12)+
theme(panel.grid.major = element_blank())+
theme(panel.grid.minor = element_blank())+
labs(x = "Year", y = "GMSL (mm)",color="Author")+
geom_errorbar(aes(ymin=GMSL-GMSLerror, ymax =GMSL+GMSLerror,alpha=Author))+
geom_errorbar("Frederikse.csv",(aes(ymin=GMSL-GMSLerrorlow, ymax =GMSL+GMSLerrorup,alpha=Author)))
scale_alpha_manual(values = c(0.3, 0.3, 0.8))+
scale_colour_manual(values=c("#BAB3F0","#1D3E72","#201641"))
p1
structure(list(Year = 1900:1905, GMSLerrorlow = c(-203.5572666,
-201.0185091, -212.0740442, -202.6975639, -200.1670151, -192.1312551
), GMSL = c(-173.2614421, -168.8016753, -180.389967, -170.2678322,
-168.7200709, -160.9814287), GMSLerrorup = c(-141.002807, -135.8976091,
-148.213824, -138.9305182, -137.4501224, -130.3514508)), row.names = c(NA,
6L), class = "data.frame")
structure(list(Time = 1900:1905, GMSL = c(-131.15, -130.5, -129.77,
-128.85, -128.1, -127.56), GMSLerror = c(25.32, 25.17, 25.01,
24.86, 24.7, 24.55)), row.names = c(NA, 6L), class = "data.frame")
structure(list(Time = c(1880.0417, 1880.125, 1880.2083, 1880.2917,
1880.375, 1880.4583), GMSL = c(-183, -171.1, -164.3, -158.2,
-158.7, -159.6), GMSLerror = c(24.2, 24.2, 24.2, 24.2, 24.2,
24.2)), row.names = c(NA, 6L), class = "data.frame")````
You can do this with mutate to make GMSLerrorlow column for all datasets
p1<-files <- c("Frederikse.csv", "ChurchandWhite.csv","Hay.csv")
set_names(files) %>% # give names - can use str_remove to drop `.csv` from names
map_dfr( ~ read_csv(.x), .id = "Author") %>% #use .id argument
mutate(
GMSLerrorlow = if_else(Author != "Frederikse.csv", GMSLerror, GMSLerrorlow),
GMSLerrorup = if_else(Author != "Frederikse.csv", GMSLerror, GMSLerrorup)
) %>%
ggplot(aes(x = Time, y = GMSL, color = Author,fill=Author)) +
geom_line(size=0.6)+
theme_bw(12)+
theme(panel.grid.major = element_blank())+
theme(panel.grid.minor = element_blank())+
labs(x = "Year", y = "GMSL (mm)",color="Author")+
geom_errorbar(aes(ymin=GMSL-GMSLerrorlow, ymax =GMSL+GMSLerrorup,alpha=Author))+
scale_alpha_manual(values = c(0.3, 0.3, 0.8))+
scale_colour_manual(values=c("#BAB3F0","#1D3E72","#201641"))

Create a grouped barplot with percentages and respective value labels

Following is the dataframe for which I want to create a grouped barplot
df <- structure(list(Race = c("Caucasian/White", "African American", "Asian", "Other"), 'Hospital 1' = c(374, 820, 31, 108), 'Hospital 2' = c(291, 311, 5, 15), 'Hospital 3' = c(330, 206, 6, 5), 'Hospital 4' = c(950, 341, 6, 13)), class = "data.frame", row.names = c(NA, -4L))
To be precise, I want to group each Hospital according to 'Race'. Each hospital bars should be represented as percentages with their corresponding value labels.
Not a programmer basically, but trying to learn.
You probably want something like this:
df %>%
pivot_longer(contains("Hospital"), names_to = "hospital", values_to = "count") %>%
group_by(hospital) %>%
mutate(percent = count/sum(count)) %>%
ggplot() +
aes(x = hospital, y = percent, fill = Race) +
geom_col(position = "stack")

Resources