I am plotting some data of water metal levels where I report the 90% to the MI-DEQ. I have boxplots of each of the metals and I want to label the hinge and whisker values. I've done something similar in base R for discrete data sets. Here is my starting code:
ggplot(data = Agm, aes(x = Street, y = Level) , na.rm=TRUE) +
ggtitle("Lead Levels",subtitle=subtext )+
xlab("Streets") + ylab("ppb") +
geom_boxplot( fill="red",width = 0.8) + theme_bw()
Agm is a subsetted df with Street being chr and Level being numeric. How would I label each of the groups quantiles? Also, how would I have geom_boxplot whisker the max value, ie include outliers? If I create a df with the Street and quantile(0.9) for each street. Then how would I have geom_point plot and label over the above boxplot using the same grouping?
The data looks like this:
Agm<-read.table(header = TRUE, text = "
Street Year Month Person Level Metal
Crawford 2019 June RCR 0.13 Ag
Crawford 2019 June RCR 160 Cu
Crawford 2019 June RAR 0.92 Ag
Crawford 2019 June RAR 140 Cu
Gratiot 2019 June RL NA Ag
Gratiot 2019 June RL 24 Cu
Seneca 2019 June DS 0.33 Ag
Seneca 2019 June DS 75 Cu")
Sorry for the delay but I observe on my iPad but my R work is on my Linux box which was not readily available. The data is more expansive and will be growing.
This brings up another issue: the reporting metric is the 90 percentile. Is there a way to plot that point from the geom_boxplot call? This way each group would have the hinge and whisker values labeled as well as the 90 percentile.
Related
I have a melted dataframe that generates the plot below. The data is downloaded from the Federal Reserve, and the first few lines of the melted dataframe are as follows:
> head(df_melt)
Date Variable value
1 Jun 1967 Chauvet-Piger Recession Probability 0.183
2 Jul 1967 Chauvet-Piger Recession Probability 0.108
3 Aug 1967 Chauvet-Piger Recession Probability 0.039
4 Sep 1967 Chauvet-Piger Recession Probability 0.096
5 Oct 1967 Chauvet-Piger Recession Probability 0.048
6 Nov 1967 Chauvet-Piger Recession Probability 0.036
I plot it using the following code:
ggplot(df_melt, aes(x = Date, y= value)) +
geom_line(aes(color = Variable)) +
labs(x = "Date", y = "Unemployment Rate") +
#Some more stuff related to axes, legend etc.
I would like to
Choose the colors
Shade the area under the UREC recession indicator with a light gray
I tried setting colors by changing aes(color = Variable) to color = line_colors
where line_colors is a vector of colors I have defined, but get an error message:
Error in `check_aesthetics()`:
! Aesthetics must be either length 1 or the same as the data (1992): colour
Run `rlang::last_error()` to see where the error occurred.
I have also tried scale_color_manual without success. What am I doing wrong, and how can I fix these two problems?
Sincerely and with many thanks in advance
Thomas Philips
My dataset has 3 columns: high school name, year, and percent enrolled in college, and it includes 104 high schools across 8 years.
school
chrt_grad
enrolled
Alba
2012
0.486
Alba
2013
0.593
Alba
2014
0.588
Alba
2015
0.588
Alba
2016
0.547
Alba
2017
0.613
Alba
2018
0.622
Alba
2019
0.599
Alba
2020
0.614
City
2012
0.588
City
2013
0.613
and so on...
I'm trying to produce 104 individual line plots--one for each school. I started by creating a single line plot showing every school:
ggplot(nsc_enroll,
mapping = aes(x = chrt_grad, y = enrolled, group = school)) +
geom_line() +
geom_point()
How can I create an individual plot for each of the 104 schools without having to filter for each school name over and over again?
You could use facet_wrap with ggplot,
ggplot(mtcars, aes(x = hp, y = mpg))+
geom_point() +
facet_wrap(~cyl)
In your case you would facet_wrap(~school), but it will produce a huge amount of plots.
First of all, I suppose that where you write chrt_grad it's the same as year or do you have another variable?
Anyway, it's not the point.
As you may know, +facet could do multiple plots, but not individually as I know.
I have a similar situation and what I would do, at least it works for me, is to:
Spread (if you know gather/spread) variable school
For loop to plot each column.
I am not at home now if you need it I can text the code.
Recently I saw some new dplyr tidyverse code using nest_by. It's very interesting although I haven't tried it yet.
I have a dataset from the world bank with some continuous and categorical variables.
> head(nationsCombImputed)
iso3c iso2c country year.x life_expect population birth_rate neonat_mortal_rate region
1 ABW AW Aruba 2014 75.45 103441 10.1 2.4 Latin America & Caribbean
2 AFG AF Afghanistan 2014 60.37 31627506 34.2 36.1 South Asia
3 AGO AO Angola 2014 52.27 24227524 45.5 49.6 Sub-Saharan Africa
4 ALB AL Albania 2014 77.83 2893654 13.4 6.5 Europe & Central Asia
5 AND AD Andorra 2014 70.07 72786 20.9 1.5 Europe & Central Asia
6 ARE AE United Arab Emirates 2014 77.37 9086139 10.8 3.6 Middle East & North Africa
income gdp_percap.x log_pop
1 High income 47008.83 5.014693
2 Low income 1942.48 7.500065
3 Lower middle income 7327.38 7.384309
4 Upper middle income 11307.55 6.461447
5 High income 30482.64 4.862048
6 High income 67239.00 6.958379
I wish to use ggpairs to plot some of the continuous variables (life_expect, birth_rate, neonat_mortal_rate, gdp_percap.x) in a scatter plot but I would like to colour them using the region categorical variable from the data. I have tried a number of different ways but I cannot colour the continuous variables without including the categorical variable.
ggpairs(nationsCombImputed[,c(2,5,7,8,9,11)],
title="Scatterplot of Variables",
mapping = ggplot2::aes(color = region),
labeller = "iso2c")
But I get this error
Error in stop_if_high_cardinality(data, columns,
cardinality_threshold) : Column 'iso2c' has more levels (211) than
the threshold (15) allowed. Please remove the column or increase the
'cardinality_threshold' parameter. Increasing the
cardinality_threshold may produce long processing times
Ultimately I would just like a 4x4 scatter plot of the continuous variables coloured by region with the data points labels using the iso2c code in column 2.
Is this possible in ggpairs?
Well yes it is possible! As per #Robin Gertenbach suggestions I added the columns argument to my code and this worked great, please see below.
ggpairs(nationsCombImputed,
title="Scatterplot of Variables",
columns = c(5,7,8,11),
mapping=ggplot2::aes(colour = region))
I still wish to add data point labels to the scatter plot using the iso2c column but I am struggling with this, any pointers would be greatly appreciated.
As mentioned in the comment you can get ggpairs to color but not plot a dimension by specifying the numeric indices of the columns you do want to plot with columns = c(5,7,8,11).
To have a text scatter plot you will need to define a function e.g. textscatter that you will supply via lower = list(continuous = textscatter) in the ggpairs function call and specify the labels in the aesthetics.
textscatter <- function(data, mapping, ...) {
ggplot(data, mapping, ...) + geom_text()
}
ggpairs(
nationsCombImputed,
title="Scatterplot of Variables",
columns = c(5,7,8,11),
mapping=ggplot2::aes(colour = region, label = iso2c))
lower = list(continuous = textscatter)
)
Of course you can also put the label aesthetic definition into textscatter
I created a bar graph in ggplot to show how counts in column scheme changed over time (i.e. from 2001 to 2016).
The x-axis is the year, the y-axis shows the frequencies (I used the fill=) to get the counts.
The data set consists of two columns (year and scheme) filled with character values:
year scheme
2016 yes
2016 yes
2016 yes
2016 yes
2015 yes
2015 yes
2014 yes
2013 yes
....
2006 no
2006 no
2006 no
2006 no
2005 no
2005 no
2004 no
2003 no
2002 no
2002 no
2001 no
2001 no
My code:
a <- ggplot(s) +
stat_bin(aes(x=year, fill=scheme, group=scheme), geom="bar", position = "dodge",bins=30)
b <- a + scale_x_continuous(breaks = c(2001:2016), labels = factor(2001:2016))
c <- b + theme(axis.text.x=element_text(size = 10, colour = "black"))
The graph:
The problem I have is that the bars are shifted in the graph for no reason. You can recognize it by looking at the x-axis and the year label. The bars are moved too much to the left (e.g.2007) or to the right (2002).
I have no clue why it happened and how can I fix it? Any type of suggestions is very much welcome.
Use binwidth = 1 instead of bins = 30. When you specify there should be 30 bins, you're asking for the years to be broken into the segments whose endpoints are sequential values in seq(2001, 2016, length.out = 30).
All the weird gaps are from the bins which didn't include a whole number.
I would like to use R to simplify and subset large datasets (over 100 000 values) and then plot them. Below is a simplified version of my dataset (Figure 1) where I broke it down into three years and two crop types. I have a Year (2011-2013), two crop types (Corn and Soybean) and their total Area.
I want to subset the data into the total Area of Corn and Soybean by year into a new table(example figure 2) with the year, type and total area and then plot the total area by year for each (example of plot in Figure 3).
Figure 1 Small example dataset
Figure 2 New total table
Figure 3 example of graph that I want to produce
I thought I could subset the data by year and crop with
corn2011 <- subset(CropTable, Year==2011 & Lulc=="Corn")
corn2012 <- subset(CropTable, Year==2012 & Lulc=="Corn")
and then I can summarize the data using the sum function
sum(corn2011[,3]),
but I'm not sure how to plot them yearly or against each other to have it look like Figure 3.
for your plot, you could try this
data.df <- read.table(text="
Year Type Area
1 2011 corn 30
2 2012 corn 15
3 2013 corn 50
4 2011 Soy 45
5 2012 Soy 30
6 2013 Soy 60",
header = TRUE)
ggplot(data=data.df, aes(x=as.factor(Year), y=Area, group=Type, color=Type)) + geom_line() + xlab("Year") + ylab("Area (ha)") + theme_bw() + scale_color_manual(values=c("red", "blue"))