ggplot bar graph by percentages - r

I am trying to make a bar graph showing ages of first alcohol use by county by percent. I am not quite sure where the mistake is and would appreciate another set of eyes.
Data is publicly available here: https://www.datafiles.samhsa.gov/dataset/national-survey-drug-use-and-health-2020-nsduh-2020-ds0001 although I have cleaned it on my computer.
The percentages are definitely not out of 100 and the numbers are not adjusting for population. They are the same as my chart showing raw numbers.
palc.age.ct<-data1.cleaned%>%
mutate(ALCTRY= na_if(x=ALCTRY, y="Never Used"))%>%
drop_na(ALCTRY)%>%
ggplot(aes(x=ALCTRY, fill=COUTYP4))+
geom_bar (position = "dodge") +
geom_bar(aes(y = (..count..)/sum(..count..)))+
scale_y_continuous(labels=scales::percent)+
theme_minimal()+
labs(title = "First Alcohol Use by Age and Locality",
x="Age Initiated", y="Number Initiated")+
scale_color_viridis(option = "D")

I'm not recreating everything you did like labelling the bins, but based on the data you can do something like below. Note that you need to include the position = "dodge" in the bar chart where you want to calculate the percentage. Otherwise the calculation is done in a different geom than the one that is creating the grouped bar geom. Which is the reason for your issue.
library(dplyr)
library(ggplot2)
NSDUH_2020 %>%
select(alctry, COUTYP4) %>%
mutate(altcry = if_else(alctry > 66, NA_integer_, alctry),
COUTYP4 = forcats::as_factor(COUTYP4)) %>%
filter(!is.na(altcry)) %>%
ggplot(aes(x = alctry, fill = COUTYP4)) +
geom_bar(aes(y = (..count..)/sum(..count..)), position = "dodge") +
scale_y_continuous(labels = scales::label_percent(accuracy = .1)) +
scale_x_binned()

Related

what are these gray lines inside the bars of my ggplot bargraph?

I wanted to create a graph to show which continent has the highest co2 emissions total per capita over the years,
so I coded this:
barchart <- ggplot(data = worldContinents,
aes(x = Continent,
y = `Per.Capita`,
colour = Continent))
+ geom_bar(stat = "identity")
barchart
This is my dataframe:
Geometry is just for some geographical mapping later.
I checked which(is.na(worldContinents$Per.Capita)) to see whether there were NA values but it returned nothing
What's causing these gray lines?
How do I get rid of them?
These are the gray lines inside the bar graph
Thank you
You have a couple of issues here. First of all, I'm guessing you want the fill to be mapped to the continent, not color, which only controls the color of the bars' outlines.
Secondly, there are multiple values for each continent in your data, so they are simply stacking on top of each other. This is the reason for the lines in your bars, and is probably not what you want. If you want the average value per capita in each continent, you either need to summarise your data beforehand or use stat_summary like this:
barchart <- ggplot(data = worldContinents,
aes(x = Continent,
y = `Per.Capita`,
fill = Continent)) +
stat_summary(geom = "col", fun = mean, width = 0.7,
col = "gray50") +
theme_light(base_size = 20) +
scale_fill_brewer(palette = "Spectral")
barchart
Data used
Obviously, we don't have your data, so I used a modified version of the gapminder data set to match your own data structure
worldContinents <- gapminder::gapminder
worldContinents$Per.Capita <- worldContinents$gdpPercap
worldContinents$Continent <- worldContinents$continent
worldContinents <- worldContinents[worldContinents$year == 2007,]

R code of scatter plot for three variables

Hi I am trying to code for a scatter plot for three variables in R:
Race= [0,1]
YOI= [90,92,94]
ASB_mean = [1.56, 1.59, 1.74]
Antisocial <- read.csv(file = 'Antisocial.csv')
Table_1 <- ddply(Antisocial, "YOI", summarise, ASB_mean = mean(ASB))
Table_1
Race <- unique(Antisocial$Race)
Race
ggplot(data = Table_1, aes(x = YOI, y = ASB_mean, group_by(Race))) +
geom_point(colour = "Black", size = 2) + geom_line(data = Table_1, aes(YOI,
ASB_mean), colour = "orange", size = 1)
Image of plot: https://drive.google.com/file/d/1E-ePt9DZJaEr49m8fguHVS0thlVIodu9/view?usp=sharing
Data file: https://drive.google.com/file/d/1UeVTJ1M_eKQDNtvyUHRB77VDpSF1ASli/view?usp=sharing
Can someone help me understand where I am making mistake? I want to plot mean ASB vs YOI grouped by Race. Thanks.
I am not sure what is your desidered output. Maybe, if I well understood your question I Think that you want somthing like this.
g_Antisocial <- Antisocial %>%
group_by(Race) %>%
summarise(ASB = mean(ASB),
YOI = mean(YOI))
Antisocial %>%
ggplot(aes(x = YOI, y = ASB, color = as_factor(Race), shape = as_factor(Race))) +
geom_point(alpha = .4) +
geom_point(data = g_Antisocial, size = 4) +
theme_bw() +
guides(color = guide_legend("Race"), shape = guide_legend("Race"))
and this is the output:
#Maninder: there are a few things you need to look at.
First of all: The grammar of graphics of ggplot() works with layers. You can add layers with different data (frames) for the different geoms you want to plot.
The reason why your code is not working is that you mix the layer call and or do not really specify (and even mix) what is the scatter and line visualisation you want.
(I) Use ggplot() + geom_point() for a scatter plot
The ultimate first layer is: ggplot(). Think of this as your drawing canvas.
You then speak about adding a scatter plot layer, but you actually do not do it.
For example:
# plotting antisocal data set
ggplot() +
geom_point(data = Antisocial, aes(x = YOI, y = ASB, colour = as.factor(Race)))
will plot your Antiscoial data set using the scatter, i.e. geom_point() layer.
Note that I put Race as a factor to have a categorical colour scheme otherwise you might end up with a continous palette.
(II) line plot
In analogy to above, you would get for the line plot the following:
# plotting Table_1
ggplot() +
geom_line(data = Table_1, aes(x = YOI, y = ASB_mean))
I save showing the plot of the line.
(III) combining different layers
# putting both together
ggplot() +
geom_point(data = Antisocial, aes(x = YOI, y = ASB, colour = as.factor(Race))) +
geom_line(data = Table_1, aes(x = YOI, y = ASB_mean)) +
## this is to set the legend title and have a nice(r) name in your colour legend
labs(colour = "Race")
This yields:
That should explain how ggplot-layering works. Keep an eye on the datasets and geoms that you want to use. Before working with inheritance in aes, I recommend to keep the data= and aes() call in the geom_xxxx. This avoids confustion.
You may want to explore with geom_jitter() instead of geom_point() to get a bit of a better presentation of your dataset. The "few" points plotted are the result of many datapoints in the same position (and overplotted).
Moving away from plotting to your question "I want to plot mean ASB vs YOI grouped by Race."
I know too little about your research to fully comprehend what you mean with that.
I take it that the mean ASB you calculated over the whole population is your reference (aka your Table_1), and you would like to see how the Race groups feature vs this population mean.
One option is to group your race data points and show them as boxplots for each YOI.
This might be what you want. The boxplot gives you the median and quartiles, and you can compare this per group against the calculated ASB mean.
For presentation purposes, I highlighted the line by increasing its size and linetype. You can play around with the colours, etc. to give you the aesthetics you aim for.
Please note, that for the grouped boxplot, you also have to treat your integer variable YOI, I coerced into a categorical factor. Boxplot works with fill for the body (colour sets only the outer line). In this setup, you also need to supply a group value to geom_line() (I just assigned it to 1, but that is arbitrary - in other contexts you can assign another variable here).
ggplot() +
geom_boxplot(data = Antisocial, aes(x = as.factor(YOI), y = ASB, fill = as.factor(Race))) +
geom_line(data = Table_1, aes(x = as.factor(YOI), y = ASB_mean, group = 1)
, size = 2, linetype = "dashed") +
labs(x = "YOI", fill = "Race")
Hope this gets you going!

How to use stat="count" to label a bar chart with counts or percentages in ggplot2?

I'm trying to produce a stacked column chart with data labels.
I'm able to produce the chart, but was unable to find a way to input data labels. I have tried geom_text() but it keeps asking me to input a y label (which if you see the ggplot code is not there). I have also tried adding geom_text(stat = "count") but that also gives me an error saying
"Error: geom_text requires the following missing aesthetics: y and label".
PS - i'm aware I need to rename the y axis as percentage. I'm also trying to figure out how to have more contrasting colours
ggplot(property,
aes(x=Bedrooms.New, fill=Property.Type.)) +
geom_bar(position = "fill") +
scale_x_discrete(name = "Number of Bedrooms",
limits = sort(factor(unique(property$Bedrooms.New))))
I have added an image below to see what my output is right now!
As the error message is telling you, geom_text requires the label aes. In your case you want to label the bars with a variable which is not part of your dataset but instead computed by stat="count", i.e. stat_count.
The computed variable can be accessed via ..NAME_OF_COMPUTED_VARIABLE... , e.g. to get the counts use ..count.. as variable name. BTW: A list of the computed variables can be found on the help package of the stat or geom, e.g. ?stat_count
Using mtcars as an example dataset you can label a geom_bar like so:
library(ggplot2)
ggplot(mtcars, aes(cyl, fill = factor(gear)))+
geom_bar(position = "fill") +
geom_text(aes(label = ..count..), stat = "count", position = "fill")
Two more notes:
To get the position of the labels right you have to set the position argument to match the one used in geom_bar, e.g. position="fill" in your case.
While counts are pretty easy, labelling with percentages is a different issue. By default stat_count computes percentages by group, e.g. by the groups set via the fill aes. These can be accessed via ..prop... If you want the percentages to be computed differently, you have to do it manually.
As an example if you want the percentages to sum to 100% per bar this could be achieved like so:
library(ggplot2)
ggplot(mtcars, aes(cyl, fill = factor(gear)))+
geom_bar(position = "fill") +
geom_text(aes(label = ..count.. / tapply(..count.., ..x.., sum)[as.character(..x..)]), stat = "count", position = "fill")
Adding a follow-up to the answer above, since this answer usually gets me 90% of the way, but I can never remember how to also:
center the labels (position_fill(vjust = 0.5))
make the percent labels pretty (scales::percent())
library(ggplot)
ggplot(mtcars, aes(cyl, fill = factor(gear)))+
geom_bar(position = "fill") +
geom_text(aes(label = scales::percent(..count.. / tapply(..count.., ..x.., sum)[as.character(..x..)])), stat = "count", position = position_fill(vjust = 0.5))
And here's an alternative with pre-calculations in advance:
mtcars %>%
count(gear, cyl) %>%
group_by(cyl) %>%
mutate(perc = n / sum(n)) %>%
ggplot(aes(cyl, perc, fill = factor(gear)))+
geom_col(position = "fill") +
geom_text(aes(label = scales::percent(perc)), position = position_fill(vjust = 0.5))

Grouped bar plot column width uneven due to no data

I am trying to display a grouped bar plot for my dataset, however, due to some months have no data (no income), the column width is showing up as unequal and I was hoping to have the same column width regardless if some states have no income. Notice how the bar plot is grouped for January, something grouped like that across all months although other states have no income (I'd like to have them spaced out if some states do not have any income). Any help will be much appreciated, thanks.
library(ggplot2)
plot = ggplot(Checkouts, aes(fill=Checkouts$State, x=Checkouts$Month, y=Checkouts$Income)) +
geom_bar(colour = "black", stat = "identity")
My Bar Plot
Checkouts table/data
There are two ways that this can be done.
If you are using the latest version of ggplot2(from 2.2.1 I believe), there is a parameter called preserve in the function position_dodge which preserves the vertical position and adjust only the horizontal position. Here is the code for it.
Code:
import(ggplot2)
plot = ggplot(Checkouts, aes(fill=Checkouts$State, x=Checkouts$Month, y=Checkouts$Income)) +
geom_bar(colour = "black", stat = "identity", position = position_dodge(preserve = 'single'))
Another way is to precompute and add dummy rows for each of the missing. using table is the best solution.
You are looking for position_dodge2(preserve = "single")(https://ggplot2.tidyverse.org/reference/position_dodge.html).
library(ggplot2)
plot = ggplot(Checkouts, aes(fill = State, x = Month, y= Income)) +
geom_bar(colour = "black", stat = "identity",
position = position_dodge2(preserve = "single"))
Also, you don't need to specify the columns to the data frame with $ in ggplot(). For example, Checkouts$State can be replaced with State.

How to label stacked histogram in ggplot

I am trying to add corresponding labels to the color in the bar in a histogram. Here is a reproducible code.
ggplot(aes(displ),data =mpg) + geom_histogram(aes(fill=class),binwidth = 1,col="black")
This code gives a histogram and give different colors for the car "class" for the histogram bars. But is there any way I can add the labels of the "class" inside corresponding colors in the graph?
The inbuilt functions geom_histogram and stat_bin are perfect for quickly building plots in ggplot. However, if you are looking to do more advanced styling it is often required to create the data before you build the plot. In your case you have overlapping labels which are visually messy.
The following codes builds a binned frequency table for the dataframe:
# Subset data
mpg_df <- data.frame(displ = mpg$displ, class = mpg$class)
melt(table(mpg_df[, c("displ", "class")]))
# Bin Data
breaks <- 1
cuts <- seq(0.5, 8, breaks)
mpg_df$bin <- .bincode(mpg_df$displ, cuts)
# Count the data
mpg_df <- ddply(mpg_df, .(mpg_df$class, mpg_df$bin), nrow)
names(mpg_df) <- c("class", "bin", "Freq")
You can use this new table to set a conditional label, so boxes are only labelled if there are more than a certain number of observations:
ggplot(mpg_df, aes(x = bin, y = Freq, fill = class)) +
geom_bar(stat = "identity", colour = "black", width = 1) +
geom_text(aes(label=ifelse(Freq >= 4, as.character(class), "")),
position=position_stack(vjust=0.5), colour="black")
I don't think it makes a lot of sense duplicating the labels, but it may be more useful showing the frequency of each group:
ggplot(mpg_df, aes(x = bin, y = Freq, fill = class)) +
geom_bar(stat = "identity", colour = "black", width = 1) +
geom_text(aes(label=ifelse(Freq >= 4, Freq, "")),
position=position_stack(vjust=0.5), colour="black")
Update
I realised you can actually selectively filter a label using the internal ggplot function ..count... No need to preformat the data!
ggplot(mpg, aes(x = displ, fill = class, label = class)) +
geom_histogram(binwidth = 1,col="black") +
stat_bin(binwidth=1, geom="text", position=position_stack(vjust=0.5), aes(label=ifelse(..count..>4, ..count.., "")))
This post is useful for explaining special variables within ggplot: Special variables in ggplot (..count.., ..density.., etc.)
This second approach will only work if you want to label the dataset with the counts. If you want to label the dataset by the class or another parameter, you will have to prebuild the data frame using the first method.
Looking at the examples from the other stackoverflow links you shared, all you need to do is change the vjust parameter.
ggplot(mpg, aes(x = displ, fill = class, label = class)) +
geom_histogram(binwidth = 1,col="black") +
stat_bin(binwidth=1, geom="text", vjust=1.5)
That said, it looks like you have other issues. Namely, the labels stack on top of each other because there aren't many observations at each point. Instead I'd just let people use the legend to read the graph.

Resources