Stacked bar chart in ggplot has separated my variables - r

I'm trying to make a stacked bar graph but I can't seem to get the protobacteria to group together. This is the code I used
ggplot(data = Bacteria, aes(x = bacteria$Location, y = bacteria$reads, fill = bacteria$Phylum.Division)) +
geom_bar(stat="identity")
Is there something I can add to my code? I've attached a picture of my graph currently.

There are probably duplicate entries of protobacteria in your dataframe, but I cannot reproduce this in a simple example.
I noticed that in your code you use Bacteria and bacteria together. R is case sensitive and it could be you are using 2 dataframes for the plot. Also you can remove the bacteria$ part in the aes statement:
ggplot(data = bacteria, aes(x = Location, y = reads, fill = Phylum.Division)) + geom_bar(stat="identity")
If you want better help, please give a reproducible example of your problem.

Related

Making a geom_bar from a dataframe in R

Background
I have a dataframe, df, of athlete injuries:
df <- data.frame(number_of_injuries = c(1,2,3,4,5,6),
number_of_people = c(73,52,43,12,7,2),
stringsAsFactors=FALSE)
The Problem
I'd like to use ggplot2 to make a bar chart or histogram of this simple data using geom_bar or geom_histogram. Important point: I'm pretty novice with ggplot2.
I'd like something where the x-axis shows bins of the number of injuries (number_of_injuries), and the y-axis shows the counts in number_of_people. Like this (from Excel):
What I've tried
I know this is the most trivial dang ggplot issue, but I keep getting errors or weird results, like so:
ggplot(df, aes(number_of_injuries)) +
geom_bar(stat = "count")
Which yields:
I've been in the tidyverse reference website for an hour at this and I can't crack the code.
It can cause confusion from time to time. If you already have "count" statistics, then do not count data using geom_bar(stats = "count") again, otherwise you simply get 1 in all categories. You want to plot those values as they are with geom_col:
ggplot(df, aes(x = number_of_injuries, y = number_of_people)) + geom_col()

What is the purpose of using facet_grid(variable ~ .) instead of just using facet_wrap?

So I'm self-teaching myself R right now using this online resource: "https://r4ds.had.co.nz/data-visualisation.html#facets"
This particular section is going over the use of facet_wrap and facet_grid. It's clear to me that facet_grid is primarily used when wanting to visualize a plot along two additional dimensions, rather than just one. What I don't understand is why you can use facet_grid(.~variable) or facet_grid(variable~.) to basically achieve the same result as facet_wrap. Putting a "." in place of a variable results in just not faceting along the row or column dimension, or in other words showing 1 additional variable just as facet_wrap would do.
If anyone can shed some light on this, thank you!
If you use facet_grid, the facets will always be in one row/column. They will never wrap to make a rectangle. But really if you just have one variable with few levels, it doesn't much matter.
You can also see that facet_grid(.~variable) and facet_grid(variable~.) will put the facet labels in different places (row headings vs column headings)
mg <- ggplot(mtcars, aes(x = mpg, y = wt)) + geom_point()
mg + facet_grid(vs~ .) + labs(title="facet_grid(vs~ .)"),
mg + facet_grid(.~ vs) + labs(title="facet_grid(.~ vs)")
So in the most simple of cases, there's nothing that different between them. The main reason to use facet_grid is to have a single, common axis for all facets so you can easily scan across all panels to make a direct comparison of data.
Actually, the same result is not produced all the time...
The number of facets which appear across the graphs pane is fixed with facet_grid (always the number of unique values in the variable) where as facet_wrap, like its name suggests, wraps the facets around the graphics pane. In this way the functions only result in the same graph when the number of facets produced is small.
Both facet_grid and facet_wrap take their arguments in the form row~columns, and nowdays we don't need to use the dot with facet_grid.
In order to compare their differences let's add a new variable with 8 unqiue values to the mtcars data set:
library(tidyverse)
mtcars$example <- rep(1:8, length.out = 32)
ggplot()+
geom_point(data = mtcars, aes(x = mpg, y = wt))+
facet_grid(~example, labeller = label_both)
Which results in a cluttered plot:
Compared to:
ggplot()+
geom_point(data = mtcars, aes(x = mpg, y = wt))+
facet_wrap(~example, labeller = label_both)
Which results in:

geom_line not outputting connected points

I have the attached dataframe. I am wanting to create a line graph using ggplot in order to plot Total and Year, with seperate lines for each offence category. I have used the following code, but I feel it is very incorrect as the output does not have any connected lines, it looks more like a vertical line graph. Any help is much appreciated :)
Dataframe
The code I have tried is:
ggplot(data = annual, aes(x = (as.numeric(Year)), y = Total, group = Offence Category)) +
geom_line()

Plot two boxplots in one figure

I am trying to make one figure with two categories of data, which looks like:
A comparison between two groups (indicated by pink and black) concerning various different species
It seems the author of this figure put two boxplot pictures into one figure. I constructed similar boxplot by R, codes like below:
{library(reshape2)
species_melt <- melt(species, "Species")
library(ggplot2)
p<-ggplot(species_melt, aes(Species, value),color="Red") + geom_boxplot()
windowsFonts(myFont1=windowsFont("Arial"),myFont2=windowsFont("Times New Roman"))
p+scale_y_log10()}
Which generate a boxplot like below (partly):
enter image description here
Thus I wonder how I could add another layer of boxplot on it, yet it seems difficult with R.
It's hard to test without having your data, but something like this should work:
library(ggplot2)
ggplot() +
geom_boxplot(data = species_melt_1,
aes(Species, value),
fill = "#ff84b3", color = "#994f6b") +
geom_boxplot(data = species_melt_2,
aes(Species, value),
alpha = 0, color = "black")
I'm using two geom_boxplot's with different datasets (species_melt_1 and species_melt_2). First one is reddish and second one is transparent.

R incorrect y-axis in ggplots geom_bar()

I have a dataframe with Wikipedia edits, with information about the number of edit for the user (1st edit, 2nd edit and so on), the timestamp when the edit was made, and how many words were added.
In the actual dataset, I have up to 20.000 edits per user and in some edits, they add up to 30.000 words.
However, here is a downloadable small example dataset to exemplify my problem. The header looks like this:
I am trying to plot the distribution of added words across the Edit Progression and across time. If I use the regular R barplot, i works just like expected:
barplot(UserFrame3$NoOfAdds,UserFrame3$EditNo)
But I want to do it in ggplot for nicer graphics and more customizing options.
If I plot this as a scatterplot, I get the same result:
ggplot(data = UserFrame3, aes(x = UserFrame3$EditNo, y = UserFrame3$NoOfAdds)) + geom_point(size = 0.1)
Same for a linegraph:
ggplot(data = UserFrame3, aes(x = UserFrame3$EditNo, y = UserFrame3$NoOfAdds)) +geom_line(size = 0.1)
But when I try to plot it as a bargraph in ggplot, I get this result:
ggplot(data = UserFrame3, aes(x = UserFrame3$EditNo, y = UserFrame3$NoOfAdds)) + geom_bar(stat = "identity", position = "dodge")
There appear to be a lot more holes on the X-axis and the maximum is nowhere close to where it should be (y = 317).
I suspect that ggplot somehow groups the bars and uses means instead of the actual values despite the "dodge" parameter? How can I avoid this? and how would I go about plotting the time progression as a bargraph aswell without ggplot averaging over multiple edits?
You should expect more x-axis "holes" using bars as compared with lines. Lines connect the zero values together, bars do not.
I used geom_col with your data download, it looks as expected:
UserFrame3 %>%
ggplot(aes(EditNo, NoOfAdds)) + geom_col()

Resources