Mark a portion of a bar chart ggplot - r

I have data for number of cars sold each year for different brands like this:
But I also have data for how many of the cars sold were cars with a diesel engine for each one of the brands and years.
I want to be able to stack the charts in a bar chart and also add a second dimension to each class, showing how many of the cars that have a diesel engine of the specific brand (e.g. BMW). I want to do it either by colour, or by lines like below:
Is it possible to do that with ggplot in R?
Edit:
My data:
The data looks like this in Excel:
BMW Volvo Audi
2010 50 400 50
2011 75 450 35
2012 45 350 55
BMW Volvo Audi
2010 0.2 0.2 0.5
2011 0.293333333 0.5 0.571428571
2012 0.488888889 0.5 0.272727273

You will need to do a bit of data preparation to make it easier to plot, but once you do this type of thing a few times, it becomes quite straightforward. I highly recommend reading about Tidy Data Principles, which I'll apply here.
Data
In the future, please post your dataframes via the output of dput(data.frame), but your tables are small, so import isn't that difficult:
df1 <- data.frame(year=c(2010:2012), BMW=c(50,75,45), Volvo=c(400,450,350), Audi=c(50,35,55))
df2 <- data.frame(year=c(2010:2012), BMW=c(0.2, 0.29333333, 0.4888888), Volvo=c(0.2,0.5,0.5), Audi=c(0.5,0.571428571,0.2727272727272))
Your data should be converted into Tidy Data, in which the key principle is that each row is an observation, each variable is one column, and each value represents the value for that column for that observation. Consider your first table, where you have only 3 pieces of information (variables) that are changing: Year, Model, and number of cars sold. As such, we need to combine those three columns for BMW, Volvo, and Audi into two: one for Model and one for number sold. You can do that by using gather() from dplyr (or a few other ways). Similarly, we need to combine columns in the second dataset.
Then, you can merge the two datasets together. Then finally, I use the information from total sold * proportion which are diesel to identify the number of diesel vs. number that are not diesel. In this way, we create the final dataframe used for plotting:
df1.1 <- df1 %>% gather(key='Model', value='Total_Sold',-year)
df2.1 <- df2 %>% gather(key='Model', value='prop_diesel',-year)
df <- merge(df1.1, df2.1)
df$diesel <- df$Total_Sold * df$prop_diesel
df$non_diesel <- df$Total_Sold - df$diesel
df <- df %>% gather(key='type', value='sold', -(1:4))
Plot
To create the plot, it seems like the best way to show this would be in a column plot, stacking "non-diesel" and "diesel" on top of one another so you can see total amount compared across each make per year, which also estimating the proportion of diesel/non-diesel. We kind of want to use dodging (separating columns out for make where they share the same x axis value), as well as "stacking" (stacking info on diesel vs. non-diesel). You kind of can't do that at the same time for a column plot, but I'm using faceting to get the same effect. Here you assign Model as the x axis, use stacking for the amount sold, and then faceting to create the subsets per year. Here's the code and result:
ggplot(df, aes(x=Model, y=sold)) +
geom_col(aes(fill=type), position='stack') +
facet_wrap(~year)

Related

Collapse and Sum Data along multiple groupings in R

I have the following data table in R, which I need to collapse for streamlined data processing. I can do this manually, but I am looking for the most efficient way possible. The data frame looks like this:
and so on. Each age group has 4 observations, 2 male and 2 female (1 of each type). And region consists of city1, city2, city3, etc. which are all ordered the same as the example above. After all age groups are exhausted, the next cityX begins.
I need to combine gender into the total, summing males and females (within type). I also need to combine all age groups to give a population total (sum all age groups). I need to keep type separate, and then later combine them as an additional column. I want the final rows output to be the region. I need the population totals for each year column. So the final output would be like this:
I know this could be done manually by splitting the data frame repeatedly, but what would be the most efficient way to do this?

How would you create categorical "bins" for a boxplot over time in R?

Been working on this and haven't been able to find a decent answer.
Basically, I've got a dataset of NBA Player height vs draft year, and I am trying to create a boxplot to show how player height has changed overtime (this is for a hw assignment, so a boxplot is necessary). My dataset (nba_data) looks like the table below, but I have 10k rows ranging from players drafted in the 60s all the way to the 2000s.
player_name
draft_year
height_in
player_a
1998
76
player_b
1972
81
player_c
2012
79
So far the closest I've gotten is
ggplot(data = nba_data, aes(x = draft_year,
y = height_in,
group = cut(x = draft_year, breaks = 5))) +
geom_boxplot()
And this is the result I get. As far as I understand, breaks being set to 5 should separate my years into 5 year buckets, right?
I created the same graph in Excel to get an idea of what it should look like:
I also attempted to create categories with cut, but was unable to apply it to my boxgraph. I mostly code in Python, but have to learn R for a class at school - any help is greatly appreciated.
Thanks!
Edit: Another question I guess would be how the "Undrafted" players would fit into this, since R seems to want to coerce the draft_year column as numerical to fit into a box plot.
From the ?cut help page, the breaks argument is:
breaks
either a numeric vector of two or more unique cut points or a single number (greater than or equal to 2) giving the number of intervals into which x is to be cut.
You gave it a single number, so that's interpreted as the number of intervals.
Instead, you should give it a vector of exact breakpoints, something like breaks = seq(1960, 2020, by = 5).
I'm surprised you think your axis is being numericized--it's definitely a continuous axis, but I've never heard of ggplot doing that to a string or factor input--check your data frame to make sure the "Undrafted" rows are really there, they might have gotten dropped or converted to NA at some point. But that's a good thing for cut, because cut will only work on numerics. I'd suggest cutting the column as numeric to create a bin column, and then replace NAs in the bin column with "Undrafted".
If you don't mind using a package, you can get the effect you want with:
library(santoku)
ggplot(..., aes(..., group = chop_width(draft_year, 5)))

Creating a subset of a dataset by taking the mean values for each date in R

Using Rstudio with tidyverse plugin, using ggplot2 to plot:
Say we have a dataset called SoccerTeam, this data set consists of variables: Location, Goals, YearPlayed, etc... and each data entry is assigned to a game, so the game was played at Location X, they scored Y Goals, It was played in year 19XX.
In the YearPlayed we have all the years the team has been active for, say years 1950 to 2020 and there is a whole season of data for each year.
Lets say that 2002 has 30 games, so there would be 30 data entries that have YearPlayed = 2002.
Our goal is to plot over time how many goals the team has scored. If we take into account every single game from each year and plot it over the 70 years of play, our graph would be very messy and hard to interpret. To tackle this issue, I would like to take the average goals for each year and plot that over time. How would i do this?
If you need a general introduction to data wrangling in R, I recommend R for Data Science. That said, you need to group by the column YearsPlayed, and then compute the mean for each year. Then, pipe it into the plot commands. The %>% symbol send the left side's output into the right side. So you can chain them together like this:
SoccerTeam %>%
group_by(YearPlayed) %>%
summarize(Goals = mean(Goals)) %>%
ggplot(aes(x=YearPlayed, y=Goals) +
geom_line()

ggplot holes in stacked area chart

Here is a link to my data.
I use the following code:
#read in data
data = read.csv("ggplot_data.csv")
#order by group then year
data = arrange(data, group, year)
#generage ggplot stacked area chart
plot = ggplot(data, aes(x=year,y=value, fill=group)) +
geom_area()
plot
That produces the following chart:
As you can see, there are odd holes in three different parts of this chart.
I previously had this issue and asked about it, and the answer provided then was that I needed to sort my data by group and then year. At the time, that answer fixed my holes. However, it doesn't seem to eliminate all the holes this time. Any help?
The reason for the gaps is that some time series start later than others. When the first non-vanishing value appears, the new area starts with an non-continuous jump. The area just above is however connected to the next point by linear interpolation. This result in the gap.
For example, look at the left-most gap. The olive region starts just after the gap with a vertical jump in 1982. The green area, however, increases linearly from the value in 1981 (where the olive area is zero) to the value in 1982 (where the olive area suddenly contributes).
What you could do is, for instance, add a value of zero at the beginning of each time series that starts after 1975. I use dplyr functionality to create a data frame of these additional first years:
first_years <- group_by(data, group, group_id) %>%
summarise(year = min(year) - 1) %>%
filter(year > 1974) %>%
mutate(value = 0, value_pct = 0)
first_years
## Source: local data frame [3 x 5]
## Groups: group [3]
##
## group group_id year value value_pct
## (fctr) (int) (dbl) (dbl) (dbl)
## 1 c 10006 1981 0 0
## 2 e 10022 2010 0 0
## 3 i 24060 2002 0 0
As you can see, these three new values fit exactly the three gaps in your plot. Now, you can combine these new data frames with your data and sort in the same way as before:
data_complete <- bind_rows(data, first_years) %>%
arrange(year, group)
And the plot then has no gaps:
ggplot(data_complete, aes(x=year,y=value, fill=group)) +
geom_area()
#Stibu's answer is probably best, but for those of us who are not very R-savvy and don't know how to go through a dataset with R to find missing rows and fill them with zeros, I solved this issue with a bit of a different approach.
For my case, I created a dummy dataset with zeroes for all years and all groups, then appended it to my original dataset. This way I added rows for years where before there was simply no rows of data. After aggregating by year and group, my aggregated dataset then contained rows with zero, as opposed to no rows existing at all. This removed all those weird gaps for me.
Best is to simply add: pos = "identity", e.g. from your code above:
ggplot(aes(x=year,y=value, fill=group), pos = "identity")
I found it simpler to save my table into csv and use python's matplotlib function stackplot(demo), which does not seem to have issues with negative numbers.

Using FOR loop for finding out the sum of variables?

I have a data frame that has 6,497,651 observations of 6 variables that I got from the National Emissions Inventory website and it has the following variables:
fips SCC Pollutant Emissions type year
09001 10100401 PM25 15.14 POINT 1999
09001 10100402 PM25 234.75 POINT 1999
Where fips is the county code, SCC is name of the source string, Pollutant is the type of pollutant (PM2.5 emission in this case), Emissions indicates the amount of the pollutant emitted in tons, type is the type of source where pollutant was emitted (road, non-road, point, etc) and year notes down years from 1999 to 2008.
Basically, I have to plot a simple line plot to showcase the change in the level of emissions according to each year. Now, the year 1999 alone has over a thousand observations; same goes for the rest of the years till 2008. The problem is not at all difficult since I can easily form a new data frame for each year with the sum of all the emissions recorded and then row bind all those subsetted data frames. But a more efficient and tidier way to accomplish this might be to use the FOR loop where I can calculate the sum of all the values under 'Emissions' according to each year and store all that information into a new data frame, but I am stuck on where to start. How do I enter the exact syntax that will calculate the sum of values according to each year? I should be having a data frame that looks something like this:
Year Emissions
Where Emissions notes down the sum of values of all emissions in that specific year.
data.table package is probably the most efficient package to handle things like that. The syntax to calculate sum of emissions for every year would be like that (assuming your data is stored in dt):
library(data.table)
dt=data.table(dt)
dt[,.(Emissions=sum(Emissions)),by=year]
A dplyr/ggplot option. We group by 'year', get the sum of 'Emissions' using summarise and plot with ggplot.
library(dplyr)
library(ggplot2)
df1 %>%
group_by(year) %>%
summarise(Emissions=sum(Emissions)) %>%
ggplot(., aes(x=year, y=Emissions))+
geom_line()
Or this can be done directly within ggplot
ggplot(df1, aes(x=year, y=Emissions)) +
stat_summary(fun.y='sum', geom='line')

Resources