ggplot holes in stacked area chart - r

Here is a link to my data.
I use the following code:
#read in data
data = read.csv("ggplot_data.csv")
#order by group then year
data = arrange(data, group, year)
#generage ggplot stacked area chart
plot = ggplot(data, aes(x=year,y=value, fill=group)) +
geom_area()
plot
That produces the following chart:
As you can see, there are odd holes in three different parts of this chart.
I previously had this issue and asked about it, and the answer provided then was that I needed to sort my data by group and then year. At the time, that answer fixed my holes. However, it doesn't seem to eliminate all the holes this time. Any help?

The reason for the gaps is that some time series start later than others. When the first non-vanishing value appears, the new area starts with an non-continuous jump. The area just above is however connected to the next point by linear interpolation. This result in the gap.
For example, look at the left-most gap. The olive region starts just after the gap with a vertical jump in 1982. The green area, however, increases linearly from the value in 1981 (where the olive area is zero) to the value in 1982 (where the olive area suddenly contributes).
What you could do is, for instance, add a value of zero at the beginning of each time series that starts after 1975. I use dplyr functionality to create a data frame of these additional first years:
first_years <- group_by(data, group, group_id) %>%
summarise(year = min(year) - 1) %>%
filter(year > 1974) %>%
mutate(value = 0, value_pct = 0)
first_years
## Source: local data frame [3 x 5]
## Groups: group [3]
##
## group group_id year value value_pct
## (fctr) (int) (dbl) (dbl) (dbl)
## 1 c 10006 1981 0 0
## 2 e 10022 2010 0 0
## 3 i 24060 2002 0 0
As you can see, these three new values fit exactly the three gaps in your plot. Now, you can combine these new data frames with your data and sort in the same way as before:
data_complete <- bind_rows(data, first_years) %>%
arrange(year, group)
And the plot then has no gaps:
ggplot(data_complete, aes(x=year,y=value, fill=group)) +
geom_area()

#Stibu's answer is probably best, but for those of us who are not very R-savvy and don't know how to go through a dataset with R to find missing rows and fill them with zeros, I solved this issue with a bit of a different approach.
For my case, I created a dummy dataset with zeroes for all years and all groups, then appended it to my original dataset. This way I added rows for years where before there was simply no rows of data. After aggregating by year and group, my aggregated dataset then contained rows with zero, as opposed to no rows existing at all. This removed all those weird gaps for me.

Best is to simply add: pos = "identity", e.g. from your code above:
ggplot(aes(x=year,y=value, fill=group), pos = "identity")

I found it simpler to save my table into csv and use python's matplotlib function stackplot(demo), which does not seem to have issues with negative numbers.

Related

How would you create categorical "bins" for a boxplot over time in R?

Been working on this and haven't been able to find a decent answer.
Basically, I've got a dataset of NBA Player height vs draft year, and I am trying to create a boxplot to show how player height has changed overtime (this is for a hw assignment, so a boxplot is necessary). My dataset (nba_data) looks like the table below, but I have 10k rows ranging from players drafted in the 60s all the way to the 2000s.
player_name
draft_year
height_in
player_a
1998
76
player_b
1972
81
player_c
2012
79
So far the closest I've gotten is
ggplot(data = nba_data, aes(x = draft_year,
y = height_in,
group = cut(x = draft_year, breaks = 5))) +
geom_boxplot()
And this is the result I get. As far as I understand, breaks being set to 5 should separate my years into 5 year buckets, right?
I created the same graph in Excel to get an idea of what it should look like:
I also attempted to create categories with cut, but was unable to apply it to my boxgraph. I mostly code in Python, but have to learn R for a class at school - any help is greatly appreciated.
Thanks!
Edit: Another question I guess would be how the "Undrafted" players would fit into this, since R seems to want to coerce the draft_year column as numerical to fit into a box plot.
From the ?cut help page, the breaks argument is:
breaks
either a numeric vector of two or more unique cut points or a single number (greater than or equal to 2) giving the number of intervals into which x is to be cut.
You gave it a single number, so that's interpreted as the number of intervals.
Instead, you should give it a vector of exact breakpoints, something like breaks = seq(1960, 2020, by = 5).
I'm surprised you think your axis is being numericized--it's definitely a continuous axis, but I've never heard of ggplot doing that to a string or factor input--check your data frame to make sure the "Undrafted" rows are really there, they might have gotten dropped or converted to NA at some point. But that's a good thing for cut, because cut will only work on numerics. I'd suggest cutting the column as numeric to create a bin column, and then replace NAs in the bin column with "Undrafted".
If you don't mind using a package, you can get the effect you want with:
library(santoku)
ggplot(..., aes(..., group = chop_width(draft_year, 5)))

ggplot geom_boxplot for gene expression data

I am trying to get boxplots for 4 different genes with the expression data for each gene across multiple patients.
I've tried multiple ways and just keep hitting errors. I can do it using the base boxplot() function, but can't figure it out in ggplot and I can't see anywhere to help - spent hours reading other answers and questions yesterday! Mostly all other data seems to be as 2 columns so can specify x = column a and y = column b. However, I want to plot all 4 columns of my entire df and I couldn't find any help with that. I can do one at a time in ggplot but not all 4 together.
The data I have, BCON_sig_genes, is 4 genes each with values between 3-6 for 152 samples. The df is 152 obs of 4 variables, where the 4 columns are headed each of the gene names and all the cells are values as shown below.
CD3E LAT ZAP70 LCK
1002 4.214679 5.652482 4.788204 5.393783
1022 4.424925 5.776641 4.864269 5.593587
8035 4.327270 5.725364 4.509920 4.961659
8037 4.415715 5.494048 4.435241 5.081846
9004 4.290078 5.265329 4.799106 5.275424
9005 4.233490 5.338098 4.666506 5.069394
The following code gets me one gene at a time, by substituting in the name of the gene.
BCON_sig_genes %>% ggplot(aes(y = CD3E, x = "CD3E"))+ geom_boxplot()
ggplot boxplot 1 gene only
I have tried gene <- colnames(BCON_sig_genes) and then inputting x = gene but it doesn't work and comes up with the following error message:
Error: Aesthetics must be either length 1 or the same as the data (152): x
I think I need to sort out what y is. I tried leaving blank so it would take all the data and sort for each column but no luck.
I tried using a gather() function and making key and value but I couldn't quite figure it out without getting errors... but this felt like I was on the right track!
With the base function all I have to do it boxplot(BCON_sig_genes) and it just plots all 4 genes on a graph with the correct values. base function boxplot all genes
I think I need to wrangle the data better for ggplot so I can tell it that y is just all the expression values for each column but I'm not sure how.
Any help would be much appreciated!!
Thanks, Vicky
For ggplot to work, you need to get the data in a long format. Which basically means you get the gene names in column 1 and their expression in column 2. You had the right idea with gather but gather is being replaced with pivot_longer.
library(tidyverse)
data %>%
pivot_longer(cols = CD3E:LCK,
names_to = "gene",
values_to = "expression") %>%
ggplot(aes(x = gene,
y = expression)) +
geom_boxplot()

Mark a portion of a bar chart ggplot

I have data for number of cars sold each year for different brands like this:
But I also have data for how many of the cars sold were cars with a diesel engine for each one of the brands and years.
I want to be able to stack the charts in a bar chart and also add a second dimension to each class, showing how many of the cars that have a diesel engine of the specific brand (e.g. BMW). I want to do it either by colour, or by lines like below:
Is it possible to do that with ggplot in R?
Edit:
My data:
The data looks like this in Excel:
BMW Volvo Audi
2010 50 400 50
2011 75 450 35
2012 45 350 55
BMW Volvo Audi
2010 0.2 0.2 0.5
2011 0.293333333 0.5 0.571428571
2012 0.488888889 0.5 0.272727273
You will need to do a bit of data preparation to make it easier to plot, but once you do this type of thing a few times, it becomes quite straightforward. I highly recommend reading about Tidy Data Principles, which I'll apply here.
Data
In the future, please post your dataframes via the output of dput(data.frame), but your tables are small, so import isn't that difficult:
df1 <- data.frame(year=c(2010:2012), BMW=c(50,75,45), Volvo=c(400,450,350), Audi=c(50,35,55))
df2 <- data.frame(year=c(2010:2012), BMW=c(0.2, 0.29333333, 0.4888888), Volvo=c(0.2,0.5,0.5), Audi=c(0.5,0.571428571,0.2727272727272))
Your data should be converted into Tidy Data, in which the key principle is that each row is an observation, each variable is one column, and each value represents the value for that column for that observation. Consider your first table, where you have only 3 pieces of information (variables) that are changing: Year, Model, and number of cars sold. As such, we need to combine those three columns for BMW, Volvo, and Audi into two: one for Model and one for number sold. You can do that by using gather() from dplyr (or a few other ways). Similarly, we need to combine columns in the second dataset.
Then, you can merge the two datasets together. Then finally, I use the information from total sold * proportion which are diesel to identify the number of diesel vs. number that are not diesel. In this way, we create the final dataframe used for plotting:
df1.1 <- df1 %>% gather(key='Model', value='Total_Sold',-year)
df2.1 <- df2 %>% gather(key='Model', value='prop_diesel',-year)
df <- merge(df1.1, df2.1)
df$diesel <- df$Total_Sold * df$prop_diesel
df$non_diesel <- df$Total_Sold - df$diesel
df <- df %>% gather(key='type', value='sold', -(1:4))
Plot
To create the plot, it seems like the best way to show this would be in a column plot, stacking "non-diesel" and "diesel" on top of one another so you can see total amount compared across each make per year, which also estimating the proportion of diesel/non-diesel. We kind of want to use dodging (separating columns out for make where they share the same x axis value), as well as "stacking" (stacking info on diesel vs. non-diesel). You kind of can't do that at the same time for a column plot, but I'm using faceting to get the same effect. Here you assign Model as the x axis, use stacking for the amount sold, and then faceting to create the subsets per year. Here's the code and result:
ggplot(df, aes(x=Model, y=sold)) +
geom_col(aes(fill=type), position='stack') +
facet_wrap(~year)

Adding rows to data frame with zero values

I have a dataset with multiple records, each of which is assigned a country, and I want to create a worldmap using rworldmap, coloured according to the frequency with which each country occurs in the dataset. Not all countries appear in the dataset - either because they have no corresponding records or because they are not eligible (e.g. middle/low income countries).
To build the map, I have created a dataframe (dfmap) based on a table of countries, where one column is the country code and the second column is the frequency with which it appears in the dataset.
In order to identify on the map countries which are eligible, but have no records, I have tried to use add_row to add these to my dataframe e.g. for Andorra:
add_row(dfmap, Var1="AND", Freq=0)
When I run add_row for each country, it appears to work (no error message and that new row appears in the table below the command) - but previously added rows where the Freq=0 do not appear.
When I then look at the dataframe using "dfmap" or "summary(dfmap)", none of the rows where Freq=0 appear, and when I build the map, they are coloured as for missing countries.
I'm not sure where I'm going wrong and would welcome any suggestions.
Many thanks
Using the method suggested in the comment above, one can use the join function and then the replace_na function to create a tibble with the complete country and give these a count value of zero.
As there was no sample data in the question i created two data frames below based on what I thought was implied by the question.
dfrm_counts = tibble(Country = c('England','Germany'),
Count = c(1,4))
dfrm_all = tibble(Country = c('England', 'Germany', 'France'))
dfrm_final = dfrm_counts %>%
right_join(dfrm_all, by = "Country") %>%
replace_na(list(Count = 0))
dfrm_final
# A tibble: 3 x 2
Country Count
<chr> <dbl>
1 England 1
2 Germany 4
3 France 0

How do I plot grouped data sorted by the number of entries in each group, in R?

I have data that look like this sample here: http://pastebin.com/5MPCFGWK
I need to plot each id as a timeline, thus I do something like this.
ggplot(df, aes(x=relative_timestamp, y=id, color=action))
which kind of works, except that it's not the most helpful chart. I figured I'd try to sort the groups by how many events they have, but I can't figure out how. I tried my hand at dplyr but I got confused with the docs, and barely managed to group the dataframe by id. Ideas?
EDIT I added a sample CSV. My goal is to plot those timelines sorted by how many entries they have, so in this case 0 is the one with the least amount, and 1 is the one with the biggest amount. Extra good would be to plot them (separate plot, not the same as above) sorted by the time the last CLOSE action occurs (there should be exactly one in each group anyway).
You will need to convert id from numeric to a factor, and then order those factors by whatever metric you are interested in. Here, I used dplyr to create a data.frame called forSort that holds the id's and a set of things you might want to sort on:
forSort <-
testDF %>%
group_by(id) %>%
summarise(n = n()
, max = max(relative_timestamp))
forSort
# id n max
# 1 0 12 244753
# 2 1 85 447680
# 3 2 22 156005
By number of actions:
ggplot(testDF %>%
mutate(id = factor(id, levels = forSort$id[order(forSort$n)]))
, aes(x=relative_timestamp
, y= id
, color=action)) +
geom_point()
By time of last action:
ggplot(testDF %>%
mutate(id = factor(id, levels = forSort$id[order(forSort$max)]) )
, aes(x=relative_timestamp
, y= id
, color=action)) +
geom_point()

Resources