Subsetting data for ggplot2 - r

I have data saved in multiple datasets, each consisting of four variables. Imagine something like a data.table dt consisting of the variables Country, Male/Female, Birthyear, Weighted Average Income. I would like to create a graph where you see only one country's weighted average income by birthyear and split by male/female. I've used the facet_grid() function to get a grid of graphs for all countries as below.
ggplot() +
geom_line(data = dt,
aes(x = Birthyear,
y = Weighted Average Income,
colour = 'Weighted Average Income'))+
facet_grid(Country ~ Male/Female)
However, I've tried isolating the graphs for just one country, but the below code doesn't seem to work. How can I subset the data correctly?
ggplot() +
geom_line(data = dt[Country == 'Germany'],
aes(x = Birthyear,
y = Weighted Average Income,
colour = 'Weighted Average Income'))+
facet_grid(Country ~ Male/Female)

For your specific case the problem is that you are not quoting Male/Female and Weighted Average Income. Also your data and basic aesthetics should likely be part of ggplot and not geom_line. Doing so isolates these to the single layer, and you would have to add the code to every layer of your plot if you were to add for example geom_smooth.
So to fix your problem you could do
library(tidyverse)
plot <- ggplot(data = dt[Country == 'Germany'],
aes(x = Birthyear,
y = sym("Weighted Average Income"),
col = sym("Weighted Average Income")
) + #Could use "`x`" instead of sym(x)
geom_line() +
facet_grid(Country ~ sym("Male/Female")) ##Could use "`x`" instead of sym(x)
plot
Now ggplot2 actually has a (lesser known) builtin functionality for changing your data, so if you wanted to compare this to the plot with all of your countries included you could do:
plot %+% dt # `%+%` is used to change the data used by one or more layers. See help("+.gg")

Related

Zig Zag when using geom_line with ggplot in R

I would really appreciate some insight on the zagging when using the following code in R:
tbi_military %>%
ggplot(aes(x = year, y = diagnosed, color = service)) +
geom_line() +
facet_wrap(vars(severity))
The dataset is comprised of 5 variables (3 character, 2 numerical). Any insight would be so appreciated.
enter image description here
This is just an illustration with a standard dataset. Let's say we're interested in plotting the weight of chicks over time depending on a diet. We would attempt to plot this like so:
library(ggplot2)
ggplot(ChickWeight, aes(Time, weight, colour = factor(Diet))) +
geom_line()
You can see the zigzag pattern appear, because per diet/time point, there are multiple observations. Because geom_line sorts the data depending on the x-axis, this shows up as a vertical line spanning the range of datapoints at that time per diet.
The data has an additional variable called 'Chick' that separates out individual chicks. Including that in the grouping resolves the zigzag pattern and every line is the weight over time per individual chick.
ggplot(ChickWeight, aes(Time, weight, colour = factor(Diet))) +
geom_line(aes(group = interaction(Chick, Diet)))
If you don't have an extra variable that separates out individual trends, you could instead choose to summarise the data per timepoint by, for example, taking the mean at every timepoint.
ggplot(ChickWeight, aes(Time, weight, colour = factor(Diet))) +
geom_line(stat = "summary", fun = mean)
Created on 2021-08-30 by the reprex package (v1.0.0)

How to add ggplots into one if data come from different data sets? And how to add geom ribbon to discrete data?

I have ggplot with mean of imdb movie rating per year plotted and I wanted to plot ribbon like layer to it, that shows the standard error for each point but is obviously continues ( if that's possible even)
ggplot(data = avg_imdb_movie_year, aes( x = startYear, y = avg_rating)) +
geom_point() +
geom_ribbon(aes(x = start_Year, y = standard_error, xmin = min(xx), xmax = max(xx)))
The xx is sequence corresponding to the years of the movies. The standard_error is simply calculated as sd(average_rating) [that is the difference to mean for each data point]
I think I do something completely wrong. If my data is discrete is there a way I can draw ribbon like standard error around the mean points?
Additional to that I have a question about adding layers that have different data frame. Here is example, I want to add to this ggplot another geom_point() layer where the data would be awarded movie ratings average per year. But I run into error:
ggplot(data = avg_imdb_movie_year, aes( x = startYear, y = avg_rating)) +
geom_point() +
geom_point(aes(x = avg_awarded_moves_year$year_film,
y = avg_awarded_moves_year$average_per_year))
Error message: Error: Aesthetics must be either length 1 or the same as the data (138): x and y
I realise that it's because there are less years (rows) in awarded_movies table, but I don't know how to add another plot from different dataset to existing ggplot. Do anyone has any ideas?

plotting two categorical vectors in ggridges

I have a dataset with a few organisms, which I would like to plot on my y-axis, against date, which I would like to plot on the x-axis. However, I want the fluctuation of the curve to represent the abundance of the organisms. I.e I would like to plot a time series with the relative abundance separated by the organism to show similar patterns with time.
However, of course, plotting just date against an organism does not yield any information on the abundance. So, my question is, is there a way to make the curve represent abundance using ggridges?
Here is my code for an example dataset:
set.seed(1)
Data <- data.frame(
Abundance = sample(1:100),
Organism = sample(c("organism1", "organism2"), 100, replace = TRUE)
)
Date = rep(seq(from = as.Date("2016-01-01"), to = as.Date("2016-10-01"), by =
'month'),times=10)
Data <- cbind(Date, Data)
ggplot(Data, aes(x = Abundance, y = Organism)) +
geom_density_ridges(scale=1.15, alpha=0.6, color="grey90")
This produces a plot with the two organisms, however, I want the date on the x-axis and not abundance. However, this doesn't work. I have read that you need to specify group=Date or change date into julian day, however, this doesn't change the fact that I do not get to incorporate abundance into the plot.
Does anyone have an example of a plot with date vs. a categorical variable (i.e. organism) plotted against a continuous variable in ggridges?
I really like to output from ggridges and would like to be able to use it for these visualizations. Thank you in advance for your help!
Cheers,
Anni
To use geom_density_ridges, it'll help to reshape the data to show observations in separate rows, vs. as summarized by Abundance.
library(ggplot2); library(ggridges); library(dplyr)
# Uncount copies the row "Abundance" number of times
Data_sum <- Data %>%
tidyr::uncount(Abundance)
ggplot(Data_sum, aes(x = Date, y = Organism)) +
ggridges::geom_density_ridges(scale=1, alpha=0.6, color="grey90")

How to shade under part of a line from a dataset

I have a simple plot of same data from an experiment.
plot(x=sample95$PositionA, y=sample95$AbsA, xlab=expression(position (mm)), ylab=expression(A[260]), type='l')
I would like to shade a particular area under the line, let's say from 35-45mm. From what I've searched so far, I think I need to use the polygon function, but I'm unsure how to assign vertices from a big dataset like this. Every example I've seen so far uses a normal curve.
Any help is appreciated, I am very new to R/RStudio!
Here is a solution using tidyverse tools including ggplot2. I use the built in airquality dataset as an example.
This first part is just to put the data in a format that we can plot by combining the month and the day into a single date. You can just substitute date for PositionA in your data.
library(tidyverse)
df <- airquality %>%
as_tibble() %>%
magrittr::set_colnames(str_to_lower(colnames(.))) %>%
mutate(date = as.Date(str_c("1973-", month, "-", day)))
This is the plot code. In ggplot2, we start with the function ggplot() and add geom functions to it with + to create the plot in layers.
The first function, geom_line, joins up all observations in the order that they appear based on the x variable, so it makes the line that we see. Each geom needs a particular mapping to an aesthetic, so here we want date on the x axis and temp on the y axis, so we write aes(x = date, y = temp).
The second function, geom_ribbon, is designed to plot bands at particular x values between a ymax and a ymin. This lets us shade the area underneath the line by choosing a constant ymin = 55 (a value lower than the minimum temperature) and setting ymax = temp.
We shade a specific part of the chart by specifying the data argument. Normally geom functions act on the dataset inherited from ggplot(), but you can override them by specifying individually. Here we use filter to only plot the points where the date is in June in geom_ribbon.
ggplot(df) +
geom_line(aes(x = date, y = temp)) +
geom_ribbon(
data = filter(df, date < as.Date("1973-07-01") & date > as.Date("1973-06-01")),
mapping = aes(x = date, ymax = temp, ymin = 55)
)
This gives the chart below:
Created on 2018-02-20 by the reprex package (v0.2.0).

Plotting error while using ggplot faceting function in R

I am trying to do the comparison of my observed and modeled data sets for two stations. One station is called station "red" and another is called "blue". I was able to create the facets but when I tried to add two series in one facet, only one facet got updated while other didn't.
This means for blue only one series is plotted and for red two series are plotted.
The code I used is as follows:
# install.packages("RCurl", dependencies = TRUE)
require(RCurl)
out <- postForm("https://dl.dropbox.com/s/ainioj2nn47sis4/watersurf1.csv?dl=1", format="csv")
watersurf <- read.csv(textConnection(out))
watersurf[1:100,]
watersurf$coupleid <- factor(rep(unlist(by(watersurf$id,watersurf$group1,
function(x) {ave(as.numeric(unique(x)),FUN=seq_along)}
)),each=6239))
p <- ggplot(data=watersurf,aes(x=time,y=data,group=id))+geom_line(aes(linetype=group1),size=1)+facet_wrap(~coupleid)
p
Is it also possible to add a third series in the graph but of unequal length (i.e not same interval)?
The output is
I followed the example on this page to create the graphs.
http://www.ats.ucla.edu/stat/r/faq/growth.htm
Is this what you are looking for,
ggplot(data = watersurf, aes( x = time, y = data))
+ geom_line(aes(linetype = group1, colour = group1), size = 0.2)
+ facet_wrap(~ id)

Resources