Creating density plots from two different data-frames using ggplot2 - r

My goal is to compare the distribution of various socioeconomic factor such as income over multiple years to see how the population has evolved in particular region in say, over 5 years. The primary data for this comes from the Public Use Microdata Sample. I am using R + ggplot2 as my preferred tool.
When comparing two years worth of data (2005 and 2010) I have two data frames hh2005 and hh2010 with the household data for the two years. The income data for the two years are stored in the variable hincp in both data frames. Using ggplot2 I am going about creating the density plot for individual years as follows (example for 2010):
p1 <- ggplot(data = hh2010, aes(x=hincp))+
geom_density()+
labs(title = "Distribution of income for 2010")+
labs(y="Density")+
labs(x="Household Income")
p1
How do I overlay the 2005 density over this plot? I am unable to figure it out as having read data in as hh2010 I am not sure how to proceed. Should I be processing the data in a fundamentally different way from the very beginning?

You can pass data arguments to individual geoms, so you should be able to add the second density as a new geom like this:
p1 <- ggplot(data = hh2010, aes(x=hincp))+
geom_density() +
# Change the fill colour to differentiate it
geom_density(data=hh2005, fill="purple") +
labs(title = "Distribution of income for 2010")+
labs(y="Density")+
labs(x="Household Income")

This is how I would approach the problem:
Tag each data frame with the variable of interest (in this case, the year)
Merge the two data sets
Update the 'fill' aesthetic in the ggplot function
For example:
# tag each data frame with the year^
hh2005$year <- as.factor(2005)
hh2010$year <- as.factor(2010)
# merge the two data sets
d <- rbind(hh2005, hh2010)
d$year <- as.factor(d$year)
# update the aesthetic
p1 <- ggplot(data = d, aes(x=hincp, fill=year)) +
geom_density(alpha=.5) +
labs(title = "Distribution of income for 2005 and 2010") +
labs(y="Density") +
labs(x="Household Income")
p1
^ Note, the 'fill' parameter seems to work best when you use a factor, thus I defined the years as such. I also set the transparency of the overlapping density plots with the 'alpha' parameter.

Related

How do I create a grouped boxplot in R?

I have a data frame containing 5 probes which are my variables in a dataframe, cg02823866, cg13474877, cg14305799, cg15837913 and cg19724470. I want to create a boxplot that will group cg02823866 and cg14305799 into a group called 'GeneBody' and then cg13474877, cg14305799 and cg19724470 into a group called 'Promoter'. I then want to colour code the boxplots to represent the probe names. I can't figure out how to group those variables into groups to plot the graph.
I created an ungrouped boxplot of the five probes and it looked like this.
I want there to be the titles 'Promoter' and 'GeneBody' on the x axis. Above the 'GeneBody' title there are the 2 boxplots for the cg02823866 and cg14305799 probes. Then a 'Promoter' label with the boxplots for cg13474877, cg14305799 and cg19724470. I then want each boxplots colour coded to represent each different probe.
My data frame that I imported into RStudio looks like this: https://i.stack.imgur.com/r4gEC.png
Assuming you have some data with variable names Beta (your y axis), Probe (your current x axis), and group (either "GeneBody" or "Promoter"), you can do something like the following:
library(ggplot2)
ggplot(data, aes(x = group, y = Beta, fill = Probe)) +
geom_boxplot()
If you provide a reproducible set of data, I can probably do better.
Adding to Ben's answer the traditional iris-data.frame example,which you can easily load by data(iris):
ggplot(iris) +
aes(x = "", y = Sepal.Length, group = Species) +
geom_boxplot(shape = "circle", fill = "#112446") +
theme_minimal()
So you just need a column which indicates the group dependency.
It gets of course more difficult with uncleand data, where you might need to transpond the data first etc. But those are follow up questions i guess.
Also if you want to make your life easier, use esquisse R-Studio add-on
Boxplot

Ordering Facets in a plot based on a column in the dataset

So, I have a dataset which looks like this.
I'm tasked with creating a smooth faceted visualization which shows each coral's bleaching rate at each site which I've successfully done so like this:
(I FULLY realize that this code might be bad and have some mistakes in it and I'd really appreciate it if people could tell me ways to improve it or correct some grave errors in it).
coral_data <- read.csv("file.csv")
#options(warn=-1)
library(ggplot2)
ggplot(coral_data, aes(x=year, y=value, colour=coralType, group=coralType)) +
geom_smooth(method="lm", se=F) +
scale_x_continuous(name="Year", breaks=c(2010, 2013, 2016)) +
scale_y_discrete(breaks = seq(0, 100, by = 10)) +
facet_grid(coralType ~ location, scales="free")+
expand_limits(y=0) +
labs(x="\nBleaching Rate", y="Year", title="Coral Bleaching for different corals at different sites over the years\n")
But, I also have to order the facets by lattitudes (currently, its like site01, site02, etc but I want the faceted sites to be ordered w.r.t. their lattitude values, be it ascending or descending) but sadly I have no idea as to how I'm going to do that.
Thus, could someone please tell me how to go about doing this?
Consider ordering your data frame by latitude, then re-assign location factor variable by defining its levels to new ordering with unique:
# ORDER DATA FRAME BY ASCENDING LATITUDE
coral_data <- with(coral_data, coral_data[order(latitude),])
# ORDER DATA FRAME BY DESCENDING LATITUDE
coral_data <- with(coral_data, coral_data[order(rev(latitude)),])
# ASSIGN site AS FACTOR WITH DEFINED LEVELS
coral_data$location <- with(coral_data, factor(as.character(location), levels = unique(location)))
ggplot(coral_data, ...)

R: ggplot - YX relationship with 2 groups - how to get only one (whole sample) slope? [duplicate]

I have multiple sources of data over three decades.
The data is discontiguous and overlaps in multiple places. I would like to plot the points for each data source in a different color but then add a single trendline that uses all of the data sources.
The included code has some sample data and two plot examples. The first call to ggplot, plots a single trendline for all of the data. the second ggplot call, plots each source distinctly in different colors with its own trendline.
library(ggplot2)
the.data <- read.table( header=TRUE, sep=",",
text="source,year,value
S1,1976,56.98
S1,1977,55.26
S1,1978,68.83
S1,1979,59.70
S1,1980,57.58
S1,1981,61.54
S1,1982,48.65
S1,1983,53.45
S1,1984,45.95
S1,1985,51.95
S1,1986,51.85
S1,1987,54.55
S1,1988,51.61
S1,1989,52.24
S1,1990,49.28
S1,1991,57.33
S1,1992,51.28
S1,1993,55.07
S1,1994,50.88
S2,1993,54.90
S2,1994,51.20
S2,1995,52.10
S2,1996,51.40
S3,2002,57.95
S3,2003,47.95
S3,2004,48.15
S3,2005,37.80
S3,2006,56.96
S3,2007,48.91
S3,2008,44.00
S3,2009,45.35
S3,2010,49.40
S3,2011,51.19")
ggplot( the.data, aes( the.data$year, the.data$value ) ) + geom_point() + geom_smooth()
#ggplot( the.data, aes( the.data$year, the.data$value, color=the.data$source ) ) + geom_point() + geom_smooth()
The second call displays the colored data points and I would like to add a single contiguous trendline representing all of the years.
Like this:
ggplot(the.data, aes( x = year, y = value ) ) +
geom_point(aes(colour = source)) +
geom_smooth(aes(group = 1))
A few notes:
Don't map aesthetics to an isolated vector like the.data$year. (Until you really know what you're doing, and know when to break that rule.) Just use the column names.
Map the aesthetics that you want in separate layers in their respective geom calls. In this case, I want the points colored differently, but for the smooth line, I want the data grouped all together (group = 1).

varying axis values in facet_wrap

I am working with a Danish dataset on immigrants by country of origin and age group. I transformed the data so I can see the top countries of origin for each age group.
I am plotting it using facet_wrap. What I would like to do is, since different age groups come from quite different areas, to show a different set of values for one axis in each facet. For example, those that are between 0 and 10 years old come from countries x,y and z, while those 10-20 years of age come from countries q, r, z and so on.
In my current version, it shows the entire set of values, including countries that are not in the top 10. I would like to show just the top ten countries of origin for each facet, in effect having different axis labels for each. (And, if it is possible, sorting by high to low for each facet).
Here is what I have so far:
library(ggplot2)
library(reshape)
###load and inspect data
load(url('http://dl.dropbox.com/u/7446674/dk_census.rda'))
head(dk_census)
###reshape for plotting--keep just a few age groups
dk_census.m <- melt(dk_census[dk_census$Age %in% c('0-9 år', '10-19 år','20-29 år','30-39 år'),c(1,2,4)])
###get top 10 observations for each age group, store in data frame
top10 <- by(dk_census.m[order(dk_census.m$Age,-dk_census.m$value),], dk_census.m$Age, head, n=10)
top10.df<-do.call("rbind", as.list(top10))
top10.df
###plot
ggplot(data=top10.df, aes(x=as.factor(Country), y=value)) +
geom_bar(stat="identity")+
coord_flip() +
facet_wrap(~Age)+
labs(title="Immigrants By Country by Age",x="Country of Origin",y="Population")
One option (that I actually strongly suspect you won't be happy with) is this:
p <- ggplot(data=top10.df, aes(x=Country, y=value)) +
geom_bar(stat="identity")+
coord_flip() +
facet_wrap(~Age)+
labs(title="Immigrants By Country by Age",x="Country of Origin",y="Population")
pp <- dlply(.data=top10.df,.(Age),function(x) {x$Country <- reorder(x$Country,x$value); p %+% x})
library(gridExtra)
do.call(grid.arrange,pp)
(Edited to sort each graph.)
Keep in mind that the only reason faceting exists is to plot multiple panels that share a common scale. So when you start asking to facet on some variable, but have the scales be different (oh, and also sort them separately on each panel as well) what you're doing is really no longer faceting. It's just making four different plots and arranging them together.
using lattice (Here I use ``latticeExtrafor ggplot2 theme), you can set torelation=freebetween panels. Here I am using abbreviate = TRUE` to short long labels.
library(latticeExtra)
barchart(value~ Country|Age,data=top10.df,layout=c(2,2),
horizontal=T,
par.strip.text =list(cex=2),
scales=list(y=list(relation='free',cex=1.5,abbreviate=T,
labels=levels(factor(top10.df$Country)))),
# ,cex=1.5,abbreviate=F),
par.settings = ggplot2like(),axis=axis.grid,
main="Immigrants By Country by Age",
ylab="Country of Origin",
xlab="Population")

ggplot boxplots with scatterplot overlay (same variables)

I'm an undergrad researcher and I've been teaching myself R over the past few months. I just started trying ggplot, and have run into some trouble. I've made a series of boxplots looking at the depth of fish at different acoustic receiver stations. I'd like to add a scatterplot that shows the depths of the receiver stations. This is what I have so far:
data <- read.csv(".....MPS.csv", header=TRUE)
df <- data.frame(f1=factor(data$Tagging.location), #$
f2=factor(data$Station),data$Detection.depth)
df2 <- data.frame(f2=factor(data$Station), data$depth)
df$f1f2 <- interaction(df$f1, df$f2) #$
plot1 <- ggplot(aes(y = data$Detection.depth, x = f2, fill = f1), data = df) + #$
geom_boxplot() + stat_summary(fun.data = give.n, geom = "text",
position = position_dodge(height = 0, width = 0.75), size = 3)
plot1+xlab("MPS Station") + ylab("Depth(m)") +
theme(legend.title=element_blank()) + scale_y_reverse() +
coord_cartesian(ylim=c(150, -10))
plot2 <- ggplot(aes(y=data$depth, x=f2), data=df2) + geom_point()
plot2+scale_y_reverse() + coord_cartesian(ylim=c(150,-10)) +
xlab("MPS Station") + ylab("Depth (m)")
Unfortunately, since I'm a new user in this forum, I'm not allowed to upload images of these two plots. My x-axis is "Stations" (which has 12 options) and my y-axis is "Depth" (0-150 m). The boxplots are colour-coded by tagging site (which has 2 options). The depths are coming from two different columns in my spreadsheet, and they cannot be combined into one.
My goal is to to combine those two plots, by adding "plot2" (Station depth scatterplot) to "plot1" boxplots (Detection depths). They are both looking at the same variables (depth and station), and must be the same y-axis scale.
I think I could figure out a messy workaround if I were using the R base program, but I would like to learn ggplot properly, if possible. Any help is greatly appreciated!
Update: I was confused by the language used in the original post, and wrote a slightly more complicated answer than necessary. Here is the cleaned up version.
Step 1: Setting up. Here, we make sure the depth values in both data frames have the same variable name (for readability).
df <- data.frame(f1=factor(data$Tagging.location), f2=factor(data$Station), depth=data$Detection.depth)
df2 <- data.frame(f2=factor(data$Station), depth=data$depth)
Step 2: Now you can plot this with the 'ggplot' function and split the data by using the `col=f1`` argument. We'll plot the detection data separately, since that requires a boxplot, and then we'll plot the depths of the stations with colored points (assuming each station only has one depth). We specify the two different plots by referencing the data from within the 'geom' functions, instead of specifying the data inside the main 'ggplot' function. It should look something like this:
ggplot()+geom_boxplot(data=df, aes(x=f2, y=depth, col=f1)) + geom_point(data=df2, aes(x=f2, y=depth), colour="blue") + scale_y_reverse()
In this plot example, we use boxplots to represent the detection data and color those boxplots by the site label. The stations, however, we plot separately using a specific color of points, so we will be able to see them clearly in relation to the boxplots.
You should be able to adjust the plot from here to suit your needs.
I've created some dummy data and loaded into the chart to show you what it would look like. Keep in mind that this is purely random data and doesn't really make sense.

Resources