Create faceted graph in R, keeping other points as greyed out - r

I have some data (AllPCA) that is divided by site. I have used qplot (PC1, PC2, data=AllPCA, colour=Population, facets=~Population) + scale_colour_manual (values=cbbPalette) to facet a scatterplot of two variables by site.
Example AllPCA:
ID PC1 PC2 Population
Syd1 0.0185 0.0426 Sydney
Was1 0.0167 0.0415 Washington
Rea1 0.0182 0.0431 Reading
Aar1 0.0183 0.0427 Aarhus
This works fine, but only gives the data from each site in each of the windows.
I would like to create the same plot, but keeping the rest of the data in each facetted plot, just greyed out. Can you help?

One way would be to use two geom_point() calls. In first I use data=AllPCA[,-4] - this is your data without column Population and set color="grey". So all points will be plotted in all facets in grey. Then I add second geom_point() with all data and color=Population. This will add only points in facets corresponding to each Population in separate colors (when facet_wrap() is used).
ggplot()+
geom_point(data=AllPCA[,-4],aes(PC1,PC2),color="grey")+
geom_point(data=AllPCA,aes(PC1,PC2,color=Population))+
facet_wrap(~Population)

Duplicate your data several times:
n <- length(unique(AllPCA[["Population"]]))
dat <- do.call(rbind, rep(list(AllPCA), n))
Create new columns for (a) facetting and (b) colour:
dat[["Population2"]] <- rep(AllPCA[["Population"]], each = n)
dat[["PopulationMatch"]] <- with(dat, Population == Population2)
Plot:
library(ggplot2)
qplot(PC1, PC2, data = dat, colour = PopulationMatch, facets = ~ Population2) +
scale_colour_manual(values = c("grey", "black"))

Related

Issues using position_dodge()

I'm having trouble with position_dodge, when using colours and shapes.
I want to graph results from an experiment in which two treatments are replicated at many sites and I would like to emphasize certain data points graphically.
As the x-axis is a factor, I'd used position_dodge, to separate the treatments. So far so good, see graph 1 below.
However, if I want to emphasize a particular data point by changing the shape, see graph 2. The data points have now been split into three columns, not the two.
Any suggestions on how I would make a graph as pictured in the third panel below.
site <- rep(c("site1"),times=6)
treatment <- rep(c("one","two"),times=2,each=3)
set.seed(21)
response <- c(rnorm(3,mean=4),
rnorm(3,mean=5))
special <- as.factor(c(0,1,0,0,0,0))
mydata <- data.frame(site,treatment,response,special)
#graph 1
ggplot()+
geom_point(data=mydata,
aes(x = site,
y = response,
colour=treatment),
size=4,
position=position_dodge(1))
#graph 2
ggplot()+
geom_point(data=mydata,
aes(x = site,
y = response,
colour=treatment,
shape=special),
size=4,
position=position_dodge(0.5))

Make scatter plots for multiple subsets of data

Let me introduce my data-set and my preliminary result first for better understanding my question. my dataset looks like:
Place Species Size Conc.
A BT 24 0.2
A ST 76 1.4
...
B BT 45 1.2
B ST 21 0.7
...
I want to make scatterplot of Size against Conc. for each Species at each Place. What I have done uses ggplot2 to make a graph as below:
scatterplot <- ggplot(mydata, aes(x = Size, y = Conc, color = Species)) +
geom_point(shape = 1)
Though this graph plots by the species group in different color, it summarizes all data in the dataset and fails to plot for different places.
I think the code below
scatterplot <- ggplot(mydata[mydata$place == "A"], aes(x = Size, y = Conc, color = Species)) + geom_point(shape = 1)
works for plotting just place A and I can do this for different places one by one. However, in my real dataset, the place variable has tons of different places, and I can't type them all out one by one manually. Thus my question actually is how to let R make those plots for different places automatically at one time?
Try:
ggplot(ddf)+geom_point(aes(Size, Conc.))+facet_grid(Place~Species)
If there are too many places:
ggplot(ddf)+geom_point(aes(Size, Conc., color=Place))+facet_grid(.~Species)
Or, in one graph:
ggplot(ddf)+geom_point(aes(Size, Conc., color=Place,shape=Species), size=5)

Creating density plots from two different data-frames using ggplot2

My goal is to compare the distribution of various socioeconomic factor such as income over multiple years to see how the population has evolved in particular region in say, over 5 years. The primary data for this comes from the Public Use Microdata Sample. I am using R + ggplot2 as my preferred tool.
When comparing two years worth of data (2005 and 2010) I have two data frames hh2005 and hh2010 with the household data for the two years. The income data for the two years are stored in the variable hincp in both data frames. Using ggplot2 I am going about creating the density plot for individual years as follows (example for 2010):
p1 <- ggplot(data = hh2010, aes(x=hincp))+
geom_density()+
labs(title = "Distribution of income for 2010")+
labs(y="Density")+
labs(x="Household Income")
p1
How do I overlay the 2005 density over this plot? I am unable to figure it out as having read data in as hh2010 I am not sure how to proceed. Should I be processing the data in a fundamentally different way from the very beginning?
You can pass data arguments to individual geoms, so you should be able to add the second density as a new geom like this:
p1 <- ggplot(data = hh2010, aes(x=hincp))+
geom_density() +
# Change the fill colour to differentiate it
geom_density(data=hh2005, fill="purple") +
labs(title = "Distribution of income for 2010")+
labs(y="Density")+
labs(x="Household Income")
This is how I would approach the problem:
Tag each data frame with the variable of interest (in this case, the year)
Merge the two data sets
Update the 'fill' aesthetic in the ggplot function
For example:
# tag each data frame with the year^
hh2005$year <- as.factor(2005)
hh2010$year <- as.factor(2010)
# merge the two data sets
d <- rbind(hh2005, hh2010)
d$year <- as.factor(d$year)
# update the aesthetic
p1 <- ggplot(data = d, aes(x=hincp, fill=year)) +
geom_density(alpha=.5) +
labs(title = "Distribution of income for 2005 and 2010") +
labs(y="Density") +
labs(x="Household Income")
p1
^ Note, the 'fill' parameter seems to work best when you use a factor, thus I defined the years as such. I also set the transparency of the overlapping density plots with the 'alpha' parameter.

ggplot boxplots with scatterplot overlay (same variables)

I'm an undergrad researcher and I've been teaching myself R over the past few months. I just started trying ggplot, and have run into some trouble. I've made a series of boxplots looking at the depth of fish at different acoustic receiver stations. I'd like to add a scatterplot that shows the depths of the receiver stations. This is what I have so far:
data <- read.csv(".....MPS.csv", header=TRUE)
df <- data.frame(f1=factor(data$Tagging.location), #$
f2=factor(data$Station),data$Detection.depth)
df2 <- data.frame(f2=factor(data$Station), data$depth)
df$f1f2 <- interaction(df$f1, df$f2) #$
plot1 <- ggplot(aes(y = data$Detection.depth, x = f2, fill = f1), data = df) + #$
geom_boxplot() + stat_summary(fun.data = give.n, geom = "text",
position = position_dodge(height = 0, width = 0.75), size = 3)
plot1+xlab("MPS Station") + ylab("Depth(m)") +
theme(legend.title=element_blank()) + scale_y_reverse() +
coord_cartesian(ylim=c(150, -10))
plot2 <- ggplot(aes(y=data$depth, x=f2), data=df2) + geom_point()
plot2+scale_y_reverse() + coord_cartesian(ylim=c(150,-10)) +
xlab("MPS Station") + ylab("Depth (m)")
Unfortunately, since I'm a new user in this forum, I'm not allowed to upload images of these two plots. My x-axis is "Stations" (which has 12 options) and my y-axis is "Depth" (0-150 m). The boxplots are colour-coded by tagging site (which has 2 options). The depths are coming from two different columns in my spreadsheet, and they cannot be combined into one.
My goal is to to combine those two plots, by adding "plot2" (Station depth scatterplot) to "plot1" boxplots (Detection depths). They are both looking at the same variables (depth and station), and must be the same y-axis scale.
I think I could figure out a messy workaround if I were using the R base program, but I would like to learn ggplot properly, if possible. Any help is greatly appreciated!
Update: I was confused by the language used in the original post, and wrote a slightly more complicated answer than necessary. Here is the cleaned up version.
Step 1: Setting up. Here, we make sure the depth values in both data frames have the same variable name (for readability).
df <- data.frame(f1=factor(data$Tagging.location), f2=factor(data$Station), depth=data$Detection.depth)
df2 <- data.frame(f2=factor(data$Station), depth=data$depth)
Step 2: Now you can plot this with the 'ggplot' function and split the data by using the `col=f1`` argument. We'll plot the detection data separately, since that requires a boxplot, and then we'll plot the depths of the stations with colored points (assuming each station only has one depth). We specify the two different plots by referencing the data from within the 'geom' functions, instead of specifying the data inside the main 'ggplot' function. It should look something like this:
ggplot()+geom_boxplot(data=df, aes(x=f2, y=depth, col=f1)) + geom_point(data=df2, aes(x=f2, y=depth), colour="blue") + scale_y_reverse()
In this plot example, we use boxplots to represent the detection data and color those boxplots by the site label. The stations, however, we plot separately using a specific color of points, so we will be able to see them clearly in relation to the boxplots.
You should be able to adjust the plot from here to suit your needs.
I've created some dummy data and loaded into the chart to show you what it would look like. Keep in mind that this is purely random data and doesn't really make sense.

How can I overlay by-group plot elements to ggplot2 facets?

My question has to do with facetting. In my example code below, I look at some facetted scatterplots, then try to overlay information (in this case, mean lines) on a per-facet basis.
The tl;dr version is that my attempts fail. Either my added mean lines compute across all data (disrespecting the facet variable), or I try to write a formula and R throws an error, followed by incisive and particularly disparaging comments about my mother.
library(ggplot2)
# Let's pretend we're exploring the relationship between a car's weight and its
# horsepower, using some sample data
p <- ggplot()
p <- p + geom_point(aes(x = wt, y = hp), data = mtcars)
print(p)
# Hmm. A quick check of the data reveals that car weights can differ wildly, by almost
# a thousand pounds.
head(mtcars)
# Does the difference matter? It might, especially if most 8-cylinder cars are heavy,
# and most 4-cylinder cars are light. ColorBrewer to the rescue!
p <- p + aes(color = factor(cyl))
p <- p + scale_color_brewer(pal = "Set1")
print(p)
# At this point, what would be great is if we could more strongly visually separate
# the cars out by their engine blocks.
p <- p + facet_grid(~ cyl)
print(p)
# Ah! Now we can see (given the fixed scales) that the 4-cylinder cars flock to the
# left on weight measures, while the 8-cylinder cars flock right. But you know what
# would be REALLY awesome? If we could visually compare the means of the car groups.
p.with.means <- p + geom_hline(
aes(yintercept = mean(hp)),
data = mtcars
)
print(p.with.means)
# Wait, that's not right. That's not right at all. The green (8-cylinder) cars are all above the
# average for their group. Are they somehow made in an auto plant in Lake Wobegon, MN? Obviously,
# I meant to draw mean lines factored by GROUP. Except also obviously, since the code below will
# print an error, I don't know how.
p.with.non.lake.wobegon.means <- p + geom_hline(
aes(yintercept = mean(hp) ~ cyl),
data = mtcars
)
print(p.with.non.lake.wobegon.means)
There must be some simple solution I'm missing.
You mean something like this:
rs <- ddply(mtcars,.(cyl),summarise,mn = mean(hp))
p + geom_hline(data=rs,aes(yintercept=mn))
It might be possible to do this within the ggplot call using stat_*, but I'd have to go back and tinker a bit. But generally if I'm adding summaries to a faceted plot I calculate the summaries separately and then add them with their own geom.
EDIT
Just a few expanded notes on your original attempt. Generally it's a good idea to put aes calls in ggplot that will persist throughout the plot, and then specify different data sets or aesthetics in those geom's that differ from the 'base' plot. Then you don't need to keep specifying data = ... in each geom.
Finally, I came up with a kind of clever use of geom_smooth to do something similar to what your asking:
p <- ggplot(data = mtcars,aes(x = wt, y = hp, colour = factor(cyl))) +
facet_grid(~cyl) +
geom_point() +
geom_smooth(se=FALSE,method="lm",formula=y~1,colour="black")
The horizontal line (i.e. constant regression eqn) will only extend to the limits of the data in each facet, but it skips the separate data summary step.

Resources