Scatterplot using ggplot - r

I need to create a scatterplot of count vs. depth of 12 species using ggplot.
This is what I have so far:
library(ggplot2)
ggplot(data = ReefFish, mapping = aes(count, depth))
However, how do I use geom_point(), geom_smooth(), and facet_wrap() to include a smoother as well as include just the 12 species I want from the data (ReefFish)? Since I believe what I have right now includes all species from the data.
Here is an example of part of my data:

Since I don't have access to the ReefFish data set, here's an example using the built-in mpg data set about cars. To make it work with your data set, just edit this code to replace manufacturers with species.
Filter the data
First we filter the data so that it only includes the species/manufacturers we're interested in.
# load our packages
library(ggplot2)
library(magrittr)
library(dplyr)
# set up a character vector of the manufacturers we're interested in
manufacturers <- c("audi", "nissan", "toyota")
# filter our data set to only include the manufacturers we care about
mpg_filtered <- mpg %>%
filter(manufacturer %in% manufacturers)
Plot the data
Now we plot. Your code was just about there! You just needed to add the plot elements, you wanted, like so:
mpg_filtered %>%
ggplot(mapping = aes(x = cty,
y = hwy)) +
geom_point() +
geom_smooth() +
facet_wrap(~manufacturer)
Hope that helps, and let me know if you have any issues.

Related

Box plots not appearing properly in RStudio

I am creating box plots within R, however, they are appearing incorrectly. My data is based off of German Credit Dataset on Kaggle.
My code with two different attributes trying to be tested:
data %>%
ggplot(aes(x = Creditability, y = Purpose, fill = Creditability)) +
geom_boxplot() +
ggtitle("Creditability vs Purpose")
data %>%
ggplot(aes(x = Creditability, y = Account.Balance, fill = Creditability)) +
geom_boxplot() +
ggtitle("Creditability vs Account Balance")
I've tried a few of the different attributes for it, but results in the same error
Edited info: Is it because the attributes have too much information? I have split the sample into test (300) vs train (700) and I am currently using train. Would it simply be because there's too much info?
Edit picture:
Factors
Edit for graph error:
Error
As others have explained in the comments, you cannot show boxplots where the y axis is set to be a factor. Factors are by their nature discrete variables, even if the levels are named as numbers. In order to utilize the stat function for the boxplot geom, you need the y axis to be continuous and the x axis to be discrete (or able to be separated into discrete values via the group= aesthetic).
Let me demonstrate with the mtcars dataset built into ggplot2:
library(ggplot2)
ggplot(mtcars, aes(x=factor(carb), y=mpg)) + geom_boxplot()
Here we can draw boxpots because the x aesthetic is forced to be discrete (via factor(carb)), while the y axis is using mpg which is a numeric column in the mtcars dataset.
If you set both carb and mpg to be factors, you get something that should look pretty similar to what you're seeing:
ggplot(mtcars, aes(x=factor(carb), y=factor(mpg))) + geom_boxplot()
In your case, all your columns in your dataset are factors. If they are factors that can be coerced to be numbers, you can turn them into continuous vectors via using as.numeric(levels(column_name)[column_name]). Alternatively, you can use as.numeric(as.character(column_name)). Here's what it looks like to first convert the mtcars$mpg column to a factor of numeric values, and then back to being only numeric via this method.
df <- mtcars
# convert to a factor
df$mpg <- factor(df$mpg)
# back to numeric!
df$mpg <- as.numeric(levels(df$mpg)[df$mpg])
# this plot looks like it did before when we did the same with mtcars
ggplot(df, aes(x=factor(carb), y=mpg)) + geom_boxplot()
So, for your case, do this two step process:
data$Purpose <- as.numeric(levels(data$Purpose)[data$Purpose])
data %>%
ggplot(aes(x = Creditability, y = Purpose, fill = Creditability)) +
geom_boxplot() +
ggtitle("Creditability vs Purpose")
That should work. You can follow in a similar fashion for your other variables.

How to add legends in this context?

songs %>% group_by(year) %>% summarise(count=nth(pop,1))%>%
ggplot(aes(x=factor(year),y=count,fill=year))+geom_bar(stat ='identity' )+theme_classic()
1.How can I adjust my legends to show years(2010:2019) rather than what it is showing right now?
2.Scale_size_manual is not working.
You need to set year as a factor each time (or externally), not just once. I don't have your data, so I'll use mtcars.
library(ggplot2)
library(dplyr)
# first plot
mtcars %>%
ggplot(aes(factor(carb), disp, fill=carb)) +
geom_bar(stat="identity")
# second plot
mutate(mtcars, carb = factor(carb)) %>%
ggplot(aes(carb, disp, fill=carb)) +
geom_bar(stat="identity")
# alternate code for second plot, not shown
mtcars %>%
ggplot(aes(factor(carb), disp, fill=factor(carb))) +
# both ^^^^^^ and ^^^^^^
geom_bar(stat="identity")
(There are numerous ways to convert to a factor. I'm using dplyr here, but it can easily be done in base or data.table.)
I included the "alternate" code above that shows the manual factor being applied to each use of carb; this is not the preferred method in my mind, since if you're doing it multiple times, just do it once before the plotting and use it multiple times. If you need both the ordinal year and the numeric version, you can add a new field, such as ordinal_year=factor(year).

Iterate over a dataframe and create one plot for each column

I would like to iterate over a data frame and plot each column against a particular column such as price.
What I have done so far is:
for(i in ncol(dat.train)) {
ggplot(dat.train, aes(dat.train[[,i]],price)) + geom_point()
}
What I want is to have the first introduction to my data (Approximately 300 columns) by plotting against the decision variable (i.e., price)
I know that there is a similar question, though I cannot really understand why the above is not really working.
You can do this, I have used mtcars data to plot other continuous variables with mpg. You have to melt the data into long form (use gather) and then use ggplot to plot these contiuous variables (disp,drat,qsec etc) against mpg. In your case instead of mpg you would take price and all the other continuous variables to be melted (like here disp,drat,qsec etc), the rest categorical variables can be taken for shape and colors etc (optional).
library(tidyverse)
mtcars %>%
gather(-mpg, -hp, -cyl, key = "var", value = "value") %>%
ggplot(aes(x = value, y = mpg, color = hp, shape = factor(cyl))) +
geom_point() +
facet_wrap(~ var, scales = "free") +
theme_bw()
EDIT:
This is another solution in case we need separate graphs for each of the variables.
Create a list of variables like this: lyst <- list("disp","hp") , you can use colnames function to get all the variable names. Use lapply to to loop through all the "lyst" objects on your data frame.
setwd("path") ###set the working directory here, This is the place where all the files are saved.
pdf(file=paste0("one.pdf"))
lapply(lyst, function(i)ggplot(mtcars, aes_string(x=i, y="mpg")) + geom_point())
dev.off()
A pdf file wil. be generated with all the graphs pdfs at your working directory which you have set
Output from solution first:

How subset a data frame by a factor and repeat a plot for each subset?

I am new to R. Forgive me if this if this question has an obvious answer but I've not been able to find a solution. I have experience with SAS and may just be thinking of this problem in the wrong way.
I have a dataset with repeated measures from hundreds of subjects with each subject having multiple measurements across different ages. Each subject is identified by an ID variable. I'd like to plot each measurement (let's say body WEIGHT) by AGE for each individual subject (ID).
I've used ggplot2 to do something like this:
ggplot(data = dataset, aes(x = AGE, y = WEIGHT )) + geom_line() + facet_wrap(~ID)
This works well for a small number of subjects but won't work for the entire dataset.
I've also tried something like this:
ggplot(data=data, aes(x = AGE,y = BW, group = ID, colour = ID)) + geom_line()
This also works for a small number of subjects but is unreadable with hundreds of subjects.
I've tried to subset using code like this:
temp <- split(dataset,dataset$ID)
but I'm not sure how to work with the resulting dataset. Or perhaps there is a way to simply adjust the facet_wrap so that individual plots are created?
Thanks!
Because you want to split up the dataset and make a plot for each level of a factor, I would approach this with one of the split-apply-return tools from the plyr package.
Here is a toy example using the mtcars dataset. I first create the plot and name it p, then use dlply to split the dataset by a factor and return a plot for each level. I'm taking advantage of %+% from ggplot2 to replace the data.frame in a plot.
p = ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_line()
require(plyr)
dlply(mtcars, .(cyl), function(x) p %+% x)
This returns all the plots, one after another. If you name the resulting list object you can also call one plot at a time.
plots = dlply(mtcars, .(cyl), function(x) p %+% x)
plots[1]
Edit
I started thinking about putting a title on each plot based on the factor, which seems like it would be useful.
dlply(mtcars, .(cyl), function(x) p %+% x + facet_wrap(~cyl))
Edit 2
Here is one way to save these in a single document, one plot per page. This is working with the list of plots named plots. It saves them all to one document, one plot per page. I didn't change any of the defaults in pdf, but you can certainly explore the changes you can make.
pdf()
plots
dev.off()
Updated to use package dplyr instead of plyr. This is done in do, and the output will have a named column that contains all the plots as a list.
library(dplyr)
plots = mtcars %>%
group_by(cyl) %>%
do(plots = p %+% . + facet_wrap(~cyl))
Source: local data frame [3 x 2]
Groups: <by row>
cyl plots
1 4 <S3:gg, ggplot>
2 6 <S3:gg, ggplot>
3 8 <S3:gg, ggplot>
To see the plots in R, just ask for the column that contains the plots.
plots$plots
And to save as a pdf
pdf()
plots$plots
dev.off()
A few years ago, I wanted to do something similar - plot individual trajectories for ~2500 participants with 1-7 measurements each. I did it like this, using plyr and ggplot2:
library(plyr)
library(ggplot2)
d_ply(dat, .var = "participant_id", .fun = function(x) {
# Generate the desired plot
ggplot(x, aes(x = phase, y = result)) +
geom_point() +
geom_line()
# Save it to a file named after the participant
# Putting it in a subdirectory is prudent
ggsave(file.path("plots", paste0(x$participant_id, ".png")))
})
A little slow, but it worked. If you want to get a sense of all participants' trajectories in one plot (like your second example - aka the spaghetti plot), you can tweak the transparency of the lines (forget coloring them, though):
ggplot(data = dat, aes(x = phase, y = result, group = participant_id)) +
geom_line(alpha = 0.3)
lapply(temp, function(X) ggplot(X, ...))
Where X is your subsetted data
Keep in mind you may have to explicitly print the ggplot object (print(ggplot(X, ..)))

How can I overlay by-group plot elements to ggplot2 facets?

My question has to do with facetting. In my example code below, I look at some facetted scatterplots, then try to overlay information (in this case, mean lines) on a per-facet basis.
The tl;dr version is that my attempts fail. Either my added mean lines compute across all data (disrespecting the facet variable), or I try to write a formula and R throws an error, followed by incisive and particularly disparaging comments about my mother.
library(ggplot2)
# Let's pretend we're exploring the relationship between a car's weight and its
# horsepower, using some sample data
p <- ggplot()
p <- p + geom_point(aes(x = wt, y = hp), data = mtcars)
print(p)
# Hmm. A quick check of the data reveals that car weights can differ wildly, by almost
# a thousand pounds.
head(mtcars)
# Does the difference matter? It might, especially if most 8-cylinder cars are heavy,
# and most 4-cylinder cars are light. ColorBrewer to the rescue!
p <- p + aes(color = factor(cyl))
p <- p + scale_color_brewer(pal = "Set1")
print(p)
# At this point, what would be great is if we could more strongly visually separate
# the cars out by their engine blocks.
p <- p + facet_grid(~ cyl)
print(p)
# Ah! Now we can see (given the fixed scales) that the 4-cylinder cars flock to the
# left on weight measures, while the 8-cylinder cars flock right. But you know what
# would be REALLY awesome? If we could visually compare the means of the car groups.
p.with.means <- p + geom_hline(
aes(yintercept = mean(hp)),
data = mtcars
)
print(p.with.means)
# Wait, that's not right. That's not right at all. The green (8-cylinder) cars are all above the
# average for their group. Are they somehow made in an auto plant in Lake Wobegon, MN? Obviously,
# I meant to draw mean lines factored by GROUP. Except also obviously, since the code below will
# print an error, I don't know how.
p.with.non.lake.wobegon.means <- p + geom_hline(
aes(yintercept = mean(hp) ~ cyl),
data = mtcars
)
print(p.with.non.lake.wobegon.means)
There must be some simple solution I'm missing.
You mean something like this:
rs <- ddply(mtcars,.(cyl),summarise,mn = mean(hp))
p + geom_hline(data=rs,aes(yintercept=mn))
It might be possible to do this within the ggplot call using stat_*, but I'd have to go back and tinker a bit. But generally if I'm adding summaries to a faceted plot I calculate the summaries separately and then add them with their own geom.
EDIT
Just a few expanded notes on your original attempt. Generally it's a good idea to put aes calls in ggplot that will persist throughout the plot, and then specify different data sets or aesthetics in those geom's that differ from the 'base' plot. Then you don't need to keep specifying data = ... in each geom.
Finally, I came up with a kind of clever use of geom_smooth to do something similar to what your asking:
p <- ggplot(data = mtcars,aes(x = wt, y = hp, colour = factor(cyl))) +
facet_grid(~cyl) +
geom_point() +
geom_smooth(se=FALSE,method="lm",formula=y~1,colour="black")
The horizontal line (i.e. constant regression eqn) will only extend to the limits of the data in each facet, but it skips the separate data summary step.

Resources