What am I missing to build this plot? - r

I am trying to make a Rare Earth Elements spider diagram that places concentration in log10 on the y-axis, and each respective element from the Rare Earth Elements on the x-axis. I then am trying to compare several units of rock with each other. An example of what I am looking for and what I am getting is added to the google doc link below.
So, with the code I have added I have two problems:
1. The elements are being listed on the x-axis in alphabetical order, not in the order that I have in my CSV
2. I don't know what I am missing in my code to correlate the points together in each sample to build a line. I couple this with not knowing if that is an issue with my code, or with the way my data is arranged in the CSV.
I have seen someone else tackle this issue by treating the respective elements as dates. I have played with lubridate a bit, but I feel like it wasn't as successful as the code that I've added below... which is saying something.
ggplot(data=dataMGSREE) +
geom_point(mapping = aes(x = Concentration, y = Element, color=Group), show.legend = FALSE) +
coord_flip() +
scale_x_log10()
Analysis Name Element Concentration
HM030218-2 Haycock Upper La 65.00
HM030218-2 Haycock Upper Ce 127.00
HM030218-2 Haycock Upper Pr 13.46
HM030218-2 Haycock Upper Nd 44.00
HM030218-2 Haycock Upper Sm 6.70
HM030218-2 Haycock Upper Eu 0.75
HM030218-2 Haycock Upper Gd 4.48
HM030218-2 Haycock Upper Tb 0.64
HM030218-2 Haycock Upper Dy 3.40
HM030218-2 Haycock Upper Ho 0.73
1-10 of 14 rows
Something similar to the expected result is listed above, while the actual result is here:https://docs.google.com/document/d/1p7QY8Ie_bmav1XApTSy1TCECvteUcxckZXpsy9Ib7Ew/edit?usp=sharing
Please forgive me also for not knowing how to upload the screenshots on here.

A few things going on
(a) If you want lines, you need to add geom_line() to your plot. You'll also need to add a group aesthetic to indicate which points to connect, presumably group = Analysis inside aes(). This is necessary whenever you plot a line with a discrete variable on an axis.
(b) See this FAQ for getting a custom order of your elements.
(c) If you want points and lines, put aes() inside the original ggplot() call, it will be passed on to both geom_point() and geom_line() so you don't have to re-specify it in subsequent layers
(d) I don't see a reason to use coord_flip here, I'd just map what you want to go on x and y from the start
(e) You don't show a column called Group in your data, so I'm surprised your color = Group works at all...
Something like this:
# change factor levels to order they occur
# you could also custom-specify an order, with, e.g., `levels = c("Li", "Ce", "Pr", ...)`
dataMGSREE$Element = factor(dataMGSREE$Element, levels = unique(dataMGSREE$Element))
# plot with changes explained above
ggplot(data = dataMGSREE,
mapping = aes(x = Element, y = Concentration, color = Analysis, group = Analysis)) +
geom_point(show.legend = FALSE) +
geom_line() +
scale_y_log10()

The axis ordering for discrete data like Element is determined by how the factor levels are set. It looks like here the factor levels should be in the same order they already are in the data, so you can do:
dataMGSREE$Element = factor(dataMGSREE$Element, levels = dataMGSREE$Element)
ggplot(data=dataMGSREE) +
# I set color = Analysis here because the example data didn't
# contain a Group column, replace as appropriate
geom_point(mapping = aes(x = Concentration, y = Element, color=Analysis),
show.legend = FALSE) +
coord_flip() +
scale_x_log10()

Related

Why is the variable considered continous in legend?

I have used the following code to generate a plot with ggplot:
I want the legend to show the runs 1-8 and only the volumes 12.5 and 25 why doesn't it show it?
And is it possible to show all the points in the plot even though there is an overlap? Because right now the plot only shows 4 of 8 points due to overlap.
OP. You've already been given a part of your answer. Here's a solution given your additional comment and some explanation.
For reference, you were looking to:
Change a continuous variable to a discrete/discontinuous one and have that reflected in the legend.
Show runs 1-8 labeled in the legend
Disconnect lines based on some criteria in your dataset.
First, I'm representing your data here again in a way that is reproducible (and takes away the extra characters so you can follow along directly with all the code):
library(ggplot2)
mydata <- data.frame(
`Run`=c(1:8),
"Time"=c(834, 834, 584, 584, 1184, 1184, 938, 938),
`Area`=c(55.308, 55.308, 79.847, 79.847, 81.236, 81.236, 96.842, 96.842),
`Volume`=c(12.5, 12.5, 12.5, 12.5, 25.0, 25.0, 25.0, 25.0)
)
Changing to a Discrete Variable
If you check the variable type for each column (type str(mydata)), you'll see that mydata$Run is an int and the rest of the columns are num. Each column is understood to be a number, which is treated as if it were a continuous variable. When it comes time to plot the data, ggplot2 understands this to mean that since it is reasonable that values can exist between these (they are continuous), any representation in the form of a legend should be able to show that. For this reason, you get a continuous color scale instead of a discrete one.
To force ggplot2 to give you a discrete scale, you must make your data discrete and indicate it is a factor. You can either set your variable as a factor before plotting (ex: mydata$Run <- as.factor(mydata$Run), or use code inline, referring to aes(size = factor(Run),... instead of just aes(size = Run,....
Using reference to factor(Run) inline in your ggplot calls has the effect of changing the name of the variable to be "factor(Run)" in your legend, so you will have to also add that to the labs() object call. In the end, the plot code looks like this:
ggplot(data = mydata, aes(x=Area, y=Time)) +
geom_point(aes(color =as.factor(Volume), size = Run)) +
geom_line() +
labs(
x = "Area", y = "Time",
# This has to be changed now
color='Volume'
) +
theme_bw()
Note in the above code I am also not referring to mydata$Run, but just Run. It is greatly preferable that you refer to just the name of the column when using ggplot2. It works either way, but much better in practice.
Disconnect Lines
The reason your lines are connected throughout the data is because there's no information given to the geom_line() object other than the aesthetics of x= and y=. If you want to have separate lines, much like having separate colors or shapes of points, you need to supply an aesthetic to use as a basis for that. Since the two lines are different based on the variable Volume in your dataset, you want to use that... but keep the same color for both. For this, we use the group= aesthetic. It tells ggplot2 we want to draw a line for each piece of data that is grouped by that aesthetic.
ggplot(data = mydata, aes(x=Area, y=Time)) +
geom_point(aes(color =as.factor(Volume), size = Run)) +
geom_line(aes(group=as.factor(Volume))) +
labs(
x = "Area", y = "Time", color='Volume'
) +
theme_bw()
Show Runs 1-8 Labeled in Legend
Here I'm reading a bit into what you exactly wanted to do in terms of "showing runs 1-8" in the legend. This could mean one of two things, and I'll assume you want both and show you how to do both.
Listing and showing sizes 1-8 in the legend.
To set the values you see in the scale (legend) for size, you can refer to the various scale_ functions for all types of aesthetics. In this case, recall that since mydata$Run is an int, it is treated as a continuous scale. ggplot2 doesn't know how to draw a continuous scale for size, so the legend itself shows discrete sizes of points. This means we don't need to change Run to a factor, but what we do need is to indicate specifically we want to show in the legend all breaks in the sequence from 1 to 8. You can do this using scale_size_continuous(breaks=...).
ggplot(data = mydata, aes(x=Area, y=Time)) +
geom_point(aes(color =as.factor(Volume), size = Run)) +
geom_line(aes(group=as.factor(Volume))) +
labs(
x = "Area", y = "Time", color='Volume'
) +
scale_size_continuous(breaks=c(1:8)) +
theme_bw()
Showing all of your runs as points.
The note about showing all runs might also mean you want to literally see each run represented as a discrete point in your plot. For this... well, they already are! ggplot2 is plotting each of your points from your data into the chart. Since some points share the same values of x= and y=, you are getting overplotting - the points are drawn over top of one another.
If you want to visually see each point represented here, one option could be to use geom_jitter() instead of geom_point(). It's not really great here, because it will look like your data has different x and y values, but it is an option if this is what you want to do. Note in the code below I'm also changing the shape of the point to be a hollow circle for better clarity, where the color= is the line around each point (here it's black), and the fill= aesthetic is instead used for Volume. You should get the idea though.
set.seed(1234) # using the same randomization seed ensures you have the same jitter
ggplot(data = mydata, aes(x=Area, y=Time)) +
geom_jitter(aes(fill =as.factor(Volume), size = Run), shape=21, color='black') +
geom_line(aes(group=as.factor(Volume))) +
labs(
x = "Area", y = "Time", fill='Volume'
) +
scale_size_continuous(breaks=c(1:8)) +
theme_bw()

ggplot scale_fill_discrete(breaks = user_countries) creates a second, undesired legend

I am trying to change the factor level ordering of a data frame column to control the legend ordering and ggplot coloring of factor levels specified by country name. Here is my dataframe country_hours:
countries hours
1 Brazil 17
2 Mexico 13
3 Poland 20
4 Indonesia 2
5 Norway 20
6 Poland 20
Here is how I try to plot subsets of the data frame depending on a list of selected countries, user_countries:
make_country_plot<-function(user_countries, country_hours_pre)
{
country_hours = country_hours_pre[which(country_hours_pre$countries %in% user_countries) ,]
country_hours$countries = factor(country_hours$countries, levels = c(user_countries))
p = ggplot(data=country_hours, aes(x=hours, color=countries))
for(name in user_countries){
p = p + geom_bar( data=subset(country_hours, countries==name), aes(y = (..count..)/sum(..count..), fill=countries), binwidth = 1, alpha = .3)
}
p = p + scale_y_continuous(labels = percent) + geom_density(size = 1, aes(color=countries), adjust=1) +
ggtitle("Baltic countries") + theme(plot.title = element_text(lineheight=.8, face="bold")) + scale_fill_discrete(breaks = user_countries)
}
This works great in that the coloring goes according to my desired order as does the top legend, but a second legend appears and shows a different order. Without scale_fill_discrete(breaks = user_countries) I do not get my desired order, but I also do not get two legends. In the plot shown below, the desired order, given by user_countries was
user_countries = c("Lithuania", "Latvia", "Estonia")
I'd like to get rid of this second legend. How can I do it?
I also have another problem, which is that the plotting/coloring is inconsistent between different plots. I'd like the "first" country to always be blue, but it's not always blue. Also the 'real' legend (darker/solid colors) is not always in the same position - sometimes it's below the incorrect/black legend. Why does this happen and how can I make this consistent across plots?
Also, different plots have different numbers of factor groups, sometimes more than 9, so I'd rather stick with standard ggplot coloring as most of the solutions for defining your own colors seem limited in the number of colors you can do (How to assign colors to categorical variables in ggplot2 that have stable mapping?)
You are mapping to two different aesthetics (color and fill) but you changed the scale specifications for only one of them. Doing this will always split a previously combined legend. There is a nice example of this on this page
To keep your legends combined, you'll want to add scale_color_discrete(breaks = user_countries) in addition to scale_fill_discrete(breaks = user_countries).
I don't have enough reputation to comment, but this previous question has a comprehensive answer.
Short answer is to change geom_density so that it doesn't map countries to color. That means just taking everything inside the aes() and putting it outside.
geom_density(size = 1, color=countries, adjust=1)
(This should work. Don't have an example to confirm).

Overlay ggplot2 stat_density2d plots with alpha channels constant across groups

I would like to plot multiple groups in a stat_density2 plot with alpha values related to the counts of observations in each group. However, the levels formed by stat_density2d seem to be normalized to the number of observations in each group. For example,
temp <- rbind(movies[1:2,],movies[movies$mpaa == "R" | movies$mpaa == "PG-13",])
ggplot(temp, aes(x=rating,y=length)) +
stat_density2d(geom="tile", aes(fill = mpaa, alpha=..density..), contour=FALSE) +
theme_minimal()
Creates a plot like this:
Because I only included 2 points without ratings, they result in densities that look much tighter/stronger than the other two, and so wash out the other two densities. I've tried looking at Overlay two ggplot2 stat_density2d plots with alpha channels and Specifying the scale for the density in ggplot2's stat_density2d but they don't really address this specific issue.
Ultimately, what I'm trying to accomplish with my real data, is I have "power" samples from discrete 2d locations for multiple conditions, and I am trying to plot what their relative powers/spatial distributions are. I am duplicating points in locations relative to their powers, but this has resulted in low power conditions with just a few locations looking the strongest when using stat_density2d. Please let me know if there is a better way of going about doing this!
Thanks!
stat_hexbin, which understands ..count.. in addition to ..density.., may work for you:
ggplot(temp, aes(x=rating,y=length)) +
stat_binhex(geom="hex", aes(fill = mpaa, alpha=..count..)) +
theme_minimal()
Although you may want to adjust the bin width.
Not the most elegant r code, but this seems to work. I normalize my real data a bit differently than this, but this gets the solution I found across. I use a for loop where I find the average power for the condition and add a new stat_density2d layer with the alpha scaled by that average power.
temp <- rbind(movies[1:2,],movies[movies$mpaa == "R" | movies$mpaa == "PG-13",])
mpaa = unique(temp$mpaa)
p <- ggplot() + theme_minimal()
for (ii in seq(1,3)) {
ratio = length(which(temp$mpaa == mpaa[ii]))
p <- p + stat_density2d(data=temp[temp$mpaa == mpaa[ii],],
aes(x=rating,y=length,fill = mpaa, alpha=..level..),
geom="polygon",
contour=TRUE,
alpha = ratio/20,
lineType = "none")
}
print(p)

How to create a complex bubble chart in R

I have a data set like this (simplified for illustration purposes):
zz <- textConnection("Company Market.Cap Institutions.Own Price.Earnings Industry
ExxonMobil 405.69 50% 9.3 Energy
Citigroup 156.23 67% 18.45 Banking
Pfizer 212.51 73% 20.91 Pharma
JPMorgan 193.1 75% 9.12 Banking
")
Companies <- read.table(zz, header= TRUE)
close(zz)
I would like to create a bubble chart (well, something like a bubble chart) with the following properties:
each bubble is a company, with the size of the bubble tied to market cap,
the color of the bubble tied to industry,
with the x-axis having two categories, Industries.Own and Price.Earnings,
and the y-axis being a 1-10 scale, each company's values being normalized to that scale. (I could of course do the normalization outside R but I believe R makes that possible.)
To be clear, each company will appear in each column of the result, for example ExxonMobil will be near the bottom of both the Institutions.Own column and the Price.Earnings column; ideally, the name of the company would appear in or next to both of its bubbles.
I think this touches on all of your points. Note - your Institutions.Own is read in as a factor because of the %...I simply deleted that for ease of time...you'll need to address that somewhere. Once that is done, I would use ggplot and map your different aesthetics accordingly. You can fiddle with the axes titles et al if you need.
#Requisite packages
library(ggplot2)
library(reshape2)
#Define function, adjust this as necessary
rescaler <- function(x) 10 * (x-min(x)) / (max(x)-min(x))
#Rescale your two variables
Companies$Inst.Scales <- with(Companies, rescaler(Institutions.Own))
Companies$Price.Scales <- with(Companies, rescaler(Price.Earnings))
#Melt into long format
Companies.m <- melt(Companies, measure.vars = c("Inst.Scales", "Price.Scales"))
#Plotting code
ggplot(Companies.m, aes(x = variable, y = value, label = Company)) +
geom_point(aes(size = Market.Cap, colour = Industry)) +
geom_text(hjust = 1, size = 3) +
scale_size(range = c(4,8)) +
theme_bw()
Results in:

adding a key for geom_line to legend from geom_area

I have a data frame, where I am talking about different flows of water at a dam (water units are kcfs—1000 cubic feet per second—if anyone is interested)
Call it df4plot
date kcfs Flowtype
10/1/2010 50 Power
10/1/2010 10 Spill_Overgen
10/1/2010 8 Spill_Force
10/2/2010 52 Power
10/2/2010 7 Spill_Overgen
10/2/2010 10 Spill_Force
(there are 3x365 rows in the data frame)
So what I want to do is make an aggregated area graph that shows each of these flows
p <- ggplot(data = df4plot, aes(date,kcfs)) +
geom_area(aes(colour = Flowtype, fill=Flowtype), position = “stack”)
I want to control the colors used, so I added
plot_colors_aggregate <- c("forestgreen","lightsalmon","dodgerblue")
p <- p +
scale_color_manual(values = plot_colors_aggregate) +
scale_fill_manual(values = plot_colors_aggregate)
Now I want to add a dashed line, showing the maximum turbine capacity—the flow limits for power generation—that vary by month. I have a separate dataframe for this (365 rows long), df4FGline
Date FGlimit
10/1/2010 52
10/2/2010 52
…
11/1/2010 60
11/2/2010 60
...
Etc
So now I have
p <- p +
geom_line(data = df4FGline, aes(x=date,y=FGlimit), colour = “darkblue”, linetype = “dashed”)
p
The legend is currently just the three blocks for the three types of Flowtype. I’d like to add the dashed line for the flow gate limits to the bottom, but I can’t get it to show up there.
It is probably related to my incomplete understanding of aes (help(aes) is AMAZINGLY unhelpful).
I’ve tried something similar to this and something similar to this, but since I’m only trying to add 1 line to a pre-existing legend, maybe?, this is not working for me.
I tried adding “legend = TRUE” inside the parentheses for the geom_line, but it put a dashed line inside each color box in the legend, AND created a 4th entry for the legend, but offset from the rest of the legend (below and to the right)... ARG!
I swear I have the book on order... any help you can share so that I understand this aesthetic thing and how it relates to the legend a little better, I'd be extremely grateful.
edited for typo
This should help:
df <- data.frame(x = 1:10,y = 1:10)
ggplot(df,aes(x = x,y = y)) +
geom_line(aes(linetype = "dashed")) +
scale_linetype_manual(name = "Linetype",values = "dashed")

Resources