In R, I am using the command plot(Strength, Weight, col= Area) to plot a scatterplot, with Weight as the explanatory numerical variable, and Area as the categorical explanatory variable, and Strength as the response.
There are, say, 6 areas, 1-6, but how can I tell which colour is associated with which area?
The scatterplot is coming out fine, but I can't tell which area the 6 colours on the scatterplot belong to.
You need to add a legend to your plot, see for instance https://www.geeksforgeeks.org/add-legend-to-plot-in-r/
But it will be easier to use the package ggplot2, which makes a legend for you, automatically. Something like, assuming your variables are in data frame yourdata :
library(ggplot2)
ggplot(yourdata, aes(Strength, Weight, color= Area)) +
geom_point()
Learning ggplot2 (gg is "grammar of graphics") will save you time in the long run!
I am using the below code to plot a data frame on the same plot:
ggplot(df) + geom_line(aes(x = date, y = values, colour = X > 5))
The plot is working and looks great all except for the fact that when the values are bigger than 5, because I am using geom_line, it then starts connecting points that are above the threshold. like below. I do not want the lines connecting the blue data.
How do I stop this from happening?
Here's an example using the economics dataset included in ggplot2. You see the same thing if we highlight the line based on values above 8000:
ggplot(economics, aes(date, unemploy)) +
geom_line(aes(color=unemploy > 8000))
When a mapping is defined in your dataset, by default ggplot2 also groups your data based on this. This makes total sense if you're trying to plot a line where you have data in long form and want to draw separate lines for each different value in a column. In cases like this, you want ggplot2 to change the color of the line based on the data, but you want to tell ggplot2 not to group based on color. This is why you will need to override the group= aesthetic.
To override the group= aesthetic change that happens when you map your line geom, you can just say group=1 or really group= any constant value. This effectively sets every observation mapped to the same group, and the line will connect all your points, but be colored differently:
ggplot(economics, aes(date, unemploy)) +
geom_line(aes(color=unemploy > 8000, group=1))
I honestly don't know why this is being so hard.
I'm creating a simple scatter plot. The x axis is a continuous variable, and at every tick in x I need to plot four points with error bars. I'm using position dodge and everything works fine.
Each point has a different color, size and shape as governed by three further variables: color and shape are governed by factors, size by a continuous variable.
By default, the four points reflect the order of the levels in the color variable (red always left, then green, then blue) but I would like them to reflect the order of the size variable (the continuous one), smallest left and largest right. How do I specify that size should be prioritised when ordering points in position dodge? I tried using reverse ordering but then the points are ordered first according to the shape legend.
I could change the mapping between variable and aesthetics (all variables are fundamentally continuous and could be used with size) but I think it'd be useful to know how to specify the order in which multiple variables should be considered when dodging points.
The question is somewhat unclear unfortunately. You don't show "a simple scatter plot". You are showing some statistics (mean with error band??) for specific x values - although this is seemingly continuous, this looks as if you have categorised it beforehand - resulting in some summary statistics which you are plotting.
Also, it is not easy (impossible) to fully help you without knowing what you have done until now to come to where you are.
I have tried to reproduce a similar looking plot with mtcars.
Dodging is only possible by one group (but one group can contain more than one variable). To specify how to group, add group = ... to your aesthetics.
Like so:
library(tidyverse)
ggplot(filter(mtcars, carb %in% 1:4)) +
geom_point(aes(carb, mpg, size= gear, group = gear, shape = as.character(vs), color = as.factor(cyl)),
position = position_dodge(width = .5))
This is now dodged by gear, which is also used as size aesthetic.
I want to be able to create a bar graph which shows also shows the mean value for bars in each group. AND shows the mean bar in the legend.
I have been able to get this graph Bar chart with means using the code below, which is fine, but I would like to be able to see the mean lines in the legend.
##The data to be graphed is the proportion of persons receiving a treatment
## (num=numerator) in each population (denom=demoninator). The population is
##grouped by two age groups and (Age) and further divided by a categorical
##variable V1
###SET UP DATAFRAME###
require(ggplot2)
df <- data.frame(V1 = c(rep(c("S1","S2","S3","S4","S5"),2)),
Age= c(rep(70,5),rep(80,5)),
num=c(5280,6570,5307,4894,4119,3377,4244,2999,2971,2322),
denom=c(9984,12600,9425,8206,7227,7290,8808,6386,6206,5227))
df$prop<-df$num/df$denom*100
PopMean<-sum(df$num)/sum(df$denom)*100
df70<-df[df$Age==70,]
group70mean<-sum(df70$num)/sum(df70$denom)*100
df80<-df[df$Age==80,]
group80mean<-sum(df80$num)/sum(df80$denom)*100
df$PopMean<-c(rep(PopMean,10))
df$groupmeans<-c(rep(group70mean,5),rep(group80mean,5))
I want the plot to look like this, but want the lines in the legend too, to be labelled as 'mean of group' or similar.
#basic plot
P<-ggplot(df, aes(x=factor(Age), y=prop, fill=factor(V1))) +
geom_bar(position=position_dodge(), colour='black',stat="identity")
P
####add mean lines
P+geom_errorbar(aes(y=df$groupmeans, ymax=df$groupmeans,
ymin=df$groupmeans), col="red", lwd=2)
Adding show.legend=TRUE overlays the error bars onto the factor legend, rather than separately. If there is a way of showing geom_errorbar separately in the legend this is probably the simplest solution.
I have also tried various things with geom_line
The syntax below produces a line for the population mean value, but running from the centre of each point rather than covering the width of the bars
This produces a line for the population mean and it does produce a legend but one showing a bar of colour rather than a line.
P+geom_line(aes(y=df$PopMean, group=df$PopMean, color=df$PopMean),lwd=1)
If i try to do lines for group means the lines are not visible (because they are only single points).
P+geom_line(aes(y=df$groupmeans, group=df$groupmeans, color=df$groupmeans))
I also tried to get round this with facet plot, although this requires me to pretend my categorical variable is numeric to get it to work.
###set up new df
df2<-df
df2$V1<-c(rep(c(1,2,3,4,5),2))
P<-ggplot(df2, aes(x=factor(V1), y=prop, fill=factor(V1))) +
geom_bar(position=position_dodge(),
colour='black',stat="identity",width=1)
P+facet_grid(.~factor(df2$Age))
P+facet_grid(.~factor(df2$Age))+geom_line(aes(y=df$groupmeans,
group=df$groupmeans, color=df$groupmeans))
Facetplot
This allows me to show the mean lines, using geom_line, so a legend does appear (although it doesn't look right, showing a colour gradient rather than coloured lines!). However, the lines still do not go the full width of the bars. Also my x-axis now needs relabelling to show S1, S2 etc rather than numeric 1,2,3
To sum up - is there a way of showing error bar lines separately in the legend?
If not, then, if i use facetting, how do I correct the legend appearance and relabel axes with my categorical variables and is is possible to get the line to go the full width of the plot?
Or is there an alternate solution that I am missing!?
Thanks
To get the legend for the geom_error you need to pass the colour argument in the aes.
As you want only one category (here red), I've create a dummy variable first
df$mean <- "Mean"
ggplot(df, aes(x=factor(Age), y=prop, fill=factor(V1))) +
geom_bar(position=position_dodge(), colour='black',stat="identity") +
geom_errorbar(aes (ymax=groupmeans,
ymin=groupmeans, colour=mean), lwd=2) +
scale_colour_manual(name="",values = "#ff0000")
I am plotting multiple dataframes, where the color of the line is dependent on a variable in the dataframe. The problem is that for each plot, R makes the color spectrum relative to the range of each plot.
I would like for the range (and corresponding colors) to be kept constant for all of the dataframes I'm using. I won't know the range of numbers in advance, though they'll all be set before plotting. In addition, there will hundreds of values, so a manual mapping is not feasible.
As of right now, I have:
library(ggplot2)
df1 <- as.data.frame(list('x'=1:5,'y'=1:5,'colors'=6:10))
df2 <- as.data.frame(list('x'=1:5,'y'=1:5,'colors'=8:12))
qplot(data=df1,x,y,geom='line', colour=colors)
qplot(data=df2,x,y,geom='line', colour=colors)
The first plot produces:
where the color range goes from 6-10.
The second plot produces:
where the color range goes from 8-12
I would like a constant range for both that goes from 6-12.