geom_boxplot not displaying correctly - r

In the assignment I am doing, it wants me to use geom_boxplot. However, I have been unable to get the graph to display the boxplots correctly.
# Convert To Factor
census_data$CIT <- as.factor(census_data$CIT)
class(census_data$CIT)
ggplot(census_data, aes(census_data[["VALP"]], (census_data[["CIT"]])) +
geom_boxplot(color = "blue", fill = "orange") +
ggtitle("Property value by citizenship status") +
xlab("“Citizenship status") + ylab("Property value")
I am slightly concerned that the CIT may not have been converted correctly to a factor.

I think you have your x and y aesthetics the wrong way around. you have VALP first which is then assumed to be x and CIT second which is asssumed to be y. Given your labels I think you want them in the other order.
I always find it helps to label them explicitly, ie aes(x=.., y=...) so you don't get confused!
You also don't need to use census_data[["VALP"]] in the aes function call, since you have supplied the census_data in the data argument just saying aes(x=CIT, y=VALP) should be enough.

Related

Why is the variable considered continous in legend?

I have used the following code to generate a plot with ggplot:
I want the legend to show the runs 1-8 and only the volumes 12.5 and 25 why doesn't it show it?
And is it possible to show all the points in the plot even though there is an overlap? Because right now the plot only shows 4 of 8 points due to overlap.
OP. You've already been given a part of your answer. Here's a solution given your additional comment and some explanation.
For reference, you were looking to:
Change a continuous variable to a discrete/discontinuous one and have that reflected in the legend.
Show runs 1-8 labeled in the legend
Disconnect lines based on some criteria in your dataset.
First, I'm representing your data here again in a way that is reproducible (and takes away the extra characters so you can follow along directly with all the code):
library(ggplot2)
mydata <- data.frame(
`Run`=c(1:8),
"Time"=c(834, 834, 584, 584, 1184, 1184, 938, 938),
`Area`=c(55.308, 55.308, 79.847, 79.847, 81.236, 81.236, 96.842, 96.842),
`Volume`=c(12.5, 12.5, 12.5, 12.5, 25.0, 25.0, 25.0, 25.0)
)
Changing to a Discrete Variable
If you check the variable type for each column (type str(mydata)), you'll see that mydata$Run is an int and the rest of the columns are num. Each column is understood to be a number, which is treated as if it were a continuous variable. When it comes time to plot the data, ggplot2 understands this to mean that since it is reasonable that values can exist between these (they are continuous), any representation in the form of a legend should be able to show that. For this reason, you get a continuous color scale instead of a discrete one.
To force ggplot2 to give you a discrete scale, you must make your data discrete and indicate it is a factor. You can either set your variable as a factor before plotting (ex: mydata$Run <- as.factor(mydata$Run), or use code inline, referring to aes(size = factor(Run),... instead of just aes(size = Run,....
Using reference to factor(Run) inline in your ggplot calls has the effect of changing the name of the variable to be "factor(Run)" in your legend, so you will have to also add that to the labs() object call. In the end, the plot code looks like this:
ggplot(data = mydata, aes(x=Area, y=Time)) +
geom_point(aes(color =as.factor(Volume), size = Run)) +
geom_line() +
labs(
x = "Area", y = "Time",
# This has to be changed now
color='Volume'
) +
theme_bw()
Note in the above code I am also not referring to mydata$Run, but just Run. It is greatly preferable that you refer to just the name of the column when using ggplot2. It works either way, but much better in practice.
Disconnect Lines
The reason your lines are connected throughout the data is because there's no information given to the geom_line() object other than the aesthetics of x= and y=. If you want to have separate lines, much like having separate colors or shapes of points, you need to supply an aesthetic to use as a basis for that. Since the two lines are different based on the variable Volume in your dataset, you want to use that... but keep the same color for both. For this, we use the group= aesthetic. It tells ggplot2 we want to draw a line for each piece of data that is grouped by that aesthetic.
ggplot(data = mydata, aes(x=Area, y=Time)) +
geom_point(aes(color =as.factor(Volume), size = Run)) +
geom_line(aes(group=as.factor(Volume))) +
labs(
x = "Area", y = "Time", color='Volume'
) +
theme_bw()
Show Runs 1-8 Labeled in Legend
Here I'm reading a bit into what you exactly wanted to do in terms of "showing runs 1-8" in the legend. This could mean one of two things, and I'll assume you want both and show you how to do both.
Listing and showing sizes 1-8 in the legend.
To set the values you see in the scale (legend) for size, you can refer to the various scale_ functions for all types of aesthetics. In this case, recall that since mydata$Run is an int, it is treated as a continuous scale. ggplot2 doesn't know how to draw a continuous scale for size, so the legend itself shows discrete sizes of points. This means we don't need to change Run to a factor, but what we do need is to indicate specifically we want to show in the legend all breaks in the sequence from 1 to 8. You can do this using scale_size_continuous(breaks=...).
ggplot(data = mydata, aes(x=Area, y=Time)) +
geom_point(aes(color =as.factor(Volume), size = Run)) +
geom_line(aes(group=as.factor(Volume))) +
labs(
x = "Area", y = "Time", color='Volume'
) +
scale_size_continuous(breaks=c(1:8)) +
theme_bw()
Showing all of your runs as points.
The note about showing all runs might also mean you want to literally see each run represented as a discrete point in your plot. For this... well, they already are! ggplot2 is plotting each of your points from your data into the chart. Since some points share the same values of x= and y=, you are getting overplotting - the points are drawn over top of one another.
If you want to visually see each point represented here, one option could be to use geom_jitter() instead of geom_point(). It's not really great here, because it will look like your data has different x and y values, but it is an option if this is what you want to do. Note in the code below I'm also changing the shape of the point to be a hollow circle for better clarity, where the color= is the line around each point (here it's black), and the fill= aesthetic is instead used for Volume. You should get the idea though.
set.seed(1234) # using the same randomization seed ensures you have the same jitter
ggplot(data = mydata, aes(x=Area, y=Time)) +
geom_jitter(aes(fill =as.factor(Volume), size = Run), shape=21, color='black') +
geom_line(aes(group=as.factor(Volume))) +
labs(
x = "Area", y = "Time", fill='Volume'
) +
scale_size_continuous(breaks=c(1:8)) +
theme_bw()

Change manual color on my geom_plot

I have an issues and have searching and searching. I'm really really new to this R Code. I have 2 different values from my gender array 1=Male and 2=Female.
I can't figure out how to change the scale from gradient to 1 and 2 color, and how I can make different color to the female and male. I close, but can't do the last job :(
ggplot(smokingdata, aes(x=ages, y=consume, col=gender)) +
geom_point() + ylim(0, 80)
Hop there is someone how can help me with that.
The real issue here is that your data is incorrectly specified. gender is a discrete variable that is stored as a numeric value, so ggplot2 is treating it as a continuous variable. Simply convert this variable to a factor and you will get a discrete color scale.
ggplot(smokingdata, aes(x=ages, y=consume, col=as.factor(gender))) +
geom_point() + ylim(0, 80)
However, this won't generate a very useful legend. Modifying your data frame to represent the variable accurately would yield the best result:
smokingdata$gender <- factor(smokingdata$gender, levels = c(1,2), labels = c("Male", "Female"))
ggplot(smokingdata, aes(x=ages, y=consume, col=gender)) +
geom_point() + ylim(0, 80)
You can give a vector, either named or unnamed, to scale_color_manual. So if your column for gender is encoded as 1 & 2, you could use scale_color_manual(values = c("1" = "purple", "2" = "orange")). Note that if you're using numbers for gender, those need to be turned into characters to work as names in the named vector.
You could also change the gender column to a character vector instead of a numeric one; that way, ggplot will treat it as a discrete variable, rather than a continuous one.

addressing `data` in `geom_line` of ggplot

I am building a barplot with a line connecting two bars in order to show that asterisk refers to the difference between them:
Most of the plot is built correctly with the following code:
mytbl <- data.frame(
"var" =c("test", "control"),
"mean1" =c(0.019, 0.022),
"sderr"= c(0.001, 0.002)
);
mytbl$var <- relevel(mytbl$var, "test"); # without this will be sorted alphabetically (i.e. 'control', then 'test')
p <-
ggplot(mytbl, aes(x=var, y=mean1)) +
geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=mean1-sderr, ymax=mean1+sderr), width=.2)+
scale_y_continuous(labels=percent, expand=c(0,0), limits=c(NA, 1.3*max(mytbl$mean1+mytbl$sderr))) +
geom_text(mapping=aes(x=1.5, y= max(mean1+sderr)+0.005), label='*', size=10)
p
The only thing missing is the line itself. In my very old code, it was supposedly working with the following:
p +
geom_line(
mapping=aes(x=c(1,1,2,2),
y=c(mean1[1]+sderr[1]+0.001,
max(mean1+sderr) +0.004,
max(mean1+sderr) +0.004,
mean1[2]+sderr[2]+0.001)
)
)
But when I run this code now, I get an error: Error: Aesthetics must be either length 1 or the same as the data (2): x, y. By trying different things, I came to an awkward workaround: I add data=rbind(mytbl,mytbl), before mapping but I don't understand what really happens here.
P.S. additional little question (I know, I should ask in a separate SO post, sorry for that) - why in scale_y_continuous(..., limits()) I can't address data by columns and have to call mytbl$ explicitly?
Just put all that in a separate data frame:
line_data <- data.frame(x=c(1,1,2,2),
y=with(mytbl,c(mean1[1]+sderr[1]+0.001,
max(mean1+sderr) +0.004,
max(mean1+sderr) +0.004,
mean1[2]+sderr[2]+0.001)))
p + geom_line(data = line_data,aes(x = x,y = y))
In general, you should avoid using things like [ and $ when you map aesthetics inside of aes(). The intended way to use ggplot2 is usually to adjust your data into a format such that each column is exactly what you want plotted already.
You can't reference variables in mytbl in the scale_* functions because that data environment isn't passed along like it is with layers. The scales are treated separately than the data layers, and so the information about them is generally assumed to live somewhere separate from the data you are plotting.

Aesthetics must either be length one or the same length

I am trying to plot values and errorbars, a seemingly simple task. As the script is fairly long, I am trying to limit the code in give here to the necessary amount.
I can plot the graph without error bars. However, when trying to add the errorbars I get the message
Error: Aesthetics must either be length one, or the same length as the dataProblems:Tempdata
This is the code I am using. All vectors in the Tempdata data frame are of length 390.
Tempdata <- data.frame (TempDiff, Measurement.points, Room.ext.resc, MelatoninData, Proximal.vs.Distal.SD.ext, ymax, ymin)
p <- ggplot(data=Tempdata,
aes(x = Measurement.points,
y = Tempdata, colour = "Temperature Differences"))
p + geom_line(aes(x=Measurement.points, y = Tempdata$TempDiff, colour = "Gradient Proximal vs. Distal"))+
geom_errorbar(aes(ymax=Tempdata$ymax, ymin=Tempdata$ymin))
The problem is that you have the colour-variables between quotation marks. You should put the variable name at that spot. So, replacing "Temperature Differences" with TempDiff and "Gradient Proximal vs. Distal" with Proximal.vs.Distal.SD.ext will probably solve your problem.
Furthermore: you can can't call for two different colour-variables.
The improved ggplot code should probably be something like this:
ggplot(data=Tempdata, aes(x=Measurement.points, y=TempDiff, colour=Proximal.vs.Distal.SD.ext)) +
geom_line() +
geom_errorbar(aes(ymax=ymax, ymin=ymin))
I also fixed some more problems with your original code:
the $ issue reported by Roland
the fact that you have conflicting calls in your aes
the fact you are calling your dataframe inside the first aes

How can I make a legend in ggplot2 with one point entry and one line entry?

I am making a graph in ggplot2 consisting of a set of datapoints plotted as points, with the lines predicted by a fitted model overlaid. The general idea of the graph looks something like this:
names <- c(1,1,1,2,2,2,3,3,3)
xvals <- c(1:9)
yvals <- c(1,2,3,10,11,12,15,16,17)
pvals <- c(1.1,2.1,3.1,11,12,13,14,15,16)
ex_data <- data.frame(names,xvals,yvals,pvals)
ex_data$names <- factor(ex_data$names)
graph <- ggplot(data=ex_data, aes(x=xvals, y=yvals, color=names))
print(graph + geom_point() + geom_line(aes(x=xvals, y=pvals)))
As you can see, both the lines and the points are colored by a categorical variable ('names' in this case). I would like the legend to contain 2 entries: a dot labeled 'Data', and a line labeled 'Fitted' (to denote that the dots are real data and the lines are fits). However, I cannot seem to get this to work. The (awesome) guide here is great for formatting, but doesn't deal with the actual entries, while I have tried the technique here to no avail, i.e.
print(graph + scale_colour_manual("", values=c("green", "blue", "red"))
+ scale_shape_manual("", values=c(19,NA,NA))
+ scale_linetype_manual("",values=c(0,1,1)))
The main trouble is that, in my actual data, there are >200 different categories for 'names,' while I only want the 2 entries I mentioned above in the legend. Doing this with my actual data just produces a meaningless legend that runs off the page, because the legend is trying to be a key for the colors (of which I have way too many).
I'd appreciate any help!
I think this is close to what you want:
ggplot(ex_data, aes(x=xvals, group=names)) +
geom_point(aes(y=yvals, shape='data', linetype='data')) +
geom_line(aes(y=pvals, shape='fitted', linetype='fitted')) +
scale_shape_manual('', values=c(19, NA)) +
scale_linetype_manual('', values=c(0, 1))
The idea is that you specify two aesthetics (linetype and shape) for both lines and points, even though it makes no sense, say, for a point to have a linetype aesthetic. Then you manually map these "nonsense" aesthetics to "null" values (NA and 0 in this case), using a manual scale.
This has been answered already, but based on feedback I got to another question (How can I fix this strange behavior of legend in ggplot2?) this tweak may be helpful to others and may save you headaches (sorry couldn't put as a comment to the previous answer):
ggplot(ex_data, aes(x=xvals, group=names)) +
geom_point(aes(y=yvals, shape='data', linetype='data')) +
geom_line(aes(y=pvals, shape='fitted', linetype='fitted')) +
scale_shape_manual('', values=c('data'=19, 'fitted'=NA)) +
scale_linetype_manual('', values=c('data'=0, 'fitted'=1))

Resources