R: ggplot2: A referential transparency problem - r

The function below is a character by character exact copy of a function appearing on page 26 of the second edition of Hadley Wickham's ggplot2 book, with two exceptions:
The output is assigned to p, which is then plotted.
The expression following "colour = " in the original was
"year(date)".
If you take year(economics$date), you get a numeric vector running from 1967 to 2015, inclusive, by ones. I have replaced that expression with 1967:2015. The result is an error:
Error: Aesthetics must be either length 1 or the same as the data (574): colour
I have two questions.
year <- function(x) as.POSIXlt(x)$year + 1900
p <- ggplot(economics, aes(unemploy / pop, uempmed)) +
geom_path(colour = "grey50") +
geom_point(aes(colour = 1967:2015))
plot(p)
Why on earth does this trivial change break the function?
Why did the function work in the first place? (It does). Because if you look at
the documentation for the colour esthetic, here:
https://ggplot2.tidyverse.org/articles/ggplot2-specs.html it
does not say that the colour esthetic takes either a numerical
vector or a function.

Related

geom_boxplot not displaying correctly

In the assignment I am doing, it wants me to use geom_boxplot. However, I have been unable to get the graph to display the boxplots correctly.
# Convert To Factor
census_data$CIT <- as.factor(census_data$CIT)
class(census_data$CIT)
ggplot(census_data, aes(census_data[["VALP"]], (census_data[["CIT"]])) +
geom_boxplot(color = "blue", fill = "orange") +
ggtitle("Property value by citizenship status") +
xlab("“Citizenship status") + ylab("Property value")
I am slightly concerned that the CIT may not have been converted correctly to a factor.
I think you have your x and y aesthetics the wrong way around. you have VALP first which is then assumed to be x and CIT second which is asssumed to be y. Given your labels I think you want them in the other order.
I always find it helps to label them explicitly, ie aes(x=.., y=...) so you don't get confused!
You also don't need to use census_data[["VALP"]] in the aes function call, since you have supplied the census_data in the data argument just saying aes(x=CIT, y=VALP) should be enough.

Histogram starting from 0 after setting the set argument

I am trying to create a histogram for my integer variable which has a very inconsistent values, here is the output of summary function applied to the variable:
Min:347 1st Qu:8786 Median:20886 Mean:69522 3rd Qu:50400 Max:4069360
So as you can see it ranges from 300 to 4,000,000
Here is the code I am using to create the histogram:
ggplot(data=mydata, aes(mydata$variable)) +
geom_histogram(aes(y =..density..),
breaks=seq(300, 2000000, by = 20000),
col="#00AFBB",
fill="#00AFBB",
alpha=.2) +
geom_density(col=2) +
Although I sat the seq argument and using different values, the histogram keep starting from 0 and ending with 4000000 as follows:
What can I do to adjust the histogram so it seems more balanced and plot the values correctly?
You can either place a restriction on the values mapped to the x-axis, effectively filtering them out:
+ scale_x_continouous(limits=c(0, 1000000))
or zoom in on the relevant part of your plot:
+ coord_cartesian(xlim=c(0, 1000000))
Do note that your first line can be reduced to:
ggplot(mydata, aes(variable)) +
as data is the first argument to ggplot, and the variables referenced in aes are always searched for in the data.frame (given to the data argument).

R - ggplot2 - difference between ggplot(data, aes(x=variable...)) and ggplot(data, aes(x=data$variable...)) [duplicate]

This question already has an answer here:
Issue when passing variable with dollar sign notation ($) to aes() in combination with facet_grid() or facet_wrap()
(1 answer)
Closed 4 years ago.
I have currently encountered a phenomenon in ggplot2, and I would be grateful if someone could provide me with an explanation.
I needed to plot a continuous variable on a histogram, and I needed to represent two categorical variables on the plot. The following dataframe is a good example.
library(ggplot2)
species <- rep(c('cat', 'dog'), 30)
numb <- rep(c(1,2,3,7,8,10), 10)
groups <- rep(c('A', 'A', 'B', 'B'), 15)
data <- data.frame(species=species, numb=numb, groups=groups)
Let the following code represent the categorisation of a continuous variable.
data$factnumb <- as.factor(data$numb)
If I would like to plot this dataset the following two codes are completely interchangable:
Note the difference after the fill= statement.
p <- ggplot(data, aes(x=factnumb, fill=species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent)
plot(p):
q <- ggplot(data, aes(x=factnumb, fill=data$species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent)
plot(q):
However, when working with real-life continuous variables not all categories will contain observations, and I still need to represent the empty categories on the x-axis in order to get the approximation of the sample distribution. To demostrate this, I used the following code:
data_miss <- data[which(data$numb!= 3),]
This results in a disparity between the levels of the categorial variable and the observations in the dataset:
> unique(data_miss$factnumb)
[1] 1 2 7 8 10
Levels: 1 2 3 7 8 10
And plotted the data_miss dataset, still including all of the levels of the factnumb variable.
pm <- ggplot(data_miss, aes(x=factnumb, fill=species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_fill_discrete(drop=FALSE) +
scale_x_discrete(drop=FALSE)+
scale_y_continuous(labels = scales::percent)
plot(pm):
qm <- ggplot(data_miss, aes(x=factnumb, fill=data_miss$species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_x_discrete(drop=FALSE)+
scale_fill_discrete(drop=FALSE) +
scale_y_continuous(labels = scales::percent)
plot(qm):
In this case, when using fill=data_miss$species the filling of the plot changes (and for the worse).
I would be really happy if someone could clear this one up for me.
Is it just "luck", that in case of plot 1 and 2 the filling is identical, or I have stumbled upon some delicate mistake in the fine machinery of ggplot2?
Thanks in advance!
Kind regards,
Bernadette
Using aes(data$variable) inside is never good, never recommended, and should never be used. Sometimes it still works, but aes(variable) always works, so you should always use aes(variable).
More explanation:
ggplot uses nonstandard evaluation. A standard evaluating R function can only see objects in the global environment. If I have data named mydata with a column name col1, and I do mean(col1), I get an error:
mydata = data.frame(col1 = 1:3)
mean(col1)
# Error in mean(col1) : object 'col1' not found
This error happens because col1 isn't in the global environment. It's just a column name of the mydata data frame.
The aes function does extra work behind the scenes, and knows to look at the columns of the layer's data, in addition to checking the global environment.
ggplot(mydata, aes(x = col1)) + geom_bar()
# no error
You don't have to use just a column inside aes though. To give flexibility, you can do a function of a column, or even some other vector that you happen to define on the spot (if it has the right length):
# these work fine too
ggplot(mydata, aes(x = log(col1))) + geom_bar()
ggplot(mydata, aes(x = c(1, 8, 11)) + geom_bar()
So what's the difference between col1 and mydata$col1? Well, col1 is a name of a column, and mydata$col1 is the actual values. ggplot will look for columns in your data named col1, and use that. mydata$col1 is just a vector, it's the full column. The difference matters because ggplot often does data manipulation. Whenever there are facets or aggregate functions, ggplot is splitting your data up into pieces and doing stuff. To do this effectively, it needs to know identify the data and column names. When you give it mydata$col1, you're not giving it a column name, you're just giving it a vector of values - whatever happens to be in that column, and things don't work.
So, just use unquoted column names in aes() without data$ and everything will work as expected.

addressing `data` in `geom_line` of ggplot

I am building a barplot with a line connecting two bars in order to show that asterisk refers to the difference between them:
Most of the plot is built correctly with the following code:
mytbl <- data.frame(
"var" =c("test", "control"),
"mean1" =c(0.019, 0.022),
"sderr"= c(0.001, 0.002)
);
mytbl$var <- relevel(mytbl$var, "test"); # without this will be sorted alphabetically (i.e. 'control', then 'test')
p <-
ggplot(mytbl, aes(x=var, y=mean1)) +
geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=mean1-sderr, ymax=mean1+sderr), width=.2)+
scale_y_continuous(labels=percent, expand=c(0,0), limits=c(NA, 1.3*max(mytbl$mean1+mytbl$sderr))) +
geom_text(mapping=aes(x=1.5, y= max(mean1+sderr)+0.005), label='*', size=10)
p
The only thing missing is the line itself. In my very old code, it was supposedly working with the following:
p +
geom_line(
mapping=aes(x=c(1,1,2,2),
y=c(mean1[1]+sderr[1]+0.001,
max(mean1+sderr) +0.004,
max(mean1+sderr) +0.004,
mean1[2]+sderr[2]+0.001)
)
)
But when I run this code now, I get an error: Error: Aesthetics must be either length 1 or the same as the data (2): x, y. By trying different things, I came to an awkward workaround: I add data=rbind(mytbl,mytbl), before mapping but I don't understand what really happens here.
P.S. additional little question (I know, I should ask in a separate SO post, sorry for that) - why in scale_y_continuous(..., limits()) I can't address data by columns and have to call mytbl$ explicitly?
Just put all that in a separate data frame:
line_data <- data.frame(x=c(1,1,2,2),
y=with(mytbl,c(mean1[1]+sderr[1]+0.001,
max(mean1+sderr) +0.004,
max(mean1+sderr) +0.004,
mean1[2]+sderr[2]+0.001)))
p + geom_line(data = line_data,aes(x = x,y = y))
In general, you should avoid using things like [ and $ when you map aesthetics inside of aes(). The intended way to use ggplot2 is usually to adjust your data into a format such that each column is exactly what you want plotted already.
You can't reference variables in mytbl in the scale_* functions because that data environment isn't passed along like it is with layers. The scales are treated separately than the data layers, and so the information about them is generally assumed to live somewhere separate from the data you are plotting.

Aesthetics must either be length one or the same length

I am trying to plot values and errorbars, a seemingly simple task. As the script is fairly long, I am trying to limit the code in give here to the necessary amount.
I can plot the graph without error bars. However, when trying to add the errorbars I get the message
Error: Aesthetics must either be length one, or the same length as the dataProblems:Tempdata
This is the code I am using. All vectors in the Tempdata data frame are of length 390.
Tempdata <- data.frame (TempDiff, Measurement.points, Room.ext.resc, MelatoninData, Proximal.vs.Distal.SD.ext, ymax, ymin)
p <- ggplot(data=Tempdata,
aes(x = Measurement.points,
y = Tempdata, colour = "Temperature Differences"))
p + geom_line(aes(x=Measurement.points, y = Tempdata$TempDiff, colour = "Gradient Proximal vs. Distal"))+
geom_errorbar(aes(ymax=Tempdata$ymax, ymin=Tempdata$ymin))
The problem is that you have the colour-variables between quotation marks. You should put the variable name at that spot. So, replacing "Temperature Differences" with TempDiff and "Gradient Proximal vs. Distal" with Proximal.vs.Distal.SD.ext will probably solve your problem.
Furthermore: you can can't call for two different colour-variables.
The improved ggplot code should probably be something like this:
ggplot(data=Tempdata, aes(x=Measurement.points, y=TempDiff, colour=Proximal.vs.Distal.SD.ext)) +
geom_line() +
geom_errorbar(aes(ymax=ymax, ymin=ymin))
I also fixed some more problems with your original code:
the $ issue reported by Roland
the fact that you have conflicting calls in your aes
the fact you are calling your dataframe inside the first aes

Resources