Change manual color on my geom_plot - r

I have an issues and have searching and searching. I'm really really new to this R Code. I have 2 different values from my gender array 1=Male and 2=Female.
I can't figure out how to change the scale from gradient to 1 and 2 color, and how I can make different color to the female and male. I close, but can't do the last job :(
ggplot(smokingdata, aes(x=ages, y=consume, col=gender)) +
geom_point() + ylim(0, 80)
Hop there is someone how can help me with that.

The real issue here is that your data is incorrectly specified. gender is a discrete variable that is stored as a numeric value, so ggplot2 is treating it as a continuous variable. Simply convert this variable to a factor and you will get a discrete color scale.
ggplot(smokingdata, aes(x=ages, y=consume, col=as.factor(gender))) +
geom_point() + ylim(0, 80)
However, this won't generate a very useful legend. Modifying your data frame to represent the variable accurately would yield the best result:
smokingdata$gender <- factor(smokingdata$gender, levels = c(1,2), labels = c("Male", "Female"))
ggplot(smokingdata, aes(x=ages, y=consume, col=gender)) +
geom_point() + ylim(0, 80)

You can give a vector, either named or unnamed, to scale_color_manual. So if your column for gender is encoded as 1 & 2, you could use scale_color_manual(values = c("1" = "purple", "2" = "orange")). Note that if you're using numbers for gender, those need to be turned into characters to work as names in the named vector.
You could also change the gender column to a character vector instead of a numeric one; that way, ggplot will treat it as a discrete variable, rather than a continuous one.

Related

How to rename the categories of a plot in R?

I would like to rename the categories of "income" from 1,2,3,4,5 to the real values of the income in the plot. I tried this code but it does not work. Can somebody please explain me why?
ggplot(data=subset(trips_renamed,income!="99")) +
geom_bar(mapping = aes(x = income,fill="income"))+
scale_x_discrete(labels=c("<=4000","4001-8000","8001-12000","12001-
16000",">16000",position="bottom"))+
labs(y= "Total number of trips", x="Income Classes")+
theme(legend.position = "none")
It would be much easier to find and test an answer if you provided a minimal reproducible example. However, below is shown how to change the scale for a similar plot as in your question.
Since the values for x are numeric we need to use the (somewhat counterintuitive) scale_x_continuous to change the labels on the fly
library(ggplot2)
ggplot(data=mtcars) +
geom_bar(aes(x = gear))+
scale_x_continuous(breaks = 3:5, labels=c("<4", "4-4.9",">4"))
Returns:
It seems your issue has to do with trips_renamed$income being a class "integer" or "numeric". As such, scale_x_discrete() should be replaced with scale_x_continuous(). You can either use scale_x_continuous() or convert to a discrete value (factor), then use scale_x_discrete(). Here are two examples using the following dummy dataset.
set.seed(8675309)
df <- data.frame(income=sample(1:5, 1000, replace=T))
Option 1 : Relabel your continuous axis
If class(trips_renamed$income) is "numeric" or "integer", then you will need to use scale_x_continuous(). Relabeling requires you to specify both breaks= and labels= arguments, and they have to be the same length. This should work:
ggplot(df, aes(x=income)) + geom_bar() +
scale_x_continuous(breaks=1:5, labels=c("<=4000","4001-8000","8001-12000","12001-
16000",">16000"),position="bottom")
Option 2 : Convert to Factor and use Discrete Scale
The other option is to convert to a factor first, then use scale_x_discrete(). Here, you don't need the breaks= argument (the levels of the factor are used):
df$income <- factor(df$income)
ggplot(df, aes(x=income)) + geom_bar() +
scale_x_discrete(labels=c("<=4000","4001-8000","8001-12000","12001-
16000",">16000"),position="bottom")
You get the same plot as above.
Option 2a: Factor and define labels together
If you want to get really crafty, you can define the labels the same time as the factor and they will be used for the axis labels instead of the name of the levels:
df2 <- df
df2$income <- factor(df2$income, labels=c("<=4000","4001-8000","8001-12000","12001-
16000",">16000"))
ggplot(df2, aes(x=income)) + geom_bar()
This together should give you a good idea of how ggplot2 works when choosing how to label the axes.

Can someone explain why my first ggplot2 box plot was just one big box and how the solution worked?

So my first ggplot2 box plot was just one big stretched out box plot, the second one was correct but I don't understand what changed and why the second one worked. I'm new to R and ggplot2, let me know if you can, thanks.
#----------------------------------------------------------
# This is the original ggplot that didn't work:
#----------------------------------------------------------
zSepalFrame <- data.frame(zSepalLength, zSepalWdth)
zPetalFrame <- data.frame(zPetalLength, zPetalWdth)
p1 <- ggplot(data = zSepalFrame, mapping = aes(x=zSepalWdth, y=zSepalLength, group = 4)) + #fill = zSepalLength
geom_boxplot(notch=TRUE) +
stat_boxplot(geom = 'errorbar', width = 0.2) +
theme_classic() +
labs(title = "Iris Data Box Plot") +
labs(subtitle ="Z Values of Sepals From Iris.R")
p1
#----------------------------------------------------------
# This is the new ggplot box plot line that worked:
#----------------------------------------------------------
bp = ggplot(zSepalFrame, aes(x=factor(zSepalWdth), y=zSepalLength, color = zSepalWdth)) + geom_boxplot() + theme(legend.position = "none")
bp
This is what the ggplot box plot looked like
I don't have your precise dataset, OP, but it seems to stem from assigning a continuous variable to your x axis, when boxplots require a discrete variable.
A continuous variable is something like a numeric column in a dataframe. So something like this:
x <- c(4,4,4,8,8,8,8)
Even though the variable x only contains 4's and 8's, R assigns this as a numeric type of variable, which is continuous. It means that if you plot this on the x axis, ggplot will have no issue with something falling anywhere in-between 4 or 8, and will be positioned accordingly.
The other type of variable is called discrete, which would be something like this:
y <- c("Green", "Green", "Flags", "Flags", "Cars")
The variable y contains only characters. It must be discrete, since there is no such thing as something between "Green" and "Cars". If plotted on an x axis, ggplot will group things as either being "Green", "Flags", or "Cars".
The cool thing is that you can change a continuous variable into a discrete one. One way to do that is to factorize or force R to consider a variable as a factor. If you typed factor(x), you get this:
[1] 4 4 4 8 8 8 8
Levels: 4 8
The values in x are the same, but now there is no such thing as a number between 4 and 8 when x is a factor - it would just add another level.
That is in short why your box plot changes. Let's demonstrate with the iris dataset. First, an example like yours. Notice that I'm assigning x=Sepal.Length. In the iris dataset, Sepal.Length is numeric, so continuous.
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) +
geom_boxplot()
This is similar to yours. The reason is that the boxplot is drawn by grouping according to x and then calculating statistics on those groups. If a variable is continuous, there are no "groups", even if data is replicated (like as in x above). One way to make groups is to force the data to be discrete, as in factor(Sepal.Length). Here's what it looks like when you do that:
ggplot(iris, aes(x=factor(Sepal.Length), y=Sepal.Width)) +
geom_boxplot()
The other way to have this same effect would be to use the group= aesthetic, which does what you might think: it groups according to that column in the dataset.
ggplot(iris, aes(x=Sepal.Length), y=Sepal.Width, group=Sepal.Length)) +
geom_boxplot()

geom_boxplot not displaying correctly

In the assignment I am doing, it wants me to use geom_boxplot. However, I have been unable to get the graph to display the boxplots correctly.
# Convert To Factor
census_data$CIT <- as.factor(census_data$CIT)
class(census_data$CIT)
ggplot(census_data, aes(census_data[["VALP"]], (census_data[["CIT"]])) +
geom_boxplot(color = "blue", fill = "orange") +
ggtitle("Property value by citizenship status") +
xlab("“Citizenship status") + ylab("Property value")
I am slightly concerned that the CIT may not have been converted correctly to a factor.
I think you have your x and y aesthetics the wrong way around. you have VALP first which is then assumed to be x and CIT second which is asssumed to be y. Given your labels I think you want them in the other order.
I always find it helps to label them explicitly, ie aes(x=.., y=...) so you don't get confused!
You also don't need to use census_data[["VALP"]] in the aes function call, since you have supplied the census_data in the data argument just saying aes(x=CIT, y=VALP) should be enough.

R - ggplot2 - difference between ggplot(data, aes(x=variable...)) and ggplot(data, aes(x=data$variable...)) [duplicate]

This question already has an answer here:
Issue when passing variable with dollar sign notation ($) to aes() in combination with facet_grid() or facet_wrap()
(1 answer)
Closed 4 years ago.
I have currently encountered a phenomenon in ggplot2, and I would be grateful if someone could provide me with an explanation.
I needed to plot a continuous variable on a histogram, and I needed to represent two categorical variables on the plot. The following dataframe is a good example.
library(ggplot2)
species <- rep(c('cat', 'dog'), 30)
numb <- rep(c(1,2,3,7,8,10), 10)
groups <- rep(c('A', 'A', 'B', 'B'), 15)
data <- data.frame(species=species, numb=numb, groups=groups)
Let the following code represent the categorisation of a continuous variable.
data$factnumb <- as.factor(data$numb)
If I would like to plot this dataset the following two codes are completely interchangable:
Note the difference after the fill= statement.
p <- ggplot(data, aes(x=factnumb, fill=species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent)
plot(p):
q <- ggplot(data, aes(x=factnumb, fill=data$species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent)
plot(q):
However, when working with real-life continuous variables not all categories will contain observations, and I still need to represent the empty categories on the x-axis in order to get the approximation of the sample distribution. To demostrate this, I used the following code:
data_miss <- data[which(data$numb!= 3),]
This results in a disparity between the levels of the categorial variable and the observations in the dataset:
> unique(data_miss$factnumb)
[1] 1 2 7 8 10
Levels: 1 2 3 7 8 10
And plotted the data_miss dataset, still including all of the levels of the factnumb variable.
pm <- ggplot(data_miss, aes(x=factnumb, fill=species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_fill_discrete(drop=FALSE) +
scale_x_discrete(drop=FALSE)+
scale_y_continuous(labels = scales::percent)
plot(pm):
qm <- ggplot(data_miss, aes(x=factnumb, fill=data_miss$species)) +
facet_grid(groups ~ .) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_x_discrete(drop=FALSE)+
scale_fill_discrete(drop=FALSE) +
scale_y_continuous(labels = scales::percent)
plot(qm):
In this case, when using fill=data_miss$species the filling of the plot changes (and for the worse).
I would be really happy if someone could clear this one up for me.
Is it just "luck", that in case of plot 1 and 2 the filling is identical, or I have stumbled upon some delicate mistake in the fine machinery of ggplot2?
Thanks in advance!
Kind regards,
Bernadette
Using aes(data$variable) inside is never good, never recommended, and should never be used. Sometimes it still works, but aes(variable) always works, so you should always use aes(variable).
More explanation:
ggplot uses nonstandard evaluation. A standard evaluating R function can only see objects in the global environment. If I have data named mydata with a column name col1, and I do mean(col1), I get an error:
mydata = data.frame(col1 = 1:3)
mean(col1)
# Error in mean(col1) : object 'col1' not found
This error happens because col1 isn't in the global environment. It's just a column name of the mydata data frame.
The aes function does extra work behind the scenes, and knows to look at the columns of the layer's data, in addition to checking the global environment.
ggplot(mydata, aes(x = col1)) + geom_bar()
# no error
You don't have to use just a column inside aes though. To give flexibility, you can do a function of a column, or even some other vector that you happen to define on the spot (if it has the right length):
# these work fine too
ggplot(mydata, aes(x = log(col1))) + geom_bar()
ggplot(mydata, aes(x = c(1, 8, 11)) + geom_bar()
So what's the difference between col1 and mydata$col1? Well, col1 is a name of a column, and mydata$col1 is the actual values. ggplot will look for columns in your data named col1, and use that. mydata$col1 is just a vector, it's the full column. The difference matters because ggplot often does data manipulation. Whenever there are facets or aggregate functions, ggplot is splitting your data up into pieces and doing stuff. To do this effectively, it needs to know identify the data and column names. When you give it mydata$col1, you're not giving it a column name, you're just giving it a vector of values - whatever happens to be in that column, and things don't work.
So, just use unquoted column names in aes() without data$ and everything will work as expected.

r ggplot change legend order to match final order of data

I have a dataframe which has a set of manufacturers and collected data for those manufacturers. The list of manufacturers and/or the attribute data can change, depending on the run.
I display this as a line chart in ggplot, but what I want is to have the legend order match the 'up/down' order of the final year of data. So for this chart:
Default Legend Order
I'd like to see the legend order (and color) be Yoyodyne (purple), Widget (green), Wonka (blue) and Acme (red).
I can't (or don't think I can) use scale_color_manual as the data-in from one model run to the next the end-order (in 2032) may differ and/or the list of manufacturers may differ.
Code for doing chart is (last part, pz, just to simplify x axis display):
px <- ggplot(bym, aes(x=Model.Year, y=AverageCost, colour=Manufacturer))
py <- px + ggtitle("MyChart") + labs(x="Year", y="Foo") + geom_line(size=0.5) + geom_point()
pz <- py + scale_x_continuous(breaks=c(min(bym$Model.Year),max(bym$Model.Year)))
pz
You can set the order of the legend objects by using dplyr::mutate function in conjunction with the factor function. To set the colors in the order you want, you can just create a vector with your desired colors in the order you want them and pass them to scale_color_manual. I have done this in the example below. Mine looks a little different then yours because I removed the intermediate assignments.
bym <- data.frame(
Model.Year = rep(seq(2016, 2030, 1), 4),
AverageCost = rnorm(60),
Manufacturer = rep(c("Yoyodyne", "Widget", "Wonka", "Acme"), each = 15)
)
my_colors <- c("purple", "green", "blue", "red")
bym %>%
mutate(Manufacturer = factor(Manufacturer,
levels = c("Yoyodyne", "Widget", "Wonka", "Acme"))) %>%
ggplot(aes(x=Model.Year, y=AverageCost, colour=Manufacturer)) +
ggtitle("MyChart") +
labs(x="Year", y="Foo") +
geom_line(size=0.5) +
geom_point()+
scale_x_continuous(breaks=c(min(bym$Model.Year),max(bym$Model.Year))) +
scale_color_manual(values = my_colors)
Have you tried setting the levels for Manufacturer according to the last year? For example, you can add a column with levels set this way:
# order Manufacturer by AverageCost in the last year
colours = bym[with(bym[bym$Model.Year == 2032,], order(-AverageCost)),]$Manufacturer
# add factor with levels ordered by colours
bym$Colour = factor(bym$Manufacturer, levels=as.character(colours))
Then use Colour for your colour aesthetic.
EDIT: That is, if you want to stick to base R. The answer with dplyr::mutate is much easier to use.

Resources