I have a data set of several features of several organisms. I'm displaying each feature individually by several different categories individually and in combination (e.g. species, location, population). Both in raw counts and a percentage of the total sample size and a percentage within a give group.
My problem comes when I'm trying to display a stacked bar chart using ggplot for the percent of individuals within a group. Since the groups do not have the same number of individuals in them, I'd like to display the raw number or count of individuals with that feature on their respective bars for context. I've managed to properly display the stacked percentage bar chat and get the number of individuals from the most populous groups to display. I'm having trouble displaying the rest of the groups.
ggplot(data=All.k6,aes(x=Second.Dorsal))+
geom_bar(aes(fill=Species),position="fill")+
scale_y_continuous(labels=scales::percent)+
labs(x="Number of Second Dorsal Spines",y="Percentage of Individuals within Species",title="Second Dorsal Spines")+
geom_text(aes(label=..count..),stat='count',position=position_fill(vjust=0.5))
You need to include a group= aesthetic so that position_fill knows how to position things. In geom_bar, you set the fill= aesthetic, so ggplot assumed you also want to group by that aesthetic. In geom_text it assumes the group is your x= aesthetic. In your case, just add group=Species after your label= aesthetic. Here's an example:
# sample dataset
set.seed(1234)
types <- data.frame(
x=c('A','A','A','B','B','B','C','C','C'),
x1=rep(c('aa','bb','cc'),3)
)
df <- rbind(types[sample(1:9,50,replace=TRUE),])
Plot without grouping:
ggplot(df, aes(x=x)) +
geom_bar(aes(fill=x1),position='fill') +
scale_y_continuous(label=scales::percent) +
geom_text(aes(label=..count..),stat='count',
position=position_fill(vjust=0.5))
Plot with group= aesthetic:
ggplot(df, aes(x=x)) +
geom_bar(aes(fill=x1),position='fill') +
scale_y_continuous(label=scales::percent) +
geom_text(aes(label=..count..,group=x1),stat='count',
position=position_fill(vjust=0.5))
Related
I'm trying to order the bars of my percent stacked barchart in R based on descending stack segment height.
R automatically sorts my categorical data in alphabetical order (in both the barchart and its legend) but I'd like the data to be ordered so to have the biggest bars (the ones with the greatest stack segment height) on top of the barchart and the smallest at the bottom, in a descending manner.
I don't know how to do this because I cannot manually set a specific order with a vector prior to using ggplot2: my dataset is quite big and I need it to be ordered based on total field area (a quantitative variable that changes for every single city I'm considering).
Does anyone know hot to help me?
You need to set your categorical variable as an ordered factor. For example, using the iris data, the default is for an alphabetical x-axis:
iris%>%
ggplot(aes(Species,Petal.Length))+
geom_col()
Using fct_reorder (from forecats, included in the tidyverse), you can change a character variable to a factor and give it an order in one step. Here I change the order of the x-axis such that is order by the average sepal width of the petal.
iris%>%
mutate(Species=fct_reorder(Species,Sepal.Width,mean))%>%
ggplot(aes(Species,Petal.Length))+
geom_col()
st_des_as %>%
mutate(COLTURA=fct_reorder(COLTURA,tot_area),.desc=F) %>%
ggplot(aes(x=" ", y=tot_area, fill=COLTURA)) +
geom_bar(position= "fill", stat="identity") +
facet_grid(~ZONA) +
labs(x=NULL, y="landcover (%)") +
scale_y_continuous(labels=function(x) paste0(x*100)) +
scale_fill_manual(name="CROP TYPE",values=colours_as) +
theme_classic() +
theme(legend.key.size = unit (10, "pt")) +
theme(legend.title = element_text(face="bold"))
geom_col()
here are some of my data, as you can see they are numerical values divided by region (ZONA) and crop type (COLTURA)
and here are the first graphs: the first one from the left is correctly sorted while the other three ones are sorted not following their bars' height but rather following the same sorting of the first graph, no matter the dimension of their own bars
how in R, should I have a histogram with a categorical variable in x-axis and
the frequency of a continuous variable on the y axis?
is this correct?
There are a couple of ways one could interpret "one graph" in the title of the question. That said, using the ggplot2 package, there are at least a couple of ways to render histograms with by groups on a single page of results.
First, we'll create data frame that contains a normally distributed random variable with a mean of 100 and a standard deviation of 20. We also include a group variable that has one of four values, A, B, C, or D.
set.seed(950141237) # for reproducibility of results
df <- data.frame(group = rep(c("A","B","C","D"),200),
y_value = rnorm(800,mean=100,sd = 20))
The resulting data frame has 800 rows of randomly generated values from a normal distribution, assigned into 4 groups of 200 observations.
Next, we will render this in ggplot2::ggplot() as a histogram, where the color of the bars is based on the value of group.
ggplot(data = df,aes(x = y_value, fill = group)) + geom_histogram()
...and the resulting chart looks like this:
In this style of histogram the values from each group are stacked atop each other(i.e. the frequency of group A is added to B, etc. before rendering the chart), which might not be what the original poster intended.
We can verify the "stacking" behavior by removing the fill = group argument from aes().
# verify the stacking behavior
ggplot(data = df,aes(x = y_value)) + geom_histogram()
...and the output, which looks just like the first chart, but drawn in a single color.
Another way to render the data is to use group with facet_wrap(), where each distribution appears in a different facet on one chart.
ggplot(data = df,aes(x = y_value)) + geom_histogram() + facet_wrap(~group)
The resulting chart looks like this:
The facet approach makes it easier to see differences in frequency of y values between the groups.
I am using the below code to plot a data frame on the same plot:
ggplot(df) + geom_line(aes(x = date, y = values, colour = X > 5))
The plot is working and looks great all except for the fact that when the values are bigger than 5, because I am using geom_line, it then starts connecting points that are above the threshold. like below. I do not want the lines connecting the blue data.
How do I stop this from happening?
Here's an example using the economics dataset included in ggplot2. You see the same thing if we highlight the line based on values above 8000:
ggplot(economics, aes(date, unemploy)) +
geom_line(aes(color=unemploy > 8000))
When a mapping is defined in your dataset, by default ggplot2 also groups your data based on this. This makes total sense if you're trying to plot a line where you have data in long form and want to draw separate lines for each different value in a column. In cases like this, you want ggplot2 to change the color of the line based on the data, but you want to tell ggplot2 not to group based on color. This is why you will need to override the group= aesthetic.
To override the group= aesthetic change that happens when you map your line geom, you can just say group=1 or really group= any constant value. This effectively sets every observation mapped to the same group, and the line will connect all your points, but be colored differently:
ggplot(economics, aes(date, unemploy)) +
geom_line(aes(color=unemploy > 8000, group=1))
I have a factor of time that has two levels, admission and discharge. I'm using facet_grid to create four panels in which my continuous Y will be looked at by time. I want to be able to add a mean line to each of the two time levels in each panel. My problem is that the mean line spans the entire width of the panel and I'd like to shorten it to just remain within the area of the dots.
Here is the code:
plot <- ggplot(data.in, aes(x=Time, y=Y)) + geom_point()
plot <- plot + facet_grid(.~FacetGroup)
data_hline <- aggregate(data.in$Y~data.in$Time + data.in$FacetGroup, FUN=mean)
plot + geom_hline(data=data_hline, aes(yintercept=Y))
I am working with a Danish dataset on immigrants by country of origin and age group. I transformed the data so I can see the top countries of origin for each age group.
I am plotting it using facet_wrap. What I would like to do is, since different age groups come from quite different areas, to show a different set of values for one axis in each facet. For example, those that are between 0 and 10 years old come from countries x,y and z, while those 10-20 years of age come from countries q, r, z and so on.
In my current version, it shows the entire set of values, including countries that are not in the top 10. I would like to show just the top ten countries of origin for each facet, in effect having different axis labels for each. (And, if it is possible, sorting by high to low for each facet).
Here is what I have so far:
library(ggplot2)
library(reshape)
###load and inspect data
load(url('http://dl.dropbox.com/u/7446674/dk_census.rda'))
head(dk_census)
###reshape for plotting--keep just a few age groups
dk_census.m <- melt(dk_census[dk_census$Age %in% c('0-9 år', '10-19 år','20-29 år','30-39 år'),c(1,2,4)])
###get top 10 observations for each age group, store in data frame
top10 <- by(dk_census.m[order(dk_census.m$Age,-dk_census.m$value),], dk_census.m$Age, head, n=10)
top10.df<-do.call("rbind", as.list(top10))
top10.df
###plot
ggplot(data=top10.df, aes(x=as.factor(Country), y=value)) +
geom_bar(stat="identity")+
coord_flip() +
facet_wrap(~Age)+
labs(title="Immigrants By Country by Age",x="Country of Origin",y="Population")
One option (that I actually strongly suspect you won't be happy with) is this:
p <- ggplot(data=top10.df, aes(x=Country, y=value)) +
geom_bar(stat="identity")+
coord_flip() +
facet_wrap(~Age)+
labs(title="Immigrants By Country by Age",x="Country of Origin",y="Population")
pp <- dlply(.data=top10.df,.(Age),function(x) {x$Country <- reorder(x$Country,x$value); p %+% x})
library(gridExtra)
do.call(grid.arrange,pp)
(Edited to sort each graph.)
Keep in mind that the only reason faceting exists is to plot multiple panels that share a common scale. So when you start asking to facet on some variable, but have the scales be different (oh, and also sort them separately on each panel as well) what you're doing is really no longer faceting. It's just making four different plots and arranging them together.
using lattice (Here I use ``latticeExtrafor ggplot2 theme), you can set torelation=freebetween panels. Here I am using abbreviate = TRUE` to short long labels.
library(latticeExtra)
barchart(value~ Country|Age,data=top10.df,layout=c(2,2),
horizontal=T,
par.strip.text =list(cex=2),
scales=list(y=list(relation='free',cex=1.5,abbreviate=T,
labels=levels(factor(top10.df$Country)))),
# ,cex=1.5,abbreviate=F),
par.settings = ggplot2like(),axis=axis.grid,
main="Immigrants By Country by Age",
ylab="Country of Origin",
xlab="Population")