Problem with naming x-axis with ggplot2 in Rstudio - r

I'm trying to create some variation of a pareto-chart.
Moving along the code I face a problem I cannot solve on my own for several hours. It's regarding the data order of the package ggplot2 (1) and renaming the labels accordingly(2).
(1)Since I want to create an ordered bar-plot with a saturation curve, I created a dummyvar from X to X-1, so my bars are sorted from high to low, as you can see in the output (1).
By maneuvering around this problem I created a second problem I can't fix.
(2)I have a column in my df containing all the species I want to see at the x-axis. However, ggplot won't allow to print those accordingly. Actually since I added the command I won't get any labeling on the x-axis. Somehow I will not get any error.
So my question is:
Is there a way to use my species list as x-axis?(But remember my data has to be sorted from high to low)
Or does some one easily spot a way to solve the labeling problem?
cheers
dfb
Beech id proc kommu Order
1 Va fla 1 8.749851 8.749851 Psocopt
2 Er 2 7.793812 16.543663 Acari
3 Faga dou 3 7.659406 24.203069 Dipt
4 Tro 4 6.675941 30.879010 Acari
5 Hal ann 5 6.289307 37.168317 Dipt
6 Stigm 6 3.724406 40.892723 Acari
7 Di fag 7 3.642574 44.535297 Lepidopt
8 Phyfa 8 3.390545 47.925842 Neoptera
9 Phylma 9 2.766040 50.691881 Lepidopt
data example:
structure(list(Beech = c("Va fla", "Er", "Faga dou", "Tro", "Hal ann",
"Stigm", "Di fag", "Phyfa", "Phylma"), id = c(1, 2, 3, 4, 5,
6, 7, 8, 9), proc = c(8.749851, 7.793812, 7.659406, 6.675941,
6.289307, 3.724406, 3.642574, 3.390545, 2.76604), kommu = c(8.749851,
16.543663, 24.203069, 30.87901, 37.168317, 40.892723, 44.535297,
47.925842, 50.691881), Order = c("Psocopt", "Acari", "Dipt",
"Acari", "Dipt", "Acari", "Lepidopt", "Neoptera", "Lepidopt")), row.names = c(NA,
-9L), class = c("tbl_df", "tbl", "data.frame"))
library(openxlsx)
library(ggplot2)
dfb <- data.xlsx ###(df containing different % values per species)
labelb <- dfb$Beech ###(list of 22 items; same number as x-values)
p <-ggplot(dfb, aes(x=id))
p <- p + geom_bar(aes(y = proc), stat = "identity", fill = "lightgreen")
p <- p + geom_line(aes(y = kommu/10), color = "orange", size = 2) + geom_point(aes(y = kommu/10),size = 2)
p <- p + scale_y_continuous(sec.axis = sec_axis(~.*10, name ="Total biocoenosis[%]"))
p <- p + labs(y = "Species [%]",
x = "Species")
p <- p + scale_x_discrete(labels = labelb)
p <- p + theme(legend.position = c(0.8, 0.9))
--> Answer to other comments:
So basically my problem is the bars are not labeled with a species name.
I know that this is a result due to my dummyvar, which is basically 1 to 22.
So I try to force ggplot to name the x-axis with my wanted values.
But this input doesn't work
p <- p + scale_x_discrete(labels = labelb)
But back to your suggestions:
Jeah, I tried tidyverse just after creating this post and couldn't handle it good enough. But your idea doesn't do anything for me, its like using the ggplot command.
arrange(Beech) %>%
mutate(Beech = factor(Beech, levels = unique(.$Beech))) %>%
ggplot(aes(Beech, proc)) +
geom_col()

I can't quite tell from the picture what's going wrong, but one way to make sure your bar plots are in ascending/descending order is to arrange the column and then convert it to a factor using the existing order of the categories:
So, without ordering:
library(tidyverse)
diamonds %>%
group_by(cut) %>%
summarize(price = mean(price)) %>%
ggplot(aes(cut, price)) +
geom_bar(stat = "identity")
And with ordering:
diamonds %>%
group_by(cut) %>%
summarize(price = mean(price)) %>%
arrange(price) %>%
mutate(cut = factor(cut, levels = unique(.$cut))) %>%
ggplot(aes(cut, price)) +
geom_bar(stat = "identity")

I edited your code with the database sample you provided and I think I was able to do what you wanted.
Basically I sorted Beech depending on the descending proc and then convert it to factor. Here is the modified code and the result:
p <-
dfb %>%
arrange(desc(proc)) %>%
mutate(Beech = factor(Beech, levels = unique(.$Beech))) %>%
ggplot(aes(Beech)) +
geom_bar(aes(y = proc), stat = "identity", fill = "lightgreen") +
geom_line(aes(y = kommu/10, x=as.integer(Beech)), color = "orange", size = 2) +
geom_point(aes(y = kommu/10),size = 2) +
labs(y = "Species [%]", x = "Species") +
scale_x_discrete("Species") +
scale_y_continuous(sec.axis = sec_axis(~.*10, name ="Total biocoenosis[%]")) +
theme(legend.position = c(0.8, 0.9))
p
Note: I had to tweak a bit the geom_line by adding x=as.integer(Beech) because it works with numbers and not factors.

Related

Removing NA category from grouped bar charts

I am currently working with survey data with 250 columns. A sample of my data looks like this:
q1 <- factor(c("yes",NA,"no","yes",NA,"yes","no","yes"))
q2 <- factor(c("Albania","USA","Albania","Albania","UK",NA,"UK","Albania"))
q3 <- factor(c(0,1,NA,0,1,1,NA,0))
q4 <- factor(c(0,NA,NA,NA,1,NA,0,0))
q5 <- factor(c("Dont know","Prefer not to answer","Agree","Disagree",NA,"Agree","Agree",NA))
q6 <- factor(c(1,NA,3,5,800,NA,900,2))
sector <- factor(c("Energy","Water","Energy","Other","Other","Water","Transportation","Energy"))
data <- data.frame(q1,q2,q3,q4,q5,q6,sector)
I have created a function to loop through all 250 columns and create grouped bar charts where x axis shows sectors, y axis shows percentage distribution of answers and fill is the underlying column from data. Below you can see the code for the function:
by_sector <- lapply(names(data), function(variable) {
ggplot(
data = data,
mapping = aes(x=sector,fill = data[[variable]])
) +
geom_bar(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..]), position="dodge") +
labs(x = variable, y = "% of total", fill = "Response", caption = paste("Total =", sum(!is.na(data[[variable]])))) +
geom_text(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..], label=scales::percent(..count../tapply(..count.., ..x.. ,sum)[..x..],accuracy = 0.1) ),
stat="count", position=position_dodge(1), vjust=0.5)+
#scale_fill_brewer(palette = "Accent")+
scale_fill_discrete(na.translate = FALSE) +
theme_bw() +
theme(panel.grid.major.y = element_blank()) +
coord_flip()
})
As you can see from image below, since I use data columns as fill, there is transparent NA category showing up. I want to remove that category from grouped bars.
enter image description here
I tried couple of things:
scale_fill_discrete(na.translate = FALSE) This just removed NA from legend not from grouped bars.
fill = subset(data,!is.na(data[[variable]])) This didn't work
ggplot(data=na.omit(data[[variable]])) This didn't work neither.
Is there a way to modify my code for barplots so that NA category doesn't show up as a bar in the graph? Thank you very much beforehand!
One option would be to aggregate your data outside of ggplot() which makes it easier to debug, removes the duplicated computations inside the code and makes it easy to drop the NA categories if desired.
Additionally, I moved the plotting code to a separate function which also allows for easier debugging by e.g. running the code for just one example.
Finally note, that I switched to the .data pronoun as the recommend way to use column names passed as strings.
Showing only the plots for two of the problematic columns:
EDIT Fixed a small bug by removing the NA values before aggregating instead of doing that afterwards.
library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
plot_fun <- function(variable) {
total <- sum(!is.na(data[[variable]]))
data <- data |>
filter(!is.na(.data[[variable]])) |>
group_by(across(all_of(c("sector", variable)))) |>
summarise(n = n(), .groups = "drop_last") |>
mutate(pct = n / sum(n)) |>
ungroup()
ggplot(
data = data,
mapping = aes(x = sector, y = pct, fill = .data[[variable]])
) +
geom_col(position = "dodge") +
labs(
x = variable, y = "% of total", fill = "Response",
caption = paste("Total =", total)
) +
geom_text(
aes(
label = scales::percent(pct, accuracy = 0.1)
),
position = position_dodge(.9), vjust = 0.5
) +
scale_fill_brewer(palette = "Accent") +
theme_bw() +
theme(panel.grid.major.y = element_blank()) +
coord_flip()
}
by_sector <- lapply(names(data), plot_fun)
by_sector[c(3, 6)]
#> [[1]]
#>
#> [[2]]

How to avoid zig-zag plot when using geom_line with color and linetype

I have a relatively large dataset that I can share here.
I am trying to plot all the lines (not just one: e.g. a mean or a median) corresponding to the values of y over x = G, with the data grouped by I and P; so that the levels of the variable I appear with a different colour and the levels of the variable P appear with a different line type.
The problem I have is that the graph I get is a zig-zag line graph along the x-axis. The aim, obviously, is to have a line for each combination of data, avoiding the zig-zag. I have read that this problem could be related to the way the data is grouped. I have tried several combinations of data grouping using group but I can't solve the problem.
The code I use is as follows:
#Selecting colours
colours<-brewer.pal(n = 11, name = "Spectral")[c(9,11,1)]
#Creating plot
data %>%
ggplot(aes(x = G, y = y, color = I, linetype=P)) +
geom_line(aes(linetype=P,color=I),size=0.2)+
scale_linetype_manual(values=c("solid", "dashed")) +
scale_color_manual(values=colours) +
scale_x_continuous(breaks = seq(0,100, by=25), limits=c(0,100)) +
scale_y_continuous(breaks = seq(0,1, by=0.25), limits=c(0,1)) +
labs(x = "Time", y = "Value") +
theme_classic()
I also tried unsuccessfully adding group=interaction(I, P) inside ggplot(aes()), as they suggests in other forums.
Following #JonSpring's point:
dd2 <- (filter(dd,G %in% c(16,17))
%>% group_by(P,I,G)
%>% summarise(n=length(unique(y)))
)
shows that you have many different values of y for each combination of G/I/P:
# A tibble: 12 x 4
# Groups: P, I [6]
P I G n
<chr> <chr> <dbl> <int>
1 heterogeneity I005 16 34
2 heterogeneity I005 17 37
3 heterogeneity I010 16 34
... [etc.]
One way around this, if you so choose, is to use stat_summary() to have R collapse the y values in each group to their mean:
(dd %>%
ggplot(aes(x = G, y = y, color = I, linetype=P)) +
stat_summary(fun=mean, geom="line",
aes(linetype=P,color=I,group=interaction(I,P)),size=0.2) +
scale_linetype_manual(values=c("solid", "dashed")) +
scale_color_manual(values=colours) +
labs(x = "Time", y = "Value") +
theme_classic()
)
You could also do this yourself with group_by() + summarise() before calling ggplot.
There's not enough information in the data set as presented to identify individual lines. If we are willing to assume that the order of the values within a given I/G/P group is an appropriate indexing variable, then we can do this:
## add index variable
dd3 <- dd %>% group_by(P,I,G) %>% mutate(index=seq(n()))
(dd3 %>%
ggplot(aes(x = G, y = y, color = I, linetype=P)) +
geom_line(aes(group=interaction(index,I,P)), size=0.2) +
scale_linetype_manual(values=c("solid", "dashed")) +
scale_color_manual(values=colours) +
labs(x = "Time", y = "Value") +
theme_classic()
)
If this isn't what you had in mind, then you need to provide more information ...

Adding a single label per group in ggplot with stat_summary and text geoms

I would like to add counts to a ggplot that uses stat_summary().
I am having an issue with the requirement that the text vector be the same length as the data.
With the examples below, you can see that what is being plotted is the same label multiple times.
The workaround to set the location on the y axis has the effect that multiple labels are stacked up. The visual effect is a bit strange (particularly when you have thousands of observations) and not sufficiently professional for my purposes. You will have to trust me on this one - the attached picture doesn't fully convey the weirdness of it.
I was wondering if someone else has worked out another way. It is for a plot in shiny that has dynamic input, so text cannot be overlaid in a hardcoded fashion.
I'm pretty sure ggplot wasn't designed for the kind of behaviour with stat_summary that I am looking for, and I may have to abandon stat_summary and create a new summary dataframe, but thought I would first check if someone else has some wizardry to offer up.
This is the plot without setting the y location:
library(dplyr)
library(ggplot2)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
df_x <- df_x %>%
group_by(Group) %>%
mutate(w_count = n())
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
geom_text(aes(label = w_count)) +
coord_flip() +
theme_classic()
and this is with my hack
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
geom_text(aes(y = 1, label = w_count)) +
coord_flip() +
theme_classic()
Create a df_text that has the grouped info for your labels. Then use annotate:
library(dplyr)
library(ggplot2)
set.seed(123)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
df_text <- df_x %>%
group_by(Group) %>%
summarise(avg = mean(Value),
n = n()) %>%
ungroup()
yoff <- 0.0
xoff <- -0.1
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
annotate("text",
x = 1:2 + xoff,
y = df_text$avg + yoff,
label = df_text$n) +
coord_flip() +
theme_classic()
I found another way which is a little more robust for when the plot is dynamic in its ordering and filtering, and works well for faceting. More robust, because it uses stat_summary for the text.
library(dplyr)
library(ggplot2)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
counts_df <- function(y) {
return( data.frame( y = 1, label = paste0('n=', length(y)) ) )
}
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
coord_flip() +
theme_classic()
p + stat_summary(geom="text", fun.data=counts_df)

How to make a dual axis in ggplot R

I have made a time series plot for total count data of 4 different species. As you can see the results with sharksucker have a much higher count than the other 3 species. To see the trends of the other 3 species they need to plotted separately (or on a smaller y axis). However, I have a figure limit in my masters paper. So, I was trying to create a dual axis plot or have the y axis split into two. Does anyone know of a way I could do this?
library(tidyverse)
library(reshape2)
dat <- read_xlsx("ReefPA.xlsx")
dat1 <- dat
dat1$Date <- format(dat1$Date, "%Y/%m")
plot_dat <- dat1 %>%
group_by(Date) %>%
summarise(Sharksucker_Remora = sum(Sharksucker_Remora)) %>%
melt("Date") %>%
filter(Date > '2018-01-01') %>%
arrange(Date)
names(plot_dat) <- c("Date", "Species", "Count")
ggplot(data = plot_dat) +
geom_line(mapping = aes(x = Date, y = Count, group = Species, colour = Species)) +
stat_smooth(method=lm, aes(x = Date, y = Count, group = Species, colour = Species)) +
scale_colour_manual(values=c(Golden_Trevally="goldenrod2", Red_Snapper="firebrick2", Sharksucker_Remora="darkolivegreen3", Juvenile_Remora="aquamarine2")) +
xlab("Date") +
ylab("Total Presence Per Month") +
theme(legend.title = element_blank()) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
The thing is, the problem you're trying to solve doesn't seem like a 2nd Y axis issue. The problem here is of relative scale of the species. You might want to think of something like standardizing the initial species presence to 100 and showing growth or decline from there.
Another option would be faceting by species.

ggplot faceted cumulative histogram

I have the following data
set.seed(123)
x = c(rnorm(100, 4, 1), rnorm(100, 6, 1))
gender = rep(c("Male", "Female"), each=100)
mydata = data.frame(x=x, gender=gender)
and I want to plot two cumulative histograms (one for males and the other for females) with ggplot.
I have tried the code below
ggplot(data=mydata, aes(x=x, fill=gender)) + stat_bin(aes(y=cumsum(..count..)), geom="bar", breaks=1:10, colour=I("white")) + facet_grid(gender~.)
but I get this chart
that, obviously, is not correct.
How can I get the correct one, like this:
Thanks!
I would pre-compute the cumsum values per bin per group, and then use geom_histogram to plot.
mydata %>%
mutate(x = cut(x, breaks = 1:10, labels = F)) %>% # Bin x
count(gender, x) %>% # Counts per bin per gender
mutate(x = factor(x, levels = 1:10)) %>% # x as factor
complete(x, gender, fill = list(n = 0)) %>% # Fill missing bins with 0
group_by(gender) %>% # Group by gender ...
mutate(y = cumsum(n)) %>% # ... and calculate cumsum
ggplot(aes(x, y, fill = gender)) + # The rest is (gg)plotting
geom_histogram(stat = "identity", colour = "white") +
facet_grid(gender ~ .)
Like #Edo, I also came here looking for exactly this. #Edo's solution was the key for me. It's great. But I post here a few additions that increase the information density and allow comparisons across different situations.
library(ggplot2)
set.seed(123)
x = c(rnorm(100, 4, 1), rnorm(50, 6, 1))
gender = c(rep("Male", 100), rep("Female", 50))
grade = rep(1:3, 50)
mydata = data.frame(x=x, gender=gender, grade = grade)
ggplot(mydata, aes(x,
y = ave(after_stat(density), group, FUN = cumsum)*after_stat(width),
group = interaction(gender, grade),
color = gender)) +
geom_line(stat = "bin") +
scale_y_continuous(labels = scales::percent_format()) +
facet_wrap(~grade)
I rescale the y so that the cumulative plot always ends at 100%. Otherwise, if the groups are not the same size (like they are in the original example data) then the cumulative plots have different final heights. This obscures their relative distribution.
Secondly, I use geom_line(stat="bin") instead of geom_histogram() so that I can put more than one line on a panel. This way I can compare them easily.
Finally, because I also want to compare across facets, I need to make sure the ggplot group variable uses more than just color=gender. We set it manually with group = interaction(gender, grade).
Answering a million years later....
I was looking for a solution for the same problem and I got here..
Eventually I figured it out by myself, so I'll drop it here in case other people will ever need it.
As required: no pre-work is necessary!
ggplot(mydata) +
geom_histogram(aes(x = x, y = ave(..count.., group, FUN = cumsum),
fill = gender, group = gender),
colour = "gray70", breaks = 1:10) +
facet_grid(rows = "gender")

Resources