Excluding levels/groups within categorical variable (ggplot graph)

Excluding levels/groups within categorical variable (ggplot graph) - r

I am relatively new to ggplot, and I am interested in visualizing a categorical variable with 11 groups/levels. I ran the code below to produce a bar graph showing the frequency of each group. However, given that some groups within the categorical variable "active" only occur once or zero times, they clutter the graph. Therefore, is it possible to directly exclude groups in ggplot within the categorical variable with < 2 observations?
I am also open to recommendations on how to visualize a categorical variable with multiple groups/levels if a bar graph isn't suitable here.
Data type
sapply(df,class)
username active
"character" "character"
ggplot(data = df, aes(x = active)) +
geom_bar()

You can count() the categories first, and then filter(), before feeding to ggplot. In this way, you would use geom_col() instead:
df %>% count(active) %>% filter(n>2) %>%
ggplot(aes(x=active,y=n)) +
geom_col()
Alternatively, you could group_by() / filter() directly within your ggplot() call, like this:
ggplot(df %>% group_by(active) %>% filter(n()>2), aes(x=active)) +
geom_bar()

Related

Facet wrap by categorical variable while also using grouping by another variable in R

Need help with a problem.
I am making a ggplot using dplyr and I need to group by 1 categorical variable while also facet wrapping by another.
My thought process was this:
d %>%
group_by(Grade) %>%
summarise(TotalPay = sum(PaymentsReceived)) %>%
ggplot(aes(y = Grade, x= TotalPay)) +
geom_col(fill = c(2:16), color = 'Black') +
facet_wrap(~ Status)
In this case I want to group by horizontal bars by the 'Grade' variable but also want to facet wrap based on the 'Status' variable. However, when I do this I can't facet_wrap because my group_by function eliminates the Status variables from the data set.
Any direction would help.
Thanks.

Reorder factored count data in ggplot2 geom_bar

I find countless examples of reordering X by the corresponding size of Y if the Dataframe for ggplot2 (geom_bar) is read using stat="identity".
I have yet to find an example of stat="count". The reorder function fails as I have no corresponding y.
I have a factored DF of one column, "count" (see below for a poor example), where there are multiple instances of the data as you would expect. However, I expected factored data to be displayed:
ggplot(df, aes(x=df$count)) + geom_bar()
by the order defined from the quantity of each factor, as it is different for unfactored (character) data i.e., will display alphabetically.
Any idea how to reorder?
This is my current awful effort, sadly I figured this out last night, then lost my R command history:

If you start off your project with loading the tidyverse, I suggest you use the built-in tidyverse function: fct_infreq()
ggplot(df, aes(x=fct_infreq(df$count))) + geom_bar()
As your categories are words, consider adding coord_flip() so that your bars run horizontally.
ggplot(df, aes(x=fct_infreq(df$count))) + geom_bar() + coord_flip()
This is what it looks like with some fish species counts: A horzontal bar chart with species on the y axis (but really the flipped x-axis) and counts on horizontal axis (but actually the flipped y-axis). The counts are sorted from least to greatest.

Converting the counts to a factor and then modifying that factor might help accomplish what you need. In the below I'm reversing the order of the counts using fct_rev from the forcats package (part of tidyverse)
library(tidyverse)
iris %>%
count(Sepal.Length) %>%
mutate(n=n %>% as.factor %>% fct_rev) %>%
ggplot(aes(n)) + geom_bar()
Alternatively, if you'd like the bars to be arranged large to small, you can use fct_infreq.
iris %>%
count(Sepal.Length) %>%
mutate(n=n %>% as.factor %>% fct_infreq) %>%
ggplot(aes(n)) + geom_bar()

Multiple columns on x-axis in R

I'm quite new to R, and there has been a question similar to mine asked before, however it doesn't quite get to what I need.
I have a table as follows:
I wish to plot the Value, and Threshold alongside each other on the X-axis for each metric, so effectively, I will have three pairs of plots on the X-axis. I have attempted to use reshape2 and ggplot2 for this as follows:
library(reshape2)
df <- melt(msi, id.vars="Average Metric Value (Abbr)")
# I get an error message, but the output seems ok.
library(ggplot2)
ggplot(df, aes(x="Average Metric Value (Abbr)", y=value, fill=variable)) + geom_bar(stat='identity', position='dodge')
The output graph is as follows:
I'm sure I can work out how to separate each of the three pairs later, but as you can see, I don't have the metric names for each of the three pairs along the x-axis, and I am missing the first "Value" bar, presumably because it equals the same as the second and I am only getting unique values plotted.
How do I get around that and have the names of each metric beneath each pairs of values?

We can do this by placing inside the aes_string or use backquotes in the aes for those columns that have spaces in its names
library(dplyr)
library(tidyr)
gather(msi, variable, value, Value:Threshold) %>%
ggplot(., aes(x= `Average Metric Value (Abbr)`,
y=value,
fill=variable)) +
geom_bar(stat='identity', position='dodge')

How to plot parallel coordinates with multiple categorical variables in R

I am facing a difficulty while plotting a parallel coordinates plot using the ggparcoord from the GGally package. As there are two categorical variables, what I want to show in the visualisation is like the image below. I've found that in ggparcoord, groupColumn is only allowed to a single variable to group (colour) by, and surely I can use showPoints to mark the values on the axes, but i also need to vary the shape of these markers according to the categorical variables. Is there other package that can help me to realise my idea?
Any response will be appreciated! Thanks!

It's not that difficult to roll your own parallel coordinates plot in ggplot2, which will give you the flexibility to customize the aesthetics. Below is an illustration using the built-in diamonds data frame.
To get parallel coordinates, you need to add an ID column so you can identify each row of the data frame, which we'll use as a group aesthetic in ggplot. You also need to scale the numeric values so that they'll all be on the same vertical scale when we plot them. Then you need to take all the columns that you want on the x-axis and reshape them to "long" format. We do all that on the fly below with the tidyverse/dplyr pipe operator.
Even after limiting the number of category combinations, the lines are probably too intertwined for this plot to be easily interpretable, so consider this merely a "proof of concept". Hopefully, you can create something more useful with your data. I've used colour (for the lines) and fill (for the points) aesthetics below. You can use shape or linetype instead, depending on your needs.
library(tidyverse)
theme_set(theme_classic())
# Get 20 random rows from the diamonds data frame after limiting
# to two levels each of cut and color
set.seed(2)
ds = diamonds %>%
filter(color %in% c("D","J"), cut %in% c("Good", "Premium")) %>%
sample_n(20)
ggplot(ds %>%
mutate(ID = 1:n()) %>% # Add ID for each row
mutate_if(is.numeric, scale) %>% # Scale numeric columns
gather(key, value, c(1,5:10)), # Reshape to "long" format
aes(key, value, group=ID, colour=color, fill=cut)) +
geom_line() +
geom_point(size=2, shape=21, colour="grey50") +
scale_fill_manual(values=c("black","white"))
I haven't used ggparcoords before, but the only option that seemed straightforward (at least on my first try with the function) was to paste together two columns of data. Below is an example. Even with just four category combinations, the plot is confusing, but maybe it will be interpretable if there are strong patterns in your data:
library(GGally)
ds$group = with(ds, paste(cut, color, sep="-"))
ggparcoord(ds, columns=c(1, 5:10), groupColumn=11) +
theme(panel.grid.major.x=element_line(colour="grey70"))

Ordering bar plots with ggplot2 according to their size, i.e. numerical value

This question asks about ordering a bar graph according to an unsummarized table. I have a slightly different situation. Here's part of my original data:
experiment,pvs_id,src,hrc,mqs,mcs,dmqs,imcs
dna-wm,0,7,9,4.454545454545454,1.4545454545454546,1.4545454545454541,4.3939393939393945
dna-wm,1,7,4,2.909090909090909,1.8181818181818181,0.09090909090909083,3.9090909090909087
dna-wm,2,7,1,4.818181818181818,1.4545454545454546,1.8181818181818183,4.3939393939393945
dna-wm,3,7,8,3.4545454545454546,1.5454545454545454,0.4545454545454546,4.272727272727273
dna-wm,4,7,10,3.8181818181818183,1.9090909090909092,0.8181818181818183,3.7878787878787876
dna-wm,5,7,7,3.909090909090909,1.9090909090909092,0.9090909090909092,3.7878787878787876
dna-wm,6,7,0,4.909090909090909,1.3636363636363635,1.9090909090909092,4.515151515151516
dna-wm,7,7,3,3.909090909090909,1.7272727272727273,0.9090909090909092,4.030303030303029
dna-wm,8,7,11,3.6363636363636362,1.5454545454545454,0.6363636363636362,4.272727272727273
I only need a few variables from this, namely mqs and imcs, grouped by their pvs_id, so I create a new table:
m = melt(t, id.var="pvs_id", measure.var=c("mqs","imcs"))
I can plot this as a bar graph where one can see the correlation between MQS and IMCS.
ggplot(m, aes(x=pvs_id, y=value))
+ geom_bar(aes(fill=variable), position="dodge", stat="identity")
However, I'd like the resulting bars to be ordered by the MQS value, from left to right, in decreasing order. The IMCS values should be ordered with those, of course.
How can I accomplish that? Generally, given any molten dataframe — which seems useful for graphing in ggplot2 and today's the first time I've stumbled over it — how do I specify the order for one variable?

It's all in making
pvs_id a factor and supplying the appropriate levels to it:
dat$pvs_id <- factor(dat$pvs_id, levels = dat[order(-dat$mqs), 2])
m = melt(dat, id.var="pvs_id", measure.var=c("mqs","imcs"))
ggplot(m, aes(x=pvs_id, y=value))+
geom_bar(aes(fill=variable), position="dodge", stat="identity")
This produces the following plot:
EDIT:
Well since pvs_id was numeric it is treated in an ordered fashion. Where as if you have a factor no order is assumed. So even though you have numeric labels pvs_id is actually a factor (nominal). And as far as dat[order(-dat$mqs), 2] is concerned the order function with a negative sign orders the data frame from largest to smallest along the variable mqs. But you're interested in that order for the pvs_id variable so you index that column which is the second column. If you tear that apart you'll see it gives you:
> dat[order(-dat$mqs), 2]
[1] 6 2 0 5 7 4 8 3 1
Now you supply that to the levels argument of factor and this orders the factor as you want it.

With newer tidyverse functions, this becomes much more straightforward (or at least, easy to read for me):
library(tidyverse)
d %>%
mutate_at("pvs_id", as.factor) %>%
mutate(pvs_id = fct_reorder(pvs_id, mqs)) %>%
gather(variable, value, c(mqs, imcs)) %>%
ggplot(aes(x = pvs_id, y = value)) +
geom_col(aes(fill = variable), position = position_dodge())
What it does is:
create a factor if not already
reorder it according to mqs (you may use desc(mqs) for reverse-sorting)
gather into individual rows (same as melt)
plot as geom_col (same as geom_bar with stat="identity")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Excluding levels/groups within categorical variable (ggplot graph) - r

Related

Facet wrap by categorical variable while also using grouping by another variable in R

Reorder factored count data in ggplot2 geom_bar

Multiple columns on x-axis in R

How to plot parallel coordinates with multiple categorical variables in R

Ordering bar plots with ggplot2 according to their size, i.e. numerical value

Categories

Resources