How to add "N = " labels to bar plot in R? - r

I'm looking to add "n = #" under each of the variables on the x-axis but I'm not sure how. The counts don't necessarily have to be under the names, just as long as the counts are there. I'm also working with two categorical variables, so that may be the issue too. Let me know if you have any suggestions, I'm new to R.
~
Here's some information on the dataset and the variables I'm comparing. The overall data set (scorpions) consists of scorpion species and what vegetation they're found in. Those are the two things I'm comparing. "species" is the vector for the species and "veg" is the vector for the vegetation type. These are both character vectors. I really just want to know how to add more labels onto my graph to give more clarification. This is what my graph currently looks like:
graph
I just want to be able to add number labels anywhere. If you want to recreate it, you can really use any dataset that consists of two character vectors. The other posts don't help because they consist of numerical vectors as well. If it's not possible to do this, then just let me know.
Thank you everyone for the help so far!
ggplot(data=scorpions, aes(x=species,y=veg,fill=veg)) +
geom_bar(stat="identity",color="black",position=position_dodge()) +
theme_stata() +
scale_fill_economist() +
theme(
axis.text.y = element_text(angle = 0),
axis.title = element_text(face="bold"),
axis.text.x = element_text(face = "italic")
) +
labs(title="Relationship Between Species and Vegetation Type")
I've tried changing the names in the Excel spreadsheet, but it looks really messy. I've also tried googling answers but nothing works since it's two categorical variables.

This question is in contrast to the most common dupe-links for grouped bar plots in ggplot2 in that other links (How to put labels over geom_bar for each bar in R with ggplot2 and How to put labels over geom_bar in R with ggplot2) tend to talk about one categorical variable only; this question asks about two categorical variables.
But it's not that hard: we just need to come up with a number for all combinations of each of the two categoricals. I'll use xtabs for that.
Using ggplot2::diamonds dataset, plotting against cut and color (both character):
library(ggplot2)
head(diamonds)
# # A tibble: 6 x 10
# carat cut color clarity depth table price x y z
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
# 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
# 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
# 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
# 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Starting with a simple (non-themed) bar plot:
gg <- ggplot(data=diamonds, aes(x=cut,y=color,fill=color)) +
geom_bar(stat="identity",color="black",position=position_dodge())
gg
Calculate the frequency table:
xtabs(~ cut + color, data = diamonds)
# color
# cut D E F G H I J
# Fair 163 224 312 314 303 175 119
# Good 662 933 909 871 702 522 307
# Very Good 1513 2400 2164 2299 1824 1204 678
# Premium 1603 2337 2331 2924 2360 1428 808
# Ideal 2834 3903 3826 4884 3115 2093 896
### convert to a frame
tab <- data.frame(xtabs(~ cut + color, data = diamonds))
head(tab)
# cut color Freq
# 1 Fair D 163
# 2 Good D 662
# 3 Very Good D 1513
# 4 Premium D 1603
# 5 Ideal D 2834
# 6 Fair E 224
New plot, adding geom_text:
gg +
geom_text(data = tab, aes(label = Freq),
position = position_dodge(width = 0.9), vjust = -0.25)

Related

Loop to plot boxplot with ggplot

I am using diamonds df,
I would like to plot a boxplot for each numerical column by category,
In this case category would be defined by "cut" column.
I am using a for-loop to accomplish this task,
Here's the code I am using:
##################################################################################
# Data #
# #
##################################################################################
data("diamonds")
basePlot <- diamonds[ names(diamonds)[!names(diamonds) %in% c("color", "clarity")] ]
##################################################################################
## set Plot view to 4 boxplots ##
par(mfrow = c(2,2))
## for-loop to boxplot all numerical columns ##
for (i in 1:(ncol(basePlot)-1)){
print(ggplot(basePlot, aes(as.factor(cut),
basePlot[c(i)],color=as.factor(cut)))
+ geom_boxplot(outlier.colour="black",outlier.shape=16,outlier.size=1,notch=FALSE)
+ xlab("Diamond Cut")
+ ylab(colnames(basePlot)[i])
)
}
Console output:
Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
Error in is.finite(x) : default method not implemented for type 'list'
Is there any other way to accomplish this task?
Instead of multiple plots, I suggest facets. To do this, though, we need to convert the data from "wide" format to "longer" format, and the canonical way in the tidyverse is with tidyr::pivot_longer.
> basePlot
# A tibble: 53,940 x 8
carat cut depth table price x y z
<dbl> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium 59.8 61 326 3.89 3.84 2.31
3 0.23 Good 56.9 65 327 4.05 4.07 2.31
4 0.290 Premium 62.4 58 334 4.2 4.23 2.63
5 0.31 Good 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good 59.4 61 338 4 4.05 2.39
# ... with 53,930 more rows
> pivot_longer(basePlot, -cut, names_to="var", values_to="val")
# A tibble: 377,580 x 3
cut var val
<ord> <chr> <dbl>
1 Ideal carat 0.23
2 Ideal depth 61.5
3 Ideal table 55
4 Ideal price 326
5 Ideal x 3.95
6 Ideal y 3.98
7 Ideal z 2.43
8 Premium carat 0.21
9 Premium depth 59.8
10 Premium table 61
# ... with 377,570 more rows
With this, we only have to tell ggplot2 to worry about val for the values, and var for the x-axis.
library(ggplot2)
library(tidyr) # pivot_longer
ggplot(pivot_longer(basePlot, -cut, names_to="var", values_to="val"),
aes(cut, val, color=cut)) +
geom_boxplot(outlier.colour="black", outlier.shape=16, outlier.size=1, notch=FALSE) +
xlab("Diamond Cut") +
facet_wrap(~var, nrow=2, scales="free") +
scale_x_discrete(guide=guide_axis(n.dodge=2))
The reason you have cut both in the x-axis and in the legend is because color= will add the legend. Since it's redundant, we could either remove the color aesthetic (which would also remove the legend) or we could just suppress the legend (by adding + scale_color_discrete(guide=FALSE)).
There are two ways of faceting: facet_wrap and facet_grid. The latter is well tuned for multiple variables (one facet variable on the x, one on the y) and many other configurations. Granted, you can use facet_grid with just one variable (which is similar to facet_wrap(nrow=1) or ncol=1), but there are some styling distinctions between them.

Subsetting, Matrices

I am super new to R and currently playing with the "diamond" dataset.
I am trying to return the row corresponding to the lowest, mean and largest prices and put everything in a 10 by 4 matrix. Please explain an easier way of doing this if possible.
library(ggplot2)
data(diamonds)
min(diamonds$price)
mean(diamonds$price)
max(diamonds$price) # this one gives me the wrong val!
M<-matrix(1:cols, nrow = 1, ncol = cols)
colnames(M)<-c("carat","cut" , "color" , "clarity", "depth" , "table" , "price" , "x" , "y" ,"z")
# Here I need to add the rows corresponding to the min,mean,max to this matrix.
Thanks
If all you want to do is to select the rows in the diamonds data frame corresponding to the mean, minimum, and maximum of price, this is easily accomplished with a combination of the $ and [ forms of the extract operator in Base R.
Note that this will return a data frame with 3 rows, not 4, because there are two rows at the minimum price, no rows at the mean price, and one row at the maximum price.
library(ggplot2)
data(diamonds)
diamonds[diamonds$price %in% c(min(diamonds$price),mean(diamonds$price),max(diamonds$price)),]
...and the output:
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
A solution with dplyr uses filter() as follows.
# dplyr solution
library(dplyr)
diamonds %>% filter(price %in% c(min(price),mean(price),max(price)))
...and the output:
# A tibble: 3 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
>
matrix and dataframes are different in R. diamonds is a dataframe it is better and easy to process if you keep it as dataframe only.
summary(diamonds) gives you some nice summary stats for each column.
If you want to apply specific functions to columns using dplyr, you can do :
library(dplyr)
diamonds %>%
summarise(across(where(is.numeric),list(min = min, max = max, mean = mean))) %>%
tidyr::pivot_longer(cols = everything(),
names_to = c('col', '.value'),
names_sep = '_')
# A tibble: 7 x 4
# col min max mean
# <chr> <dbl> <dbl> <dbl>
#1 carat 0.2 5.01 0.798
#2 depth 43 79 61.7
#3 table 43 95 57.5
#4 price 326 18823 3933.
#5 x 0 10.7 5.73
#6 y 0 58.9 5.73
#7 z 0 31.8 3.54
Note that I applied these functions only to numeric columns since cut, color, clarity are factor columns.

Line plot with error bars in which each line is a different group and multiple variables are in the x axis

I'm trying to create a line plot with error bars in R/Rstudio, in which each line is a different group (coded by one variable) and different continuous variables compose the x axis.
Taking the dataset diamonds as examples, it would be a multiple line graph, in which each line is one category of the variable "color and x,y,z are variables in whose levels are in the y axis, but they are positioned in the x axis.
the head of diamonds in R is:
(as coded in R studio :
>head(diamonds)
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
an example of a similar graph would be the one attached in the pic, but I need one with error bars (and this was made in stata, which just can't add error bars to this command which is: profileplot varx vary varz, by(groups)
profile plot without errorbars as an example is here::
Before we start, we will plot x,y,z columns from diamonds,and because x and y and very close, i subtract 1 from y so we can see it, and also introduce some error for error bars
library(tidyr)
library(ggplot2)
library(dplyr)
mydata <- diamonds %>% select(color,x,y,z) %>% pivot_longer(-color)
# A tibble: 6 x 3
color name value
<ord> <chr> <dbl>
1 E x 1.80
2 E y 3.98
3 E z 2.43
4 E x 2.92
5 E y 3.84
6 E z 2.31
Then:
ggplot(mydata,aes(x=name,y=value,color=color)) +
stat_summary(fun.y=mean,geom="point") +
stat_summary(fun.y=mean,aes(group=color),geom="line") +
stat_summary(fun.data=mean_se,geom="errorbar",width=0.1)
In this case the errorbars etc don't make sense because the x, y and z values are pretty much similar.

Visualize multiple box plot selecting differents rows of a dataframe

I am developing an EDA (Estimation of Distribution Algorithm). I'm getting all measure of the Pareto Front's solutions with distint configurations.
I have a structure with all values:
> metrics20
# A tibble: 320 x 6
File Hypervolume `Modified Hypervolume` Spread Spacing Time
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 001-unif-0.csv 25771 26294. 391. 30.1 16.8
2 002-unif-0.csv 27481 28416. 534. 41.1 16.5
3 003-unif-0.csv 26394 26842. 356. 29.6 16.5
4 004-unif-0.csv 30828 31696 418. 38.0 16.5
5 005-unif-0.csv 28146 28727 444. 34.2 16.6
6 006-unif-0.csv 30176 31006 451. 50.1 16.6
7 007-unif-0.csv 29374 30216 537. 35.8 16.5
8 008-unif-0.csv 27434 28156. 439. 31.4 16.5
9 009-unif-0.csv 28944 29426 471. 33.7 16.4
10 010-unif-0.csv 28339 29302. 576. 44.3 16.4
I want to visualize the values by this way. I take for example the Hipervolume column, I split data by File column value: -unif-, -sat-, -eff- and -prod- distribution and show values with -0.csv,-0.25.csv,-0.5.csv and -0.75.csv in x axis for the same distribution.
Reproducible example:
library(readr)
metrics20 <- read_csv("./metrics20.csv")
Data: Link
Hopefully this is a step towards what you're looking for:
library(readr)
library(dplyr)
library(ggplot2)
metrics20 <- read_csv("metrics20.csv")
metrics20 %>%
mutate(tag = factor(gsub("(^\\d+-)(\\w+)(-.*$)", "\\2", .$File), levels = c("unif", "sat", "eff", "prod")),
level = gsub("(^\\d+-\\w+-)(.*)(\\.csv$)", "\\2", .$File)) %>%
ggplot(aes(x = level, y = Hypervolume)) +
geom_boxplot() +
facet_wrap(~tag, nrow = 1)+
theme_minimal() +
theme(panel.border = element_rect(colour = "black", fill = NA),
panel.grid = element_blank())
From here there may be other things you want to tweak if you need to adjust it to be more like the example plot. You should be able to find all next steps in the help for the functions used.

ggplot() color each point manually

How do I create a scatter-plot in ggplot() with each points coloured manually? The necessary colours are given in my dataframe.
> head(df)
x y col
1 0.72 2757 #2AAE89
2 0.72 2757 #2DFE83
3 0.72 2757 #40FE89
4 0.70 2757 #28FE97
5 0.86 2757 #007C7D
6 0.75 2757 #24FEA1
The colour of the points must be exactly as given in the dataframe
Luckily there is a relatively easy solution by using scale_colour_identity(), see the following example:
library(ggplot2)
z <- " x y z col
1 0.72 2757 86 #2AAE89
2 0.72 2757 86 #2DFE83
3 0.72 2757 86 #40FE89
4 0.70 2757 82 #28FE97
5 0.86 2757 26 #007C7D
6 0.75 2757 79 #24FEA1"
df <- read.table(text = z, header = T)
ggplot(df, aes(x, y, colour = col)) +
geom_point() +
scale_colour_identity()
EDIT: I made a mistake in loading in the data, but the plotting syntax is still valid.

Resources