Loop to plot boxplot with ggplot - r

I am using diamonds df,
I would like to plot a boxplot for each numerical column by category,
In this case category would be defined by "cut" column.
I am using a for-loop to accomplish this task,
Here's the code I am using:
##################################################################################
# Data #
# #
##################################################################################
data("diamonds")
basePlot <- diamonds[ names(diamonds)[!names(diamonds) %in% c("color", "clarity")] ]
##################################################################################
## set Plot view to 4 boxplots ##
par(mfrow = c(2,2))
## for-loop to boxplot all numerical columns ##
for (i in 1:(ncol(basePlot)-1)){
print(ggplot(basePlot, aes(as.factor(cut),
basePlot[c(i)],color=as.factor(cut)))
+ geom_boxplot(outlier.colour="black",outlier.shape=16,outlier.size=1,notch=FALSE)
+ xlab("Diamond Cut")
+ ylab(colnames(basePlot)[i])
)
}
Console output:
Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
Error in is.finite(x) : default method not implemented for type 'list'
Is there any other way to accomplish this task?

Instead of multiple plots, I suggest facets. To do this, though, we need to convert the data from "wide" format to "longer" format, and the canonical way in the tidyverse is with tidyr::pivot_longer.
> basePlot
# A tibble: 53,940 x 8
carat cut depth table price x y z
<dbl> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium 59.8 61 326 3.89 3.84 2.31
3 0.23 Good 56.9 65 327 4.05 4.07 2.31
4 0.290 Premium 62.4 58 334 4.2 4.23 2.63
5 0.31 Good 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good 59.4 61 338 4 4.05 2.39
# ... with 53,930 more rows
> pivot_longer(basePlot, -cut, names_to="var", values_to="val")
# A tibble: 377,580 x 3
cut var val
<ord> <chr> <dbl>
1 Ideal carat 0.23
2 Ideal depth 61.5
3 Ideal table 55
4 Ideal price 326
5 Ideal x 3.95
6 Ideal y 3.98
7 Ideal z 2.43
8 Premium carat 0.21
9 Premium depth 59.8
10 Premium table 61
# ... with 377,570 more rows
With this, we only have to tell ggplot2 to worry about val for the values, and var for the x-axis.
library(ggplot2)
library(tidyr) # pivot_longer
ggplot(pivot_longer(basePlot, -cut, names_to="var", values_to="val"),
aes(cut, val, color=cut)) +
geom_boxplot(outlier.colour="black", outlier.shape=16, outlier.size=1, notch=FALSE) +
xlab("Diamond Cut") +
facet_wrap(~var, nrow=2, scales="free") +
scale_x_discrete(guide=guide_axis(n.dodge=2))
The reason you have cut both in the x-axis and in the legend is because color= will add the legend. Since it's redundant, we could either remove the color aesthetic (which would also remove the legend) or we could just suppress the legend (by adding + scale_color_discrete(guide=FALSE)).
There are two ways of faceting: facet_wrap and facet_grid. The latter is well tuned for multiple variables (one facet variable on the x, one on the y) and many other configurations. Granted, you can use facet_grid with just one variable (which is similar to facet_wrap(nrow=1) or ncol=1), but there are some styling distinctions between them.

Related

How to add "N = " labels to bar plot in R?

I'm looking to add "n = #" under each of the variables on the x-axis but I'm not sure how. The counts don't necessarily have to be under the names, just as long as the counts are there. I'm also working with two categorical variables, so that may be the issue too. Let me know if you have any suggestions, I'm new to R.
~
Here's some information on the dataset and the variables I'm comparing. The overall data set (scorpions) consists of scorpion species and what vegetation they're found in. Those are the two things I'm comparing. "species" is the vector for the species and "veg" is the vector for the vegetation type. These are both character vectors. I really just want to know how to add more labels onto my graph to give more clarification. This is what my graph currently looks like:
graph
I just want to be able to add number labels anywhere. If you want to recreate it, you can really use any dataset that consists of two character vectors. The other posts don't help because they consist of numerical vectors as well. If it's not possible to do this, then just let me know.
Thank you everyone for the help so far!
ggplot(data=scorpions, aes(x=species,y=veg,fill=veg)) +
geom_bar(stat="identity",color="black",position=position_dodge()) +
theme_stata() +
scale_fill_economist() +
theme(
axis.text.y = element_text(angle = 0),
axis.title = element_text(face="bold"),
axis.text.x = element_text(face = "italic")
) +
labs(title="Relationship Between Species and Vegetation Type")
I've tried changing the names in the Excel spreadsheet, but it looks really messy. I've also tried googling answers but nothing works since it's two categorical variables.
This question is in contrast to the most common dupe-links for grouped bar plots in ggplot2 in that other links (How to put labels over geom_bar for each bar in R with ggplot2 and How to put labels over geom_bar in R with ggplot2) tend to talk about one categorical variable only; this question asks about two categorical variables.
But it's not that hard: we just need to come up with a number for all combinations of each of the two categoricals. I'll use xtabs for that.
Using ggplot2::diamonds dataset, plotting against cut and color (both character):
library(ggplot2)
head(diamonds)
# # A tibble: 6 x 10
# carat cut color clarity depth table price x y z
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
# 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
# 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
# 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
# 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Starting with a simple (non-themed) bar plot:
gg <- ggplot(data=diamonds, aes(x=cut,y=color,fill=color)) +
geom_bar(stat="identity",color="black",position=position_dodge())
gg
Calculate the frequency table:
xtabs(~ cut + color, data = diamonds)
# color
# cut D E F G H I J
# Fair 163 224 312 314 303 175 119
# Good 662 933 909 871 702 522 307
# Very Good 1513 2400 2164 2299 1824 1204 678
# Premium 1603 2337 2331 2924 2360 1428 808
# Ideal 2834 3903 3826 4884 3115 2093 896
### convert to a frame
tab <- data.frame(xtabs(~ cut + color, data = diamonds))
head(tab)
# cut color Freq
# 1 Fair D 163
# 2 Good D 662
# 3 Very Good D 1513
# 4 Premium D 1603
# 5 Ideal D 2834
# 6 Fair E 224
New plot, adding geom_text:
gg +
geom_text(data = tab, aes(label = Freq),
position = position_dodge(width = 0.9), vjust = -0.25)

Grouping Frame Values

I have a dataset of ingredients for cookies. I'm trying to answer which group (A, B, C, etc) of cookies has the most sugar in them. The dataset is structured as follows:
group id mois prot fat hocolate sugar carb cal
1 A 14069 27.82 21.43 44.87 5.11 1.77 0.77 4.93
2 A 14053 28.49 21.26 43.89 5.34 1.79 1.02 4.84
3 A 14025 28.35 19.99 45.78 5.08 1.63 0.80 4.95
4 B 14016 30.55 20.15 43.13 4.79 1.61 1.38 4.74
5 B 14005 30.49 21.28 41.65 4.82 1.64 1.76 4.67
6 A 14075 31.14 20.23 42.31 4.92 1.65 1.40 4.67
7 C 14082 31.21 20.97 41.34 4.71 1.58 1.77 4.63
8 C 14097 28.76 21.41 41.60 5.28 1.75 2.95 4.72
etc....
How can I plot the mean of each grouping to show that one of them has a higher average of sugar than the others? Or at the least, how can I print off the results of the grouped averages of sugar to defend my argument that one has more sugar than the other?
After saving your text to CSV and loading this file into R, it's pretty easy to obtain the mean sugar quantity per group, which I'm assuming is what you need.
You first group your data by variable group and then summarize the data using the "mean" function.
library(dplyr)
(cookies = df %>%
group_by(group) %>%
summarize(meanSugar = mean(sugar)))
group meanSugar
<chr> <dbl>
1 A 1.71
2 B 1.62
3 C 1.66
As you can see, group A has sugar content a bit higher than the others based on your data.
If you wanna go a step further and really plot this data, you can do that:
library(ggplot2)
cookies %>%
ggplot(aes(x=meanSugar,y=reorder(group,meanSugar),fill=group,label=meanSugar)) +
geom_col()+
labs(y="Cookie groups",x="Mean Sugar")+
geom_label(stat="identity",hjust=+1.2,color="white")+
theme(legend.position = "none")
If you have any questions on some of these steps, let me know!
Obs: please try to provide better data the next time so it's easy to reproduce what you need and give you a quick answer :)

Translating filter_all(any_vars()) to filter(across())

On updating my own answer to another thread, I wasn't able to come up with a good solution to replace the last example (see below). The idea is to get all rows where any column contains a certain string, in my example "V".
library(tidyverse)
#get all rows where any column contains 'V'
diamonds %>%
filter_all(any_vars(grepl('V',.))) %>%
head
#> # A tibble: 6 x 10
#> carat cut color clarity depth table price x y z
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 2 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#> 3 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#> 4 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
#> 5 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
#> 6 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
# this does naturally not give the desired output!
diamonds %>%
filter(across(everything(), ~ grepl('V', .))) %>%
head
#> # A tibble: 0 x 10
I found a thread where the poster ponders over similar stuff, but applying a similar logic on grepl does not work.
### don't run, this is ugly and does not work
diamonds %>%
rowwise %>%
filter(any(grepl("V", across(everything())))) %>%
head
This is very difficult, because the example shows that you want to filter data from all columns when any of them meets the condition (i.e. you want a union). That's done with filter_all() and any_vars().
While filter(across(everything(), ...)) filters out from all columns when all of them meet the condition (i.e. this is a intersection, quite opposite of the previous).
To convert it from intersection to the union (i.e. to get again rows where any of the columns meet the condition), you probably need to check the row sum for that:
diamonds %>%
filter(rowSums(across(everything(), ~grepl("V", .x))) > 0)
It will sum all the TRUEs that appear in the row, i.e. if there is at least one value meeting the condition, that row sum will be > 0 and will be shown.
I'm sorry for across() is not the very first child of filter(), but it's at least some idea how to do that. :-)
Evaluation:
Using #TimTeaFan's method to check that:
identical(
{diamonds %>%
filter_all(any_vars(grepl('V',.)))
},
{diamonds %>%
filter(rowSums(across(everything(), ~grepl("V", .x))) > 0)
}
)
#> [1] TRUE
Benchmark:
As per our discussion under TimTeaFan's answer, here is a comparison, surprisingly, all solutions have a similar time:
library(tidyverse)
microbenchmark::microbenchmark(
filter_all = {diamonds %>%
filter_all(any_vars(grepl('V',.)))},
purrr_reduce = {diamonds %>%
filter(across(everything(), ~ grepl('V', .)) %>% purrr::reduce(`|`))},
base_reduce = {diamonds %>%
filter(across(everything(), ~ grepl('V', .)) %>% Reduce(`|`, .))},
rowsums = {diamonds %>%
filter(rowSums(across(everything(), ~grepl("V", .x))) > 0)},
times = 100L,
check = "identical"
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> filter_all 295.7235 302.1311 309.6455 305.0491 310.0335 449.3619 100
#> purrr_reduce 297.8220 302.4411 310.2829 306.2929 312.2278 461.0194 100
#> base_reduce 298.5033 303.6170 309.4147 306.1839 312.3518 409.5273 100
#> rowsums 295.3863 301.0281 307.8517 305.3142 309.4793 372.8867 100
Created on 2020-07-14 by the reprex package (v0.3.0)
This is the equivalent to the filter_all call you posted. However, #akrun is totally correct to point out, that it should be converted to character first. Nevertheless, this also holds true for your filter_all statement.
The idea is to use across(everything(), ~ grepl('V', .)) to get the whole data.frame transformed into columns of TRUE and FALSE regarding grepl('V', .). However, filter needs a vector, or a data.frame with one column so we transform it by using reduce(|). It combines the first two columns with | then the result of this call with the third column and so on, until the original data.frame has one column with TRUE and FALSE which can then be used to filter the rows.
library(ggplot2)
library(dplyr)
diamonds %>%
filter(across(everything(), ~ grepl('V', .)) %>% purrr::reduce(`|`)) %>%
head
#> # A tibble: 6 x 10
#> carat cut color clarity depth table price x y z
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 2 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#> 3 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#> 4 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
#> 5 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
#> 6 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
identical({diamonds %>%
filter_all(any_vars(grepl('V',.)))},
{diamonds %>%
filter(across(everything(), ~ grepl('V', .)) %>% purrr::reduce(`|`))
})
#> [1] TRUE
Created on 2020-07-14 by the reprex package (v0.3.0)
Some of the columns were ordered and it will affect with c_across. Instead, if we convert to character class and then do the grepl it should work
library(dplyr)
library(ggplot2)
diamonds %>%
head %>%
mutate(across(where(is.factor), as.character)) %>%
rowwise %>%
filter(any(grepl("V", c_across(where(is.character)))))
# A tibble: 3 x 10
# Rowwise:
# carat cut color clarity depth table price x y z
# <dbl> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#2 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#3 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

Subsetting, Matrices

I am super new to R and currently playing with the "diamond" dataset.
I am trying to return the row corresponding to the lowest, mean and largest prices and put everything in a 10 by 4 matrix. Please explain an easier way of doing this if possible.
library(ggplot2)
data(diamonds)
min(diamonds$price)
mean(diamonds$price)
max(diamonds$price) # this one gives me the wrong val!
M<-matrix(1:cols, nrow = 1, ncol = cols)
colnames(M)<-c("carat","cut" , "color" , "clarity", "depth" , "table" , "price" , "x" , "y" ,"z")
# Here I need to add the rows corresponding to the min,mean,max to this matrix.
Thanks
If all you want to do is to select the rows in the diamonds data frame corresponding to the mean, minimum, and maximum of price, this is easily accomplished with a combination of the $ and [ forms of the extract operator in Base R.
Note that this will return a data frame with 3 rows, not 4, because there are two rows at the minimum price, no rows at the mean price, and one row at the maximum price.
library(ggplot2)
data(diamonds)
diamonds[diamonds$price %in% c(min(diamonds$price),mean(diamonds$price),max(diamonds$price)),]
...and the output:
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
A solution with dplyr uses filter() as follows.
# dplyr solution
library(dplyr)
diamonds %>% filter(price %in% c(min(price),mean(price),max(price)))
...and the output:
# A tibble: 3 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
>
matrix and dataframes are different in R. diamonds is a dataframe it is better and easy to process if you keep it as dataframe only.
summary(diamonds) gives you some nice summary stats for each column.
If you want to apply specific functions to columns using dplyr, you can do :
library(dplyr)
diamonds %>%
summarise(across(where(is.numeric),list(min = min, max = max, mean = mean))) %>%
tidyr::pivot_longer(cols = everything(),
names_to = c('col', '.value'),
names_sep = '_')
# A tibble: 7 x 4
# col min max mean
# <chr> <dbl> <dbl> <dbl>
#1 carat 0.2 5.01 0.798
#2 depth 43 79 61.7
#3 table 43 95 57.5
#4 price 326 18823 3933.
#5 x 0 10.7 5.73
#6 y 0 58.9 5.73
#7 z 0 31.8 3.54
Note that I applied these functions only to numeric columns since cut, color, clarity are factor columns.

Line plot with error bars in which each line is a different group and multiple variables are in the x axis

I'm trying to create a line plot with error bars in R/Rstudio, in which each line is a different group (coded by one variable) and different continuous variables compose the x axis.
Taking the dataset diamonds as examples, it would be a multiple line graph, in which each line is one category of the variable "color and x,y,z are variables in whose levels are in the y axis, but they are positioned in the x axis.
the head of diamonds in R is:
(as coded in R studio :
>head(diamonds)
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
an example of a similar graph would be the one attached in the pic, but I need one with error bars (and this was made in stata, which just can't add error bars to this command which is: profileplot varx vary varz, by(groups)
profile plot without errorbars as an example is here::
Before we start, we will plot x,y,z columns from diamonds,and because x and y and very close, i subtract 1 from y so we can see it, and also introduce some error for error bars
library(tidyr)
library(ggplot2)
library(dplyr)
mydata <- diamonds %>% select(color,x,y,z) %>% pivot_longer(-color)
# A tibble: 6 x 3
color name value
<ord> <chr> <dbl>
1 E x 1.80
2 E y 3.98
3 E z 2.43
4 E x 2.92
5 E y 3.84
6 E z 2.31
Then:
ggplot(mydata,aes(x=name,y=value,color=color)) +
stat_summary(fun.y=mean,geom="point") +
stat_summary(fun.y=mean,aes(group=color),geom="line") +
stat_summary(fun.data=mean_se,geom="errorbar",width=0.1)
In this case the errorbars etc don't make sense because the x, y and z values are pretty much similar.

Resources