Visualize multiple box plot selecting differents rows of a dataframe

Visualize multiple box plot selecting differents rows of a dataframe - r

I am developing an EDA (Estimation of Distribution Algorithm). I'm getting all measure of the Pareto Front's solutions with distint configurations.
I have a structure with all values:
> metrics20
# A tibble: 320 x 6
File Hypervolume `Modified Hypervolume` Spread Spacing Time
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 001-unif-0.csv 25771 26294. 391. 30.1 16.8
2 002-unif-0.csv 27481 28416. 534. 41.1 16.5
3 003-unif-0.csv 26394 26842. 356. 29.6 16.5
4 004-unif-0.csv 30828 31696 418. 38.0 16.5
5 005-unif-0.csv 28146 28727 444. 34.2 16.6
6 006-unif-0.csv 30176 31006 451. 50.1 16.6
7 007-unif-0.csv 29374 30216 537. 35.8 16.5
8 008-unif-0.csv 27434 28156. 439. 31.4 16.5
9 009-unif-0.csv 28944 29426 471. 33.7 16.4
10 010-unif-0.csv 28339 29302. 576. 44.3 16.4
I want to visualize the values by this way. I take for example the Hipervolume column, I split data by File column value: -unif-, -sat-, -eff- and -prod- distribution and show values with -0.csv,-0.25.csv,-0.5.csv and -0.75.csv in x axis for the same distribution.
Reproducible example:
library(readr)
metrics20 <- read_csv("./metrics20.csv")
Data: Link

Hopefully this is a step towards what you're looking for:
library(readr)
library(dplyr)
library(ggplot2)
metrics20 <- read_csv("metrics20.csv")
metrics20 %>%
mutate(tag = factor(gsub("(^\\d+-)(\\w+)(-.*$)", "\\2", .$File), levels = c("unif", "sat", "eff", "prod")),
level = gsub("(^\\d+-\\w+-)(.*)(\\.csv$)", "\\2", .$File)) %>%
ggplot(aes(x = level, y = Hypervolume)) +
geom_boxplot() +
facet_wrap(~tag, nrow = 1)+
theme_minimal() +
theme(panel.border = element_rect(colour = "black", fill = NA),
panel.grid = element_blank())
From here there may be other things you want to tweak if you need to adjust it to be more like the example plot. You should be able to find all next steps in the help for the functions used.

Related

ggplot boxplot with mean and confidence interval by group

I'd like to make a boxplot with mean instead of median. Moreover, I would like the line to stop at 5% (lower) end 95% (upper) quantile. Here the code;
ggplot(data, aes(x=Cement, y=Mean_Gap, fill=Material)) +
geom_boxplot(fatten = NULL,aes(fill=Material), position=position_dodge(.9)) +
xlab("Cement") + ylab("Mean cement layer thickness") +
stat_summary(fun=mean, geom="point", aes(group=Material), position=position_dodge(.9),color="black")
I'd like to change geom to errorbar, but this doesn't work. I tried middle = mean(Mean_Gap), but this doesn't work either. I tried ymin = quantile(y,0.05), but nothing was changing. Can anyone help me?
The standard boxplot using ggplot. fill is Material:

Here is how you can create the boxplot using custom parameters for the box and whiskers. It's the solution shown by #lukeA in stackoverflow.com/a/34529614/6288065, but this one will also show you how to make several boxes by groups.
The R built-in data set called "ToothGrowth" is similar to your data structure so I will use that as an example. We will plot the length of tooth growth (len) for each vitamin C supplement group (supp), separated/filled by dosage level (dose).
# "ToothGrowth" at a glance
head(ToothGrowth)
# len supp dose
#1 4.2 VC 0.5
#2 11.5 VC 0.5
#3 7.3 VC 0.5
#4 5.8 VC 0.5
#5 6.4 VC 0.5
#6 10.0 VC 0.5
library(dplyr)
# recreate the data structure with specific "len" coordinates to plot for each group
df <- ToothGrowth %>%
group_by(supp, dose) %>%
summarise(
y0 = quantile(len, 0.05),
y25 = quantile(len, 0.25),
y50 = mean(len),
y75 = quantile(len, 0.75),
y100 = quantile(len, 0.95))
df
## A tibble: 6 x 7
## Groups: supp [2]
# supp dose y0 y25 y50 y75 y100
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 OJ 0.5 8.74 9.7 13.2 16.2 19.7
#2 OJ 1 16.8 20.3 22.7 25.6 26.9
#3 OJ 2 22.7 24.6 26.1 27.1 30.2
#4 VC 0.5 4.65 5.95 7.98 10.9 11.4
#5 VC 1 14.0 15.3 16.8 17.3 20.8
#6 VC 2 19.8 23.4 26.1 28.8 33.3
# boxplot using the mean for the middle and 95% quantiles for the whiskers
ggplot(df, aes(supp, fill = as.factor(dose))) +
geom_boxplot(
aes(ymin = y0, lower = y25, middle = y50, upper = y75, ymax = y100),
stat = "identity"
) +
labs(y = "len", title = "Boxplot with Mean Middle Line") +
theme(plot.title = element_text(hjust = 0.5))
In the figure above, the boxplot on the left is the standard boxplot with regular median line and regular min/max whiskers. The boxplot on the right uses the mean middle line and 5%/95% quantile whiskers.

Sum total distance by groups

I have a df tracking movement of points each hour. I want to find the total distance traveled by that group/trial by adding the distance between the hourly coordinates, but I'm confusing myself with apply functions.
I want to say "in each group/trial, sum [distance(hour1-hou2), distance(hour2=hour3), distance(hour3-hour4)....] until current hour so on each line, I have a cumulative distance travelled value.
I've created a fake df below.
paths <- data.frame(matrix(nrow=80,ncol=5))
colnames(paths) <- c("trt","trial","hour","X","Y")
paths$trt <- rep(c("A","B","C","D"),each=20)
paths$trial <- rep(c(rep(1,times=10),rep(2,times=10)),times=4)
paths$hour <- rep(1:10,times=8)
paths[,4:5] <- runif(160,0,50)
#this shows the paths that I want to measure.
ggplot(data=paths,aes(x=X,y=Y,group=interaction(trt,trial),color=trt))+
geom_path()
I probably want to add a column paths$dist.traveled to keep track each hour.
I think I could use apply or maybe even aggregate but I've been using PointDistance to find the distances, so I'm a bit confused. I also would rather not do a loop inside a loop, because the real dataset is large.

Here's an answer that uses {dplyr}:
library(dplyr)
paths %>%
arrange(trt, trial, hour) %>%
group_by(trt, trial) %>%
mutate(dist_travelled = sqrt((X - lag(X))^2 + (Y - lag(Y))^2)) %>%
mutate(total_dist = sum(dist_travelled, na.rm = TRUE)) %>%
ungroup()
If you wanted the total distance but grouped only by trt and not trial you would just remove that from the call to group_by().

Is this what you are trying to achieve?:
paths %>%
mutate(dist.traveled = sqrt((X-lag(X))^2 + (Y-lag(Y))^2))
trt trial hour X Y dist.traveled
<chr> <dbl> <int> <dbl> <dbl> <dbl>
1 A 1 1 11.2 26.9 NA
2 A 1 2 20.1 1.48 27.0
3 A 1 3 30.4 0.601 10.4
4 A 1 4 31.1 26.6 26.0
5 A 1 5 38.1 30.4 7.88
6 A 1 6 27.9 47.9 20.2
7 A 1 7 16.5 35.3 16.9
8 A 1 8 0.328 13.0 27.6
9 A 1 9 14.0 41.7 31.8
10 A 1 10 29.7 7.27 37.8
# ... with 70 more rows
paths$dist.travelled[which(paths$hour==1)] <- NA
paths %>%
group_by(trt)%>%
summarise(total_distance = sum(dist.traveled, na.rm = TRUE))
trt total_distance
<chr> <dbl>
1 A 492.
2 B 508.
3 C 479.
4 D 462.
I am adding the new column to calculate distances for each group, and them sum them up.

how to use map function in r to find the range and quantile

I first simulated 500 samples of size 55 in the normal distribution.
samples <- replicate(500, rnorm(55,mean=50, sd=10), simplify = FALSE)
1) For each sample, I want the mean, median, range, and third quartile. Then I need to store these together in a data frame.
This is what I have. I am not sure about the range or the quantile. I tried sapply and lapply but not sure how they work.
stats <- data.frame(
means = map_dbl(samples,mean),
medians = map_dbl(samples,median),
sd= map_dbl(samples,sd),
range= map_int(samples, max-min),
third_quantile=sapply(samples,quantile,type=3)
)
2) Then plot the sampling distribution (histogram) of the means.
I try to plot but I don't get how to get the mean
stats <- gather(stats, key = "Trials", value = "Mean")
ggplot(stats,aes(x=Trials))+geom_histogram()
3) Then I want to plot the other three statistics in (three separate graphs) of a single plotting window.
I know I need to use something like gather and facet_wrap, but I am not sure how to do it.

You were almost there. All it is needed is to define anonymous functions wherever there are errors.
library(tidyverse)
set.seed(1234) # Make the results reproducible
samples <- replicate(500, rnorm(55,mean=50, sd=10), simplify = FALSE)
str(samples)
stats <- data.frame(
means = map_dbl(samples, mean),
medians = map_dbl(samples, median),
sd = map_dbl(samples, sd),
range = map_dbl(samples, function(x) diff(range(x))),
third_quantile = map_dbl(samples, function(x) quantile(x, probs = 3/4, type = 3))
)
str(stats)
#'data.frame': 500 obs. of 5 variables:
# $ means : num 49.8 51.5 52.2 50.2 51.6 ...
# $ medians : num 51.5 51.7 51 51.1 50.5 ...
# $ sd : num 9.55 7.81 11.43 8.97 10.75 ...
# $ range : num 38.5 37.2 54 36.7 60.2 ...
# $ third_quantile: num 57.7 56.2 58.8 55.6 57 ...

The map_dbl functions you're using are definitely nice, but if you're trying to get a data frame in the end anyway, you might have an easier time converting the list into a data frame at the beginning, then taking advantage of some dplyr functions.
I'm first mapping over the list, creating tibbles, and binding it together with an added ID. The conversion creates a column value of the sample values. summarise_at lets you take a list of functions—supplying names in the list sets the names in the resultant data frame. You can use purrr's ~. notation to define these functions inline where needed. Cuts down on the number of times you have to map_dbl and so on.
library(tidyverse)
stats <- samples %>%
map_dfr(as_tibble, .id = "sample") %>%
group_by(sample) %>%
summarise_at(vars(value),
.funs = list(mean = mean, median = median, sd = sd,
range = ~(max(.) - min(.)),
third_quartile = ~quantile(., probs = 0.75)))
head(stats)
#> # A tibble: 6 x 6
#> sample mean median sd range third_quartile
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 45.0 44.4 8.71 47.6 48.6
#> 2 10 51.0 52.0 9.55 49.3 56.2
#> 3 100 51.6 52.2 10.4 60.7 58.1
#> 4 101 51.6 51.1 9.92 37.6 57.2
#> 5 102 49.1 48.2 9.65 39.8 57.0
#> 6 103 52.2 51.3 10.1 47.4 58.5
Next, in your code you gathered the data—which is often the solution folks need on SO—but if you're only trying to show the mean column, you can work with it as is.
ggplot(stats, aes(x = mean)) +
geom_histogram()

How can I plot using 2 y-axes using a single data frame with 7 variables having a wide range of values?

I have 7 variables (density of plankton functional groups) in a time series which I want to place in a single plot to compare their trends over time. I used ggplot, geom_point and geom_line. Since each of the variables vary in range, those with smaller values end up as almost flat lines when plotted against those with the larger values. Since I am only after the trends, not the density, I would prefer to see all lines in one plot. I considered using the sec.axis function, but could not figure out how to assign the variables to either of the y-axes.
Below is my sample data:
seq=1:6
fgrp:
Cop<-c(4.166667,4.722222,3.055556,4.444444,2.777778,2.222222)
Cyan<-c(7.222222,3.888889,1.388889,0.555556,6.944444,3.611111)
Dia<-c(96.66667,43.88889,34.44444,111.8056,163.0556,94.16667)
Dino<-c(126.9444,71.11111,50,55.97222,65,38.33333)
Naup<-c(271.9444,225.5556,207.7778,229.8611,139.7222,92.5)
OT<-c(22.5,19.16667,10.27778,18.61111,18.88889,8.055556)
Prot<-c(141.9444,108.8889,99.16667,113.8889,84.44444,71.94444)
And the ggplot script without the sec.axis since I could not make it work yet:
ggplot(data=df,aes(x=seq,y=mean,shape=fgrp,linetype=fgrp))+geom_point(size=2.5)+geom_line(size=0.5)+scale_shape_manual(values=c(16,17,15,18,8,1,0),
guide=guide_legend(title="Functional\nGroups"))+scale_linetype_manual(values=c("solid","longdash","dotted","dotdash","dashed","twodash","12345678"),guide=F)+scale_y_continuous(sec.axis = sec_axis(~./3)) +geom_errorbar(mapping=aes(ymax=mean+se,ymin=mean-se), width=0.04,linetype="longdash",color="gray30")+theme_minimal()+labs(list(title="Control",x="time",y="density"),size=12)+theme(plot.title = element_text(size = 12,hjust = 0.5 ))

The lines do not look terrible, as is, but here's an example that leverages facet_wrap with scales = "free_y" that should get you going in the right direction:
library(tidyverse)
seq <- 1:6
Cop <- c(4.166667,4.722222,3.055556,4.444444,2.777778,2.222222)
Cyan <- c(7.222222,3.888889,1.388889,0.555556,6.944444,3.611111)
Dia <- c(96.66667,43.88889,34.44444,111.8056,163.0556,94.16667)
Dino <- c(126.9444,71.11111,50,55.97222,65,38.33333)
Naup <- c(271.9444,225.5556,207.7778,229.8611,139.7222,92.5)
OT <- c(22.5,19.16667,10.27778,18.61111,18.88889,8.055556)
Prot <- c(141.9444,108.8889,99.16667,113.8889,84.44444,71.94444)
df <- tibble(
seq = seq,
cop = Cop,
cyan = Cyan,
dia = Dia,
dino = Dino,
naup = Naup,
ot = OT,
prot = Prot
)
df
#> # A tibble: 6 x 8
#> seq cop cyan dia dino naup ot prot
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 4.17 7.22 96.7 127. 272. 22.5 142.
#> 2 2 4.72 3.89 43.9 71.1 226. 19.2 109.
#> 3 3 3.06 1.39 34.4 50 208. 10.3 99.2
#> 4 4 4.44 0.556 112. 56.0 230. 18.6 114.
#> 5 5 2.78 6.94 163. 65 140. 18.9 84.4
#> 6 6 2.22 3.61 94.2 38.3 92.5 8.06 71.9
df_tidy <- df %>%
gather(grp, value, -seq)
df_tidy
#> # A tibble: 42 x 3
#> seq grp value
#> <int> <chr> <dbl>
#> 1 1 cop 4.17
#> 2 2 cop 4.72
#> 3 3 cop 3.06
#> 4 4 cop 4.44
#> 5 5 cop 2.78
#> 6 6 cop 2.22
#> 7 1 cyan 7.22
#> 8 2 cyan 3.89
#> 9 3 cyan 1.39
#> 10 4 cyan 0.556
#> # ... with 32 more rows
ggplot(df_tidy, aes(x = seq, y = value, color = grp)) +
geom_line()
ggplot(df_tidy, aes(x = seq, y = value, color = grp)) +
geom_line() +
facet_wrap(~ grp, scales = "free_y")

How to reorder discrete y axis on a plot with facets, flipped coordinates, and continuous x axis

Objective:
I am trying to draw a plot of data on cities in several countries, grouped by region, then arranged by my own order from 1 at the top in ascending order going down the plot along the y axis.
The plot is currently grouping the regions in reverse alphabetical order; I'd like it to be in alphabetical order as well. Also, it is currently arranging the nations alphabetically. My order is not being used.
What I've tried so far:
You'll see two lines of code that have been commented out below:
#aspect.ratio=1/5, #I tried this and it did not work for me
and
#coord_fixed(ratio=1/500) + #new one for fixing y axis spacing #this did not solve it either
and I've tried to solve it using the solution proposed here: https://www.r-bloggers.com/ordering-categories-within-ggplot2-facets-2/ with this code:
group_by(rdatacities$region, rdatacities$country) %>%
arrange(desc(contribution)) %>%
ungroup() %>%
mutate(country = factor(paste(country, region, sep = "__"), levels = rev(paste(country, region, sep = "__")))) %>%
# --ggplot here--
scale_x_discrete(labels = function(x) gsub("__.+$", "", x))
but group_by(rdatacities$region, rdatacities$country) %>% throws this error:
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "character"
I am not an R expert, I believe this to be a minimal, complete, and verifiable example, but if it's a wall of terrible code, I apologize in advance.
library(readxl)
library(grid)
library(scales)
library(ggplot2)
library(graphics)
library(grDevices)
library(datasets)
p<-ggplot(rdatacities, aes(x=country, y=target, size=citiesorig)) +
geom_point(shape=21, alpha=.6, stroke=.75) + #municipal targets
facet_grid(region ~ ., scales = "free_y", space = "free_y") + #regional clusters
geom_point(data=rdatanation, shape=0, size=2.5, stroke=1, aes(colour='red'))
#national targets
plot <- p + theme(
plot.title = element_text(hjust=0.5), legend.key=element_rect(fill='white'),
panel.background = element_blank(), plot.margin = margin(2),
axis.ticks.y = element_blank(), panel.grid.major.y= element_line(size=.2, linetype='solid', colour='grey'),
axis.text.y = element_text(size=6),
panel.grid.major.x= element_blank(), axis.line.x = element_blank(),
axis.ticks.x=element_line(color='black'),
#aspect.ratio=1/5, #I tried this and it did not work for me
strip.background=element_rect(fill = NA, size = 0, color = "white", linetype = "blank")
) +
#coord_fixed(ratio=1/500) + #new one for fixing y axis spacing #this did not solve it either
scale_y_continuous(name="Target by 2030 as Percent of Base Year", labels = percent) +
scale_x_discrete(name="Country", expand=waiver()) +
coord_flip(ylim = c(0, 3)) +
ggtitle("National vs. Municipal GHG Emissions Targets") +
guides(colour = guide_legend(order = 1, "Targets"),
size = guide_legend(order = 2,"Municipal")) +
scale_color_manual(name="Targets", labels = c("National"), values = c('#843C0C'))
plot
Here is a sample of the data from rdatacities:
region country target cities order citiesorig
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 EAP JPN 0.660 1 71 1
2 EAP JPN 0.75 2 71 2
3 EAP JPN 0.8 1 71 1
4 EAP JPN 0.85 1 71 1
5 EAP JPN 0.88 1 71 1
6 EAP JPN 0.96 1 71 1
7 EAP KOR 0.6 1 72 1
8 ECA ALB 0.22 1 68 1
9 ECA AZE 0.2 1 65 1
10 ECA BLR 0.2 2 62 6
# ... with 391 more rows
Here is a sample of the data from rdatanations
region countryname country target EmBase EmCBase EmC2014 UrbPop2014 `%UrbPop2014` pop2014
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 EAP Australia AUS 1 263705. 15.5 15.4 20947819 0.893 23460694
2 EAP Brunei Darussalam BRN 2.09 6194. 23.9 22.1 316547 0.769 411704
3 EAP French Polynesia PYF NA 436. 2.20 2.92 154205 0.560 275484
4 EAP Japan JPN 0.82 1096180. 8.87 9.54 118393408 0.930 127276000
5 EAP Korea, Rep. KOR 0.74 246943. 5.76 11.6 41794948 0.824 50746659
6 EAP Malaysia MYS 5.93 56593. 3.14 8.03 22371755 0.740 30228017
7 EAP Mongolia MNG 1.89 9989. 4.57 7.13 2082457 0.712 2923896
8 EAP New Caledonia NCL NA 1584. 9.27 16.0 186697 0.697 268000
9 EAP New Zealand NZL 1.1 23546. 7.07 7.69 3889661 0.863 4509700
10 EAP Palau PLW 1.73 235. 15.6 12.3 18236 0.865 21094
# ... with 70 more rows

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Visualize multiple box plot selecting differents rows of a dataframe - r

Related

ggplot boxplot with mean and confidence interval by group

Sum total distance by groups

how to use map function in r to find the range and quantile

How can I plot using 2 y-axes using a single data frame with 7 variables having a wide range of values?

How to reorder discrete y axis on a plot with facets, flipped coordinates, and continuous x axis

Categories

Resources