Specialised Boxplot: Plotting Lines to the Error Bars to Highlight the Data Range in R - r

Overview
I have a data frame called ANOVA.Dataframe.1 (see below) containing the dependent variable called 'Canopy_Index', and the independent variable called 'Urbanisation_index".
My aim is to produce a boxplot (exactly the same as the desired result below) for Canopy Cover (%) for each category of the Urbanisation Index with plotted lines pointing towards both the bottom and top of the error bars to highlight the data range.
I have searched intensively in order to find the code to produce the desired boxplot this (please see the desired result), but I was unsuccessful, and I'm also unsure if these boxplots have a specialised name.
Perhaps this can be achieved in either ggplot or Base R
If anyone can help, I would be deeply appreciative.
Desired Result ( Reference)
I can produce an ordinary boxplot with the R-code below, but I cannot figure out how to implement the lines pointing towards the ends of the error bars.
R-code
Boxplot.obs1.Canopy.Urban<-boxplot(ANOVA.Dataframe.1$Canopy_Index~ANOVA.Dataframe.1$Urbanisation_index,
main="Mean Canopy Index (%) for Categories of the Urbansiation Index",
xlab="Urbanisation Index",
ylab="Canopy Index (%)")
Boxplot produced from R-code
Data frame 1
structure(list(Urbanisation_index = c(2, 2, 4, 4, 3, 3, 4, 4,
4, 2, 4, 3, 4, 4, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2,
2, 2, 2, 4, 4, 3, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 4, 4, 4,
4, 4, 4, 4), Canopy_Index = c(65, 75, 55, 85, 85, 85, 95, 85,
85, 45, 65, 75, 75, 65, 35, 75, 65, 85, 65, 95, 75, 75, 75, 65,
75, 65, 75, 95, 95, 85, 85, 85, 75, 75, 65, 85, 75, 65, 55, 95,
95, 95, 95, 45, 55, 35, 55, 65, 95, 95, 45, 65, 45, 55)), row.names = c(NA,
-54L), class = "data.frame")
Dataframe 2
structure(list(Urbanisation_index = c(2, 2, 4, 4, 3, 3, 4, 4,
4, 3, 4, 4, 4, 4, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2,
2, 2, 2, 4, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 4, 4, 4, 4, 4, 4, 4
), Canopy_Index = c(5, 45, 5, 5, 5, 5, 45, 45, 55, 15, 35, 45,
5, 5, 5, 5, 5, 5, 35, 15, 15, 25, 25, 5, 5, 5, 5, 5, 5, 15, 25,
15, 35, 25, 45, 5, 25, 5, 5, 5, 5, 55, 55, 15, 5, 25, 15, 15,
15, 15)), row.names = c(NA, -50L), class = "data.frame")

Alice, is this what you are looking for?
You can do everything with ggplot2, but for non standard things you have to play with it for a while. My code:
library(tidyverse)
library(wrapr)
df %.>%
ggplot(data = ., aes(
x = Urbanisation_index,
y = Canopy_Index,
group = Urbanisation_index
)) +
stat_boxplot(
geom = 'errorbar',
width = .25
) +
geom_boxplot() +
geom_line(
data = group_by(., Urbanisation_index) %>%
summarise(
bot = min(Canopy_Index),
top = max(Canopy_Index)
) %>%
gather(pos, val, bot:top) %>%
select(
x = Urbanisation_index,
y = val
) %>%
mutate(gr = row_number()) %>%
bind_rows(
tibble(
x = 0,
y = max(.$y) * 1.15,
gr = 1:8
)
),
aes(
x = x,
y = y,
group = gr
)) +
theme_light() +
theme(panel.grid = element_blank()) +
coord_cartesian(
xlim = c(min(.$Urbanisation_index) - .5, max(.$Urbanisation_index) + .5),
ylim = c(min(.$Canopy_Index) * .95, max(.$Canopy_Index) * 1.05)
) +
ylab('Company Index (%)') +
xlab('Urbanisation Index')

Related

Visualizing Longitudinal Categorical Data

I am trying to visualize longitudinal changes in a categorical variable. A small chunk of my data is shown below. The data is telling us how long each participant is on a particular insurance type, with the maximum amount of follow-up being 45 weeks. For example, participant 1 was on insurance type 1 for 40 weeks. Participant 2 was on insurance type 2 until 24 weeks, then insurance type 1 until week 25, then back to insurance type 1 until week 35.
df <- data_frame(id = c(1, 2, 2, 2, 3, 4, 4, 5, 6, 6, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13), weeks = c(40, 24, 25, 35, 41, 9, 40, 41, 14, 18, 39, 37, 0, 40, 39, 39, 40, 41, 0, 41), insurance = as.factor(c(1, 2, 1, 2, 1, 1, 3, 3, 1, 2, 1, 1, 4, 1, 1, 1, 1, 3, 3, 1)))
I am trying to figure out a way to visualize this data. I have looked at other posts (1, 2, 3) and am still having trouble.
This is what I've been able to do so far, using geom_tile in the ggplot2 package:
I'm wondering if someone can help me figure out how to fill the color to the left? For example, ID 1 should have an entire row of pink, ID 2 should have green until week 24, then pink until week 25, then green again.
If you have any other suggestions as well, that would be great! My actual data has over 71,000 participants, so if you have any other options for how to visualize this data well, that would be greatly appreciated.
Thank you for the help!!
Initial code for graph:
ggplot(df, aes(x=weeks, y=id, fill=insurance)) + geom_tile(color="grey20")
Update: after clarification see comments:
library(tidyverse)
df %>%
ggplot(aes(x=factor(id), y=weeks), color=insurance) +
geom_segment( aes(x=id, xend=id, y=0, yend=weeks)) +
geom_point( aes(color=insurance), size=4, alpha=0.6) +
scale_x_discrete()+
theme_light() +
coord_flip() +
xlab("ID")+
scale_color_brewer(palette="Dark2")
First answer:
Maybe with a kind of lollipop chart:
library(tidyverse)
df %>%
group_by(id) %>%
mutate(row = row_number()) %>%
mutate(id_new = paste0(id, "-", row)) %>%
ungroup() %>%
mutate(order = row_number()) %>%
mutate(id_new = factor(id_new, levels = id_new)) %>%
ggplot(aes(x=fct_reorder(id_new, order), y=weeks), color=insurance) +
geom_segment( aes(x=id_new, xend=id_new, y=0, yend=weeks, color=insurance)) +
geom_point( aes(color=insurance), size=4, alpha=0.6) +
theme_light() +
coord_flip() +
xlab("ID")+
scale_color_brewer(palette="Dark2")
Maybe you want something like this:
Your data:
df <- data_frame(id = c(1, 2, 2, 2, 3, 4, 4, 5, 6, 6, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13), weeks = c(40, 24, 25, 35, 41, 9, 40, 41, 14, 18, 39, 37, 0, 40, 39, 39, 40, 41, 0, 41), insurance = as.factor(c(1, 2, 1, 2, 1, 1, 3, 3, 1, 2, 1, 1, 4, 1, 1, 1, 1, 3, 3, 1)))
You can use this code:
library(ggplot2)
ggplot(df, aes(x=weeks, y=id, color=factor(insurance))) +
geom_point(size=2) +
scale_color_discrete("Insurance",labels=c("1","2", "3", "4")) +
facet_grid(.~insurance) +
theme_bw()
Output plot:

Zelen Exact Test - Trying to use a k 2x2 in the function zelen.test()

I am trying to use the zelen.test function on the package NSM3. I am having difficulty reading the data into the function.
You can recreate my data using
data <- c(4, 2, 3, 3, 8, 3, 4, 7, 0, 7, 1, 1, 12, 13,
74, 74, 77, 85, 31, 37, 11, 7, 18, 18, 96, 97, 48, 40)
events <- matrix(data, ncol = 2)
The documentation on CRAN states that zelen.test(z, example = F, r = 3) where z is an array of k 2 x 2 matrix, example is set to FALSE because it returns a p-value for an example I cannot access, and r is the number of decimals the users wants returned in the p-value.
I've tried:
zelen.test(events, r = 4)
I thought it may want the study number and the trial data, so I tried this:
studies <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7)
data <- c(4, 2, 3, 3, 8, 3, 4, 7, 0, 7, 1, 1, 12, 13,
74, 74, 77, 85, 31, 37, 11, 7, 18, 18, 96, 97, 48, 40)
events <- matrix(cbind(studies, events), ncol = 3)
zelen.test(events, r = 4)
but it continues to return and error stating
"Error in z[1, 1, ] : incorrect number of dimensions" for both cases I tried above.
Any help would be greatly appreciated!
If we check the source code by typing zelen.test on the console, if the example = TRUE, it is constructing a 3D array
...
if (example)
z <- array(c(2, 1, 2, 5, 1, 5, 4, 1), dim = c(2, 2, 2))
...
The input z dim is also specified in the documentation of ?zelen.test
z - data as an array of k 2x2 matrices. Small data sets only!
So, we may need to construct an array of dimensions 3
library(NSM3)
z1 <- array(c(4, 2, 3, 3, 8, 3, 4, 7), c(2, 2, 2))
zelen.test(z1, r = 4)
# Zelen's test:
# P = 1
Or with 3rd dimension of length 3
z1 <- array( c(4, 2, 3, 3, 8, 3, 4, 7, 0, 7, 1, 1), c(2, 2, 3))
zelen.test(z1, r = 4)
# Zelen's test:
#P = 0.1238

Why are my error bars on my graph out of place?

I have a graph that I'm trying to make with ggplot and gridExtra, but my error bars are out of place. I want the error bars to be at the top of each bar, not where they are now. What can I do to correct them?
Also, what ggsave parameters will generate a graph with the same pixel parameters that I am using with the r png base function? ggsave seems to work more consistently than this function, so I need to use it.
Data:
###Open packages###
library(readxl)
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
#Dataframes
set1 <- data.frame(type = c(1,
1,
1,
1,
1,
1,
1,
1,
1,
2,
2,
2,
2,
2,
2,
2,
2,
2,
3,
3,
3,
3,
3,
3,
3,
3,
3),
flowRate = c(24,
24,
24,
45,
45,
45,
58,
58,
58,
24,
24,
24,
45,
45,
45,
58,
58,
58,
24,
24,
24,
45,
45,
45,
58,
58,
58),
speed = c(0.563120137230256,
0.301721535875508,
0.170683367727845,
0.698874950490133,
0.158488731250147,
0.162788814307903,
0.105943103772245,
0.682354871986346,
0.17945825301837,
0.806637519498752,
0.599304186634932,
0.268788206619179,
0.518615600601962,
0.907628477211427,
0.144209408332705,
0.161586044320138,
0.946354993801663,
0.488881557759483,
0.497120443885793,
0.666120238846602,
0.264813203831783,
0.717007333314455,
0.95119232422312,
0.833669574933742,
0.450082932184122,
0.309570971522678,
0.732874401666482))
set2 <- data.frame(type = c(1,
1,
1,
1,
1,
1,
1,
1,
1,
2,
2,
2,
2,
2,
2,
2,
2,
2,
3,
3,
3,
3,
3,
3,
3,
3,
3),
flowRate = c(24,
24,
24,
45,
45,
45,
58,
58,
58,
24,
24,
24,
45,
45,
45,
58,
58,
58,
24,
24,
24,
45,
45,
45,
58,
58,
58),
speed = c(0.489966876244169,
0.535542121502899,
0.265940150225231,
0.399521957817437,
0.0831661276630631,
0.302201301891001,
0.78194419406759,
0.202331797255324,
0.192182716686147,
0.163038660094618,
0.658020173938572,
0.735633308902771,
0.480982144690572,
0.749452781972296,
0.491759702396918,
0.459610541236644,
0.397660083986082,
0.939983924945833,
0.128956722185581,
0.998492083119223,
0.440514184126494,
0.242917958355044,
0.350643319960552,
0.02613674288471,
0.71625407018877,
0.589325978787179,
0.649116781211748))
Code:
#Standard error of the mean function
sem <- function(x) sd(x)/sqrt(length(x))
#Aggregate dataframes, mean and Standard Error
mean_set1 <- aggregate(set1, by=list(set1$flowRate, set1$speed), mean)
mean_set1 <- select(mean_set1, -Group.1, -Group.2)
mean_set1 <- arrange(mean_set1, type, flowRate)
sem_set1 <- aggregate(set1, by=list(set1$flowRate, set1$speed), sem)
sem_set1 <- as.data.frame(sem_set1)
sem_set1 <- cbind(mean_set1$type, mean_set1$flowRate, sem_set1$Group.2)
sem_set1 <- as.data.frame(sem_set1)
mean_set2 <- aggregate(set2, by=list(set2$flowRate, set2$speed), mean)
mean_set2 <- select(mean_set2, -Group.1, -Group.2)
mean_set2 <- arrange(mean_set2, type, flowRate)
sem_set2 <- aggregate(set2, by=list(set2$flowRate, set2$speed), sem)
sem_set2 <- as.data.frame(sem_set2)
sem_set2 <- cbind(mean_set2$type, mean_set2$flowRate, sem_set2$Group.2)
sem_set2 <- as.data.frame(sem_set2)
#Graph sets
set1_graph <- ggplot(mean_set1, aes(x=type, y=speed, fill=factor(flowRate)))+
geom_bar(stat="identity",width=0.6, position="dodge", col="black")+
scale_fill_discrete(name="Flow Rate")+
xlab("type")+ylab("Speed")+
geom_errorbar(aes(ymin= mean_set1$speed,ymax=mean_set1$speed+sem_set1$V3), width=0.2, position = position_dodge(0.6))
set2_graph <- ggplot(mean_set2, aes(x=type, y=speed, fill=factor(flowRate)))+
geom_bar(stat="identity",width=0.6, position="dodge", col="black")+
scale_fill_discrete(name="Speed")+
xlab("type")+ylab("Flow Rate")+
geom_errorbar(aes(ymin= mean_set2$speed,ymax=mean_set2$speed+sem_set2$V3), width=0.2, position = position_dodge(0.6))
#Grid.arrange and save image
png("image.png", width = 1000, height = 700)
grid.arrange(set1_graph, set2_graph,nrow=1, ncol=2)
dev.off()

Logarithmic scaling with ggplot2 in R

I am trying to create a diagram using ggplot2. There are several very small values to be displayed and a few larger ones. I'd like to display all of them in an appropriate way using logarithmic scaling. This is what I do:
plotPointsPre <- ggplot(data = solverEntries, aes(x = val, y = instance,
color = solver, group = solver))
...
finalPlot <- plotPointsPre + coord_trans(x = 'log10') + geom_point() +
xlab("costs") + ylab("instance")
This is the result:
It is just the same as without coord_trans(x = 'log10').
However, if I use it with the y-axis:
How do I achieve the logarithmic scaling on the x-axis? Besides, it is not about the x-axis, if I switch the values of x and y, then it works on the x-axis and no longer on the y-axis. So there seems to be some problem with the displayed values. Does anybody have an idea how to fix this?
Edit - Here's the used data contained in solverEntries:
solverEntries <- data.frame(instance = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 19, 20, 20, 20, 20),
solver = c(4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1),
time = c(1, 24, 13, 6, 1, 41, 15, 5, 1, 26, 16, 5, 1, 39, 7, 4, 1, 28, 11, 3, 1, 31, 12, 3, 1, 38, 20, 3, 1, 37, 10, 4, 1, 25, 11, 3, 1, 32, 18, 4, 1, 27, 21, 3, 1, 23, 22, 3, 1, 30, 17, 2, 1, 36, 8, 3, 1, 37, 19, 4, 1, 40, 21, 3, 1, 29, 11, 4, 1, 33, 10, 3, 1, 34, 9, 3, 1, 35, 14, 3),
val = c(6553.48, 6565.6, 6565.6, 6577.72, 6568.04, 7117.14, 6578.98, 6609.28, 6559.54, 6561.98, 6561.98, 6592.28, 6547.42, 7537.64, 6549.86, 6555.92, 6546.24, 6557.18, 6557.18, 6589.92, 6586.22, 6588.66, 6588.66, 6631.08, 6547.42, 7172.86, 6569.3, 6582.6, 6547.42, 6583.78, 6547.42, 6575.28, 6555.92, 6565.68, 6565.68, 6575.36, 6551.04, 6551.04, 6551.04, 6563.16, 6549.86, 6549.86, 6549.86, 6555.92, 6544.98, 6549.86, 6549.86, 6561.98, 6558.36, 6563.24, 6563.24, 6578.98, 6566.86, 7080.78, 6570.48, 6572.92, 6565.6, 7073.46, 6580.16, 6612.9, 6557.18, 7351.04, 6562.06, 6593.54, 6547.42, 6552.3, 6552.3, 6558.36, 6553.48, 6576.54, 6576.54, 6612.9, 6555.92, 6560.8, 6560.8, 6570.48, 6566.86, 6617.78, 6572.92, 6578.98))
Your data in current form is not log distributed -- most val around 6500 and some 10% higher. If you want to stretch the data, you could use a custom transformation using the scales::trans_new(), or here's a simpler version that just subtracts a baseline value to make a log transform useful. After subtracting 6500, the small values will be mapped to around 50, with the large values around 1000, which is a more appropriate range for a log scale. Then we apply the same transformation to the breaks so that the labels will appear in the right spots. (i.e. the label 6550 is mapped to the data that is mapped to 6550 - 6500 = 50)
This method helps if you want to make the underlying values more distinguishable, but at the cost of distorting the underlying proportions between values. You might be able to help with this by picking useful breaks and labeling them with scaling stats, e.g.
7000
+7% over min
my_breaks <- c(6550, 6600, 6750, 7000, 7500)
baseline = 6500
library(ggplot2)
ggplot(data = solverEntries,
aes(x = val - baseline, y = instance,
color = solver, group = solver)) +
geom_point() +
scale_x_log10(breaks = my_breaks - baseline,
labels = my_breaks, name = "val")
Is this what you're looking for?
x_data <- seq(from=1,to=50)
y_data <- 2*x_data+rnorm(n=50,mean=0,sd=5)
#non log y
ggplot()+
aes(x=x_data,y=y_data)+
geom_point()
#log y scale
ggplot()+
aes(x=x_data,y=y_data)+
geom_point()+
scale_y_log10()
#log x scale
ggplot()+
aes(x=x_data,y=y_data)+
geom_point()+
scale_x_log10()

Interpolating three columns

I have a set of data in ranges like:
x|y|z
-4|1|45
-4|2|68
-4|3|96
-2|1|56
-2|2|65
-2|3|89
0|1|45
0|2|56
0|3|75
2|1|23
2|2|56
2|3|75
4|1|42
4|2|65
4|3|78
Here I need to interpolate between x and y using the z value.
I tried interpolating separately for x and y using z value by using the below code:
interpol<-approx(x,z,method="linear")
interpol_1<-approx(y,z,method="linear")
Now I'm trying to use all the three columns but values are coming wrong.
In your script you forgot to direct to your data.frame. Note the use of $ in the approx function.
interpol <- approx(df$x,df$z,method="linear")
interpol_1 <- approx(df$y,df$z,method="linear")
Data:
df <- data.frame(
x = c(-4, -4, -4, -2, -2, -2, 0, 0, 0, 2, 2, 2, 4, 4, 4),
y = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3),
z = c(45, 68, 96, 56, 65, 89, 45, 56, 75, 23, 56, 75, 42, 65, 78)
)

Resources