How can I visualize specific data points in a boxplot in R? - r

I have some monthly time-series data and I need to create a boxplot on this data using R. Until here, no problem of course. Then, additionally to the median, 1st quartile and 3rd quartile, I also need to visualize in the graph specific data points coming from the time-series, namely the observations at 3 years, 1 year and 3 months in the past. I looked online but I can't seem to find any command for this. Is there a way I can add these observation to the boxplot?

You can use, for example, the text function:
set.seed(123)
x <- rnorm(100, 5, 10)
If you want to display the mean value:
boxplot(x)
text(1, mean(x), "x", col = "red")
Additionally or alternatively, you can use the function points:
points(IQR(x), col = "blue", pch = 8)

Related

Plot two datasets with different scales on the same graph, same axis in R

I have two datasets, that I'd like to see on a single scatterplot with a single axis. One dataset has Y values ranging from 0 to 0.0006, the other between 0 and 1.
Each dataset has 50 entries.
In R, is there a way of changing the scale of the y axis at the 0.0006 mark to show detail in both halves of the graph, e.g., the range of 0 - 0.0006 and 0.0006 - 1 would be the same size on the graph.
I did this using a log scale, this is a sample dataset, which doesnt go all the way to 1 but taps out around 0.07.
I'm still open to other techniques as this one gives too much emphasis to the 0.0006-0 range.
You can scale your data for plotting, then call axis twice:
y1<-runif(50,0,0.0006)
y2<-runif(50,0.0006,1)
x<-runif(50)
y1.scaled<-y1*(0.5/0.0006)
y2.scaled<-(y2-0.0006)*(1-0.5)/(1-0.0006) + 0.5
plot(c(0,1),c(0,1),col=NA,yaxt='n',ylab="",xlab="")
points(x,y1.scaled,pch=20,col="red")
points(x,y2.scaled,pch=21,col="black")
axis(2,at=seq(0,0.5,length.out = 3), labels = c(0,0.0003,0.0006), col="red")
axis(2,at=seq(0.5,1,length.out = 3), labels = seq(0.0006,1,length.out=3))
See this post for how to re-scale a set of numbers with a known min and max to any other min and max:
How to scale down a range of numbers with a known min and max value
Assuming you have two different datasources (and that values from either source can be <0.0006) we could combine them, create an indicator for whether or not the value is <0.0006, and then use a facet_wrap with free scales. Something like this:
library(ggplot2)
set.seed(1)
y1<-runif(50,0,0.0006)
y2<-runif(50,0,1)
x<-1:50
df<-as.data.frame(rbind(cbind(y1,x),cbind(y2,x))) #Combine data
df$y1 <- as.numeric(as.character(df$y1))
df$x <- as.numeric(as.character(df$x))
df$group <- (df$y1 <= 0.0006) #Create group
#ggplot with facet
ggplot(data=df) + geom_point(aes(y=y1,x=x)) + facet_wrap(~grp,scales="free")

Plotting distribution of multiple measurements in two different groups in R

I have measurements of approximately 1000 variables in 2 groups with 10 replicates in each, in other words I have 2 dataframes with 10 columns and 1000 rows in each.
I would like to show the distribution of my measurements, in two different groups, to pick up variables that differ significantly between the groups. My initial idea was to do a large scatter plot where the x-coordinate would be an iteration of variables, and the y-coordinate would be measurement, and the points could be color coded. It doesn't quite work as expected however, I get a scatter plot matrix instead.
I tried to go with a boxplot,
ratios1 <- as.data.frame(matrix(rnorm(10000) * 100, 1000, 10))
boxplot(t(log2(ratios1)), horizontal = T)
which sort of works but all lines for the boxes makes the plot undecipherable, even for a single group (see figure below). Then I tried to remove the boxes and add the points afterwards as suggested here
boxplot(t(log2(ratios1)), horizontal = T, border = "white")
points(t(log2(ratios1)), pch=1)
But that didn't quite work either, as I only got the first variable drawn on the graph.
How can I display this type of information?
First of all, columns correspond to variables and rows to observations, not the other way around.
set.seed(42)
ratios1 <- as.data.frame(matrix(rnorm(10000) * 100, 10, 1000))
You could plot quantiles like this:
library(reshape2)
ratios2 <- melt(ratios1)
library(ggplot2)
ggplot(ratios2, aes(x = as.numeric(variable), y = value)) +
stat_summary(fun.data = function(y) as.data.frame(setNames(as.list(quantile(y, probs = c(0.025, 0.5, 0.975))), c("ymin", "y", "ymax"))),
color = "blue") +
stat_summary(fun.data = function(y) as.data.frame(setNames(as.list(quantile(y, probs = c(0.25, 0.5, 0.75))), c("ymin", "y", "ymax"))),
color = "red") +
xlab("variable")
There are no groups in your data, so I don't know what to do with that. Maybe you could facet by group. However, I don't think this kind of plot would be very useful for your goal of "pick[ing] up variables that differ significantly between the groups". I would do a hypothesis test with the appropriate correction for alpha error inflation.

How to make categorical scatterplot in R with median marking

My data are set up so that one column contains a continuous value testosterone concentration and the second column contains one of four "Kit type" values being "EIA," "RIA," "Other," or "All." I wanted to make the kit types into categories along the x axis with testosterone concentration along the y. I can't seem to figure out how to make sort of a cross between a boxplot and a scatterplot, but with only the individual data points and a median marking for each category marked on the graph?
This seemed to get me the data points into catagories alright, but the summarySE function does not have a median: Categorical scatter plot with mean segments using ggplot2 in R
Without data, I'm only guessing here, but ...
## create some data
set.seed(42)
n <- 100
dat <- data.frame(Testo=rbeta(n, 2, 5),
Kit=sample(c('EIA', 'RIA', 'Other', 'All'), size = n, replace = TRUE))
## show unequal distribution of points, no problem
table(dat$Kit)
## All EIA Other RIA
## 23 30 14 33
## break into individual levels
dat2 <- lapply(levels(dat$Kit), function(lvl) dat$Testo[ dat$Kit == lvl ])
names(dat2) <- levels(dat$Kit)
## parent plot
boxplot(dat2, main = 'Testosterone Levels per Kit')
## adding individual points
for (lvl in seq_along(dat2)) {
points(jitter(rep(lvl, length(dat2[[lvl]]))), dat2[[lvl]],
pch = 16, col = '#88888888')
}

How do I extract positive and negative outliers from boxplots into separate variables in R?

I need to plot the outliers from a boxplot on to a map. My lecturer gave me the function to extract all outliers from this boxplot:
outliers = match(names(boxplot(pc3, plot = FALSE)$out), names(pc3))
(pc3 being the data)
I am then plotting them using:
points(Data.1$X[outliers], Data.1$Y[outliers], col = "red", cex = 3, lwd = 2)
However I want to extract the positive outliers into one variable and the negative outliers into a different variable in order to plot them in different colours. How do I do this?
Thank you.
Outliers are defined by boxplot as points farther than 1.5 times the inter-quartile range from the sides of the box (75th and 25th percentile). You can apply that definition directly:
iq.range <- quantile(pc3, probs=c(0.25, 0.75))
lower.bound <- iq.range[1] - 1.5*diff(iq.range)
upper.bound <- iq.range[2] + 1.5*diff(iq.range)
low.out <- pc3[pc3 < lower.bound]
high.out <- pc3[pc3 > upper.bound]
That's computing it from scratch. You can also split the vector that you get from boxplot using the median. Anything above is the higher part.

Make a boxplot without whiskers

I have made box-plots for the onset values of three different groups using the box-plot function in R like so:
boxplot(onset ~ group, data = pulse.dat, range = 0, col = "lightblue")
However, I want to see how the data looks without the range, so I want to create a box-plot without the whiskers. I also wouldn't mind any kind of graph as long as it displays the median, 25th and 75th quartile for each of the 3 groups.
Does anyone know how I can do this in R?
Under boxplot pars...
d <- rnorm(1:100, 100, 10)
boxplot(d, whisklty = 0, staplelty = 0)
whisklty gets rid of the lines or whiskers
staplelty gets rid of the ends or staples

Resources