Boxplot with axis limits in R - r

I have data in tab delimited format with nearly 400 columns filled with values ie
X Y Z A B C
2.34 .89 1.4 .92 9.40 .82
6.45 .04 2.55 .14 1.55 .04
1.09 .91 4.19 .16 3.19 .56
5.87 .70 3.47 .80 2.47 .90
Now I want visualize the data using box plot method.Though it is difficult to view 400 in single odf,I want split into 50 each.ie(50 x 8).Here is the code I used:
boxplot(data[1:50],xlab="Samples",xlim=c(0.001,70),log="xy",
pch='.',col=rainbow(ncol(data[1:50)))
but I got the following error:
In plot.window(xlim = xlim, ylim = ylim, log = log, yaxs = pars$yaxs)
: nonfinite axis limits [GScale(-inf,4.4591,2, .); log=1]
I want to view the box plots for 400 samples with 50 each in a 8 different pdf....Please do help me in getting better visualization.

Others have already pointed out that actual boxplots are not going to work well. However, there is a very efficient way to visually scan all of your variables: Simply plot their distributions as an image (i.e. heatmap). Here is an example showing how it is really quite easy to get the gist of 400 variables and 80,000 individual data points!
# Simulate some data
set.seed(12345)
n.var = 400
n.obs = 200
data = matrix(rnorm(n.var*n.obs), nrow=n.obs)
# Summarize data
breaks = seq(min(data), max(data), length.out=51)
histdata = apply(data, 2, function(x) hist(x, plot=F, breaks=breaks)$counts)
# Plot
dev.new(width=4, height=4)
image(1:n.var, breaks, t(histdata), xlab='Variable Index', ylab='Histogram Bin')
This will be most useful if all your variables are comparable, or are at least sorted into rational groups. hclust and the heatmap functions can also be helpful here for more complicated displays. Good luck!

I agree that you will have to do something a bit drastic to distinguish 400 boxes in the same graph. The code below uses two tricks: (1) reverse the usual x-y order so that it's easier to read the labels (plotted on the y axis); (2) send the output to a tall, skinny PDF file so that you can scroll through it at your leisure. I also opted to sort the variables by mean, to make the plot easier to interpret -- that would be optional, but I suspect you'd have a hard time looking up a particular category in a 400-box plot in any case ...
nc <- 400
z <- as.data.frame(matrix(rnorm(nc*100),ncol=nc))
library(ggplot2)
m <- melt(z)
m <- transform(m,variable=reorder(variable,value))
pdf(width=10,height=50,file="boxplot.pdf")
print(ggplot(m,aes(x=variable,y=value))+geom_boxplot()+coord_flip())
dev.off()

Considering that you are plotting 400 boxes in your box plot, I am not surprised that you are having trouble seeing them. Suppose that you have a monitor that is 1024 pixels wide. Your application will only be able to display the boxes as two pixels wide. Even with larger screens you will not increase the number of pixels by much (a screen with 2000 pixels will at most show you boxes that are 5 pixels wide).
I would suggest plotting your boxes on two or more separate plots.

Related

Labelling issue after using gap.plot to create an axis break

I've been struggling to get a plot that shows my data accurately, and spent a while getting gap.plot up and running. After doing so, I have an issue with labelling the points.
Just plotting my data ends up with this:
Plot of abundance data, basically two different tiers of data at ~38,000, and between 1 - 50
As you can see, that doesn't clearly show either the top or the bottom sections of my plots well enough to distinguish anything.
Using gap plot, I managed to get:
gap.plot of abundance data, 100 - 37000 missed, labels only appearing on the lower tier
The code for my two plots is pretty simple:
plot(counts.abund1,pch=".",main= "Repeat 1")
text(counts.abund1, labels=row.names(counts.abund1), cex= 1.5)
gap.plot(counts.abund1[,1],counts.abund1[,2],gap=c(100,38000),gap.axis="y",xlim=c(0,60),ylim=c(0,39000))
text(counts.abund1, labels=row.names(counts.abund1), cex= 1.5)
But I don't know why/can't figure out why the labels (which are just the letters that the points denote) are not being applied the same in the two plots.
I'm kind of out of my depth trying this bit, very little idea how to plot things like this nicely, never had data like it when learning.
The data this comes from is originally a large (10,000 x 10,000 matrix) that contains a random assortment of letters a to z, then has replacements and "speciation" or "immigration" which results in the first lot of letters at ~38,000, and the second lot normally below 50.
The code I run after getting that matrix to get the rank abundance is:
##Abundance 1
counts1 <- as.data.frame(as.list(table(neutral.v1)))
counts.abund1<-rankabundance(counts1)
With neutral.v1 being the matrix.
The data frame for counts.abund1 looks like (extremely poorly formatted, sorry):
rank abundance proportion plower pupper accumfreq logabun rankfreq
a 1 38795 3.9 NaN NaN 3.9 4.6 1.9
x 2 38759 3.9 NaN NaN 7.8 4.6 3.8
j 3 38649 3.9 NaN NaN 11.6 4.6 5.7
m 4 38639 3.9 NaN NaN 15.5 4.6 7.5
and continues for all the variables. I only use Rank and Abundance right now, with the a,x,j,m just the variable that applies to, and what I want to use as the labels on the plot.
Any advice would be really appreciated. I can't really shorten the code too much or provide the matrix because the type of data is quite specific, as are the quantities in a sense.
As I mentioned, I've been using gap.plot to just create a break in the axis, but if there are better solutions to plotting this type of data I'd be absolutely all ears.
Really sorry that this is a mess of a question, bit frazzled on the whole thing right now.
gap.plot() doesn't draw two plots but one plot by decreasing upper section's value, drawing additional box and rewriting axis tick labels. So, the upper region's y-coordinate is neither equivalent to original value nor axis tick labels. The real y-coordinate in upper region is "original value" - diff(gap).
gap.plot(counts.abund1[,1], counts.abund1[,2], gap=c(100,38000), gap.axis="y",
xlim=c(0,60), ylim=c(0,39000))
text(counts.abund1, labels=row.names(counts.abund1), cex= 1.5)
text(counts.abund1[,1], counts.abund1[,2] - diff(c(100, 38000)), labels=row.names(counts.abund1), cex=1.5)
# the example data I used
set.seed(1)
counts.abund1 <- data.frame(rank = 1:50,
abundance = c(rnorm(25, 38500, 100), rnorm(25, 30, 20)))

Group bar plot with error bars and spit y axis

I would like to draw a group bar graph with error bars and split y axis to show both smaller and larger values in same plot? (as shown in my data sample number 1 has small values compare to other samples, therefore, I want to make a gap on y axis in-between 10-200)
Here is my data,
sample mean part sd
1 4.3161 G 1.2209
1 2.3157 F 1.7011
1 1.7446 R 1.1618
2 1949.13 G 873.42
2 195.07 F 47.82
2 450.88 R 140.31
3 2002.98 G 367.92
3 293.45 F 59.01
3 681.99 R 168.03
4 2717.85 G 1106.07
4 432.83 F 118.02
4 790.97 R 232.62
You can do anything you want with primitive graphic elements. For this reason, I always prefer to design my own plots with just the base R plotting functions, particularly points(), segments(), lines(), abline(), rect(), polygon(), text(), and mtext(). You can easily create curves (e.g. for circles) and more complex shapes using segments() and lines() across granular coordinate vectors that you define yourself. For example, see Plot angle between vectors. This provides much more control over the plot elements you create, however, it often takes more work and careful coding than more prepackaged solutions, so it's a tradeoff.
Data
First, here's your data in runnable form:
df <- data.frame(
sample=c(1,1,1,2,2,2,3,3,3,4,4,4),
mean=c(4.3161,2.3157,1.7446,1949.13,195.07,450.88,2002.98,293.45,681.99,2717.85,432.83,790.97),
part=c('G','F','R','G','F','R','G','F','R','G','F','R'),
sd=c(1.2209,1.7011,1.1618,873.42,47.82,140.31,367.92,59.01,168.03,1106.07,118.02,232.62),
stringsAsFactors=F
);
df;
## sample mean part sd
## 1 1 4.3161 G 1.2209
## 2 1 2.3157 F 1.7011
## 3 1 1.7446 R 1.1618
## 4 2 1949.1300 G 873.4200
## 5 2 195.0700 F 47.8200
## 6 2 450.8800 R 140.3100
## 7 3 2002.9800 G 367.9200
## 8 3 293.4500 F 59.0100
## 9 3 681.9900 R 168.0300
## 10 4 2717.8500 G 1106.0700
## 11 4 432.8300 F 118.0200
## 12 4 790.9700 R 232.6200
OP ggplot
Now, for reference, here's a screenshot of the plot that results from the ggplot code you pasted into your comment:
library(ggplot2);
ggplot(df,aes(x=as.factor(sample),y=mean,fill=part)) +
geom_bar(position=position_dodge(),stat='identity',colour='black') +
geom_errorbar(aes(ymin=mean-sd,ymax=mean+sd),width=.2,position=position_dodge(.9));
Linear Single
Also for reference, here's how you can produce a similar grouped bar plot using base R barplot() and legend(). I've added the error bars with custom calls to segments() and points():
## reshape to wide matrices
dfw <- reshape(df,dir='w',idvar='part',timevar='sample');
dfw.mean <- as.matrix(dfw[grep(perl=T,'^mean\\.',names(dfw))]);
dfw.sd <- as.matrix(dfw[grep(perl=T,'^sd\\.',names(dfw))]);
rownames(dfw.mean) <- rownames(dfw.sd) <- dfw$part;
colnames(dfw.mean) <- colnames(dfw.sd) <- unique(df$sample);
## plot precomputations
ylim <- c(0,4000);
yticks <- seq(ylim[1L],ylim[2L],100);
xcenters <- (col(dfw.sd)-1L)*(nrow(dfw.sd)+1L)+row(dfw.sd)+0.5;
partColors <- c(G='green3',F='indianred1',R='dodgerblue');
errColors <- c(G='darkgreen',F='darkred',R='darkblue');
## plot
par(xaxs='i',yaxs='i');
barplot(dfw.mean,beside=T,col=partColors,ylim=ylim,xlab='sample',ylab='mean',axes=F);
segments(xcenters,dfw.mean-dfw.sd,y1=dfw.mean+dfw.sd,lwd=2,col=errColors);
points(rep(xcenters,2L),c(dfw.mean-dfw.sd,dfw.mean+dfw.sd),pch=19,col=errColors);
axis(1L,par('usr')[1:2],F,pos=0,tck=0);
axis(2L,yticks,las=1L,cex.axis=0.7);
legend(2,3800,dfw$part,partColors,title=expression(bold('part')),cex=0.7,title.adj=0.5[2:1]);
The issue is plain to see. There's nuance to some of the data (the sample 1 means and variability) that is not well represented in the plot.
Logarithmic
There are two standard options for dealing with this problem. One is to use a logarithmic scale. You can do this with the log='y' argument to the barplot() function. It's also good to override the default y-axis tick selection, since the default base R ticks tend to be a little light on density and short on range. (That's actually true in general, for most base R plot types; I make custom calls to axis() for all the plots I produce in this answer.)
## plot precomputations
ylim <- c(0.1,4100); ## lower limit must be > 0 for log plot
yticks <- rep(10^seq(floor(log10(ylim[1L])),ceiling(log10(ylim[2L])),1),each=9L)*1:9;
xcenters <- (col(dfw.sd)-1L)*(nrow(dfw.sd)+1L)+row(dfw.sd)+0.5;
partColors <- c(G='green3',F='indianred1',R='dodgerblue');
errColors <- c(G='darkgreen',F='darkred',R='darkblue');
## plot
par(xaxs='i',yaxs='i');
barplot(log='y',dfw.mean,beside=T,col=partColors,ylim=ylim,xlab='sample',ylab='mean',axes=F);
segments(xcenters,dfw.mean-dfw.sd,y1=dfw.mean+dfw.sd,lwd=2,col=errColors);
points(rep(xcenters,2L),c(dfw.mean-dfw.sd,dfw.mean+dfw.sd),pch=19,col=errColors);
axis(1L,par('usr')[1:2],F,pos=0,tck=0);
axis(2L,yticks,yticks,las=1L,cex.axis=0.6);
legend(2,3000,dfw$part,partColors,title=expression(bold('part')),cex=0.7,title.adj=0.5[2:1]);
Right away we see the issue with sample 1 is fixed. But we've introduced a new issue: we've lost precision in the rest of the data. In other words, the nuance that exists in the rest of the data is less visually pronounced. This is an unavoidable result of the "zoom-out" effect of changing from linear to logarithmic axes. You would incur the same loss of precision if you used a linear plot but with too large a y-axis, which is why it's always expected that axes are fitted as closely as possible to the data. This also serves as an indication that a logarithmic y-axis may not be the correct solution for your data. Logarithmic axes are generally advised when the underlying data reflects logarithmic phenomena; that it ranges over several orders of magnitude. In your data, only sample 1 sits in a different order of magnitude from the remaining data; the rest are concentrated in the same order of magnitude, and are thus not best represented with a logarithmic y-axis.
Linear Multiple
The second option is to create separate plots with completely different y-axis scaling. It should be noted that ggplot faceting is essentially the creation of separate plots. Also, you could create multifigure plots with base R, but I've usually found that that's more trouble than it's worth. It's usually easier to just generate each plot individually, and then lay them out next to each other with publishing or word processing software.
There are different ways of customizing this approach, such as whether you combine the axis labels, where you place the legend, how you size and arrange the different plots relative to each other, etc. Here's one way of doing it:
##--------------------------------------
## plot 1 -- high values
##--------------------------------------
dfw.mean1 <- dfw.mean[,-1L];
dfw.sd1 <- dfw.sd[,-1L];
## plot precomputations
ylim <- c(0,4000);
yticks <- seq(ylim[1L],ylim[2L],100);
xcenters <- (col(dfw.sd1)-1L)*(nrow(dfw.sd1)+1L)+row(dfw.sd1)+0.5;
partColors <- c(G='green3',F='indianred1',R='dodgerblue');
errColors <- c(G='darkgreen',F='darkred',R='darkblue');
par(xaxs='i',yaxs='i');
barplot(dfw.mean1,beside=T,col=partColors,ylim=ylim,xlab='sample',ylab='mean',axes=F);
segments(xcenters,dfw.mean1-dfw.sd1,y1=dfw.mean1+dfw.sd1,lwd=2,col=errColors);
points(rep(xcenters,2L),c(dfw.mean1-dfw.sd1,dfw.mean1+dfw.sd1),pch=19,col=errColors);
axis(1L,par('usr')[1:2],F,pos=0,tck=0);
axis(2L,yticks,las=1L,cex.axis=0.7);
legend(2,3800,dfw$part,partColors,title=expression(bold('part')),cex=0.7,title.adj=0.5[2:1]);
##--------------------------------------
## plot 2 -- low values
##--------------------------------------
dfw.mean2 <- dfw.mean[,1L,drop=F];
dfw.sd2 <- dfw.sd[,1L,drop=F];
## plot precomputations
ylim <- c(0,6);
yticks <- seq(ylim[1L],ylim[2L],0.5);
xcenters <- (col(dfw.sd2)-1L)*(nrow(dfw.sd2)+1L)+row(dfw.sd2)+0.5;
partColors <- c(G='green3',F='indianred1',R='dodgerblue');
errColors <- c(G='darkgreen',F='darkred',R='darkblue');
par(xaxs='i',yaxs='i');
barplot(dfw.mean2,beside=T,col=partColors,ylim=ylim,xlab='sample',ylab='mean',axes=F);
segments(xcenters,dfw.mean2-dfw.sd2,y1=dfw.mean2+dfw.sd2,lwd=2,col=errColors);
points(rep(xcenters,2L),c(dfw.mean2-dfw.sd2,dfw.mean2+dfw.sd2),pch=19,col=errColors);
axis(1L,par('usr')[1:2],F,pos=0,tck=0);
axis(2L,yticks,las=1L,cex.axis=0.7);
This solves both problems (small-value visibility and large-value precision). But it also distorts the relative magnitude of samples 2-4 vs. sample 1. In other words, the sample 1 data has been "scaled up" relative to samples 2-4, and the reader must make a conscious effort to read the axes and digest the differing scales in order to properly understand the plots.
The lesson here is that there's no perfect solution. Every approach has its own pros and cons, its own tradeoffs.
Gapped
In your question, you indicate you want to add a gap across the y range 10:200. On the surface, this sounds like a reasonable solution for raising the visibility of the sample 1 data. However, the magnitude of that 190 unit range is dwarfed by the range of the remainder of the plot, so it ends up having a negligible effect on sample 1 visibility.
In order to demonstrate this I'm going to use some code I've written which can be used to transform input coordinates to a new data domain which allows for inconsistent scaling of different segments of the axis. Theoretically you could use it for both x and y axes, but I've only ever used it for the y-axis.
A few warnings: This introduces some significant complexity, and decouples the graphics engine's idea of the y-axis scale from the real data. More specifically, it maps all coordinates to the range [0,1] based on their cumulative position within the sequence of segments.
At this point, I'm also going to abandon barplot() in favor of drawing the bars manually, using calls to rect(). Technically, it would be possible to use barplot() with my segmentation code, but as I said earlier, I prefer to design my own plots from scratch with primitive graphic elements. This also allows for more precise control over all aspects of the plot.
Here's the code and plot, I'll attempt to give a better explanation of it afterward:
dataCoordToPlot <- function(data,seg) {
## data -- double vector of data-world coordinates.
## seg -- list of two components: (1) mark, giving the boundaries between all segments, and (2) scale, giving the relative scale of each segment. Thus, scale must be one element shorter than mark.
data <- as.double(data);
seg <- as.list(seg);
seg$mark <- as.double(seg$mark);
seg$scale <- as.double(seg$scale);
if (length(seg$scale) != length(seg$mark)-1L) stop('seg$scale must be one element shorter than seg$mark.');
scaleNorm <- seg$scale/sum(seg$scale);
cumScale <- c(0,cumsum(scaleNorm));
int <- findInterval(data,seg$mark,rightmost.closed=T);
int[int%in%c(0L,length(seg$mark))] <- NA; ## handle values outside outer segments; will propagate NA to returned vector
(data - seg$mark[int])/(seg$mark[int+1L] - seg$mark[int])*scaleNorm[int] + cumScale[int];
}; ## end dataCoordToPlot()
## y dimension segmentation
ymax <- 4000;
yseg <- list();
yseg$mark <- c(0,10,140,ymax);
yseg$scale <- diff(yseg$mark);
yseg$scale[2L] <- 30;
yseg$jump <- c(F,T,F);
## plot precomputations
xcenters <- seq(0.5,len=length(unique(df$sample)));
xlim <- range(xcenters)+c(-0.5,0.5);
ylim <- range(yseg$mark);
yinc <- 100;
yticks.inc <- seq(ylim[1L],ylim[2L],yinc);
yticks.inc <- yticks.inc[!yseg$jump[findInterval(yticks.inc,yseg$mark,rightmost.closed=T)]];
yticks.jump <- setdiff(yseg$mark,yticks.inc);
yticks.all <- sort(c(yticks.inc,yticks.jump));
## plot
## define as reusable function for subsequent examples
custom.barplot <- function() {
par(xaxs='i',yaxs='i');
plot(NA,xlim=xlim,ylim=dataCoordToPlot(ylim,yseg),axes=F,ann=F);
abline(h=dataCoordToPlot(yticks.all,yseg),col='lightgrey');
axis(1L,seq(xlim[1L],xlim[2L]),NA,tck=0);
axis(1L,xcenters,unique(df$sample));
axis(2L,dataCoordToPlot(yticks.inc,yseg),yticks.inc,las=1,cex.axis=0.7);
axis(2L,dataCoordToPlot(yticks.jump,yseg),yticks.jump,las=1,tck=-0.008,hadj=0.1,cex.axis=0.5);
mtext('sample',1L,2L);
mtext('mean',2L,3L);
xgroupRatio <- 0.8;
xbarRatio <- 0.9;
partColors <- c(G='green3',F='indianred1',R='dodgerblue');
partsCanon <- unique(df$part);
errColors <- c(G='darkgreen',F='darkred',R='darkblue');
for (sampleIndex in seq_along(unique(df$sample))) {
xc <- xcenters[sampleIndex];
sample <- unique(df$sample)[sampleIndex];
dfs <- df[df$sample==sample,];
parts <- unique(dfs$part);
parts <- parts[order(match(parts,partsCanon))];
barWidth <- xgroupRatio*xbarRatio/length(parts);
gapWidth <- xgroupRatio*(1-xbarRatio)/(length(parts)-1L);
xstarts <- xc - xgroupRatio/2 + (match(dfs$part,parts)-1L)*(barWidth+gapWidth);
rect(xstarts,0,xstarts+barWidth,dataCoordToPlot(dfs$mean,yseg),col=partColors[dfs$part]);
barCenters <- xstarts+barWidth/2;
segments(barCenters,dataCoordToPlot(dfs$mean + dfs$sd,yseg),y1=dataCoordToPlot(dfs$mean - dfs$sd,yseg),lwd=2,col=errColors);
points(rep(barCenters,2L),dataCoordToPlot(c(dfs$mean-dfs$sd,dfs$mean+dfs$sd),yseg),pch=19,col=errColors);
}; ## end for
## draw zig-zag cutaway graphic in jump segments
zigCount <- 30L;
jumpIndexes <- which(yseg$jump);
for (jumpIndex in jumpIndexes) {
if (yseg$scale[jumpIndex] == 0) next;
jumpStart <- yseg$mark[jumpIndex];
jumpEnd <- yseg$mark[jumpIndex+1L];
lines(seq(xlim[1L],xlim[2L],len=zigCount*2L+1L),dataCoordToPlot(c(rep(c(jumpStart,jumpEnd),zigCount),jumpStart),yseg));
}; ## end for
legend(0.2,dataCoordToPlot(3800,yseg),partsCanon,partColors,title=expression(bold('part')),cex=0.7,title.adj=c(NA,0.5));
}; ## end custom.barplot()
custom.barplot();
The key function is dataCoordToPlot(). That stands for "data coordinates to plot coordinates", where "plot coordinates" refers to the [0,1] normalized domain.
The seg argument defines the segmentation of the axis and the scaling of each segment. Its mark component specifies the boundaries of each segment, and its scale component gives the scale factor for each segment. n segments must have n+1 boundaries to fully define where each segment begins and ends, thus mark must be one element longer than scale.
Before being used, the scale vector is normalized within the function to sum to 1, so the absolute magnitudes of the scale values don't matter; it's their relative values that matter.
The algorithm is to find each coordinate's containing segment, find the accumulative distance within the segment reached by the coordinate accounting for the segment's relative scale, and then add to that the cumulative distance reached by all prior segments.
Using this design, it is possible to take any range of coordinates along the axis dimension and scale them up or down relative to the other segments. An instantaneous gap across a range could be achieved with a scale of zero. Alternatively, you can simply scale down the range so that it has some thickness, but contributes little to the progression of the dimension. In the above plot, I use the latter for the gap, mainly so that I can use the small thickness to add a zigzag aesthetic which visually indicates the presence of the gap.
Also, I should note that I used 10:140 instead of 10:200 for the gap. This is because the sample 2 F part error bar extends down to 147.25 (195.07 - 47.82). The difference is negligible.
As you can see, the result looks basically identical to the Linear Single plot. The gap is not significant enough to raise the visibility of the sample 1 data.
Distorted with Gap
Just to throw some more possibilities into mix, now venturing into very non-standard and probably questionable waters, we can use the segmentation transformation to scale up the sample 1 order of magnitude, thereby making it much more visible while still remaining within the single plot, directly alongside samples 2-4.
For this example, I preserve the gap from 10:140 so you can see how it looks when not lying prostrate near the baseline.
## y dimension segmentation
ymax <- 4000;
yseg <- list();
yseg$mark <- c(0,10,140,ymax);
yseg$scale <- c(24,1,75);
yseg$jump <- c(F,T,F);
## plot precomputations
xcenters <- seq(0.5,len=length(unique(df$sample)));
xlim <- range(xcenters)+c(-0.5,0.5);
ylim <- range(yseg$mark);
yinc1 <- 1;
yinc2 <- 100;
yticks.inc1 <- seq(ceiling(yseg$mark[1L]/yinc1)*yinc1,yseg$mark[2L],yinc1);
yticks.inc2 <- seq(ceiling(yseg$mark[3L]/yinc2)*yinc2,yseg$mark[4L],yinc2);
yticks.inc <- c(yticks.inc1,yticks.inc2);
yticks.jump <- setdiff(yseg$mark,yticks.inc);
yticks.all <- sort(c(yticks.inc,yticks.jump));
## plot
custom.barplot();
Distorted without Gap
Finally, just to clarify that gaps are not necessary for inconsistent scaling between segments, here's the same plot but without the gap:
## y dimension segmentation
ymax <- 4000;
yseg <- list();
yseg$mark <- c(0,10,ymax);
yseg$scale <- c(25,75);
yseg$jump <- c(F,F);
## plot precomputations
xcenters <- seq(0.5,len=length(unique(df$sample)));
xlim <- range(xcenters)+c(-0.5,0.5);
ylim <- range(yseg$mark);
yinc1 <- 1;
yinc2 <- 100;
yticks.inc1 <- seq(ceiling(yseg$mark[1L]/yinc1)*yinc1,yseg$mark[2L],yinc1);
yticks.inc2 <- seq(ceiling(yseg$mark[2L]/yinc2)*yinc2,yseg$mark[3L],yinc2);
yticks.inc <- c(yticks.inc1,yticks.inc2);
yticks.jump <- setdiff(yseg$mark,yticks.inc);
yticks.all <- sort(c(yticks.inc,yticks.jump));
## plot
custom.barplot();
In principle, there's really no difference between the Linear Multiple solution and the Distorted solutions. Both involve visual distortion of competing orders of magnitude. Linear Multiple simply separates the different orders of magnitude into separate plots, while the Distorted solutions combine them into the same plot.
Probably the best argument in favor of using Linear Multiple is that if you use Distorted you'll probably be crucified by a large mob of data scientists, since that is a very non-standard way of plotting data. On the other hand, one could argue that the Distorted approach is more concise and helps to represent the relative positions of each data point along the number line. The choice is yours.
What you want to plot is a discontinuous y axis.
This issue was covered before in this post and seems not to be possible in ggplot2.
The answers to the mentioned post suggest faceting, log scaled y axis and separate plots to solve your problem.
Please find the reasons detailed by Hadley Wickham here, who thinks that a broken y axis could be "visually distorting".

Identifying data points amongst background noise for binned data R

Not sure whether this should go on cross validated or not but we'll see. Basically I obtained data from an instrument just recently (masses of compounds from 0 to 630) which I binned into 0.025 bins before plotting a histogram as seen below:-
I want to identify the bins that are of high frequency and that stands out from against the background noise (the background noise increases as you move from right to left on the a-xis). Imagine drawing a curve line ontop of the points that have almost blurred together into a black lump and then selecting the bins that exists above that curve to further investigate, that's what I'm trying to do. I just plotted a kernel density plot to see if I could over lay that ontop of my histogram and use that to identify points that exist above the plot. However, the density plot in no way makes any headway with this as the densities are too low a value (see the second plot). Does anyone have any recommendations as to how I Can go about solving this problem? The blue line represents the density function plot overlayed and the red line represents the ideal solution (need a way of somehow automating this in R)
The data below is only part of my dataset so its not really a good representation of my plot (which contains just about 300,000 points) and as my bin sizes are quite small (0.025) there's just a huge spread of data (in total there's 25,000 or so bins).
df <- read.table(header = TRUE, text = "
values
1 323.881306
2 1.003373
3 14.982121
4 27.995091
5 28.998639
6 95.983138
7 2.0117459
8 1.9095478
9 1.0072853
10 0.9038475
11 0.0055748
12 7.0964916
13 8.0725191
14 9.0765316
15 14.0102531
16 15.0137390
17 19.7887675
18 25.1072689
19 25.8338140
20 30.0151683
21 34.0635308
22 42.0393751
23 42.0504938
")
bin <- seq(0, 324, by = 0.025)
hist(df$values, breaks = bin, prob=TRUE, col = "grey")
lines(density(df$values), col = "blue")
Assuming you're dealing with a vector bin.densities that has the densities for each bin, a simple way to find outliers would be:
look at a window around each bin, say +- 50 bins
current.bin <- 1
window.size <- 50
window <- bin.densities[current.bin-window.size : current.bin+window.size]
find the 95% upper and lower quantile value (or really any value you think works)
lower.quant <- quantile(window, 0.05)
upper.quant <- quantile(window, 0.95)
then say that the current bin is an outlier if it falls outside your quantile range.
this.is.too.high <- (bin.densities[current.bin] > upper.quant
this.is.too.low <- (bin.densities[current.bin] < lower.quant)
#final result
this.is.outlier <- this.is.too.high | this.is.too.low
I haven't actually tested this code, but this is the general approach I would take. You can play around with window size and the quantile percentages until the results look reasonable. Again, not exactly super complex math but hopefully it helps.

Use a for loop to create several bubble plots with different legend scales in R

I have been trying to make several bubble plots showing the frequency of observations (as a percentage) of several individuals in different sites. Some individuals were found in the same site, but not all. Also the number of locations within each site may vary among individuals. My main problem is that I have more than 3 individuals and more than 3 sites, so I have been trying to come up with a good/fast way of creating this type of bubble plots/legends. I am also having problems with the legend as I need to have a function that will place the legend in the same location when creating a new plot. In the legend I want to show different bubble sizes for each frequency (if possible indicating the value next to the bubble).
Here is an example of my script. Any suggestions or ideas on how to do this will be extremely helpful.
# require libraries
library(maptools)
library(sp)
data<-read.table(text="ind lat long site freq perc
A -18.62303 147.29207 A 449 9.148329258
A -18.6195 147.29492 A 725 14.77180114
A -18.62512 147.3018 A 3589 73.12550937
A -18.62953 147.29422 A 145 2.954360228
B -18.75383 147.25405 B 2 0.364963504
B -18.73393 147.28162 B 1 0.182481752
B -18.62303 147.29207 A 3 0.547445255
B -18.6195 147.29492 A 78 14.23357664
B -18.62512 147.3018 A 451 82.29927007
B -18.62953 147.29422 A 13 2.372262774
C -18.51862 147.39717 C 179 0.863857922
C -18.53281 147.39052 C 20505 98.95757927
C -18.52847 147.40167 C 37 0.178562811",header=TRUE)
# Split data frame for each tag
ind<-data$ind
M<-split(data,ind)
l<-length(M)
### Detection Plots ###
pdf("Plots.pdf",width=11,height=8,paper="a4r")
par(mfrow=c(1,1))
for(j in 1:l){
# locations
new.data<-M[[j]]
site<-as.character(unique(new.data$site))
fname<-paste(new.data$ind[1],sep="")
loc<-new.data[,c("long","lat")]
names(loc)<-c("X", "Y")
coord<-SpatialPoints(loc)
coord1<-SpatialPointsDataFrame(coord,new.data)
# draw some circles with specify radius size
x<-new.data$long
y<-new.data$lat
freq<-new.data$perc
rad<-freq
rad1<-round(rad,1)
title<-paste("Ind","-",fname," / ","Site","-",new.data$site[1],sep="")
# create bubble plot
symbols(x,y,circles=rad1,inches=0.4,fg="black",bg="red",xlab="",ylab="")
points(x,y,pch=1,col="black",cex=0.4)
par(new=T)
# map scale
maps::map.scale(grconvertX(0.4,"npc"),grconvertY(0.1, "npc"),
ratio=FALSE,relwidth=0.2,cex=0.6)
# specifying coordinates for legend
legX<-grconvertX(0.8,"npc")
legY1<-grconvertY(0.9,"npc")
legY2<-legY1-0.001
legY3<-legY2-0.0006
legY4<-legY3-0.0003
# creating the legend
leg<-data.frame(X=c(legX,legX,legX,legX),Y=c(legY1,legY2,legY3,legY4),
rad=c(1000,500,100,25))
symbols(leg$X,leg$Y,circles=leg$rad,inches=0.3,add=TRUE,fg="black",bg="white")
mtext(title,3,line=1,cex=1.2)
mtext("Latitude",2,line=3,padj=1,cex=1)
mtext("Longitude",1,line=2.5,padj=0,cex=1)
box()
}
dev.off()
The first plot is actually Ok, and will only need to have the values of the frequency/perc next to the lengend bubble. However, it does not really work with the others...
You are hardcoding the legend position - make it relative...
legX<-grconvertX(0.8,"npc")
legY1<-grconvertY(0.9,"npc")
# Get the size of the plotting area (measured on the y axis)
ysize <- par()$usr[4]-par()$usr[3]
# Use that to calculate the new positions
legY2<-legY1 - (0.1* ysize)
legY3<-legY1 - (0.2* ysize)
legY4<-legY1 - (0.3* ysize)
This will put the bubbles on the same place on all the plots (in steps of 10% of the plotting area).

3d plot or contourplot of 3-tuples where x&y are NOT in a grid and NOT equally spaced in R

I am trying to visualize 3-tuple points that are NOT in a grid, and x and y are NOT equally spaced. Thus I can not make a matrix as mostly required, nor can I meet the requirements of the lattice contourplot, which accepts vectors, but they have to be in a pretty restrictive form. (x,y must form a grid and be equally spaced...)
I don't care, whether the result is a 3d surface or a 2D contourplot. But in some way I'ld like to visualize a (probably interpolated) surface of my 3-tuples.
Data will look like this:
myX myY myZ
1 458 4 0.54
2 101 5 0.46
3 390 0 0.45
4 186 2 0.84
5 241 3 0.50
6 495 2 0.67
I have tried several plotting functions from graphics, rgl and lattice packages.
I understand that the connecting of x,y pairs at arbitrary positions is everything but trivial - but is there any plotting function in any package, which can handle this? Or can I fill (interpolate) my data beforehand easily in order to have a full matrix? (I have fitted models visualized, but I want to see the raw data...)
Any help or hint is appreciated!
Cheers,
Niko
I bit hard to understand the question, but I will try to show how one interpolates to a full matrix. I usually use the interp function from the akima package:
set.seed(1)
x <- runif(20)
y <- runif(20)
z <- x^3 + sin(y)
require(akima)
F <- interp(x,y,z)
image(F)
points(x,y)
Here's an example of extrapolation:
F <- interp(x,y,z, linear=FALSE, extrap=TRUE)
image(F)
points(x,y)

Resources