Group bar plot with error bars and spit y axis - r

I would like to draw a group bar graph with error bars and split y axis to show both smaller and larger values in same plot? (as shown in my data sample number 1 has small values compare to other samples, therefore, I want to make a gap on y axis in-between 10-200)
Here is my data,
sample mean part sd
1 4.3161 G 1.2209
1 2.3157 F 1.7011
1 1.7446 R 1.1618
2 1949.13 G 873.42
2 195.07 F 47.82
2 450.88 R 140.31
3 2002.98 G 367.92
3 293.45 F 59.01
3 681.99 R 168.03
4 2717.85 G 1106.07
4 432.83 F 118.02
4 790.97 R 232.62

You can do anything you want with primitive graphic elements. For this reason, I always prefer to design my own plots with just the base R plotting functions, particularly points(), segments(), lines(), abline(), rect(), polygon(), text(), and mtext(). You can easily create curves (e.g. for circles) and more complex shapes using segments() and lines() across granular coordinate vectors that you define yourself. For example, see Plot angle between vectors. This provides much more control over the plot elements you create, however, it often takes more work and careful coding than more prepackaged solutions, so it's a tradeoff.
Data
First, here's your data in runnable form:
df <- data.frame(
sample=c(1,1,1,2,2,2,3,3,3,4,4,4),
mean=c(4.3161,2.3157,1.7446,1949.13,195.07,450.88,2002.98,293.45,681.99,2717.85,432.83,790.97),
part=c('G','F','R','G','F','R','G','F','R','G','F','R'),
sd=c(1.2209,1.7011,1.1618,873.42,47.82,140.31,367.92,59.01,168.03,1106.07,118.02,232.62),
stringsAsFactors=F
);
df;
## sample mean part sd
## 1 1 4.3161 G 1.2209
## 2 1 2.3157 F 1.7011
## 3 1 1.7446 R 1.1618
## 4 2 1949.1300 G 873.4200
## 5 2 195.0700 F 47.8200
## 6 2 450.8800 R 140.3100
## 7 3 2002.9800 G 367.9200
## 8 3 293.4500 F 59.0100
## 9 3 681.9900 R 168.0300
## 10 4 2717.8500 G 1106.0700
## 11 4 432.8300 F 118.0200
## 12 4 790.9700 R 232.6200
OP ggplot
Now, for reference, here's a screenshot of the plot that results from the ggplot code you pasted into your comment:
library(ggplot2);
ggplot(df,aes(x=as.factor(sample),y=mean,fill=part)) +
geom_bar(position=position_dodge(),stat='identity',colour='black') +
geom_errorbar(aes(ymin=mean-sd,ymax=mean+sd),width=.2,position=position_dodge(.9));
Linear Single
Also for reference, here's how you can produce a similar grouped bar plot using base R barplot() and legend(). I've added the error bars with custom calls to segments() and points():
## reshape to wide matrices
dfw <- reshape(df,dir='w',idvar='part',timevar='sample');
dfw.mean <- as.matrix(dfw[grep(perl=T,'^mean\\.',names(dfw))]);
dfw.sd <- as.matrix(dfw[grep(perl=T,'^sd\\.',names(dfw))]);
rownames(dfw.mean) <- rownames(dfw.sd) <- dfw$part;
colnames(dfw.mean) <- colnames(dfw.sd) <- unique(df$sample);
## plot precomputations
ylim <- c(0,4000);
yticks <- seq(ylim[1L],ylim[2L],100);
xcenters <- (col(dfw.sd)-1L)*(nrow(dfw.sd)+1L)+row(dfw.sd)+0.5;
partColors <- c(G='green3',F='indianred1',R='dodgerblue');
errColors <- c(G='darkgreen',F='darkred',R='darkblue');
## plot
par(xaxs='i',yaxs='i');
barplot(dfw.mean,beside=T,col=partColors,ylim=ylim,xlab='sample',ylab='mean',axes=F);
segments(xcenters,dfw.mean-dfw.sd,y1=dfw.mean+dfw.sd,lwd=2,col=errColors);
points(rep(xcenters,2L),c(dfw.mean-dfw.sd,dfw.mean+dfw.sd),pch=19,col=errColors);
axis(1L,par('usr')[1:2],F,pos=0,tck=0);
axis(2L,yticks,las=1L,cex.axis=0.7);
legend(2,3800,dfw$part,partColors,title=expression(bold('part')),cex=0.7,title.adj=0.5[2:1]);
The issue is plain to see. There's nuance to some of the data (the sample 1 means and variability) that is not well represented in the plot.
Logarithmic
There are two standard options for dealing with this problem. One is to use a logarithmic scale. You can do this with the log='y' argument to the barplot() function. It's also good to override the default y-axis tick selection, since the default base R ticks tend to be a little light on density and short on range. (That's actually true in general, for most base R plot types; I make custom calls to axis() for all the plots I produce in this answer.)
## plot precomputations
ylim <- c(0.1,4100); ## lower limit must be > 0 for log plot
yticks <- rep(10^seq(floor(log10(ylim[1L])),ceiling(log10(ylim[2L])),1),each=9L)*1:9;
xcenters <- (col(dfw.sd)-1L)*(nrow(dfw.sd)+1L)+row(dfw.sd)+0.5;
partColors <- c(G='green3',F='indianred1',R='dodgerblue');
errColors <- c(G='darkgreen',F='darkred',R='darkblue');
## plot
par(xaxs='i',yaxs='i');
barplot(log='y',dfw.mean,beside=T,col=partColors,ylim=ylim,xlab='sample',ylab='mean',axes=F);
segments(xcenters,dfw.mean-dfw.sd,y1=dfw.mean+dfw.sd,lwd=2,col=errColors);
points(rep(xcenters,2L),c(dfw.mean-dfw.sd,dfw.mean+dfw.sd),pch=19,col=errColors);
axis(1L,par('usr')[1:2],F,pos=0,tck=0);
axis(2L,yticks,yticks,las=1L,cex.axis=0.6);
legend(2,3000,dfw$part,partColors,title=expression(bold('part')),cex=0.7,title.adj=0.5[2:1]);
Right away we see the issue with sample 1 is fixed. But we've introduced a new issue: we've lost precision in the rest of the data. In other words, the nuance that exists in the rest of the data is less visually pronounced. This is an unavoidable result of the "zoom-out" effect of changing from linear to logarithmic axes. You would incur the same loss of precision if you used a linear plot but with too large a y-axis, which is why it's always expected that axes are fitted as closely as possible to the data. This also serves as an indication that a logarithmic y-axis may not be the correct solution for your data. Logarithmic axes are generally advised when the underlying data reflects logarithmic phenomena; that it ranges over several orders of magnitude. In your data, only sample 1 sits in a different order of magnitude from the remaining data; the rest are concentrated in the same order of magnitude, and are thus not best represented with a logarithmic y-axis.
Linear Multiple
The second option is to create separate plots with completely different y-axis scaling. It should be noted that ggplot faceting is essentially the creation of separate plots. Also, you could create multifigure plots with base R, but I've usually found that that's more trouble than it's worth. It's usually easier to just generate each plot individually, and then lay them out next to each other with publishing or word processing software.
There are different ways of customizing this approach, such as whether you combine the axis labels, where you place the legend, how you size and arrange the different plots relative to each other, etc. Here's one way of doing it:
##--------------------------------------
## plot 1 -- high values
##--------------------------------------
dfw.mean1 <- dfw.mean[,-1L];
dfw.sd1 <- dfw.sd[,-1L];
## plot precomputations
ylim <- c(0,4000);
yticks <- seq(ylim[1L],ylim[2L],100);
xcenters <- (col(dfw.sd1)-1L)*(nrow(dfw.sd1)+1L)+row(dfw.sd1)+0.5;
partColors <- c(G='green3',F='indianred1',R='dodgerblue');
errColors <- c(G='darkgreen',F='darkred',R='darkblue');
par(xaxs='i',yaxs='i');
barplot(dfw.mean1,beside=T,col=partColors,ylim=ylim,xlab='sample',ylab='mean',axes=F);
segments(xcenters,dfw.mean1-dfw.sd1,y1=dfw.mean1+dfw.sd1,lwd=2,col=errColors);
points(rep(xcenters,2L),c(dfw.mean1-dfw.sd1,dfw.mean1+dfw.sd1),pch=19,col=errColors);
axis(1L,par('usr')[1:2],F,pos=0,tck=0);
axis(2L,yticks,las=1L,cex.axis=0.7);
legend(2,3800,dfw$part,partColors,title=expression(bold('part')),cex=0.7,title.adj=0.5[2:1]);
##--------------------------------------
## plot 2 -- low values
##--------------------------------------
dfw.mean2 <- dfw.mean[,1L,drop=F];
dfw.sd2 <- dfw.sd[,1L,drop=F];
## plot precomputations
ylim <- c(0,6);
yticks <- seq(ylim[1L],ylim[2L],0.5);
xcenters <- (col(dfw.sd2)-1L)*(nrow(dfw.sd2)+1L)+row(dfw.sd2)+0.5;
partColors <- c(G='green3',F='indianred1',R='dodgerblue');
errColors <- c(G='darkgreen',F='darkred',R='darkblue');
par(xaxs='i',yaxs='i');
barplot(dfw.mean2,beside=T,col=partColors,ylim=ylim,xlab='sample',ylab='mean',axes=F);
segments(xcenters,dfw.mean2-dfw.sd2,y1=dfw.mean2+dfw.sd2,lwd=2,col=errColors);
points(rep(xcenters,2L),c(dfw.mean2-dfw.sd2,dfw.mean2+dfw.sd2),pch=19,col=errColors);
axis(1L,par('usr')[1:2],F,pos=0,tck=0);
axis(2L,yticks,las=1L,cex.axis=0.7);
This solves both problems (small-value visibility and large-value precision). But it also distorts the relative magnitude of samples 2-4 vs. sample 1. In other words, the sample 1 data has been "scaled up" relative to samples 2-4, and the reader must make a conscious effort to read the axes and digest the differing scales in order to properly understand the plots.
The lesson here is that there's no perfect solution. Every approach has its own pros and cons, its own tradeoffs.
Gapped
In your question, you indicate you want to add a gap across the y range 10:200. On the surface, this sounds like a reasonable solution for raising the visibility of the sample 1 data. However, the magnitude of that 190 unit range is dwarfed by the range of the remainder of the plot, so it ends up having a negligible effect on sample 1 visibility.
In order to demonstrate this I'm going to use some code I've written which can be used to transform input coordinates to a new data domain which allows for inconsistent scaling of different segments of the axis. Theoretically you could use it for both x and y axes, but I've only ever used it for the y-axis.
A few warnings: This introduces some significant complexity, and decouples the graphics engine's idea of the y-axis scale from the real data. More specifically, it maps all coordinates to the range [0,1] based on their cumulative position within the sequence of segments.
At this point, I'm also going to abandon barplot() in favor of drawing the bars manually, using calls to rect(). Technically, it would be possible to use barplot() with my segmentation code, but as I said earlier, I prefer to design my own plots from scratch with primitive graphic elements. This also allows for more precise control over all aspects of the plot.
Here's the code and plot, I'll attempt to give a better explanation of it afterward:
dataCoordToPlot <- function(data,seg) {
## data -- double vector of data-world coordinates.
## seg -- list of two components: (1) mark, giving the boundaries between all segments, and (2) scale, giving the relative scale of each segment. Thus, scale must be one element shorter than mark.
data <- as.double(data);
seg <- as.list(seg);
seg$mark <- as.double(seg$mark);
seg$scale <- as.double(seg$scale);
if (length(seg$scale) != length(seg$mark)-1L) stop('seg$scale must be one element shorter than seg$mark.');
scaleNorm <- seg$scale/sum(seg$scale);
cumScale <- c(0,cumsum(scaleNorm));
int <- findInterval(data,seg$mark,rightmost.closed=T);
int[int%in%c(0L,length(seg$mark))] <- NA; ## handle values outside outer segments; will propagate NA to returned vector
(data - seg$mark[int])/(seg$mark[int+1L] - seg$mark[int])*scaleNorm[int] + cumScale[int];
}; ## end dataCoordToPlot()
## y dimension segmentation
ymax <- 4000;
yseg <- list();
yseg$mark <- c(0,10,140,ymax);
yseg$scale <- diff(yseg$mark);
yseg$scale[2L] <- 30;
yseg$jump <- c(F,T,F);
## plot precomputations
xcenters <- seq(0.5,len=length(unique(df$sample)));
xlim <- range(xcenters)+c(-0.5,0.5);
ylim <- range(yseg$mark);
yinc <- 100;
yticks.inc <- seq(ylim[1L],ylim[2L],yinc);
yticks.inc <- yticks.inc[!yseg$jump[findInterval(yticks.inc,yseg$mark,rightmost.closed=T)]];
yticks.jump <- setdiff(yseg$mark,yticks.inc);
yticks.all <- sort(c(yticks.inc,yticks.jump));
## plot
## define as reusable function for subsequent examples
custom.barplot <- function() {
par(xaxs='i',yaxs='i');
plot(NA,xlim=xlim,ylim=dataCoordToPlot(ylim,yseg),axes=F,ann=F);
abline(h=dataCoordToPlot(yticks.all,yseg),col='lightgrey');
axis(1L,seq(xlim[1L],xlim[2L]),NA,tck=0);
axis(1L,xcenters,unique(df$sample));
axis(2L,dataCoordToPlot(yticks.inc,yseg),yticks.inc,las=1,cex.axis=0.7);
axis(2L,dataCoordToPlot(yticks.jump,yseg),yticks.jump,las=1,tck=-0.008,hadj=0.1,cex.axis=0.5);
mtext('sample',1L,2L);
mtext('mean',2L,3L);
xgroupRatio <- 0.8;
xbarRatio <- 0.9;
partColors <- c(G='green3',F='indianred1',R='dodgerblue');
partsCanon <- unique(df$part);
errColors <- c(G='darkgreen',F='darkred',R='darkblue');
for (sampleIndex in seq_along(unique(df$sample))) {
xc <- xcenters[sampleIndex];
sample <- unique(df$sample)[sampleIndex];
dfs <- df[df$sample==sample,];
parts <- unique(dfs$part);
parts <- parts[order(match(parts,partsCanon))];
barWidth <- xgroupRatio*xbarRatio/length(parts);
gapWidth <- xgroupRatio*(1-xbarRatio)/(length(parts)-1L);
xstarts <- xc - xgroupRatio/2 + (match(dfs$part,parts)-1L)*(barWidth+gapWidth);
rect(xstarts,0,xstarts+barWidth,dataCoordToPlot(dfs$mean,yseg),col=partColors[dfs$part]);
barCenters <- xstarts+barWidth/2;
segments(barCenters,dataCoordToPlot(dfs$mean + dfs$sd,yseg),y1=dataCoordToPlot(dfs$mean - dfs$sd,yseg),lwd=2,col=errColors);
points(rep(barCenters,2L),dataCoordToPlot(c(dfs$mean-dfs$sd,dfs$mean+dfs$sd),yseg),pch=19,col=errColors);
}; ## end for
## draw zig-zag cutaway graphic in jump segments
zigCount <- 30L;
jumpIndexes <- which(yseg$jump);
for (jumpIndex in jumpIndexes) {
if (yseg$scale[jumpIndex] == 0) next;
jumpStart <- yseg$mark[jumpIndex];
jumpEnd <- yseg$mark[jumpIndex+1L];
lines(seq(xlim[1L],xlim[2L],len=zigCount*2L+1L),dataCoordToPlot(c(rep(c(jumpStart,jumpEnd),zigCount),jumpStart),yseg));
}; ## end for
legend(0.2,dataCoordToPlot(3800,yseg),partsCanon,partColors,title=expression(bold('part')),cex=0.7,title.adj=c(NA,0.5));
}; ## end custom.barplot()
custom.barplot();
The key function is dataCoordToPlot(). That stands for "data coordinates to plot coordinates", where "plot coordinates" refers to the [0,1] normalized domain.
The seg argument defines the segmentation of the axis and the scaling of each segment. Its mark component specifies the boundaries of each segment, and its scale component gives the scale factor for each segment. n segments must have n+1 boundaries to fully define where each segment begins and ends, thus mark must be one element longer than scale.
Before being used, the scale vector is normalized within the function to sum to 1, so the absolute magnitudes of the scale values don't matter; it's their relative values that matter.
The algorithm is to find each coordinate's containing segment, find the accumulative distance within the segment reached by the coordinate accounting for the segment's relative scale, and then add to that the cumulative distance reached by all prior segments.
Using this design, it is possible to take any range of coordinates along the axis dimension and scale them up or down relative to the other segments. An instantaneous gap across a range could be achieved with a scale of zero. Alternatively, you can simply scale down the range so that it has some thickness, but contributes little to the progression of the dimension. In the above plot, I use the latter for the gap, mainly so that I can use the small thickness to add a zigzag aesthetic which visually indicates the presence of the gap.
Also, I should note that I used 10:140 instead of 10:200 for the gap. This is because the sample 2 F part error bar extends down to 147.25 (195.07 - 47.82). The difference is negligible.
As you can see, the result looks basically identical to the Linear Single plot. The gap is not significant enough to raise the visibility of the sample 1 data.
Distorted with Gap
Just to throw some more possibilities into mix, now venturing into very non-standard and probably questionable waters, we can use the segmentation transformation to scale up the sample 1 order of magnitude, thereby making it much more visible while still remaining within the single plot, directly alongside samples 2-4.
For this example, I preserve the gap from 10:140 so you can see how it looks when not lying prostrate near the baseline.
## y dimension segmentation
ymax <- 4000;
yseg <- list();
yseg$mark <- c(0,10,140,ymax);
yseg$scale <- c(24,1,75);
yseg$jump <- c(F,T,F);
## plot precomputations
xcenters <- seq(0.5,len=length(unique(df$sample)));
xlim <- range(xcenters)+c(-0.5,0.5);
ylim <- range(yseg$mark);
yinc1 <- 1;
yinc2 <- 100;
yticks.inc1 <- seq(ceiling(yseg$mark[1L]/yinc1)*yinc1,yseg$mark[2L],yinc1);
yticks.inc2 <- seq(ceiling(yseg$mark[3L]/yinc2)*yinc2,yseg$mark[4L],yinc2);
yticks.inc <- c(yticks.inc1,yticks.inc2);
yticks.jump <- setdiff(yseg$mark,yticks.inc);
yticks.all <- sort(c(yticks.inc,yticks.jump));
## plot
custom.barplot();
Distorted without Gap
Finally, just to clarify that gaps are not necessary for inconsistent scaling between segments, here's the same plot but without the gap:
## y dimension segmentation
ymax <- 4000;
yseg <- list();
yseg$mark <- c(0,10,ymax);
yseg$scale <- c(25,75);
yseg$jump <- c(F,F);
## plot precomputations
xcenters <- seq(0.5,len=length(unique(df$sample)));
xlim <- range(xcenters)+c(-0.5,0.5);
ylim <- range(yseg$mark);
yinc1 <- 1;
yinc2 <- 100;
yticks.inc1 <- seq(ceiling(yseg$mark[1L]/yinc1)*yinc1,yseg$mark[2L],yinc1);
yticks.inc2 <- seq(ceiling(yseg$mark[2L]/yinc2)*yinc2,yseg$mark[3L],yinc2);
yticks.inc <- c(yticks.inc1,yticks.inc2);
yticks.jump <- setdiff(yseg$mark,yticks.inc);
yticks.all <- sort(c(yticks.inc,yticks.jump));
## plot
custom.barplot();
In principle, there's really no difference between the Linear Multiple solution and the Distorted solutions. Both involve visual distortion of competing orders of magnitude. Linear Multiple simply separates the different orders of magnitude into separate plots, while the Distorted solutions combine them into the same plot.
Probably the best argument in favor of using Linear Multiple is that if you use Distorted you'll probably be crucified by a large mob of data scientists, since that is a very non-standard way of plotting data. On the other hand, one could argue that the Distorted approach is more concise and helps to represent the relative positions of each data point along the number line. The choice is yours.

What you want to plot is a discontinuous y axis.
This issue was covered before in this post and seems not to be possible in ggplot2.
The answers to the mentioned post suggest faceting, log scaled y axis and separate plots to solve your problem.
Please find the reasons detailed by Hadley Wickham here, who thinks that a broken y axis could be "visually distorting".

Related

In ggplot, how to draw a circle/disk with a line that divides its area according to a given ratio and colored points inside?

I want to visualize proportions using points inside a circle. For example, let's say that I have 100 points that I wish to scatter (somewhat randomly jittered) in a circle.
Next, I want to use this diagram to represent the proportions of people who voted Biden/Harris in 2020 US presidential elections, in each state.
Example #1 -- Michigan
Biden got 50.62% of Michigan's votes. I'm going to draw a horizontal diameter that splits the circle to two halves, and then color the points under the diameter in blue (Democrats' color).
Example #2 -- Wyoming
Unlike Michigan, in Wyoming Biden got only 26.55% of the votes, which is approximately a quarter of the vote. In this case I'd draw a horizontal chord that divides the circle such that the disk's area under the chord is 25% of the entire disk area. Then I'll color the respective points in that area in blue. Since I have 100 points in total, 25 points represent the 25% who voted Biden in Wyoming.
My question: How can I do this with ggplot? I researched this issue, and there's a lot of geometry going on here. First, the kind of area I'm talking about is called a "circular segment". Second, there are many formulas to calculate its area, if we know some other parameters about the shape (such as the radius length, etc.). See this nice demo.
However, my goal isn't to solve geometry problems, but just to represent proportions in a very specific way:
draw a circle
sprinkle X number of points inside
draw a (real or invisible) horizontal line that divides the circle/disk area according to a given proportion
ensure that the points are arranged respective to the split. That is, if we want to represent a 30%-70% split, then have 30% of the points under the line that divides the disk.
color the points under the line.
I understand that this is somewhat an exotic visualization, but I'll be thankful for any help with this.
EDIT
I've found a reference to a JavaScript package that does something very similar to what I'm asking.
I took a crack at this for fun. There's a lot more that could be done. I agree that this is not a great way to visualize proportions, but if it's engaging your audience ...
Formulas for determining appropriate heights are taken from Wikipedia. In particular we need the formulas
a/A = (theta - sin(theta))/(2*pi)
h = 1-cos(theta/2)
where a is the area of the segment; A is the whole area of the circle; theta is the angle described by the arc that defines the segment (see Wikipedia for pictures); and h is the height of the segment.
Machinery for finding heights.
afun <- function(x) (x-sin(x))/(2*pi)
## curve(afun, from=0, to = 2*pi)
find_a <- function(a) {
uniroot(
function(x) afun(x) -a,
interval=c(0, 2*pi))$root
}
find_h <- function(a) {
1- cos(find_a(a)/2)
}
vfind_h <- Vectorize(find_h)
## find_a(0.5)
## find_h(0.5)
## curve(vfind_h(x), from = 0, to= 1)
set up a circle
dd <- data.frame(x=0,y=0,r=1)
library(ggforce)
library(ggplot2); theme_set(theme_void())
gg0 <- ggplot(dd) + geom_circle(aes(x0=x,y0=y,r=r)) + coord_fixed()
finish
props <- c(0.2,0.5,0.3) ## proportions
n <- 100 ## number of points to scatter
cprop <- cumsum(props)[-length(props)]
h <- vfind_h(cprop)
set.seed(101)
r <- runif(n)
th <- runif(n, 0, 2 * pi)
dd <-
data.frame(x = sqrt(r) * cos(th),
y = sqrt(r) * sin(th))
dd2 <- data.frame(x=r*cos(2*pi*th), y = r*sin(2*pi*th))
dd2$g <- cut(dd2$y, c(1, 1-h, -1))
gg0 + geom_point(data=dd2, aes(x, y, colour = g), size=3)
There are a bunch of tweaks that would make this better (meaningful names for the categories; reverse the axis order to match the plot; maybe add segments delimiting the sections, or (more work) polygons so you can shade the sections.
You should definitely check this for mistakes — e.g. there are places where I may have used a set of values where I should have used their first differences, or vice versa (values vs cumulative sum). But this should get you started.

How to visualize a distance matrix on the map by the thickness or color of the line connect the adjacent localities in R?

Suppose I have two datasets: (1) a data frame: coordinates of localities, each with ID; and (2) a linguistic distance matrix which reflects the linguistic distance between these localities.
# My data are similar to this structure
# dataframe
id <- c("A","B","C","D","E")
x_coor <- c(0.5,1,1,1.5,2)
y_coor <- c(5.5,3,7,6.5,5)
my.data <- data.frame(id = id, x_coor = x_coor, y_coor = y_coor)
# linguistic distance matrix
A B C D
B 308.298557
C 592.555483 284.256926
D 141.421356 449.719913 733.976839
E 591.141269 282.842712 1.414214 732.562625
Now, I want to visualize the linguistic distance between every two sites onto a map by the thickness or color of the line connect the adjacent localities in R.
Just like this:
enter image description here
My idea is to generate the delaunay triangulation by deldir or tripack package in R.
# generate delaunay triangulation
library(deldir)
de=deldir(my.data$x_coor,my.data$y_coor)
plot.deldir(de,wlines="triang",col='blue',wpoints = "real",cex = 0.1)
text(my.data$x_coor,my.data$y_coor,my.data$id)
this is the plot:
enter image description here
My question is how to reflect the linguistic distance by the thickness or color of the edges of triangles? Is there any other better method?
Thank you very much!
What you want to do in respect of the line widths can be done "fairly
easily" by the deldir package. You simply call plot.deldir() with the
appropriate value of "lw" (line width).
At the bottom of this answer is a demonstration script "demo.txt" which shows how to do this in the case of your example. In particular this script shows
how to obtain the appropriate value of lw from the "linguistic distance
matrix". I had to make some adjustments in the way this matrix was
presented. I.e. I had to convert it into a proper matrix.
I have rescaled the distances to lie between 0 and 10 to obtain the
corresponding values of the line widths. You might wish to rescale in a different manner.
In respect of colours, there are two issues:
(1) It is not at all clear how you would like to map the "linguistic
distances" to colours.
(2) Unfortunately the code for plot.deldir() is written in a very
kludgy way, whence the "col" argument to segments() cannot be
appropriately passed on in the same manner that the "lw" argument can.
(I wrote the plot.deldir() code a long while ago, when I knew far less about
R programming than I know now! :-))
I will adjust this code and submit a new version of deldir to CRAN
fairly soon.
#
# Demo script
#
# Present the linguistic distances in a useable way.
vldm <- c(308.298557,592.555483,284.256926,141.421356,449.719913,
733.976839,591.141269,282.842712,1.414214,732.562625)
ldm <- matrix(nrow=5,ncol=5)
ldm[row(ldm) > col(ldm)] <- vldm
ldm[row(ldm) <= col(ldm)] <- 0
ldm <- (ldm + t(ldm))/2
rownames(ldm) <- LETTERS[1:5]
colnames(ldm) <- LETTERS[1:5]
# Set up the example data. It makes life much simpler if
# you denote the "x" and "y" coordinates by "x" and "y"!!!
id <- c("A","B","C","D","E")
x_coor <- c(0.5,1,1,1.5,2)
y_coor <- c(5.5,3,7,6.5,5)
# Eschew nomenclature like "my.data". Such nomenclature
# is Micro$oft-ese and is an abomination!!!
demoDat <- data.frame(id = id, x = x_coor, y = y_coor)
# Form the triangulation/tessellation.
library(deldir)
dxy <- deldir(demoDat)
# Plot the triangulation with line widths proportional
# to "linguistic distances". Note that plot.deldir() is
# a *method* for plot, so you do not have to (and shouldn't)
# type the ".deldir" in the plotting command.
plot(dxy,col=0) # This, and plotting with "add=TRUE" below, is
# a kludge to dodge around spurious warnings.
ind <- as.matrix(dxy$delsgs[,c("ind1","ind2")])
lwv <- ldm[ind]
lwv <- 10*lwv/max(lwv)
plot(dxy,wlines="triang",col='grey',wpoints="none",
lw=10*lwv/max(lwv),add=TRUE)
with(demoDat,text(x,y,id,col="red",cex=1.5))

Find correct 2D translation of a subset of coordinates

I have a problem I wish to solve in R with example data below. I know this must have been solved many times but I have not been able to find a solution that works for me in R.
The core of what I want to do is to find how to translate a set of 2D coordinates to best fit into an other, larger, set of 2D coordinates. Imagine for example having a Polaroid photo of a small piece of the starry sky with you out at night, and you want to hold it up in a position so they match the stars' current positions.
Here is how to generate data similar to my real problem:
# create reference points (the "starry sky")
set.seed(99)
ref_coords = data.frame(x = runif(50,0,100), y = runif(50,0,100))
# generate points take subset of coordinates to serve as points we
# are looking for ("the Polaroid")
my_coords_final = ref_coords[c(5,12,15,24,31,34,48,49),]
# add a little bit of variation as compared to reference points
# (data should very similar, but have a little bit of noise)
set.seed(100)
my_coords_final$x = my_coords_final$x+rnorm(8,0,.1)
set.seed(101)
my_coords_final$y = my_coords_final$y+rnorm(8,0,.1)
# create "start values" by, e.g., translating the points we are
# looking for to start at (0,0)
my_coords_start =apply(my_coords_final,2,function(x) x-min(x))
# Plot of example data, goal is to find the dotted vector that
# corresponds to the translation needed
plot(ref_coords, cex = 1.2) # "Starry sky"
points(my_coords_start,pch=20, col = "red") # start position of "Polaroid"
points(my_coords_final,pch=20, col = "blue") # corrected position of "Polaroid"
segments(my_coords_start[1,1],my_coords_start[1,2],
my_coords_final[1,1],my_coords_final[1,2],lty="dotted")
Plotting the data as above should yield:
The result I want is basically what the dotted line in the plot above represents, i.e. a delta in x and y that I could apply to the start coordinates to move them to their correct position in the reference grid.
Details about the real data
There should be close to no rotational or scaling difference between my points and the reference points.
My real data is around 1000 reference points and up to a few hundred points to search (could use less if more efficient)
I expect to have to search about 10 to 20 sets of reference points to find my match, as many of the reference sets will not contain my points.
Thank you for your time, I'd really appreciate any input!
EDIT: To clarify, the right plot represent the reference data. The left plot represents the points that I want to translate across the reference data in order to find a position where they best match the reference. That position, in this case, is represented by the blue dots in the previous figure.
Finally, any working strategy must not use the data in my_coords_final, but rather reproduce that set of coordinates starting from my_coords_start using ref_coords.
So, the previous approach I posted (see edit history) using optim() to minimize the sum of distances between points will only work in the limited circumstance where the point distribution used as reference data is in the middle of the point field. The solution that satisfies the question and seems to still be workable for a few thousand points, would be a brute-force delta and comparison algorithm that calculates the differences between each point in the field against a single point of the reference data and then determines how many of the rest of the reference data are within a minimum threshold (which is needed to account for the noise in the data):
## A brute-force approach where min_dist can be used to
## ameliorate some random noise:
min_dist <- 5
win_thresh <- 0
win_thresh_old <- 0
for(i in 1:nrow(ref_coords)) {
x2 <- my_coords_start[,1]
y2 <- my_coords_start[,2]
x1 <- ref_coords[,1] + (x2[1] - ref_coords[i,1])
y1 <- ref_coords[,2] + (y2[1] - ref_coords[i,2])
## Calculate all pairwise distances between reference and field data:
dists <- dist( cbind( c(x1, x2), c(y1, y2) ), "euclidean")
## Only take distances for the sampled data:
dists <- as.matrix(dists)[-1*1:length(x1),]
## Calculate the number of distances within the minimum
## distance threshold minus the diagonal portion:
win_thresh <- sum(rowSums(dists < min_dist) > 1)
## If we have more "matches" than our best then calculate a new
## dx and dy:
if (win_thresh > win_thresh_old) {
win_thresh_old <- win_thresh
dx <- (x2[1] - ref_coords[i,1])
dy <- (y2[1] - ref_coords[i,2])
}
}
## Plot estimated correction (your delta x and delta y) calculated
## from the brute force calculation of shifts:
points(
x=ref_coords[,1] + dx,
y=ref_coords[,2] + dy,
cex=1.5, col = "red"
)
I'm very interested to know if there's anyone that solves this in a more efficient manner for the number of points in the test data, possibly using a statistical or optimization algorithm.

Create bubble chart with biggest bubble at the center

I'm trying to create a bubble chart using a set of data as follows:
X --> 10
Y --> 20
Z --> 5
Q --> 10
I simply need to have the biggest bubble (based on its number) to be at the centre (give or take) and the rest of the bubbles be around it without overlapping.
All of the R examples I have seen require a two dimensional dataset, and since the data I have are only one dimensional, I like to know if it's at all possible to create such graphs in R.
It would be great if someone could suggest me some useful hints or so. By the way for this task I need to use a SA tools so something like d3js is out of options. However, I am open to using a tool other than R.
I wasn't quite sure if this question should be asked in On Stack Overflow or Cross Validated, so if moderators believe it doesn't belong here, I'll remove it.
This should do, the main idea being that you sort by the value of the radius, so the first is the biggest, then shift the values around it (odd on one side, even on the other) so that the values are decreasing both ways.
Further explanations in the code.
library(plotrix)
library(RColorBrewer)
# Set the random seed, to get reproducible results
set.seed(54321)
# Generate some random values for the radius
num.circles <- 11
rd <- runif(num.circles, 1, 20)
df <- data.frame(labels=paste("Lbl", 1:num.circles), radius=rd)
# Sort by descending radius. The biggest circle is always row 1
df <- df[rev(order(df$radius)),]
# Now we want to put the biggest circle in the middle and the others on either side
# To do so we reorder the data frame taking the even values first reversed, then the odd values.
# This ensure the biggest circle is in the middle
df <- df[c(rev(seq(2, num.circles, 2)), seq(1, num.circles, 2)),]
# Space between the circles. 0.2 * average radius seems OK
space.between <- 0.2 * mean(df$radius)
# Creat an empty plot
plot(0, 0, "n", axes=FALSE, bty="n", xlab="", ylab="",
xlim=c(0, sum(df$radius)*2+space.between*num.circles),
ylim=c(0, 2.5 * max(df$radius)))
# Draw the circle at half the height of the biggest circle (plus some padding)
xx <- 0
mid.y <- max(df$radius) * 1.25
# Some nice degrading tones of blue
colors <- colorRampPalette(brewer.pal(8,"Blues"))(num.circles/2)
for (i in 1:nrow(df))
{
row <- df[i,]
x <- xx + row$radius + i*space.between
y <- mid.y
# Draw the circle
draw.circle(x, y, row$radius,
col=colors[abs(num.circles/2-i)])
# Add the label
text(x, y, row$labels, cex=0.6)
# Update current x position
xx <- xx + row$radius * 2
}
The result:
Live version on RFiddle.

How to count line segment occurrences by pixel in R?

I am trying to convey the concentration of lines in 2D space by showing the number of crossings through each pixel in a grid. I am picturing something similar to a density plot, but with more intuitive units. I was drawn to the spatstat package and its line segment class (psp) as it allows you to define line segments by their end points and incorporate the entire line in calculations. However, I'm struggling to find the right combination of functions to tally these counts and would appreciate any suggestions.
As shown in the example below with 50 lines, the density function produces values in (0,140), the pixellate function tallies the total length through each pixel and takes values in (0, 0.04), and as.mask produces a binary indictor of whether a line went through each pixel. I'm hoping to see something where the scale takes integer values, say 0..10.
require(spatstat)
set.seed(1234)
numLines = 50
# define line segments
L = psp(runif(numLines),runif(numLines),runif(numLines),runif(numLines), window=owin())
# image with 2-dimensional kernel density estimate
D = density.psp(L, sigma=0.03)
# image with total length of lines through each pixel
P = pixellate.psp(L)
# binary mask giving whether a line went through a pixel
B = as.mask.psp(L)
par(mfrow=c(2,2), mar=c(2,2,2,2))
plot(L, main="L")
plot(D, main="density.psp(L)")
plot(P, main="pixellate.psp(L)")
plot(B, main="as.mask.psp(L)")
The pixellate.psp function allows you to optionally specify weights to use in the calculation. I considered trying to manipulate this to normalize the pixels to take a count of one for each crossing, but the weight is applied uniquely to each line (and not specific to the line/pixel pair). I also considered calculating a binary mask for each line and adding the results, but it seems like there should be an easier way. I know that you can sample points along a line, and then do a count of the points by pixel. However, I am concerned about getting the sampling right so that there is one and only one point per line crossing of a pixel.
Is there is a straight-forward way to do this in R? Otherwise would this be an appropriate suggestion for a future package enhancement? Is this more easily accomplished in another language such as python or matlab?
The example above and my testing has been with spatstat 1.40-0, R 3.1.2, on x86_64-w64-mingw32.
You are absolutely right that this is something to put in as a future enhancement. It will be done in one of the next versions of spatstat. It will probably be an option in pixellate.psp to count the number of crossing lines rather than measure the total length.
For now you have to do something a bit convoluted as e.g:
require(spatstat)
set.seed(1234)
numLines = 50
# define line segments
L <- psp(runif(numLines),runif(numLines),runif(numLines),runif(numLines), window=owin())
# split into individual lines and use as.mask.psp on each
masklist <- lapply(1:nsegments(L), function(i) as.mask.psp(L[i]))
# convert to 0-1 image for easy addition
imlist <- lapply(masklist, as.im.owin, na.replace = 0)
rslt <- Reduce("+", imlist)
# plot
plot(rslt, main = "")

Resources