How to read a coplot() graph - r

I cannot warp my mind arround reading the plots generated by coplot().
For example from the help(coplot)
## Tonga Trench Earthquakes
coplot(lat ~ long | depth, data = quakes)
What do the gray bars above represent? Why are there 2 rows or lat/long boxes?
How do I read this graph?

I can shed some more light on the second chart's interpretation. The gray bars for both mag and depth represent intervals of the their respective variables. Andy gave a nice description of how they are created above.
When you are reading them keep in mind that they are meant to show you the range of the observations for the respective conditioning variable (mag or depth) represented in each column or row. Therefore, in Andy's example the largest mag bar is just showing that the topmost row contains observations for earthquakes of approx. 4.6 to 7. It makes sense that this bar is the largest, since as Andy mentioned, they are created to have roughly similar numbers of observations and stronger earthquakes are not as common as weaker ones. The same logic holds true for depth where a larger range of depths was required to get a roughly proportional number of observations.
Regarding reading the chart, you would read the columns as representing the three depth groups (left to right) and the rows as representing the four mag groups (bottom to top). Thus, as you read up the chart you're progressively slicing the data into groups of observations with increasing magnitudes. So, for example, the bottom row represents earthquakes with magnitudes of 4 to 4.5 with each column representing a different range of depths. Similarly, you read the columns as holding depth constant while allowing you to see various ranges of magnitudes.
Putting it all together, as mentioned by Andy, we can see that as we read up the rows (progressing up in magnitude) the distribution of earthquakes remains relatively unchanged. However, when reading across the columns (progressing up in depth) we see that the distribution does slightly change. Specifically, the grouping of quakes on the right, between longitudes 180 and 185, grows tighter and more clustered towards the top of the cell.

This is a method for visualizing interactions in your dataset. More specifically, it lets you see how some set of variables are conditional on some other set of variables.
In the example given, you're asking to visualize how lat and long vary with depth. Because you didn't specify number, and the formula indicates you're interested in only one conditional variable, the function assumes you want number=6 depth cuts (passed to co.intervals, which tries to make the number of data points approximately equal within each interval) and is simply maximizing the data-to-ink ratio by stacking individual plot frames; the value of depth increases to the right, starting with the lowest row and moving up (hence the top-right frame represents the largest depth interval). You can set rows or columns to change this behavior, e.g.:
coplot(lat ~ long | depth, data = quakes, columns=6)
but I think the power of this tool becomes more apparent when you inspect two or more conditioning variables. For example:
coplot(lat ~ long | depth * mag, data = quakes, number=c(3,4))
gives a rich view of how earthquakes vary in space, and demonstrates that there is some interaction with depth (the pattern changes from left to right), and little-to-no interaction with magnitude (the pattern does not change from top to bottom).
Finally, I would highly recommend reading Cleveland's Visualizing Data -- a classic text.

Related

Extract values from geom_bin2d

I have taken photos of a bird nesting area and have marked positions of earch bird on the photo. Resulting data is a list of X and Y positions. I transformed pixel data to meters data.
I want to calculate how many of counts are there in squares of 1m2. I was able to get what I looked for graphically with geom_bin2d but I would like to extract the value of each of the squares.
Any functions that would do this? or methods to extract data from geom_bin2d?
Thank you very much!
I have found few functions (density, bkde2D) but they are related to Kernel density estimate, which doesn't seem to fit the same values with geom_bin2d.

Correlating rasters with divisible resolution

I am using a multibeam echosounder to create a raster stack in R with layers all in the same resolution, which I then convert to a data frame so I can create additive models to describe the distribution of fish around bathymetry variables (depth, aspect, slope, roughness etc.).
The issue I have is that I would like to keep my resonse variable (fish school volume) fine and my predictive variables (bathymetry) coarse, such that I have say 1 x 1m cells representing the distribution of fish schools and 10 x 10m cells representing bathymetry (so the coarse cell is divisible by the fine cell with no remainder).
I can easily create these rasters individually but relating them is the problem. As each coarser cell would contain 10 x 10 = 100 finer cells, I am not sure how to program this into R so that the values are in the right location relative to an x and a y column (for cell addresses). But I realise in this case, I would need each coarse cell value to be repeated 100 times in the data frame.
Any advice would be greatly appreciated! Thanks!

Piecewise linear function given three points and two crossover boundaries

Suppose you have three points; (3500, 700), (52500, 5075), and (527500, 36800). As well as two x boundaries 25000 and 200000. The question is then to construct three lines (each of which goes through one of the points) in such a way that the lines achieve the same value at each boundary point. The catch is that the second line must have a lesser slope than the first line and the third line must have a lesser slope than the second line.
I do not believe a solution exists but I would like to know how to set up the problem in order to check for solutions in r. Ideally, I would like to then relax the constraint that the functions be equal at the boundaries to something where they are within 10% of one another or something like that.
Edit 1:
Essentially what I want is a line that goes through (3500, 700) and has a value (25000, y_1), a line that goes through (52500, 5075) and has values (25000, y_1) and (200000, y_2), and a third line that goes through (527500, 36800) and has a value(200000, y_2).
Edit 2:
Here is an updated graphic. The dashed lines represent an incorrect solution because the slope of the last segment is larger than the slope of the middle segment (which is 0). What I am essentially looking for is the y = mx + b format of each of the three segments if they were infinitely extending lines.

Making a heatmap in R varying both color and transparency

Is it possible to generate a heatmap taking into consideration both the color and the transparency, with these two parameters given from two different matrices (matrix 1 defines color, matrix 2 defines alpha)?
A little more information on what I'm after:
I have successfully used R and the heatmap.2 function in the gplots package to generate heatmaps - in this case to visualize miRNA interactions. Here, what I want to show is the probability of a particular nucleotide along the typical 20-24 nucleotides of the miRNA in being engaged in target pairing. My heatmap matrix consists of miRNAs (rows) and positions 1-24 (columns) with numeric paring probability in each cell. An example would be changing the alpha parameter of the color determined by the matrix values, such that white=no pairing and dark red=high pairing.
The heatmap.2 function works great for a single such plot, but I would now like to take in overlap information from two different species. Thus, I would need my heatmap to basically consider two matrices:
1) A matrix with the degree of species overlap, e.g. ranging from red-purple-blue for species1-only to species1+2 to species2-only.
2) A matrix with the average degree of pairing, e.g. visualized by the alpha parameter going from a weak-to-strong average pairing (whatever the color) at a given position in matrix 1.
I have tried to use the principles from this post:
Place 1 heatmap on another with transparency in R
But haven't been able to apply its suggestions to my own question.
Thanks in advance!

What method does outline=FALSE use to determine outliers? [duplicate]

This question already has answers here:
In ggplot2, what do the end of the boxplot lines represent?
(4 answers)
Closed 10 years ago.
In R, I have used the outline=FALSE parameter to exclude outliers when plotting a box and whisker for a particular set. It's worked spectacularly, but leaves me wondering how exactly it determines which elements are outliers.
boxplot(x, horizontal = TRUE, axes = FALSE, outline = FALSE)
An "outlier" in the terminology of box-and-whisker plots is any point in the data set that falls farther than a specified distance from the median, typically approximately 2.5 times the difference between the median and the 0.25 (lower) or 0.75 (upper) quantile. To get there, see ?boxplot.stats: first, look at the definition of out in the output
out: the values of any data points which lie beyond the extremes of the whiskers (if(do.out)).
These are the "outliers".
Second, look at the definition of the whiskers, which are based on the coef parameter, which is 1.5 by default:
the whiskers extend to the most extreme data point which is no more than coef times the length of the box away from the box.
Finally, look at the definition of the "hinges", which are the ends of the box:
The two ‘hinges’ are versions of the first and third quartile, i.e., close to quantile(x, c(1,3)/4).
Put these together, and you get outliers defined (approximately) as points that are farther from the median than 2.5 times the distance between the median and the relevant quartile. The reasons for these somewhat convoluted definitions are (I think) partly historical and partly the desire to have the components of the plots reflect actual values that are present in the data (rather than, say, the halfway point between two data points) as much as possible. (You would probably need to go back to the original literature referenced in the help page for the full justifications and explanations.)
The thing to be careful about is that points defined as "outliers" by this algorithm are not necessarily outliers in the usual statistical sense (e.g. points that are surprisingly extreme based on a particular statistical model of the data). In particular, if you have a big data set you will necessarily see lots of "outliers" (one indication that you might want to switch to a more data-hungry graphical summary such as a violin plot or beanplot).
For boxplot, outliers are the points that are above or below the "whiskers". These one, by default, extend to the data points that are no more than the interquartile range times the range argument from the box. By default range value is 1.5, but you can change it and so you can also change the outliers list.
You can also see that with the boxplot.stats function, which performs the computation used by the plot.
For example, if you have the following vector :
v <- c(runif(10), -0.5, -1)
boxplot(v)
By default, only the -1 value is considered as an outlier. You can see it with boxplot.stats :
boxplot.stats(v)$out
[1] -1
But if you change the range argument (or the coef one for boxplot.stats), then -1 is no more considered as an outlier :
boxplot(v, range=2)
boxplot.stats(v, coef=2)$out
numeric(0)
This is admittedly not immediately evident from boxplot(). Look at the range parameter:
this determines how far the plot whiskers extend out from the box. If ‘range’ is positive, the whiskers extend to the most extreme data point which is no more than ‘range’ times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.
So the value of range is used, together with the interquartile range and the box (given by the quartiles), to determine where the whiskers end. And everything outside the whiskers is an outlier.
I'll be the first to agree that this definition is unintuitive. Sadly enough, it is established by now.

Resources