I'm using splot to visualize the fitness histogram for an optimization problem. In this scenario positive Z values (say in the +2 - +15 range) representing good solutions are of particular interest whereas negative values don't provide much insight, i.e. it doesn't matter if a bad solution as a Z value of -50, -500 or -5000.
Using the autorange option all the interesting bits around +/- 0 are 'scaled away' (i.e. mostly flat to include neg. peaks in the surface) so I'm now using an explicit zrange of [-bestValue:bestValue] to focus the plot on the interesting Z values.
Now the development of best solutions close to 0 can be traced much better, however the surface is rendered with 'holes' for neg. Z values exceeding the range:
This is very confusing to look at/interpret.
(FWIW the hidden3d option is enabled)
Can we (gnuplot) 'fill' the holes in some way, e.g. by clamping neg. values in the surface plot instead of just dropping the points from the surface?
Related
I have a binomial assymetric distribution which I would like to cut at both ends. The specific part of it is that I would like to calculate symmetric boundaries at the appropriate side of each 'bell'. The figure shows an extreme case of separation between bells for simplicity.
In this case the red cuts were selected by eye and the 1550 blue lines used at each side represent an arbitrary value that could potentially be passed through a function for the trim. My goal would be subset everything between blue lines.
hist(p3_cut$x,50)
abline(v=c(6200,7600),col='red')
abline(v=c(6200-1500,7600+1500),col='blue')
My guess is that the problem here is basically find the 'edges' of each curve. I cannot use half distance between means, I need something that recognizes frequency change from 0 (or very low value) to something relatively high.
A somewhat general answer. Depending on the problem you might need to adjust the binwidth in the density function:
# get density of x and normalize so max is one
dens <- density(x,adjust=0.1)
dens$y <- dens$y / max(dens$y)
# keep all x where density is higher than some fraction of max (here 1%)
min_frac <- 0.01
x_keep <- dens$x[dens$y > 0.01]
# find position of gap in x, and get x just before and after gap
gap_pos <- which.max(diff(x_keep))
left_cut <- x_keep[gap_pos]
right_cut <- x_keep[gap_pos + 1]
Using this code and changing the adjust parameter in the density function I was able to calculate almost perfect cuts at least for this case. I am positive that this approach is flexible enough for most situations that are similar to this one. I show the results for the cuts proposed.
I am looking to present a variable as a bar plot with the caveat that the groups I am trying to plot (the size of an object) vary over several orders of magnitude. The other complication of the data is that the variable y also varies over several orders of magnitude when positive as well as having negative values. I usually think in pictures so I have sketched something along the lines that I am looking for below (the colour would simply be a function of the distance from zero, i.e. white zero, dark blue very negative, dark red very positive etc):
Here is a real case of the data if required:
x <- c(1.100e-08, 1.200e-08, 1.300e-08, 1.400e-08, 1.600e-08, 1.700e-08, 1.900e-08, 2.100e-08, 2.300e-08, 2.600e-08, 3.100e-08, 3.500e-08, 4.200e-08, 4.700e-08, 5.200e-08, 5.800e-08, 6.400e-08, 7.100e-08, 7.900e-08, 8.800e-08, 9.800e-08, 1.100e-07, 1.230e-07, 1.380e-07, 1.550e-07, 1.760e-07, 3.250e-07, 3.750e-07, 4.250e-07, 4.750e-07, 5.400e-07, 6.150e-07, 6.750e-07, 7.500e-07, 9.000e-07, 1.150e-06, 1.450e-06, 1.800e-06, 2.250e-06, 2.750e-06, 3.250e-06, 3.750e-06, 4.500e-06, 5.750e-06, 7.000e-06, 8.000e-06, 9.250e-06, 1.125e-05, 1.375e-05, 1.625e-05, 1.875e-05, 2.250e-05, 2.750e-05, 3.100e-05)
y <-c(1.592140e+01, -1.493541e+01, -6.255603e+00, -2.191637e+00, -1.274086e+00, -1.343391e+00, -8.869018e-01, -7.717447e-01, -6.140710e-01, -5.637220e-01, -5.404424e-01, -3.473077e-01, -2.279666e-01, -1.945254e-01, -2.485636e-01, -2.363181e-01, -2.197054e-01, -2.119314e-01, -1.897220e-01, -1.656779e-01, -1.478176e-01, -1.364191e-01, -1.297830e-01, -1.408082e-01, -1.514742e-01, -1.311300e-01, -1.358422e-01, -2.718636e+00, -2.231532e+00, -3.479395e+00, -3.572720e+00, -2.297957e+00, -3.265428e+00, -5.449620e+00, -7.741435e+00, -1.172256e+01, 9.368365e+00, 1.078983e+02, 9.542029e+01, 1.484089e+02, 2.293383e+02, 3.678836e+02, 7.965286e+02, 1.349151e+03, 1.577808e+04, 4.554271e+05, 1.821730e+06, 8.092310e+04, 1.015619e+06, 2.113788e+06, 5.208331e+06, 4.534863e+06, 8.086026e+06, 1.577413e+07)
I could also plot this as a scatterplot with broken axis but I am currently playing with the a nice approach to display such data- important for me is highlighting at the approximate value of x that y changes sign as well as the variability and magnitude of both the positive and negative values. Any tips and advice you have plotting such data would be great.
Edit based upon comments
I realise that on my graph x and y are the wrong way around, apologies for that. Parameter x should indeed be on the x-axis and parameter y on the y-axis.
Taking on board your suggestions I would be better to plot this data as a scatterplot. Accepting that I still need to break my axis at a relevant value of y (not x as shown in the figure) and have a log scale above this value and linear scale below. Somewhere below the smallest "positive" value of y seems sensible for this break. Can this be done using base r?
I guess something like this but with the split on the y-axis rather than the x-axis and in r of course.
This question already has answers here:
In ggplot2, what do the end of the boxplot lines represent?
(4 answers)
Closed 10 years ago.
In R, I have used the outline=FALSE parameter to exclude outliers when plotting a box and whisker for a particular set. It's worked spectacularly, but leaves me wondering how exactly it determines which elements are outliers.
boxplot(x, horizontal = TRUE, axes = FALSE, outline = FALSE)
An "outlier" in the terminology of box-and-whisker plots is any point in the data set that falls farther than a specified distance from the median, typically approximately 2.5 times the difference between the median and the 0.25 (lower) or 0.75 (upper) quantile. To get there, see ?boxplot.stats: first, look at the definition of out in the output
out: the values of any data points which lie beyond the extremes of the whiskers (if(do.out)).
These are the "outliers".
Second, look at the definition of the whiskers, which are based on the coef parameter, which is 1.5 by default:
the whiskers extend to the most extreme data point which is no more than coef times the length of the box away from the box.
Finally, look at the definition of the "hinges", which are the ends of the box:
The two ‘hinges’ are versions of the first and third quartile, i.e., close to quantile(x, c(1,3)/4).
Put these together, and you get outliers defined (approximately) as points that are farther from the median than 2.5 times the distance between the median and the relevant quartile. The reasons for these somewhat convoluted definitions are (I think) partly historical and partly the desire to have the components of the plots reflect actual values that are present in the data (rather than, say, the halfway point between two data points) as much as possible. (You would probably need to go back to the original literature referenced in the help page for the full justifications and explanations.)
The thing to be careful about is that points defined as "outliers" by this algorithm are not necessarily outliers in the usual statistical sense (e.g. points that are surprisingly extreme based on a particular statistical model of the data). In particular, if you have a big data set you will necessarily see lots of "outliers" (one indication that you might want to switch to a more data-hungry graphical summary such as a violin plot or beanplot).
For boxplot, outliers are the points that are above or below the "whiskers". These one, by default, extend to the data points that are no more than the interquartile range times the range argument from the box. By default range value is 1.5, but you can change it and so you can also change the outliers list.
You can also see that with the boxplot.stats function, which performs the computation used by the plot.
For example, if you have the following vector :
v <- c(runif(10), -0.5, -1)
boxplot(v)
By default, only the -1 value is considered as an outlier. You can see it with boxplot.stats :
boxplot.stats(v)$out
[1] -1
But if you change the range argument (or the coef one for boxplot.stats), then -1 is no more considered as an outlier :
boxplot(v, range=2)
boxplot.stats(v, coef=2)$out
numeric(0)
This is admittedly not immediately evident from boxplot(). Look at the range parameter:
this determines how far the plot whiskers extend out from the box. If ‘range’ is positive, the whiskers extend to the most extreme data point which is no more than ‘range’ times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.
So the value of range is used, together with the interquartile range and the box (given by the quartiles), to determine where the whiskers end. And everything outside the whiskers is an outlier.
I'll be the first to agree that this definition is unintuitive. Sadly enough, it is established by now.
R automatically uses powers of ten for the x axis (values are from zero to 500000) - but i want just the plain figures in steps of 50000 or something (NOT written as powers of ten).
I tried to set the axis with axis(1,c(0,100000,....)) but it is plotted as powers of ten again.
I tried to scale down the font with cex.axis but it still uses power of ten for the x-axis. I think R tries to secure enough space between the values on the x-axis - but i want to force the full values to be plotted.
Axis looks at the moment like this:
-4e+05 -2e+05 0e+00 2e+05 4e+04 and so on ...
This link seems to answer your question: http://tolstoy.newcastle.edu.au/R/help/05/09/12499.html
e.g. option(scipen=6) would make the cutoff for scientific notation only for numbers larger than 1e6 I believe.
Problem: Suppose you have a collection of points in the 2D plane. I want to know if this set of points sits on a regular grid (if they are a subset of a 2D lattice). I would like some ideas on how to do this.
For now, let's say I'm only interested in whether these points form an axis-aligned rectangular grid (that the underlying lattice is rectangular, aligned with the x and y axes), and that it is a complete rectangle (the subset of the lattice has a rectangular boundary with no holes). Any solutions must be quite efficient (better than O(N^2)), since N can be hundreds of thousands or millions.
Context: I wrote a 2D vector field plot generator which works for an arbitrarily sampled vector field. In the case that the sampling is on a regular grid, there are simpler/more efficient interpolation schemes for generating the plot, and I would like to know when I can use this special case. The special case is sufficiently better that it merits doing. The program is written in C.
This might be dumb but if your points were to lie on a regular grid, then wouldn't peaks in the Fourier transform of the coordinates all be exact multiples of the grid resolution? You could do a separate Fourier transform the X and Y coordinates. If theres no holes on grid then the FT would be a delta function I think. FFT is O(nlog(n)).
p.s. I would have left this as a comment but my rep is too low..
Not quite sure if this is what you are after but for a collection of 2d points on a plane you can always fit them on a rectangular grid (down to the precision of your points anyway), the problem may be the grid they fit to may be too sparsly populated by the points to provide any benefit to your algorithm.
to find a rectangular grid that fits a set of points you essentially need to find the GCD of all the x coordinates and the GCD of all the y coordinates with the origin at xmin,ymin this should be O( n (log n)^2) I think.
How you decide if this grid is then too sparse is not clear however
If the points all come only from intersections on the grid then the hough transform of your set of points might help you. If you find that two mutually perpendicular sets of lines occur most often (meaning you find peaks at four values of theta all 90 degrees apart) and you find repeating peaks in gamma space then you have a grid. Otherwise not.
Here's a solution that works in O(ND log N), where N is the number of points and D is the number of dimensions (2 in your case).
Allocate D arrays with space for N numbers: X, Y, Z, etc. (Time: O(ND))
Iterate through your point list and add the x-coordinate to list X, the y-coordinate to list Y, etc. (Time: O(ND))
Sort each of the new lists. (Time: O(ND log N))
Count the number of unique values in each list and make sure the difference between successive unique values is the same across the whole list. (Time: O(ND))
If
the unique values in each dimension are equally spaced, and
if the product of the number of unique values of each coordinate is equal to the number of original points (length(uniq(X))*length(uniq(Y))* ... == N,
then the points are in a regular rectangular grid.
Let's say a grid is defined by an orientation Or (within 0 and 90 deg) and a resolution Res. You could compute a cost function that evaluate if a grid (Or, Res) sticks to your points. For example, you could compute the average distance of each point to its closest point of the grid.
Your problem is then to find the (Or, Res) pair that minimize the cost function. In order to narrow the search space and improve the , some a heuristic to test "good" candidate grids could be used.
This approach is the same as the one used in the Hough transform proposed by jilles. The (Or, Res) space is comparable to the Hough's gamma space.