Histogram() doesn't display the values at the edge of defined range - julia

I can't get Julia to display edge values on histograms, when defining a range for the bins. Here is a minimal example:
using Plots
x = [0,0.5,1]
plot(histogram(x, bins=range(0,1,length=3)))
Defining them explicitly doesn't help (bins=[0,0.3,0.7,1]). It seems that histogram() excludes the limits of the range. I can extend the range to make it work:
plot(histogram(x, bins=[0,0.3,0.7,1.01))
But I really don't think that should be the way to go. Surprisingly, fixing the number of bins does work (nbins=3) but I need to keep the width of all the bins the same and constant across different runs for comparison purposes.
I have tried with Plots, PlotlyJS and StatsBase (with fit() and its closed attribute) to no avail. Maybe I'm missing something, so I wanted to ask: is it possible to do what I want?

Try:
plot(histogram(x, bins=range(0,nextfloat(1.0),length=3)))
Although this extends the range, it does so in a minimal way. Essentially the most minimal which turns the right end of the histogram closed.
As for equal widths, when dealing with floating points, equal widths has different meanings - in terms of real numbers (which are not always representible), or in terms (for example) of the number of values, but this can be different for [0.0,1.0] and [1.0,2.0].
So hopefully, this scratches the itch in the OP.

https://juliastats.org/StatsBase.jl/latest/empirical/#StatsBase.Histogram
most importantly:
closed: A symbol with value :right or :left indicating on which side bins (half-open intervals or higher-dimensional analogues thereof) are closed. See below for an example.
this is very common in many histogram implementations, for example, Numpy
In [8]: np.histogram([0], bins=[0, 1, 2])
Out[8]: (array([1, 0]), array([0, 1, 2]))
In [9]: np.histogram([1], bins=[0, 1, 2])
Out[9]: (array([0, 1]), array([0, 1, 2]))
Numpy has the inconsistency that the last bin is closed on both sides, but it's perfectly normal for every bin to close on one side,

Related

How do graphing calculators like desmos generate their graphs?

I was wondering how graphing calculators were able to plot functions and relations so quickly.
For a funtion, I can see just testing all the x values numerically in a domain, and outputting it that way. But how does this work for relations (such as x^2 + y^2 = 1)? Numerically testing every possible x and y value isn't that fast, as it would be O(n^2), right? How is it possible?
Thank you.
It's based on the zoom, when you zoom in, you render the same amount of values. The graph only lets you see a max of 5 steps at a time so it doesn't check all the x values, it only checks the x values to the step*5. It also does not render the decimals like you would think. Instead of rendering x=x/100 to make the line look smooth, it does x=x/screenres. This means that like with 99% of graphics programs, it gets slower the higher your screen resolution is.

Detect peaks at beginning and end of x-axis

I've been working on detecting peaks within a data set of thousands of y~x relationships. Thanks to this post, I've been using loess and rollapply to detect peaks by comparing the local maximum to the smooth. Since, I've been working to optimise the span and w thresholds for loess and rollapply functions, respectively.
However, I have realised that several of my relationships have a peak at the beginning or the end on the x-axis, which are of my interest. But these peaks are not being identified. For now, I've tried to add fake variables outside of my x variable range to imitate a peak. For example, if my x values range from -50 to 160, I created x values of -100 and 210 and assigned a 0 y value to them.
This helped me to identify some of the relationships that have a peak at the beginning or the end. As you can see here:
However, for some it does not work.
Despite the fact that I feel uncomfortable adding 'fake' values to the relationship, the smoothing shifts the location of the peak frequently and more importantly, I cannot find a solution that allows to detect these beginning or end peaks. Does anyone know how to work out a solution that works in R?

Adding jitter on a log transformed axis in R [duplicate]

This question already has an answer here:
height/width arguments to geom_jitter interact with log scales
(1 answer)
Closed 4 months ago.
I'm learning more about ggplot and came across a situation where I had to use jitter on a sqrt transformed axis where some values were '0'. Since you can't take the sqrt of a negative number the following argument was added to the code:
ggplot(aes(x=x,y=y),data=df) + geom_jitter(alpha=0.1, position = position_jitter(h=0)).
Any idea how to perform a similar operation on a log scale? For some reason I thought changing the argument: position_jitter(h=1) would do the trick, but it didn't.
The argument h is short for height and stands for the magnitude of the vertical noise (e.g. the noise on the y-axis) that jitter adds to the data; in both positive and negative direction.
I guess, in the your data set, h=0 was used to prevent the y-values to become negative, so that the sqrt function could be applied.
To set h=1 in the case of a log-transformation would therefore not make sense. If your original y-values are all positive, then h=0 would do the trick for the same reason as in the sqrt case.
If some y-values are 0 (or even negative), then you cannot apply the log-function anyway.

How to draw millions of lines fast

x axes could be treated as 1:n.
y axes values distribute in a limited range [-1, 1]
I want to draw line segments connecting all points described by vectors above
geom_line(aes(x, y))
All works good except for the performance. It takes minutes to render the final image. Sample plot goes blow.
Is there any way to improve the performance?
Thank you for your comments. I did tried a resampling. But it's very hard to me to do a real "smart" resampling. As we cares a lot about the "out of local mean values" which is usually considered as "noise" in many statistical cases. Please allow me to show the problem by image, though it's not encouraged.
The image above is the original one, while below is the resampled one. I market the "important" information loss with arrows in the original image.
Thanks a lot for commenting so much. Eventually I think I could resolve this by aggregating hundreds of values into one line range.
To be more descriptive, assume there're 1M points.
Group them into 10K groups, with 100 points in each group.
Get the min & max values of each group.
For each group, draw a vertical line from min to max.
By doing such aggregation I could reduce the data by 1/group.size
Still, it surprises me a little that drawing one line could take tens of microseconds in R. At the very beginning I was thinking if there's any solution like "hardware acceleration"

igraph layout.fruchterman.reingold outliers (example image included)

Sometimes when using a layout algorithm such as layout.fruchterman.reingold you can get some nodes that are outliers in the sense that they extend out disproportionately from the rest of the structure. Does anyone know how to impose a maximum length on edges (such as =1) so that the edge cannot exceed a max length and therefore remove these outliers?
l <- layout.fruchterman.reingold(subgraph)
BTW, I'm aware of an employ a scale factor already to regin things in:
l <- layout.fruchterman.reingold(subgraph) * scaleFactor
There is no built-in functionality for that in the Fruchterman-Reingold algorithm (and I suspect that using xmin, ymin, xmax and ymax would not work because it might simply "compress" the non-outlier part of the network to make more space for the outliers), but you can probably experiment with edge weights. When the FR layout algorithm is used with weights, the algorithm will strive to make edges with a larger weight to be shorter. You could probably try setting the weights incident on "outlier" vertices (i.e. vertices with degree=1 or 2) to a smaller value. Another possibility is to make the edge weights depend on the degrees of both endpoints such that smaller degrees are mapped to smaller values but larger degrees are not mapped to disproportionately larger values - maybe the geometric mean of the degrees of the two endpoints could be useful here. But there is no "universal" solution as far as I know so you'll have to experiment a bit.
When asking a question with an example that is dependent on non-base functions, please remember to note which package they live in.
(for those wondering, it's in igraph).
igraph's documentation for the fruchterman-reingold layout method contains, in "arguments":
xmin,xmax
The limits for the first coordinate, if one of them or both are NULL then no normalization is performed along this direction.
ymin,ymax
The limits for the second coordinate, if one of them or both are NULL then no normalization is performed along this direction.
zmin,zmax
The limits for the third coordinate, if one of them or both are NULL then no normalization is performed along this direction.
...so, set limits on x and y? Z isn't necessary unless it's a three-dimensional graph.

Resources