Adding jitter on a log transformed axis in R [duplicate] - r

This question already has an answer here:
height/width arguments to geom_jitter interact with log scales
(1 answer)
Closed 4 months ago.
I'm learning more about ggplot and came across a situation where I had to use jitter on a sqrt transformed axis where some values were '0'. Since you can't take the sqrt of a negative number the following argument was added to the code:
ggplot(aes(x=x,y=y),data=df) + geom_jitter(alpha=0.1, position = position_jitter(h=0)).
Any idea how to perform a similar operation on a log scale? For some reason I thought changing the argument: position_jitter(h=1) would do the trick, but it didn't.

The argument h is short for height and stands for the magnitude of the vertical noise (e.g. the noise on the y-axis) that jitter adds to the data; in both positive and negative direction.
I guess, in the your data set, h=0 was used to prevent the y-values to become negative, so that the sqrt function could be applied.
To set h=1 in the case of a log-transformation would therefore not make sense. If your original y-values are all positive, then h=0 would do the trick for the same reason as in the sqrt case.
If some y-values are 0 (or even negative), then you cannot apply the log-function anyway.

Related

Histogram() doesn't display the values at the edge of defined range

I can't get Julia to display edge values on histograms, when defining a range for the bins. Here is a minimal example:
using Plots
x = [0,0.5,1]
plot(histogram(x, bins=range(0,1,length=3)))
Defining them explicitly doesn't help (bins=[0,0.3,0.7,1]). It seems that histogram() excludes the limits of the range. I can extend the range to make it work:
plot(histogram(x, bins=[0,0.3,0.7,1.01))
But I really don't think that should be the way to go. Surprisingly, fixing the number of bins does work (nbins=3) but I need to keep the width of all the bins the same and constant across different runs for comparison purposes.
I have tried with Plots, PlotlyJS and StatsBase (with fit() and its closed attribute) to no avail. Maybe I'm missing something, so I wanted to ask: is it possible to do what I want?
Try:
plot(histogram(x, bins=range(0,nextfloat(1.0),length=3)))
Although this extends the range, it does so in a minimal way. Essentially the most minimal which turns the right end of the histogram closed.
As for equal widths, when dealing with floating points, equal widths has different meanings - in terms of real numbers (which are not always representible), or in terms (for example) of the number of values, but this can be different for [0.0,1.0] and [1.0,2.0].
So hopefully, this scratches the itch in the OP.
https://juliastats.org/StatsBase.jl/latest/empirical/#StatsBase.Histogram
most importantly:
closed: A symbol with value :right or :left indicating on which side bins (half-open intervals or higher-dimensional analogues thereof) are closed. See below for an example.
this is very common in many histogram implementations, for example, Numpy
In [8]: np.histogram([0], bins=[0, 1, 2])
Out[8]: (array([1, 0]), array([0, 1, 2]))
In [9]: np.histogram([1], bins=[0, 1, 2])
Out[9]: (array([0, 1]), array([0, 1, 2]))
Numpy has the inconsistency that the last bin is closed on both sides, but it's perfectly normal for every bin to close on one side,

Detect peaks at beginning and end of x-axis

I've been working on detecting peaks within a data set of thousands of y~x relationships. Thanks to this post, I've been using loess and rollapply to detect peaks by comparing the local maximum to the smooth. Since, I've been working to optimise the span and w thresholds for loess and rollapply functions, respectively.
However, I have realised that several of my relationships have a peak at the beginning or the end on the x-axis, which are of my interest. But these peaks are not being identified. For now, I've tried to add fake variables outside of my x variable range to imitate a peak. For example, if my x values range from -50 to 160, I created x values of -100 and 210 and assigned a 0 y value to them.
This helped me to identify some of the relationships that have a peak at the beginning or the end. As you can see here:
However, for some it does not work.
Despite the fact that I feel uncomfortable adding 'fake' values to the relationship, the smoothing shifts the location of the peak frequently and more importantly, I cannot find a solution that allows to detect these beginning or end peaks. Does anyone know how to work out a solution that works in R?

Does scaling time series (using min/max scaling) affect cross-correlation?

I have two time series and I want to find the correlation between them. However, they were on completely different scales before, so I thought I should normalize them between 0 and 1 to have a better understanding of what is going on. To do so, I did something along the lines of:
ts1 <- ts1$price-min(ts1$price)/(max(ts1$price)-min(ts1$price)
ts2 <- ts2$price-min(ts2$price)/(max(ts2$price)-min(ts2$price)
However, when I compute the cross correlation before and after normalizing (using the ccf function in R), I get the same thing. Should that happen? Does scaling time series not affect cross-correlation (or is that I am scaling both time series and therefore the effect cancels out)? I would definitely like to have greater intuition about how this works.
Thanks!
That's exactly as expected, no worries.
Translations (subtracting a constant) and scaling (multiplying by a constant) have no affect on correlation. Since a min/max scaling is simply the combination of a translation and a scaling (with no shear) it has no affect on cross-correlation.
This is easy to understand if you remember that the definition of correlation already subtracts off the mean of both datasets (making it invariant under translations) and divides by the sums of squares at the end (making it invariant under scaling operations.)

Linear Scale vs. Log Scale

This may be a very simple question, but I haven't been in touch with Algorithms since a while.
I have a logarithmic scale of 20,100,500,2500,12500 which relates to 1,2,3,4,5 on the respectively. Now, I want to find out as to where the value 225 would lie on the scale above? And also, going the other way round, how would I find out as to what the value for 2.3 interpret to on the scale. Would be great if someone can help me with the answer and explanation for this.
Note that each step in the scale multiplies the previous step by 5.
So the explicit formula for your output is
y = 4 * 5^x
or
x = log-base-5(y/4)
where
log-base-5(n) = log(n)/log(5)
if you want to compute it in code. That last line is called the change of base formula, and is explained here You can use either a natural log or common log on the right side of the formula, it doesn't matter.

Drawing an iso line of a 2D implicit scalar field

I have an implicit scalar field defined in 2D, for every point in 2D I can make it compute an exact scalar value but its a somewhat complex computation.
I would like to draw an iso-line of that surface, say the line of the '0' value. The function itself is continuous but the '0' iso-line can have multiple continuous instances and it is not guaranteed that all of them are connected.
Calculating the value for each pixel is not an option because that would take too much time - in the order of a few seconds and this needs to be as real time as possible.
What I'm currently using is a recursive division of space which can be thought of as a kind of quad-tree. I take an initial, very coarse sampling of the space and if I find a square which contains a transition from positive to negative values, I recursively divide it to 4 smaller squares and checks again, stopping at the pixel level. The positive-negative transition is detected by sampling a sqaure in its 4 corners.
This work fairly well, except when it doesn't. The iso-lines which are drawn sometimes get cut because the transition detection fails for transitions which happen in a small area of an edge and that don't cross a corner of a square.
Is there a better way to do iso-line drawing in this settings?
I've had a lot of success with the algorithms described here http://web.archive.org/web/20140718130446/http://members.bellatlantic.net/~vze2vrva/thesis.html
which discuss adaptive contouring (similar to that which you describe), and also some other issues with contour plotting in general.
There is no general way to guarantee finding all the contours of a function, without looking at every pixel. There could be a very small closed contour, where a region only about the size of a pixel where the function is positive, in a region where the function is generally negative. Unless you sample finely enough that you place a sample inside the positive region, there is no general way of knowing that it is there.
If your function is smooth enough, you may be able to guess where such small closed contours lie, because the modulus of the function gets small in a region surrounding them. The sampling could then be refined in these regions only.

Resources