Convert log10-Data to a 1-11 scale - math

I want to convert my data to a 1-11 scale. At first, I took the raw data and converted them with the following formula, which I took from here:
scale_min + float(x-data_min)*float(scale_max-scale_min)/(data_max-data_min)
However, since my data has extreme outliers, this makes the differences in most of the data impossible to see.
I thought about taking the log10 of the data and then converting those values to my 1-11 scale, however, the above formula does not work with negative values, which can occur with log10.
Is there any other way of mapping log10 values to a 1-11 scale? Or any other way of mapping data with extreme outliers to a 1-11 scale in a way which makes differences in smaller values perceivable, while not losing any of the information?
An example for my values would be
[0.1, 10, 300, 500000]

I don't see a reason why your formula shouldn't work with negative numbers. It is a simple affine function.
Example in python:
from math import log10
data = [0.1, 10, 300, 500000]
logdata = [log10(x) for x in data]
a = min(logdata)
b = max(logdata)
data1_11 = [1 + (y - a) * 10 / (b - a) for y in logdata]
print(data1_11)
# [1.0, 4.0, 6.2, 11.0]

Related

In R, how do I count the number of data points on a scatter plot within a cell of custom dimensions?

Let's just say I have the following scatterplot:
set.seed(665544)
n <- 100
x <- cbind(
x=runif(10, 0, 5) + rnorm(n, sd=0.4),
y=runif(10, 0, 5) + rnorm(n, sd=0.4)
)
plot(x)
I want to divide this scatterplot into square cells of a specified size and then count how many points fall into each unique cell. This will essentially give me the local density value of that cell. What is the best way of doing this? Is there an R package that can help? Perhaps a 2D histogram method like in Matlab?
Quick clarifications:
1.) I'd like the function/method to take the following 3 arguments: dimensions of total area, dimensions of cell (OR number of cells), and the data. It would then perhaps output a matrix where each value corresponds to a cell's point count.
2.) Q: Why do you want to use this method to determine local density? Isn't this much easier:
library(dbscan)
pointdensity(x, eps = .1, type = "frequency")
A: This method calculates the local density around each point. Though easy, this definition of local density then makes it very difficult (optimization algorithms necessary) to assign new data in a way that it matches the local density distribution of the original data set.

predict x value with given y using loess

I have a dataset from a biological experiment:
x = c(0.488, 0.977, 1.953, 3.906, 7.812, 15.625, 31.250, 62.500, 125.000, 250.000, 500.000, 1000.000)
y = c(0.933, 1.036, 1.112, 1.627, 2.646, 5.366, 11.115, 2.355, 1.266, 0, 0, 0)
plot(log(x),y)
x represents a concentration and y represents the response in our assay.
The plot can be found here: 1
How can I predict the x-value (concentration) of a pre-defined y-value (in my case 1.5)?
After a loess smoothing I can predict the y-value at a defined x-value. See the example:
smooth_data <- loess(y~log(x))
predict(smooth_data, 1.07) # which gives 1.5
Using the predict function, both x = 1.07 and x = 5.185 result in y = 1.5
Is there a convenient way to get the estimates from the loess regression at y = 1.5 without manually typing some x values into the predict function?
Any suggestions?
I gues your x and y's are pairs? so for f(0.488) = 0.933 and so on?
More of a mathproblem in my opinion :).
If you could define a function that describes your graph it would be pretty easy.
You could also draw a straight line between all points and for every line that intersects with your y value you could get corrosponding x values. But straight lines wouldn't be really precies.
If you have enough pairs you could also train a neureal network. That might get you the best results but takes some time and alot of pairs to train well.
Could you clarify your question a bit and tell us what you are looking for? A way to do it or a code example?
I hope this is helping you atleast a little bit :)
Since your function is not monotonic, there is no true inverse, but if you split it into two functions - one for x < maximum and one for x > maximum - you can just create two inverse functions and solve for whatever values of y you want.
smooth_data <- loess(y~log(x))
X = seq(0,6.9,0.1)
P = predict(smooth_data, X)
M = which.max(P)
Inverse1 = approxfun(X[1:M] ~ P[1:M])
Inverse2 = approxfun(X[M:length(X)] ~ P[M:length(X)])
Inverse1(1.5)
[1] 1.068267
predict(smooth_data, 1.068267)
[1] 1.498854
Inverse2(1.5)
[1] 5.185876
predict(smooth_data, 5.185876)
[1] 1.499585

Find correct 2D translation of a subset of coordinates

I have a problem I wish to solve in R with example data below. I know this must have been solved many times but I have not been able to find a solution that works for me in R.
The core of what I want to do is to find how to translate a set of 2D coordinates to best fit into an other, larger, set of 2D coordinates. Imagine for example having a Polaroid photo of a small piece of the starry sky with you out at night, and you want to hold it up in a position so they match the stars' current positions.
Here is how to generate data similar to my real problem:
# create reference points (the "starry sky")
set.seed(99)
ref_coords = data.frame(x = runif(50,0,100), y = runif(50,0,100))
# generate points take subset of coordinates to serve as points we
# are looking for ("the Polaroid")
my_coords_final = ref_coords[c(5,12,15,24,31,34,48,49),]
# add a little bit of variation as compared to reference points
# (data should very similar, but have a little bit of noise)
set.seed(100)
my_coords_final$x = my_coords_final$x+rnorm(8,0,.1)
set.seed(101)
my_coords_final$y = my_coords_final$y+rnorm(8,0,.1)
# create "start values" by, e.g., translating the points we are
# looking for to start at (0,0)
my_coords_start =apply(my_coords_final,2,function(x) x-min(x))
# Plot of example data, goal is to find the dotted vector that
# corresponds to the translation needed
plot(ref_coords, cex = 1.2) # "Starry sky"
points(my_coords_start,pch=20, col = "red") # start position of "Polaroid"
points(my_coords_final,pch=20, col = "blue") # corrected position of "Polaroid"
segments(my_coords_start[1,1],my_coords_start[1,2],
my_coords_final[1,1],my_coords_final[1,2],lty="dotted")
Plotting the data as above should yield:
The result I want is basically what the dotted line in the plot above represents, i.e. a delta in x and y that I could apply to the start coordinates to move them to their correct position in the reference grid.
Details about the real data
There should be close to no rotational or scaling difference between my points and the reference points.
My real data is around 1000 reference points and up to a few hundred points to search (could use less if more efficient)
I expect to have to search about 10 to 20 sets of reference points to find my match, as many of the reference sets will not contain my points.
Thank you for your time, I'd really appreciate any input!
EDIT: To clarify, the right plot represent the reference data. The left plot represents the points that I want to translate across the reference data in order to find a position where they best match the reference. That position, in this case, is represented by the blue dots in the previous figure.
Finally, any working strategy must not use the data in my_coords_final, but rather reproduce that set of coordinates starting from my_coords_start using ref_coords.
So, the previous approach I posted (see edit history) using optim() to minimize the sum of distances between points will only work in the limited circumstance where the point distribution used as reference data is in the middle of the point field. The solution that satisfies the question and seems to still be workable for a few thousand points, would be a brute-force delta and comparison algorithm that calculates the differences between each point in the field against a single point of the reference data and then determines how many of the rest of the reference data are within a minimum threshold (which is needed to account for the noise in the data):
## A brute-force approach where min_dist can be used to
## ameliorate some random noise:
min_dist <- 5
win_thresh <- 0
win_thresh_old <- 0
for(i in 1:nrow(ref_coords)) {
x2 <- my_coords_start[,1]
y2 <- my_coords_start[,2]
x1 <- ref_coords[,1] + (x2[1] - ref_coords[i,1])
y1 <- ref_coords[,2] + (y2[1] - ref_coords[i,2])
## Calculate all pairwise distances between reference and field data:
dists <- dist( cbind( c(x1, x2), c(y1, y2) ), "euclidean")
## Only take distances for the sampled data:
dists <- as.matrix(dists)[-1*1:length(x1),]
## Calculate the number of distances within the minimum
## distance threshold minus the diagonal portion:
win_thresh <- sum(rowSums(dists < min_dist) > 1)
## If we have more "matches" than our best then calculate a new
## dx and dy:
if (win_thresh > win_thresh_old) {
win_thresh_old <- win_thresh
dx <- (x2[1] - ref_coords[i,1])
dy <- (y2[1] - ref_coords[i,2])
}
}
## Plot estimated correction (your delta x and delta y) calculated
## from the brute force calculation of shifts:
points(
x=ref_coords[,1] + dx,
y=ref_coords[,2] + dy,
cex=1.5, col = "red"
)
I'm very interested to know if there's anyone that solves this in a more efficient manner for the number of points in the test data, possibly using a statistical or optimization algorithm.

Sample regression, x = months, huge bandwidths

I have two vectors, x and y.
x is a vector where each entry represents a month for a period of several years, so I have (let's say) 10 years of data, then length(x) = 120 and so on.
(I have used the "posix.ct" command so they really are "months" in that sense, but couldn't I just have x as a numerical vector like c(1:n) or something, since I already know which month and which year a certain element of c(1:n) corresponds to? i.e if x = c(1:n), I know that x[13] is february of the second year and so on..)
y is a vector where each elements is an observation of a particular variable at a certain month.
So the observed data is grouped like this (january,0.123), (february,2.125) and so on.
I have two vectors for the months;
x1 = seq(as.POSIXct("YYYY-MM-DD", tz="GMT"),
as.POSIXct("YYYY-MM-DD", tz="GMT"),
by="month")
x2 = c(1:length(x1))
What I want to do is to run ksmooth:
plot(x1,y)
smooth = ksmooth(x2,y,"normal")
lines(smooth)
The reason that I use x1 in the plot() command is that I don't know how to otherwise get the x-axis in time.
R should automatically find a decent smoothing parameter when I haven't specified anything. The result is that ksmooth$y is equal to the input vector y! Also, a vertical bar is produced in the plot. If I replace x2 by x1 in the code above, ksmooth$y is NA for all values except for the first and last, which equal those of the input y.
So i try some bandwidths:
h = 0.1: now smooth$y = y, as before. A vertical bar is produced (it is the same color as I specified in the lines() command, so it must have to do with the ksmooth command.)
h = 10: get some non-strange results for smooth$y, however, a vertical bar is produced as before.
Then, I tried the crazy idea of very large bandwidths;
h = 1e+06: This produced nothing when I used x1 and x2 as in the code above. When I changed x2 to x1 however, I get some good results. For h = 1e+09 (that's huge!!) I get a very nice result. (I get a curve that fits the data and looks nice)
But h = 1e+09, is that reasonable? in all the examples I have looked h is something betweeen 0.1 and 10, give or take. heard something about a rule of thumb: h should equal n^(-1/5) where n is the number of data points.
I think the one thing that you are missing is that R doesn't find a decent smoothing parameter when you haven't specified anything, it just uses a bandwidth of 0.5, which is totally useless in your case.
The other thing you might be missing is that in ksmooth the bandwidth parameter is in terms of x. When ksmooth takes an x value of Date, it converts it to a numeric, which is the number of seconds. Therefore, your bandwidth will be measured in seconds, an undesirable result. When ksmooth takes an x value of months, it will default to a bandwidth of 0.5 months, also undesirable.
What you want to do is specify a reasonable bandwidth for the x that you are using. Here is an example:
x1 = seq(as.POSIXct("2000-01-01", tz="GMT"),
as.POSIXct("2010-12-31", tz="GMT"),
by="month")
x2 = c(1:length(x1))
set.seed(1)
y = runif(length(x1))
plot(x1,y,type='l')
smooth = ksmooth(x2,y,"normal")
lines(x1,smooth$y,col='blue',lwd=2)
lines(x1,ksmooth(x2,y,'normal',bandwidth=2)$y,col='red',lwd=2)
lines(x1,ksmooth(x2,y,'normal',bandwidth=10)$y,col='green',lwd=2)
lines(x1,ksmooth(x2,y,'normal',bandwidth=20)$y,col='orange',lwd=2)

Scipy - data interpolation from one irregular grid to another irregular spaced grid

I am struggling with the interpolation between two grids, and I couldn't find an appropriate solution for my problem.
I have 2 different 2D grids, of which the node points are defined by their X and Y coordinates. The grid itself is not rectangular, but forms more or less a parallelogram (so the X-coordinate for (i,j) is not the same as (i,j+1), and the Y coordinate of (i,j) is different from the Y coordinate of (i+1,j).
Both grids have a 37*5 shape and they overlap almost entirely.
For the first grid I have for each point the X-coordinate, the Y-coordinate and a pressure value. Now I would like to interpolate this pressure distribution of the first grid on the second grid (of which also X and Y are known for each point.
I tried different interpolation methods, but my end result was never correct due to the irregular distribution of my grid points.
Functions as interp2d or griddata require as input a 1D array, but if I do this, the interpolated solution is wrong (even if I interpolate the pressure values from the original grid again on the original grid, the new pressure values are miles away from the original values.
For 1D interpolation on different irregular grids I use:
def interpolate(X, Y, xNew):
if xNew<X[0]:
print 'Interp Warning :', xNew,'is under the interval [',X[0],',',X[-1],']'
yNew = Y[0]
elif xNew>X[-1]:
print 'Interp Warning :', xNew,'is above the interval [',X[0],',',X[-1],']'
yNew = Y[-1]
elif xNew == X[-1] : yNew = Y[-1]
else:
ind = numpy.argmax(numpy.bitwise_and(X[:-1]<=xNew,X[1:]>xNew))
yNew = Y[ind] + ((xNew-X[ind])/(X[ind+1]-X[ind]))*(Y[ind+1]-Y[ind])
return yNew
but for 2D I thought griddata would be easier to use. Does anyone have experience with an interpolation where my input is a 2D array for the mesh and for the data?
Have another look at interp2d. http://docs.scipy.org/scipy/docs/scipy.interpolate.interpolate.interp2d/#scipy-interpolate-interp2d
Note the second example in the 'x,y' section under 'Parameters'. 'x' and 'y' are 1-D in a loose sense but they can be flattened arrays.
Should be something like this:
f = scipy.interpolate.interp2d([0.25, 0.5, 0.27, 0.58], [0.4, 0.8, 0.42,0.83], [3, 4, 5, 6])
znew = f(.25,.4)
print znew
[ 3.]
znew = f(.26,.41) # midway between (0.25,0.4,3) and (0.27,0.42,5)
print znew
[ 4.01945345] # Should be 4 - close enough?
I would have thought you could pass flattened 'xnew' and 'ynew' arrays to 'f()' but I couldn't get that to work. The 'f()' function would accept the row, column syntax though, which isn't useful to you. Because of this limitation with 'f()' you will have to evaluate 'znew' as part of a loop - might should look at nditer for that. Make sure also that it does what you want when '(xnew,ynew)' is outside of the '(x,y)' domain.

Resources