I needed to logarithmically distribute a number in a range of 2-200 in a determined number of intervals. The formula used in another question worked perfectly; I just couldn't explain how or why it works.
Can anyone explain how this function works? Perhaps derive it for me?
It's projecting evenly-spaced intervals to logarithmicly-spaced intervals by raising some number X to each interval.
The example distributes the range 1-1000 into 10 intervals. The end result is that some number X to the 10th power will be 1,000 (X^10 = 1,000). So X is 1,000 ^ (1.10) = 1.99526
Now the range
1 2 3 4 5 6 7 8 9 10
is projected to a logarithmic range using the function
f(x) = 1.99536 ^ x
which results in
1.99526, 3.98107, 7.94328, 15.8489, 31.6228, 63.0957, 125.893, 251.189, 501.187, 1000
Related
I have run a simple linear regression in R with two variables and got the following relation:
y = 30000+1.95x
Which is reasonably fair. My only concern is that, practically the (0,0) point should be included in the model.
Is there any math help I can get please ?
I needed to post the data somehow... and here it is. This will give a better approach to the problem now.
There are more such data sets available. This is data collected for a marketing strategy.
The objective is to obtain a relation between sales and spend so that we can predict the spend amount that we need in order to obtain a certain amount of sales.
All help will be appreciated.
This is not an answer, but rather a comment with graphics.
I converted the month data to "elapsed months", starting with 1 as the first month, then 2, then 3 etc. This allowed me to view the data in 3D, and as you can see from the 3D scatterplot below, both Spend and Sales are related to the number of months that have passed. I also scaled the financial data in thousands so I could more easily read the plots.
I fit the data to a simple flat surface equation of the form "z = f(x,y)" as shown below, as this equation was suggested to me by the scatterplot. My fit of this data gave me the equation
Sales (thousands) = a + b * Months + c * Spend(thousands)
with fitted parameters
a = 2.1934871882483066E+02
b = 6.3389747441412403E+01
c = 1.0011902575903093E+00
for the following data:
Month Spend Sales
1 120.499 327.341
2 168.666 548.424
3 334.308 978.437
4 311.963 885.522
5 275.592 696.238
6 405.845 1268.859
7 399.824 1054.429
8 343.622 1193.147
9 619.030 1118.420
10 541.674 985.816
11 701.460 1263.009
12 957.681 1960.920
13 479.050 1240.943
14 552.718 1821.106
15 633.517 1959.944
16 527.424 2351.679
17 1050.231 2419.749
18 583.889 2104.677
19 322.356 1373.471
if you want to include the point (0,0) in your regression line this would mean setting the intercept to zero.
In R you can achieve this by
mod_nointercept <- lm(y ~ 0 + x)
In this model only beta is fitted. And alpha (i.e. the intercept is set to zero).
I am looking to use dbinom() in R to generate a probability. The default documentation gives dbinom(x, size, prob, log = FALSE), and I understand what they all mean except the x, where x is said to be "vector of quantiles". Can anyone explain what that means in context of let's say that I would like to find the probability of obtaining the number 5 twice if I sample 10 times from the numbers 1-5. In this case the binomial probability would be
choose(10, 2) * (1/5)^2 * (4/5)^8
In your example the "number of times you see a five" is the quantile of interest. Loosely speaking, a "quantile" is a possible value of a random variable. So if you want to find the probability of seeing a 5 x = 2 times out of size = 10 draws where each number has prob = 1 / 5 of being drawn you would enter dbinom(2, 10, 1 / 5).
I am trying to generate random numbers from a custom distribution, i already found this question:
Simulate from an (arbitrary) continuous probability distribution
but unfortunatly it does not help me since the approach suggested there requires a formula for the distribution function. My distribution is a combination of multiple uniform distributions, basically the distribution function looks like a histogram. An example would be:
f(x) = {
0 for x < 1
0.5 for 1 <= x < 2
0.25 for 2 <= x < 4
0 for 4 <= x
}
You just need inverse CDF method:
samplef <- function (n) {
x <- runif(n)
ifelse(x < 0.5, 2 * x + 1, 4 * x)
}
Compute CDF yourself to verify that:
F(x) = 0 x < 1
0.5 * x - 0.5 1 < x < 2
0.25 * x 2 < x < 4
1 x > 4
so that its inverse is:
invF(x) = 2 * x + 1 0 < x < 0.5
4 * x 0.5 < x < 1
You can combine various efficient methods for sampling from discrete distributions with a continuous uniform.
That is, simulate from the integer part Y=[X] of your variable, which has a discrete distribution with probability equal to the probability of being in each interval (such as via the table method - a.k.a. the alias method), and then simply add a random uniform [0,1$, X = Y+U.
In your example, you have Y taking the values 1,2,3 with probability 0.5,0.25 and 0.25 (which is equivalent to sampling 1,1,2,3 with equal probability) and then add a random uniform.
If your "histogram" is really large this can be a very fast approach.
In R you could do a simple (if not especially efficient) version of this via
sample(c(1,1,2,3))+runif(1)
or
sample(c(1,1,2,3),n,replace=TRUE)+runif(n)
and more generally you could use the probability weights argument in sample.
If you need more speed than this gets you (and for some applications you might, especially with big histograms and really big sample sizes), you can speed up the discrete part quite a bit using approaches mentioned at the link, and programming the workhorse part of that function in a lower level language (in C, say).
That said, even just using the above code with a considerably "bigger" histogram -- dozens to hundreds of bins -- this approach seems - even on my fairly nondescript laptop - to be able to generate a million random values in well under a second, so for many applications this will be fine.
If the custom distribution from which you are drawing a random number is defined by empirical observations, then you can also just draw from an empirical distribution using the fishmethods::remp() package and function.
https://rdrr.io/cran/fishmethods/man/remp.html
I am using geoR package for spatial interpolation of rainfall. I have to tell that I am quite new to geostatistics. Thanks to some video tutorials in youtube, I understood (well, I think so) the theory behind variogram. As per my understanding, the number of pairs should decrease with increasing lag distances. For eg, if we consider a 100m long stretch (say 100m long cross section of a river bed) the number of pairs for 5m lag is 20 and number of pairs for 10m lag is 10 and so on. But I am kind of confused with output from variog function in geoRpackage. An example is given below
mydata
X Y a
[1,] 415720 432795 2.551415
[2,] 415513 432834 2.553177
[3,] 415325 432740 2.824652
[4,] 415356 432847 2.751844
[5,] 415374 432858 2.194091
[6,] 415426 432774 2.598897
[7,] 415395 432811 2.699066
[8,] 415626 432762 2.916368
this is my dataset where a is my variable (rainfall intensity) and x, y are the coordinates of the points. The varigram calculation is shown below
geodata=as.geodata(data,header=TRUE)
variogram=variog(geodata,coords=geodata$coords,data=geodata$data)
variogram[1:3]
$u
[1] 46.01662 107.37212 138.04987 199.40537 291.43861 352.79411
$v
[1] 0.044636453 0.025991469 0.109742986 0.029081575 0.006289056 0.041963076
$n
[1] 3 8 3 3 3 2
where
u: a vector with distances.
v: a vector with estimated variogram values at distances given in u.
n: number of pairs in each bin
According to this, number of pairs (n) have a random pattern whereas corresponding lag distance (u) is increasing. I find it hard to understand this. Can anyone explain what is happening? Also any suggestions/advice to improve the variogram calculation for this application (spatial interpolation of rainfall intensity) is highly appreciated as I am new to geostatistics. Thanks in advance.
On a linear transect of 100 m with 5 m regular spacing between observations, if you'd have 20 pairs at 5 m lag, you'd have 19 pairs at 10 m lag. This idea does not hold for your data, because they are irregularly distributed, and they are distributed over two dimensions. For irregularly distributed data, you often have very few point pairs for the very short distances. The advice for obtaining a better looking variogram is to work with a larger data set: geostatistics starts getting interesting with 30 observations, and fun with over 100 observations.
Say I have the following example distribution (vector) of numbers in c++:
vector 1 vector 2 vector 3
11 4 65
128 6 66
12 4 64
13 4 62
12 5 65
14 5 63
16 7 190
60 3 210
120 4 220
126 5 242
77 6 231
14 4 210
12 7 222
13 6 260
11 8 300
14 6 233
99 80
15 66
13
I need to find a threshold for each vector. I'll eliminate the larger ("bad") numbers in each if they are above that vector's threshold. I want to re-use this method to find a threshold on other similar vectors in the future. The numbers aren't necessarily mostly smaller "good" numbers.
The threshold would ideally be just a hair larger than most of the smaller "good" numbers. For example, the first vetor's ideal threshold value would be around 17 or 18, the second's would be about 8, and the third's would be around 68-70.
I realize this is probably simple math but since I'm horrible at math in general, I would really appreciate a code example on how to find this magical threshold, in either C++ or Objective-C specifically, which is why I'm posting this in SO and not on the Math site.
Some things I've tried
float threshold = mean_of_vector;
float threshold = mean_of_vector / 1.5f;
float threshold = ((max_of_vector - min_of_vector) / 2.0f) + mean_of_vector;
Each of these seem to have their own issues, eg: some include too many of the "good" average numbers (so the threshold was too low), some not enough good numbers (threshold too high), or not enough of the "bad" numbers. Sometimes they'll work with specific vectors of numbers, for example, if the standard deviation is high, but not others where the standard deviation is low.
I'm thinking the method would involve standard deviation and/or some sort of gaussian distribution, but I don't know how to piece them together to get the desired result.
Edit: I am able to re-sort the vectors.
You could just eliminate the values above 90% or 95%.
Technicaly you calculate the p = 0.9 (or 0.95) percentile of the array distribution.
Just sort the array ascending:
int[] data;
Arrays.sort(data); // or use ArrayList<Integer> which has Collections.sort(dataArrayList),
Then calculate position of percentile p:
float p = data.length * p; // e.g p = 0.9 for 90% percentile.
// cut of fractional part.
int posInt = (int) p;
// this is the threshold value
int threshold = data[posInt]
Now filter array by keeping all value < or <= threshold.
This keeps the 90% of smallest values.
int i = 0;
while (i < data.length && data[i] <= threshold) {
// output data[i];
}
For mathematically "perfect" results you could search for "calculate percentile of discrete array / values).
As i remeber there are two valid algorithms, describeing whether one has to round down or round up the posInt. I my example above I just truncated.
An idea would be to compute both the mean mu and the standard deviation sigma (e.g. using the algorithm described at "Accurately computing running variance" ) and use both of them for the definition of your threshold.
If your data are assumed to be Gaussian, you know that 97.5% of your data should be below mu + 2*sigma, so that can be a good threshold.
Remark: you might want to recompute your threshold once you have rejected the extreme values since these values can have a significant impact on the mean and the standard deviation.
EDIT:
I just computed the thresholds using the method I proposed and it does not look satisfactory for you: for the first case, threshold is around 130 (so maybe taking 1.5 sigma could help getting rid of the largest entries), for the second case, threshold is around 8 and for the third case, threshold is around 262.
Actually, I'm not that surprised with these results: for your last example, you want to get rid of more than the half of the data! Assuming that the data are Gaussian with just a few extrem values is far from what you have at hand...