Is there a Julia equivalent to numpy.digitize? - julia

Given a bin range min:step:max and a value x I want to find the bin the x is in. Is there a better way than
floor(Int, (x-min)/step)
?

searchsortedfirst(min:step:max, x) does what you need. The first bin is for values less than min and the last for greater than max (and it is implemented efficiently due to multiple dispatch)

Related

regarding the usage of runif function

I once saw the following R code,
x<-runif(3,max=c(10,20,30))
If the min is not set, what's the lower range for the generated random variable. Besides,when max is setup this way, my understanding is that it will iterate over the three values given in c() for each generated variable, is that right?
If you look at the ?runif help page, you'll see the default for min= is 0.
If you specify multiple values for max, the values are recycled so it's like the first value comes from unif(0,10), the second from unif(0,20) and the third from (0,30) and that pattern repeats for as many values as you request. If you only request one value
runif(1, max=c(10,20,30)
that would be the same as
runfi(1, max=10)
This is noted in the help page under the Value section
The numerical arguments other than n are recycled to the length of the result.
Per the documentation for this function (https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/Uniform), min takes on the value 0 unless explicitly passed.
And yes, that is correct - the function will iterate over the values given in c() for each value. If there isn't a value passed (e.g. you're generating 3 random variables and set c=(1,2)), then max will take the default value of 1 for the elements that don't have a set max value. An example showing how it iterates over c():
x<-runif(3,max=c(1,20, 7000000))
x
[1] 0.622216 7.463306 809194.417205

Math calculation formula

I'm trying to get to a formula that gives me a number inside a range, from a increment-able number x that i'm giving to it.
Sure that this can be easily done with a program, inside a loop, but i want to know if is possible to archive this just by make a calculation.
For example, in this case the range is from 10-50, and if x = 10 (number to increment) and the actual position of it is 40, for example, in this case the value will be 50. Now if i give a value x of 15 i want it to give me 15, since the value 50 has been reached and the sum as to restart from 10.
Is there a solution for this case?
Thanks.

What seems more useful for this particular problem, Mean or Median and Why?

You are given an array of positive integers. You are told to make all the numbers equal by doing this operation, i.e. to increase/decrease the value of an array element. The cost of the operation will be the amount of increment/decrement (absolute value). Find the minimum cost required to do this task.
Test Example
Array : 2 3 1 5 2
Answer : 5
At first, it appears as though we should change all values to mean value and that should do the trick. But the optimal answer comes when using the median value. I understand that the mean is more sensitive to outliers but still, I cannot really understand why would changing values to the mean value will not give an optimal answer.
Suppose you decide to change each array element to v. The cost of changing a[i] to v is
|a[i] - v|
So the total cost is
C(v) = Sum{ i | |a[i]-v| }
We want to chose v to minimise the total cost. The v that does that is a median for the a[]. The proof of that is a little awkward, in the general case... However if you consider a list with every element unique you should be able to convince yourself in that case.
By contrast the mean m is what minimises
Sum{ (a[i]-m)*(a[i]-m)}
So that the mean would be the choice for a problem where you could add any number to an element, the cost being the square of what what you added. The proof that the mean minimises is easy.

Is there a way to handle calculations invovling exponential of big values in R?

I have looked a bit online and in the site but I did not find any solution. My problem is relatively simple so if you could point me to a possible solution, much appreciated.
test_vec <- c(2,8,709,600)
mean(exp(test_vec))
test_vec_bis <- c(2,8,710,600)
mean(exp(test_vec_bis))
exp(709)
exp(710)
# The numerical limit of R is at exp(709)
How can I calculate the mean of my vector and deal with the Inf values knowing that R could probably handle the mean value but not all values in the numerator of the mean calculation ?
There is an edge case where you can solve your problem by simply restating your problem mathematically, but that would require that the length of your vector is extremely large and/or that your large exp. numbers are close to the numeric limit:
Since the mean sum(x)/n can be written as sum(x/n) and since exp(x)/exp(y) = exp(x-y), you can calculate sum(exp(x-log(n))), which gives you a relief of log(n).
mean(exp(test_vec))
[1] 2.054602e+307
sum(exp(test_vec - log(length(test_vec))))
[1] 2.054602e+307
sum(exp(test_vec_bis - log(length(test_vec_bis))))
[1] 5.584987e+307
While this works for your example, most likely this won't work for your real vector.
In this case, you will have to consult packages like Rmpfr as suggested by #fra.
Here's one way where you qualify to only select those in your test_vec that give an answer < Inf:
mean(exp(test_vec)[which(exp(test_vec) < Inf)])
[1] 1.257673e+260
t2 <- c(2,8,600)
mean(exp(t2))
[1] 1.257673e+260
This assumes you were looking to exclude values that result in Inf, of course.

get location of row with median value in R data frame

I am a bit stuck with this basic problem, but I cannot find a solution.
I have two data frames (dummies below):
x<- data.frame("Col1"=c(1,2,3,4), "Col2"=c(3,3,6,3))
y<- data.frame("ColA"=c(0,0,9,4), "ColB"=c(5,3,20,3))
I need to use the location of the median value of one column in df x to then retrieve a value from df y. For this, I am trying to get the row number of the median value in e.g. x$Col1 to then retrieve the value using something like y[,"ColB"][row.number]
is there an elegant way/function for doing this? Solutions might need to account for two cases - when the sample has an even number of values, and ehwn this is uneven (when numbers are even, the median value might be one that is not found in the sample as a result of calculating the mean of the two values in the middle)
The problem is a little underspecified.
What should happen when the median isn't in the data?
What should happen if the median appears in the data multiple times?
Here's a solution which takes the (absolute) difference between each value and the median, then returns the index of the first row for which that difference vector achieves its minimum.
with(x, which.min(abs(Col1 - median(Col1))))
# [1] 2
The quantile function with type = 1 (i.e. no averaging) may also be of interest, depending on your desired behavior. It returns the lower of the two "sides" of the median, while the which.min method above can depend on the ordering of your data.
quantile(x$Col1, .5, type = 1)
# 50%
# 2
An option using quantile is
with(x, which(Col1 == quantile(Col1, .5, type = 1)))
# [1] 2
This could possibly return multiple row-numbers.
Edit:
If you want it to only return the first match, you could modify it as shown below
with(x, which.min(Col1 != quantile(Col1, .5, type = 1)))
Here, something like y$ColB[which(x$Col1 == round(median(x$Col1)))] would do the trick.
The problem is x has an even number of rows, so the median 2.5 is not an integer. In this case you have to choose between 2 or 3.
Note: The above works for your example, not for general cases (e.g. c(-2L,2L) or with rational numbers). For the more general case see #IceCreamToucan's solution.

Resources