Rebalancing table with average value - math

Looking for algorithm which will resolve my problem, i.e.
Having a table with values: [6, 2, 3, 1]
Get an average value per slice: 12/4 = 3
Re-balance values in slice, i.e. I should get formulas something like: 1) 0, 2) 6-1, 3) 0, 4) 5-2. I.e. from highest value deduct amount need's for average. As result I will need to get slice [3, 3, 3, 3]
Thanks

Related

Multidimensional selection in a 2D array

picture of question posted in this link
Given a 2-d array arr of size n x m, a selection is defined as an array of integers such that it contains at least [m/2] integers from each row of arr. The cost of the selection is defined as the maximum difference between any two integers of the selection.
Suppose k is the minimum cost of all the possible selections for the given 2-d array. Find the maximum value of the product of k * the number of integers considered in the selection with the minimum cost.
Example
Suppose n= 3, m = 2, and arr = [[1, 2], [3, 4], [8, 9]]
Some of the possible selections are [2, 3, 8], [1, 2, 3, 9], [1, 3, 4, 8, 9] etc. The cost of these selections are 8 - 2 = 6, 9 - 1 = 8, and 8 respectively.
Here the minimum cost of all the possible selections is 6. The possible selections with the cost 6 are [2, 4, 8] and [2, 3, 4, 8]. The maximum value of the required product is obtained using the latter selection i.e. 6 * 4 = 24. Hence the answer is 24.
How do you go about this problem? This is the basic thought process:
Find the selections (how do you do that?)
calculate the cost of each selection
determine the least cost and multiply it with the length of the selection.
Unfortunately I am stuck on the very first step which is to find the selections. How can we do that in an efficient manner? Can we use combinations on 2D arrays?
Any help would be appreciated, thank you!

How to find number of observation between first observation and observation with maximal value

I have a large data frame and I need a function to automate this search. Basically I want to find how many observations are between the first observation and the observation with maximal value.
Example:
x <- c(2, 1, 9, 3, 4, -6, 5, 11, 6, -7, -1)
Assuming that this is my data I want to count the number of data points between 2 and 11.
I need to do this in r.
Help is highly appreciated :D !!!
We can eithe
diff(which(x %in% c(2, max(x)))) -1
#[1] 6
Or substract the index of the max value (which.max) from the first value (+1 - not including the elements)
which.max(x) - x[1]

Create Vector of Length n in Julia

I want to create a vector/array of length n to be filled afterwards.
How can I do that?
And does it have to be filled already with something?
For example if you want a Vector of Ints of length 10 you can write
v = Vector{Int}(undef, 10)
And more general for an Array of Ints of dimensions (2, 3, 4)
a = Array{Int}(undef, (2, 3, 4))
Note that this fills the Vector/Array with garbage values, so this can be a bit dangerous. As an alternative you can use
v = Vector{Int}()
sizehint!(v, 10)
push!(v, 1) # add a one to the end of the Vector
append!(v, (2, 3, 4, 5, 6, 7, 8, 9, 10)) # add values 2 to 9 to the end of the vector
sizehint! is not necessary, but it can improve performance, because it tells Julia to expect 10 values.
There are other functions such as zeros, ones or fill that can provide a Vector/Array with already filled in data.

How to find the averages of any consecutive numbers in a sequence?

This is a bit of a math question, but I post it here too because there's a direct practical purpose and it's related to creating a faster algorithm. I want to identify users that use my app on a weekly basis. For each user I can generate a sequence of times of their interactions, and from that I can generate a sequence of the length of time between each interaction.
So given this sequence of lengths of time, how can I find sections of consecutive numbers that have an average of 7 days or less?
As an example, if I had the following sequence: [1, 11, 1, 8, 12]
[1, 11, 1, 8, 12] would be a valid stretch of numbers with an average of 7 or less, but [11, 1, 8, 12] would not be valid. [1, 2, 12] would again be valid.
Ideally, my output for every valid section would be the starting position of the first item and the length of the section. So [1, 11, 1, 8, 12] would be described as [1, 5] and [1, 2, 12] would be described as [3, 3].
There is a brute force, computational approach where I take every item in the sequence as a start point, and calculate the averages of every possible length of following numbers up until the end of the sequence. The number of calculations grows quickly though at a rate of n(n+1)/2 (Imagine for each given sequence of length N finding consecutive sequences of length N, N-1, N-2 etc.)
I ask broadly if there's a more elegant approach that doesn't require a quadratically growing number of individual calculations for means.

Algorithm for rating the monotonicity of an array (i.e. judging the "sortedness" of an array)

EDIT: Wow, many great responses. Yes, I am using this as a fitness function for judging the quality of a sort performed by a genetic algorithm. So cost-of-evaluation is important (i.e., it has to be fast, preferably O(n).)
As part of an AI application I am toying with, I'd like to be able to rate a candidate array of integers based on its monotonicity, aka its "sortedness". At the moment, I'm using a heuristic that calculates the longest sorted run, and then divides that by the length of the array:
public double monotonicity(int[] array) {
if (array.length == 0) return 1d;
int longestRun = longestSortedRun(array);
return (double) longestRun / (double) array.length;
}
public int longestSortedRun(int[] array) {
if (array.length == 0) return 0;
int longestRun = 1;
int currentRun = 1;
for (int i = 1; i < array.length; i++) {
if (array[i] >= array[i - 1]) {
currentRun++;
} else {
currentRun = 1;
}
if (currentRun > longestRun) longestRun = currentRun;
}
return longestRun;
}
This is a good start, but it fails to take into account the possibility that there may be "clumps" of sorted sub-sequences. E.g.:
{ 4, 5, 6, 0, 1, 2, 3, 7, 8, 9}
This array is partitioned into three sorted sub-sequences. My algorithm will rate it as only 40% sorted, but intuitively, it should get a higher score than that. Is there a standard algorithm for this sort of thing?
This seems like a good candidate for Levenshtein Damerau–Levenshtein distance - the number of swaps needed to sort the array. This should be proportional to how far each item is from where it should be in a sorted array.
Here's a simple ruby algorithm that sums the squares of the distances. It seems a good measure of sortedness - the result gets smaller every time two out-of-order elements are swapped.
ap = a.sort
sum = 0
a.each_index{|i| j = ap.index(a[i])-i
sum += (j*j)
}
dist = sum/(a.size*a.size)
I expect that the choice of function to use depends very strongly on what you intend to use it for. Based on your question, I would guess that you are using a genetic system to create a sorting program, and this is to be the ranking function. If that is the case, then speed of execution is crucial. Based on that, I bet your longest-sorted-subsequence algorithm would work pretty well. That sounds like it should define fitness pretty well.
Something like these? http://en.wikipedia.org/wiki/Rank_correlation
Here's one I just made up.
For each pair of adjacent values, calculate the numeric difference between them. If the second is greater than or equal to the first, add that to the sorted total, otherwise add to the unsorted total. When done, take the ratio of the two.
Compute the lenghts of all sorted sub-sequences, then square them and add them.
If you want to calibrate how much enphasis you put on largest, use a power different than 2.
I'm not sure what's the best way to normalize this by length, maybe divide it per length squared?
What you're probably looking for is Kendall Tau. It's a one-to-one function of the bubble sort distance between two arrays. To test whether an array is "almost sorted", compute its Kendall Tau against a sorted array.
I would suggest looking at the Pancake Problem and the reversal distance of the permutations. These algorithms are often used to find the distance between two permutations (the Identity and the permuted string). This distance measure should take into account more clumps of in order values, as well as reversals (monotonically decreasing instead of increasing subsequences). There are also approximations that are polynomial time[PDF].
It really all depends on what the number means and if this distance function makes sense in your context though.
I have the same problem (monotonicity scoring), and I suggest you to try Longest Increasing Subsequence. The most efficient algorithm run in O(n log n), not so bad.
Taking example from the question, the longest increasing sequence of {4, 5, 6, 0, 1, 2, 3, 7, 8, 9} is {0, 1, 2, 3, 7, 8, 9} (length of 7). Maybe it rate better (70%) than your longest-sorted-run algorithm.
It highly depends on what you're intending to use the measure for, but one easy way to do this is to feed the array into a standard sorting algorithm and measure how many operations (swaps and/or comparisons) need to be done to sort the array.
Some experiments with a modifier Ratcliff & Obershelp
>>> from difflib import SequenceMatcher as sm
>>> a = [ 4, 5, 6, 0, 1, 2, 3, 7, 8, 9 ]
>>> c = [ 0, 1, 9, 2, 8, 3, 6, 4, 7, 5 ]
>>> b = [ 4, 5, 6, 0, 1, 2, 3, 7, 8, 9 ]
>>> b.sort()
>>> s = sm(None, a, b)
>>> s.ratio()
0.69999999999999996
>>> s2 = sm(None, c, b)
>>> s2.ratio()
0.29999999999999999
So kind of does what it needs to. Not too sure how to prove it though.
How about counting the number of steps with increasing value vs. the number of total steps. That's O(n).

Resources