R an algorithm for missing data imputation - r

I have a matrix M[i,j] with a large number of NA representing the population stock in a number of urban areas [i,] over the period 2000-2013 [,j]. I would like to complete missing data assuming that population grows at a constant rate.
I would like to run a loop which, for each row i of the matrix M
1. Computes the average growth rate using all the non-missing observations
2. Fills in the gaps using imputed values
Generating an example dataset:
city1=c(10,40,NA,NA,320,640);
city2=c(NA,300,NA,1200,2400,NA);
city3=c(NA,NA,4000,8000,16000,32000);
mat=rbind(city1,city2,city3)
Since the average growth rate is 4 for city1, 3 for city2, and 2 for city3, the corresponding result matrix should be:
r.city1=c(10,40,80,160,320,640);
r.city2=c(100,300,600,1200,2400,4800);
r.city3=c(1000,2000,4000,8000,16000,32000);
r.mat=rbind(city1,city2,city3)
Do you have any idea of how I could go about this?
Best,
Clément

simply approximate the growth rate
you have 1D array p[n] per city so start with that (p as population)
index i={0,1,2...,n-1} means time in some constant step
so if growth rate is constant (let call it m) then
p[1]=p[0]*m
p[2]=p[0]*(m^2)
p[3]=p[0]*(m^3)
p[i]=p[0]*(m^i)
So now just guess or approximate m and minimize the distance from each known point
as a start point you can get 2 continuous known values
m0=p[i+1]/p[i];
and then make loop around this value minimizing error for all known values at once
also you can use mine approx class (in C++) if you want
If you want to be more precise use dynamic growth rate
for example interpolation cubic or spline
and add curve fitting or also use approximation ...
Here C++ example (simple ugly slow non precise...)
const int N=3; // cities
const int n=6; // years
int p[3][n]=
{
10, 40, -1, -1, 320, 640, // city 1 ... -1 = NA
-1,300, -1,1200, 2400, -1, // city 2
-1, -1,4000,8000,16000,32000, // city 3
};
void estimate()
{
int i,I,*q,i0;
float m,m0,m1,dm,d,e,mm;
for (I=0;I<N;I++) // all cities
{
q=p[I]; // pointer to actual city
// avg growth rate
for (m0=0.0,m1=0.0,i=1;i<n;i++)
if ((q[i]>0)&&(q[i-1]>0))
{ m0+=q[i]/q[i-1]; m1++; }
if (m1<0.5) continue; // skip this city if avg m not found
m0/=m1;
// find m more closelly on interval <0.5*m0,2.0*m0>
m1=2.0*m0; m0=0.5*m0; dm=(m1-m0)*0.001;
for (m=-1.0,e=0.0;m0<=m1;m0+=dm)
{
// find first valid number
for (mm=-1,i=0;i<n;i++)
if (q[i]>0) { mm=q[i]; break; }
// comute abs error for current m0
for (d=0.0;i<n;i++,mm*=m0)
if (q[i]>0) d+=fabs(mm-q[i]);
// remember the best solution
if ((m<0.0)||(e>d)) { m=m0; e=d; }
}
// now compute missing data using m
// ascending
for (mm=-1,i=0;i<n;i++) if (q[i]>0) { mm=q[i]; break; }
for (;i<n;i++,mm*=m) if (q[i]<0) q[i]=mm;
// descending
for (mm=-1,i=0;i<n;i++) if (q[i]>0) { mm=q[i]; break; }
for (;i>=0;i--,mm/=m) if (q[i]<0) q[i]=mm;
}
}
Result:
// input
[ 10 40 NA NA 320 640 ]
[ NA 300 NA 1200 2400 NA ]
[ NA NA 4000 8000 16000 32000 ]
// output
[ 10 40 52 121 320 640 ]
[ 150 300 599 1200 2400 4790 ]
[ 1000 2000 4000 8000 16000 32000 ]
can add some rounding and tweak it a bit
also you can compute the error in both asc and desc order to be more safe
also can enhance this by using as start point value surrounded by valid values instead of first valid entry
or by computing each NA gap separately

Related

Number of action per year. Combinatorics question

I'm writing a diploma about vaccines. There is a region, its population and 12 month. There is an array of 12 values from 0 to 1 with step 0.01. It means which part of population should we vaccinate in every month.
For example if we have array = [0.1,0,0,0,0,0,0,0,0,0,0,0]. That means that we should vaccinate 0.1 of region population only in first month.
Another array = [0, 0.23,0,0,0,0,0,0, 0.02,0,0,0]. It means that we should vaccinate 0.23 of region population in second month and 0.02 of region population in 9th month.
So the question is: how to generate (using 3 loops) 12(months) * 12(times of vaccinating) * 100 (number of steps from 0 to 1) = 14_400 number of arrays that will contain every version of these combinations.
For now I have this code:
for(int month = 0;month<12;month++){
for (double step = 0;step<=1;step+=0.01){
double[] arr = new double[12];
arr[month] = step;
}
}
I need to add 3d loop that will vary number of vaccinating per year.
Have no idea how to write it.
Idk if it is understandable.
Hope u get it otherwise ask me, please.
You have 101 variants for the first month 0.00, 0.01..1.00
And 101 variants for the second month - same values.
And 101*101 possible combinations for two months.
Continuing - for all 12 months you have 101^12 variants ~ 10^24
It is not possible to generate and store so many combinations (at least in the current decade)
If step is larger than 0.01, then combination count might be reliable. General formula is P=N^M where N is number of variants per month, M is number of months
You can traverse all combinations representing all integers in range 0..P-1 in N-ric numeral system. Or make digit counter:
fill array D[12] with zeros
repeat
increment element at the last index by step value
if it reaches the limit, make it zero
and increment element at the next index
until the first element reaches the limit
It is similar to counting 08, 09, here we cannot increment 9, so make 10 and so on
s = 1
m = 3
mx = 3
l = [0]*m
i = 0
while i < m:
print([x/3 for x in l])
i = 0
l[i] += s
while (i < m) and l[i] > mx:
l[i] = 0
i += 1
if i < m:
l[i] += s
Python code prints 64 ((mx/s+1)^m=4^3) variants like [0.3333, 0.6666, 0.0]

Knapsack variation

So I have an array of coupons, each with a price and the quantity of items that can be bought from it. I can only buy the given item quantity from a coupon, no more, no less. How to find the minimum cost to get the required number of items with coupons (and return -1 if not possible)?
For example, on having 4 coupons: "Buy 3 at 10 dollars", "Buy 2 at 4 dollars", "Buy 2 at 4 dollars" and "Buy 1 at 3 dollars", and 4 items to buy, the minimum cost is 8 dollars.
Knapsack works on finding maximums, but for minimum it'll just keep on not considering any coupon and come up with an answer of 0.
Here's my code:
int minimumCost(coupon_t coupons[], int numCoupons, int units) {
if (units <= 0 || numCoupons <= 0)
return 0;
if (coupons[numCoupons-1].quantity > units)
return minimumCost(coupons, numCoupons-1, units);
coupon_t coupon = coupons[numCoupons-1];
return min(coupon.price + minimumCost(coupons, numCoupons-1, units-coupon.quantity),
minimumCost(coupons, numCoupons-1, units));
}
Had a little more think about this. The key, as you say, is handling of 0. In typical knapsack code, 0 has two meanings: both "not buying" and "can't buy". Splitting those seems to work:
def minimum_cost(coupons, units, coupon_no=0):
if units < 0 or coupon_no == len(coupons):
# special value for "impossible"
return None
if units == 0:
# no more space, so we're not buying anything else
return 0
quantity, price = coupons[coupon_no]
next_coupon = coupon_no + 1
if quantity > units:
return minimum_cost(coupons, units, next_coupon)
pre_purchase_value_when_used = minimum_cost(coupons, units - quantity, next_coupon)
value_when_unused = minimum_cost(coupons, units, next_coupon)
# return whichever is not impossible, or cheaper of two possibilities:
if pre_purchase_value_when_used is None:
return value_when_unused
elif value_when_unused is None:
return pre_purchase_value_when_used + price
else:
return min(pre_purchase_value_when_used + price, value_when_unused)
coupons = [[3, 10], [2, 4], [2, 4], [1, 3]]
units = 4
cost = minimum_cost(coupons, units)
print(cost)
# => 8
(Note that recursion is not dynamic-programming, unless you cache the function results, although it shouldn't be too hard to make it use a table. The key insight about dynamic programming is to use storage to avoid recalculating things we already calculated.)

Calculate if trend is up, down or stable

I'm writing a VBScript that sends out a weekly email with client activity. Here is some sample data:
a b c d e f g
2,780 2,667 2,785 1,031 646 2,340 2,410
Since this is email, I don't want a chart with a trend line. I just need a simple function that returns "up", "down" or "stable" (though I doubt it will ever be perfectly stable).
I'm terrible with math so I don't even know where to begin. I've looked at a few other questions for Python or Excel but there's just not enough similarity, or I don't have the knowledge, to apply it to VBS.
My goal would be something as simple as this:
a b c d e f g trend
2,780 2,667 2,785 1,031 646 2,340 2,410 ↘
If there is some delta or percentage or other measurement I could display that would be helpful. I would also probably want to ignore outliers. For instance, the 646 above. Some of our clients are not open on the weekend.
First of all, your data is listed as
a b c d e f g
2,780 2,667 2,785 1,031 646 2,340 2,410
To get a trend line you need to assign a numerical values to the variables a, b, c, ...
To assign numerical values to it, you need to have little bit more info how data are taken. Suppose you took data a on 1st January, you can assign it any value like 0 or 1. Then you took data b ten days later, then you can assign value 10 or 11 to it. Then you took data c thirty days later, then you can assign value 30 or 31 to it. The numerical values of a, b, c, ... must be proportional to the time interval of the data taken to get the more accurate value of the trend line.
If they are taken in regular interval (which is most likely your case), lets say every 7 days, then you can assign it in regular intervals a, b, c, ... ~ 1, 2, 3, ... Beginning point is entirely your choice choose something that makes it very easy. It does not matter on your final calculation.
Then you need to calculate the slope of the linear regression which you can find on this url from which you need to calculate the value of b with the following table.
On first column from row 2 to row 8, I have my values of a,b,c,... which I put 1,2,3, ...
On second column, I have my data.
On third column, I multiplied each cell in first column to corresponding cell in second column.
On fourth column, I squared the value of cell of first column.
On row 10, I added up the values of the above columns.
Finally use the values of row 10.
total_number_of_data*C[10] - A[10]*B[10]
b = -------------------------------------------
total_number_of_data*D[10]-square_of(A[10])
the sign of b determines what you are looking for. If it's positive, then it's up, if it's negative, then it's down, and if it's zero then stable.
This was a huge help! Here it is as a function in python
def trend_value(nums: list):
summed_nums = sum(nums)
multiplied_data = 0
summed_index = 0
squared_index = 0
for index, num in enumerate(nums):
index += 1
multiplied_data += index * num
summed_index += index
squared_index += index**2
numerator = (len(nums) * multiplied_data) - (summed_nums * summed_index)
denominator = (len(nums) * squared_index) - summed_index**2
if denominator != 0:
return numerator/denominator
else:
return 0
val = trend_value([2781, 2667, 2785, 1031, 646, 2340, 2410])
print(val) # -139.5
in python:
def get_trend(numbers):
rows = []
total_numbers = len(numbers)
currentValueNumber = 1
n = 0
while n < len(numbers):
rows.append({'row': currentValueNumber, 'number': numbers[n]})
currentValueNumber += 1
n += 1
sumLines = 0
sumNumbers = 0
sumMix = 0
squareOfs = 0
for k in rows:
sumLines += k['row']
sumNumbers += k['number']
sumMix += k['row']*k['number']
squareOfs += k['row'] ** 2
a = (total_numbers * sumMix) - (sumLines * sumNumbers)
b = (total_numbers * squareOfs) - (sumLines ** 2)
c = a/b
return c
trendValue = get_trend([2781,2667,2785,1031,646,2340,2410])
print(trendValue) # output: -139.5

Recursively Find A Path

I have a series of numbers ranging from 0-9. Each number represents a position with an x and y co-ordinate. So, position 0 could represent (5, 5) or something similar, always (x, y). Now what I need to do is recursively bash each possible route using 5 positions to get the position given by a user. So for example:
Input = (1, 2) //This is the co-ordinate the user gives.
Now given this input it should take every possible path and find the shortest one. Some paths could be:
start 0 1 2 3 4 input
start 0 1 2 3 5 input
start 0 1 2 3 6 input
start 0 1 2 3 7 input
start 0 1 2 4 3 input
start 1 0 2 3 5 input
and so on....
It could be any combination of 5 numbers from the 0-9. It must end at the input destination and begin at start destination. Numbers cannot be reused. So I need to recursively add all the distances for a given course (ex. start 0 1 2 3 4 input) and find the shortest possible course while going through those 5 points.
Question: What would the base and recursive case be?
Basically what you want to do is generate all combinations of size k (the length of the path) from the set {1,..,n}, and then calculate the value of the path for it.
Here's a C# code sample:
void OPTPathForKSteps(List<int> currentPath, List<int> remainingPositions, int remainingSteps)
{
if (remainingSteps == 0)
{
// currentPath now contains a combination of k positions
// do something with currentPath...
}
else
{
for (int i = 0; i < remainingPositions.Count; i++)
{
int TempPositionIndex = remainingPositions[i];
currentPath.Add(TempPositionIndex);
remainingPositions.RemoveAt(i);
OPTPathForKSteps(currentPath, remainingPositions, remainingSteps - 1);
remainingPositions.Insert(i, TempPositionIndex);
currentPath.RemoveAt(currentPath.Count - 1);
}
}
}
This is the initial call for the function (assume Positions is an integer list of 0...n positions, and k is the length of the path):
OPTPathForKSteps(new List<int>(), Positions, K);
You can change the function and add arguments so it will return the optimal path and minimal value.
There are other (maybe shorter) ways to create these combinations, the good thing about my implementation is that it is light on the memory, and doesn't require storing all the possible combinations.

Negative & Positive Percentage Calculation

Let's say i have 3 sets of numbers and i want the % of their difference.
30 - 60
94 - 67
10 - 14
I want a function that calculate the percentage of the difference between each 2 numbers, and the most important is to support positive and negative percentages.
Example:
30 - 60 : +100%
94 - 67 : -36% ( just guessing )
10 - 14 : +40%
Thanks
This is pretty basic math.
% difference from x to y is 100*(y-x)/x
The important issue here is whether one of your numbers is a known reference, for example, a theoretical value.
With no reference number, use the percent difference as
100*(y-x)/((x+y)/2)
The important distinction here is dividing by the average, which symmetrizes the definition.
From your example though, it seems that you might want percent error, that is, you are thinking of your first number as the reference number and want to know how the other deviates from that. Then the equation, where x is reference number, is:
100*(y-x)/x
See, e.g., wikipedia, for a small discussion on this.
for x - y the percentage is (y-x)/x*100
Simple math:
function differenceAsPercent($number1, $number2) {
return number_format(($number2 - $number1) / $number1 * 100, 2);
}
echo differenceAsPercent(30, 60); // 100
echo differenceAsPercent(94, 67); // -28.72
echo differenceAsPercent(10, 14); // 40
If the percentage is needed for a voting system then Andrey Korolyov is the only who answered correctly.
Example
10 votes for 1 vote against = 90%
10 votes for 5 votes against = 50%
10 votes for 3 votes against = 70%
100 votes for 1 vote against = 99%
1000 votes for 1 vote against = 99.9%
1 votes for 10 vote against = -90%
5 votes for 10 votes against = -50%
3 votes for 10 votes against = -70%
1 votes for 100 vote against = -99%
1 votes for 1000 vote against = -99.9%
function perc(a,b){
console.log( (a > b) ? (a-b)/a*100 : (b - a)/b*-100);
}
$c = ($a > $b) ? ($a-$b)/$a*-100 : ($b-$a)/$b*100;
In Ukraine children learn these math calculations at the age of 12 :)

Resources