Orange performance on Association rule mining - associations

I am using orangecontrib.associate.fpgrowth for association rules mining. Based on couple of experiment it seems as number of products increase to above 1000 then running time increase exponentially. However, I am not sure if it is true. Is there any approximation about the complexity of algorithm and/or parameters which have most significant effect on performance (Like number of product? number of total transaction? or something else?

Association rules generation algorithms in general "explode" quite fast. Rules-from-itemsets operation, in particular, I think is akin to enumerating a powerset (2n). Couldn't further elaborate on the theoretical complexity myself, but I think the runtimes for given support / confidence / avg. transaction size thresholds are comparable to those found elsewhere:
import time
import numpy as np
from orangecontrib.associate.fpgrowth import frequent_itemsets,\
association_rules
for n_trans in (100, 1000, 10000):
for n_items in (10, 100, 1000):
X = np.random.random((n_trans, n_items)) > .8
t_start = time.clock()
itemsets = dict(frequent_itemsets(X, .05))
n_itemsets = len(itemsets)
t_itemsets = time.clock() - t_start
t_start = time.clock()
rules = list(association_rules(itemsets, .01))
n_rules = len(rules)
t_rules = time.clock() - t_start
print('{:5d} {:4d} {:5.1f} ({:7d}) {:4.1f} ({:7d})'.format(
n_trans, n_items, t_itemsets, n_itemsets, t_rules, n_rules))
Outputs:
100 10 0.0 ( 24) 0.0 ( 28)
100 100 0.1 ( 1800) 0.0 ( 3880)
100 1000 43.3 ( 470675) 15.3 (2426774)
1000 10 0.1 ( 11) 0.6 ( 2)
1000 100 0.2 ( 452) 0.0 ( 704)
1000 1000 33.5 ( 35448) 0.8 ( 68896)
10000 10 0.1 ( 10) 0.0 ( 0)
10000 100 2.6 ( 100) 0.0 ( 0)
10000 1000 180.8 ( 1000) 0.0 ( 0)

Related

Lua: How do I calculate a random number between 50 and 500, with an average result of 100?

I think this is a log-normal distribution? I'm not sure. The lua I have is here:
local min = 50
local max = 500
local avg = 100
local fFloat = ( RandomInt( 0, avg ) / 100 ) ^ 2 -- float between 0 and 1, squared
local iRange = max - min -- range of min-max
local fDistribution = ( ( fFloat * iRange ) + min ) / 100
finalRandPerc = fDistribution * RandomInt( min, max ) / 100
It is close to working, but sometimes generates numbers that are slightly too large.
This can be done in literally infinite number of ways. One other approach is to generate a number from binomial distribution, multiply with 450 and add 50. I will leave the task of finding the right parameters for the binomial distribution to you.
How do I calculate a random number between 50 and 500, with an average result of 100?
You can use Chi-squared of degree 4 with its tail removed.
It is very easy to calculate.
local function random_50_500()
-- result is from 50 to 500
-- mean is very near to 100
local x
repeat
x = math.log(math.random()) + math.log(math.random())
until x > -18
return x * -25 + 50
end

Looking for a logic to keep a fraction in a range

I need to write some code that can calculate a variable which shows the preference of a consumer to buy a component for his laptop. The preference changes by the tax (T) and the importance of prices on people's purchases (PriceI). I need to include both T and PriceI to find the person's willingness (W) for purchasing a laptop. Tax changes in a slider ranging from 50 Cent to $6 . I want to keep the variable W in a range from 1 to 2, where 1 is when the tax is on its default, minimum values which is 50 cent.
So There are 2 variables that have influence on W:
50<T<600
0.6 < PriceI < 9
Since I want 1<W<2, I thought it should work if I first normalize all the data by dividing them by their max, then in order to find a fraction to be between 1 and 2, I made the numerator to be less than 4 and the denominator to be less than 2, hoping to have the result between 1 to 2 :
to setup-WCalculator
ask consumers [
set PP ((PriceI / 9) * 2)
set TT ((T / 600) * 4)
set W TT / PP
]
end
However, Netlogo makes both PP and TT zero while they should be a small value like 0.15! Does the logic for finding W make sense?
Thanks,
Normalization is normally done with a formula such as
TT = (T - Tmin) / (Tmax - Tmin)
or here
TT = (T - 50) / (600 - 50)
That gives a normalized value between 0 and 1 as T ranges between 50 and 600. If you want TTT to range between 1 and x, where x is > 1, then you can set
TTT = 1.0 + TT * (x - 1.0)
So
TTT = 1.0 + TT * (4.0 - 1.0) = 1.0 + TT * 3.0
will give you a value between 1 and 4.

Julia significantly slower with #parallel

I have this code(primitive heat transfer):
function heat(first, second, m)
#sync #parallel for d = 2:m - 1
for c = 2:m - 1
#inbounds second[c,d] = (first[c,d] + first[c+1, d] + first[c-1, d] + first[c, d+1] + first[c, d-1]) / 5.0;
end
end
end
m = parse(Int,ARGS[1]) #size of matrix
firstm = SharedArray(Float64, (m,m))
secondm = SharedArray(Float64, (m,m))
for c = 1:m
for d = 1:m
if c == m || d == 1
firstm[c,d] = 100.0
secondm[c,d] = 100.0
else
firstm[c,d] = 0.0
secondm[c,d] = 0.0
end
end
end
#time for i = 0:opak
heat(firstm, secondm, m)
firstm, secondm = secondm, firstm
end
This code give good times when run sequentially, but when I add #parallel it slow down even if I run on one thread. I just need explanation why this is happening? Code only if it doesn't change algorithm of heat function.
Have a look at http://docs.julialang.org/en/release-0.4/manual/performance-tips/ . Contrary to advised, you make use of global variables a lot. They are considered to change types anytime so they have to be boxed and unboxed everytime they are referenced. This question also Julia pi aproximation slow suffers from the same. In order to make your function faster, have global variables as input arguments to the function.
There are some points to consider. One of them is the size of m. If it is small, parallelism would give much overhead for not a big gain:
julia 36967257.jl 4
# Parallel:
0.040434 seconds (4.44 k allocations: 241.606 KB)
# Normal:
0.042141 seconds (29.13 k allocations: 1.308 MB)
For bigger m you could have better results:
julia 36967257.jl 4000
# Parallel:
0.054848 seconds (4.46 k allocations: 241.935 KB)
# Normal:
3.779843 seconds (29.13 k allocations: 1.308 MB)
Plus two remarks:
1/ initialisation could be simplified to:
for c = 1:m, d = 1:m
if c == m || d == 1
firstm[c,d] = 100.0
secondm[c,d] = 100.0
else
firstm[c,d] = 0.0
secondm[c,d] = 0.0
end
end
2/ your finite difference schema does not look stable. Please take a look at Linear multistep method or ADI/Crank Nicolson.

Python .1 - .1 = extremely small number w/negative exponent?

This has got to be a well-traveled gotcha of some sort. Define the following function foo():
>>> def foo():
... x = 1
... while x != 0:
... x -= .1
... if x < 0:
... x = 0
... print x
So of course, when we call the function, we get exactly what we expect to get.
>>> foo()
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
1.38777878078e-16 # O_o
0
So, I know that math with integers vs. floating point numbers can get a little weird. Just typing 3 - 2.9 yields such an answer:
>>> 3 - 2.9
0.10000000000000009
So, in fairness -- this is not causing an issue in the script I'm mucking about with. But surely this creeps up and bites people who would actually be affected by values as astronomically small as 1.38777878078e-16. And in order to prevent there from ever being an issue because of the strangely small number, I've got this gem sitting at the bottom of my controller du jour:
if (x < .1 and x > 0) or x < 0:
x = 0
That can't be the solution... unless it totally is. So... is it? If not, what's the trick here?
This can certainly "creep up and bite people", generally when they try to compare floats:
>>> a = 1 / 10
>>> b = 0.6 - 0.5
>>> a == b
False
Therefore it is common to compare floats using a tolerance:
>>> tolerance = 0.000001
>>> abs(a - b) < tolerance
True
This program:
def foo():
x = 1
while x != 0:
x -= .1
if x < 0:
x = 0
print '%.20f' % x
foo()
prints out this:
0.90000000000000002220
0.80000000000000004441
0.70000000000000006661
0.60000000000000008882
0.50000000000000011102
0.40000000000000013323
0.30000000000000015543
0.20000000000000014988
0.10000000000000014433
0.00000000000000013878
0.00000000000000000000
You were not printing the numbers out with enough precision to see what was actually going on. Compare this with the output of print '%.20f' % x when you explicitly set x to 0.9 and 0.8 and so forth. You may want to pay particular attention to the result for 0.5.
You miss the point - that isn't something you want to have a workaround for. Just trust VM, and assume, that it does all the computations as they should be done.
What you want to do, is to format your number. Spot the difference between value, and its representation.
>>> x = 0.9
>>> while x>0.1:
... x -= 0.1
...
>>> x
1.3877787807814457e-16
>>> "{:.2f}".format(x)
'0.00'
Here you have example of showing you value with 2 decimal points. More on formatting (number formatting too) you'll find HERE

Normalize numbers from 1-.0000X to 1 - 0.0X?

I have range of numbers that range from 1 - 0.00000X . Most are small numbers like 0.000823. How can I map them so that they are closer in range ? I used sqrt method but any other suggestions ?
Update
Example
Numbers between 1-0.1 I don't have problem with them . My problem with numbers below 0.1. I need to bring them closer to 0.1.
.00004 -> 0.0004 or 0.004
0.023 -> 0.05 or 0.09
Have you tried logarithms?
If your numbers satisfy eps < x <= 1, the function
y = 1 - C*log(x) where C = 1/-log(eps)
will map the numbers to a range 0..1. If the range isn't required, only that the numbers are close together, you can drop the scale factor.
Edit:
This can be expressed without a subtraction of course.
y = 1 + C*log(x) where C = 1/log(eps)
For example, with an epsilon of 0.0000000001 (10^-10), you get C = -0.1 and:
0.0000000001 => 0
0.000000001 => 0.1
0.00000001 => 0.2
...
0.1 => 0.9
1 => 1
Edit: If you don't want to change the range from 0.1 ... 1.0 but only smaller numbers, then just scale the range from 0 ... 0.1. This can be done by multiplying x with 10 before the function is applied, and divide again by 10 after. Of course in this case use the scale function only if the value is less than 0.1.
Well, a simple way would be to calculate the minimal one (say, 1-t), and remap the segment [1-t, 1] to [0, 1]. The mapping function could be linear:
xnew = (xold - 1) / t + 1
(of course t = 1 - min value)

Resources