Sliding Window Flink - bigdata

Of particular interest to us are sliding window queries. Given a
finite epoch τ = (t0,t1,...,tn) of n timepoints, a sliding window
query is a triple SWQ = f,l,p. where f stands for the frequency
of the query, l stands for the length of the sliding window, and p
specifies the required event pattern we would like to identify. For
the sake of clarity and without loss of generality, we assume that n
is a multiple of f (n mod f = 0). Can Someone please write a code related to it in flink. Please I am trying to apply parameters of flink where
f: number of events generated inside flink
l: number of events generated in volume (size)
p: Pattern -> number of variety generated in events

Related

In what order we need to put weights on scale?

I' am doing my homework in programming, and I don't know how to solve this problem:
We have a set of n weights, we are putting them on a scale one by one until all weights is used. We also have string of n letters "R" or "L" which means which pen is heavier in that moment, they can't be in balance. There are no weights with same mass. Compute in what order we have to put weights on scale and on which pan.
The goal is to find order of putting weights on scale, so the input string is respected.
Input: number 0 < n < 51, number of weights. Then weights and the string.
Output: in n lines, weight and "R" or "L", side where you put weight. If there are many, output any of them.
Example 1:
Input:
3
10 20 30
LRL
Output:
10 L
20 R
30 L
Example 2:
Input:
3
10 20 30
LLR
Output:
20 L
10 R
30 R
Example 3:
Input:
5
10 20 30 40 50
LLLLR
Output:
50 L
10 L
20 R
30 R
40 R
I already tried to compute it with recursion but unsuccessful. Can someone please help me with this problem or just gave me hints how to solve it.
Since you do not show any code of your own, I'll give you some ideas without code. If you need more help, show more of your work then I can show you Python code that solves your problem.
Your problem is suitable for backtracking. Wikipedia's definition of this algorithm is
Backtracking is a general algorithm for finding all (or some) solutions to some computational problems, notably constraint satisfaction problems, that incrementally builds candidates to the solutions, and abandons a candidate ("backtracks") as soon as it determines that the candidate cannot possibly be completed to a valid solution.
and
Backtracking can be applied only for problems which admit the concept of a "partial candidate solution" and a relatively quick test of whether it can possibly be completed to a valid solution.
Your problem satisfies those requirements. At each stage you need to choose one of the remaining weights and one of the two pans of the scale. When you place the chosen weight on the chosen pan, you determine if the corresponding letter from the input string is satisfied. If not, you reject the choice of weight and pan. If so, you continue by choosing another weight and pan.
Your overall routine first inputs and prepares the data. It then calls a recursive routine that chooses one weight and one pan at each level. Some of the information needed by each level could be put into mutable global variables, but it would be more clear if you pass all needed information as parameters. Each call to the recursive routine needs to pass:
the weights not yet used
the input L/R string not yet used
the current state of the weights on the pans, in a format that can easily be printed when finalized (perhaps an array of ordered pairs of a weight and a pan)
the current weight imbalance of the pans. This could be calculated from the previous parameter, but time would be saved by passing this separately. This would be total of the weights on the right pan minus the total of the weights on the left pan (or vice versa).
Your base case for the recursion is when the unused-weights and unused-letters are empty. You then have finished the search and can print the solution and quit the program. Otherwise you loop over all combinations of one of the unused weights and one of the pans. For each combination, calculate what the new imbalance would be if you placed that weight on that pan. If that new imbalance agrees with the corresponding letter, call the routine recursively with appropriately-modified parameters. If not, do nothing for this weight and pan.
You still have a few choices to make before coding, such as the data structure for the unused weights. Show me some of your own coding efforts then I'll give you my Python code.
Be aware that this could be slow for a large number of weights. For n weights and two pans, the total number of ways to place the weights on the pans is n! * 2**n (that is a factorial and an exponentiation). For n = 50 that is over 3e79, much too large to do. The backtracking avoids most groups of choices, since choices are rejected as soon as possible, but my algorithm could still be slow. There may be a better algorithm than backtracking, but I do not see it. Your problem seems to be designed to be handled by backtracking.
Now that you have shown more effort of your own, here is my un-optimized Python 3 code. This works for all the examples you gave, though I got a different valid solution for your third example.
def weights_on_pans():
def solve(unused_weights, unused_tilts, placement, imbalance):
"""Place the weights on the scales using recursive
backtracking. Return True if successful, False otherwise."""
if not unused_weights:
# Done: print the placement and note that we succeeded
for weight, pan in placement:
print(weight, 'L' if pan < 0 else 'R')
return True # success right now
tilt, *later_tilts = unused_tilts
for weight in unused_weights:
for pan in (-1, 1): # -1 means left, 1 means right
new_imbalance = imbalance + pan * weight
if new_imbalance * tilt > 0: # both negative or both positive
# Continue searching since imbalance in proper direction
if solve(unused_weights - {weight},
later_tilts,
placement + [(weight, pan)],
new_imbalance):
return True # success at a lower level
return False # not yet successful
# Get the inputs from standard input. (This version has no validity checks)
cnt_weights = int(input())
weights = {int(item) for item in input().split()}
letters = input()
# Call the recursive routine with appropriate starting parameters.
tilts = [(-1 if letter == 'L' else 1) for letter in letters]
solve(weights, tilts, [], 0)
weights_on_pans()
The main way I can see to speed up that code is to avoid the O(n) operations in the call to solve in the inner loop. That means perhaps changing the data structure of unused_weights and changing how it, placement, and perhaps unused_tilts/later_tilts are modified to use O(1) operations. Those changes would complicate the code, which is why I did not do them.

Random factor for simulating a Poisson distribution with a tick-tick simulation

I've got a program that simulates events happening. I have the following constraints:
The distribution should approximate a Poisson distribution, i.e., independent events occurring at a fixed rate.
At time 1000 ticks, the event should have happened 100 times (on average). (1000 and 100 are placeholders; the real numbers are determined experimentally from the real-world system I'm trying to model.)
I'm currently using code of the following form:
def tick():
doMaintenance()
while random() < 0.1:
eventHappens()
I use a while instead of an if to simulate the idea that these events are independent. Otherwise, there would be no chance of two events occurring in the same tick. However, I suspect that means that the random() < 0.1 (where random returns a number in the half-open range [0.0, 1.0)) is slightly inaccurate. (It's okay that I'm quantizing the event occurrence time.)
Can somebody suggest the correct random() < f constant to use if I want (in the general case) a Poisson distribution such that at time t there will be event count e? I believe that such a constant f exists, but its derivation is not obvious to me.
I'm putting this in stackoverflow.com so I can conveniently talk in coding terms and because I'm using a tick-tick simulation which is more familiar to numerical simulation programmers than mathematicians. If this is something more appropriate in math.stackexchange.com, though, let me know.
As MSalters pointed out in comments, even with λ = 100 / 1000 = 0.1, there's a real possibility of having more than one occurrence in a single tick of your clock. I generated Poisson cumulative probabilities for the given λ—these show the progressive rarity of multiple events in a single tick. You can use them to determine the right number of events via an if/else construct:
u = random()
if u ≤ 0.904837418
// 0 events
else if u ≤ 0.9953211598
// 1 event
else if u ≤ 0.9998453469
// 2 events
else if u ≤ 0.9999961532
// 3 events
else if u ≤ 0.9999999233
// 4 events
else if u ≤ 0.9999999987
// 5 events
else
// 6 events
Since the probabilities of more than 6 events are below 10-10, they can probably be safely ignored unless you're doing a ridiculous number of time ticks. You can reduce the number of cases if you're not worried about the rarer cases with higher counts.
I'd still recommend an event scheduling approach with exponential inter-event times (with the same λ)—let the events happen when they happen—but the approach outlined above will work if you just gotta stick with time steps.

Can someone explain how probabilistic counting works?

Specifically around log log counting approach.
I'll try and clarify the use of probabilistic counters although note that I'm no expert on this matter.
The aim is to count to very very large numbers using only a little space to store the counter (e.g. using a 32 bits integer).
Morris came up with the idea to maintain a "log count", so instead of counting n, the counter holds log₂(n). In other words, given a value c of the counter, the real count represented by the counter is 2ᶜ.
As logs are not generally of integer value, the problem becomes when the c counter should be incremented, as we can only do so in steps of 1.
The idea here is to use a "probabilistic counter", so for each call to a method Increment on our counter, we update the actual counter value with a probability p. This is useful as it can be shown that the expected value represented by the counter value c with probabilistic updates is in fact n. In other words, on average the value represented by our counter after n calls to Increment is in fact n (but at any one point in time our counter is probably has an error)! We are trading accuracy for the ability to count up to very large numbers with little storage space (e.g. a single register).
One scheme to achieve this, as described by Morris, is to have a counter value c represent the actual count 2ᶜ (i.e. the counter holds the log₂ of the actual count). We update this counter with probability 1/2ᶜ where c is the current value of the counter.
Note that choosing this "base" of 2 means that our actual counts are always multiples of 2 (hence the term "order of magnitude estimate"). It is also possible to choose other b > 1 (typically such that b < 2) so that the error is smaller at the cost of being able to count smaller maximum numbers.
The log log comes into play because in base-2 a number x needs log₂ bits to be represented.
There are in fact many other schemes to approximate counting, and if you are in need of such a scheme you should probably research which one makes sense for your application.
References:
See Philippe Flajolet for a proof on the average value represented by the counter, or a much simpler treatment in the solutions to a problem 5-1 in the book "Introduction to Algorithms". The paper by Morris is usually behind paywalls, I could not find a free version to post here.
its not exactly for the log counting approach but i think it can help you,
using Morris' algorithm, the counter represents an "order of magnitude estimate" of the actual count.The approximation is mathematically unbiased.
To increment the counter, a pseudo-random event is used, such that the incrementing is a probabilistic event. To save space, only the exponent is kept. For example, in base 2, the counter can estimate the count to be 1, 2, 4, 8, 16, 32, and all of the powers of two. The memory requirement is simply to hold the exponent.
As an example, to increment from 4 to 8, a pseudo-random number would be generated such that a probability of .25 generates a positive change in the counter. Otherwise, the counter remains at 4. from wiki

Big O of Recursive Methods

I'm having difficulty determining the big O of simple recursive methods. I can't wrap my head around what happens when a method is called multiple times. I would be more specific about my areas of confusion, but at the moment I'm trying to answer some hw questions, and in lieu of not wanting to cheat, I ask that anyone responding to this post come up with a simple recursive method and provide a simple explanation of the big O of said method. (Preferably in Java... a language I'm learning.)
Thank you.
You can define the order recursively as well. For instance, let's say you have a function f. To calculate f(n) takes k steps. Now you want to calculate f(n+1). Lets say f(n+1) calls f(n) once, then f(n+1) takes k + some constant steps. Each invocation will take some constant steps extra, so this method is O(n).
Now look at another example. Lets say you implement fibonacci naively by adding the two previous results:
fib(n) = { return fib(n-1) + fib(n-2) }
Now lets say you can calculate fib(n-2) and fib(n-1) both in about k steps. To calculate fib(n) you need k+k = 2*k steps. Now lets say you want to calculate fib(n+1). So you need twice as much steps as for fib(n-1). So this seems to be O(2^N)
Admittedly, this is not very formal, but hopefully this way you can get a bit of a feel.
You might want to refer to the master theorem for finding the big O of recursive methods. Here is the wikipedia article: http://en.wikipedia.org/wiki/Master_theorem
You want to think of a recursive problem like a tree. Then, consider each level of the tree and the amount of work required. Problems will generally fall into 3 categories, root heavy (first iteration >> rest of tree), balanced (each level has equal amounts of work), leaf heavy (last iteration >> rest of tree).
Taking merge sort as an example:
define mergeSort(list toSort):
if(length of toSort <= 1):
return toSort
list left = toSort from [0, length of toSort/2)
list right = toSort from [length of toSort/2, length of toSort)
merge(mergeSort(left), mergeSort(right))
You can see that each call of mergeSort in turn calls 2 more mergeSorts of 1/2 the original length. We know that the merge procedure will take time proportional to the number of values being merged.
The recurrence relationship is then T(n) = 2*T(n/2)+O(n). The two comes from the 2 calls and the n/2 is from each call having only half the number of elements. However, at each level there are the same number of elements n which need to be merged, so the constant work at each level is O(n).
We know the work is evenly distributed (O(n) each depth) and the tree is log_2(n) deep, so the big O of the recursive function is O(n*log(n)).

How to calculate the number of messages sent in a distributed system?

Suppose we have n processes forming a general network. We don't know which are connected together, but we know the number of the processes (n).If at each round, a process sends a message to all processes it is connected to, receives 1 message from each of them, and the program executes for r rounds, is there a way to find how many messages have been sent during the program execution?
As you have pointed out, without the exact network structure it is impossible to put a specific value on the number of messages sent. Instead, we can look at its Big-O value.
Now just to be clear what we mean by Big-O:
Big-O refers to the upper bound (ie. the worst possible case) complexity
It is possible (and quite likely in real systems) that the actual value will be less
Without some function that describes the average case (eg. each process is connected to, on average N / 2 other processes) we must assume the worst case
By "worst case" for this problem we mean the maximum number of messages are sent
So let us assume the worst case, in which each process is connected to N - 1 other processes.
Let us also define some variables:
S := the set of processes
N := the number of processes in S
We can represent the set S as a complete (every node connects to every other node), undirected graph in which each node in the graph corresponds to a process and each edge in the graph corresponds to 2 messages sent (1 outgoing transmission and one reply).
From here, we see that the number of edges in a complete graph is (N(N-1))/2
So in the worst case, the number of messages sent is N(N-1), or N^2 - N.
Now because we are dealing with Big-O notation, we are interested in how this value grows as a function of N.
By the triangle inequality, we can see that O(N^2 - N) is an element of O(N^2).
So the number of messages sent grows as N^2 in the worst case.
It is also possible to arrive at this result using an adjacency matrix, that is an N x N matrix where a 1 in the (i, j)th element refers to an edge from node i to node j.
Because in the original problem each process sends a single message to all connected processes, who respond with a single message, we can see that for every pair (i, j) and (j, i) there will be an edge (one representing an outgoing message, one a reply). The exception to this will be pairs where i = j, ie. we a process doesn't send itself a message.
So the matrix will be completely filled with 1s with the exception of the diagonal.
0 1 1 1
1 0 1 1
1 1 0 1
1 1 1 0
Above: the adjacency matrix for N = 4.
So we first want to determine a formula for the total number of messages sent as a function of the number of nodes.
By the area of a rectangle we can see that the number of 1s in the matrix (ignoring the diagonal) would be N x N = N^2.
Now we must consider the diagonal. The number of pairs (x, x) that can exist is given by the function f(i) where Z(N) -> Z(N) x Z(N) : f(i) = (i, i) -- this means that there will be exactly N distinct solutions to this function.
So the overall result is that we have N^2 - N messages when the diagonal is considered.
Now, we use the same Big-O reasoning from above to arrive at the same conclusion, the number of messages grows in the worst case as O(N^2).
So, now you need only take into consideration the number of rounds that have occured, leaving you with O(RN^2).
Of course now you must consider whether you really do have the worst case ...

Resources