Can Flink count to zero? - count

I have found examples for counting in Flink SQL, however I cannot seem to count to zero.
The obvious use case would be monitoring, where an action would need to be taken if an update is NOT received.
Here is what I tried so far:
select count(*) from emptysource
GROUP BY TUMBLE(PROCTIME(), INTERVAL '1' second)
When counting from the end of a topic that does not receive any messages, the count does not show zero. In fact, nothing seems to be produced by this job, though it does run succesfully.
I have not tried more complicated setups, where we would count per key.
I am primarily interested in Flink SQL, but if another solution in Flink would be needed that would be good to know as well.

I think the problem is that Flink currently doesn't support windows that have no elements, so Flink basically creates a window only after a first element for this window arrives. Some additional info can be found here, even though the answer is for event time, but the rule for processing time is basically the same.
Possible workarounds depend on exact usecase but generally can be something like:
Use keyed timers to emit periodical aggregations (this will only work if at leas one message arrives for key)
Use dummy source just so that windows are created correctly (as described here)

Related

spark inconsistency when running count command

A question about inconsistency of Spark calculations. Does this exist? For example, I am running EXACTLY the same command twice, e.g.:
imp_sample.where(col("location").isNotNull()).count()
And I am getting slightly different results every time I run it (141,830, then 142,314)!
Or this:
imp_sample.where(col("location").isNull()).count()
and getting 2,587,013, and then 2,586,943. How is it even possible?
Thank you!
As per your comment, you are using sampleBy in your pipeline. sampleBydoesn't guarantee you'll get the exact fractions of rows. It takes a sample with probability for each record being included equal to fractions and can vary from run to run.
Regarding your monotonically_increasing_id question in the comments, it only guarantees that the next id is larger than the previous one, however, it doesn't guarantee ids are consecutive (i,i+i,i+2, etc...).
Finally, you can persist a data frame, by called persist() on it.
Ok, I have suffered majorly from this in the past. I had a seven or eight stage pipeline that normalised a couple of tables, added ids, joined them and grouped them. Consecutive runs of the same pipeline gave different results, although not in any coherent pattern I could understand.
Long story short, I traced this feature to my usage of the function monotonically_increasing_id, supposed resolved by this JIRA ticket, but still evident in Spark 2.2.
I do not know exactly what your pipeline does, but please understand that my fix is to force SPARK to persist results after calling monotonically_increasing_id. I never saw the issue again after I started doing this.
Let me know if a judicious persist resolves this issue.
To persist an RDD or DataFrame, call either df.cache (which defaults to in-memory persistence) or df.persist([some storage level]), for example
df.persist(StorageLevel.DISK_ONLY)
Again, it may not help you, but in my case it forced Spark to flush out and write id values which were behaving non-deterministically given repeated invocations of the pipeline.

DynamoDB atomic counter for account balance

In DynamoDB an Atomic Counter is a number that avoids race conditions
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.AtomicCounters
What makes a number atomic, and can I add/subtract from a float in non-unit values?
Currently I am doing: "SET balance = balance + :change"
(long version) I'm trying to use DynamoDB for user balances, so accuracy is paramount. The balance can be updated from multiple sources simultaneously. There is no need to pre-fetch the balance, we will never deny a transaction, I just care that when all the operations are finished we are left with the right balance. The operations can also be applied in any order, as long as the final result is correct.
From what I understand, this should be fine, but I haven't seen any atomic increment examples that do changes of values other than "1"
My hesitation arises because questions like Amazon DynamoDB Conditional Writes and Atomic Counters suggest using conditional writes for similar situation, which sounds like a terrible idea. If I fetch balance, change and do a conditional write, the write could fail if the value has changed in the meantime. However, balance is the definition of business critical, and I'm always nervous when ignoring documentation
-Additional Info-
All writes will originate from a Lambda function, and I expect pretty much 100% success rates in writes. However, I also maintain a history of all changes, and in the event the balance is in an "unknown" state (eg network timeout), could lock the table and recalculate the correct balance from history.
This I think gives the best "normal" operation. 99.999% of the time, all updates will work with a single write. Failure could be very costly, as we would need to scan a clients entire history to recreate the balance, but in terms of trade-off that seems a pretty safe bet.
The documentation for atomic counter is pretty clear and in my opinion it will be not safe for your use case.
The problem you are solving is pretty common, AWS recommends using optimistic locking in such scenarios.
Please refer to the following AWS documentation,
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBMapper.OptimisticLocking.html
It appears that this concept is workable, from a AWS staff reply
Often application writers will use a combination of both approaches,
where you can have an atomic counter for real-time counting, and an
audit table for perfect accounting later on.
https://forums.aws.amazon.com/thread.jspa?messageID=470243&#470243
There is also confirmation that the update will be atomic and any update operation will be consistent
All non batch requests you send to DynamoDB gets processed atomically
- there is no interleaving involved of any sort between requests. Write requests are also consistent, so any write request will update
the latest version of the item at the time the request is received.
https://forums.aws.amazon.com/thread.jspa?messageID=621994&#621994
In fact, every write to a given item is strongly consistent
in DynamoDB, all operations against a given item are serialized.
https://forums.aws.amazon.com/thread.jspa?messageID=324353&#324353

Riak: are my 2is broken?

we're having some weird things happening with a cleanup cronjob and riak:
the objects we store (postboxes) have a 2i for modification date (which is a unix timestamp).
there's a cronjob running freqently deleting all postboxes that have not been modified within 180 days. however we've found evidence that postboxes that some (very little) postboxes that were modified in the last three days were deleted by this cronjob.
After reviewing and debugging several times over every line of code, I am confident, that this is not a problem of the cronjob.
I also traced back all delete calls to that bucket - and no one else is deleting objects there.
Of course I also checked with Riak to read the postboxes with r=ALL: they're definitely gone. (and they are stored with w=QUORUM)
I also checked the logs: updating the post boxes did succeed (there were no errors reported back from the write operations)
This leaves me with two possible causes for this:
riak loses data (which I am not willing to believe that easily)
the secondary indexes are corrupt and queries to them return wrong keys
So my questions are:
Can 2is actually break?
Is it possible to verify that?
Am I missing something completely different?
Cheers,
Matthias
Secondary index queries in Riak are coverage queries, which means that they will only use one of the stored replicas, and not perform a quorum read.
As you are writing with w=QUORUM, it is possible that one (or more) of the replicas may not get updated if you have n_val set to 3 or higher while the operation still is deemed successful. If this is the one selected for the coverage query, you could end up deleting based on the old value. In order to avoid this, you will need to perform updates with w=ALL.

Is there (or has there been considered) anything like 'merge' or 'batch' setting in Firebase?

In doing a bit more programming with Firebase today, I found myself wishing for a couple of features:
1) Merge set:
Say I have a firebase ref that has the value {a:1,b:2,c:3}.
If I do something like ref.set({a:-1,b:-2}) the new value will (unsurprisingly) be {a:-1,b:-2}.
Instead, imagine ref.mergeSet({a:-1,b:-2}) which would have a result in the value of the ref being {a:-1,b:-2,c:3}.
Now, I realize that I could do something like ref.child("a").set(-1) and ref.child("b").set(-2) to achieve this result, but in at least some cases, I'd prefer to get only a single call to my .on() handler.
This segues into my second idea.
2) Batch set:
In my application I'd like a way to force an arbitrary number of calls to .set to only result in one call to .on in other clients. Something like:
ref.startBatch()
ref.child("a").set(1)
ref.child("b").set(2)
....
ref.endBatch()
In batch mode, .set wouldn't result in a call to .on, instead, the minimal number of calls to .on would all result from calling .endBatch.
I readily admit that these ideas are pretty nascent, and I wouldn't be surprised if they conflict with existing architectural features of Firebase, but I thought I'd share them anyway. I find that I'm having to spend more time ensuring consistency across clients when using Firebase than I expected to.
Thanks again, and keep up the great work.
UPDATE: We've added a new update() method to the Firebase web client and PATCH support to the REST API, which allow you to atomically modify multiple siblings at a particular location, while leaving the other siblings unmodified. This is what you described as "mergeSet" and can be used as follows:
ref.update({a: -1, b: -2});
which will update 'a' and 'b', but leave 'c' unmodified.
OLD ANSWER
Thanks for the detailed feature request! We'd love to hear more about your use case and how these primitives would help you. If you're willing to share more details, email support#firebase.com and we can dig into your scenario.
To answer your question though, the primary reason we don't have these features is related our architecture and the performance / consistency guarantees that we're trying to maintain. Not to go too deep, but if you imagine that your Firebase data is spread across many servers, it's easier for us to have stronger guarantees (atomicity, ordering, etc.) when modifying data that's close in the tree than when modifying data that's far away. So by limiting these guarantees to data that you can replace with a single set() call, we push you in a direction that will perform well with the Firebase architecture.
In some cases, you may be able to get roughly what you want by just reorganizing your tree. For instance, if you know you always want to set 'a' and 'b' together, you could put them under a common 'ab' parent and do ref.child('ab').set({a:-1, b:-2});, which won't affect the 'c' child.
Like I said, we'd love to hear more about your scenario. We're in beta so that we can learn from developers about how they're using the API and where it's falling short! support#firebase.com :-)

GAE -- Queries on sharded properties

I understand the theory of sharding values in Google App Engine,as outlined here:
http://code.google.com/appengine/articles/sharding_counters.html
but what happens when I want to run a query on a value that I've sharded? I can't simply query against the value, because it's been split up randomly amongst N different counters. Is the solution just to sum these values back up occasionally to update my main entity? I'm curious to see what solutions others have come up with to this problem.
EDIT: I just discovered the Task Queue API, and it looks like it might be a solution to updating the main value in the background. Anyone tried using this in parallel with sharding?
you're right, you can't use the total sum in another datastore query in a single shot, since it's split between the shards. however, you can run an initial query to gather all of the shards, sum them in memory, and then run your original query using that sum.
beyond that, yes, the task queue is definitely a good approach to doing work like this in the background. take a look at this talk for ideas:
http://www.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html

Resources