spark inconsistency when running count command

spark inconsistency when running count command - count

A question about inconsistency of Spark calculations. Does this exist? For example, I am running EXACTLY the same command twice, e.g.:
imp_sample.where(col("location").isNotNull()).count()
And I am getting slightly different results every time I run it (141,830, then 142,314)!
Or this:
imp_sample.where(col("location").isNull()).count()
and getting 2,587,013, and then 2,586,943. How is it even possible?
Thank you!

As per your comment, you are using sampleBy in your pipeline. sampleBydoesn't guarantee you'll get the exact fractions of rows. It takes a sample with probability for each record being included equal to fractions and can vary from run to run.
Regarding your monotonically_increasing_id question in the comments, it only guarantees that the next id is larger than the previous one, however, it doesn't guarantee ids are consecutive (i,i+i,i+2, etc...).
Finally, you can persist a data frame, by called persist() on it.

Ok, I have suffered majorly from this in the past. I had a seven or eight stage pipeline that normalised a couple of tables, added ids, joined them and grouped them. Consecutive runs of the same pipeline gave different results, although not in any coherent pattern I could understand.
Long story short, I traced this feature to my usage of the function monotonically_increasing_id, supposed resolved by this JIRA ticket, but still evident in Spark 2.2.
I do not know exactly what your pipeline does, but please understand that my fix is to force SPARK to persist results after calling monotonically_increasing_id. I never saw the issue again after I started doing this.
Let me know if a judicious persist resolves this issue.
To persist an RDD or DataFrame, call either df.cache (which defaults to in-memory persistence) or df.persist([some storage level]), for example
df.persist(StorageLevel.DISK_ONLY)
Again, it may not help you, but in my case it forced Spark to flush out and write id values which were behaving non-deterministically given repeated invocations of the pipeline.

Related

Can Flink count to zero?

I have found examples for counting in Flink SQL, however I cannot seem to count to zero.
The obvious use case would be monitoring, where an action would need to be taken if an update is NOT received.
Here is what I tried so far:
select count(*) from emptysource
GROUP BY TUMBLE(PROCTIME(), INTERVAL '1' second)
When counting from the end of a topic that does not receive any messages, the count does not show zero. In fact, nothing seems to be produced by this job, though it does run succesfully.
I have not tried more complicated setups, where we would count per key.
I am primarily interested in Flink SQL, but if another solution in Flink would be needed that would be good to know as well.

I think the problem is that Flink currently doesn't support windows that have no elements, so Flink basically creates a window only after a first element for this window arrives. Some additional info can be found here, even though the answer is for event time, but the rule for processing time is basically the same.
Possible workarounds depend on exact usecase but generally can be something like:
Use keyed timers to emit periodical aggregations (this will only work if at leas one message arrives for key)
Use dummy source just so that windows are created correctly (as described here)

Teradata DELETE ALL vs DROP+CREATE

I've been recently assigned on a project using Teradata.
I've been told to strictly use DROP+CREATE instead of DELETE ALL, because the latter "leaves some space allocated someway". This is counter-intuitive to me, and I think it's probably wrong. I searched the web for a comparison between the two methods, but I found nothing.
This only reinforces my belief that DELETE ALL doesn't suffer from the issue above.
However, if this is the case, I must prove it (both practically and theoretically).
So, my question is: is there a difference in space allocation between the two methods? If not, is there an official document (user guide, technical specification, whatever else) that proves it?
Thank you!

There's a discussion here: http://teradataforum.com/teradata/20120403_105705.htm about the very same subject (although it does not really answer the "leaves some space allocated someway" part). They actually recommend DELETE ALL but for other (performance) reasons:
I'll quote just in case the link goes dead:
"Delete all" will be quicker, although being practical there often isn't a lot of difference in the performance of them.
However, especially for a process that is run regularly (say a daily batch process) then I recommend the "delete all" approach. This will do less work as it only removes the data and leaves the definition in place. Remember that if you remove the definition then this requires accessing multiple dictionary tables, and of course you then have to access those same tables (typically) when you re-create the object.
Apart from the performance aspect, the downside of the drop/create approach is that every time you create an object Teradata inserts "default rows" into the AccessRights table, even if subsequent access to the object is controlled via Role security and/or database level security. As you may well know the AccessRights table can easily get large and very skewed. In my experience many sites have a process which cleans this table on a regular basis, removing redundant rows. If your (typically batch) processes regularly drop/create objects, then you're simply adding rows into the table which have previously been removed by a clean process, and which will be removed in the future by the same process. This all sounds like a complete waste of time to me.

Your impression is correct, you didn't find any reference to "DELETE leaves some space allocated" in any place, because it's simply wrong :-)
DELETE ALL is similar to a TRUNCATE in other DBMSes and in most cases use fastpath processing:

First of all, you cannot do DROP/CREATE in one transaction in Teradata (in Oracle there are other problems with everyday DDL) so when ETL processes become complicated you might end up with the dependence where more important business processes depend on less important (like you might see the customers table empty just because the interests rates were not refreshed
or you have an exceeding varchar value in just one minor column)
My opinion: Use transactions and modular programming. In Teradata this means avoiding DDL where possible and using DELETE/UPDATE/MERGE/INSERT instead of DROP/CREATE.
We have a slightly different situation in Postgres where DDL statements are transactional.

Riak: are my 2is broken?

we're having some weird things happening with a cleanup cronjob and riak:
the objects we store (postboxes) have a 2i for modification date (which is a unix timestamp).
there's a cronjob running freqently deleting all postboxes that have not been modified within 180 days. however we've found evidence that postboxes that some (very little) postboxes that were modified in the last three days were deleted by this cronjob.
After reviewing and debugging several times over every line of code, I am confident, that this is not a problem of the cronjob.
I also traced back all delete calls to that bucket - and no one else is deleting objects there.
Of course I also checked with Riak to read the postboxes with r=ALL: they're definitely gone. (and they are stored with w=QUORUM)
I also checked the logs: updating the post boxes did succeed (there were no errors reported back from the write operations)
This leaves me with two possible causes for this:
riak loses data (which I am not willing to believe that easily)
the secondary indexes are corrupt and queries to them return wrong keys
So my questions are:
Can 2is actually break?
Is it possible to verify that?
Am I missing something completely different?
Cheers,
Matthias

Secondary index queries in Riak are coverage queries, which means that they will only use one of the stored replicas, and not perform a quorum read.
As you are writing with w=QUORUM, it is possible that one (or more) of the replicas may not get updated if you have n_val set to 3 or higher while the operation still is deemed successful. If this is the one selected for the coverage query, you could end up deleting based on the old value. In order to avoid this, you will need to perform updates with w=ALL.

Strategy for handling variable time queries?

I have a typical scenario that I'm struggling with from a performance standpoint. The user selects a value from a dropdown and clicks a button. A stored procedure takes that value as an input parameter, executes, and returns the results to a grid. For just one of the values ('All'), the query runs for roughly 2.5 minutes. For the rest of the values the query runs less than 1ms.
Obviously, having the user wait for 2.5 minutes just isn't going to fly. So, what are some typical strategies to handle this?
Some of my own thoughts:
New table that stores the information for the 'All' value and is generated nightly
Cache the data on the caching server
Any help is appreciated.
Thanks!
Update
A little bit more info:
sp returns two result sets. The first is a group by rollup summary and the second is the first result set, disaggregated (roughly 80,000 rows).

I would first look at if your have the proper indexes in place. Using the Query Analyzer and the Database Tuning Assistant is a simple and often effective way of seeing what indexes might help.
If you still have performance problems after creating the appropriate indexes you might then look at adding tables/views to speed things up. If your query does a lot of joins you might consider creating an indexed view that allows you to do a select with no joins on the denormalized data. Since indexed views are persisted you can see big gains from their use.
You can read up on indexed views here:
http://msdn.microsoft.com/en-us/library/dd171921%28v=sql.100%29.aspx
and read about the database tuning adviser here:
http://msdn.microsoft.com/en-us/library/ms166575.aspx
Also, how many records does "All" return? I have seen people get hung up on the "All" scenario before, but if it returns 1 million records or something then the data is not usable to a person anyways...

Caching data is a good thing, but.... if the SP is inherently flawed, then you might want to actually fix it instead of trying to bandage it with caching.
You might also want to (since you didn't mention here) look at the number of rows "All" returns compared to the other selections and think about your indexes.
Also in your SP does the "All" cause it to run a different sets of tsql as in maybe a case or an if... or is it running the same code just with a different "WHERE"?
It might simply be that "ALL" just returns A LOT of records. You may want to implement paging and partial dataset return using ajax... (kinda like return the first 1000 records early so that it can be displayed and also show a throbber on the screen while the rest of the dataset is returned)
These are all options... if the number of records really isnt that different between ALL and the others... then it probably has something to do with the query/index/program flow.

GAE -- Queries on sharded properties

I understand the theory of sharding values in Google App Engine,as outlined here:
http://code.google.com/appengine/articles/sharding_counters.html
but what happens when I want to run a query on a value that I've sharded? I can't simply query against the value, because it's been split up randomly amongst N different counters. Is the solution just to sum these values back up occasionally to update my main entity? I'm curious to see what solutions others have come up with to this problem.
EDIT: I just discovered the Task Queue API, and it looks like it might be a solution to updating the main value in the background. Anyone tried using this in parallel with sharding?

you're right, you can't use the total sum in another datastore query in a single shot, since it's split between the shards. however, you can run an initial query to gather all of the shards, sum them in memory, and then run your original query using that sum.
beyond that, yes, the task queue is definitely a good approach to doing work like this in the background. take a look at this talk for ideas:
http://www.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex