Spark tasks with Cassandra - count

I am new to Spark and Cassandra.
We are using Spark on top of Cassandra to read data, since we have requirement to read data using non-primary key columns.
One observation is, number of tasks for a spark job increasing w.r.t data growth. Due to this we are facing lot of latency in fetching data.
What would be the reasons for the spark job task count increase?
What should be considered to increase performance in Spark with Cassandra?
Please suggest me.
Thanks,
Mallikarjun

The input split size is controlled by the configuration spark.cassandra.input.split.size_in_mb. Each split will generate a task in Spark, therefore, the more data in Cassandra, the longer it will take to process (which is what you would expect)
To improve performance, make sure you are aligning the partitions using joinWithCassandraTable. Don't use context.cassandraTable(...) unless you absolutely need all the data in the table and optimize the retrieved data using select to project only the columns that you need.
If you need data from some rows, it would make sense to build a secondary table where the id of those rows is stored.
Secondary indexes could also help to select subsets of the data, but I've seen reports of if being not highly performant.

What would be the reasons for the spark job task count increase?
Following on from maasgs answer, rather than setting the spark.cassandra.input.split.size_in_mb. on the SparkConf, it can be useful to use the ReadConf config when reading from different keyspaces/datacentres in a single job:
val readConf = ReadConf(
splitCount = Option(500),
splitSizeInMB = 64,
fetchSizeInRows = 1000,
consistencyLevel = ConsistencyLevel.LOCAL_ONE,
taskMetricsEnabled = true
)
val rows = sc.cassandraTable(cassandraKeyspace, cassandraTable).withReadConf(readConf)
What should be considered to increase performance in Spark with
Cassandra?
As far as increasing performance is concerned, this will depend on the jobs you are running and the types of transformations required. Some general advice to maximise Spark-Cassandra performance (As can be found here) is outlined below.
Your choice of operations and the order in which they are applied is critical to performance.
You must organize your processes with task distribution and memory in mind.
The first thing is to determine if you data is partitioned appropriately. A partition in this context is merely a block of data. If possible, partition your data before Spark even ingests it. If this is not practical or possible, you may choose to repartition the data immediately following the load. You can repartition to increase the number of partitions or coalesce to reduce the number of partitions.
The number of partitions should, as a lower bound, be at least 2x the number of cores that are going to operate on the data. Having said that, you will also want to ensure any task you perform takes at least 100ms to justify the distribution across the network. Note that a repartition will always cause a shuffle, where coalesce typically won’t. If you’ve worked with MapReduce, you know shuffling is what takes most of the time in a real job.
Filter early and often. Assuming the data source is not preprocessed for reduction, your earliest and best place to reduce the amount of data spark will need to process is on the initial data query. This is often achieved by adding a where clause. Do not bring in any data not necessary to obtain your target result. Bringing in any extra data will affect how much data may be shuffled across the network, and written to disk. Moving data around unnecessarily is a real killer and should be avoided at all costs
At each step you should look for opportunities to filter, distinct, reduce, or aggregate the data as much as possible prior to proceeding to the operation.
Use pipelines as much as possible. Pipelines are a series of transformations that represent independent operations on a piece of data and do not require a reorganization of the data as a whole (shuffle). For example: a map from a string -> string length is independent, where a sort by value requires a comparison against other data elements and a reorganization of data across the network (shuffle).
In jobs which require a shuffle see if you can employ partial aggregation or reduction before the shuffle step (similar to a combiner in MapReduce). This will reduce data movement during the shuffle phase.
Some common tasks that are costly and require a shuffle are sorts, group by key, and reduce by key. These operations require the data to be compared against other data elements which is expensive. It is important to learn the Spark API well to choose the best combination of transformations and where to position them in your job. Create the simplest and most efficient algorithm necessary to answer the question.

Related

Limitations of using sequential IDs in Cloud Firestore

I read in a stackoverflow post that (link here)
By using predictable (e.g. sequential) IDs for documents, you increase the chance you'll hit hotspots in the backend infrastructure. This decreases the scalability of the write operations.
I would like if anyone could explain better on the limitations that can occur when using sequential or user provided id.
Cloud Firestore scales horizontally by allocated key ranges to machines. As load increases beyond a certain threshold on a single machine, it will split the range being served by it and assign it to 2 machines.
Let's say you just starting writing to Cloud Firestore, which means a single server is currently handling the entire range.
When you are writing new documents with random Ids, when we split the range into 2, each machine will end up with roughly the same load. As load increases, we continue to split into more machines, with each one getting roughly the same load. This scales well.
When you are writing new documents with sequential Ids, if you exceed the write rate a single machine can handle, the system will try to split the range into 2. Unfortunately, one half will get no load, and the other half the full load! This doesn't scale well as you can never get more than a single machine to handle your write load.
In the case where a single machine is running more load than it can optimally handle, we call this "hot spotting". Sequential Ids mean we cannot scale to handle more load. Incidentally, this same concept applies to index entries too, which is why we warn sequential index values such as timestamps of now as well.
So, how much is too much load? We generally say 500 writes/second is what a single machine will handle, although this will naturally vary depending on a lot of factors, such as how big a document you are writing, number of transactions, etc.
With this in mind, you can see that smaller more consistent workloads aren't a problem, but if you want something that scales based on traffic, sequential document ids or index values will naturally limit you to what a single machine in the database can keep up with.

Embedded key-value db vs. just storing one file per key?

I'm confused about the advantage of embedded key-value databases over the naive solution of just storing one file on disk per key. For example, databases like RocksDB, Badger, SQLite use fancy data structures like B+ trees and LSMs but seem to get roughly the same performance as this simple solution.
For example, Badger (which is the fastest Go embedded db) takes about 800 microseconds to write an entry. In comparison, creating a new file from scratch and writing some data to it takes 150 mics with no optimization.
EDIT: to clarify, here's the simple implementation of a key-value store I'm comparing with the state of the art embedded dbs. Just hash each key to a string filename, and store the associated value as a byte array at that filename. Reads and writes are ~150 mics each, which is faster than Badger for single operations and comparable for batched operations. Furthermore, the disk space is minimal, since we don't store any extra structure besides the actual values.
I must be missing something here, because the solutions people actually use are super fancy and optimized using things like bloom filters and B+ trees.
But Badger is not about writing "an" entry:
My writes are really slow. Why?
Are you creating a new transaction for every single key update? This will lead to very low throughput.
To get best write performance, batch up multiple writes inside a transaction using single DB.Update() call.
You could also have multiple such DB.Update() calls being made concurrently from multiple goroutines.
That leads to issue 396:
I was looking for fast storage in Go and so my first try was BoltDB. I need a lot of single-write transactions. Bolt was able to do about 240 rq/s.
I just tested Badger and I got a crazy 10k rq/s. I am just baffled
That is because:
LSM tree has an advantage compared to B+ tree when it comes to writes.
Also, values are stored separately in value log files so writes are much faster.
You can read more about the design here.
One of the main point (hard to replicate with simple read/write of files) is:
Key-Value separation
The major performance cost of LSM-trees is the compaction process. During compactions, multiple files are read into memory, sorted, and written back. Sorting is essential for efficient retrieval, for both key lookups and range iterations. With sorting, the key lookups would only require accessing at most one file per level (excluding level zero, where we’d need to check all the files). Iterations would result in sequential access to multiple files.
Each file is of fixed size, to enhance caching. Values tend to be larger than keys. When you store values along with the keys, the amount of data that needs to be compacted grows significantly.
In Badger, only a pointer to the value in the value log is stored alongside the key. Badger employs delta encoding for keys to reduce the effective size even further. Assuming 16 bytes per key and 16 bytes per value pointer, a single 64MB file can store two million key-value pairs.
Your question assumes that the only operation needed are single random reads and writes. Those are the worst case scenarios for log-structured merge (LSM) approaches like Badger or RocksDB. The range query, where all keys or key-value pairs in a range gets returned, leverages sequential reads (due to the adjacencies of sorted kv within files) to read data at very high speeds. For Badger, you mostly get that benefit if doing key-only or small value range queries since they are stored in a LSM while large values are appended in a not-necessarily sorted log file. For RocksDB, you’ll get fast kv pair range queries.
The previous answer somewhat addresses the advantage on writes - the use of buffering. If you write many kv pairs, rather than storing each in separate files, LSM approaches hold these in memory and eventually flush them in a file write. There’s no free lunch so asynchronous compaction must be done to remove overwritten data and prevent checking too many files for queries.
Previously answered here. Mostly similar to other answers provided here but makes one important, additional point: files in a filesystem can't occupy the same block on disk. If your records are, on average, significantly smaller than typical disk block size (4-16 KiB), storing them as separate files will incur substantial storage overhead.

Difference between shuffle() and rebalance() in Apache Flink

I am working on my bachelor's final project, which is about the comparison between Apache Spark Streaming and Apache Flink (only streaming) and I have just arrived to "Physical partitioning" in Flink's documentation. The matter is that in this documentation it doesn't explain well how this two transformations work. Directly from the documentation:
shuffle(): Partitions elements randomly according to a uniform distribution.
rebalance(): Partitions elements round-robin, creating equal load per partition. Useful for performance optimisation in the presence of data skew.
Source: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/datastream_api.html#physical-partitioning
Both are automatically done, so what I understand is that they both redistribute equally (shuffle() > uniform distribution & rebalance() > round-robin) and randomly the data. Then I deduce that rebalance() distributes the data in a better way ("equal load per partitions") so the tasks have to process the same amount of data, but shuffle() may create bigger and smaller partitions. Then, in which cases might you prefer to use shuffle() than rebalance()?
The only thing that comes to my mind is that probably rebalance()requires some processing time so in some cases it might use more time to do the rebalancing than the time it will improve in the future transformations.
I have been looking for this and nobody has talked about this, only in a mailing list of Flink, but they don't explain how shuffle() works.
Thanks to Sneftel who has helped me to improve my question asking me things to let me rethink about what I wanted to ask; and to Till who answered quite well my question. :D
As the documentation states, shuffle will randomly distribute the data whereas rebalance will distribute the data in a round robin fashion. The latter is more efficient since you don't have to compute a random number. Moreover, depending on the randomness, you might end up with some kind of not so uniform distribution.
On the other hand, rebalance will always start sending the first element to the first channel. Thus, if you have only few elements (fewer elements than subtasks), then only some of the subtasks will receive elements, because you always start to send the first element to the first subtask. In the streaming case this should eventually not matter because you usually have an unbounded input stream.
The actual reason why both methods exist is a historically reason. shuffle was introduced first. In order to make the batch an streaming API more similar, rebalance was then introduced.
This statement by Flink is misleading:
Useful for performance optimisation in the presence of data skew.
Since it's used to describe rebalance, but not shuffle, it suggests it's the distinguishing factor. My understanding of it was that if some items are slow to process and some fast, the partitioner will use the next free channel to send the item to. But this is not the case, compare the code for rebalance and shuffle. The rebalance just adds to next channel regardless how busy it is.
// rebalance
nextChannelToSendTo = (nextChannelToSendTo + 1) % numberOfChannels;
// shuffle
nextChannelToSendTo = random.nextInt(numberOfChannels);
The statement can be also understood differently: the "load" doesn't mean actual processing time, just the number of items. If your original partitioning has skew (vastly different number of items in partitions), the operation will assign items to partitions uniformly. However in this case it applies to both operations.
My conclusion: shuffle and rebalance do the same thing, but rebalance does it slightly more efficiently. But the difference is so small that it's unlikely that you'll notice it, java.util.Random can generate 70m random numbers in a single thread on my machine.

divide workload on different hardware using MPI

I have a small network with computers of different hardware. Is it possible to optimize workload division between these hardware using MPI? ie. give nodes with larger ram and better cpu more data to compute? minimizing waiting time between different nodes for final reduction.
Thanks!
In my program data are divided into equal-sized batches. Each node in the network will process some of them. The result of each batch will be summed up after all batches are processed.
Can you divide the work into more batches than there are processes? If so, change your program so that instead of each process receiving one batch, the master keeps sending batches to whichever node is available, for as long as there are unassigned batches. It should be a fairly easy modification, and it will make faster nodes process more data, leading to a lower overall completion time. There are further enhancements you can make, e.g. once all batches have been assigned and a fast node is available, you could take an already assigned batch away from a slow node and reassign it to said fast node. But these may not be worth the extra effort.
If you absolutely have to work with as many batches as you have nodes, then you'll have to find some way of deciding which nodes are fast and which ones are slow. Perhaps the most robust way of doing this is to assign small, equally sized test batches to each process, and have them time their own solutions. The master can then divide the real data into appropriately sized batches for each node. The biggest downside to this approach is that if the initial speed measurement is inaccurate, then your efforts at load balancing may end up doing more harm than good. Also, depending on the exact data and algorithm you're working with, runtimes with small data sets may not be indicative of runtimes with large data sets.
Yet another way would be to take thorough measurements of each node's speed (i.e. multiple runs with large data sets) in advance, and have the master balance batch sizes according to this precompiled information. The obvious complication here is that you'll somehow have to keep this registry up to date and available.
All in all, I would recommend the very first approach: divide the work into many smaller chunks, and assign chunks to whichever node is available at the moment.

Hadoop suitability for recursive data processing

I have a filtering algorithm that needs to be applied recursively and I am not sure if MapReduce is suitable for this job. W/o giving too much away, I can say that each object that is being filtered is characterized by a collection if ordered list or queue.
The data is not huge, just about 250MB when I export from SQL to
CSV.
The mapping step is simple: the head of the list contains an object that can classify the list as belonging to one of N mapping nodes. the filtration algorithm at each node works on the collection of lists assigned to the node and at the end of the filtration, either a list remains the same as before the filtration or the head of the list is removed.
The reduce function is simple too: all the map jobs' lists are brought together and may have to be written back to disk.
When all the N nodes have returned their output, the mapping step is repeated with this new set of data.
Note: N can be as much as 2000 nodes.
Simple, but it requires perhaps up to a 1000 recursions before the algorithm's termination conditions are met.
My question is would this job be suitable for Hadoop? If not, what are my options?
The main strength of Hadoop is its ability to transparently distribute work on a large number of machines. In order to fully benefit from Hadoop your application has to be characterized, at least by the following three things:
work with large amounts of data (data which is distributed in the cluster of machines) - which would be impossible to store on one machine
be data-parallelizable (i.e. chunks of the original data can be manipulated independently from other chunks)
the problem which the application is trying to solve lends itself nicely to the MapReduce (scatter - gather) model.
It seems that out of these 3, your application has only the last 2 characteristics (with the observation that you are trying to recursively use a scatter - gather procedure - which means a large number of jobs - equal to the recursion depth; see last paragraph why this might not be appropriate for hadoop).
Given the amount of data you're trying to process, I don't see any reason why you wouldn't do it on a single machine, completely in memory. If you think you can benefit from processing that small amount of data in parallel, I would recommend focusing on multicore processing than on distributed data intensive processing. Of course, using the processing power of a networked cluster is tempting but this comes at a cost: mainly the time inefficiency given by the network communication (network being the most contended resource in a hadoop cluster) and by the I/O. In scenarios which are well-fitted to the Hadoop framework these inefficiency can be ignored because of the efficiency gained by distributing the data and the associated work on that data.
As I can see, you need 1000 jobs. The setup and the cleanup of all those jobs would be an unnecessary overhead for your scenario. Also, the overhead of network transfer is not necessary, in my opinion.
Recursive algos are hard in the distributed systems since they can lead to a quick starvation. Any middleware that would work for that needs to support distributed continuations, i.e. the ability to make a "recursive" call without holding the resources (like threads) of the calling side.
GridGain is one product that natively supports distributed continuations.
THe litmus test on distributed continuations: try to develop a naive fibonacci implementation in distributed context using recursive calls. Here's the GridGain's example that implements this using continuations.
Hope it helps.
Q&D, but I suggest you read a comparison of MongoDB and Hadoop:
http://www.osintegrators.com/whitepapers/MongoHadoopWP/index.html
Without knowing more, it's hard to tell. You might want to try both. Post your results if you do!

Resources