Stripes vs Batch in ORC & Partitions in Parquet - bigdata

I am new to Big Data, trying to understand different file formats from better query executions perspective , while trying to understand more about layout of the files of ORC & Parquet, came across the below terminology and wanted to validate my understanding.
Stripes - as per my understanding they are group of rows which belongs to same data set, for example if we have data for a different countries in a
file , the data for each country is in one Stripe , this helps in reducing the amount of data transferred from disk to memory as the Stripes not relevant can be skipped using Predicate push down.
VectorizedRowBatch , this helps in vectorization of columnar data for efficient CPU processing . it processes Batch by batch. in this context I could not understand the difference between Stripes & Batch. As I understood stripes helps to reduce the amount of data being transferred from disk to Memory , while Batch helps in processing the records paralelly using vectorization approach.
3.Stripes are only specific to ORC file format , similar concept for Parquet files is "Partitions" which is related data stored in RowGroups. which helps in Predicate push down.
for Parquet we do not have vectorization support , does that mean that ORC is better compared to Parquet for better CPU utilization.
Please correct me if my understanding is wrong for the points mentioned.

Related

How to speedup bulk importing into google cloud datastore with multiple workers?

I have an apache-beam based dataflow job to read using vcf source from a single text file (stored in google cloud storage), transform text lines into datastore Entities and write them into the datastore sink. The workflow works fine but the cons I noticed is that:
The write speed into datastore is at most around 25-30 entities per second.
I tried to use --autoscalingAlgorithm=THROUGHPUT_BASED --numWorkers=10 --maxNumWorkers=100 but the execution seems to prefer one worker (see graph below: the target workers once increased to 2 but reduced to 1 "based on the ability to parallelize the work in the currently running step").
I did not use ancestor path for the keys; all the entities are the same kind.
The pipeline code looks like below:
def write_to_datastore(project, user_options, pipeline_options):
"""Creates a pipeline that writes entities to Cloud Datastore."""
with beam.Pipeline(options=pipeline_options) as p:
(p
| 'Read vcf files' >> vcfio.ReadFromVcf(user_options.input)
| 'Create my entity' >> beam.ParDo(
ToEntityFn(), user_options.kind)
| 'Write to datastore' >> WriteToDatastore(project))
Because I have millions of rows to write into the datastore, it would take too long to write with a speed of 30 entities/sec.
Question: The input is just one huge gzipped file. Do I need to split it into multiple small files to trigger multiple workers? Is there any other way I can make the importing faster? Do I miss something in the num_workers setup? Thanks!
I'm not familiar with apache beam, the answer is from the general flow perspective.
Assuming there are no dependencies to be considered between entity data in various input file sections then yes, working with multiple input files should definitely help as all these files could then be processed virtually in parallel (depending, of course, on the max number of available workers).
You might not need to split the huge zipfile beforehand, it might be possible to simply hand off segments of the single input data stream to separate data segment workers for writing, if the overhead of such handoff itself is neglijible compared to the actual data segment processing.
The overall performance limitation would be the speed of reading the input data, splitting it in segments and handoff to the segment data workers.
A data segment worker would further split the data segment it receives in smaller chunks of up to the equivalent of the max 500 entities that can be converted to entities and written to the datastore in a single batch operation. Depending of the datastore client library used it may be possible to perform this operation asyncronously, allowing the split into chunks and conversion to entities to continue without waiting for the previous datastore writes to complete.
The performance limitation at the data segment worker would then be the speed at which the data segment can be split into chunks and the chunk converted to entities
If async ops aren't available or for even higher throughput, yet another handoff of each chunk to a segment worker could be performed, with the segment worker performing the conversion to entities and datastore batch write.
The performance limitation at the data segment worker level would then be just the speed at which the data segment can be split into chunks and handed over to the chunk workers.
With such approach the actual conversion to entities and batch writing them to the datastore (async or not) would no longer sit in the critical path of splitting the input data stream, which is, I believe, the performance limitation in your current approach.
I looked into the design of vcfio. I suspect (if I understand correctly) that the reason I always get one worker when the input is a single file is due to the limit of the _VcfSource and the VCF format constraint. This format has a header part that defines how to translate the non-header lines. This causes that each worker that reads the source file has to work on an entire file. When I split the single file into 5 separate files that share the same header, I successfully get up to 5 workers (but not any more probably due to the same reason).
One thing I don't understand is that the number of workers that read can be limited to 5 (in this case). But why we are limited to have only 5 workers to write? Anyway, I think I have found the alternative way to trigger multiple workers with beam Dataflow-Runner (use pre-split VCF files). There is also a related approach in gcp variant transforms project, in which the vcfio has been significantly extended. It seems to support the multiple workers with a single input vcf file. I wish the changes in that project could be merged into the beam project too.

Embedded key-value db vs. just storing one file per key?

I'm confused about the advantage of embedded key-value databases over the naive solution of just storing one file on disk per key. For example, databases like RocksDB, Badger, SQLite use fancy data structures like B+ trees and LSMs but seem to get roughly the same performance as this simple solution.
For example, Badger (which is the fastest Go embedded db) takes about 800 microseconds to write an entry. In comparison, creating a new file from scratch and writing some data to it takes 150 mics with no optimization.
EDIT: to clarify, here's the simple implementation of a key-value store I'm comparing with the state of the art embedded dbs. Just hash each key to a string filename, and store the associated value as a byte array at that filename. Reads and writes are ~150 mics each, which is faster than Badger for single operations and comparable for batched operations. Furthermore, the disk space is minimal, since we don't store any extra structure besides the actual values.
I must be missing something here, because the solutions people actually use are super fancy and optimized using things like bloom filters and B+ trees.
But Badger is not about writing "an" entry:
My writes are really slow. Why?
Are you creating a new transaction for every single key update? This will lead to very low throughput.
To get best write performance, batch up multiple writes inside a transaction using single DB.Update() call.
You could also have multiple such DB.Update() calls being made concurrently from multiple goroutines.
That leads to issue 396:
I was looking for fast storage in Go and so my first try was BoltDB. I need a lot of single-write transactions. Bolt was able to do about 240 rq/s.
I just tested Badger and I got a crazy 10k rq/s. I am just baffled
That is because:
LSM tree has an advantage compared to B+ tree when it comes to writes.
Also, values are stored separately in value log files so writes are much faster.
You can read more about the design here.
One of the main point (hard to replicate with simple read/write of files) is:
Key-Value separation
The major performance cost of LSM-trees is the compaction process. During compactions, multiple files are read into memory, sorted, and written back. Sorting is essential for efficient retrieval, for both key lookups and range iterations. With sorting, the key lookups would only require accessing at most one file per level (excluding level zero, where we’d need to check all the files). Iterations would result in sequential access to multiple files.
Each file is of fixed size, to enhance caching. Values tend to be larger than keys. When you store values along with the keys, the amount of data that needs to be compacted grows significantly.
In Badger, only a pointer to the value in the value log is stored alongside the key. Badger employs delta encoding for keys to reduce the effective size even further. Assuming 16 bytes per key and 16 bytes per value pointer, a single 64MB file can store two million key-value pairs.
Your question assumes that the only operation needed are single random reads and writes. Those are the worst case scenarios for log-structured merge (LSM) approaches like Badger or RocksDB. The range query, where all keys or key-value pairs in a range gets returned, leverages sequential reads (due to the adjacencies of sorted kv within files) to read data at very high speeds. For Badger, you mostly get that benefit if doing key-only or small value range queries since they are stored in a LSM while large values are appended in a not-necessarily sorted log file. For RocksDB, you’ll get fast kv pair range queries.
The previous answer somewhat addresses the advantage on writes - the use of buffering. If you write many kv pairs, rather than storing each in separate files, LSM approaches hold these in memory and eventually flush them in a file write. There’s no free lunch so asynchronous compaction must be done to remove overwritten data and prevent checking too many files for queries.
Previously answered here. Mostly similar to other answers provided here but makes one important, additional point: files in a filesystem can't occupy the same block on disk. If your records are, on average, significantly smaller than typical disk block size (4-16 KiB), storing them as separate files will incur substantial storage overhead.

Spark tasks with Cassandra

I am new to Spark and Cassandra.
We are using Spark on top of Cassandra to read data, since we have requirement to read data using non-primary key columns.
One observation is, number of tasks for a spark job increasing w.r.t data growth. Due to this we are facing lot of latency in fetching data.
What would be the reasons for the spark job task count increase?
What should be considered to increase performance in Spark with Cassandra?
Please suggest me.
Thanks,
Mallikarjun
The input split size is controlled by the configuration spark.cassandra.input.split.size_in_mb. Each split will generate a task in Spark, therefore, the more data in Cassandra, the longer it will take to process (which is what you would expect)
To improve performance, make sure you are aligning the partitions using joinWithCassandraTable. Don't use context.cassandraTable(...) unless you absolutely need all the data in the table and optimize the retrieved data using select to project only the columns that you need.
If you need data from some rows, it would make sense to build a secondary table where the id of those rows is stored.
Secondary indexes could also help to select subsets of the data, but I've seen reports of if being not highly performant.
What would be the reasons for the spark job task count increase?
Following on from maasgs answer, rather than setting the spark.cassandra.input.split.size_in_mb. on the SparkConf, it can be useful to use the ReadConf config when reading from different keyspaces/datacentres in a single job:
val readConf = ReadConf(
splitCount = Option(500),
splitSizeInMB = 64,
fetchSizeInRows = 1000,
consistencyLevel = ConsistencyLevel.LOCAL_ONE,
taskMetricsEnabled = true
)
val rows = sc.cassandraTable(cassandraKeyspace, cassandraTable).withReadConf(readConf)
What should be considered to increase performance in Spark with
Cassandra?
As far as increasing performance is concerned, this will depend on the jobs you are running and the types of transformations required. Some general advice to maximise Spark-Cassandra performance (As can be found here) is outlined below.
Your choice of operations and the order in which they are applied is critical to performance.
You must organize your processes with task distribution and memory in mind.
The first thing is to determine if you data is partitioned appropriately. A partition in this context is merely a block of data. If possible, partition your data before Spark even ingests it. If this is not practical or possible, you may choose to repartition the data immediately following the load. You can repartition to increase the number of partitions or coalesce to reduce the number of partitions.
The number of partitions should, as a lower bound, be at least 2x the number of cores that are going to operate on the data. Having said that, you will also want to ensure any task you perform takes at least 100ms to justify the distribution across the network. Note that a repartition will always cause a shuffle, where coalesce typically won’t. If you’ve worked with MapReduce, you know shuffling is what takes most of the time in a real job.
Filter early and often. Assuming the data source is not preprocessed for reduction, your earliest and best place to reduce the amount of data spark will need to process is on the initial data query. This is often achieved by adding a where clause. Do not bring in any data not necessary to obtain your target result. Bringing in any extra data will affect how much data may be shuffled across the network, and written to disk. Moving data around unnecessarily is a real killer and should be avoided at all costs
At each step you should look for opportunities to filter, distinct, reduce, or aggregate the data as much as possible prior to proceeding to the operation.
Use pipelines as much as possible. Pipelines are a series of transformations that represent independent operations on a piece of data and do not require a reorganization of the data as a whole (shuffle). For example: a map from a string -> string length is independent, where a sort by value requires a comparison against other data elements and a reorganization of data across the network (shuffle).
In jobs which require a shuffle see if you can employ partial aggregation or reduction before the shuffle step (similar to a combiner in MapReduce). This will reduce data movement during the shuffle phase.
Some common tasks that are costly and require a shuffle are sorts, group by key, and reduce by key. These operations require the data to be compared against other data elements which is expensive. It is important to learn the Spark API well to choose the best combination of transformations and where to position them in your job. Create the simplest and most efficient algorithm necessary to answer the question.

How to store a huge hash table in RAM and share it between different applications?

The data contains information like billions of ID-scores pairs. To quickly access these paired information, I plan to use the hash-table container since its time complexity of search is O(1). Considering the the raw data is around 80G, I don't want to load the data into RAM every time when I need to run search application. What I want to do is to generate the hash-table once and then store it in RAM with persistence of filesystem lifetime (the expense of RAM is not a criteria), and search it with different applications.
Based on my limited understanding, I could use "Memory Mapped Files" (boost C++ libraries). But I have questions:
1) Is it possible to keep the hash-table data structure when write it to the mapped file?
2) How much time it will cost to map the existed file to RAM?
Any answers/comments/suggestions are most welcomed!
Thanks,
1) Yes. The file is just bytes, just like memory.
2) Creating the mapping will be effectively instantaneous. Node that you won't be able to map all of it contiguously at once except on a 64-bit OS. Of course, if the file cache can't hold the portion of the map you're using, it will have to be read from disk.
How big are IDs? How big are pairs? How much locality of reference do you have? (Are there heavily-used pair and lightly used pairs?) How often will you be searching for pairs that aren't present? Is the data read-mostly? There may be better ways to do it. I'd strongly suggest starting with a broader question to make sure you're not stuck on a sub-optimal path.

What file format do people use when logging data to a FAT32 file system using a 8bit microcontroller?

Updated question to be less vague.
I plan to log sensor data by time so something like sqlite would be perfect, but it requires too much resources in something like an atmega328p. Most of the searching will be done off the uC.
What do other people use? Flat text files? XML? A more complicated data structure?
Thanks for the feedback. It is good to know what other people are using. I've decided to serialize my data structures and save them in a binary file to eliminate string processing on the uC for now.
I've used flat text files for similar projects, albiet years ago, but I believe it's still a good approach for that environment. Since you don't need to process the data on-chip, you want it to be as efficient as possible (as little overhead as possible).
However, if you want more flexibility and weren't as concerned about space, perhaps saving JSON objects would be better, where each field is identified clearly. A tiny bit of overhead for creating the objects, but allows you to add and remove fields without complex logic on the interpreting side. I would pick JSON over XML just because you have about half the overhead (in space, and probably in processing).
With a small micro-controller like the 328, it is very important to determine the space requirements.
How big is each record? How many records do you want to store? How will you get the records off of the micro-controller?
Like Doug, I usually use a flat text to store data. So each record might contain year, day of the year and a value if I am storing a value once a day.
The file would look like:
11,314,100<cr>
11,315,99<cr>
11,316,98<cr>
11,317,220<cr>
You could store approximately 90-100 records, requiring you to dump the data every three months
If you need more then the 1kEEprom holds (200 5 byte records, 100 10 byte records or simliar) then you will need additional memory using an IC, SD or Flash.
If you want to unplug the memory and plug it into a PC, then the SD or Flash would be best.
You could use a vinculum chip from FTDIChip.com to simplify writing the fat files to flash drive.

Resources