Reduce Elastic Map Reduce runtime - emr

I use Elastic Map Reduce to analyze large amount of data (stored on S3).
What is the most cost efficient way to reduce the runtime of the job other than increasing the size of the instance.
If I create more smaller files on S3 will it reduce the runtime?
If I create fewer but larger files on S3 will it reduce the runtime?

Related

Handle embedding vectors similarities

I need to compare one vector embedding against 50k (can grow to 100-200k) vector embeddings and find top 10 similarities.
I need to do it fast for online user interactions.
I have access to azure cloud.
How should I handle such scenario?
Where do I store all embeddings so my code can access and process them quickly?
Should I use python library and handle all in memory? Should I distribute the calculation somehow for speed and memory better usage?
Is there some graph db that handles this case and can compare vectors using KNN/ANN?
Whats the best approach for this usecase?

R Studio/AWS Memory Requirements

I am studying the bike/scooter trip behavior of commuters in major cities and I am trying to collect real time data from an API every 15 seconds or so about the location of bike/scooter sharing devices.
Here is an example API of the data:
https://data.lime.bike/api/partners/v1/gbfs/arlington/free_bike_status
I am using Rstudio/AWS to collect this data. I have two potential options and I am trying my best to minimize the cost of collecting because AWS is expensive and I am just a graduate student
1) Keep the memory low on AWS t.micro (1 GB) and save the output file into S3 repeatedly
or
2) Save a final file once daily but use a lot of memory for the instance.
What is the cost efficient way to collect this data? Is there a way R can tell me how GB of memory a data frame is taking up so I can adjust my AWS instance memory appropriately?

Limitations of using sequential IDs in Cloud Firestore

I read in a stackoverflow post that (link here)
By using predictable (e.g. sequential) IDs for documents, you increase the chance you'll hit hotspots in the backend infrastructure. This decreases the scalability of the write operations.
I would like if anyone could explain better on the limitations that can occur when using sequential or user provided id.
Cloud Firestore scales horizontally by allocated key ranges to machines. As load increases beyond a certain threshold on a single machine, it will split the range being served by it and assign it to 2 machines.
Let's say you just starting writing to Cloud Firestore, which means a single server is currently handling the entire range.
When you are writing new documents with random Ids, when we split the range into 2, each machine will end up with roughly the same load. As load increases, we continue to split into more machines, with each one getting roughly the same load. This scales well.
When you are writing new documents with sequential Ids, if you exceed the write rate a single machine can handle, the system will try to split the range into 2. Unfortunately, one half will get no load, and the other half the full load! This doesn't scale well as you can never get more than a single machine to handle your write load.
In the case where a single machine is running more load than it can optimally handle, we call this "hot spotting". Sequential Ids mean we cannot scale to handle more load. Incidentally, this same concept applies to index entries too, which is why we warn sequential index values such as timestamps of now as well.
So, how much is too much load? We generally say 500 writes/second is what a single machine will handle, although this will naturally vary depending on a lot of factors, such as how big a document you are writing, number of transactions, etc.
With this in mind, you can see that smaller more consistent workloads aren't a problem, but if you want something that scales based on traffic, sequential document ids or index values will naturally limit you to what a single machine in the database can keep up with.

How to speedup bulk importing into google cloud datastore with multiple workers?

I have an apache-beam based dataflow job to read using vcf source from a single text file (stored in google cloud storage), transform text lines into datastore Entities and write them into the datastore sink. The workflow works fine but the cons I noticed is that:
The write speed into datastore is at most around 25-30 entities per second.
I tried to use --autoscalingAlgorithm=THROUGHPUT_BASED --numWorkers=10 --maxNumWorkers=100 but the execution seems to prefer one worker (see graph below: the target workers once increased to 2 but reduced to 1 "based on the ability to parallelize the work in the currently running step").
I did not use ancestor path for the keys; all the entities are the same kind.
The pipeline code looks like below:
def write_to_datastore(project, user_options, pipeline_options):
"""Creates a pipeline that writes entities to Cloud Datastore."""
with beam.Pipeline(options=pipeline_options) as p:
(p
| 'Read vcf files' >> vcfio.ReadFromVcf(user_options.input)
| 'Create my entity' >> beam.ParDo(
ToEntityFn(), user_options.kind)
| 'Write to datastore' >> WriteToDatastore(project))
Because I have millions of rows to write into the datastore, it would take too long to write with a speed of 30 entities/sec.
Question: The input is just one huge gzipped file. Do I need to split it into multiple small files to trigger multiple workers? Is there any other way I can make the importing faster? Do I miss something in the num_workers setup? Thanks!
I'm not familiar with apache beam, the answer is from the general flow perspective.
Assuming there are no dependencies to be considered between entity data in various input file sections then yes, working with multiple input files should definitely help as all these files could then be processed virtually in parallel (depending, of course, on the max number of available workers).
You might not need to split the huge zipfile beforehand, it might be possible to simply hand off segments of the single input data stream to separate data segment workers for writing, if the overhead of such handoff itself is neglijible compared to the actual data segment processing.
The overall performance limitation would be the speed of reading the input data, splitting it in segments and handoff to the segment data workers.
A data segment worker would further split the data segment it receives in smaller chunks of up to the equivalent of the max 500 entities that can be converted to entities and written to the datastore in a single batch operation. Depending of the datastore client library used it may be possible to perform this operation asyncronously, allowing the split into chunks and conversion to entities to continue without waiting for the previous datastore writes to complete.
The performance limitation at the data segment worker would then be the speed at which the data segment can be split into chunks and the chunk converted to entities
If async ops aren't available or for even higher throughput, yet another handoff of each chunk to a segment worker could be performed, with the segment worker performing the conversion to entities and datastore batch write.
The performance limitation at the data segment worker level would then be just the speed at which the data segment can be split into chunks and handed over to the chunk workers.
With such approach the actual conversion to entities and batch writing them to the datastore (async or not) would no longer sit in the critical path of splitting the input data stream, which is, I believe, the performance limitation in your current approach.
I looked into the design of vcfio. I suspect (if I understand correctly) that the reason I always get one worker when the input is a single file is due to the limit of the _VcfSource and the VCF format constraint. This format has a header part that defines how to translate the non-header lines. This causes that each worker that reads the source file has to work on an entire file. When I split the single file into 5 separate files that share the same header, I successfully get up to 5 workers (but not any more probably due to the same reason).
One thing I don't understand is that the number of workers that read can be limited to 5 (in this case). But why we are limited to have only 5 workers to write? Anyway, I think I have found the alternative way to trigger multiple workers with beam Dataflow-Runner (use pre-split VCF files). There is also a related approach in gcp variant transforms project, in which the vcfio has been significantly extended. It seems to support the multiple workers with a single input vcf file. I wish the changes in that project could be merged into the beam project too.

How to store a huge hash table in RAM and share it between different applications?

The data contains information like billions of ID-scores pairs. To quickly access these paired information, I plan to use the hash-table container since its time complexity of search is O(1). Considering the the raw data is around 80G, I don't want to load the data into RAM every time when I need to run search application. What I want to do is to generate the hash-table once and then store it in RAM with persistence of filesystem lifetime (the expense of RAM is not a criteria), and search it with different applications.
Based on my limited understanding, I could use "Memory Mapped Files" (boost C++ libraries). But I have questions:
1) Is it possible to keep the hash-table data structure when write it to the mapped file?
2) How much time it will cost to map the existed file to RAM?
Any answers/comments/suggestions are most welcomed!
Thanks,
1) Yes. The file is just bytes, just like memory.
2) Creating the mapping will be effectively instantaneous. Node that you won't be able to map all of it contiguously at once except on a 64-bit OS. Of course, if the file cache can't hold the portion of the map you're using, it will have to be read from disk.
How big are IDs? How big are pairs? How much locality of reference do you have? (Are there heavily-used pair and lightly used pairs?) How often will you be searching for pairs that aren't present? Is the data read-mostly? There may be better ways to do it. I'd strongly suggest starting with a broader question to make sure you're not stuck on a sub-optimal path.

Resources