Snowpipe vs Airflow for Continues data loading into Snowflake - airflow

I had a question related to Snowflake. Actually in my current role, I am planning to migrate data from ADLS (Azure data lake) to Snowflake.
I am right now looking for 2 options
Creating Snowpipe to load updated data
Create Airflow job for same.
I am still trying to understand which will be the best way and what is the pro and cons of choosing each.

It depends on what you are trying to as part of this migration. If it is a plain vanilla(no transformation, no complex validations) as-is migration of data from ADLS to Snowflake, then you may be good with SnowPipe(but please also check if your scenario is good for Snowpipe or Bulk Copy- https://docs.snowflake.com/en/user-guide/data-load-snowpipe-intro.html#recommended-load-file-size).
If you have many steps before you move the data to snowflake and there are chances that you may need to change your workflow in future, it is better to use Airflow which will give you more flexibility. In one of my migrations, I used Airflow and in the other one CONTROL-M

You'll be able to load higher volumes of data with lower latency if you use Snowpipe instead of Airflow. It'll also be easier to manage Snowpipe in my opinion.
Airflow is a batch scheduler and using it to schedule anything that runs more frequently than 5 minutes becomes painful to manage. Also, you'll have to manage the scaling yourself with Airflow. Snowpipe is a serverless option that can scale up and down based on the volumes sees and you're going to see your data land within 2 minutes.
The only thing that should restrict your usage of Snowpipe is cost. Although, you may find that Snowpipe ends up being cheaper in the long run if you consider that you'll need someone to manage your Airflow pipelines too.

There are a few considerations. Snowpipe can only run a single copy command, which has some limitations itself, and snowpipe imposes further limitations as per Usage Notes. The main pain is that it does not support PURGE = TRUE | FALSE (i.e. automatic purging while loading) saying:
Note that you can manually remove files from an internal (i.e.
Snowflake) stage (after they’ve been loaded) using the REMOVE command.
Regrettably the snowflake docs are famously vague as they use an ambiguous colloquial writing style. While it said you 'can' remove the files manually yourself in reality any user using snowpipe as advertised for "continuous fast ingestion" must remove the files to not suffer performance/cost impacts of the copy command having to ignore a very large number of files that have been previously loaded. The docs around the cost and performance of "table directories" which are implicit to stages talk about 1m files being a lot of files. By way of an official example the default pipe flush time on snowflake kafka connector snowpipe is 120s so assuming data ingests continually, and you make one file per flush, you will hit 1m files in 2 years. Yet using snowpipe is supposed to imply low latency. If you were to lower the flush to 30s you may hit the 1m file mark in about half a year.
If you want a fully automated process with no manual intervention this could mean that after you have pushed files into a stage and invoked the pipe you need logic have to poll the API to learn which files were eventually loaded. Your logic can then remove the loaded files. The official snowpipe Java example code has some logic that pushes files then polls the API to check when the files are eventually loaded. The snowflake kafka connector also polls to check which files the pipe has eventually asynchronously completed. Alternatively, you might write an airflow job to ls #the_stage and look for files last_modified that is in the past greater than some safe threshold to then rm #the_stage/path/file.gz the older files.
The next limitation is that a copy command is a "copy into your_table" command that can only target a single table. You can however do advanced transformations using SQL in the copy command.
Another thing to consider is that neither latency nor throughput is guaranteed with snowpipe. The documentation very clearly says you should measure the latency yourself. It would be a completely "free lunch" if snowpipe that is running on shared infrastructure to reduce your costs were to run instantly and as fast if you were paying for hot warehouses. It is reasonable to assume a higher tail latency when using shared "on-demand" infrastructure (i.e. a low percentage of invocations that have a high delay).
You have no control over the size of the warehouse used by snowpipe. This will affect the performance of any sql transforms used in the copy command. In contrast if you run on Airflow you have to assign a warehouse to run the copy command and you can assign as big a warehouse as you need to run your transforms.
A final consideration is that to use snowpipe you need to make a Snowflake API call. That is significantly more complex code to write than making a regular database connection to load data into a stage. For example, the regular Snowflake JDBC database connection has advanced methods to make it efficient to stream data into stages without having to write oAuth code to call the snowflake API.
Be very clear that if you carefully read the snowpipe documentation you will see that snowpipe is simply a restricted copy into table command running on shared infrastructure that is eventually run at some point; yet you yourself can run a full copy command as part of a more complex SQL script on a warehouse that you can size and suspend. If you can live with the restrictions of snowpipe, can figure out how to remove the files in the stage yourself, and you can live with the fact that tail latency and throughput is likely to be higher than paying for a dedicated warehouse, then it could be a good fit.

Related

Generic Airflow data staging operator

I am trying to understand how to manage large data with Airflow. The documentation is clear that we need to use external storage, rather than XCom, but I can not find any clean examples of staging data to and from a worker node.
My expectation would be that there should be an operator that can run a staging in operation, run the main operation, and staging out again.
Is there such a Operator or pattern? The closes I've found is an S3 File Transform but it runs an executable to do the transform, rather than a generic Operator, such as a DockerOperator which we'd want to use.
Other "solutions" I've seen rely on everything running on a single host, and using known paths, which isn't a production ready solution.
Is there such an operator that would support data staging, or is there a concrete example of handling large data with Airflow that doesn't rely on each operator being equipped with cloud coping capabilities?
Yes and no. Traditionally, Airflow is mostly orchestrator - so it does not usually "do" the stuff, it usually tells others what to do. You very rarely need to bring actual data to Airflow worker, Worker is mostly there to tell others where the data is coming from, what to do with it and where to send it.
There are exceptions (some transfer operators actually download data from one service and upload it to another) - so the data passes through Airflow node, but this is an exception rather than a rule (the more efficient and better pattern is to invoke an external service to do the transfer and have a sensor to wait until it completes).
This is more of "historical" and somewhat "current" way how Airflow operates, however with Airflow 2 and beyond we are expandingh this and it becomes more and more possible to do a pattern similar to what you describe, and this is where XCom play a big role there.
You can - as of recently - develop Custom XCom Backends that allow for more than meta-data sharing - they are also good for sharing the data. You can see docs here https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html#custom-backends but also you have this nice article from Astronomer about it https://www.astronomer.io/guides/custom-xcom-backends and a very nice talk from Airflow Summit 2021 (from last week) presenting that: https://airflowsummit.org/sessions/2021/customizing-xcom-to-enhance-data-sharing-between-tasks/ . I Highly Recommend to watch the talk!
Looking at your pattern - XCom Pull is staging-in, Operator's execute() is operation and XCom Push is staging-out.
This pattern will be reinforced, I think by upcoming Airflow releases and some cool integrations that are coming. And there will be likely more cool data sharing options in the future (but I think they will all be based on - maybe slightly enhanced - XCom implementation).

How would I use pglogical from a downstream database?

I'd like to use pglogical to replicate a set of tables, but I want to make all of my changes downstream from the master - to avoid risk, I don't want to make any modifications to the master database. I'd also like to start using pglogical now so we get familiar with the technology and can include it across all of our databases on our next release.
I don't need constant updates, so I came up with a plan, a cron job that:
Turns off streaming replication to a standby
Makes this standby a logically-replicating master (just for logical replication, no writes)
Stop postgresql
Copy off data dir
Make config changes
Start postgresql
Create pglogical extension
Catches up logical replication
Makes this database a streaming standby without logical replication again
Stop postgresql
Replace data dir with previous copy
Config changes
Start postgresql
My question - does this approach even make sense? Is there some easy way to accomplish this that I'm totally missing?

Meteor server-side memory usage for thousands of concurrent users

Based on this answer, it looks like the meteor server keeps an in-memory copy of the cache for each connected client. My understanding is that it gets used in order to avoid sending multiple copies of data when dealing with overlapping subscriptions on a client.
The relevant part of the linked answer (emphasis is mine):
The merge box: The job of the merge box is to combine the results (added, changed and removed calls) of all of a client's active publish functions into a single data stream. There is one merge box for each connected client. It holds a complete copy of the client's minimongo cache.
Assuming that answer is still accurate in the current version of meteor, couldn't that create a huge waste of memory on the server as the number of users increases?
As an off-the-cuff calculation, if an app had about a 100kB cache per client, then 10,000 concurrent users would use up 1GB of memory on the server, and 100,000 users a whopping 10GB! This would be true even if each client was looking at almost identical data. It seems plausible for an app use much more data than that per client, which would further exacerbate the problem.
Does this problem exist in the current version of Meteor? If so, what techniques can be used to limit the amount of memory the server needs to use to manage all the client subscriptions?
Take a look at this post by Arunoda at his meteorhacks.com blog:
http://meteorhacks.com/making-meteor-500-faster-with-smart-collections.html
which talks about his Smart Collections page:
http://meteorhacks.com/introducing-smart-collections.html
He created an alternative Collection stack which has succeeded in it's goals for speed, efficiency (memory & cpu) and scalability (you can see a graphed comparison in the post). Admittedly in his tests RAM usage was negligent with both Collection types, although the way he's implemented things there should be a very obvious difference with the type of use case you mentioned.
Also, you can see in this post on meteor-core:
https://groups.google.com/d/msg/meteor-core/jG1KLObX1bM/39aP4kxqWZUJ
that the Meteor developers are aware of his work and are cooperating in implementing some of the improvements into Meteor itself (but until then his smart package works great).
Important note! Smart collections relies on access to the Mongo Oplog. This is easy if you're running on your own machine or hosted infrastructure. If you're using a cloud based database, this option might not be available, or if it is, will cost a lot more than the smaller packages.

Is migrating from RDS to Elastic MapReduce + Hive the right choice?

First of all I must put clear that I am a newbie and excuse myself if I don't use the correct terminology in my question.
This is my scenario:
I need to analyze large quantities of text like tweets, comments, mails, etc. The data is currently inserted into an Amazon RD MySQL instance as it occurs.
Later I run and R job locally using RTextTools (http://www.rtexttools.com/) over that data to output my desired results. At this point it might be important to make clear that the R scripts analyzes the data and writes data back into the MySQL table which will later be used to display it.
The issue I am having lately is that the job takes about 1 hour each time I run it and I need to do it at least 2 times a day...so using my local computer is not an option anymore.
Looking for alternatives I started to read about Amazon Elastic MapReduce instance which at first sight seems to be what I need, but here start my questions and confusions about it.
I read that data for EMR should be pulled out from an S3 bucket. If thats the case then I must start storing my data into a JSON or similar within an S3 bucket and not into my RDS instance, right?
At this point I read it is a good idea to create HIVE tables and then use RHive to read the data in order for RTextTools to do its job and write the results back to my RDS tables, is this right?
And now the final and most important question: Is taking all this trouble worth it vs. running a EC2 instance with R and running my R scripts there, will I reduce computing time?
Thanks a lot for your time and any tip in the right direction will be much appreciated
Interesting, I would like to suggest few things.
You can totally store data in S3, but you will have to first write your data to some file (txt etc) and then push it to S3. You cannot put raw JSON on S3. You can probably get the benefit of cloud front deployed over S3 for fast retrieval of data. You can also use RDS. the performance difference you will have to analyze yourself.
Writing results back to RDS shouldn't be any issue. EMR basically creates two EC2 instances , ElasticMapReduce-master and ElasticMapReduce-slave which can be used to communicate with RDS.
See,I think its worth trying out with EC2 instance with R , but then to reduce the computation time, you might have to go with expensive EC2 instance, or put autoscaling and divide task between different instances. Its just like implementing whole parallel computation logic by yourself, but in the case of EMR , you are getting all this logic of map reduce in itself. So, firstly you should try with EMR and if it doesn't work out well for your , try with new EC2 instance with R.
Let me know how it goes, thank you.
You should consider trying EMR. S3+EMR is very much worth trying out if the 1hour window is a constraint. For your type of processing workloads, you might save cycles by using a scalable on demand hadoop/hive platform. Obviously, there are some learning, re-platforming, and ongoing cluster mgmt costs related to the trial and switch. They are non-trivial. Alternatively, consider services such as Qubole, which also runs on EC2+S3 and provides higher level (and potentially easier to use) abstractions.
Disclaimer: I am a product manager at Qubole.

What administration tasks (if any) should I perform on a sqlite database?

I'm deploying a client application to several thousand windows machines. Each machine will use a sqlite database as a local datastore, ultimately writing data to a remote server. So the sqlite db won't really grow over a few MB over time, since data will be added and then later deleted. SQLite is supposed to be zero-administration, but are there any tasks I need to run occassionally such as analyze / update statistics / check for consistency, etc? Or can I assume that once the sqlite db is there it can be used for a couple of years with no worries and no corruption?
If there are such tasks I'll build processes into my app to run them occassionally, but at present I'm not sure if that's necessary/recommended.
SQLite has a VACUUM command, which will compact deleted entries and reduce fragmentation & file size. The amount of deletions vs. insertions really dictate how frequently this should be done. It's generally quick and not always necessary.
The best thing to do is setup some scripts to simulate your expected activity in production. Take snapshots of the file size & graph in your favorite spreadsheet. If after 10 years of simulated activity, there isn't a problem, then your usage patterns don't need VACUUM.

Resources