Had a requirement to export the daily snapshot from a table, transform some values then save to a positioned file sent via ftp. The table itself estimating around 2-3m rows, each row has around 20 columns. Given the volume, a little hesitate to use the biztalk sql adapter, thinking to use another ETL tool (e.g. SSIS) to perform the select/transform/export to a flat file then use BizTalk to simplely dump the file to the ftp. The alternative is of course has BizTalk do all the job, poll the table, transform in the map etc.
What's the better approach?
You don't need BizTalk for this requirement, file with 2-3 M of rows, when get suspended in BizTalk it will hold this huge message on to the BizTalk DB.
Just use SSIS FTP Task to transfer your converted file.
SSIS is much better at large batches than BizTalk.
BizTalk's strength is in lots of small messages being processed in parallel.
So SSIS for this scenario to create the file and BizTalk to move it sounds best.
Related
I had a question related to Snowflake. Actually in my current role, I am planning to migrate data from ADLS (Azure data lake) to Snowflake.
I am right now looking for 2 options
Creating Snowpipe to load updated data
Create Airflow job for same.
I am still trying to understand which will be the best way and what is the pro and cons of choosing each.
It depends on what you are trying to as part of this migration. If it is a plain vanilla(no transformation, no complex validations) as-is migration of data from ADLS to Snowflake, then you may be good with SnowPipe(but please also check if your scenario is good for Snowpipe or Bulk Copy- https://docs.snowflake.com/en/user-guide/data-load-snowpipe-intro.html#recommended-load-file-size).
If you have many steps before you move the data to snowflake and there are chances that you may need to change your workflow in future, it is better to use Airflow which will give you more flexibility. In one of my migrations, I used Airflow and in the other one CONTROL-M
You'll be able to load higher volumes of data with lower latency if you use Snowpipe instead of Airflow. It'll also be easier to manage Snowpipe in my opinion.
Airflow is a batch scheduler and using it to schedule anything that runs more frequently than 5 minutes becomes painful to manage. Also, you'll have to manage the scaling yourself with Airflow. Snowpipe is a serverless option that can scale up and down based on the volumes sees and you're going to see your data land within 2 minutes.
The only thing that should restrict your usage of Snowpipe is cost. Although, you may find that Snowpipe ends up being cheaper in the long run if you consider that you'll need someone to manage your Airflow pipelines too.
There are a few considerations. Snowpipe can only run a single copy command, which has some limitations itself, and snowpipe imposes further limitations as per Usage Notes. The main pain is that it does not support PURGE = TRUE | FALSE (i.e. automatic purging while loading) saying:
Note that you can manually remove files from an internal (i.e.
Snowflake) stage (after they’ve been loaded) using the REMOVE command.
Regrettably the snowflake docs are famously vague as they use an ambiguous colloquial writing style. While it said you 'can' remove the files manually yourself in reality any user using snowpipe as advertised for "continuous fast ingestion" must remove the files to not suffer performance/cost impacts of the copy command having to ignore a very large number of files that have been previously loaded. The docs around the cost and performance of "table directories" which are implicit to stages talk about 1m files being a lot of files. By way of an official example the default pipe flush time on snowflake kafka connector snowpipe is 120s so assuming data ingests continually, and you make one file per flush, you will hit 1m files in 2 years. Yet using snowpipe is supposed to imply low latency. If you were to lower the flush to 30s you may hit the 1m file mark in about half a year.
If you want a fully automated process with no manual intervention this could mean that after you have pushed files into a stage and invoked the pipe you need logic have to poll the API to learn which files were eventually loaded. Your logic can then remove the loaded files. The official snowpipe Java example code has some logic that pushes files then polls the API to check when the files are eventually loaded. The snowflake kafka connector also polls to check which files the pipe has eventually asynchronously completed. Alternatively, you might write an airflow job to ls #the_stage and look for files last_modified that is in the past greater than some safe threshold to then rm #the_stage/path/file.gz the older files.
The next limitation is that a copy command is a "copy into your_table" command that can only target a single table. You can however do advanced transformations using SQL in the copy command.
Another thing to consider is that neither latency nor throughput is guaranteed with snowpipe. The documentation very clearly says you should measure the latency yourself. It would be a completely "free lunch" if snowpipe that is running on shared infrastructure to reduce your costs were to run instantly and as fast if you were paying for hot warehouses. It is reasonable to assume a higher tail latency when using shared "on-demand" infrastructure (i.e. a low percentage of invocations that have a high delay).
You have no control over the size of the warehouse used by snowpipe. This will affect the performance of any sql transforms used in the copy command. In contrast if you run on Airflow you have to assign a warehouse to run the copy command and you can assign as big a warehouse as you need to run your transforms.
A final consideration is that to use snowpipe you need to make a Snowflake API call. That is significantly more complex code to write than making a regular database connection to load data into a stage. For example, the regular Snowflake JDBC database connection has advanced methods to make it efficient to stream data into stages without having to write oAuth code to call the snowflake API.
Be very clear that if you carefully read the snowpipe documentation you will see that snowpipe is simply a restricted copy into table command running on shared infrastructure that is eventually run at some point; yet you yourself can run a full copy command as part of a more complex SQL script on a warehouse that you can size and suspend. If you can live with the restrictions of snowpipe, can figure out how to remove the files in the stage yourself, and you can live with the fact that tail latency and throughput is likely to be higher than paying for a dedicated warehouse, then it could be a good fit.
TL:DR I'd like to combine the power of BigQuery with my MERN-stack application. Is it better to (a) use nodejs-biquery to write a Node/Express API directly with BigQuery, or (b) create a daily job that writes my (entire) BigQuery DB over to MongoDB, and then use mongoose to write a Node/Express API with MongoDB?
I need to determine the best approach for combining a data ETL workflow that creates a BigQuery database, with a react/node web application. The data ETL uses Airflow to create a workflow that (a) backs up daily data into GCS, (b) writes that data to BigQuery database, and (c) runs a bunch of SQL to create additional tables in BigQuery. It seems to me that my only two options are to:
Do a daily write/convert/transfer/migrate (whatever the correct verb is) from BigQuery database to MongoDB. I already have a node/express API written using mongoose, connected to a MongoDB cluster, and this approach would allow me to keep that API.
Use the nodejs-biquery library to create a node API that is directly connected to BigQuery. My app would change from MERN stack (BQ)ERN stack. I would have to re-write the node/express API to work with BigQuery, but I would no longer need the MongoDB (nor have to transfer data daily from BigQuery to Mongo). However, BigQuery can be a very slow database if I am looking for a single entry, a since its not meant to be used as Mongo or a SQL Database (it has no index, one row retrieve query run slow as full table scan). Most of my APIs calls are for very little data from the database.
I am not sure which approach is best. I don't know if having 2 databases for 1 web application is a bad practice. I don't know if it's possible to do (1) with the daily transfers from one db to the other, and I don't know how slow BigQuery will be if I use it directly with my API. I think if it is easy to add (1) to my data engineering workflow, that this is preferred, but again, I am not sure.
I am going with (1). It shouldn't be too much work to write a python script that queries tables from BigQuery, transforms, and writes collections to Mongo. There are some things to handle (incremental changes, etc.), however this is much easier to handle than writing a whole new node/bigquery API.
FWIW in a past life, I worked on a web ecommerce site that had 4 different DB back ends. ( Mongo, MySql, Redis, ElasticSearch) so more than 1 is not an issue at all, but you need to consider one as the DB of record, IE if anything does not match between them, one is the sourch of truth, the other is suspect. For my example, Redis and ElasticSearch were nearly ephemeral - Blow them away and they get recreated from the unerlying mysql and mongo sources. Now mySql and Mongo at the same time was a bit odd and that we were dong a slow roll migration. This means various record types were being transitioned from MySql over to mongo. This process looked a bit like:
- ORM layer writes to both mysql and mongo, reads still come from MySql.
- data is regularly compared.
- a few months elapse with no irregularities and writes to MySql are turned off and reads are moved to Mongo.
The end goal was no more MySql, everything was Mongo. I ran down that tangent because it seems like you could do similar - write to both DB's in whatever DB abstraction layer you used ( ORM, DAO, other things I don't keep up to date with etc.) and eventually move the reads as appropriate to wherever they need to go. If you need large batches for writes, you could buffer at that abstraction layer until a threshold of your choosing was reached before sending it.
With all that said, depending on your data complexity, a nightly ETL job would be completely doable as well, but you do run into the extra complexity of managing and monitoring that additional process. Another potential downside is the data is always stale by a day.
I am planning to create sqlite table on my android app. The data comes from the the server via webservice.
I would like to know what is the best way to do this.
Should I transfer the data from the webservice in a sqlite db file and merge it or should i get all the data as a soap request and parse it in to table or should I use rest call.
The general size of the data is 2MB with 100 columns.
Please advise the best case where I can quickly get this data, with less load on the device.
My Workflow is:
Download a set of 20000 Addresses and save them to device sqlite database. This operation is only once, when you run the app for the first time or when you want to refresh the whole app data.
Update this record when ever there is a change in the server.
Now I can get this data either in JSON, XML or as pure SqLite File from the server . I want to know what is the fastest way to store this data in to Android Database.
I tried all the above methods and I found getting the database file from server and copying that data to the database is faster than getting the data in XML or JSON and parsing it. Please advise if I am right or wrong.
If you are planning to use sync adapters then you will need to implement a content provider (or atleast a stub) and an authenticator. Here is a good example that you can follow.
Also, you have not explained more about what is the use-case of such a web-service to decide what web-service architecture to suggest. But REST is a good style to write your services and using JSON over XML is advisable due to data format efficiency (or better yet give protocol-buffer a shot)
And yes, sync adapters are better to use as they already provide a great set of features that you will want to implement otherwise when written as a background service (e.g., periodic sync, auto sync, exponential backoff etc.)
To have less load on the device you can implement a sync-adapter backed by a content provider. You serialize/deserialize data when you upload/download data from server. When you need to persist data from the server you can use the bulkInsert() method in content-provider and persist all your data in a transaction
I have got a situation which require to Select data from database (MS SQL), then process the data a little before provide it to client to download. The problem is that the table contains a HUGE number of records to select (~1 Billion records). (EXPORT function)
Currently, I got 2 issues:
Select data -> SQL Server response too slow, usually hanged
I used streambuilder to store content of file before write to file for client to download. So, with huge data, IIS got problem with memory -> HANGED.
I have been trying some solutions to deal with them, I appreciate any advices from you for the good solution to deal with huge data in this case.
My thinking of solution for :
1st issue: select data like paging, get around 10000 record per time.
2nd: try to create a temporary file on server, write data to that file, and send to client after all data is available in that file.
In your professional experience, are these solutions possible?
Python --> SQLite --> ASP.NET C#
I am looking for an in memory database application that does not have to write the data it receives to disc. Basically, I'll be having a Python server which receives gaming UDP data and translates the data and stores it in the memory database engine.
I want to stay away from writing to disc as it takes too long. The data is not important, if something goes wrong, it simply flushes and fills up with the next wave of data sent by players.
Next, another ASP.NET server must be able to connect to this in memory database via TCP/IP at regular intervals, say once every second, or 10 seconds. It has to pull this data, and this will in turn update on a website that displays "live" game data.
I'm looking at SQlite, and wondering, is this the right tool for the job, anyone have any suggestions?
Thanks!!!
This sounds like a premature optimization (apologizes if you've already done the profiling). What I would suggest is go ahead and write the system in the simplest, cleanest way, but put a bit of abstraction around the database bits so they can easily by swapped out. Then profile it and find your bottleneck.
If it turns out it is the database, optimize the database in the usual way (indexes, query optimizations, etc...). If its still too slow, most databases support an in-memory table format. Or you can mount a RAM disk and mount individual tables or the whole database on it.
Totally not my field, but I think Redis is along these lines.
The application of SQlite depends on your data complexity.
If you need to perform complex queries on relational data, then it might be a viable option. If your data is flat (i.e. not relational) and processed as a whole, then some python-internal data structures might be applicable.
Perhaps AppFabric would work for you?
http://msdn.microsoft.com/en-us/windowsserver/ee695849.aspx
SQLite doesn't allow remote "connections" as far as I know, it only supports being invoked as an in-process library. However, you could try to use MySQL which, while heavier, supports remote connections and does have in-memory tables.
See http://dev.mysql.com/doc/refman/5.5/en/memory-storage-engine.html