What is the best practice for placement of incremental models in DBT pipelines? - bigdata

I am currently getting to grips with DBT to populate and keep our data warehouse up to date and am starting to look at where we would benefit from incremental load to reduce compute resource and processing time.
One thing I can't really grasp from the DBT documentation is where incremental models should sit in the pipeline and I'm wondering if anyone out there has any real world examples they can share.
Right now the pipeline consists of source (materialized as views) -> staging (table) -> production (table) stages. Things I read online seem to suggest that incremental models should sit as close to the source as possible, which makes sense, but this means that we'll be storing table data at every stage in the pipeline which almost feels unnecessary. If I move the incremental load to being in the staging models, this removes this problem, but means the views are always returning the full dataset. If I change the source stage to use an incremental load strategy, should I also be making the other stages in the pipeline incremental?
I realise this is a bit of an open ended question so apologies for that - I'm just looking for some pointers.

Related

Cyclic Workflow in Data Comparison Process

I am searching for a solution to automatize an iterative data comparison process until all data packages are consistent. My general guess is to use something like Apache Airflow, but the iterative nature seem to be a cyclic graph. Apache airflow only allows DAGs (directed acyclic graph). Since I have not even a lot of knowledge in Airflow, I am a bit lost and would appreciate some expert knowledge here.
Current status: I am in a position were I regularly need to compare data packages for consistency and communicate errors to and between the two different parties manually.
On the one hand there is a design data set and on the other hand there are measured data sets. Both datasets involve many manual steps from different parties. So if an inconsistency occurs, I contact one or the other party and the error is removed manually. There are also regular changes to both data sets that can introduce new errors to already checked datasets.
I guess this process was not automatized yet, because the datasets are not directly comparable, but some transformations need to be done in between. I automatized this transformation process the last weeks so all that need to be done now from my side is to run the script and to communicate the errors.
What I would need now is a tool that orchestrates my script against the correct datasets and contacts the according persons as long as errors exists. In case something changes or was added the script needs to be run again.
My first guess was that I would need to create a workflow in apache airflow, but this iterative process seems to me as a cyclic graph, which is not allowed in Airflow. Do you have any suggestions or is this a common occurrence, were also solutions with Airflow exists?
I think one way to solve your problem could be to have a DAG workflow for the main task of comparing the datasets and sending notifications. Then run a periodical task in Cron, Quartz, etc, that triggers that DAG workflow. You are correct in Airflow not liking cyclic workflows.
I worked on Cylc, a cyclic graph workflow tool. Cyclic workflows (or workflows with loops) are very common in areas such as Numerical Weather Prediction NWP (reason why Cylc was created), and also in other fields such as optimization.
In NWP workflows, some steps may be waiting for datasets, and the workflow may stall and send notifications if the data is not as it was expected (e.g. some satellite imaging data is missing, and also the tides model output file is missing).
Also, in production, NWP models run multiple times a day. Either because you have new observation data, or new input data, or maybe because you want to run ensemble models, etc. So you end up with multiple runs of the workflow in parallel, where the workflow manager is responsible to manage dependencies, optimize the use of resources, send notifications, and more.
Cyclic workflows are complicated, that's probably why most implementations opt to support only DAGs.
If you'd like to try Cylc, the team has been trying to make it more generic so that it's not specific to NWP only. It has a new GUI, and the input format and documentation were improved with ease of use in mind.
There are other tools that support cyclic workflows too, such as StackStorm, Prefect, and I am currently checking if Autosubmit supports it too. Take a look at these tools if you'd like.
If you are in life sciences, or are interested in reproducible workflows, the CWL standard also has some ongoing discussion about adding support to loops, which could allow you to achieve something akin to what you described I reckon.

what is the best way to store preprocessed data in machine learning pipeline?

In my case, raw data is stored on NoSQL. Before training ML model, i should preprocess raw data on NoSQL. At this time, if i preprocess raw data, then what is the best way to keep prerocessed data?
1. keep it on memory
2. keep it on another table in NoSQL
3. can you recommend another options?
Depends on your use case, size of the data, tech stack and machine learning framework / library. Truth be told, without knowledge of your data and requirements, no-one on SO will be able to give you a complete answer.
In terms of passing data to the model/ running the model, load it in memory. Look at batching your data into the model if you hit memory limits. Or use an AWS EMR cluster!
For the question on storing the data, I’ll use the previous answer’s example of Spark and try to give some general rules.
If the processed data is “Big” and regularly accessed (eg once a month/week/day), then store it in a distributed manner, then load into memory when running the model.
For Spark, best bet is to write it as partitioned parquet files or to a Hive Data Warehouse.
The key thing about those two is that they are distributed. Spark will create N parquet files containing all your data. When it comes to reading the dataset into memory (before running your model), it can read from many files at once - saving a lot of time. Tensorflow does a similar thing with the TFRecords format.
If your NoSQL database is distributed, then you can potentially use that.
If it won’t be regularly accessed and is “small”, then just run the code from scratch & load into memory.
If the processing takes no time at all and it’s not used for other work, then there’s no point storing it. It’s a waste of time. Don’t even think about it. Just focus on your model, get the data in memory and get running.
If the data won’t be regularly accessed but is “Big”, then time to think hard!
You need to carefully think about the trade off of processing time vs. data storage capability.
How much will it cost to store this data?
How often is it needed?
Is it business critical?
When someone asks for this, is it always a “needed to be done yesterday” request?
Etc.
—-
The Spark framework is a good solution to make what you want to do learn more about it here: spark. Spark for machine learning: here.

How to set up/monitor huge amounts of equivalent DAGs

I am new to Airflow and am still learning the concepts.
I am trying to monitor a huge amount of webpages (>1000) once a day.
At the moment I dynamically create a single DAG for each webpage (data acquisition and processing). This works from a functional point of view. However, looking at the User-Interface I find the amount of DAGs overwhelming and my question is:
Is this the right way to do it? (a single DAG for each webpage)
Is there any way to get a better overview of how the monitoring of all webpages is doing?
Since all DAGs are equivalent and only deal with a different url, it made me think that grouping these DAGs together or having a common overview might be possible or at least a good idea.
E.g. if the acquisition or processing of a certain webpage is failing I would like to see this easily in the UI without having to scroll many pages to find a certain DAG.
You should just have one DAG and have multiple tasks. Based on information you provided, the only thing that seem to change is the URL, so better have one DAG and have many tasks.

how to provide exclusive copies of big data repository to many developers?

Here's a situation I am facing right now at work:
we currently have 300GB+ of production data (and it increases every day at large). It's in a mongodb clustr
data science team members are working on few algorithms that require access to all of this data at once and those algorithms may update data in place, hence, they have replicated the data in dev environment for their use until they are sure their code works
if multiple devs are running their algorithms then all/some of them may end up with unexpected output because other algorithms are also updating the data
this problem could be easily solved if everyone had their own copy of data!
however, given the volume of data, it's not feasible for me to provide them (8 developers right now) with their exclusive copy everyday. Even if I automate this process, we'll have to wait until copy is completed over the wire
I am hoping for a future proof approach considering we'll be dealing with TB's of data quite soon
I am assuming that many organizations would be facing such issues, and wondering how do other folks approach such a case.
I'd highly appreciate any pointers, leads, solutions for this.
Thanks
You can try using snapshots on the replicated data so each developer can have his own "copy" of the data. See Snapshots definition and consult your cloud provider if it can provide writable snapshots.
Note, snapshots are created almost instantly and at the moment of creation they almost do not require storage space because this technology utilizes pointers but not data itself. Unfortunately each snapshot can grow up to the original volume size because any change of data will initiate physical data copy: the technology that hides behind the process is usually CoW - Copy-on-write. So there is a serious danger that uncontrolled snapshots can "eat" all your free storage space.

How to increase my Web Application's Performance?

I have a ASP.NET web application (.NET 2008) using MS SQL server 2005. I want to increase the performance of the web site. Does anyone know of an article containing steps to do that, step by step, in SQL (indexes, etc.), and in the code?
Performance tuning is a very specific process. I don't know of any articles that discuss directly how to achieve this, but I can give you a brief overview of the steps I follow when I need to improve performance of an application/website.
Profile.
Start by gathering performance data. At the end of the tuning process you will need some numbers to compare to actually prove you have made a difference. This means you need to choose some specific processes that you monitor and record their performance and throughput.
For example, on your site you might record how long a login takes. You need to keep this very narrow. Pick a specific action that you want to record and time it. (Use a tool to do the timing, or put some Stopwatch code in you app to report times. Also, don't just run it once. Run it multiple times. Try to ensure you know all the environment set up so you can duplicate this again at the end.
Try to make this as close to your production environment as possible. Make sure your code is compiled in release mode, and running on real separate servers, not just all on one box etc.
Instrument.
Now you know what action you want to improve, and you have a target time to beat, you can instrument your code. This means injecting (manually or automatically) extra code that times each method call, or each line and records times and or memory usage right down the call stack.
There are lots of tools out their that can help you with this and automate some of it. (Microsoft's CLR profiler (free), Redgate - Ants (commercial), the higher editions of visual studio have stuff built in, and loads more) But you don't have to use automatic tools, it's perfectly acceptable to just use the Stopwatch class to time each block of your code. What you are looking for is a bottle neck. The likely hood is that you will find a high proportion of the overall time is spent in a very small bit of code.
Tune.
Now you have some timing data, you can start tuning.
There are two approaches to consider here. Firstly, take an overall perspective. Consider if you need to re design the whole call stack. Are you repeating something unnecessarily? Or are you just doing something you don't need to?
Secondly, now you have an idea of where your bottle neck is you can try and figure out ways to improve this bit of code. I can't offer much advice here, because it depends on what your bottle neck is, but just look to optimise it. Perhaps you need to cache data so you don't have to loop over it twice. Or batch up SQL calls so you can do just one. Or tighten your query filters so you return less data.
Re-profile.
This is the most important step that people often miss out. Once you have tuned your code, you absolutely must re-profile it in the same environment that you ran your initial profiling in. It is very common to make minor tweaks that you think might improve performance and actually end up degrading it because of some unknown way that the CLR handles something. This is much more common in managed languages because you often don't know exactly what is going on under the covers.
Now just repeat as necessary.
If you are likely to be performance tuning often I find it good to have a whole batch of automated performance tests that I can run that check the performance and throughput of various different activities. This way I can run these with every release and record performance changes each release. It also means that I can check that after a performance tuning session I know I haven't made the performance of some other area any worse.
When you are profiling, don't always just think about the time to run a single action. Also consider profiling under load, with lots of users logged in. Sometimes apps perform great when there's just one user connected, but when they hit a certain number of users suddenly the whole thing grinds to a halt. Perhaps because suddnely they are spending more time context switching or swapping memory in and out to disk. If it's throughput you want to improve you need to be figuring out what is causing the limit on throughput.
Finally. Check out this huge MSDN article on Improving .NET Application Performance and Scalability. Specifically, you might want to look at chapter 6 and chapter 17.
I think the best we can do from here is give you some pointers:
query less data from the sql server (caching, appropriate query filters)
write better queries (indexing, joins, paging, etc)
minimise any inappropriate blockages such as locks between different requests
make sure session-state hasn't exploded in size
use bigger metal / more metal
use appropriate looping code etc
But to stress; from here anything is guesswork. You need to profile to find the general area for the suckage, and then profile more to isolate the specific area(s); but start by looking at:
sql trace between web-server and sql-server
network trace between web-server and client (both directions)
cache / state servers if appropriate
CPU / memory utilisation on the web-server
I think First of all you have to find your Bottlenecks and then try to improve those.
This helps you to perform exactly where you have serios problem.
An in addition you needto improve your Connection to DB. For exampleusing a Lazy , Singletone Pattern and also create Batch request instead of single requests.
It help you to decrease DB connection.
Check your cache and suitable loop structures.
another thing is to use appropriate types, forexample if you need int donot create a long and etc
at the end ypu can use some Profiler (specially in SQL) andcheckif your queries implemented as well as possible.

Resources