Enable Data Profiling on Apache Airflow

Enable Data Profiling on Apache Airflow - airflow

Data profiling although present in documentation is not visible in a fresh installation. Is there a way to enable it, or it has been deprecated? Installed version is apache-airflow==1.10.3 in case it helps.

Airflow 2 disabled Data Profiling due to security reasons: (Breaking changes section)
Due to security concerns, the new web server will no longer support the features in the Data Profiling menu of the old UI, including Ad Hoc Query, Charts, and Known Events.

Data profile is default UI when install airflow v1.10.3.
If you can not see data profile menu, you might need to double check if 3rd party customized airflow for you.
Airflow use flask as web framework. you can go to the related folder to see if it has been modified.
the configuration file is at [you airflow source code folder]\www\app.py
Thanks for vote for the answer.
The default app.py section related to dataprofiling is below:
with app.app_context():
from airflow.www import views
admin = Admin(
app, name='Airflow',
static_url_path='/admin',
index_view=views.HomeView(endpoint='', url='/admin', name="DAGs"),
template_mode='bootstrap3',
)
av = admin.add_view
vs = views
av(vs.Airflow(name='DAGs', category='DAGs'))
if not conf.getboolean('core', 'secure_mode'):
print("create_app", __file__)
av(vs.QueryView(name='Ad Hoc Query', category="Data Profiling"))
av(vs.ChartModelView(
models.Chart, Session, name="Charts", category="Data Profiling"))
av(vs.KnownEventView(
models.KnownEvent,
Session, name="Known Events", category="Data Profiling"))
As you can see from above code, it related to secure mode.
if not conf.getboolean('core', 'secure_mode'):
you might need check you secure mode configuration as well.
Note: pls check if the secure_mode in airflow.cfg is configurated properly.
the data profiling can only be showed in secure_mode = False.
# If set to False enables some unsecure features like Charts and Ad Hoc Queries.
# In 2.0 will default to True.
secure_mode = False

Related

Airflow Metadata DB = airflow_db?

I have a project requirement to back-up Airflow Metadata DB to some data warehouse (but not using an Airflow DAG). At the same time, the requirement mentions some connection called airflow_db.
I am quite new to Airflow, so I googled a bit on the topic. I am a bit confused about this part. Our Airflow Metadata DB is PostgreSQL (this is built from docker-compose, so I am tinkering on a local install), but when I look at Connections in Airflow Web UI, it says airflow_db is MySQL.
I initially assumed that they are the same, but by the looks of it, they aren't? Can someone explain the difference and what they are for?

Airflow creates airflow_db Conn Id with MySQL by default (see source code)
Default connections are not really useful in production system. It's just a long list of stuff that you are probably not going to use.
Airflow 1.1.10 introduced the ability not to create the default list by setting:
load_default_connections = False in airflow.cfg (See PR)
To give more background the connection list is where hooks find the information needed in order to connect to a service. It's not related to the backend database. Though the backend is db like any db and if you wish to allow hooks to interact with it you can define it in the list like any other connection (which is probably why you have this as option in the default).

Is it possible to setup MYSQL replication with binlog files generated from server A (which we are considering as Master) to server B

We are migrating from the Magento community to Magento cloud for one of our projects and we need to access DB for our custom developed CRM.
But unfortunately magento cloud does not support DB replication and they have enabled binlogs and they are not supporting for creating replication user and server id setup, The binlog files can be synced to our CRM server periodically.
Now we want to know whether we can use the binlog files to replicate the database or is there any workaround for doing the same?
We have tried using tunnel setup but the query execution time is more while using tunnel setup which will affect our CRM performance badly.
Also we need to reconfirm whether there are any other possibilities we can try to access the Magento Cloud DB in our CRM without performance lag.
Thanks in advance for your suggestions.

Yes, it is possible, but it may be a little fiddly in the setup you are describing. You can replay the binlogs as relay logs. Have a look at this article for more details:
https://lefred.be/content/howto-make-mysql-point-in-time-recovery-faster/
Specifically, these parts are relevant (you'll need to edit them appropriately):
[root#mysql1 mysql]# for i in $(ls /tmp/binlogs/*.0*)
do
ext=$(echo $i | cut -d'.' -f2);
cp $i mysql1-relay-bin.$ext;
done
[root#mysql1 mysql]# ls ./mysql1-relay-bin.0* >mysql1-relay-bin.index

Kusto.Explorer - Authentication Trouble

I'm having trouble adding a connection in the Kusto.Explorer desktop app 1.0.3.949. I can login via Web UI but in the desktop app it gives me this error:
This normally represents a permanent error, and retrying is unlikely to help.
Please provide the following information when contacting the Kusto team # https://aka.ms/kustosupport :
DataSource='https://m1explorer.westus.kusto.windows.net/v1/rest/mgmt',
DatabaseName='NetDefaultDB',
ClientRequestId='KD2RunCommand;5723fa83-9dd5-48fe-a1ee-5d4ddb7f9cd9',
ActivityId='74b41f5e-be7c-46be-88f5-dae1a6d35c30,
Timestamp='2020-08-02T18:48:13.6846740Z'.
In other applications such as the Kuskus VSCode extension or even the Web UI, the problem seems to be that it uses the "common" tenant/authority id as a default. Is there a way to specify the tenant id when adding the connection? It says you can import an .xml file but I'm not sure where or how this can be generated.
Thanks,
Steven

Please try approach described at:
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/tools/kusto-explorer#control-the-user-identity-connecting-to-kustoexplorer
The default security model for new connections is AAD-Federated
security. Authentication is done through the Azure Active Directory
using the default AAD user experience.
If you need finer control over the authentication parameters, you can
expand the "Advanced: Connection Strings" edit box and provide a valid
Kusto connection string value.
For example, users with a presence in multiple AAD tenants sometimes
need to use a particular "projection" of their identities to a
specific AAD tenant. Do this by providing a connection string, such as
the one below (replace words IN CAPITALS with specific values):
Data Source=https://CLUSTER_NAME.kusto.windows.net;Initial Catalog=DATABASE_NAME;AAD Federated Security=True;Authority Id=AAD_TENANT_OF_CLUSTER;User=USER_DOMAIN

Multiple applications in the same Symfony2 application

This is quite a long question, but there's quite a lot to it.
It feels like it should be a reasonably common use case, so I'm hoping the Stack Overflow community can provide me with a 'best practice in Symfony2' answer.
The solution I describe below works, but there are several consequences I'd like to avoid:
In my local dev environment, if I have used the wrong db connection the test will work in dev but fail on production
The routes of the ADMIN API are accessible on the PUBLIC API url, just denied.
If I have a mirror of live in my dev environment (3 separate checkouts with the corresponding parameters.yml file) then the feature tests for the other bundles fail
Is there a 'best practice in Symfony2' way to set up my project?
We're running a LAMP stack. We use git/(Atlassian) stash for version control.
We're using doctrine for the ORM and FOS-REST with OAuth plus symfony firewalls to authenticate and authorise the users.
We're committed to use Symfony2, so I am trying to find a 'best practice' solution:
I have a project with 3 applications:
A public-facing API (which gives read-only access to the data)
A protected API (which provides admin functionality)
A set of batch processes (to e.g. import data and monitor data quality)
Each application uses a set of shared models.
I have created 4 bundles, one each for the application and a 4th for the shared models.
Each application must use a different database user to access the database.
There's only one database.
There's several tables, one is called 'prices'
The admin API only must be accessible from one hostname (e.g. admin-api.server1)
The public API only must be accessible from a different hostname (e.g. public-api.server2)
Each application is hosted on a different server
In parameters.yml in my dev environment I have this
// parameters.yml
api_public_db_user: user1
api_public_db_pass: pass1
api_admin_db_user: user2
api_admin_db_pass: pass2
batch_db_user: user3
batch_db_pass: pass3
In config.yml I have this:
// config.yml
doctrine:
dbal:
connections:
api_public:
user: "%api_public_db_user%"
password: "%api_public_db_pass%"
api_admin:
user: "%api_admin_db_user%"
password: "%api_admin_db_pass%"
batch:
user: "%batch_db_user%"
password: "%batch_db_pass%"
In my code I can do this (I believe this can be done from the service container too, but I haven't got that far yet)
$entityManager = $this->getContainer()->get('doctrine')->getManager('api_public');
$entityRepository = $this->getContainer()->get('doctrine')->getRepository('CommonBundle:Price', api_admin');
When I deploy my code to each of the live servers, I put junk values in the parameters.yml for the other applications
// parameters.yml on the public api server
api_public_db_user: user1
api_public_db_pass: pass1
api_admin_db_user: **JUNK**
api_admin_db_pass: **JUNK**
batch_db_user: **JUNK**
batch_db_pass: **JUNK**
I have locked down my application so that the database isn't accessible (and thus the other API features don't work)
I have also set up Symfony firewall security so that the different routes require different permissions
There's also security in the apache vhost to deny access to say the admin api path from the public api directory.
So, I have secured my application and met the requirement of the security audit, but the dev process isn't ideal and something feels wrong.
As background:
We have previously looked at splitting it up into different applications within the same project (like this Symfony2 multiple applications and api centric application. Actually followed this method http://jolicode.com/blog/multiple-applications-with-symfony2) , but ran into difficulties, and in any case, Fabien says not to (https://groups.google.com/forum/#!topic/symfony-devs/yneojUuFiqw). That this existed in Symfony1 and was removed in Symfony2 is enough of an argument for me.
We have previously gone down the route of splitting up each bundle and importing it using composer, but this caused too many development overheads (for example, having to modify many repositories to implement a feature; it not being possible to see all of the changes for a feature in a single pull request).
We are receiving an ever growing number of requests to create APIs, and we're similarly worried about putting each application in its own repository.
So, putting each of the three applications in a separate Symfony project / git repository is something we want to avoid too.

How to access a "plone site object" without context?

I have a scheduled job (i'm using apscheduler.scheduler lib) that needs access to the plone site object, but I don't have the context in this case. I subscribed IProcessingStart event, but unfortunately getSite() function returns None.
Also, is there a programmatic way to obtain a specific Plone Site from Zope Server root?
Additional info:
I have a job like this:
from zope.site import hooks
sched = Scheduler()
#sched.cron_schedule(day_of_week="*", hour="9", minute="0")
def myjob():
site = hooks.getSite()
print site
print site.absolute_url()
catalogtool = getToolByName(site, "portal_catalog")
print catalogtool
The site variable is always None inside a APScheduler job. And we need informations about the site to run correctly the job.
We have avoided to execute using a public URL because an user could execute the job directly.

Build a context first with setSite(), and perhaps a request object:
from zope.app.component.hooks import setSite
from Testing.makerequest import makerequest
app = makerequest(app)
site = app[site_id]
setSite(site)
This does require that you open a ZODB connection yourself and traverse to the site object yourself.
However, it is not clear how you are accessing the Plone site from your scheduler. Instead of running a full new Zope process, consider calling a URL from your scheduling job. If you integrated APScheduler into your Zope process, you'd have to create a new ZODB connection in the job, traverse to the Plone site from the root, and use the above method to set up the site hooks (needed for a lot of local components anyway).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Enable Data Profiling on Apache Airflow - airflow

Data profiling although present in documentation is not visible in a fresh installation. Is there a way to enable it, or it has been deprecated? Installed version is apache-airflow==1.10.3 in case it helps.

Airflow 2 disabled Data Profiling due to security reasons: (Breaking changes section) Due to security concerns, the new web server will no longer support the features in the Data Profiling menu of the old UI, including Ad Hoc Query, Charts, and Known Events.

Related

Airflow Metadata DB = airflow_db?

Is it possible to setup MYSQL replication with binlog files generated from server A (which we are considering as Master) to server B

Kusto.Explorer - Authentication Trouble

Multiple applications in the same Symfony2 application

How to access a "plone site object" without context?

Categories

Resources