Publish features to cosmos dB using Azure Databricks Feature Store Client fails on workspace with unity catalog enabled - azure-cosmosdb

we are trying to create an online feature store using cosmosdb following this documentation: https://learn.microsoft.com/en-us/azure/databricks/machine-learning/feature-store/publish-features .
But I get an error when I publish the table to cosmosdb: AnalysisException: Catalog 'cosmoscatalog' not found. The issue only happens when using unity-enabled workspaces. I can publish using a non-unity enabled workspace.
P.S. If I create the table using the non-unity enabled workspace, then the unity-enabled workspace can update the cosmosdb. But the unity-enabled worskpace cannot create the cosmos container/database using the fs.publish_table.
I tried the following code:
from databricks.feature_store.online_store_spec import AzureCosmosDBSpec
from databricks.feature_store.client import FeatureStoreClient
fs = FeatureStoreClient()
account_uri = "https://online-feature-store.documents.azure.com:443/"
# Specify the online store.
online_store_spec = AzureCosmosDBSpec(
account_uri=account_uri,
write_secret_prefix="secret/write-cosmos",
read_secret_prefix="secret/read-cosmos",
database_name="online_feature_store_example",
container_name="feature_store_online_wine_features"
)
# Push the feature table to online store.
fs.publish_table("online_feature_store_example.wine_static_features", online_store_spec, mode='merge')
The following code works on workspaces without unity catalog enabled. However, on a unity-catalog enabled workspace, it trhows an error: AnalysisException: Catalog 'cosmoscatalog' not found

You need to create the database and container in CosmosDB with the name you are specifying in AzureCosmosDBSpec.

Related

Authentication for Bigquery using bigrquery from an R in Google Colab

I try to access my own data tables stored on Google BigQuery in my Google Colab sheet (with a R runtime) by running the following code:
# install.packages("bigrquery")
library("bigrquery")
bq_auth(path = "mykeyfile.json")
projectid = "work-366734"
sql <- "SELECT * FROM `Output.prepared_data`"
Running
tb <- bq_project_query(projectid, sql)
results in the following access denied error:
Access Denied: BigQuery BigQuery: Permission denied while globbing file pattern. [accessDenied]
For clarification, I already created a service account (under Google Cloud IAM and admin), gave it the roles ‘BigQuery Admin’ and ‘BigQuery Data Owner’, and extracted the above-mentioned json Key file ‘mykeyfile.json’. (as suggested here)
Additionally, I added the Role of the service account to the dataset (BigQuery – Sharing – Permissions – Add Principal), but still, the same error shows up…
Of course, I already reset/delete and reinitialized the runtime.
Am I missing giving additional permissions somewhere else?
Thanks!
Not sure if it is relevant, but I add it just in case: I also tried the authentication process via
bq_auth(use_oob = TRUE, cache = FALSE)
which opens an additional window, where I have to allow access (using my Google Account, which is also the Data Owner) and enter an authorization code. While this steps works, bq_project_query(projectid, sql) still gives the same Access Denied error.
Trying to authorize access to Google BigQuery using python and the following commands, works flawless (using the same account/credentials).
from google.colab import auth
auth.authenticate_user()
project_id = "work-366734"
client = bigquery.Client(project=project_id)
df = client.query( '''
SELECT
*
FROM
`work-366734.Output.prepared_data`
''' ).to_dataframe()

CosmosDB Zone Redundancy using Azure Libraries for Net

I currently create a CosmosDB with the following properties:
cosmosDb = await azure.CosmosDBAccounts
.Define(cosmosDbResource.Name)
.WithRegion(cosmosDbResource.Region)
.WithExistingResourceGroup(cosmosDbResource.ResourceGroup.Name)
.WithKind(DatabaseAccountKind.GlobalDocumentDB)
.WithStrongConsistency()
.WithTags(cosmosDbResource.ResourceGroup.Tags)
.CreateAsync();
The only place I have seen to be able to set Zone Redundancy on is the ReadReplication database, like so:
cosmosDb = await azure.CosmosDBAccounts
.Define(cosmosDbResource.Name)
.WithRegion(cosmosDbResource.Region)
.WithExistingResourceGroup(cosmosDbResource.ResourceGroup.Name)
.WithKind(DatabaseAccountKind.GlobalDocumentDB)
.WithStrongConsistency()
.WithReadReplication(Region.USEast, true)
.WithTags(cosmosDbResource.ResourceGroup.Tags)
.CreateAsync();
The problem is that I don't care about a Read Replication database. I want to set Zone Redundancy on the initial database I create. I noticed that in the Azure Portal when I create a CosmosDB manually, it gives me the option to set Zone Redundancy. Is this not possible via the Azure Libraries for NET SDK?
To specify write region with Zone Redundancy do this below:
.WithWriteReplication(Region.USWest2, true)
PS: If at all possible I would recommend you use the Auto-rest generated version of this SDK. The fluent API is not generally as up to date as the Auto-rest generated API's. This gets built directly off our the Cosmos DB swagger spec and everything downstream is built upon this including ARM, PowerShell and CLI.
There is a repository with a fairly complete set of examples as well that you can use to help build your own management libraries. It also includes fluent samples but also out of date. Cosmos DB Samples
This is the repo for the Auto-rest generated SDK. Cosmos DB Management SDK for .NET

How to access on premise Teradata from Azure Databricks

We need to connect to on premise Teradata from Azure Databricks .
Is that possible at all ?
If yes please let me know how .
I was looking for this information as well and I recently was able to access our Teradata instance from Databricks. Here is how I was able to do it.
Step 1. Check your cloud connectivity.
%sh nc -vz 'jdbcHostname' 'jdbcPort'
- 'jdbcHostName' is your Teradata server.
- 'jdbcPort' is your Teradata server listening port. By default, Teradata listens to the TCP port 1025
Also check out Databrick’s best practice on connecting to another infrastructure.
Step 2. Install Teradata JDBC driver.
Teradata Downloads page provides JDBC drivers by version and archive type. You can also check the Teradata JDBC Driver Supported Platforms page to make sure you pick the right version of the driver.
Databricks offers multiple ways to install a JDBC library JAR for databases whose drivers are not available in Databricks. Please refer to the Databricks Libraries to learn more and pick the one that is right for you.
Once installed, you should see it listed in the Cluster details page under the Libraries tab.
Terajdbc4.jar dbfs:/workspace/libs/terajdbc4.jar
Step 3. Connect to Teradata from Databricks.
You can define some variables to let us programmatically create these connections. Since my instance required LDAP, I added LOGMECH=LDAP in the URL. Without LOGMECH=LDAP it returns “username or password invalid” error message.
(Replace the text in italic to the values in your environment)
driver = “com.teradata.jdbc.TeraDriver”
url = “jdbc:teradata://Teradata_database_server/Database=Teradata_database_name,LOGMECH=LDAP”
table = “Teradata_schema.Teradata_tablename_or_viewname”
user = “your_username”
password = “your_password”
Now that the connection variables are specified, you can create a DataFrame. You can also explicitly set this to a particular schema if you have one already. Please refer to Spark SQL Guide for more information.
Now, let’s create a DataFrame in Python.
My_remote_table = spark.read.format(“jdbc”)\
.option(“driver”, driver)\
.option(“url”, url)\
.option(“dbtable”, table)\
.option(“user”, user)\
.option(“password”, password)\
.load()
Now that the DataFrame is created, it can be queried. For instance, you can select some particular columns to select and display within Databricks.
display(My_remote_table.select(“EXAMPLE_COLUMN”))
Step 4. Create a temporary view or a permanent table.
My_remote_table.createOrReplaceTempView(“YOUR_TEMP_VIEW_NAME”)
or
My_remote_table.write.format(“parquet”).saveAsTable(“MY_PERMANENT_TABLE_NAME”)
Step 3 and 4 can also be combined if the intention is to simply create a table in Databricks from Teradata. Check out the Databricks documentation SQL Databases Using JDBC for other options.
Here is a link to the write-up I published on this topic.
Accessing Teradata from Databricks for Rapid Experimentation in Data Science and Analytics Projects
If you create a virtual network that can connect to on prem then you can deploy your databricks instance into that vnet. See https://docs.azuredatabricks.net/administration-guide/cloud-configurations/azure/vnet-inject.html.
I assume that there is a spark connector for terradata. I haven't used it myself but I'm sure one exists.
You can't. If you run Azure Databricks, all the data needs to be stored in Azure. But you can call the data using REST API from Teradata and then save data in Azure.

Azure Cosmos DB - Gremlin API to clone existing collection into another collection

I have created a gremlin api database in Azure Cosmos DB and have data in one collection.
However, I want to know if there is a way to clone the data into another collection in another database.
I want to copy graph data from Dev environment to stage and prod environments.
You can use existing tools for cosmos SQL API(earlier known as documentdb), cosmosdb allows you to query graph via sql API as well
Something like "select * from c" can fetch you the json representation of how cosmosdb stores your graph data.
The simplest approach would be using cosmosdb migration tool:
Set input source as Cosmos SQL API/Documentdb, and use your dev endpoint with the following query select * from c
Set output type as json and export your data
Now use the downloaded json as input source and set your prod graph db as your output(choose documentdb/cosmos SQL API as output type) and run it.
This should push your dev graph data to prod.
You can also use other Azure tools such as data factory, which work with documentdb
Just used this CosmicClone to clone a cosmos db graph database form one account to another https://github.com/microsoft/CosmicClone. Cloned 500k records in 20mins. Looks like it would work with a DB to clone a collection.

wso2 api analytics schema

I have installed wso2 APIManager and APIAnalytics,I want to change the APIANalytics datasource from h2 to mysql
In the tutorial they mentioned to create the equivalent database schema for WSO2_ANALYTICS_PROCESSED_DATA_STORE_DB
Where can I find the schemas for the database
Thanks
prabhat
You need to perform following steps to configure your wso2-am-analytics on mysql.
step 1. create all required databases in mysql db. just create it nothing else
create database wso2am_stats_db; --use existing one of wso2-am
create database wso2metrics_db; --use existing one of wso2-am
create database wso2_processed_data_store;
create database wso2carbon_db;
create database wso2_geolocation_db;
create database wso2_event_store;
step 2: configure these database in configuraton files under /wso2am-analytics-2.0.0/repository/conf/datasources/. there are 5 of them which you need to edit and change the database details.
analytics-datasources.xml
geolocation-datasources.xml
master-datasources.xml
metrics-datasources.xml
stats-datasources.xml
Step 3 :
once you are done, then start your server with -Dsetup (only once) this will create all tables needed. for example: sh bin/wso2server.sh -Dsetup
fore more read this article: Setting up MySQL

Resources