Relative path in absolute URI Exception while accessing DynamoDB via Glue Data Catalogue in PySpark running on EMR - amazon-dynamodb

I am executing a pyspark application on AWS EMR that is configured to use AWS Glue Data Catalog as metastore. I have a table setup in AWS Glue that points to DynamoDB table. And now in my pyspark script, I am trying to access the Glue table. I am able to do show tables and able to see the glue table. But when I try to query the table, I am getting below exception,
pyspark.sql.utils.AnalysisException: u'java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: arn:aws:dynamodb:<region>:<acct_id>:table/DDBTABLE;'
My query in pyspark script:
spark.sql("select * from ddbtable").show()
Couldn't find any good reference on this. I see people talking about issue with spark.sql.warehouse.dir. But not sure how it is related to glue data catalog. Any inputs ?

Contacted AWS Tech and apparently this is an issue with EMR (as of 5.23.0) while using Glue data catalog and accessing Glue table that connects to DynamoDB. They are still working on this and meanwhile have provided below workaround.
Edit the properties file of the Glue table to include below,
update : Location property to some dummy S3 location so that it is of the form - s3://dummy-path
add : Add below DynamoDB specific information under parameters,
"dynamodb.table.name": "ddb-table",
"dynamodb.column.mapping": "col:col",
"storage_handler": "org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler"
For updating glue table refer here

Related

specify a database name in databricks sql connection parameters

I am using airflow 2.0.2 to connect with databricks using the airflow-databricks-operator. The SQL Operator doesn't let me specify the database where the query should be executed, so I have to prefix the table_name with database_name. I tried reading through the doc of databricks-sql-connector as well here -- https://docs.databricks.com/dev-tools/python-sql-connector.html and still couldn't figure out if I could give the database name as a parameter in the connection string itself.
I tried setting database/schema/namespace in the **kwargs, but no luck. The query executor keeps saying that the table not found, because the query keeps getting executed in the default database.
Right now it's not supported - primarily reason is that if you have multiple statements then connector could reconnect between their execution, and result of use will be lost. databricks-sql-connector also doesn't allow setting of the default database.
Right now you can workaround that by adding explicit use <database> statement into a list of SQLs to execute (the sql parameter could be a list of strings, not only string).
P.S. I'll look, maybe I'll add setting of the default catalog/database in the next versions

List available collections for database in ArangoDB using HTTP interface?

I am trying to use ArangoDB's HTTP interface to dump all collections belonging to a specific database.
I am able to view all available databases using the following command:
curl http://localhost:8529/_api/database
However, once I find a database name (for example, "test") I am unable to dump the collections belonging to this database. Ultimately, I would like to dump the collections for this database, and then all results within a chosen collection.
I have followed the documentation provided here: https://www.arangodb.com/docs/stable/http/general.html, however I am still unable to find the relevant documentation for this request.
You can get the list of all collections with
curl http://localhost:8529/_db/DBNAME/_api/collection
as is implied in this part of the documentation: https://www.arangodb.com/docs/stable/http/collection.html#address-of-a-collection
The rest of the interface works accordingly, e.g.
curl http://localhost:8529/_db/DBNAME/_api/collection/COLLNAME
to get information about a single collection (that is already included in the output of the first call).
You find the complete swagger documentation with just two clicks in the Web-Interface:

How to access on premise Teradata from Azure Databricks

We need to connect to on premise Teradata from Azure Databricks .
Is that possible at all ?
If yes please let me know how .
I was looking for this information as well and I recently was able to access our Teradata instance from Databricks. Here is how I was able to do it.
Step 1. Check your cloud connectivity.
%sh nc -vz 'jdbcHostname' 'jdbcPort'
- 'jdbcHostName' is your Teradata server.
- 'jdbcPort' is your Teradata server listening port. By default, Teradata listens to the TCP port 1025
Also check out Databrick’s best practice on connecting to another infrastructure.
Step 2. Install Teradata JDBC driver.
Teradata Downloads page provides JDBC drivers by version and archive type. You can also check the Teradata JDBC Driver Supported Platforms page to make sure you pick the right version of the driver.
Databricks offers multiple ways to install a JDBC library JAR for databases whose drivers are not available in Databricks. Please refer to the Databricks Libraries to learn more and pick the one that is right for you.
Once installed, you should see it listed in the Cluster details page under the Libraries tab.
Terajdbc4.jar dbfs:/workspace/libs/terajdbc4.jar
Step 3. Connect to Teradata from Databricks.
You can define some variables to let us programmatically create these connections. Since my instance required LDAP, I added LOGMECH=LDAP in the URL. Without LOGMECH=LDAP it returns “username or password invalid” error message.
(Replace the text in italic to the values in your environment)
driver = “com.teradata.jdbc.TeraDriver”
url = “jdbc:teradata://Teradata_database_server/Database=Teradata_database_name,LOGMECH=LDAP”
table = “Teradata_schema.Teradata_tablename_or_viewname”
user = “your_username”
password = “your_password”
Now that the connection variables are specified, you can create a DataFrame. You can also explicitly set this to a particular schema if you have one already. Please refer to Spark SQL Guide for more information.
Now, let’s create a DataFrame in Python.
My_remote_table = spark.read.format(“jdbc”)\
.option(“driver”, driver)\
.option(“url”, url)\
.option(“dbtable”, table)\
.option(“user”, user)\
.option(“password”, password)\
.load()
Now that the DataFrame is created, it can be queried. For instance, you can select some particular columns to select and display within Databricks.
display(My_remote_table.select(“EXAMPLE_COLUMN”))
Step 4. Create a temporary view or a permanent table.
My_remote_table.createOrReplaceTempView(“YOUR_TEMP_VIEW_NAME”)
or
My_remote_table.write.format(“parquet”).saveAsTable(“MY_PERMANENT_TABLE_NAME”)
Step 3 and 4 can also be combined if the intention is to simply create a table in Databricks from Teradata. Check out the Databricks documentation SQL Databases Using JDBC for other options.
Here is a link to the write-up I published on this topic.
Accessing Teradata from Databricks for Rapid Experimentation in Data Science and Analytics Projects
If you create a virtual network that can connect to on prem then you can deploy your databricks instance into that vnet. See https://docs.azuredatabricks.net/administration-guide/cloud-configurations/azure/vnet-inject.html.
I assume that there is a spark connector for terradata. I haven't used it myself but I'm sure one exists.
You can't. If you run Azure Databricks, all the data needs to be stored in Azure. But you can call the data using REST API from Teradata and then save data in Azure.

Using the Nexus3 API how do I get a list of artifacts in a repository

We are migrating from Nexus Repository Manager 2.1.4 to Nexus 3.1.0-04. With version 2 we have been able to use the API to get a list of artifacts by repository, however we are struggling to find a way to do this with the Nexus 3 API.
Having read https://books.sonatype.com/nexus-book/reference3/scripting.html chapter 16 we have been able to get artifact information for a specific blob using a groovy script like:
import org.sonatype.nexus.blobstore.api.BlobId
def properties = blobStore.blobStoreManager.get("default").get(new BlobId("7f6379d32f8dd78f98b5b181166703b6")).getProperties()
return [headers: properties.headers, metrics: properties.metrics]
However we can't find a way to iterate over the contents of a blob store. We can get a blob store object:
blobStore.blobStoreManager.get("default")
however the API does not appear to give us a way to get a list of all blobs within that store. We need to get a list of the blobIDs within a blob store.
Is there a way to do this via the Nexus 3 API?
One of our internal team members put this together. It doesn't use the blobStore but accomplishes I believe what you are trying to do (and a bit more): https://gist.github.com/kellyrob99/2d1483828c5de0e41732327ded3ab224
For some background, think of a blobStore as just where we store the bits, with no information about them. OrientDB has Component/Asset records and stores all the info about them. You'll generally want to use that instead of the blobStore for Asset information as a result.
Once your migration is done, it can be worth to investigate to update your version of Nexus.
That way, you will be able to use the - still in beta - new API for Nexus. It's available by default on the version 3.3.0 and more: http://localhost:8082/swagger-ui/
Basically, you retrieve the json output from this URL: http://localhost:8082/service/siesta/rest/beta/assets?repositoryId=YOURREPO
Only 10 records will be displayed at a time and you will have to use the continuationToken provided to request the next 10 records for your repository by calling: http://localhost:8082/service/siesta/rest/beta/assets?continuationToken=46525652a978be9a87aa345bdb627d12&repositoryId=YOURREPO
More information here: http://blog.sonatype.com/nexus-repository-new-beta-rest-api-for-content

Spark newbie (ODBC/SparkSQL)

I have a spark cluster setup and tried both native scala and spark sql on my dataset and the setup seems to work for the most part. I have the following questions
From an ODBC/extenal connectivity to the cluster, what should i expect?
- the admin/developer shapes the data and persists/caches a few RDDs that will be exposed? (Thinking on the lines of hive tables)
- What would be the equivalent of connecting to a "Hive metastore" in spark/spark sql?
Is thinking along the lines of hive faulted?
My other question was
- when i issue hive queries, (and say create tables and such), it uses the same hive metastore as hadoop/hive
- Where do the tables get created when i issue sql queries using sqlcontext?
- If i persist the table, it is the same concept as persisting an RDD?
Appreciate your answers
Nithya
(this is written with spark 1.1 in mind, be aware that new features tend to be added quickly, some limitations mentioned below might very well disappear at some point in the future).
You can use Spark SQL with Hive syntax and connect to Hive metastore, which will result in your Spark SQL hive commands to be executed on the same data space as if they were executed through Hive directly.
To do that you simply need to instantiate a HiveContext as explained here and provide a hive-site.xml configuration file that specifies, among other things, where to find the Hive metastore.
The result of a SELECT statement is a SchemaRDD, which is an RDD of Row objects that has an associated schema. You can use it just like you use any RDD, including cache and persist and the effect is the same (the fact that the data comes from hive has not influence here).
If your hive command is creating data, e.g. "CREATE TABLE ... ", the corresponding table gets created in exactly the same place as with regular Hive, i.e. /var/lib/hive/warehouse by default.
Executing Hive SQL through Spark provides you with all the caching benefits of Spark: executing a 2nd SQL query on the same data set within the same spark context will typically be much faster than the first query.
Since Spark 1.1, it is possible to start the Thrift JDBC server, which is essentially an equivalent to HiveServer2 and thus allows you to execute SparkQL commands through a JDBC connection.
Note that not all Hive features are available (yet?), see details here.
Finally, you can also discard Hive syntax and metastore and execute SQL queries directly on CSV and Parquet files. My best guess is that this will become the preferred approach in the future, although at the moment the set of SQL features available like this is smaller than when using the Hive syntax.

Resources