Why is AWS Glue DataBrew project when created, Is trying to load all data from the RDS table

Why is AWS Glue DataBrew project when created, Is trying to load all data from the RDS table - aws-databrew

When we create a DataBrew project and refer to the Dataset based on JDBC connection
we find that the RDS(MySQL database) executes the query=> select * from table. But
our table contains huge data. Why is the complete data being loaded ?. Can we restrict loading all data. Because loading whole data in DataBrew project is huge concern for us. But we see that DataBrew Project has Default sample size 500 records to load. Our main concern is why is the query (Select * From table) is sent to the rds.

Related

Relative path in absolute URI Exception while accessing DynamoDB via Glue Data Catalogue in PySpark running on EMR

I am executing a pyspark application on AWS EMR that is configured to use AWS Glue Data Catalog as metastore. I have a table setup in AWS Glue that points to DynamoDB table. And now in my pyspark script, I am trying to access the Glue table. I am able to do show tables and able to see the glue table. But when I try to query the table, I am getting below exception,
pyspark.sql.utils.AnalysisException: u'java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: arn:aws:dynamodb:<region>:<acct_id>:table/DDBTABLE;'
My query in pyspark script:
spark.sql("select * from ddbtable").show()
Couldn't find any good reference on this. I see people talking about issue with spark.sql.warehouse.dir. But not sure how it is related to glue data catalog. Any inputs ?

Contacted AWS Tech and apparently this is an issue with EMR (as of 5.23.0) while using Glue data catalog and accessing Glue table that connects to DynamoDB. They are still working on this and meanwhile have provided below workaround.
Edit the properties file of the Glue table to include below,
update : Location property to some dummy S3 location so that it is of the form - s3://dummy-path
add : Add below DynamoDB specific information under parameters,
"dynamodb.table.name": "ddb-table",
"dynamodb.column.mapping": "col:col",
"storage_handler": "org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler"
For updating glue table refer here

Attaching a database in sqlite

I'm trying to attach one database to another so that data from it can be used in queries.
I'm using
ATTACH DATABASE '{full url}' AS {database_name};
No errors actually occur, but when I run
PRAGMA database_list
Only the main database shows.

I found these limitations for ATTACH DATABASE that may or may not apply to your case:
The database names called main and temp are reserved names within your database connection and can not be used for attached databases.
The database name called main is reserved for the primary database and the database name called temp is reserved for the database that holds temporary tables.
Attached databases must use the same text encoding as the main database.
When the database connection is closed, the attached database will be automatically be detached.

Can we insert a data into Firebase realtime Database?

One child node of my Firebase realtime database has become huge (aroung 20 GB) and I need to purge this and insert the the extracted data of last month from the backup into the Firebase realtime database using Python Admin SDK.
In the documentation, I see the following options:
set - Write or replace data to a defined path, like messages/users/
update - Update some of the keys for a defined path without replacing all of the data
push - Add to a list of data in the database. Every time you push a new node onto a list, your database generates a unique key, like messages/users//
transaction - Use transactions when working with complex data that could be corrupted by concurrent updates
However, I want to add/insert the data from the firebase backup. I have to insert because the app is used in production and I cannot afford the overwrite of data.
Is there any method available to insert/add the data and not overwrite the data?
Any help/support is greatly appreciated.

There is no way to do this in Firebase Realtime Database without reading the current value of each location.
The only operation that allows you to update data based on its existing value is a transaction. A Firebase transaction gives you the (likely) current value at a location, and you then return what the new value should become.
But if the data you're restoring is (largely) the same as the data you have in the database, you might be able to use an update() call with sufficiently deep paths.

cosmosdb - archive data older than n years into cold storage

I researched several places and could not find any direction on what options are there to archive old data from cosmosdb into a cold storage. I see for DynamoDb in AWS it is mentioned that you can move dynamodb data into S3. But not sure what options are for cosmosdb. I understand there is time to live option where the data will be deleted after certain date but I am interested in archiving versus deleting. Any direction would be greatly appreciated. Thanks

I don't think there is a single-click built-in feature in CosmosDB to achieve that.
Still, as you mentioned appreciating any directions, then I suggest you consider DocumentDB Data Migration Tool.
Notes about Data Migration Tool:
you can specify a query to extract only the cold-data (for example, by creation date stored within documents).
supports exporting export to various targets (JSON file, blob
storage, DB, another cosmosDB collection, etc..),
compacts the data in the process - can merge documents into single array document and zip it.
Once you have the configuration set up you can script this
to be triggered automatically using your favorite scheduling tool.
you can easily reverse the source and target to restore the cold data to active store (or to dev, test, backup, etc).
To remove exported data you could use the mentioned TTL feature, but that could cause data loss should your export step fail. I would suggest writing and executing a Stored Procedure to query and delete all exported documents with single call. That SP would not execute automatically but could be included in the automation script and executed only if data was exported successfully first.
See: Azure Cosmos DB server-side programming: Stored procedures, database triggers, and UDFs.
UPDATE:
These days CosmosDB has added Change feed. this really simplifies writing a carbon copy somewhere else.

Aggregating multiple distributed MySQL databases

I have a project that requires us to maintain several MySQL databases on multiple computers. They will have identical schemas.
Periodically, each of those databases must send their contents to a master server, which will aggregate all of the incoming data. The contents should be dumped to a file that can be carried via flash drive to an internet-enabled computer to send.
Keys will be namespace'd, so there shouldn't be any conflict there, but I'm not totally sure of an elegant way to design this. I'm thinking of timestamping every row and running the query "SELECT * FROM [table] WHERE timestamp > last_backup_time" on each table, then dumping this to a file and bulk-loading it at the master server.
The distributed computers will NOT have internet access. We're in a very rural part of a 3rd-world country.
Any suggestions?

Your
SELECT * FROM [table] WHERE timestamp > last_backup_time
will miss DELETEed rows.
What you probably want to do is use MySQL replication via USB stick. That is, enable the binlog on your source servers and make sure the binlog is not thrown away automatically. Copy the binlog files to USB stick, then PURGE MASTER LOGS TO ... to erase them on the source server.
On the aggregation server, turn the binlog into an executeable script using the mysqlbinlog command, then import that data as an SQL script.
The aggregation server must have a copy of each source servers database, but can have that under a different schema name as long as your SQL all does use unqualified table names (does never use schema.table syntax to refer to a table). The import of the mysqlbinlog generated script (with a proper USE command prefixed) will then mirror the source servers changes on the aggregation server.
Aggregation across all databases can then be done using fully qualified table names (i.e. using schema.table syntax in JOINs or INSERT ... SELECT statements).