There is a ingestExternalFile() I found in RocksDB API. But when I try to ingest sst files from one RocksDB to another RocksDB, it give me the blow exception:
Exception in thread "main" org.rocksdb.RocksDBException: External file version not found
at org.rocksdb.RocksDB.ingestExternalFile(Native Method)
at org.rocksdb.RocksDB.ingestExternalFile(RocksDB.java:2142)
Can anybody help me?
No, you cannot.
There will be mismatch in File versions, CF ids and sequence numbers when you try to do that. Usually, SSTs to ingest are created using SSTFileWriter. This writer sets the SEQUENCE IDs of all the rows in the SST and the global sequence number to 0. When this SST file gets ingested, the db assigns a suitable global sequence number to this file after ingesting.
When ingesting SST from db to db, this will cause problems.
Ingesting a compacted SST file from one rocks instance to another might come up in future versions.
Related
I have a case where I need to ingest CSV files into CosmosDb.
So I have one DataSets to process the CSV, and another to prepare CosmosDb schema.
In the pipeline, I have a CopyData task mapping from CSV and then writing in Cosmos.
In the CopyData Source parameter, I specify an Azure Blob Storage where CSV are stored.
Until now, there was no problem.
Thing is, I now need to find a way to ensure that blobs are ingested like an alphabeticaly ordered files array (based on fileName).
Is there a way ?
It's hard to sort by fileNames in ADF.
One way to achieve:
Save all your fileNames in a csv file, then use Sort activity in Data Flow and overwrite this file. Finally, use Lookup and For Each activity to copy blobs to Cosmos DB.
Another way:
Pass childItems of Get Metadata activity's output to Azure Function. Then sort fileNames in Azure Function. Finally, loop output of Function by For Each activity and copy to Cosmos DB.
Using airflow, I extract data from a MySQL database, transform it with python and load it into a Redshift cluster.
Currently I use 3 airflow tasks : they pass the data by writing CSV on local disk.
How could I do this without writing to disk ?
Should I write one big task in python ? ( That would lower visibility )
Edit: this is a question about Airflow, and best practice for choosing the granularity of tasks and how to pass data between them.
It is not a general question about data migration or ETL. In this question ETL is only used as an exemple of workload for airflow tasks.
There are different ways you can achieve this:
If you are using AWS RDS service for MySQL, you can use AWS Data Pipeline to transfer data from MySQL to Redshift. They have inbuilt template in AWS Data Pipeline to do that. You can even schedule the incremental data transfer from MySQL to Redshift
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-redshift.html
How large is your table?
If your table is not too large and you can read the whole table into python using Pandas DataFrame or tuples and then transfer it Redshift.
Even if you have large table still you can read that table in chunks and push each chunk to Redshift.
Pandas are little inefficient in terms of memory usage if you read table into it.
Creating different tasks in Airflow will not help much. Either you can create a single function and call that function in dag using PythonOperator or create a python script and execute it using BashOperator in dag
One possibility is using the GenericTransfer operator from airflow. See docs
This only works with smallish datasets and the mysqlhook of airflow uses MySQLdb which does not support python 3.
Otherwise, I dont think there are other options, when using airflow, than writing to disk.
How large is your database?
Your approach of writing CSV on a local disk is optimal with a small database, so if this is the case you can write a Python task for that.
As the database get larger there will be more COPY commands and error prone uploading because you’re dealing with billions of rows of data spread across multiple MySQL tables.
You will also have to figure out exactly in which CSV file something went wrong.
It is also important to determine whether you need high throughput, high latency or frequent schema changes.
In conclusion, you should consider a third-party option like Alooma to extract data from a MySQL database and load it into your Redshift cluster.
I have done similar task before, but my system was in GCP.
What I did there was to write the data queried out into AVRO files, which can be easily (and very efficiently) be ingested into BigQuery.
So there is one task in the dag to query out the data and write to an AVRO file in Cloud Storage (S3 equivalent). And one task after that to call BigQuery operator to ingest the AVRO file.
You can probably do similar with csv file in S3 bucket, and then RedShift COPY command from the csv file in S3. I believe RedShift COPY from file in S3 is the fastest way to ingest data into RedShift.
These tasks are implemented as PythonOperators in Airflow.
You can pass information between tasks using XCom. You can read up on it in the documentation and there is also an example in the set of sample DAGs installed with Airflow by default.
I am streaming a Avro encoded file over the network from a S3 compliant object store and trying to read it and put it in some data-structure.
Issue: The issue I am facing sometimes ( one or two times in one / two days in test node when running continuously) is that half way through the file it hits this exception Invalid sync! in the nextRawBlock() method in DataFileStream class.
I would like to detect the root-cause of this and fix. I have been trying to
reproduce this in a test app but unable to do so successfully. I am looking for ideas on
what might potentially cause this ?
any better ways of reproducing this.
More details
a) The Avro file is not downloaded to disk , I get a handle to the file stream using S3ObjectInputStream and feed it to DataFileStream constructor
and then read from the stream directly.
b) The app tries to read records from the Avro encoded file in batches
of 500 records at a time.
c) The file contains a header section containing a Long count and a KV Map of String to Integer. After that it contains a array of records where each record contains a String and a long array. The schema uses Avro's union construct for enabling this.
d) Number of records in the file on average is around 5M
e) This entire download happens in separate thread and not in any user request.
f) The file is uploaded to the store by a separate process.
Other observation:
a) Upon failure the app closes the stream and tries to again download and read the stream. What I observe is this takes the node to a high oldgen state slowing down user requests.
does anybody know where OpenStack Swift stores the "Rings"? Is there a distributed algorithm or is it just one table somewhere on some of the Storage Nodes with information about all (!) the physical object locations (I cannot believe that because from my understanding of Object Storage, it should scale to Exabytes, and this would need lots of entries in such a table...)?
This page could not help me: http://docs.openstack.org/developer/swift/overview_ring.html
Thanks in advance for your help!
Ring Builder
The rings are built and managed manually by a utility called the ring-builder. The ring-builder assigns partitions to devices and writes an optimized Python structure to a gzipped, serialized file on disk for shipping out to the servers. The server processes just check the modification time of the file occasionally and reload their in-memory copies of the ring structure as needed.
so, it's stored in all servers.
If you were asking the path of ring,gz files it is under /etc/swift by default
Also these ring files are can be updated using the .builder files when swift rebalance is run.
i'm working with IronPython 2.6 for .Net4 and sqlite3 module from: IronPython.SQLite.
i have a written a GUI program what runs in four frames of an MDI window. Every of the four programs receives data from a serialport and stores this data in a sqlite database. One database per program.
Between inserting this data on receive into the database the program querys the database every 100ms for the latest data items.
I'm already using a mutex call for the cursor.execute() command to prevent problems with simultaneous commands (insert or select).
During runtime the program (sporadically) runs into an exception.
When trying to query data:
System.Exception: database disk image is malformed
or when trying to insert data:
System.Exception: database or disk is full
Is it possible, that an database query short after an database insert (or the way around) could cause such exceptions and destroy the database?
It would be very kind of you, if you could give me a kind of advice how to solve this issue.