Comparing bz2 files in unix - unix

I manage a number of databases on unix servers, and do daily backups of these databases using mysqldump. Since (some of) these databases are very large (20+Gb), I usually zip the backup .sql files using bzip2, to get compressed bz2 files.
As part of the backup process, I check that the size of the new backup file is greater than or equal to the size of the previous backup file - we are adding data to these databases on a daily basis, but very rarely remove data from these databases.
The check on the backup file size is a check on the quality of the backup - given that our databases primarily only grow in size, if the new backup is smaller than the old backup, it means either a) something has been removed from the database (in which case, I should check out what...) or b) something went wrong with the backup (in which case, I should check out why...).
However, if I compare the sizes of the bz2 files - for example, using comparison (using test) of stat %s, even though the database has increased in size, the bz2 files may have shrunk - presumably because of more efficient compression.
So - how can I compare the size of the backup files?
One option is to decompress the previous backup file from .bz2 to .sql, and compare the sizes of these .sql files. However, given that these are quite large files (20+Gb), the compression/decompression can take a while...
Another option is to keep the previous backup file as .sql, and again do the comparison of the .sql files. This is my preferred option, but requires some care to make sure we don't end up with lots of .sql files lying around - because that would eat up our hard drive pretty quickly.
Alternatively, someone in the SO community might have a better or brighter idea...?

It's possible to split the input files into parts (100MB chunks for example) and compare them separately. As size might actually also stay the same even with different input, you should generally not use it for looking for differences - instead use something like cmp to see if the files differ.
It's also possible to just cat the bz2 files of the individual parts together and get a perfectly valid multi-stream bz2 file, which can be uncompressed again in whole without any problems. You might want to look into pbzip, which is a parallel implementation of bzip and uses exactly this mechanic for parallel bzip into a multi-stream bz2 file to speed up the process on smp/multi core systems.
As to why I would suggest splitting the files into parts: Depending on your mysql setup, it might be possible that some of your parts never change, and data might actually mostly get appended at the end - if you can make sure of this, you would only have to compare small parts of the whole dump, which would speed up the process.
Still, be aware, that the whole data could change without anything added or removed, as mysql might resort data in memory (OPTIMIZE command for example could result in this)
Another way of splitting the data is possible if you use InnoDB - in that case you can just tell mysql (using my.cnf) to use one file per table, so you could a) bzip those files individually and only compare the tables which might actually have changed (in case you have static data in some of the tables) and/or b) save the last modified date of a table file, and compare that beforehand (again, this is only really useful in case you have tables with only static data)

Related

Read in an RDS file from memory using aws.s3

I am using the cloudy r aws.s3 code to pull down an RDS file from s3. I have my own custom R runtime in a lambda. aws.s3 has a handy method called s3readRDS("s3://pathtoyourfile"). This works well, but has a limitation in that it saves the RDS file to disk, then must read it back in using readRDS. This is fine for smaller files, but for larger files, there is a no-go as we have limited disk storage.
Right now, I'm kind of stuck with these largish data files and the ability to pull them out into a database is just not feasible at the moment due to the target group, so I'm trying to minimize our cost and maximize throughput and this is the last nail in that coffin.
According to the documentation:
"Some users may find the raw vector response format of \code{get_object} unfamiliar. The object will also carry attributes, including \dQuote{content-type}, which may be useful for deciding how to subsequently process the vector. Two common strategies are as follows. For text content types, running \code{\link[base]{charToRaw}} may be the most useful first step to make the response human-readable. Alternatively, converting the raw vector into a connection using \code{\link[base]{rawConnection}} may also be useful, as that can often then be passed to parsing functions just like a file connection would be. "
Based on this(and an example using load()) the below code looks like it should work but it does not.
foo <- readRDS(rawConnection(get_object("s3://whatever/foo.rds")))
Error in readRDS(rawConnection(get_object("s3://whatever/foo.rds", :
unknown input format
I can't seem the datastream right for readRDS or unserialize to make sense of it. I know the file is correct as using the save to disk/load from disk works fine. But I want to know how to make "foo" into an unserialized object without the save/load.

Copying 100 GB with continues change of file between datacenters with R-sync is good idea?

I have a datacenter A which has 100GB of the file changing every millisecond. I need to copy and place the file in Datacenter B. In case of failure on Datacenter A, I need to utilize the file in B. As the file is changing every millisecond does r-sync can handle it at 250 miles far datacenter? Is there any possibility of getting the corropted file? As it is continuously updating when we call this as a finished file in datacenter B ?
rsync is a relatively straightforward file copying tool with some very advanced features. This would work great for files and directory structures where change is less frequent.
If a single file with 100GB of data is changing every millisecond, that would be a potential data change rate of 100TB per second. In reality I would expect the change rate to be much smaller.
Although it is possible to resume data transfer and potentially partially reuse existing data, rsync is not made for continuous replication at that interval. rsync works on a file level and is not as commonly used as a block-level replication tool. However there is an --inplace option. This may be able to provide you the kind of file synchronization you are looking for. https://superuser.com/questions/576035/does-rsync-inplace-write-to-the-entire-file-or-just-to-the-parts-that-need-to
When it comes to distance, the 250 miles may result in at least 2ms of additional latency, if accounting for the speed of light, which is not all that much. In reality this would be more due to cabling, routers and switches.
rsync by itself is probably not the right solution. This question seems to be more about physics, link speed and business requirements than anything else. It would be good to know the exact change rate, and to know if you're allowed to have gaps in your restore points. This level of reliability may require a more sophisticated solution like log shipping, storage snapshots, storage replication or some form of distributed storage on the back end.
No, rsync is probably not the right way to keep the data in sync based on your description.
100Gb of data is of no use to anybody without without the means to maintain it and extract information. That implies structured elements such as records and indexes. Rsync knows nothing about this structure therefore cannot ensure that writes to the file will transition from one valid state to another. It certainly cannot guarantee any sort of consistency if the file will be concurrently updated at either end and copied via rsync
Rsync might be the right solution, but it is impossible to tell from what you have said here.
If you are talking about provisioning real time replication of a database for failover purposes, then the best method is to use transaction replication at the DBMS tier. Failing that, consider something like drbd for block replication but bear in mind you will have to apply database crash recovery on the replicated copy before it will be usable at the remote end.

How to calculate duration for a BerkeleyDB dump/load operation for a given BDB file?

I'm using a 3rd party application that uses BerkeleyDB for its local datastore (called BMC Discovery). Over time, its BDB files fragment and become ridiculously large, and BMC Software scripted a compact utility that basically uses db_dump piped into db_load with a new file name, and then replaces the original file with the rebuilt file.
The time it can take for large files is insanely long, and can take hours, while some others for the same size take half that time. It seems to really depend on the level of fragmentation in the file and/or type of data in it (I assume?).
The utility provided uses a crude method to guestimate the duration based on the total size of the datastore (which is composed of multiple BDB files). Ex. if larger than 1G say "will take a few hours" and if larger than 100G say "will take many hours". This doesn't help at all.
I'm wondering if there would be a better, more accurate way, using the commands provided with BerkeleyDB.6.0 (on Red Hat), to estimate the duration of a db_dump/db_load operation for a specific BDB file ?
Note: Even though this question mentions a specific 3rd party application, it's just to put you in context. The question is generic to BerkelyDB.
db_dump/db_load are the usual (portable) way to defragment.
Newer BDB (like last 4-5 years, certainly db-6.x) has a db_hotbackup(8) command that might be faster by avoiding hex conversions.
(solutions below would require custom coding)
There is also a DB->compact(3) call that "optionally returns unused Btree, Hash or Recno database pages to the underlying filesystem.". This will likely lead to a sparse file which will appear ridiculously large (with "ls -l") but actually only uses the blocks necessary to store the data.
Finally, there is db_upgrade(8) / db_verify(8), both of which might be customized with DB->set_feedback(3) to do a callback (i.e. a progress bar) for long operations.
Before anything, I would check configuration using db_tuner(8) and db_stat(8), and think a bit about tuning parameters in DB_CONFIG.

Best way to store 100,000+ CSV text files on server?

We have an application which will need to store thousands of fairly small CSV files. 100,000+ and growing annually by the same amount. Each file contains around 20-80KB of vehicle tracking data. Each data set (or file) represents a single vehicle journey.
We are currently storing this information in SQL Server, but the size of the database is getting a little unwieldy and we only ever need to access the journey data one file at time (so the need to query it in bulk or otherwise store in a relational database is not needed). The performance of the database is degrading as we add more tracks, due to the time taken to rebuild or update indexes when inserting or deleting data.
There are 3 options we are considering:
We could use the FILESTREAM feature of SQL to externalise the data into files, but I've not used this feature before. Would Filestream still result in one physical file per database object (blob)?
Alternatively, we could store the files individually on disk. There
could end being half a million of them after 3+ years. Will the
NTFS file system cope OK with this amount?
If lots of files is a problem, should we consider grouping the datasets/files into a small database (one peruser) so that each user? Is there a very lightweight database like SQLite that can store files?
One further point: the data is highly compressible. Zipping the files reduces them to only 10% of their original size. I would like to utilise compression if possible to minimise disk space used and backup size.
I have a few thoughts, and this is very subjective, so your mileage ond other readers' mileage may vary, but hopefully it will still get the ball rolling for you even if other folks want to put differing points of view...
Firstly, I have seen performance issues with folders containing too many files. One project got around this by creating 256 directories, called 00, 01, 02... fd, fe, ff and inside each one of those a further 256 directories with the same naming convention. That potentially divides your 500,000 files across 65,536 directories giving you only a few in each - if you use a good hash/random generator to spread them out. Also, the filenames are pretty short to store in your database - e.g. 32/af/file-xyz.csv. Doubtless someone will bite my head off, but I feel 10,000 files in one directory is plenty to be going on with.
Secondly, 100,000 files of 80kB amounts to 8GB of data which is really not very big these days - a small USB flash drive in fact - so I think any arguments about compression are not that valid - storage is cheap. What could be important though, is backup. If you have 500,000 files you have lots of 'inodes' to traverse and I think the statistic used to be that many backup products can only traverse 50-100 'inodes' per second - so you are going to be waiting a very long time. Depending on the downtime you can tolerate, it may be better to take the system offline and back up from the raw, block device - at say 100MB/s you can back up 8GB in 80 seconds and I can't imagine a traditional, file-based backup can get close to that. Alternatives may be a filesysten that permits snapshots and then you can backup from a snapshot. Or a mirrored filesystem which permits you to split the mirror, backup from one copy and then rejoin the mirror.
As I said, pretty subjective and I am sure others will have other ideas.
I work on an application that uses a hybrid approach, primarily because we wanted our application to be able to work (in small installations) in freebie versions of SQL Server...and the file load would have thrown us over the top quickly. We have gobs of files - tens of millions in large installations.
We considered the same scenarios you've enumerated, but what we eventually decided to do was to have a series of moderately large (2gb) memory mapped files that contain the would-be files as opaque blobs. Then, in the database, the blobs are keyed by blob-id (a sha1 hash of the uncompressed blob), and have fields for the container-file-id, offset, length, and uncompressed-length. There's also a "published" flag in the blob-referencing table. Because the hash faithfully represents the content, a blob is only ever written once. Modified files produce new hashes, and they're written to new locations in the blob store.
In our case, the blobs weren't consistently text files - in fact, they're chunks of files of all types. Big files are broken up with a rolling-hash function into roughly 64k chunks. We attempt to compress each blob with lz4 compression (which is way fast compression - and aborts quickly on effectively-incompressible data).
This approach works really well, but isn't lightly recommended. It can get complicated. For example, grooming the container files in the face of deleted content. For this, we chose to use sparse files and just tell NTFS the extents of deleted blobs. Transactional demands are more complicated.
All of the goop for db-to-blob-store is c# with a little interop for the memory-mapped files. Your scenario sounds similar, but somewhat less demanding. I suspect you could get away without the memory-mapped I/O complications.

Is there any size/row limit in .txt file?

The question may looks duplicate. But i am not getting the answer which i am looking.
The problem is, in unix, one of the 4GL binary is fetching data from the table using cursor and writing the data in .txt file.
The table contains around 50 Million records.
The binary took lot of time and not completing. the .txt file is also 0 byte.
I want to know the possibilities why the records are not written in the .txt file.
Note: There is enough disk space available.
Also, for 30 Million records, i can get the data in the .txt file as i expected.
The information you provide is insufficient to tell for sure why the file is not written.
In UNIX, a text file is just like any another file - a collection of bytes. No specific limit (or structure) is enforced on "row size" or "row count," although obviously, some programs might have certain limits on maximum supported line sizes and such (depending on their implementation).
When a program starts writing data to a file (i.e. once the internal buffer is flushed for the first time) the file will no longer be zero size, so clearly your binqary is doing something else all that time (unless it wipes out the file as part of the cleanup).
Try running your executable via strace to see the file I/O activity - that would give some clues as to what is going on.
Try closing the writer if you are using one to write to the file. It achieves the dual purpose of closing the resource along with flushing the remaining contents of the buffer.
CPU calculated output needs to be flushed if you are using any mechanism of buffered writer. I have encountered such situations a few times and in almost all cases, the issue was that of flushing the output.
In java specifically, usually the best practice of writing data involves buffers. So when the buffer limit is reached, it gets written to the file but doesn't get written to the file when the end of buffer has not been reached yet. This happens when program closes without flushing the buffered writer.
So, in your case, if the processing time that it takes is reasonable and still the output is not on the file, it may mean that the output has been calculated and put on the RAM but could not be written to the file (which represents the disk) due to the output not being flushed.
You can also consider the answers to this question.

Resources