Is it necessary to keep data.fs.old after packing? - plone

My data.fs was 500 MB, so I packed it then backed it up, resulting in 100 MB.
My hosting account is only 500 MB, so I am wondering if it is safe to delete data.fs.old (500 MB)?

You don't need to keep it and you can let the pack process remove it for you. To remove it automatically specify the pack-keep-old option in the zeoserver section:
[zeo]
recipe = plone.recipe.zeoserver
...
pack-keep-old = false
pack-days = 2
The pack process will still create a new file during packing, so you do need to have enough free disk space to hold both your Data.fs and a copy of it.
Via the pack-days option you can specify how many days of history you want to preserve. If you trust your system backup, you can set this to a low value, which will save some more disk space.

No, you can safely delete it; it's a backup of the database from before the pack.
If the pack went haywire for some reason, you could use it to restore the database to the state it was before packing, by simply moving it back to Data.fs.

Related

Copying 100 GB with continues change of file between datacenters with R-sync is good idea?

I have a datacenter A which has 100GB of the file changing every millisecond. I need to copy and place the file in Datacenter B. In case of failure on Datacenter A, I need to utilize the file in B. As the file is changing every millisecond does r-sync can handle it at 250 miles far datacenter? Is there any possibility of getting the corropted file? As it is continuously updating when we call this as a finished file in datacenter B ?
rsync is a relatively straightforward file copying tool with some very advanced features. This would work great for files and directory structures where change is less frequent.
If a single file with 100GB of data is changing every millisecond, that would be a potential data change rate of 100TB per second. In reality I would expect the change rate to be much smaller.
Although it is possible to resume data transfer and potentially partially reuse existing data, rsync is not made for continuous replication at that interval. rsync works on a file level and is not as commonly used as a block-level replication tool. However there is an --inplace option. This may be able to provide you the kind of file synchronization you are looking for. https://superuser.com/questions/576035/does-rsync-inplace-write-to-the-entire-file-or-just-to-the-parts-that-need-to
When it comes to distance, the 250 miles may result in at least 2ms of additional latency, if accounting for the speed of light, which is not all that much. In reality this would be more due to cabling, routers and switches.
rsync by itself is probably not the right solution. This question seems to be more about physics, link speed and business requirements than anything else. It would be good to know the exact change rate, and to know if you're allowed to have gaps in your restore points. This level of reliability may require a more sophisticated solution like log shipping, storage snapshots, storage replication or some form of distributed storage on the back end.
No, rsync is probably not the right way to keep the data in sync based on your description.
100Gb of data is of no use to anybody without without the means to maintain it and extract information. That implies structured elements such as records and indexes. Rsync knows nothing about this structure therefore cannot ensure that writes to the file will transition from one valid state to another. It certainly cannot guarantee any sort of consistency if the file will be concurrently updated at either end and copied via rsync
Rsync might be the right solution, but it is impossible to tell from what you have said here.
If you are talking about provisioning real time replication of a database for failover purposes, then the best method is to use transaction replication at the DBMS tier. Failing that, consider something like drbd for block replication but bear in mind you will have to apply database crash recovery on the replicated copy before it will be usable at the remote end.

What happens when rocksdb can't flush a memtable to a sst file?

Our service (which uses rockdb) was out of disk space for approximately 30 minutes.
We manually deleted some files which freed-up 650MiB.
However, even with those free 650MiB, rocksdb kept complaining:
IO error: Failed to pwrite for:
/path/to/database/454250.sst: There is not enough space on the disk.
Is it possible that the memtable got so big that it needed more than 650MB of disk space?
Looking at other sst files in the database folder, they don't take up more than ~40MiB.
If not, what other reasons could there be for these error messages?
There are two cases where this can happen,
1) Rocksdb persists the in-memory data via WAL files and it gets removed when the memtable is flushed. When you have multiple column families, where some of them have higher insert rates (memtable fills faster) and others have lower rates, the .log files (rocksdb WAL files) cannot be removed. This is because, the wal files contain transactions from all column families and it cannot be removed until all the column families have been persisted via a flush.
This might lead to stagnant .log files resulting in Disk space issues.
2) Assume, the memtable size is configured to be 1GB and the number to memtables to merge is 3,
you actually wait for 3 memtables to fill and then the flush gets triggerred. Even if you had configured the target file size (since you had mentioned your SSTs are around 40MB) to 50MB, you will generate 185 SSTs each of size 50MB totalling 3GB.
But the space you have is around 650MB which might be a problem.
There are various options that influence the flush behaviour in rocksdb. you can take a look at,
write_buffer_size - Size of each memtable.
min_write_buffer_number_to_merge - Number of memtables to merge during a flush or in other words, when the immutable memtable count becomes equal to this value, do a flush to disk.
target_file_size_base - Size of the resulting SSTs from compaction or Flush.
target_file_size_multiplier - Decides the Size of SSTs in each level.
You can also take a look at SST compression techniques. Let me know if it helps.

Best way to store 100,000+ CSV text files on server?

We have an application which will need to store thousands of fairly small CSV files. 100,000+ and growing annually by the same amount. Each file contains around 20-80KB of vehicle tracking data. Each data set (or file) represents a single vehicle journey.
We are currently storing this information in SQL Server, but the size of the database is getting a little unwieldy and we only ever need to access the journey data one file at time (so the need to query it in bulk or otherwise store in a relational database is not needed). The performance of the database is degrading as we add more tracks, due to the time taken to rebuild or update indexes when inserting or deleting data.
There are 3 options we are considering:
We could use the FILESTREAM feature of SQL to externalise the data into files, but I've not used this feature before. Would Filestream still result in one physical file per database object (blob)?
Alternatively, we could store the files individually on disk. There
could end being half a million of them after 3+ years. Will the
NTFS file system cope OK with this amount?
If lots of files is a problem, should we consider grouping the datasets/files into a small database (one peruser) so that each user? Is there a very lightweight database like SQLite that can store files?
One further point: the data is highly compressible. Zipping the files reduces them to only 10% of their original size. I would like to utilise compression if possible to minimise disk space used and backup size.
I have a few thoughts, and this is very subjective, so your mileage ond other readers' mileage may vary, but hopefully it will still get the ball rolling for you even if other folks want to put differing points of view...
Firstly, I have seen performance issues with folders containing too many files. One project got around this by creating 256 directories, called 00, 01, 02... fd, fe, ff and inside each one of those a further 256 directories with the same naming convention. That potentially divides your 500,000 files across 65,536 directories giving you only a few in each - if you use a good hash/random generator to spread them out. Also, the filenames are pretty short to store in your database - e.g. 32/af/file-xyz.csv. Doubtless someone will bite my head off, but I feel 10,000 files in one directory is plenty to be going on with.
Secondly, 100,000 files of 80kB amounts to 8GB of data which is really not very big these days - a small USB flash drive in fact - so I think any arguments about compression are not that valid - storage is cheap. What could be important though, is backup. If you have 500,000 files you have lots of 'inodes' to traverse and I think the statistic used to be that many backup products can only traverse 50-100 'inodes' per second - so you are going to be waiting a very long time. Depending on the downtime you can tolerate, it may be better to take the system offline and back up from the raw, block device - at say 100MB/s you can back up 8GB in 80 seconds and I can't imagine a traditional, file-based backup can get close to that. Alternatives may be a filesysten that permits snapshots and then you can backup from a snapshot. Or a mirrored filesystem which permits you to split the mirror, backup from one copy and then rejoin the mirror.
As I said, pretty subjective and I am sure others will have other ideas.
I work on an application that uses a hybrid approach, primarily because we wanted our application to be able to work (in small installations) in freebie versions of SQL Server...and the file load would have thrown us over the top quickly. We have gobs of files - tens of millions in large installations.
We considered the same scenarios you've enumerated, but what we eventually decided to do was to have a series of moderately large (2gb) memory mapped files that contain the would-be files as opaque blobs. Then, in the database, the blobs are keyed by blob-id (a sha1 hash of the uncompressed blob), and have fields for the container-file-id, offset, length, and uncompressed-length. There's also a "published" flag in the blob-referencing table. Because the hash faithfully represents the content, a blob is only ever written once. Modified files produce new hashes, and they're written to new locations in the blob store.
In our case, the blobs weren't consistently text files - in fact, they're chunks of files of all types. Big files are broken up with a rolling-hash function into roughly 64k chunks. We attempt to compress each blob with lz4 compression (which is way fast compression - and aborts quickly on effectively-incompressible data).
This approach works really well, but isn't lightly recommended. It can get complicated. For example, grooming the container files in the face of deleted content. For this, we chose to use sparse files and just tell NTFS the extents of deleted blobs. Transactional demands are more complicated.
All of the goop for db-to-blob-store is c# with a little interop for the memory-mapped files. Your scenario sounds similar, but somewhat less demanding. I suspect you could get away without the memory-mapped I/O complications.

Comparing bz2 files in unix

I manage a number of databases on unix servers, and do daily backups of these databases using mysqldump. Since (some of) these databases are very large (20+Gb), I usually zip the backup .sql files using bzip2, to get compressed bz2 files.
As part of the backup process, I check that the size of the new backup file is greater than or equal to the size of the previous backup file - we are adding data to these databases on a daily basis, but very rarely remove data from these databases.
The check on the backup file size is a check on the quality of the backup - given that our databases primarily only grow in size, if the new backup is smaller than the old backup, it means either a) something has been removed from the database (in which case, I should check out what...) or b) something went wrong with the backup (in which case, I should check out why...).
However, if I compare the sizes of the bz2 files - for example, using comparison (using test) of stat %s, even though the database has increased in size, the bz2 files may have shrunk - presumably because of more efficient compression.
So - how can I compare the size of the backup files?
One option is to decompress the previous backup file from .bz2 to .sql, and compare the sizes of these .sql files. However, given that these are quite large files (20+Gb), the compression/decompression can take a while...
Another option is to keep the previous backup file as .sql, and again do the comparison of the .sql files. This is my preferred option, but requires some care to make sure we don't end up with lots of .sql files lying around - because that would eat up our hard drive pretty quickly.
Alternatively, someone in the SO community might have a better or brighter idea...?
It's possible to split the input files into parts (100MB chunks for example) and compare them separately. As size might actually also stay the same even with different input, you should generally not use it for looking for differences - instead use something like cmp to see if the files differ.
It's also possible to just cat the bz2 files of the individual parts together and get a perfectly valid multi-stream bz2 file, which can be uncompressed again in whole without any problems. You might want to look into pbzip, which is a parallel implementation of bzip and uses exactly this mechanic for parallel bzip into a multi-stream bz2 file to speed up the process on smp/multi core systems.
As to why I would suggest splitting the files into parts: Depending on your mysql setup, it might be possible that some of your parts never change, and data might actually mostly get appended at the end - if you can make sure of this, you would only have to compare small parts of the whole dump, which would speed up the process.
Still, be aware, that the whole data could change without anything added or removed, as mysql might resort data in memory (OPTIMIZE command for example could result in this)
Another way of splitting the data is possible if you use InnoDB - in that case you can just tell mysql (using my.cnf) to use one file per table, so you could a) bzip those files individually and only compare the tables which might actually have changed (in case you have static data in some of the tables) and/or b) save the last modified date of a table file, and compare that beforehand (again, this is only really useful in case you have tables with only static data)

encrypting and/or decrypting large files (AES) on a memory and storage constrained system, with "catastrophe recovery"

I have a fairly generic question, so please pardon if it is a bit vague.
So, let's a assume a file of 1GB, that needs to be encrypted and later decrypted on a given system.
Problem is that the system has less than 512 mb of free memory and about 1.5 GB storage space (give or take), so, with the file "onboard" we have about ~500 MB of "hard drive scratch space" and less than 512 mb RAM to "play with".
The system is not unlikely to experience an "unscheduled power down" at any moment during encryption or decryption, and needs to be able to successfully resume the encryption/decryption process after being powered up again (and this seems like an extra-unpleasant nut to tackle).
The questions are:
1) is it at all doable :) ?
2) what would be the best strategy to go about
a) encrypting/decrypting with so little scratch space (can't have the entire file lying around while decrypting/encrypting, need to truncate it "on the fly" somehow...)
and
b) implementing a disaster recovery that would work in such a constrained environment?
P.S.:
The cipher used has to be AES.
I looked into AES-CTR specifically but it does not seem to bode all that well for the disaster recovery shenanigan in an environment where you can't keep the entire decrypted file around till the end...
[edited to add]
I think I'll be doing it the Iserni way after all.
It is doable, provided you have a means to save the AES status vector together with the file position.
Save AES status and file position P to files STAGE1 and STAGE2
Read one chunk (say, 10 megabytes) of encrypted/decrypted data
Write the decrypted/encrypted chunk to external scratch SCRATCH
Log the fact that SCRATCH is completed
Write SCRATCH over the original file at the same position
Log the fact that SCRATCH has been successfully copied
Goto 1
If you get a hard crash after stage 1, and STAGE1 and STAGE2 disagree, you just restart and assume the stage with the earliest P to be good.
If you get a hard crash during or after stage 2, you lose 10 megabytes worth of work: but the AES and P are good, so you just repeat stage 2.
If you crash at stage 3, then on recovery you won't find the marker of stage 4, and so will know that SCRATCH is unreliable and must be regenerated. Having STAGE1/STAGE2, you are able to do so.
If you crash at stage 4, you will BELIEVE that SCRATCH must be regenerated, even if you could avoid this -- but you lose nothing in regenerating except a little time.
By the same token, if you crash during 5, or before 6 is committed to disk, you just repeat stages 5 and 6. You know you don't have to regenerate SCRATCH because stage 4 was committed to disk. If you crash after stage 1, you will still have a good SCRATCH to copy.
All this assumes that 10 MB is more than a cache's (OS + hard disk if writeback) worth of data. If it is not, raise to 32 or 64 MB. Recovery will be proportionately slower.
It might help to flush() and sync(), if these functions are available, after every write-stage has been completed.
Total write time is a bit more than twice normal, because of the need of "writing twice" in order to be sure.
You have to work with the large file in chunks. Break a piece of the file off, encrypt it, and save it to disk; once saved, discard the unencrypted piece. Repeat. To decrypt, grab an encrypted piece, decrypt it, store the unencrypted chunk. Discard the encrypted piece. Repeat. When done decrypting the pieces, concatenate them.
Surely this is doable.
The "largest" (not large at all however) problem is that when you encrypt say 128 Mb of original data, you need to remove them from the source file. To do this you need to copy the remainder of the file to the beginning and then truncate the file. This would take time. During this step power can be turned off, but you don't care much -- you know the size of data you've encrypted (if you encrypt data by blocks with size multiple to 16 bytes, the size of the encrypted data will be equal to size that was or has to be removed from the decrypted file). Unfortunately it seems to be easier to invent the scheme than to explain it :), but I really see no problem other than extra copy operations which will slowdown the process. And no, there's no generic way to strip the data from the beginning of the file without copying the remainder to the beginning.

Resources