I got some question about the .xdf file:
What is this exacly?
How does this type of file work?
How Microsoft R works with this typ of file?
What are the advantages agains data.frames?
I'm really looking forward to your answers.
Greetings R123456789
An XDF file is a compressed binary file format with user selectable levels of compression, some quick facts can be found here: https://support.microsoft.com/en-us/help/3104260/qa-what-is-the-.xdf-file-format XDF files come in two forms, Standalone and Composite. For Standalone XDF files, you will see a single file stored on disk with the .xdf extension. For Composite, the XDF file is represented by a directory, which contains metadata and data subdirectories. Also, for Composite, Metadata and Data files in there directories are split and individually compress as XDF part files.
It is a proprietary implementation inside of Microsoft R Server, I can expand on this answer, but i would need to refine the question, "How does this type of file work?"
An XDF file is stored on the disk and does not sit in memory. Microsoft R Server, with a call to RxXdfData() or rxImport(), will read the XDF file and decompress it, then insert it into memory as a Data Frame. Many Microsoft R "rx" functions can take a path to an XDF directly as a data source or sink, and will manage reading segments into memory as required.
The advantages of using XDF as a Data Source/Sink is that you do not need to buffer the entire file into memory for Microsoft R Server to work with it. It allows for partial reads and writes, as well as other optimizations around disk space via compression. It will operate faster than reading/writing from flat files as Metadata is used to index the XDF. The disadvantages are primarily performance, Data in-memory (data.frames) will be faster to operate on than data on disk in all cases.
Note: As with all files, the underlying operation system controls when a file is written from memory to disk. For the purpose of your question, the assumption can be made that the XDF file resides on disk as a standard file.
Related
I'm loading an R rds file into Julia with
using RData
objs = load(rds, convert=true)
The original rds file is ~3GB. When I run the load function about, the memory spikes to ~40GB.
Any ideas what's going on?
The rds files are actually compressed using gzip. Try unzipping your file and see how big it actually is (on Windows you could use 7-zip for that). The compression level for a dataframe easily could be around 80-90% so your numbers look fine.
What's the best practise for storing a large (expanding) number of small files on a server, without running into inode limitations?
For a project, I am storing a large number of small files on a server with 2TB HD space, but my limitation is the 2560000 allowed inodes. Recently the server used up all the inodes and was unable to write new files. I subsequently moved some files into databases, but others (images and json files remain on the drive). I am currently at 58% inode usage, so a solution is required imminently.
The reason for storing the files individually is to limit the number of database calls. Basically the scripts will check if the file exists and if so, then return results dependently. Performance wise this makes sense for my application, but as stated above it has limitations.
As I understand it does not help to move the files into sub-directories, because each inode points to a file (or a directory file), so in fact I would just use up more inodes.
Alternatively I might be able to bundle the files together in an archive type of file, but that will require some sort of indexing.
Perhaps I am going about this all wrong, so any feedback is greatly appreciated.
On the advice of arkascha I looked into loop devices and found some documentation about losetup. Remains to be tested.
I'm working on some code in IDL that retrieves data files through FTP that are Unix compressed (.Z) files. I know IDL can work with .gz compressed files with the /compress keyword however it doesn't seem capable of playing nicely with the .Z compression.
What are my options for working with these files? The files I am downloading are coming from another institution so I have no control in the compression being used. Downloading and decompressing the files manually before running the code is an absolute last resort as it makes things a lot more difficult as I don't always know which files I need from the FTP site in advance so the code grabs the ones needed based on the parameters in real time.
I'm currently running on Windows 7 but once the code is finished it will be used on a Unix system as well (computer cluster).
You can use SPAWN as you note in your comment (assuming you can find an equivalent of the Unix uncompress command that runs on Windows), or for higher speed you can use an external C function with CALL_EXTERNAL to do the decompression. Just by coincidence, I posted an answer on stackexchange the other day with just such a C function to decompress .Z files here.
I have a web server that has many compressed archive files (zip files) available for download. I would like to drastically reduce the disk footprint those archives take on the server.
The key insight is that those archives are in fact slightly different versions of the same uncompressed content. If you uncompressed any two of these many archives and ran a diff on the results, I expect you would find that the diff is about 1% of the total archive size.
Those archives are actually JAR files, but the compression details are — I believe — irrelevant. But this explains, that serving those archives in a specific compressed format is non-negotiable : it is the basic purpose of the server.
In itself, it is not a problem for me to install differential storage for the content of those archives, drastically reducing the disk footprint of the set of archives. There are numerous ways of doing this, using delta encoding or a compressed filesystem that understands sharing (e.g. I believe btrfs understands block sharing, or I could use snapshotting to enforce it).
The question is, how do I produce compressed zips from those files ? The server I have has very little computational power, certainly not enough to recreate JARs on the fly from the block-sharing content.
Is there a programmatic way to expose the shared content at the uncompressed level to the
compressed level ? An easily-translatable-to-zip incremental compressed format ?
Should I look for a caching solution coupled with generating JARs on the fly ? This would at least alleviate the computational pain from generating the JARs that are the most requested.
There is specialized hardware that can produce zips very fast, but I'd rather avoid the expense. It's also not a very scalable solution as the number of requests to the server grows.
If the 1% differences are smeared across all of the entries in all of the jar files, then there's not much you can do without having to recompress a lot.
If on the other hand the 1% differences are concentrated in a few % of the jar entries, with most of the jar entries unchanged, then there's hope. You can keep all of the individual jar entries in their own jar files on the server, and for each jar file you want to serve, just keep a list of those individual jar entry files to combine. It would be easy to write a fast utility to take a set of jar files and merge them into a single jar file. If there isn't one already.
One approach I've used in the past is to log for some time the actual requests for the zip files. If you find that the requests are highly skewed, then you may be able to use caching to alleviate the cost of producing zip files on the fly.
Basically, implement your differential storage along the lines as you suggest. Allocate also some amount, say 10%, of your total storage for a LRU (or whatever other replacement algorithm you feel like) for the actual .zipped files. Every time a user requests the zip, you serve it from the cache if it is ready, or generate it on the fly and put it in the cache if not.
In the general case this may not work well, but in the common case that actual requests are typically to a small concentrated number of files, it may solve the problem.
Otherwise, I see your options as:
Use delta encoding on disk and then change the format your clients expect for responses. For example, instead of zip, you can serve them a format which is basically the bits of the delta-encoded files they need to reconstruct the file. On the server side, you save most of the work since you are just serving files more or less unmodified from disk, and then the client has to put them together (the existing client already has to unzip the files, so perhaps this is not an undue burden).
Carefully look at the .zip format and store your files in a specialized way that does most of the .zip work ahead of time. For example, something like a delta encoding, but with the actual hard part of match-finding stored on disk, such that encoding a file can be a very fast process. This would require someone with sophisticated knowledge of the zip format to design, however.
I am implementing a file sharing system in ASP.NET MVC3. I suppose most file sharing sites store files in a standard binary format on a server's file system, right?
I have two options storage wise - a file system, or binary data field in a database.
Is there any advantages in storing files (including large one's) in a database, rather then on file system?
MORE INFO:
Expected average file size is 800 MB. 3 files per minute are to be usually requested to be fed back to the user, who is downloading.
If the files are as big as that, then using the filesystem is almost certainly a better option. Databases are designed to contain relational data grouped into small rows and are optimized for consulting and comparing the values in these relations. Filesystems are optimized for storing fairly large blobs and recalling them by name as a bytestream.
Putting files that big into a database will also make it difficult to manage the space occupied by the database. The tools to query space used in a filesystem, and remove and replace data are better.
The only caveat to using the filesystem is that your application has to run under an account that has the necessary permission to write the (portion of the) filesystem you use to store these files.
Use FileStream when:
Objects that are being stored are, on average, larger than 1 MB.
Fast read access is important.
You are developing applications that use a middle tier for application logic.
Here is MSDN link https://msdn.microsoft.com/en-us/library/gg471497.aspx
How to use it: https://www.simple-talk.com/sql/learn-sql-server/an-introduction-to-sql-server-filestream/