Overcoming inode limitation - unix

What's the best practise for storing a large (expanding) number of small files on a server, without running into inode limitations?
For a project, I am storing a large number of small files on a server with 2TB HD space, but my limitation is the 2560000 allowed inodes. Recently the server used up all the inodes and was unable to write new files. I subsequently moved some files into databases, but others (images and json files remain on the drive). I am currently at 58% inode usage, so a solution is required imminently.
The reason for storing the files individually is to limit the number of database calls. Basically the scripts will check if the file exists and if so, then return results dependently. Performance wise this makes sense for my application, but as stated above it has limitations.
As I understand it does not help to move the files into sub-directories, because each inode points to a file (or a directory file), so in fact I would just use up more inodes.
Alternatively I might be able to bundle the files together in an archive type of file, but that will require some sort of indexing.
Perhaps I am going about this all wrong, so any feedback is greatly appreciated.

On the advice of arkascha I looked into loop devices and found some documentation about losetup. Remains to be tested.

Related

webdav access a textfile line by line

I have looked all over (spent about 7 hours). I have found numerous articles on how to map a drive (google drive, onedrive etc). What I cannot seem to find an answer to is this: Once I have mapped the drive can I use the files on that drive just like I use files on a server. Open the file, read a record, write a record. I have created a file, mapped a network drive, wrote records to the file and retrieved records from the file. I have a home grown database that is implemented with a large binary (as opposed to text) file. I have to go to a byte position and read a fixed number of bytes. If WebDAV is copying the file to my computer and then writing it back this would make my file access way to slow and I cannot seem to find an answer. Some programmers I have talked to say I cannot even do that, yet I can. Any direction would be very much appreciated.
Charlie
That's likely because standard WebDAV doesn't allow updating ranges of resources only, so the whole thing needs to be written back.

Storing and serving many compressed archives with shared underlying content

I have a web server that has many compressed archive files (zip files) available for download. I would like to drastically reduce the disk footprint those archives take on the server.
The key insight is that those archives are in fact slightly different versions of the same uncompressed content. If you uncompressed any two of these many archives and ran a diff on the results, I expect you would find that the diff is about 1% of the total archive size.
Those archives are actually JAR files, but the compression details are — I believe — irrelevant. But this explains, that serving those archives in a specific compressed format is non-negotiable : it is the basic purpose of the server.
In itself, it is not a problem for me to install differential storage for the content of those archives, drastically reducing the disk footprint of the set of archives. There are numerous ways of doing this, using delta encoding or a compressed filesystem that understands sharing (e.g. I believe btrfs understands block sharing, or I could use snapshotting to enforce it).
The question is, how do I produce compressed zips from those files ? The server I have has very little computational power, certainly not enough to recreate JARs on the fly from the block-sharing content.
Is there a programmatic way to expose the shared content at the uncompressed level to the
compressed level ? An easily-translatable-to-zip incremental compressed format ?
Should I look for a caching solution coupled with generating JARs on the fly ? This would at least alleviate the computational pain from generating the JARs that are the most requested.
There is specialized hardware that can produce zips very fast, but I'd rather avoid the expense. It's also not a very scalable solution as the number of requests to the server grows.
If the 1% differences are smeared across all of the entries in all of the jar files, then there's not much you can do without having to recompress a lot.
If on the other hand the 1% differences are concentrated in a few % of the jar entries, with most of the jar entries unchanged, then there's hope. You can keep all of the individual jar entries in their own jar files on the server, and for each jar file you want to serve, just keep a list of those individual jar entry files to combine. It would be easy to write a fast utility to take a set of jar files and merge them into a single jar file. If there isn't one already.
One approach I've used in the past is to log for some time the actual requests for the zip files. If you find that the requests are highly skewed, then you may be able to use caching to alleviate the cost of producing zip files on the fly.
Basically, implement your differential storage along the lines as you suggest. Allocate also some amount, say 10%, of your total storage for a LRU (or whatever other replacement algorithm you feel like) for the actual .zipped files. Every time a user requests the zip, you serve it from the cache if it is ready, or generate it on the fly and put it in the cache if not.
In the general case this may not work well, but in the common case that actual requests are typically to a small concentrated number of files, it may solve the problem.
Otherwise, I see your options as:
Use delta encoding on disk and then change the format your clients expect for responses. For example, instead of zip, you can serve them a format which is basically the bits of the delta-encoded files they need to reconstruct the file. On the server side, you save most of the work since you are just serving files more or less unmodified from disk, and then the client has to put them together (the existing client already has to unzip the files, so perhaps this is not an undue burden).
Carefully look at the .zip format and store your files in a specialized way that does most of the .zip work ahead of time. For example, something like a delta encoding, but with the actual hard part of match-finding stored on disk, such that encoding a file can be a very fast process. This would require someone with sophisticated knowledge of the zip format to design, however.

A file storage format for file sharing site

I am implementing a file sharing system in ASP.NET MVC3. I suppose most file sharing sites store files in a standard binary format on a server's file system, right?
I have two options storage wise - a file system, or binary data field in a database.
Is there any advantages in storing files (including large one's) in a database, rather then on file system?
MORE INFO:
Expected average file size is 800 MB. 3 files per minute are to be usually requested to be fed back to the user, who is downloading.
If the files are as big as that, then using the filesystem is almost certainly a better option. Databases are designed to contain relational data grouped into small rows and are optimized for consulting and comparing the values in these relations. Filesystems are optimized for storing fairly large blobs and recalling them by name as a bytestream.
Putting files that big into a database will also make it difficult to manage the space occupied by the database. The tools to query space used in a filesystem, and remove and replace data are better.
The only caveat to using the filesystem is that your application has to run under an account that has the necessary permission to write the (portion of the) filesystem you use to store these files.
Use FileStream when:
Objects that are being stored are, on average, larger than 1 MB.
Fast read access is important.
You are developing applications that use a middle tier for application logic.
Here is MSDN link https://msdn.microsoft.com/en-us/library/gg471497.aspx
How to use it: https://www.simple-talk.com/sql/learn-sql-server/an-introduction-to-sql-server-filestream/

How does file system on UNIX find files?

Suppose that a request is made to ls somefile. How does the file system in UNIX handle this request from algorithmic perspective? Is that a O(1) query or O(log(N)) depending on files say starting in current directory node, or is it a O(N) linear search, or is that a combination depending on some parameters?
It can be O(n). Classic Unix file systems, based on the old school BSD fast file system and the like, store files as inode numbers, and their names are assigned at the directory level, not at the file level. This allows you have to the same file present in multiple locations at the same time, via hard links. As such, a "directory" in most Unix systems is just a file that lists filenames and inode numbers for all the files stored "in" that directory.
Searching for a particular filename in a directory just means opening that directory file and parsing through it until you find the filename's entry.
Of course, there's many different file systems available for Unix systems these days, and some will have completely differnet internal semantics for finding files, so there's no one "right" answer.
Its O(n) since the file systems has to read it off phyical media initially, but Buffer Caches will increase that significantly based on the Virtual File System (VFS) implementation on your flavor of *nix. (Notice how the first time you access a file its slower than the second time you execute the exact same command?)
To learn more read IBM's article on the Anatomy of the Unix file system.
Typical flow for a program like ls would be
Opendir on the current path.
Readdir for the current path.
Filter the entries returned by the OpenDir through filter provided on the command line. So typically O(n)
This is the generic flow, however there are many optimizations in place for special and frequent cases (;like caching of inode numbers of recent and frequent paths.
Also it depends on how directoy file are organized. In unix it is based on time of creation forcing to read every entry and increasing the look-up time to O(n). In NTFS equivalent of directory files are sorted based on name.
I can't answer your question. Maybe if you take a peak into the source code, you could answer your question yourself and explain us how it works.
ls.c
ls.h

Writing to and reading from the same file, at the same time (disk being asynchronous?)

We're creating a web service where we're writing files to disk. Sometimes these files will be read at the same time as they are written.
If we do this - writing and reading from the same file - we sometimes end up with files that are of the same length, but where some of the data inside are not the same. So with a 350mb file we get maybe 20-40 bytes that differs.
This problem mostly occur if we have 3-4 files that are being written and read at the same time. Could this problem be because there is no guarantee that after a "write" to a disk, that the data is actually written, i.e., the disk is being asynchronous.
Also, the computer we're testing on is just a standard macbook pro, so no fancy disks of any kind.
The bug might be somewhere else, but we just wanted to ask the question and see if anybody knew something about this writing+reading thing.
All modern OSs support concurrent reading and writing to files (obviously, given a single writer). So this is not an OS level bug. But do make sure you do not have multiple threads/processes trying to append data to the file.
Check your application code. Check the buffers you are using. Make sure your application is synchronized and there are no race conditions between readers and writers.

Resources