How to scale a document storage system? - asp.net

I maintain a web application (ASP.NET/IIS7/SQL2K8/Win2K8) that needs to access documents, actually hundreds of thousands of documents, and growing. Currently, they are all on a Windows 2K8 Server fileshare, being accessed by UNC path (SMB). The files are in a single flat directory and I'm trying to plan how to best improve this solution. I don't want to use the SQL Filestream attribute as it would be significant effort to migrate it all into that, and would really lock in to SQL Server. I also need to find a way to replicate the data for disaster recovery, so perhaps a solution can help with that too.
Options could be:
Segment files into multiple directories?
Application would add metadata for which directory it's on (or segment by other means)
Segment files into separate servers? (virtualize)
Backup becomes more complicated.
Application would add metadata for which server it's on
NAS Storage
SAN Storage
Put a service (WCF) in front of the files and have the app talk to the service
bonus of being reusable across many applications
Assuming I'm going to store on filesystem and not in database (I've read those disccusions here), which would be a more scalable solution?

You've got a couple issues:
- managing a large volume of (static?) files
- preparing for backups and disaster recovery of said files
I'll throw this out there, even though I'm not a fan of the answer, but you might poke around with the free SharePoint 2010 Foundation that's included with server 2k8. If you're having issues with finding the documents you need (either by search, taxonomy via tagging or other metadata) as well as document expiration and you don't want to buy a full blown document management system, this might be a solution. Of course it introduces new problems...
If your only desire is to have these files available to spit out on the web, then the file store like you're using now really is the simplest solution. For DR/redundancy purposes, I'd look at a) running them on a raid/SAN of some sort and b) auto-syncing them with the cloud (either azure or amazon). For b) you can get apps that make the cloud appear as a mapped drive and then use an rsync type software to keep the cloud up to date.
If you want to build something new and cool, you might think about moving the entire file archive into the cloud and just write a table in a db to manage the file name, old location, new cloud location and a redirector code that can provide the access tokens to requestors.
3 different approaches... your choice.

Related

How to sync data of multiple Azure Web Site instances

I have an Azure website running on several instances on Basic compute mode. I want to synchronize local data (just a few small numbers like number of online users, total app users, not whole tables or files) between the instances. How can I achieve this? I've seen Refreshing Data on all Azure Instances but it didn't help much. My current naive approach is writing these values to my database, and in my application code, keeping local changes, and syncing the actual data from the database every few minutes, which is OK for my application. But I'm looking for a more elegant way to achieving sync between instances. Is this possible? If yes, how?
There is not an API that will support syncing state between instances. However, here are a few ideas you might consider.
Local File System
The web site instances share a common file system. So, you could just write data to a .txt file in your App_Data folder and all the instances will use that same file. This would at least eliminate the dependency on a database for such simple data.
Caching
Another option is to write the data to a cache. Your options for this would be to use the Managed Cache service or the new Redis Cache (in preview).
Storage Tables
This may not be any better than your current solution. But, it is cheap and very efficient.

Multiple deployers single Content Delivery database (Broker DB)

In the publishing scenario I have, we have multiple deployers pushing content to both file system and database (broker). Pages and Binaries are put on the file system, everything else in the Broker. We have one of the deployers putting the content into the database. Is this the recommended best practice?
If the storage configurations in all deployers also put the content into the database, how does Tridion handle this? Could this cause duplicate entries, locking failures etc?
I'm afraid at the time of writing I don't have access to an environment to test how this would work.
SDL best practice is to have a one-to-one relationship between a deployer and a publication; that means so long as two deployers do not publish the same content (from the same publication) then they will not collide providing, if a file system, there is separation between the deployed sites e.g. www/pub1 & www/pub2.
Your explanation of your scenario needs some additional information to make it complete but it sounds most likely that there are multiple broker databases (albeit hosted on a single database server). This is the most common setup when dealing with multiple file systems on webservers, combined with a single database server.
I personally do not like this set up as I think it would be better to host file system content in a shared location & share single DB. Or better still deploy everything to the database and uses something like DD4T/CWA.
I have seen (and even recommended based on customer limitations) similar configurations where you have multiple deployers configured as destinations of a given target.
Only one of the deployers can write to the database for the same transaction, otherwise you'll have concurrency issues. So one deployer writes to the database, while all others write to the file system.
All brokers/web applications are configured to read from the database.
This solves the issue of deploying to multiple servers and/or data centers where using a shared file system (preferred approach) is not feasible - be it for cost or any other reason).
In short - not a best practice, but it is known to work.
Julian's and Nuno's approaches cover most of the common scenarios. Indeed a single database is a single point of failure, but in many installations, you are expected to run multiple schemas on the same database server, so you still have a single point of failure even if you have multiple "Broker DBs".
Another alternative to consider is totally independent delivery nodes. This might even mean running a database server on your presentation box. These days it's all virtual anyway so you could run separate small database servers. (Licensing costs would be an important constraint)
Each delivery server has it's own database and file system. Depending on how many you want, you might not want to set up multiple destinations/deployers, so you deploy to one, and use file system replication and database log shipping to mirror the content to the rest.
Of course, you could configure two deployment systems (or three) for redundancy, assuming you can manage all the clustering etc.
OK - to come clean - I've never built one like this, but I'm fairly sure elements of this kind of design will become more common as virtualisation increases, and licensing models which support it. (Maybe we have to wait for Tridion to support an open source database!)

Handling Images and file attachments in a Content Management System

Assumptions: Microsoft stack (ASP.NET; SQL Server).
Some content management systems handle user-generated content (images, file attachments) by storing it in the file system. Others store these items in the back end database.
Some examples of both:
In the filesystem: Community Server, Graffiti CMS
In the database: Microsoft Sharepoint
I can see pros and cons of each approach.
In the filesystem
Lightweight
Avoids bloating the database
Backup and restore potentially simpler
In the Database
All content together in one repository (the database)
Complete separation of concerns (content vs format)
Easier deployment of web site (e.g. directly from Subversion repository)
What's the best approach, and why? What are the pros and cons of keeping user files in the database? Is there another approach?
I'm making this question Community Wiki because it is somewhat subjective.
If you are using SQL Server 2008 or higher, you can use the FileStream functionality to get the best of both worlds. That is, you can access documents from the database (for queries, etc), but still have access to the file via the file system (using SMB). More details here.
Erick
I picked the file system because it made editing of documents in place easier, that is when the user edits a file or document it can be saved in the location it is loaded from with no intervention by the program or user.
IMO, as of right now with the current functionality available in databases, the file system is the better choice.
The file system has no limit on the size of the files and with content this could easily be files larger than 2 GB.
It makes the database size much smaller which means less pressure on memory.
You can design your system to use UNCs and NASs or even cloud storage where as you cannot do this with FILESTREAM.
The biggest downside with using the file system is the potential for orphaning files and keeping the database information on files in sync with the actual files on disk. Admittedly, this is a huge issue but until solutions like FILESTREAM are more flexible, it is the price you have to pay.
Actually its door #3 Chuck.
I think storing images in the database is bad news unless you need to keep them private, otherwise, just put them on a CDN and store the URLs of the images instead. I've built some huge sites for ecommerce and putting the load on a CDN like Akumai or Amazon Cloudfront is a real nice way to speed up your website dramatically. I'm not a big fan of burning your bandwidth, CPU and memory for serving up images. Seems a silly waste of resources now days since CDNs are so cheap. Also, it does allow deployment to not care because your stuff is already in a globally accessible region. You can take a look at my profile to see the sites I've done and see how they are using CDNs to offload static requests. Just makes sense and gets even better if you can gzip it.

How do I design the file storage issue?

I am working on an application that creates video files and stores them in a folder in the C:\ drive. I speculate that there will be a large number of these files in the future and we would run out of disk space at some point of time (on our VPS). When the time comes that we have to upgrade, we either plan to use one of the Cloud providers to store files or our existing provider can add another disk (say D:\ drive).
Either way, I would want to design the app now in a way that in future, moving to different locations would not be an issue and would be transparent to the end user.
The code that creates these files supports 2 ways:
myObj.SetOutputToDisk(<path to store>); or
myObj.SetOutputToMemoryStream(ms);
If we go with the Cloud architecture, I assume we might have the following combination:
Cloud Files + Existing VPS or
Cloud Files + Cloud Windows Server
Given the unknowns at this time, how would I go about designing this?
Serve the files up from a subdomain. Say: media.yourdomain.com.
That way, you can trivially repoint DNS records to the new storage provider at some point in the future.
Also, I'd recommend storing the media files on another physical disk to the OS disk. So have a D:\ drive and store the media there.
You might want to look at the Managed Extensibility Framework as a way of adding extensions to your app for new storage methods without the need to rebuild the whole thing.
You need some way to record the storage location and method used, I'd expect some kind of database store that you could migrate to the cloud later if required.
Your question is very vague, you haven't put much work in yourself and as such you are unlikely to get the level of detail you are hoping for in the answers. At least try to implement the system and then ask specific questions around issues that you are having problems with.

Store pictures as files or in the database like MSSQL for a web app?

I'm building an ASP .NET web solution that will include a lot of pictures and hopefully a fair amount of traffic. I do really want to achieve performance.
Should I save the pictures in the Database or on the File system? And regardless the answer I'm more interested in why choosing a specific way.
Store the pictures on the file system and picture locations in the database.
Why? Because...
You will be able to serve the pictures as static files.
No database access or application code will be required to fetch the pictures.
The images could be served from a different server to improve performance.
It will reduce database bottleneck.
The database ultimately stores its data on the file system.
Images can be easily cached when stored on the file system.
In my recently developed projects, I stored images (and all kinds of binary documents) as image columns in database tables.
The advantage of having files stored in the database is obviously that you do not end up with unreferenced files on the harddisk if a record is deleted, since synchronization between database (= meta data) and harddisk (= file storage) is not built-in and has to be programmed manually.
Using today's technology, I suggest you store images in SQL Server 2008 FILESTREAM columns (at least that's what I am going to do with my next project), since they combine the advantage of storing data in database AND having large binaries in separate files (at least according to advertising ;) )
The adage has always been "Files in the filesystem, file metadata in the database"
Better to store files as files. Different databses handle Blob data differently, so if you have to migrate your back end you might get into trouble.
When serving the impages an < img src= to a file that already exists on the server is likely to be quicker than making a temporary file from the database field and pointing the < img tag to that.
I found this answer from googling your question and reading the comments at http://databases.aspfaq.com/database/should-i-store-images-in-the-database-or-the-filesystem.html
i usually like to have binary files in the database because :
data integrity : no unreferenced file, no path in the db without any file associated
data consistency : take a database dump and that's all. no "O i forgot to targz this data directory."
Storing images in the database adds a DB overhead to serve single images and makes it hard to offload to alternate storage (S3, Akami) if you grow to that level. Storing them in the database makes it much easier to move your app to a different server since it's only the DB that needs to move now.
Storing images on the disk makes it easy to offload to alternate storage, makes images static elements so you don't have to mess about with HTTP headers in your web app to make the images cacheable. The downside is if you ever move your app to a different server you need to remember to move the images too; something that's easily forgotten.
For web based applications, you're going to get better performance out of using the file system for storing your images. Doing so will allow you to easily implement caching of the images at multiple levels within your application. There are some advantages to storing images in a database, but most of the time those advantages come with client based applications.
Just to add some more to the already good answers so far. You can still get the benefits of caching from both the web level maybe and the database level if you go the route keeping you images in the database.
I think for the database you can achieve this by how you store the images with relation to the textual data associated with them and if you can the access to the images into a particular query so that the database can cache the query (just theory though so feel free to nuke me on that part).
With the web side, I would guess since you're question is tagged up with asp.net that you would go the route of using a http handler to serve up the images. Then you have all the benefits of the framework at your disposal and you can keep you domain logic cleaner with only having to pass the key to your image to the http handler.
Here is a step-by-step example (general approach, Spring implementation, Eclipse) of storing images in file system and holding their metadata in DB --
http://www.devmanuals.com/tutorials/java/spring/spring3/mvc/Spring3MVCImageUpload.html
Here is an example too -- http://www.journaldev.com/2573/spring-mvc-file-upload-example-tutorial-single-and-multiple-files
Also you can investigate a codebase of this project -- https://github.com/jdmr/fileUpload . Pay attention to this controller.

Resources