Using hash function for fast image duplication prevention - math

I develop website where people can upload images for using as textures.
I want to limit pressure on web-server by avoiding storing the same image twice.
As users can upload images very often I want to have a very fast technique to identify is uploaded image have a copy on web server and if it has when discard it and use already existing image.
Is it ok to use hash function for this task if I can accept possibility of different images having the same hash below 1e-5? What hash function is better suitable for my task?

Related

Serving Lazy Thumbnail Images from Azure Blob Storage - What is the overhead of Exists?

I have a website where users upload images. These images are shown on various sections of the site with various thumbnail dimensions. Since the site is still under rapid development, I don't yet want to commit to a set number of thumb sizes. Thus I believe I should be generating thumbnails on a lazy basis.
Of the two options, which is the most performant way to do this:
When I go to serve the thumbnail, convert the dimensions into a canonical filename (like "bighouse-thumb-160x120"). Check if the file exists in blob storage using client.GetContainerReference(containerName).GetBlockBlobReference(key).Exists(); If it does not exist, generate it and save it.
When I go to serve the thumbnail, query my SQL database to see if the thumbnail exists. If it exists, get the blob URI from the DB and emit that as HTML. If it does not exist, generate it and update the SQL database.
I've used #2 in the past, but design-wise it is duplicating state which is bad. If querying azure for the existence of blobs is scalable, I'd rather do that. I don't really understand the threading model in Asp.Net. If I have 200 users requesting thumbs, will my azure Exists calls all happen in parallel? Even if they do, two round trips seem like a lot of overhead. I assume roundtripping the database is faster and lends itself more easily to generic caching solutions.
What is the right answer?
Regardless of the overhead, I would pre-generate thumbnails when you upload/store the image. This way you move the burden of generating thumbnails from something that is done many times (retrieving an image) to one that is much less often executed (storing an image).
Consider the following scenario, when you lazily generate thumbnails on the first view:
Check for an existing thumbnail (is false, first view remember ;))
Generate a thumbnail
Store the thumbnail
Send the thumbnail to the client
With pre-generated thumbnails the process is much shorter:
Send the thumbnail to the client
Done.
With 'lazy generating' the check for existing can be expensive due to network overhead (on every hit!), generating the thumbnail can be hugely expensive memory- & CPU-wise and than you have to store it, with network overhead again. You can even offload generating the thumbnail(s) to a separate process, possibly started by queue messages, to take the burden of generating the images even further away from your webservers.
However, this brings up the question of what you should do when you introduce a new thumbnail/image size. When you pre-generate the thumbnails you can write a simple tool to create the new sizes and store them, and if you went the separate process route it's even simpler. Just upgrade the separate process, generate a queue message for every existing image and just let it do its work.

What is the best way to handle time consuming dynamic generated reports downloads?

A website is serving continuously updated content (think stock exchange), is required to generate reports on-demand and files get downloaded by users. Users can customize the downloaded report based on lots of parameters.
What is the best practice in handling highly customized reports downloaded files as (.xls)?
How to cache and improve performance ?
It might be good to mention that the data is stored in RavenDb and the reports are expected to handle 100K results sizes.
Here are some pointers:
Make sure you haven static indexes defined in RavenDB to match all possible reports. You don't want to use dynamically generated temp indexes for this.
Probably one or more parameters will drastically change the query, so you may have some conditional logic to choose which of several query to run. This is especially true for different groupings, as they'll require a different map-reduce index.
Choose whether you want to limit your result set using standard paging with Skip and Take operators, or whether you are going to stream unbounded result sets.
However you build the actual report, do it in memory. Do not try to write it to disk first. Managing file permissions, locks, and cleanup is not worth the hassle. Plus, you risk taking servers down if they run out of disk space.
Preferably you should build the response and stream it out to your user in a single step, as to not require large amounts of memory on the server. Make sure you understand the yield keyword in C#, and that you work with IEnumerable and IQueryable directly whenever possible. Don't try to use .ToList() or .ToArray(), which will put the whole result set into memory.
With regard to caching, you could consider using a front-end cache like Memcached, but I'm not sure if it will help you here or not. You probably want as accurate of data that's possible from your database. Introducing any sort of cache will require you understand how and when to reset that cache. Keep in mind that Raven has several caching layers built in already. Build your solution without cache first, and then add caching if you need it.

Save image url or save image file in sql database?

We can save an image with 2 way
upload image in Server and save image url in Database.
save directly image into database
which one is better?
There's a really good paper by Microsoft Research called To Blob or Not To Blob.
Their conclusion after a large number of performance tests and analysis is this:
if your pictures or document are typically below 256K in size, storing them in a database VARBINARY column is more efficient
if your pictures or document are typically over 1 MB in size, storing them in the filesystem is more efficient (and with SQL Server 2008's FILESTREAM attribute, they're still under transactional control and part of the database)
in between those two, it's a bit of a toss-up depending on your use
If you decide to put your pictures into a SQL Server table, I would strongly recommend using a separate table for storing those pictures - do not store the employee foto in the employee table - keep them in a separate table. That way, the Employee table can stay lean and mean and very efficient, assuming you don't always need to select the employee foto, too, as part of your queries.
For filegroups, check out Files and Filegroup Architecture for an intro. Basically, you would either create your database with a separate filegroup for large data structures right from the beginning, or add an additional filegroup later. Let's call it "LARGE_DATA".
Now, whenever you have a new table to create which needs to store VARCHAR(MAX) or VARBINARY(MAX) columns, you can specify this file group for the large data:
CREATE TABLE dbo.YourTable
(....... define the fields here ......)
ON Data -- the basic "Data" filegroup for the regular data
TEXTIMAGE_ON LARGE_DATA -- the filegroup for large chunks of data
Check out the MSDN intro on filegroups, and play around with it!
Like many questions, the ansewr is "it depends." Systems like SharePoint use option 2. Many ticket tracking systems (I know for sure Trac does this) use option 1.
Think also of any (potential) limitations. As your volume increases, are you going to be limited by the size of your database? This has particular relevance to hosted databases and applications where increasing the size of your database is much more expensive than increasing your storage allotment.
Saving the image to the server will work better for a website, given that these are incidental to your website, like per customer branding images - if you're setting up the next Flickr obviously the answer would be different :). You'd want to set up one server to act as a file server, share out the /uploaded_images directory (or whatever you name it), and set up an application variable defining the base url of uploaded images. Why is it better? Cost. File servers are dirt cheap commodity hardware. You can back up the file contents using dirt cheap commodity (even just consumer grade) backup software. And if your file server croaks and someone loses a day of uploaded images? Who cares. They just upload them again. Our database server is an enterprise cluster running on SSD SAN. Our backups and tran logs are shipped to remote sites over expensive bandwidth and maintained even on tape for x period. We use it for all the data where we need the ACID (atomicity, consistency, isolation, durability) benefits of a RDBMS. We don't use it for company logos.
Store them in the database unless you have a good reason not to.
Storing them in the filesystem is premature optimization.
With a database you get referential integrity, you can back everything up at once, integrated security, etc.
The book SQL Anti-Patterns calls storing files in the filesystem an anti-pattern.

Is it good to store Images in DB and retrieve? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Storing Images in DB - Yea or Nay?
Friends,
I have a requirement to show dynamically changing images in DataList. what I did was I am storing the images in DB as image datatype and retrieving the images. Is that good technique to store images in DB?
FYI, user can upload the images.
Regards,
Abhi
The answer is - it depends... Studies have been done (http://research.microsoft.com/pubs/64525/tr-2006-45.pdf) which have basically concluded if objects are larger than one megabyte on average, NTFS has a clear advantage over SQL Server. If the objects are under 256 kilobytes, the database has a clear advantage. Inside this range, it depends on how write intensive the workload is, and the storage age of a typical replica in the system.
I would store them as physical files on the server, but store the file path in the database, not the actual image. Storing the image in the database will increase it's size dramatically over time.
Images sizes will increase your DB size unnecessarily so not good practice to store it, instead store the file path in your db, which is not that big.
Storing image in DB should be done if you have some strong requirement or use case.
The thing you have to address if you store paths etc. is maintaining referential integrity with your images. What if somebody moves files, what if somebody uploads a new file with the same name (I'd suggest uploads get renamed to reflect some kind of key rather than keeping their original name of bob.jpg). You'll need to look at segmenting your directories etc. to keep the list sensible. Finding the images may be harder than if you store them in a DB also.
However, on the up side, you can form a CDN based on distribution of your images over diverse servers, subdomains, cloud etc. if you don't jam them all in your database
Depends on the size of the images and the DB you use.
For SQL Server it is pretty bad idea if they are larger than 1MB and you do not use the NTFS Filestreams for storage of your BLOB fields.
See for example http://www.simple-talk.com/sql/learn-sql-server/an-introduction-to-sql-server-filestream/
If you have a document oriented database like Couch DB it might be ok.
I would store them as physical files on the server, but store the file path in the database, not the actual image. And search the file as per the location store into Databasse. Storing the image in the database will increase it's size dramatically over time.
Storing them in database is also useful if you need to scale your site across multiple web servers.
If they are static then there is no use as they can be deployed with your site but things like avatars are generally better stored in the DB so they are available to all cluster members.

Invisible caching in ASP.Net MVC

I'm creating a page that has some dynamically generated images built from data that comes from web services. The image generation takes a fair amount of time due to the time involved in hitting the web services, so I need to do some caching.
One option would be to use the OutputCache parameter to cache the images, but I don't like forcing some unlucky user to wait for a really long time. I'd rather write the images to files in the background and serve static html.
Whats the best way to do this? I'm thinking about creating a special url to trigger refreshes that writes the images to disk and setting up a scheduled task of some sort to hit the refresh url. Any better ideas?
It seems its possible to use memcached with ASP.Net, how hard would that be to set up? Seems like it may be overkill for this situation (internal tool) and I've already got the disk based version working, but I'm curious.
We do something similar, although we simply precompute the files/images and store them in the HttpRuntime.Cache. This way our views can still be generated as-is, but they typically pull from cached data rather than generating on-the-fly.
On the off chance that the cached data isn't available, we have getter functions to generate them:
public static GetGraph(int id)
{
if (HttpRuntime.Cache["image_"+id] == null)
HttpRuntime.Cache["image_"+id] = _imageGen.GenerateGraph(id);
return HttpRuntime.Cache["image_"+id];
}

Resources