Nginx cache behavior for large files and disk full - nginx

Let's assume that Nginx is configured as a reverse proxy to serve very large files from a storage server. The cache is configured to cache everything with no limits (no max-size) for demo purpose. The server on which nginx is installed has 50 GB of disk space.
I was wondering how nginx behaves in these situations:
In case "max-size" is not specified, I understand that nginx can use all the available disk space. But when the disk is full, what is the behavior? It removes the oldest cache?
If thousands of files are cached and a 50 GB file needs to be cached. Nginx will then clean the cache of those thousands of files to make room for one big file?
Nginx receives a request for a 60 GB file. According to the configuration, it must cache it for future requests. But the disk is only 50 GB. Does it start caching the 50 GB file knowing that it will not be able to succeed? Or does it understand that this is not possible and just passes the request without caching.
Thank you

I can answer the first two
Yes, following LRU, but you need min_free specified proxy_cache_path. See proxy_cache_path
When the size is exceeded or there is not enough free space, it removes the least recently used data.
Per 1, yes.
Hope it helps, and I hope someone else can enlighten us with 3rd question.

Related

I want to cache static frequently used content on disk

We are going to deploy a storage server without raid ( we have lots of data but limited storage for now | data is not important ), so we will assign a subdomain to each of 12 x 8 TB drives for our clients to download from it.
Clients will be downloading content through a static URL over http (http://subdomain1.xyzwebsite.com/folder1/file1.mkv), our server is powerful with 128 GB of RAM and 6 x 2 Cores Processor with 10 Gigabit LAN Card but without RAID when multiple clients download from same drive it will look like a bottleneck so to overcome it I started to look into varnish cache but i do not get a satisfaction how will it serve data (I do not understand setting object size and manually setting cache location to RAM or DISK).
NOTE: each file size can range from 500 MB to 4 GB
We do not want a separate server for caching data, we want to utilize this powerful server to do this, now for the solution i think that data is located in a 1 drive and if it is possible to copy/mirror/cache frequent used (files download in 24 hours or 12 hours) content to second drive and serve same file with same sub-domain
NOTE: Nginx know which file is accessed via access.log
scenerio:
there are 12 drives (there are 2 separate drives for os which i'm not counting here), i will store data on 11 drives and use 12th drive as a copy/mirror/cache for all drives, i know how http works whether i add multiple ip to same domain i can only download from one ip at a time ( i will add multiple ip address on same server ), this is my solution data will be served via round-robin, if one client is downloading from one ip another client might get to download from second ip.
Now i dont know how to implement it, i tried searching for solutions but i do not get any, there are two main problems:
how to copy/mirror/cache only frequent data of the 11 drives to 1 drive and serve from it
If i add second ip address entry to same subdomain and there is no data on 12th drive how will it fetch it
Nginx or Varnish based solution is required on same server, if RAM based cache can be done it will be good to
Varnish can be used for this, but unfortunately not the open source version.
Varnish Enteprise features the so-called Massive Storage Engine, which uses both disk and RAM to store large volumes of data.
Instead of using files to store objects, MSE uses pre-allocated large files with filesystem-like behavior. This is much faster and less prone to disk fragmentation.
In MSE you can configure how individual disks should behave and how much storage per disk is used. Each disk or group of disks can be tagged.
Based on Varnish Enterprise's MSE VMOD, you can then control what content is stored on each disk or group of disks.
You can decide how content is distributed to disk based on content type, URL, content size, disk usage and many other parameters. You can also choose not to persist content on disk, but just keep content in memory.
Regardless of this MSE VMOD, "hot content" will be automatically buffered from disk into memory. There are also "waterlevel" settings you can tune do decide how to automatically ensure that enough space is always available.

nginx least_conn is still doing round_robin

I have a fully working config doing loadbalancing over 2 upstream servers.
I want to use "least_conn"
When I put least_conn in, it still is doing round robin.
I can confirm that other configs like "weight' and "ip_hash" are working as
expected.
Is there some other configuration/setting that also affects whether
least_conn is honored?
This is using nginx 1.18.0
For future reference. It was indeed working. Turns out that there is quite a large amount of buffer (default is ~1GB between a temp file and some in memory buffers) and the files that were being transferred were a little less than that.

What happens when nginx proxy_buffer_size is exceeded?

I am running a node server in AWS Elastic Beanstalk with Docker, which also uses nginx. One of my endpoints is responsible for image manipulation such as resizing etc.
My logs show a lot of ESOCKETTIMEDOUT errors, which indicate it could be caused by an invalid url.
This is not the case as it is fairly basic to handle that scenario, and when I open the apparent invalid url, it loads an image just fine.
My research has so far led me to make the following changes:
Increase the timeout of the request module to 2000
Set the container uv_threadpool_size env variable to the max 128
While 1 has helped in improving response times somewhat, I don't see any improvements from 2. I have now come across the following warning in my server logs:
an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/0/12/1234567890 while reading upstream,.
This makes me think that the ESOCKETTIMEDOUT errors could be due to the proxy_buffer_size being exceeded. But, I am not sure and I'd like some opinion on this before I continue making changes based on a hunch.
So I have 2 questions:
Would the nginx proxy_buffer_size result in an error if a) the size is exceeded in cases of manipulating a large image or b) the volume of requests maxes out the buffer size?
What are the cost impacts, if any, of updating the size. AWS memory, instance size etc?
I have come across this helpful article but wanted some more opinion on if this would even help in my scenario.
When proxy_buffer_size is exceeded it creates a temporary file to use as a kind of "swap", which uses your storage, and if it is billable your cost will increase. When you increase proxy_buffer_size value you will use more RAM, which means you will have to pay for a larger one, or try your luck with the current one.
There is two things you should never make the user wait for processing: e-mails and images. It can lead to timeouts or even whole application unavailability. You can always use larger timeouts, or even more robust instances for those endpoints, but when it scales you WILL have problems.
I suggest you approach this differently: Make a image placeholder response and process those images asynchronously. When they are available as versioned resized images you can serve them normally. There is an AWS article about something like this using lambda for it.

What are the options for transferring 60GB+ files over the a network?

I'm about to start developing an application to transfer very large files without any rush but with need of reliability. I would like people that had worked coding such a particular case give me an insight of what I'm about to get into.
The environment will be intranet ftp server> so far using active ftp normal ports windows systems. I might need to also zip up the files before sending and I remember working with a library once that would zip in memory and there was a limit on the size... ideas on this would also be appreciated.
Let me know if I need to clarify something else. I'm asking for general/higher level gotchas if any not really detail help. I've done apps with normal sizes (up to 1GB) before but this one seems I'd need to limit the speed so I don't kill the network or things like that.
Thanks for any help.
I think you can get some inspiration from torrents.
Torrents generally break up the file in manageable pieces and calculate a hash of them. Later they transfer them piece by piece. Each piece is verified against hashes and accepted only if matched. This is very effective mechanism and let the transfer happen from multiple sources and also let is restart any number of time without worrying about corrupted data.
For transfer from a server to single client, I would suggest that you create a header which includes the metadata about the file so the receiver always knows what to expect and also knows how much has been received and can also check the received data against hashes.
I have practically implemented this idea on a client server application but the data size was much smaller, say 1500k but reliability and redundancy were important factors. This way, you can also effectively control the amount of traffic you want to allow through your application.
I think the way to go is to use the rsync utility as an external process to Python -
Quoting from here:
the pieces, using checksums, to possibly existing files in the target
site, and transports only those pieces that are not found from the
target site. In practice this means that if an older or partial
version of a file to be copied already exists in the target site,
rsync transports only the missing parts of the file. In many cases
this makes the data update process much faster as all the files are
not copied each time the source and target site get synchronized.
And you can use the -z switch to have compression on the fly for the data transfer transparently, no need to boottle up either end compressing the whole file.
Also, check the answers here:
https://serverfault.com/questions/154254/for-large-files-compress-first-then-transfer-or-rsync-z-which-would-be-fastest
And from rsync's man page, this might be of interest:
--partial
By default, rsync will delete any partially transferred
file if the transfer is interrupted. In some circumstances
it is more desirable to keep partially transferred files.
Using the --partial option tells rsync to keep the partial
file which should make a subsequent transfer of the rest of
the file much faster

Harvesting Dynamic HTTP Content to produce Replicating HTTP Static Content

I have a slowly evolving dynamic website served from J2EE. The response time and load capacity of the server are inadequate for client needs. Moreover, ad hoc requests can unexpectedly affect other services running on the same application server/database. I know the reasons and can't address them in the short term. I understand HTTP caching hints (expiry, etags....) and for the purpose of this question, please assume that I have maxed out the opportunities to reduce load.
I am thinking of doing a brute force traversal of all URLs in the system to prime a cache and then copying the cache contents to geodispersed cache servers near the clients. I'm thinking of Squid or Apache HTTPD mod_disk_cache. I want to prime one copy and (manually) replicate the cache contents. I don't need a federation or intelligence amongst the slaves. When the data changes, invalidating the cache, I will refresh my master cache and update the slave versions, probably once a night.
Has anyone done this? Is it a good idea? Are there other technologies that I should investigate? I can program this, but I would prefer a configuration of open source technologies solution
Thanks
I've used Squid before to reduce load on dynamically-created RSS feeds, and it worked quite well. It just takes some careful configuration and tuning to get it working the way you want.
Using a primed cache server is an excellent idea (I've done the same thing using wget and Squid). However, it is probably unnecessary in this scenario.
It sounds like your data is fairly static and the problem is server load, not network bandwidth. Generally, the problem exists in one of two areas:
Database query load on your DB server.
Business logic load on your web/application server.
Here is a JSP-specific overview of caching options.
I have seen huge performance increases by simply caching query results. Even adding a cache with a duration of 60 seconds can dramatically reduce load on a database server. JSP has several options for in-memory cache.
Another area available to you is output caching. This means that the content of a page is created once, but the output is used multiple times. This reduces the CPU load of a web server dramatically.
My experience is with ASP, but the exact same mechanisms are available on JSP pages. In my experience, with even a small amount of caching you can expect a 5-10x increase in max requests per sec.
I would use tiered caching here; deploy Squid as a reverse proxy server in front of your app server as you suggest, but then deploy a Squid at each client site that points to your origin cache.
If geographic latency isn't a big deal, then you can probably get away with just priming the origin cache like you were planning to do and then letting the remote caches prime themselves off that one based on client requests. In other words, just deploying caches out at the clients might be all you need to do beyond priming the origin cache.

Resources