Find last evicted url/key from Squid - squid

I am trying to find the last url/key evicted from the squid for some analysis purpose. Till now I have tried different log patterns available in squid. Squid documentation suggests cache_store_log Shows which objects are ejected from the cache, and which objects are saved and for how long. I am not able to find the last key evicted with the help of this log. Please suggest other alternatives to find the last key/url evicted.
2)Ways to find all the urls stored in squid at sometime.

Related

Calculate the number of visits based on downloaded GB

I have a website hosted in firebase that totally went viral for a day. Since I wasn't expecting that, I didn't install any analytics tool. However, I would like to know the number of visits or downloads. The only metric I have available is the GB Downloaded: 686,8GB. But I am confused because if I open the website with the console of Chrome, I get two different metrics about the size of the page: 319KB transferred and 1.2MB resources. Furthermore, not all of those things are transferred from firebase but from other CDN as you can see in the screenshots. What is the proper way of calculating the visits I had?
Transferred metric is how much bandwidth was used after compression was applied.
Resources metric is how much disk space those resources use before they are compressed (for transfer).
True analytics requires an understanding how what is on the web. There are three classifications:
Humans, composed of flesh and blood and overwhelmingly (though not absolutely) use web browsers.
Spiders (or search engines) that request pages with the notion that they obey robots.txt and will list your website in their websites for relevant search queries.
Rejects (basically spammers and the unknowns) which include (though are far from limited to) content/email scrapers, brute-force password guessers, vulnerability scanners and POST spammers.
With this clarification in place what you're asking in effect is, "How many human visitors am I receiving?" The easiest way to obtain that information is to:
Determine what user agent requests are human (not easy, behavior based).
Determine the length of time a single visit from a human should count as.
Assign human visitors a session.
I presume you understand what a cookie is and how it differs from a session cookie. Obviously when you sign in to a website you are assigned a session. If that session cookie is not sent to the server on a page request you will in effect be signed out. You can make session cookies last for a long time and it will come down to factors such as convenience for the visitor and if you directly count those sessions or use it in conjunction with something else.
Now your next thought likely is, "But how do I count downloads?" Thankfully you mention PHP in your website so I can thankfully give you some code that should make sense to you. If you just link directly to the file you'd be stuck with (at best) counting clicks via a click event on the anchor element though if the download gets canceled because it was a mistake or something else makes it more subjective than my suggestion. Granted my suggestion can still be subjective (e.g. they decide they actually don't want to download and cancel before the completion) and of course if they use the download is another aspect to consider. That being said if you want the server to give you a download count you'd want to do the following:
You'll may want to use Apache rewrite (or whatever the other HTTP server equivalents are) so that PHP handles the download.
You'll may need to ensure Apache has the proper handling for PHP (e.g. AddType application/x-httpd-php5 .exe .msi .dmg) so your server knows to let PHP run on the request file.
You'll want to use PHP's file_exists() with an absolute file path on the server for the sake of security.
You'll want to ensure that you set the correct mime for the file via PHP's header() as you should expect browsers to be horrible at guessing.
You absolutely need to use die() or exit() to avoid Gecko (Firefox) bugs if your software leaks even whitespace as the browser would interpret it as part of the file likely causing corruption.
Here is the code for PHP itself:
$p = explode('/',strrev($_SERVER['REQUEST_URI']));
$file = strrev($p[0]);
header('HTTP/1.1 200');
header('Content-Type: '.$mime);
echo file_get_contents($path_absolute.$file);
die();
For counting downloads if you want to get a little fancy you could create a couple of database tables. One for the files (download_files) and the second table for requests (download_requests). Throw in basic SQL queries and you're collecting data. Record IPv6 (Storing IPv6 Addresses in MySQL) and you'll be able to discern from a query how many unique downloads you have.
Back to human visitors: it takes a very thorough study to understand the differences between humans and bots. Things like Captcha are garbage and are utterly annoying. You can get a rough start by requiring a cookie to be sent back on requests though not all bots are ludicrously stupid. I hope this at least gets you on the right path.

How to circumvent FTP server slowdown

I have a problem with an FTP server that slows dramatically after returning a few files.
I am trying to access data from a government server at the National Snow and Ice Data Center, using an R script and the RCurl library, which is a wrapper for libcurl. The line of code I am using is this (as an example for a directory listing):
getURL(url="ftp://n5eil01u.ecs.nsidc.org/SAN/MOST/MOD10A2.005/")
or this example, to download a particular file:
getBinaryURL(url="ftp://n5eil01u.ecs.nsidc.org/SAN/MOST/MOD10A2.005/2013.07.28/MOD10A2.A2013209.h26v04.005.2013218193414.hdf
I have to make the getURL() and getBinaryURL() requests frequently because I am picking through directories looking for particular files and processing them as I go.
In each case, the server very quickly returns the first 5 or 6 files (which are ~1 Mb each), but then my script often has to wait for 10 minutes or more until the next files are available; in the meantime the server doesn't respond. If I restart the script or try curl from the OSX Terminal, I again get a very quick response for the first few files, then a massive slowdown.
I am quite sure that the server's behavior has something to do with preventing DOS attacks or limiting bandwidth used by bots or ignorant users. However, I am new to this stuff and I don't understand how to circumvent the slowdown. I've asked the people who maintain the server but I don't have a definitive answer yet.
Questions:
Assuming for a moment that this problem is not unique to the particular server, would my goal generally be to keep the same session open, or to start new sessions with each FTP request? Would the server be using a cookie to identify my session? If so, would I want to erase or modify the cookie? I don't understand the role of handles, either.
I apologize for the vagueness but I'm wandering in the wilderness here. I would appreciate any guidance, even if it's just to existing resources.
Thanks!
The solution was to release the curl handle after making each FTP request. However, that didn't work at first because R was hanging onto the handle even though it had been removed. The solution (provided by Bill Dunlap on the R help list) was to call garbage collection. In summary, the successful code looked like this:
for(file in filelist){
curl<-getCurlHandle() #create a new curl handle
getURL(url=file, curl=curl,...) #download the file
rm(curl) #remove the curl
gc() #the magic call to garbage collection, without which the above does not work
}
I still suspect that there may be a more elegant way to accomplish the same thing using the RCurl library, but at least this works.

System.Security.Cryptography.CryptographicException: Not enough storage is available to process this command

Our asp.net app was working fine, then the DBA decided to encrypt the db password in the web.config. Now I'm getting this error:
System.Security.Cryptography.CryptographicException: Not enough storage is available to process this command.
There is only one other article on SO that has this error listed and the user resorted to a refactor instead of identifying a solution.
The weird thing is that we have plenty of space (RAM, HDD, etc). Even more weird, three of the people on my team don't have this problem (with the exact same url). Another guy had it yesterday, but it works today.
I'm worried about when we move this to prod. Especially, if this needs some kind of incremental storage or permissions for EACH user.
Edit: The other error that seems to show up is:
"Failed to decrypt using provider 'RsaProtectedConfigurationProvider'"
It turns out that this is a generic error message that happens whenever the server has trouble decrypting with RSA. Not very helpful, because it is misleading (at worst) and at best, very vague.
For us, the error was only happening for me because our dev servers are load-balanced (which I didn't know till today). The encryption key was generated on one machine (server1) and installed on both servers. When I got load-balanced onto server2, I see this error (so would anyone else on server2).
The solution is to export the private key from server1 and install it onto server2.

How to get rid of ConflictError on ZEO workers?

Looking at my ZEO workers I get to see quite a lot of:
2013-10-18T11:59:54 INFO ZPublisher.Conflict ConflictError at
/VirtualHostBase/http/www.domain.com:80/Plone/VirtualHostRoot/:
database conflict error (oid 0x533cd5, class
persistent.mapping.PersistentMapping) (78 conflicts (0 unresolved)
since startup at Mon Oct 14 04:09:45 2013)
As they are logged as INFO should I assume that is not harmful at all?
And I guess that if there is a conflict is because there are too much writes on the ZODB?
The conflicts are indeed caused because two requests are trying to change a PersistentMapping at the same time. One of these is then forced to retry the commit.
Use these entries to pinpoint bottlenecks in your application; perhaps replace the specific mapping with a BTree.OOBTree which minimizes conflicts by spreading key-value pairs out over separate persistent buckets.
Without traffic data and what that specific PersistentMapping holds or what your application does with it, it is impossible to say if 78 conflicts in 4 days is a lot or a little, and if it is worth your while switching to a different container.
Conflict errors are not -- in themselves -- harmful. The ZEO server will retry several times to resolve the error. But they are a sign of write-contention in the database, and a lot of them will indicate that you have a bottleneck in your current configuration. Your users soon will be complaining of poor performance.
You should probably begin analysis to determine if you've some add-on package that's doing excessive or very inefficient writes to the database. The worst case, for example, would be some code that's trying to write to the database on every page load like a traffic logger. The ZODB is optimized for reading, not writing, and those operations should be redesigned to put their data stores somewhere other than the ZODB.
If it's just content writes that are the problem, look to reduce catalog indexes and metadata. If at all possible, replace old Archetypes-style content with Dexterity content types. Dexterity is far more efficient in content creation.

Multiple requests to server question

I have a DB with user accounts information.
I've scheduled a CRON job which updates the DB with every new user data it fetches from their accounts.
I was thinking that this may cause a problem since all requests are coming from the same IP address and the server may block requests from that IP address.
Is this the case?
If so, how do I avoid being banned? should I be using a proxy?
Thanks
You get banned for suspicious (or malicious) activity.
If you are running a normal business application inside a normal company intranet you are unlikely to get banned.
Since you have access to user accounts information, you already have a lot of access to the system. The best thing to do is to ask your systems administrator, since he/she defines what constitutes suspicious/malicious activity. The systems administrator might also want to help you ensure that your database is at least as secure as the original information.
should I be using a proxy?
A proxy might disguise what you are doing - but you are still doing it. So this isn't the most ethical way of solving the problem.
Is the cron job that fetches data from this "database" on the same server? Are you fetching data for a user from a remote server using screen scraping or something?
If this is the case, you may want to set up a few different cron jobs and do it in batches. That way you reduce the amount of load on the remote server and lower the chance of wherever you are getting this data from, blocking your access.
Edit
Okay, so if you have not got permission to do scraping, obviously you are going to want to do it responsibly (no matter the site). Try gather as much data as you can from as little requests as possible, and spread them out over the course of the whole day, or even during times that a likely to be low load. I wouldn't try and use a proxy, that wouldn't really help the remote server, but it would be a pain in the ass to you.
I'm no iPhone programmer, and this might not be possible, but you could try have the individual iPhones grab the data so all the source traffic isn't from the same IP. Just an idea, otherwise just try to be a bit discrete.
Here are some tips from Jeff regarding the scraping of Stack Overflow, but I'd imagine that the rules are similar for any site.
Use GZIP requests. This is important! For example, one scraper used 120 megabytes of bandwidth in only 3,310 hits which is substantial. With basic gzip support (baked into HTTP since the 90s, and universally supported) it would have been 20 megabytes or less.
Identify yourself. Add something useful to the user-agent (ideally, a link to an URL, or something informational) so we can see your bot as something other than "generic unknown anonymous scraper."
Use the right formats. Don't scrape HTML when there is a JSON or RSS feed you could use instead. Heck, why scrape at all when you can download our cc-wiki data dump??
Be considerate. Pulling data more than every 15 minutes is questionable. If you need something more timely than that ... why not ask permission first, and make your case as to why this is a benefit to the SO community and should be allowed? Our email is linked at the bottom of every single page on every SO family site. We don't bite... hard.
Yes, you want an API. We get it. Don't rage against the machine by doing naughty things until we build it. It's in the queue.

Resources