How to circumvent FTP server slowdown - r

I have a problem with an FTP server that slows dramatically after returning a few files.
I am trying to access data from a government server at the National Snow and Ice Data Center, using an R script and the RCurl library, which is a wrapper for libcurl. The line of code I am using is this (as an example for a directory listing):
getURL(url="ftp://n5eil01u.ecs.nsidc.org/SAN/MOST/MOD10A2.005/")
or this example, to download a particular file:
getBinaryURL(url="ftp://n5eil01u.ecs.nsidc.org/SAN/MOST/MOD10A2.005/2013.07.28/MOD10A2.A2013209.h26v04.005.2013218193414.hdf
I have to make the getURL() and getBinaryURL() requests frequently because I am picking through directories looking for particular files and processing them as I go.
In each case, the server very quickly returns the first 5 or 6 files (which are ~1 Mb each), but then my script often has to wait for 10 minutes or more until the next files are available; in the meantime the server doesn't respond. If I restart the script or try curl from the OSX Terminal, I again get a very quick response for the first few files, then a massive slowdown.
I am quite sure that the server's behavior has something to do with preventing DOS attacks or limiting bandwidth used by bots or ignorant users. However, I am new to this stuff and I don't understand how to circumvent the slowdown. I've asked the people who maintain the server but I don't have a definitive answer yet.
Questions:
Assuming for a moment that this problem is not unique to the particular server, would my goal generally be to keep the same session open, or to start new sessions with each FTP request? Would the server be using a cookie to identify my session? If so, would I want to erase or modify the cookie? I don't understand the role of handles, either.
I apologize for the vagueness but I'm wandering in the wilderness here. I would appreciate any guidance, even if it's just to existing resources.
Thanks!

The solution was to release the curl handle after making each FTP request. However, that didn't work at first because R was hanging onto the handle even though it had been removed. The solution (provided by Bill Dunlap on the R help list) was to call garbage collection. In summary, the successful code looked like this:
for(file in filelist){
curl<-getCurlHandle() #create a new curl handle
getURL(url=file, curl=curl,...) #download the file
rm(curl) #remove the curl
gc() #the magic call to garbage collection, without which the above does not work
}
I still suspect that there may be a more elegant way to accomplish the same thing using the RCurl library, but at least this works.

Related

Streaming stdout to a web page

This seems like it should be a really simple thing to achieve, unfortunately web development was never my strong point.
I have a bunch of scripts, and I would like to launch them from a webpage and see the realtime stdout text on the page. Some of the scripts take a long time to run so the normal single response isn't good enough (I have this working already).
As far as I can see, my options are
stdout to a file, and periodically (every couple of seconds) send a request from the client and respond with the contents of this file.
Chunked HTTP responses? I'm not sure if this is what they are used for- I tried to implement this already but I think I may be misunderstanding their purpose.
Websockets (I'm using a Luvit server so this isn't an option).
... Something else?
I'm sure there must be a standard way of achieving, I see other sites doing this all the time. Teamcity for example. Or chat rooms (vanilla TCP sockets?).
Any pointers in the right direction appreciated. Simplest method possible, if that's just sending lots of scheduled requests from the client then so be it.
That heavily reminds me of Common Gateway Interfaces.
Your own ideas sound all like the right direction. As you are using a shell script, and some potentially nontrivial interactions with the web server, I feel it could make sense to point out where to dig for examples of this kind of code, which was common a long time ago, and very error-prone, basically allways.
Practically, your script is a CGI script, doing typical things.
In the earlier days and years of the internet, that was the "normal way" to implement web page that are not just static files (HTML or others).
The page is basically implemented as a shell script (or any other programm reading from stdin and writing to stdout).
Part of what you are doing/proposing is very similar, and I think there are useful lessons to learn from old CGI code.
For example, getting buffering right with from inside the script over sdtout, whrough the web server onto the client's page can be tricky of course.
So digging out old examples could help a lot.
(Much of this may be obvious to you, the OP, personally, so take the "you" as potential reader)
The tricky part in general will be the buffering, I expect. If you are used to explicitly handling stdin/out buffers in shell, for programms that do not support it, the kind of things to expect can be imagined - but if not used to it: I remember CGI is worse, as you have to get the buffering of the HTTP server in sync too (let's hope it is handled automatically) - so maybe start to ask questions/dig for examples early.
The CGI style way would be exactly what you have implemented now - and it the buffering is right, that should be as real-time as it can get. But I understand that you get timeouts because of the long runtime? Or do you have strongly varying runtimes?
In terms of getting it as real-time as possible, there is nothing better than writing stdout to the http stream.
(I assume we accept the overhead of going through a HTTP server.)
Also, I'm thinking of line buffering, so not flushing every char - is that good enough for the use case? (i.e. no animated progress indicator lines/ ANSI escapes that you want to see in real time)
Then maybe the best is to work aroung the issues like timeouts, but to keep the concept. If real time is not that important, other ways may be better in many ways, of course. One point would be that other methods could be required for any scalability.

ASP.NET page to reflect server status

I'm looking to create a webpage that will reflect the status of one of my company's servers automatically. Frequently there will be a minor error that only lasts 2-3 minutes, and it would be great to have this reflected on a self-generated page, which might prevent 50-60 unhappy clients from calling in simultaneously and asking what's wrong.
I'm not quite sure where to begin - would anyone have a suggestions for good resources to study? Programming examples? I'm not referring to the basics of writing an ASP.NET page, of course, but rather process interaction in Windows.
Thanks.
To pull this off, you'd need a separate page that essentially runs server diagnostics, otherwise the page wouldn't know if it was up or down. Also, the page would need to be isolated from the sort of problems that are kill other people's requests, such as cache hit problems, memory starvation, high CPU usage, insufficient bandwidth. So ideally the diagnostics would run in a separate app-pool, separate virtual directory, separate machine.
Many of the interesting diagnostics would require a WMI call, but some you can get from the My.Computer namespace.
Also, are you going to do this on every server, or do you want one web server to display the status of several different servers?
It also depends on the type of errors your servers are encountering.
If they are going down completely, or are losing internet connection, then pinging them after an interval of time will let you know if they are up or not.
If you have a specific process running on a server that becomes unavailable, that can be a little more tricky.
Your best bet is to find a way to do a simple request from the services/applications that are important and see if you get a response, if you do, the server is likely up, if not, then it is likely not.
Anything you can do to reduce the number of support calls you get is a good idea, but I'd also focus some time and try to figure out why your servers are going down so often.
Also, telling your users that the server is down, but not giving a reason why may not give the effect you are looking for. Users will still be confused and frustrated when they can't get their work done.
I know you were looking to build a webpage to display the server diagnostics, but there are plenty of server monitoring tools that produce webpages for an easy dashboard view of the history.
A quick google returned the following link:
http://www.webdesignbooth.com/10-really-useful-server-monitoring-tools/

How do I email myself data from a R script?

I'm hoping to take advantage of Amazon spot instances which come at a lower cost but can terminate anytime. I want to set it up such that I can send myself data mid-way through a script so I can pick up from there in the future.
How would I email myself a .rdata file?
difficulty: The ideal solution will not involve RCurl since I am unable to install that package on my machine instance.
The same way you would on the command-line -- I like the mpack binary for that which you find in Debian and Ubuntu.
So save data to a file /tmp/foo.RData (or generate a temporary name) and then
system("mpack -s Data /tmp/foo.RData you#some.where.com")
in R. That assumes the EC2 instance has mail setup, of course.
Edit Per request for a windoze alternative: blat has been recommended by other for this task.
There is a good article on this in R News from 2007. Amongst other things, the author describes some tactics for catching errors as they occur, and automatically sending email alerts when this happens -- helpful for long simulations.
Off topic: the article also gives tips about how the linux/unix tools screen and make can be very useful for remote monitoring and automatic error reporting. These may also be relevant in cases when you are willing to let R email you.
What you're asking is probably best solved not by email but by using an EBS volume. The volume will persist regardless of the instance (note though that I'm referring to an EBS volume as opposed to an EBS-backed instance).
In another question, I mention a bunch of options for checkpointing and related tools, if you would like to use a separate function for storing your data during the processing.

Multiple requests to server question

I have a DB with user accounts information.
I've scheduled a CRON job which updates the DB with every new user data it fetches from their accounts.
I was thinking that this may cause a problem since all requests are coming from the same IP address and the server may block requests from that IP address.
Is this the case?
If so, how do I avoid being banned? should I be using a proxy?
Thanks
You get banned for suspicious (or malicious) activity.
If you are running a normal business application inside a normal company intranet you are unlikely to get banned.
Since you have access to user accounts information, you already have a lot of access to the system. The best thing to do is to ask your systems administrator, since he/she defines what constitutes suspicious/malicious activity. The systems administrator might also want to help you ensure that your database is at least as secure as the original information.
should I be using a proxy?
A proxy might disguise what you are doing - but you are still doing it. So this isn't the most ethical way of solving the problem.
Is the cron job that fetches data from this "database" on the same server? Are you fetching data for a user from a remote server using screen scraping or something?
If this is the case, you may want to set up a few different cron jobs and do it in batches. That way you reduce the amount of load on the remote server and lower the chance of wherever you are getting this data from, blocking your access.
Edit
Okay, so if you have not got permission to do scraping, obviously you are going to want to do it responsibly (no matter the site). Try gather as much data as you can from as little requests as possible, and spread them out over the course of the whole day, or even during times that a likely to be low load. I wouldn't try and use a proxy, that wouldn't really help the remote server, but it would be a pain in the ass to you.
I'm no iPhone programmer, and this might not be possible, but you could try have the individual iPhones grab the data so all the source traffic isn't from the same IP. Just an idea, otherwise just try to be a bit discrete.
Here are some tips from Jeff regarding the scraping of Stack Overflow, but I'd imagine that the rules are similar for any site.
Use GZIP requests. This is important! For example, one scraper used 120 megabytes of bandwidth in only 3,310 hits which is substantial. With basic gzip support (baked into HTTP since the 90s, and universally supported) it would have been 20 megabytes or less.
Identify yourself. Add something useful to the user-agent (ideally, a link to an URL, or something informational) so we can see your bot as something other than "generic unknown anonymous scraper."
Use the right formats. Don't scrape HTML when there is a JSON or RSS feed you could use instead. Heck, why scrape at all when you can download our cc-wiki data dump??
Be considerate. Pulling data more than every 15 minutes is questionable. If you need something more timely than that ... why not ask permission first, and make your case as to why this is a benefit to the SO community and should be allowed? Our email is linked at the bottom of every single page on every SO family site. We don't bite... hard.
Yes, you want an API. We get it. Don't rage against the machine by doing naughty things until we build it. It's in the queue.

Need to check uptime on a large file being hosted

I have a dynamically generated rss feed that is about 150M in size (don't ask)
The problem is that it keeps crapping out sporadically and there is no way to monitor it without downloading the entire feed to get a 200 status. Pingdom times out on it and returns a 'down' error.
So my question is, how do I check that this thing is up and running
What type of web server, and server side coding platform are you using (if any)? Is any of the content coming from a backend system/database to the web tier?
Are you sure the problem is not with the client code accessing the file? Most clients have timeouts and downloading large files over the internet can be a problem depending on how the server behaves. That is why file download utilities track progress and download in chunks.
It is also possible that other load on the web server or the number of users is impacting server. If you have little memory available and certain servers then it may not be able to server that size of file to many users. You should review how the server is sending the file and make sure it is chunking it up.
I would recommend that you do a HEAD request to check that the URL is accessible and that the server is responding at minimum. The next step might be to setup your download test inside or very close to the data center hosting the file to monitor further. This may reduce cost and is going to reduce interference.
Found an online tool that does what I needed
http://wasitup.com uses head requests so it doesn't time out waiting to download the whole 150MB file.
Thanks for the help BrianLy!
Looks like pingdom does not support the head request. I've put in a feature request, but who knows.
I hacked this capability into mon for now (mon is a nice compromise between paying someone else to monitor and doing everything yourself). I have switched entirely to https so I modified the https monitor to do it. The did it the dead-simple way: copied the https.monitor file, called it https.head.monitor. In the new monitor file I changed the line that says (you might also want to update the function name and the place where that's called):
get_https to head_https
Now in mon.cf you can call a head request:
monitor https.head.monitor -u /path/to/file

Resources