Parsing SEC Edgar fundamental data - edgar

I'm planning on downloading fundamental data (cash flow, income, balance sheet etc) from SEC EDGAR - am planning on downloading ALL the stocks in S&P 500 index . Does anyone know if there is an upper limit on the total amount of data/files that can be downloaded using FTP. Is there a daily limit on data that can be downloaded.
Thanks in advance.

I hope you were able to get the data/files you needed -- time is running out for FTP. SEC has announced it will permanently terminate FTP service at year's end.
https://www.sec.gov/edgar/searchedgar/ftpusers.htm
In the meantime it sounds like they just ask you to be courteous:
To preserve equitable server access, we ask that bulk FTP transfer requests be performed between 9 PM and 6 AM Eastern time. Please use efficient scripting, downloading only what you need and space out requests to minimize server load.

I use a base url like this to download filings as ftp is shutdown.
base_url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK="+str(cik)+"&type="+str(type)+"&dateb="+str(priorto)+"&owner=exclude&output=xml&count="+str(count)
you just need to fill in everywhere that you see str(*) with relevant information.
I haven't faced any limitation, except when SEC admin blocks me if I send multiple requests simultaneously. To avoid that, I put sleep(0.5) between each request.

In regards to the shutdown of the ftp, you may replace the ftp://ftp.sec.gov/ part of your URI with https://www.sec.gov/Archives/ and continue downloading in pretty much the same way as before.

Related

How to circumvent FTP server slowdown

I have a problem with an FTP server that slows dramatically after returning a few files.
I am trying to access data from a government server at the National Snow and Ice Data Center, using an R script and the RCurl library, which is a wrapper for libcurl. The line of code I am using is this (as an example for a directory listing):
getURL(url="ftp://n5eil01u.ecs.nsidc.org/SAN/MOST/MOD10A2.005/")
or this example, to download a particular file:
getBinaryURL(url="ftp://n5eil01u.ecs.nsidc.org/SAN/MOST/MOD10A2.005/2013.07.28/MOD10A2.A2013209.h26v04.005.2013218193414.hdf
I have to make the getURL() and getBinaryURL() requests frequently because I am picking through directories looking for particular files and processing them as I go.
In each case, the server very quickly returns the first 5 or 6 files (which are ~1 Mb each), but then my script often has to wait for 10 minutes or more until the next files are available; in the meantime the server doesn't respond. If I restart the script or try curl from the OSX Terminal, I again get a very quick response for the first few files, then a massive slowdown.
I am quite sure that the server's behavior has something to do with preventing DOS attacks or limiting bandwidth used by bots or ignorant users. However, I am new to this stuff and I don't understand how to circumvent the slowdown. I've asked the people who maintain the server but I don't have a definitive answer yet.
Questions:
Assuming for a moment that this problem is not unique to the particular server, would my goal generally be to keep the same session open, or to start new sessions with each FTP request? Would the server be using a cookie to identify my session? If so, would I want to erase or modify the cookie? I don't understand the role of handles, either.
I apologize for the vagueness but I'm wandering in the wilderness here. I would appreciate any guidance, even if it's just to existing resources.
Thanks!
The solution was to release the curl handle after making each FTP request. However, that didn't work at first because R was hanging onto the handle even though it had been removed. The solution (provided by Bill Dunlap on the R help list) was to call garbage collection. In summary, the successful code looked like this:
for(file in filelist){
curl<-getCurlHandle() #create a new curl handle
getURL(url=file, curl=curl,...) #download the file
rm(curl) #remove the curl
gc() #the magic call to garbage collection, without which the above does not work
}
I still suspect that there may be a more elegant way to accomplish the same thing using the RCurl library, but at least this works.

How to detect a reasonable number of concurrent requests I can safely perform on someone's server?

I crawl some data from the web, because there is no API. Unfortunately, it's quite a lot of data from several different sites and I quickly learned I can't just make thousands of requests to the same site in a short while... I want to approach the data as fast as possible, but I don't want to cause a DOS attack :)
The problem is, every server has different capabilities and I don't know them in advance. The sites belong to my clients, so my intention is to prevent any possible downtime caused by my script. So no policy like "I'll try million requests first and if it fails, I'll try half million, and if it fails..." :)
Is there any best practice for this? How Google's crawler knows how many requests it can do in the same while to the same site? Maybe they "shuffle their playlist", so there are not as many concurrent requests to a single site. Could I detect this stuff somehow via HTTP? Wait for a single request, count response time, approximately guess how well balanced the server is and then somehow make up a maximum number of concurrent requests?
I use a Python script, but this doesn't matter much for the answer - just to let you know in which language I'd prefer your potential code snippets.
The google spider is pretty damn smart. On my small site it hits me 1 page per minute to the second. They obviously have a page queue that is filled keeping time and sites in mind. I also wonder if they are smart enough about not hitting multiple domains on the same server -- so some recognition of IP ranges as well as URLs.
Separating the job of queueing up the URLs to be spidered at a specific time from the actually spider job would be a good architecture for any spider. All of your spiders could use the urlToSpiderService.getNextUrl() method which would block (if necessary) unless the next URL is to be spidered.
I believe that Google looks at the number of pages on a site to determine the spider speed. The more pages that you have the refresh in a given time then the faster they need to hit that particular server. You certainly should be able to use that as a metric although before you've done an initial crawl it would be hard to determine.
You could start out at one page every minute and then as the pages-to-be-spidered for a particular site increases, you would decrease the delay. Some sort of function like the following would be needed:
public Period delayBetweenPages(String domain) {
take the number of pages in the to-do queue for the domain
divide by the overall refresh period that you want to complete in
if more than a minute then just return a minute
if less than some minimum then just return the minimum
}
Could I detect this stuff somehow via HTTP?
With the modern internet, I don't see how you can. Certainly if the server is returning after a couple of seconds or returning 500 errors, then you should be throttling way back but a typical connection and download is sub-second these days for a large percentage of servers and I'm not sure there is much to be learned from any stats in that area.

Building ASP.Net web page to test user's connection (bandwidth)

We need to add a feature to our website that allows the user to test his connection speed (upload/download). After some research I found that the way this being done is downloading/uploading a file and divide the file size by the time required for that task, like here http://www.codeproject.com/KB/IP/speedtest.aspx
However, this is not possible in my case, this should be a web application so I can't download/upload files to/from user's machine for obvious security reasons. From what I have read this download/upload tasks should be on the server, but how?
This website has exactly what I have in mind but in php http://www.brandonchecketts.com/open-source-speedtest. Unfortunately, I don't have any background about php and this task is really urgent.
I really don't need to use stuff like speedtest
thanks
The formula for getting the downloadspeed in bit/s:
DataSize in byte * 8 / (time in seconds to download - latency to the server).
Download:
Host a blob of data (html document, image or whatever) of which you know the downloadsize on your server, make a AJAX GET request to fetch it and measure the amount of time between the start of the download and when it's finished.
This can be done with only Javascript code, but I would probably recommend you to include a timestamp when the payload was generated in order to calculate the latency.
Upload:
Make a AJAX POST request towards the server which you fill with junk data of a specific size, calculate the time it took to perform this POST.
List of things to keep in mind:
You will not be able to measure higher bandwith than your servers
Concurrent users will drain the bandwith, build a "limit" if this will be a problem
It is hard to build a service to measure bandwidth with precision, the one you linked to is presenting 5Mbit/s to me when my actual speed is around 30Mbit/s according to Bredbandskollen.
Other links on the subject
What's a good way to profile the connection speed of Web users?
Simple bandwidth / latency test to estimate a users experience
Note: I have not built any service like this, but it should do the trick.

Multiple requests to server question

I have a DB with user accounts information.
I've scheduled a CRON job which updates the DB with every new user data it fetches from their accounts.
I was thinking that this may cause a problem since all requests are coming from the same IP address and the server may block requests from that IP address.
Is this the case?
If so, how do I avoid being banned? should I be using a proxy?
Thanks
You get banned for suspicious (or malicious) activity.
If you are running a normal business application inside a normal company intranet you are unlikely to get banned.
Since you have access to user accounts information, you already have a lot of access to the system. The best thing to do is to ask your systems administrator, since he/she defines what constitutes suspicious/malicious activity. The systems administrator might also want to help you ensure that your database is at least as secure as the original information.
should I be using a proxy?
A proxy might disguise what you are doing - but you are still doing it. So this isn't the most ethical way of solving the problem.
Is the cron job that fetches data from this "database" on the same server? Are you fetching data for a user from a remote server using screen scraping or something?
If this is the case, you may want to set up a few different cron jobs and do it in batches. That way you reduce the amount of load on the remote server and lower the chance of wherever you are getting this data from, blocking your access.
Edit
Okay, so if you have not got permission to do scraping, obviously you are going to want to do it responsibly (no matter the site). Try gather as much data as you can from as little requests as possible, and spread them out over the course of the whole day, or even during times that a likely to be low load. I wouldn't try and use a proxy, that wouldn't really help the remote server, but it would be a pain in the ass to you.
I'm no iPhone programmer, and this might not be possible, but you could try have the individual iPhones grab the data so all the source traffic isn't from the same IP. Just an idea, otherwise just try to be a bit discrete.
Here are some tips from Jeff regarding the scraping of Stack Overflow, but I'd imagine that the rules are similar for any site.
Use GZIP requests. This is important! For example, one scraper used 120 megabytes of bandwidth in only 3,310 hits which is substantial. With basic gzip support (baked into HTTP since the 90s, and universally supported) it would have been 20 megabytes or less.
Identify yourself. Add something useful to the user-agent (ideally, a link to an URL, or something informational) so we can see your bot as something other than "generic unknown anonymous scraper."
Use the right formats. Don't scrape HTML when there is a JSON or RSS feed you could use instead. Heck, why scrape at all when you can download our cc-wiki data dump??
Be considerate. Pulling data more than every 15 minutes is questionable. If you need something more timely than that ... why not ask permission first, and make your case as to why this is a benefit to the SO community and should be allowed? Our email is linked at the bottom of every single page on every SO family site. We don't bite... hard.
Yes, you want an API. We get it. Don't rage against the machine by doing naughty things until we build it. It's in the queue.

Need to check uptime on a large file being hosted

I have a dynamically generated rss feed that is about 150M in size (don't ask)
The problem is that it keeps crapping out sporadically and there is no way to monitor it without downloading the entire feed to get a 200 status. Pingdom times out on it and returns a 'down' error.
So my question is, how do I check that this thing is up and running
What type of web server, and server side coding platform are you using (if any)? Is any of the content coming from a backend system/database to the web tier?
Are you sure the problem is not with the client code accessing the file? Most clients have timeouts and downloading large files over the internet can be a problem depending on how the server behaves. That is why file download utilities track progress and download in chunks.
It is also possible that other load on the web server or the number of users is impacting server. If you have little memory available and certain servers then it may not be able to server that size of file to many users. You should review how the server is sending the file and make sure it is chunking it up.
I would recommend that you do a HEAD request to check that the URL is accessible and that the server is responding at minimum. The next step might be to setup your download test inside or very close to the data center hosting the file to monitor further. This may reduce cost and is going to reduce interference.
Found an online tool that does what I needed
http://wasitup.com uses head requests so it doesn't time out waiting to download the whole 150MB file.
Thanks for the help BrianLy!
Looks like pingdom does not support the head request. I've put in a feature request, but who knows.
I hacked this capability into mon for now (mon is a nice compromise between paying someone else to monitor and doing everything yourself). I have switched entirely to https so I modified the https monitor to do it. The did it the dead-simple way: copied the https.monitor file, called it https.head.monitor. In the new monitor file I changed the line that says (you might also want to update the function name and the place where that's called):
get_https to head_https
Now in mon.cf you can call a head request:
monitor https.head.monitor -u /path/to/file

Resources