grab website content thats not in the sourcecode - web-scraping

I want to grab some financial data from sites like http://www.fxstreet.com/rates-charts/currency-rates/
up to now I'm using liburl to grab the sourcecode and some regexp search to get the data, which I afterwards store in a file.
Yet there is a little problem:
On the page as I see it in the browser, the data is updated almost each second. When I open the source code however the data I'm looking for changes only every two minutes.
So my program only gets the data with a much lower time-resolution than possible.
I have two questions:
(i) How is it possible that a source-code which remains static over two minutes produces a table that changes every second? What is the mechanism?
(ii) How do I get the data with second time-resolution, i.e. how do I read out such a changing table thats not shown in the sourcecode.
thanks in advance,
David

You can use the network panel in FireBug to examine the HTTP requests being sent out (typically to fetch data) while the page is open. This particular page you've referenced appears to be sending POST requests to http://ttpush.fxstreet.com/http_push/, then receiving and parsing a JSON response.

try sending POST request to http://ttpush.fxstreet.com/http_push/connect, and see what you get
it will continuously load new data
EDIT:
you can use liburl or python, it doesn't really matter. Under HTTP, when you browse the web, you send GET or POST requests.
Go to the website, open the Developer Tools (Chrome)/firebug(firefox plugin) and you will see that after all the data is loaded, there's a request that doesn't close - it stays open.
When you have a website and you want to fetch data continuously, you can do it in a few techniques:
make separate requests (using ajax) every few seconds - this will open a connection for each request, and if you want frequent data updates - it's wasteful
use long polling or server polling - make 1 request that fetches the data. it stays open, and flushes data to the socket (to your browser) whenever it needs. the TCP connection remains open. When the connection times out - you can reopen it. It's more effective than the above normally - but the connection remains open.
use XMPP or some other protocol (not HTTP) - used mainly on chats, like facebook/msn i think., probably google's and some others.
the website you posted uses the second method - when it detects a POST request to that page, it keeps the connection open and dumps data continuously.
What you need to do is make a POST request to that page, you need to see which parameters (if any) are needed to be sent. It doesn't matter how you make the request, as long as you send the right parameters.
you need to read the response with a delimiter - probably every time they want to process data, they send \n or some other delimiter.
Hope this helps. If you see that you still can't get around this let me know and i'll get into more technical details

Related

Can a web server begin responding before the client has sent the full request?

I am writing a web application for an academic research group. The researchers need to be able to upload large data sets (100MB - 1GB) in CSV format. I've written the server to process the data as it comes in. This means that if there is an error in the first row of the CSV, we can return an error straight away.
However, when this happens, the browser reports that "The connection was reset" or similar. Clearly, my web server is responding in a way that doesn't make sense.
If I explicitly close the HTTP request stream (this is Kotlin on the JVM by the way) before returning the error to the browser, then the problem goes away. However, it turns out that the close implementation of the request stream first goes and reads the whole stream to its end. So at that point the user still has to wait 30mins+ to find out that there is an error in the first row of their CSV.
Is what I am trying to do possible? Does the HTTP protocol permit a web server, in any circumstances, to begin responding before the full request body has been sent? If not, can you suggest a workaround that would allow me to deliver a user experience where the user doesn't have to wait for the whole file to be uploaded before finding out if there are any problems?
The answer is yes, according to the http spec servers should be able to send responses early and the client should stop sending the request body. Most browsers however, don't implement this correctly.
In theory, your http server needs to return a 4xx error code with a response body, then reset the connection to prevent the upload continuing in the background. See the answers below for a more detailed description of the issue. There are a couple of browser versions that do support this, so if you're doing this in lab conditions where you can control the client being used the links below will help.
https://stackoverflow.com/a/14483857/2274303
https://stackoverflow.com/a/18370751/2274303
[edit]
To answer your question about using a workaround, chunking the uploads using javascript is a good way to mitigate internet connectivity issues, but if you want to parse it in real time it's not as simple as arbitrarily breaking up the file into pieces. You need to make sure you're not splitting the file in the middle of a line, otherwise it will fail even if the data is valid. That brings up the issue of parsing a 1GB file in javascript, which isn't a good idea imo.
If you want to use javascript, continue uploading the entire file at once via an ajax request, so you can get the response outside of the main dom and force a redirect or cancel the upload. Depending on which js libraries you're using there are different ways of doing this.
None of this solves the reverse scenario. What if the file is 95% uploaded before there's an error? The researcher will need to either upload the whole thing again or edit the file to only include the rows from the error going forward. That means your application needs to support partial uploads and know to pick up where it left off. All these things are possible, but you're probably not going to find a simple workaround to get this working well.
Without understanding the dataset and what kind of validation you are doing it's hard to come up with a full solution. If parsing each row doesn't depend on the previous rows being valid, you could always upload the whole file, then display the rows with errors at the end and ask them to upload a second file with just the corrections.
The normal process of a HTTP web server happens like:
Server listens for request
Client creates request
Client sends request to server
Server processes request
Server creates response
Server sends response to client
Client processes response
The client starts the connection for communication and the server is able to respond on that connection, however if you close the connection the server will need to send a response on another connection. The browser may not allow the server to start a new connection that the client didn't request.
You may be able to respond by reading the first line and creating an error quickly, but the client will not read the response until it is done sending the request.
By sending the file in chunks or asynchronously sending lines of the file, you will be able to give feedback more immediately. You will be sending many smaller requests with the ability to respond in between.
The question was about HTTP protocol. I feel like this would be allowed by the protocol if you wrote a custom app and web app, however if you are using browsers then you must use HTTP as the companies have implemented it. In a custom app you could check for interruptions however most browsers will probably fire a full request before listening for a response, which is also a reason AJAX took off 20years ago.

How to efficiently stream video over HTTP directly from SQL Server?

I'm trying to implement a video-streaming service. I use ASP.NET Web API, and as I've searched, PushStreamContent is exactly what I want, and it works very fine, sending HTTP response 206 (partial content) to the client, keeping the connection alive and pushing (writing) streams of bytes to the output.
However, I can't scale. Because I can't retrieve partial binary data from database. For example consider that I have a 300MB video in my SQL Server table (varbinary field) and I use Entity Framework to get the record, and then push it to the client using PushStreamContent.
However, this hugely impacts RAM. And for each seeking action that client does, the RAM uses another extra 600MB of space. Look at it in action:
1) First request for video
2) Second request (seeking to the middle of the video)
3) Third request (seeking into the last quarter of the video)
This can not be scaled at all. 10 users watching this movie, and our server is down.
What should I do? How can I stream video directly from SQL Server table without loading the entire video into RAM with Entity Framework and then pushing it to client via PushStreamContent?
You could combine the SUBSTRING function with VARBINARY fields, to return portions of your data. But I suspect you'd prefer a solution that doesn't require jumping from one chunk to the next.
You may also want to review this similar question.

Receiving JSON data into my Web Service

I'm hoping this will be a quick answer (probably a 'No').
I have set up a web service on Server B to receive HTTP POST data in JSON format from Server A. I don't have code level access to Server A, but I can manually trigger it to send data to my web service.
My current problem is that I have asked the Server A guys to send me a sample of what is being sent so I can program for the formats etc, but they are taking their sweet time responding.
I know the sending is working, and my WS is responding with my default return string (though Server A is seeing it as an error rather than success .. I don't know what they are expecting back for a successful transmission yet).
I am wondering if it is possible to receive and analyse the data without knowing exactly what is being sent? This way I can start my next phase of coding without needing to wait for them to provide a sample. Plus, I'm not sure how much the format will change for different jobs, so would be good to be able to accept whatever is sent and be able to look at it.
EDIT: To add more background.
Server A is a production application that we use. We have just found out that they have an API that can send data to us (HTTP POST in JSON format) each time one of our users completes whatever they are doing. We want to then store this data to build tables/stats for our clients to view (but that is another story).
You... can try putting together some dummy data, if you know enough about the type of data you'll be getting... But if you don't even know what shape it'll be, I don't know how in the world you would "analyze" it. Unless by "analyze," you mean get the size or something generic like that...

Measuring time between client starts request in browser until server gets it

Is there a way to measure this?
Certainly, for get requests, no available headers are being sent consistently from clients.
One idea I got is to get that from query string, but is that possible? Something like (pseudo-code follows)
http://server/default.aspx?t=(new Date().getTime())
Another one that would work is to have users hit a very small page that appends a query string as above, but wanted to avoid a redirection if possible.
(Overall goal is to gather per-request such statistics. The server processing time and server to client are more doable, under some assumptions.)
Thanks in advance.
I've done this through an AJAX request after the initial page load where you have control over the request from the very beginning. Pass the UNIX time in the query string and then when it reaches the server take the difference. I'm not familiar with iis7 so you'd have to make sure that timezone's are accounted for. This number could be very erratic since it's basically just calculating latency and DNS lookups which is different for every client.
Does the request start from an initial page that you have control over it ? in that case you can send the server time with the initial response, and then increment that time second-by-second with a javascript code while the user is on the page. this way you can have a server-synchronized time on the page and when a request goes to be sent from that page you can send that synced time with that request, then all you need is calculate difference on the server.

Is there a way using HTTP to allow the server to update the content in a client browser without client requesting for it?

It is quite easy to update the interface by sending jQuery ajax request and updating with new content. But I need something more specific.
I want to send the response to client without their having requested it and update the content when they have found something new on the server. No need to send an ajax request every time. When the server has new data it sends a response to every client.
Is there any way to do this using HTTP or some specific functionality inside the browser?
Websockets, Comet, HTTP long polling.
It has name server push (you can also find it under name Comet technology). Do search using these keywords and you will find bunch examples, tools and so on. No special protocol is required for that.
Aaah! You are trying to break the principles of the web :) You see if the web was pure MVC (model-view-controller) the 'server' could actually send messages to the client(s) and ask them to update. The issue is that the server could be load balanced and the same request could be sent to different servers. Now if you were to send a message back to the client you'll have to know who all are connected to the server. Let's say the site is quite popular and you have about 100,000 people connecting to it every day. You'll actually have to store the IPs of each of them to know where on the internet they are located and to be able to "push" them a message.
Caveats:
What if they are no longer browsing your website? You see currently there is no way to log out automatically if you close your browser. The server needs to check after a fixed timeout if you have logged out (or you send a new nonce with every response to prevent the server from doing that check)
What about a system restart/crash etc? You'd lose all the IPs that you were keeping track of and you are back to square one - you have people connected to you but until you receive new requests you can't really "send" them data when they may be expecting it as per your model.
Let's take an example of facebook's news feeds or "Most recent" link close to the top right - sometimes while you are browsing your wall you see the number next to most recent has gone up or a new 'feed' has come to the top of your wall post! It's the client sending periodic requests to the server to find out what was updated rather than the other way round
You see, it keeps it simple and restful. You may feel it's inefficient for the client to "poll" the server to pull the data and you'd prefer push, but the design of the server gets simplified :)
I suggest ajax-pulling is the best way to go - you are distributing computation to the client and keeping it simple (KIS principle :)
Of course you can get around it, the question is, is it worth it?
Hope this helps :)
RFC 6202 might be a good read.

Resources