How to crawl a wordpress blog? - wordpress

I write a c program to crawl blogs. It works well until it meets this blog: www.ipujia.com. I send the HTTP request:
GET http://www.ipujia.com/ HTTP/1.0
to the website and get the response as below:
HTTP/1.1 301 Moved Permanently
Date: Sun, 27 Feb 2011 13:15:26 GMT
Server: Apache/2.2.16 (Unix) mod_ssl/2.2.16 OpenSSL/0.9.8e-fips-rhel5
mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_perl/2.0.4
Perl/v5.8.8
X-Powered-By: PHP/5.2.14
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
Pragma: no-cache
Last-Modified: Sun, 27 Feb 2011 13:15:27 GMT
Location: http://http/www.ipujia.com/
Content-Length: 0
Connection: close
Content-Type: text/html; charset=UTF-8
This is strange because I cannot get the index page following the Location. Does anyone have any ideas?

The Location field in the response contains a malformed URI.
Location: http://http/www.ipujia.com/ (notice the protocol error)
Should be
Location: http://www.ipujia.com/
Unless you are in control of the server there is little you could do here.
To solve it could you not parse the "Location" response and attempt to extract a valid URI from the it?

Related

HTTP caching does not work

Opening same URL from browser, and the server returns below header.
Repeating again, why it always give HTTP 200, instead of 304.
Any idea?
HTTP/1.1 200 OK Server: Cowboy Connection: keep-alive X-Powered-By:
Express Content-Type: text/html Cache-Control: public, max-age=600
Date: Sat, 07 Nov 2015 04:41:50 GMT Transfer-Encoding: chunked Via:
1.1 vegur

How to tell Chrome not use local cache prematurely

The page (just a static .html) is served with the following headers:
HTTP/1.1 200 OK
Server: nginx
Date: Wed, 29 Jul 2015 02:59:37 GMT
Content-Type: text/html; charset=utf-8
Last-Modified: Wed, 29 Jul 2015 02:53:23 GMT
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Content-Encoding: gzip
Then, I close the tab, open the page again (by typing the url manually) and get the next response:
HTTP/1.1 304 Not Modified
Server: nginx
Date: Wed, 29 Jul 2015 02:58:45 GMT
Last-Modified: Wed, 29 Jul 2015 02:53:23 GMT
Connection: keep-alive
ETag: "55b84023-1ad"
I repeat it several times and then Chrome stops requesting it from the server and serves directly from its cache.
Status Code:200 OK (from cache)
Content-Encoding:gzip
Content-Type:text/html; charset=utf-8
Date:Wed, 29 Jul 2015 02:59:56 GMT
Last-Modified:Wed, 29 Jul 2015 02:53:23 GMT
Server:nginx
Vary:Accept-Encoding
Is there a way (server side) to tell Chrome not to do that, but respect HTTP caching headers on every request?

PHP sending different headers on different hosts

I have two identical instances of an application - one on my local machine the other one on a VPS.
The problem ist, that my local instance sends the expected headers:
[rico#local]$ curl -I app.local
HTTP/1.1 302 Found
Server: nginx/1.6.0
Content-Type: text/html
Connection: keep-alive
X-Powered-By: PHP/5.5.14
Set-Cookie: PHPSESSID=9sh2e9uo9b56sdvhrruakigvd7; path=/
Cache-Control: max-age=0, must-revalidate, no-store, nocache, private
Date: Thu, 14 Aug 2014 14:51:30 GMT
Location: /login
Pragma: no-cache
Expires: Fri, 01 Jan 1990 00:00:00 GMT
Whereas the instance on the VPS send some other (wrong) headers:
[rico#local]$ curl -I app.vps
HTTP/1.1 302 Found
Server: nginx/1.6.0
Content-Type: text/html
Connection: keep-alive
X-Powered-By: PHP/5.5.9-1ubuntu4.3
Set-Cookie: PHPSESSID=me5h90pj59g0rh62rjlghtjsi0; path=/
Cache-Control: max-age=315360000
Date: Thu, 14 Aug 2014 14:52:40 GMT
Location: /login
Expires: Thu, 31 Dec 2037 23:55:55 GMT
What / who might be rewriting my headers?

cURL response missing Location field

I am curling to a server that I own and getting a response. But this response is missing a Location field. Can anyone help me figure out why this is happening?
This is the curl:
curl -I http://example.server.net/
And this the the response I get:
HTTP/1.1 200 OK
Date: Fri, 18 Jul 2014 15:08:53 GMT
Server: Apache/2.2.22 (Ubuntu)
Content-Length: 15
Vary: Accept-Encoding
Content-Type:
As you can see there is no Location field that responds with something like:
Location: http://example.server.net
I know it is a valid field from other curls. I am looking for reasons why some basic fields might not be shown and possible solutions
Thanks!
You will get Location header when response code is 3xx.
For example:
$ curl -I http://g.cn
HTTP/1.1 301 Moved Permanently
Location: http://www.google.cn/
Date: Fri, 18 Jul 2014 15:29:10 GMT
Expires: Fri, 18 Jul 2014 15:29:10 GMT
Cache-Control: private, max-age=2592000
Content-Type: text/html; charset=UTF-8
Server: gws
Content-Length: 218
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Alternate-Protocol: 80:quic,80:quic

Weird HTTP Response Arduino

So, I wrote a program than is supposes to connected to a server, and it returns the time. It works on my server, but when I tried to use it on another server, it responses oddly. Here is the response from my server:
HTTP/1.1 200 OK
Date: Tue, 07 Jan 2014 00:06:20 GMT
Server: Apache/2.2.22 (Debian)
X-Powered-By: PHP/5.4.4-14+deb7u5
Set-Cookie: PHPSESSID=jlscamqbddtqibf9j7m0fu27p5; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Vary: Accept-Encoding
Content-Length: 6
Connection: close
Content-Type: text/html
4:06pm
which works great. Now here is the response from the other server (doesn't work):
HTTP/1.1 200 OK
Date: Tue, 07 Jan 2014 00:06:34 GMT
Server: Apache
Set-Cookie: PHPSESSID=krlqmoqgpiqm9b9u27agup53c7; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Connection: close
Content-Type: text/html
6
4:06pm
0
As you can see I'm getting some weird stuff before and after the expected response. The code on the server is exactly the same. And the code on the Arduino is the same except for the a couple strings.
Here is a pastebin of the code I am using: http://pastebin.com/TFF5h2Gw
Sorry there aren't a lot of comments and it's kinda jumbled together. I omitted a little bit of code that is used by other stuff that I haven't even gotten to test yet because I can't even get the time.
What you are seeing is a chunk-encoded response. That is okay as all HTTP/1.1 capable clients are supposed to understand this transport encoding. What is wierd is that the server is not explicitly marking the response as being chunk-encoded (This is usually done via the Transer-Encoding: chunked header).
A quick way to get rid of this is to issue a HTTP/1.0 request.

Resources