Python 3.5 requests for clawing - python-requests

I have a coding problem regarding Python 3.5 web clawing.
I try to use 'requests.get' to extract the real link from 'http://www.baidu.com/link?url=ePp1pCIHlDpkuhgOrvIrT3XeWQ5IRp3k0P8knV3tH0QNyeA042ZtaW6DHomhrl_aUXOaQvMBu8UmDjySGFD2qCsHHtf1pBbAq-e2jpWuUd3'. An example of the code is like below:
import requests
response = requests.get('http://www.baidu.com/link?url=ePp1pCIHlDpkuhgOrvIrT3XeWQ5IRp3k0P8knV3tH0QNyeA042ZtaW6DHomhrl_aUXOaQvMBu8UmDjySGFD2qCsHHtf1pBbAq-e2jpWuUd3')
c = response.url
I expected that 'c' should be 'caifu.cnstock.com/fortune/sft_jj/tjj_yndt/201605/3787477.htm'. (I remove http:// from the link as I can't post two links in one question.)
However, it doesn't work, and keeps return me the same link as I putted in.
Can anyone help on this. Many thanks in advance.
#
Thanks a lot to Charlie.
I have found out the solution. I first use .content.decode to read the response history, but that will be mixed up with many irrelevant info. I then use .findall to extract the redirect url from the history, which should be the first url displayed in the response history. Then, I use requests.get to retrieve the info. Below is the code:
rep1 = requests.get(url)
cont = rep1.content.decode('utf-8')
extract_cont = re.findall('"([^"]*)"', cont)
redir_url = extract_cont[0]
rep = requests.get(redir_url)

You may consider looking into the response headers for a 'location' header.
response.headers['location']
You may also consider looking at the response history, which contains a response for each response instance in a chain of redirects
response.history

Your sample URL doesn't redirect; The response is a 200 and then it uses a JavaScript window.location change. The requests library won't support this type of redirect.
<script>window.location.replace("http://caifu.cnstock.com/fortune/sft_jj/tjj_yndt/201605/3787477.htm")</script>
<noscript><META http-equiv="refresh" content="0;URL='http://caifu.cnstock.com/fortune/sft_jj/tjj_yndt/201605/3787477.htm'"></noscript>
If you know you will always be using this one service, you could parse the response, maybe using regex.
If you don't know what service will always be used and also want to handle every possible situation, you might need to instantiate a WebKit instance or something and somehow try to determine when it finally finishes. I'm sure there's a page load complete event which you could use, but you still might have pages that do a window.location change after the page is loaded using a timer. This will be very heavyweight and still not cover every conceivable type of redirect.
I recommend starting with writing a special handler for each type of edge case and fallback on a default handler that just looks at the response.url. As new edge cases come up, write new handlers. It's kind of the 'trial and error' approach.

Related

Is it possible to return HTTP code 200, but give a "better" url without using 3xx?

Consider StackOverflow, where each question has a unique ID, but URLs are often overridden to include a stub in the URL. For readability and other reasons the stub helps users know they are at the right place.
I have a site that returns 200 when calling a URL like:
http://stackoverflow.com/questions/28057406/
But want the URL to update to:
http://stackoverflow.com/questions/28057406/is-it-possible-to-return-http-code-200-but-give-a-better-url-without-using-3x
The first call is technically valid and the code can retrieve the object and render it perfectly fine, but I'd like to update the URL to use the stubified one.
I'd prefer to do this without a redirect as just getting the ID causes a database call to get the object. Which would mean with a redirect the process would be:
Call http://stackoverflow.com/questions/28057406/
Retrieve item 25257999 from the database to get the name to make the stub
Redirect to http://stackoverflow.com/questions/28057406/is-it-possible-to-return-http-code-200-but-give-a-better-url-without-using-3x
New HTTP Call, so retrieve item 25257999 from the database to render the final page.
If possible I'd like to not use Javascript either.
So, is it possible to return Location as part of a HTTP header with a status code of 200 and the actual page, or am I stuck using 3xx calls or Javascript?
If you are just doing HTTP, you can either choose to redirect, or not choose to redirect... You can also (with Content-Location) tell the client that the canonical address is actually somewhere else... but no browser will respond to that.
To avoid the database-call, you could of course just cache the result.
If you are in a browser however, you can dynamically update the current address without forcing a refresh, with window.history.pushState.
For more information about that call, see this other SO answer:
Modify the URL without reloading the page

Trouble entering POST parameters into url for ".do" page

I'm doing some heavy web scraping using Python. In some cases, post data is sent not through a form submit but through some Javascript, which I cannot interact with via this approach. In order to circumvent this, I've been appending names and values for the post requests to the url and then visiting that url.
This method was working fine until I came across a site that used this kind of structure: [sitename].com/?[pagename].do/. I admit total ignorance about this .do extension, though some light searching tells me that it has to do with Struts and a Java-based backend. In this case it seems to be a way of dynamically generating a table; I'm trying to filter the results of that table. What I want to enter is something like [sitename].com/?[pagename].do?[name]=[value]&[name]=[value], but this doesn't work, nor does it even seem like it should work. I attempted it using several variations in syntax. It seems like something I don't quite understand is going on here.
I wish I could direct you to the actual site, but unfortunately I cannot due to the sensitive nature of the project. Let me know, though, if there's any additional information that would be helpful in providing an answer. Thanks in advance.
Edit: This is not really a "my code isn't working" question, as it's the underlying functionality that I would like to emulate in my code which is troubling me, but I'll do my best to get grittier. I'm contractually bound not to share the names of the sites that we're studying, but I will try to model the problem. I am hoping that someone with some familiarity of the back-end activity sending this .do page to the browser will be able to shed light.
import urllib
import urllib2
#
## case 1: a site that i have success in scraping
url = 'http://[sitename]/[pagename]'
values = {'s' : '40', 'pg' : '1'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
print the_page #i get the filtered data that i am looking for
#
## case 2: the site that poses a problem for the encoding of post parameters
url = 'http://[sitename]/?[pagename].do/' # this site uses a .do file to generate
# the content i want to filter. note that the page name is preceded by ?.
values = {'s' : '40', 'pg' : '1'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
print the_page # i am taken back to the root of the site,
# the same result i would get if i entered nonsense
# post parameters that did not correspond with actual control names.
here, also, is an example of some javascript on the page that accomplishes what i'd like to do with my scraper:
function page_next (id) {
$("#loading").fadeIn("normal");
$.post("/?dumps.do/", {s: id, pg: 2},
function( data ) {
var content = $( data ).find( '#dumps' );
}
)
}
I don't know what site you are parsing, but this: [sitename].com/?[pagename].do/ is not something I would call default Struts behaviour, assuming it's indeed a Struts application.
Having a .do extension was indeed something Struts used to use as request mapping, but the URL in that case should be [sitename].com/[pagename].do not [sitename].com/?[pagename].do/
In the second form, the action is in fact a parameter in a query string. This is why this syntax is broken: [sitename].com/?[pagename].do?[name]=[value]&[name]=[value]. You want to send a query string to the action but the action itself is a parameter in the query string.
But that's not the issue. The issue is that the site is doing something with that parameter and expects to receive it's data in a certain way, a way you were not able to reverse engineer.
Assuming again that this is a Struts application, Struts uses a front controller to intercept all action.do URLs and then use the action to invoke a particular class in the application, a class that is mapped to that particular action. The format for this should be [sitename].com/[pagename].do. That would be similar to having, say, [sitename].com/[pagename].php.
But having the action as a parameter makes me think that the site has a different front controller (not that of Struts) that is taking the parameter from the query string and passing it downstream to the Struts framework.
There could be a lot of reasons for having this funky way of handling requests, including making it harder for others to scrape the site, although this seems kinda straight forward:
$.post("/?dumps.do/", {s: id, pg: 2}, ...
Have you tried doing a POST to the root of the application with the action in the query string?

Nesting HTTP GET parameters (request within a request)

I want to call a JSP with GET parameters within the GET parameter of a parent JSP. The URL for this would be http://server/getMap.jsp?lat=30&lon=-90&name=http://server/getName.jsp?lat1=30&lon1=-90
getName.jsp will return a string that goes in the name parameter of getMap.jsp.
I think the problem here is that &lon1=-90 at the end of the URL will be given to getMap.jsp instead of getName.jsp. Is there a way to distinguish which GET parameter goes to which URL?
One idea I had was to encode the second URL (e.g. = -> %3D and & -> %26) but that didn't work out well. My best idea so far is to allow only one parameter in the second URL, comma-delimited. So I'll have http://server/getMap.jsp?lat=30&lon=-90&name=http://server/getName.jsp?params=30,-90 and leave it up to getName.jsp to parse its variables. This way I leave the & alone.
NOTE - I know I can approach this problem from a completely different angle and avoid nested URLs altogether, but I still wonder (for the sake of knowledge!) if this is possible or if anyone has done it...
This has been done a lot, especially with ad serving technologies and URL redirects
But an encoded URL should just work fine. You need to completely encode it tho. A generator can be found here
So this:
http://server/getMap.jsp?lat=30&lon=-90&name=http://server/getName.jsp?lat1=30&lon1=-90
becomes this: http://server/getMap.jsp?lat=30&lon=-90&name=http%3A%2F%2Fserver%2FgetName.jsp%3Flat1%3D30%26lon1%3D-90
I am sure that jsp has a function for this. Look for "urlencode". Your JSP will see the contents of the GET-Variable "name" as the unencoded string: "http://server/getName.jsp?lat1=30&lon1=-90"

If POST to an URL and how can I tell the client to look at a fragment of the response?

I would like to POST an entity as follows
POST /example.org/MyEntity/100
Based on the passed entity, the server would like to draw the users attention to a particular part of the response using a fragment identifier. e.g.
/example.org/MyEntity/100#InterestingPart
How do I return this new URL to the client. I am assuming I could do some form of redirect using a 3XX response code, but I actually do not want the client to do another request because the only difference between the two URLs is the fragment. At the moment it seems that a 307 return code would be the most appropriate because according to the spec you should not automatically redirect a POST.
Is there are better way?
Update: My client is not limited to the constraints of a web browser. I am just looking at this from the perspective of HTTP.
Update2: Based on my reading of RFC2616, I can see nothing stopping me from returning a 200 and a Location header that contains the fragment identifier. Anyone know of a reason why I cannot do that?
I think the only sensible solution is to have action URL have static fragment identifier, like <form method="post" action="/action#anchored"> and then put an anchor wherever you want user to look at while generating page.
But, to answer the Update2: no, there's no reason to avoid it.
My inclination is to return 201 - and have the location header point to the URI you want the client to GET.
I didn't look, but IIRC nothing dictates that the location header points to the resource created, so it should be spec legal.
You should normally redirect every POST to avoid problems with refreshing the page and the use of the back button. This is known as the PRG (POST Redirect Get) pattern:
http://blog.httpwatch.com/2007/10/03/60-of-web-users-can%E2%80%99t-be-wrong-%E2%80%93-don%E2%80%99t-break-the-back-button/
Although, this does incur the cost of another round trip to the server it makes your web application much more user friendly.
You could then add the fragment onto the redirected URL.
There's an example of PRG with a fragment on this page:
http://www.httpwatch.com/httpgallery/redirection/
POSTing to the URI:
http://example.org/MyEntity/100
implies to me that a MyEntity resource called "100" already exists. If that's the case, why not use PUT instead? Is this an update or a create operation?
An alternative might be:
POST http://example.org/MyEntities
Now your service has a choice to make from at least two possibilities:
Return 201 Created. Set the Location header to be the URI you want the client to use (e.g.: http://example.org/MyEntities/100#InterestingPart). Add the representation of the new resource to the body.
Return 204 No Content. Same as above, but no body. This option requires a subsequent GET to fetch the representation, which sounds like what you're trying to avoid.
Neither approach requires redirection and both can return as specific a URI as you desire.
I am curious though, why is the #InterestingPart significant? Why not just return the entire representation and its URI http://example.org/MyEntities/100 in the Location header - and let clients decide for themselves what's interesting or not? If the answers have something to do with only a small part of the resource being of interest (or being modified) during a request, how feasible would it be to break MyResource into a main resource and one or more subordinate resources? For example:
/MyResources/100/CoolThings
/MyResources/100/CoolThings/42
/MyResources/100/InterestingThings
/MyResources/100/InterestingThings/109

Checking The Date A Webpage Has Been Updated?

I want to be able to run a little script that I can populate with a list of URLs and it pulls in and checks when the page was last updated? Has anyone done this?
I can only find a manual way of doing this using JavaScript by pasting this into the browser URL field
javascript:alert(document.lastModified)
Any ideas greatly received :)
The following will step through an array of URLs and display the last modified date or, if it's not present, the date of the server request.
string[] urls = { "http://boflynn.net", "http://slashdot.org" };
foreach ( string url in urls )
{
System.Net.HttpWebRequest req =
(System.Net.HttpWebRequest) System.Net.WebRequest.Create(url);
System.Net.HttpWebResponse resp =
(System.Net.HttpWebResponse) req.GetResponse();
Console.WriteLine("{0} - {1}", url, resp.LastModified);
}
If you use urllib2 (or perhaps httplib might be better still) in a python script you can inspect the headers that are returned for the last-modified field.
It depends on what you mean by "last updated". Sure, there is the Last-Modified HTTP header, but it can be very misleading. For example, if the page is being served up dynamically, there is a good change that this field will be the current time, even if the content of the page itself (the part useful to humans) has not been updated in a rather long time. This page itself is a good example of this phenomenon.
If you are truly interested in the last time the content was updated, then I don't have an immediate answer.

Resources