The following code makes a single request to Yahoo API. How do I manage to request from multiple sources using urllib.request.Request? I am aware of gerequests. If possible, Is there any performance difference between the two?
Any suitable modules on this topic?
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
headers = {'User-Agent': user_agent, }
assembled_request = urllib.request.Request(YAHOO, None, headers)
response = urllib.request.urlopen(assembled_request)
html_data = response.read()
There is no way to "bundle" multiple requests into one, the overhead needed to generate the request object on the client should be trivial anyway. You are pretty much just waiting for the request to go through and for the server to respond. If you need to send requests in bulk, doing it asynchronously is the best way.
If you are using an API on the web to query some data, there is often some equivalent of a get_multiple() method that you can use instead of just using get() X amount of times. This might be the kind of thing you are looking for.
For example:
www.example.com/get_cat.html?brown=1
Might yield a brown cat object.
While:
www.example.com/get_cats.html?brown=1
Might yield all the brown cat objects that the database contains.
These kinds of methods save time and bandwidth for both the server and client.
Related
Please forgive me if this question is too stupid.
We know that in the browser it is possible to go to Inspect -> Network -> XHR -> Headers and get Request Headers. It is then possible to add these Headers to the Scrapy request.
However, is there a way to get these Request Headers automatically using the Scrapy request, rather than manually?
I tried to use: response.request.headers but this information is not enough:
{b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 S afari/537.36'], b'Accept-Encoding': [b'gzip,deflate']}
We see a lot more of Request Headers information in the browser. How to get this information?
Scrapy uses these headers to scrape the webpage. Sometimes if a website needs some special keys in headers (like an API), you'll notice that the scrapy won't be able to scrape the webpage.
However there is a workaround, in DownloaMiddilewares, you can implement Selenium. So the requested webpage will be downloaded using selenium automated browser. then you would be able to extract the complete headers as the selenium initiates an actual browser.
## Import webdriver from Selenium Wire instead of Selenium
from seleniumwire import webdriver
## Get the URL
driver = webdriver.Chrome("my/path/to/driver", options=options)
driver.get("https://my.test.url.com")
## Print request headers
for request in driver.requests:
print(request.url) # <--------------- Request url
print(request.headers) # <----------- Request headers
print(request.response.headers) # <-- Response headers
You can use the above code to get the request headers. This must be placed within DownlaodMiddleware of Scrapy so both can work together.
I'm trying to scrape data from this website, using httr and rvest. After several times of scraping (around 90 - 100), the website will automatically transfer me to another url with captcha.
this is the normal url: "https://fs.lianjia.com/ershoufang/pg1"
this is the captcha url: "http://captcha.lianjia.com/?redirect=http%3A%2F%2Ffs.lianjia.com%2Fershoufang%2Fpg1"
When my spider comes accross captcha url, it will tell me to stop and solve it in browser. Then I solve it by hand in browser. But when I run the spider and send GET request, the spider is still transferred to captcha url. Meanwhile in browser, everything goes normal, even I type in the captcha url, it will transfer me back to the normal url in browser.
Even I use proxy, I still got the same problem. In browser, I can normally browse the website, while the spider kept being transferred to captcha url.
I was wondering,
Is my way of using proxy correct?
Why the spider keeps being transferred while browser doesn't. They are from the same IP.
Thanks.
This is my code:
a <- GET(url, use_proxy(proxy, port), timeout(10),
add_headers('User-Agent' = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Connection' = 'keep-alive',
'Accept-Language' = 'en-GB,en;q=0.8,zh-CN;q=0.6,zh;q=0.4,en-US;q=0.2,fr;q=0.2,zh-TW;q=0.2',
'Accept-Encoding' = 'gzip, deflate, br',
'Host' = 'ajax.api.lianjia.com',
'Accept' = '*/*',
'Accept-Charset' = 'GBK,utf-8;q=0.7,*;q=0.3',
'Cache-Control' = 'max-age=0'))
b <- a %>% read_html %>% html_nodes('div.leftContent') %>% html_nodes('div.info.clear') %>%
html_nodes('div.title') %>% html_text()
Finally, I turned to RSelenium, it's slow but no more captchas. Even when it appears, I can directly solve it in the browser.
You are getting CAPTCHAs because that is the way website is trying to prevent non-human/programming script scrapping their data. So, when you are trying to scrape the data, it's detecting you as non-human/robotic script. The reason why this is happening because your script sending very frequent GET request along with some parameters data. Your program need to behave like a real user (Visiting website in random time pattern, different browsers, and IP).
You can avoid getting CAPTCHA by manipulating with these parameters as below. So your program would appear like a real user:
Use randomness when sending GET request. Like you can use Sys.sleep function (use random distribution) to sleep before sending each GET request.
Manipulate user agent data(Mozilla, Chrome, IE etc), cookie acceptance, and encoding.
Manipulate your source location (ip address, and server info)
Manipulating these information will help you to avoid getting CAPTACHA validation in some way.
So I understand the concept of server-sent events (EventSource):
A client connects to an endpoint via EventSource
Client just listens to messages sent from the endpoint
The thing I'm confused about is how it works on the server. I've had a look at different examples, but the one that comes to mind is Mozilla's: http://hacks.mozilla.org/2011/06/a-wall-powered-by-eventsource-and-server-sent-events/
Now this may be just a bad example, but it kinda makes sense how the server side would work, as I understand it:
Something changes in a datastore, such as a database
A server-side script polls the datastore every Nth second
If the polling script notices a change, a server-sent event is fired to the clients
Does that make sense? Is that really how it works from a barebones perspective?
The HTML5 doctor site has a great write-up on server-sent events, but I'll try to provide a (reasonably) short summary here as well.
Server-sent events are, at its core, a long running http connection, a special mime type (text/event-stream) and a user agent that provides the EventSource API. Together, these make the foundation of a unidirectional connection between a server and a client, where messages can be sent from server to client.
On the server side, it's rather simple. All you really need to do is set the following http headers:
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
Be sure to respond with the code 200 and not 204 or any other code, as this will cause compliant user agents to disconnect. Also, make sure to not end the connection on the server side. You are now free to start pushing messages down that connection. In nodejs (using express), this might look something like the following:
app.get("/my-stream", function(req, res) {
res.status(200)
.set({ "content-type" : "text/event-stream"
, "cache-control" : "no-cache"
, "connection" : "keep-alive"
})
res.write("data: Hello, world!\n\n")
})
On the client, you just use the EventSource API, as you noted:
var source = new EventSource("/my-stream")
source.addEventListener("message", function(message) {
console.log(message.data)
})
And that's it, basically.
Now, in practice, what actually happens here is that the connection is kept alive by the server and the client by means of a mutual contract. The server will keep the connection alive for as long as it sees fit. Should it want to, it may terminate the connection and respond with a 204 No Content next time the client tries to connect. This will cause the client to stop trying to reconnect. I'm not sure if there's a way to end the connection in a way that the client is told not to reconnect at all, thereby skipping the client trying to reconnect once.
As mentioned client will keep the connection alive as well, and try to reconnect if it is dropped. The algorithm to reconnect is specified in the spec, and is fairly straight forward.
One super important bit that I've so far barely touched on however is the mime type. The mime type defines the format of the message coming down the connecting. Note however that it doesn't dictate the format of the contents of the messages, just the structure of the messages themselves. The mime type is extremely straight forward. Messages are essentially key/value pairs of information. The key must be one of a predefined set:
id - the id of the message
data - the actual data
event - the event type
retry - milleseconds the user agent should wait before retrying a failed connection
Any other keys should be ignored. Messages are then delimited by the use of two newline characters: \n\n
The following is a valid message: (last new line characters added for verbosity)
data: Hello, world!
\n
The client will see this as: Hello, world!.
As is this:
data: Hello,
data: world!
\n
The client will see this as: Hello,\nworld!.
That pretty much sums up what server-sent events are: a long running non-cached http connection, a mime type and a simple javascript API.
For more information, I strongly suggest reading the specification. It's small and describes things very well (although the requirements of the server side could possibly be summarized a bit better.) I highly suggest reading it for the expected behavior with certain http status codes, for instance.
You also need to make sure to call res.flushHeaders(), otherwise Node.js won't send the HTTP headers until you call res.end(). See this tutorial for a complete example.
maybe you guys can help me with this. I am trying to implement
reCAPTCHA in my node.js application and no matter what I do, I keep
getting "invalid-site-private-key" as a response.
Here are the things I double and double checked and tried:
Correct Keys
Keys are not swapped
Keys are "global keys" as I am testing on localhost and thought it might be an issue with that
Tested in production environment on the server - same problem
The last thing I can think of is that my POST request to the reCAPTCHA
API itself is incorrect as the concrete format of the body is not
explicitly documented (the parameters are documented, I know). So this
is the request body I am currently sending (the key and IP is changed
but I checked them on my side):
privatekey=6LcHN8gSAABAAEt_gKsSwfuSfsam9ebhPJa8w_EV&remoteip=10.92.165.132& challenge=03AHJ_Vuu85MroKzagMlXq_trMemw4hKSP648MOf1JCua9W-5R968i2pPjE0jjDGX TYmWNjaqUXTGJOyMO3IKKOGtkeg_Xnn2UVAfoXHVQ-0VCHYPNwrj3PQgGj22EFv7RGSsuNfJCyn mwTO8TnwZZMRjHFrsglar2zQ&response=Coleshill areacce
Is there something wrong with this format? Do I have to send special
headers? Am I completely wrong? (I am working for 16 hours straight
now so this might be ..)
Thank you for your help!
As stated in the comments above, I was able to solve the problem myself with the help of broofa and the node-recaptcha module available at https://github.com/mirhampt/node-recaptcha.
But first, to complete the missing details from above:
I didn't use any module, my solution is completely self-written based on the documentation available at the reCAPTCHA website.
I didn't send any request headers as there was nothing stated in the documentation. Everything that is said concerning the request before they explain the necessary parameters is the following:
"After your page is successfully displaying reCAPTCHA, you need to configure your form to check whether the answers entered by the users are correct. This is achieved by doing a POST request to http://www.google.com/recaptcha/api/verify. Below are the relevant parameters."
-- "How to Check the User's Answer" at http://code.google.com/apis/recaptcha/docs/verify.html
So I built a querystring myself (which is a one-liner but there is a module for that as well as I learned now) containing all parameters and sent it to the reCAPTCHA API endpoint. All I received was the error code invalid-site-private-key, which actually (as we know by now) is a wrong way of really sending a 400 Bad Request. Maybe they should think about implementing this then people would not wonder what's wrong with their keys.
These are the header parameters which are obviously necessary (they imply you're sending a form):
Content-Length which has to be the length of the query string
Content-Type which has to be application/x-www-form-urlencoded
Another thing I learned from the node-recaptcha module is, that one should send the querystring utf8 encoded.
My solution now looks like this, you may use it or built up on it but error handling is not implemented yet. And it's written in CoffeeScript.
http = require 'http'
module.exports.check = (remoteip, challenge, response, callback) ->
privatekey = 'placeyourprivatekeyhere'
request_body = "privatekey=#{privatekey}&remoteip=#{remoteip}&challenge=#{challenge}&response=#{response}"
response_body = ''
options =
host: 'www.google.com'
port: 80
method: 'POST'
path: '/recaptcha/api/verify'
req = http.request options, (res) ->
res.setEncoding 'utf8'
res.on 'data', (chunk) ->
response_body += chunk
res.on 'end', () ->
callback response_body.substring(0,4) == 'true'
req.setHeader 'Content-Length', request_body.length
req.setHeader 'Content-Type', 'application/x-www-form-urlencoded'
req.write request_body, 'utf8'
req.end()
Thank you :)
+1 to #florian for the very helpful answer. For posterity, I thought I'd provide some information about how to verify what your captcha request looks like to help you make sure that the appropriate headers and parameters are being specified.
If you are on a Mac or a Linux machine or have access to one of these locally, you can use the netcat command to setup a quick server. I guess there are netcat windows ports but I have no experience with them.
nc -l 8100
This command creates a TCP socket listening on pot 8100 and will wait for a connection. You then can change the captcha verify URL from http://www.google.com/recaptcha/... in your server code to be http://localhost:8100/. When your code makes the POST to the verify URL you should see your request outputted to the scree by netcat:
POST / HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Content-Length: 277
Host: localhost:8100
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.1 (java 1.5)
privatekey=XXX&remoteip=127.0.0.1&challenge=03AHJYYY...&response=some+words
Using this, I was able to see that my private-key was corrupted.
I have:
Request request = new Request(Method.GET, "https://www.awebsite.com/login");
Client client = new Client(Protocol.HTTPS);
Response response = client.handle(request);
...
response.getEntity().write(System.out);
But I don't know how to set the login parameters...
I want code that
does the escaping etc
can switch between get/post easily
Being a REST-based platform, I'm thinking I might need to use some parameter "representation" but that seems a bit strange. I'd think it would be common enough to build in this representational exception.
If by "login parameters" you mean sending credentials using Basic HTTP Authentication, it's done using Request.setChallengeResponse() like so:
Request request = new Request(Method.GET, "https://www.awebsite.com/login");
request.setChallengeResponse(new ChallengeResponse(ChallengeScheme.HTTP_BASIC, username, password));
This will work for any Request, using any HTTP method.
If, however, the server to which you're trying to authenticate expects credentials using some protocol other than Basic HTTP Auth, then you'll need to explain that protocol -- i.e. does it use cookies, headers, tokens, etc.
BTW, you might get faster/better responses by posting to the Restlet-Discuss mailing list; I've been on there for a year and a half and it's a great community.