I just started with the documentation of Scrapy and I was wondering if anyone could provide me with a proper line by line explanation of the following code:
def parse(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
Have you seen http://doc.scrapy.org/en/stable/intro/tutorial.html#our-first-spider?
parse(): a method of the spider, which will be called with the downloaded Response object of each start URL. The response is passed to the method as the first and only argument.
# a method called parse that takes one argument: response
def parse(self, response):
# get the URL (string) from the response object [1]
# split [2] the string on the "/" character
# generate a filename from the list of split strings
filename = response.url.split("/")[-2] + '.html'
# open [3] a file called filename and write [4] into it the body
# of the response (i.e. the contents of the scraped page)
with open(filename, 'wb') as f:
f.write(response.body)
[1] http://doc.scrapy.org/en/stable/topics/request-response.html#scrapy.http.Response
[2] https://docs.python.org/2/library/stdtypes.html#str.split
[3] https://docs.python.org/2/library/functions.html#open
[4] https://docs.python.org/2/library/stdtypes.html#file.write
You have a spider which downloads a web page and save the response in a file.
The spider applies as a callback for the response received the parse method that you defined:
line1: define parse method which receive the response as parameter. The response is what you get from the webserver.
line2: define a filename in which the response data will be saved. The name is taken from the URL as been the last but one string from the URL, after you split the URL based on '/' character. Than append .html to the filename.
line3: Open the defined file to write data inside as binary mode, 'wb'
line4: write HTML data into file taken from the response.body.
Related
Say I have a function that takes some *args (or **kwargs??) for a http request and I want to input different arguments each time the function is called - something like:
def make_some_request(self, *args)
response = requests.get(*args)
return response
where *args might be e.g. url, headers and parameters in one case and url and timeout in another case. How can this be formatted to get the request to look like
response = requests.get(url, headers=headers, params=parameters)
on the first function call and
response = requests.get(url, timeout=timeout)
on the second function call?
I wondered if this was possible using *args or **kwargs but the format doesn't seem quite right with either.
You can do this with **kwargs:
class Client:
def make_some_request(self, *args, **kwargs)
response = requests.get(*args, **kwargs)
return response
client = Client()
client.make_some_request("https://domain.tld/path?param=value", timeout=...)
client.make_some_request("https://domain.tld/path?param=value", headers=..., params=...)
*args will take care of passing the positional argument (the URL), whereas the named arguments (timeout, headers, params) will be passed to requests.get by **kwargs.
I'm using this API to search through books. I need to create a request with given parameters. When I use requests library and params argument it creates bad URL which gives me wrong response. Let's look at the examples:
import requests
params = {'q': "", 'inauthor': 'keyes', 'intitle': 'algernon'}
r = requests.get('https://www.googleapis.com/books/v1/volumes?', params=params)
print('URL', r.url)
The URL is https://www.googleapis.com/books/v1/volumes?q=&inauthor=keyes&intitle=algernon
Which works but gives a different response than when the link is as Working Wolumes tells.
Should be: https://www.googleapis.com/books/v1/volumes?q=inauthor:keyes+intitle:algernon
Documentation of requests tells only about params and separates them with &.
I'm looking for a library or any solution. Hopefully, I don't have to create them using e.g. f-strings
You need to create a parameter to send the url, the way you are doing it now is not what you wanted.
In this code you are saying that you need to send 3 query parameters, but that is not what you wanted. You actually want to send 1 parameter with a value.
import requests
params = {'q': "", 'inauthor': 'keyes', 'intitle': 'algernon'}
r = requests.get('https://www.googleapis.com/books/v1/volumes?', params=params)
print('URL', r.url)
try below code instead which is doing what you require:
import requests
params = {'inauthor': 'keyes', 'intitle': 'algernon'}
new_params = 'q='
new_params += '+'.join('{}:{}'.format(key, value) for key, value in params.items())
print(new_params)
r = requests.get('https://www.googleapis.com/books/v1/volumes?', params=new_params)
print('URL', r.url)
I am using VS code + git bash to scrape this data into JSON. But I am not getting any data into JSON or I did not get anything in JSON. JSON file is empty.
import scrapy
class ContactsSpider(scrapy.Spider):
name= 'contacts'
start_urls = [
'https://app.cartinsight.io/sellers/all/amazon/'
]
def parse(self, response):
for contacts in response.xpath("//td[#title= 'Show Contact']"):
yield{
'show_contacts_td': contacts.xpath(".//td[#id='show_contacts_td']").extract_first()
}
next_page= response.xpath("//li[#class = 'stores-desc hidden-xs']").extract_first()
if next_page is not None:
next_page_link= response.urljoin(next_page)
yield scrapy.Request(url=next_page_link, callback=self.parse)
The URL https://app.cartinsight.io/sellers/all/amazon/ you want to scrape is redirecting to this URL https://app.cartinsight.io/. The second URL didn't contain this XPath "//td[#title= 'Show Contact']" which results in skipping the for loop in parse method and thus you are not getting your desired results.
Im just a beginner with python but have a plan to working hard to be expert soon:)
I tried to convert to json nad have error:
Response not in valid JSON format
and
class 'requests.models.Response'>
url="https://pl.wikipedia.org/wiki/Wikipedia:Strona_g%C5%82%C3%B3wna"
try:
response = requests.get(url)
if not response.status_code == 200:
print("HTTP error",response.status_code)
else:
try:
import json
response.content.decode('utf-8')
response = json.dumps(response)
loaded_response = json.loads(response)
except:
print("Response not in valid JSON format")
except:
print("Something went wrong with requests.get")
print(type(response))
Response not in valid JSON format means that your response data is not in a valid JSON format.
It's not very clear what you want, but the URL you are requesting https://pl.wikipedia.org/wiki/Wikipedia:Strona_g%C5%82%C3%B3wna does not return JSON formatted data, it returns a full html document (try opening the URL in your browser).
Therefore you cannot convert it to a JSON object.
JSON formatted data would look something like:
{ "name":"John", "age":30, "car":null }
If you would like to test your code you can use a URL from a placeholder service, i.e. https://jsonplaceholder.typicode.com/todos/1
A (GET) request to this URL will return JSON formatted placeholder data so you can test and experiment with your code.
You can learn more about JSON here: https://www.w3schools.com/js/js_json_intro.asp
Good luck!
import Network.URI
import Network.HTTP
import Network.Browser
get :: URI -> IO String
get uri = do
let req = Request uri GET [] ""
resp <- browse $ do
setAllowRedirects True -- handle HTTP redirects
request req
return $ rspBody $ snd resp
main = do
case parseURI "http://cn.bing.com/search?q=hello" of
Nothing -> putStrLn "Invalid search"
Just uri -> do
body <- get uri
writeFile "output.txt" body
Here is the diff between haskell output and curl output
It's probably not a good idea to use String as the intermediate data type here, as it will cause character conversions both when reading the HTTP response, and when writing to the file. This can cause corruption if these conversions are nor consistent, as it would appear they are here.
Since you just want to copy the bytes directly, it's better to use a ByteString. I've chosen to use a lazy ByteString here, so that it does not have to be loaded into memory all at once, but can be streamed lazily into the file, just like with String.
import Network.URI
import Network.HTTP
import Network.Browser
import qualified Data.ByteString.Lazy as L
get :: URI -> IO L.ByteString
get uri = do
let req = Request uri GET [] L.empty
resp <- browse $ do
setAllowRedirects True -- handle HTTP redirects
request req
return $ rspBody $ snd resp
main = do
case parseURI "http://cn.bing.com/search?q=hello" of
Nothing -> putStrLn "Invalid search"
Just uri -> do
body <- get uri
L.writeFile "output.txt" body
Fortunately, the functions in Network.Browser are overloaded so that the change to lazy bytestrings only involves changing the request body to L.empty, replacing writeFile with L.writeFile, as well as changing the type signature of the function.