Need help scraping contents of this page with Scrapy

Need help scraping contents of this page with Scrapy - web-scraping

Can someone please tell me how to scrape the data (Names & Numbers) from this page using Scrapy. The data is dynamically loaded. If you check Network tab you'll find a POST request to https://www.icab.es/rest/icab-api/collegiates. So I copied it as cURL and send the request through Postman. But I am getting error. Could someone please help me?
URL: https://www.icab.es/es/servicios-a-la-ciudadania/necesito-un-abogado/buscador-de-profesionales/?extraSearch=false&probono=false

This is a very good question! But maybe next time you'll want to add your code and maybe format it a little better. How to ask
Solution:
You need to recreate the request. I inspected the request with Burp Suite.
I got the headers for the url in start_urls, and both the headers and body for the json_url.
If you try to to get the json_url from start_request you'll get 401 error, so we first go to the start_urls url and only then request the json_url.
The complete code:
import scrapy
class Temp(scrapy.Spider):
name = "tempspider"
allowed_domains = ['icab.es']
start_urls = ['https://www.icab.es/es/servicios-a-la-ciudadania/necesito-un-abogado/buscador-de-profesionales']
json_url = 'https://www.icab.es/rest/icab-api/collegiates'
def start_requests(self):
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Origin": "https://www.icab.es",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"DNT": "1",
"Host": "www.icab.es",
"Pragma": "no-cache",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Sec-GPC": "1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
}
yield scrapy.Request(url=self.start_urls[0], headers=headers, callback=self.parse)
def parse(self, response):
headers = {
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"DNT": "1",
"Pragma": "no-cache",
"Sec-GPC": "1",
'Accept': 'application/json',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Content-Type': 'application/json',
'Host': 'www.icab.es',
'Sec-Ch-Ua': '"Chromium";v="91", " Not;A Brand";v="99"',
'Sec-Ch-Ua-Mobile': '?0',
'Origin': 'https://www.icab.es',
'Referer': 'https://www.icab.es/es/servicios-a-la-ciudadania/necesito-un-abogado/buscador-de-profesionales',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
"X-KL-Ajax-Request": "Ajax_Request",
}
body = '{"filters":{"keyword":"","name":"","surname":"","street":"","postalCode":"","collegiateNumber":"","dedication":"","language":"","paginationFirst":"1","paginationLast":"25","paginationOrder":"surname","paginationOrderAscDesc":"ASC"}}'
yield scrapy.Request(url=self.json_url, headers=headers, body=body, method='POST', callback=self.parse_json)
def parse_json(self, response):
json_response = response.json()
members = json_response['members']
for member in members:
yield {
'randomPosition': member['randomPosition'],
'collegiateNumber': member['collegiateNumber'],
'surname': member['surname'],
'name': member['name'],
'gender': member['gender'],
}
Output:
{'randomPosition': '27661107', 'collegiateNumber': '35080', 'surname': 'Abad Bamala', 'name': 'Ana', 'gender': 'M'}
{'randomPosition': '98668217', 'collegiateNumber': '14890', 'surname': 'Abad Calvo', 'name': 'Encarnacion', 'gender': 'M'}
{'randomPosition': '53180188', 'collegiateNumber': '29746', 'surname': 'Abad de Brocá', 'name': 'Laura', 'gender': 'M'}
{'randomPosition': '41073111', 'collegiateNumber': '31865', 'surname': 'Abad Esteve', 'name': 'Joan Domènec', 'gender': 'H'}
{'randomPosition': '63371735', 'collegiateNumber': '29647', 'surname': 'Abad Fernández', 'name': 'Dolors', 'gender': 'M'}
{'randomPosition': '30290704', 'collegiateNumber': '45016', 'surname': 'Abad Hernández', 'name': 'Laura', 'gender': 'M'}
{'randomPosition': '57510617', 'collegiateNumber': '16083', 'surname': 'Abad Mariné', 'name': 'Jose Antonio', 'gender': 'H'}
................
................
................

Related

how to use flurl to get the response body with query string parameters

how to use flurl to get the response body with query string parameters. I tried using Postman and the results were as I expected. but I can't apply it to flurl.
var strUrl = await "https://example.com/api/v2/xxx/yyy?need_personalize=true&promotionid=2007354722&sort_soldout=true"
.WithHeaders(new
{
user_agent = "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36",
content_type = "application/json",
referer = "https://example.com/xxx",
cookie = myCookies
})
.PostJsonAsync(new
{
need_personalize = true,
promotionid = 2007354722,
sort_soldout = true
});
bodyUrl = await strUrl.GetStringAsync();
the results I got : 200
{
"version": "6f1e0da21b667876ba3853b59a8bb812",
"error_msg": null,
"error": 10010
}
the resulting response should be like this :
{
"version": "6f1e0da21b667876ba3853b59a8bb812",
"data": {
"selling_out_item_brief_list": [
{
"itemid": 466536326,
"from": null
}
],
"items": [],
"mega_sale_items": [],
"item_brief_list": [
{
"itemid": 6720842242,
"from": null,
"is_soldout": false
}
],
"promotionid": 2007354722
},
"error_msg": null,
"error": 0
}
maybe anyone can help me?

Filtering out login logs from Gitlab production_json.log with jq

I'm trying to filter out login events from the production_json.log of a Omnibus GitLab server.
Thus JSON elements that i want to filter look like this:
{
"method": "POST",
"path": "/users/sign_in",
"format": "html",
"controller": "SessionsController",
"action": "create",
"status": 302,
"duration": 146.22,
"view": 0,
"db": 16.64,
"location": "https://maschm.ddnss.de/",
"time": "2021-01-05T11:44:30.180Z",
"params": [
{
"key": "utf8",
"value": "✓"
},
{
"key": "authenticity_token",
"value": "[FILTERED]"
},
{
"key": "user",
"value": {
"login": "root",
"password": "[FILTERED]",
"remember_me": "0"
}
}
],
"remote_ip": "46.86.21.18",
"user_id": 1,
"username": "root",
"ua": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.2 Safari/605.1.15",
"queue_duration": 7.3,
"correlation_id": "JtnY93e2ti8"
}
I only want output for such elements.
jq is new to me. I'm using this command now:
sudo tail -f /var/log/gitlab/gitlab-rails/production_json.log |
jq --unbuffered '
if .remote_ip != null and .method == "POST" and
.path == "/users/sign_in" and .action == "create"
then
.ua + " " + .remote_ip else ""
end
'
The output looks like this:
""
""
""
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.2 Safari/605.1.15 46.86.21.18"
""
""
""
""
""
""
I have two questions:
How can i avoid the "" output (there should be no output for other JSON elements)?
Is if the correct jq statement for filtering?

You could use empty instead of "" to solve the problem, but using select() to filter out unwanted stream elements is a cleaner solution.
jq --unbuffered '
select(
.remote_ip != null and
.method == "POST" and
.path == "/users/sign_in" and
.action == "create"
) |
.ua + " " + .remote_ip
'

Response is 404 when term matcher added on path of api on using pact-stub-server

describe('Getting asset for player', () => {
before(() => {
return provider.addInteraction({
given: 'GET call',
uponReceiving: 'Get asset for player',
withRequest: {
method: 'GET',
path: term({
matcher: '/api/assets/[0-9]+',
generate: '/api/assets/10006'
}),
},
willRespondWith: {
status: 200,
headers: { 'Content-Type': 'application/json' },
body: assetByPlayer
}
});
});
it('Get the asset by player', () => {
return request.get(`http://localhost:${PORT}/api/assets/10006`)
.set({ 'Accept': 'application/json' }).then((response) => {
return expect(Promise.resolve(response.statusCode)).to.eventually.equals(200);
}).catch(err => {
console.log("Error in asset with player listing", err);
});
});
});
I get the json file as : https://pastebin.com/TqRbTmNS
When i use the json file in other code base by pact stub server , it
gets the request from UI as
===> Received Request ( method: GET, path: /api/assets/10006, query:
None, headers: Some({"actasuserid": "5", "content-type": "application/vnd.nativ.mio.v1+json", "host": "masteraccount.local.nativ.tv:30044", "accept": "application/json", "authorization": "Basic bWFzdGVydXNlcjptYXN0ZXJ1c2Vy", "connection": "close", "content-length": "2"}), body: Present(2 bytes) )
but does not sends any response
But if i just remove the matching rules part
"matchingRules": {
"$.path": {
"match": "regex",
"regex": "\/api\/assets\/[0-9]+"
}
}
it starts to work again
===> Received Request ( method: GET, path: /api/assets/10006, query: None, headers: Some({"authorization": "Basic bWFzdGVydXNlcjptYXN0ZXJ1c2Vy", "accept": "application/json", "content-length": "2", "connection": "close", "host": "masteraccount.local.nativ.tv:30044", "content-type": "application/vnd.nativ.mio.v1+json", "actasuserid": "5"}), body: Present(2 bytes) )
<=== Sending Response ( status: 200, headers: Some({"Content-Type": "application/json"}), body: Present(4500 bytes) )
and I can see the data to be present
Could you let me what is wrong here ?

How do I continuously pipe to curl and post with chunking?

I know that curl can post data to a URL:
$ curl -X POST http://httpbin.org/post -d "hello"
{
"args": {},
"data": "",
"files": {},
"form": {
"hello": ""
},
"headers": {
"Accept": "*/*",
"Content-Length": "5",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "curl/7.50.1"
},
"json": null,
"origin": "64.238.132.14",
"url": "http://httpbin.org/post"
}
And I know I can pipe to curl to achieve the same thing:
$ echo "hello" | curl -X POST http://httpbin.org/post -d #-
{
"args": {},
"data": "",
"files": {},
"form": {
"hello": ""
},
"headers": {
"Accept": "*/*",
"Content-Length": "5",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "curl/7.50.1"
},
"json": null,
"origin": "64.238.132.14",
"url": "http://httpbin.org/post"
}
Now here is where it gets tricky. I know about http transfer-coding and chunking, for example to send a multiline file:
('silly' is a file that contains several lines of text, as seen here)
$ cat silly | curl -X POST --header "Transfer-Encoding: chunked" "http://httpbin.org/post" --data-binary #-
{
"args": {},
"data": "",
"files": {},
"form": {
"hello\nthere\nthis is a multiple line\nfile\n": ""
},
"headers": {
"Accept": "*/*",
"Content-Length": "41",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "curl/7.50.1"
},
"json": null,
"origin": "64.238.132.14",
"url": "http://httpbin.org/post"
}
Now what I want to be able to do is to have curl read a line from stdin, send it as a chunk, and then come back and read stdin again (which allows me to keep it going continously). This was my first attempt:
curl -X POST --header "Transfer-Encoding: chunked" "http://httpbin.org/post" -d #-
And it does work ONE TIME ONLY when I hit ctrl-D, but that obviously ends the execution of curl.
Is there any way to tell curl "Send (using chunk encoding) what I've given you so far, and then come back to stdin for more" ?
Thanks so much, I've been scratching my head for a while on this one!

Importing Postman Collection Fails

trying to import a Postman collection and I'm getting this error in an alert dialog:
Import Failed
TypeError: null is not an object (evaluating 'postmanBodyData.length')
And then this in the console:
JS Exception Line 54. TypeError: null is not an object (evaluating 'postmanBodyData.length')
Here's a sample of a collection that failed to import.
{
"id": "5eb54264-f906-b6d7-9ee4-d045875c8ad4",
"name": "SO Test",
"order": [
"ee9c4b31-f6b3-0799-5d9d-298d8257d6d0",
"513b4473-f1c3-469e-ce67-edaf33faf2d0"
],
"timestamp": 1448497158415,
"requests": [
{
"id": "513b4473-f1c3-469e-ce67-edaf33faf2d0",
"url": "http://stackoverflow.com/questions/33901145/importing-postman-collection-fails?noredirect=1#comment55564842_33901145",
"method": "GET",
"headers": "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8\nUpgrade-Insecure-Requests: 1\nUser-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.69 Safari/537.36\nAccept-Encoding: gzip, deflate, sdch\nAccept-Language: en-US,en;q=0.8\nCookie: prov=41dcce2f-3878-4f81-b102-86ced2fc0edd; __qca=P0-107192378-1422497046148; gauthed=1; _ga=GA1.2.828174835.1422497046; __cfduid=df57f13c8f66daf4cca857b9bde72d0981447728327\n",
"data": null,
"dataMode": "params",
"version": 2,
"name": "http://stackoverflow.com/questions/33901145/importing-postman-collection-fails?noredirect=1#comment55564842_33901145",
"description": "",
"descriptionFormat": "html",
"collectionId": "5eb54264-f906-b6d7-9ee4-d045875c8ad4"
},
{
"id": "ee9c4b31-f6b3-0799-5d9d-298d8257d6d0",
"url": "http://stackoverflow.com/posts/33901145/ivc/2e31?_=1448497117271",
"method": "GET",
"headers": "Accept: */*\nX-Requested-With: XMLHttpRequest\nUser-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.69 Safari/537.36\nReferer: http://stackoverflow.com/questions/33901145/importing-postman-collection-fails?noredirect=1\nAccept-Encoding: gzip, deflate, sdch\nAccept-Language: en-US,en;q=0.8\nCookie: prov=41dcce2f-3878-4f81-b102-86ced2fc0edd; __qca=P0-107192378-1422497046148; gauthed=1; _ga=GA1.2.828174835.1422497046; __cfduid=df57f13c8f66daf4cca857b9bde72d0981447728327\n",
"data": null,
"dataMode": "params",
"version": 2,
"name": "http://stackoverflow.com/posts/33901145/ivc/2e31?_=1448497117271",
"description": "",
"descriptionFormat": "html",
"collectionId": "5eb54264-f906-b6d7-9ee4-d045875c8ad4"
}
]
}

This bug was due to passing an empty body incorrectly:
"data": null,
"dataMode": "params",
It has been fixed in v1.1.2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Need help scraping contents of this page with Scrapy - web-scraping

Related

how to use flurl to get the response body with query string parameters

Filtering out login logs from Gitlab production_json.log with jq

Response is 404 when term matcher added on path of api on using pact-stub-server

How do I continuously pipe to curl and post with chunking?

Importing Postman Collection Fails

Categories

Resources