How to scrape content? - web-scraping

How to scrape content? - web-scraping

I created the following code to fetch the content:
import requests
r = requests.post(url='https://icecat.us/index.php/product/offers')
print r
print r.content
Requests returns HTTP Response Code 200 OK.
But r.content is empty, thus no content is retrieved, even though the response in the Developer tools is surely not empty.
What am I missing? Why is the content not retrieved correctly?
Thanks for your advice!

The request.get call requires params, not data. data is for request.post.
import requests
payload = {
'num' : '37963146',
'lang' : 'us',
'offers_country' : '0'
}
r = requests.get(
url='https://icecat.us/index.php/product/offers',
params=payload,
headers={
'X-Requested-With': 'XMLHttpRequest'
}
)
print r
print r.content
BTW, I saw this posted on Upwork.

Related

Passing files in POST using Requests in Python

I'm getting below error when making a requests call post method
{'detail': [{'loc': ['body', 'files'], 'msg': 'field required', 'type': 'value_error.missing'}]}
I tried
response = requests.post("url",headers={mytoken},params=p,files=files)
files = { "file 1": open("sample.pdf",'rb'), "file 2":open("sample11.pdf",'rb')}
I want to get 200 status but I'm getting 422 validation error. Any Idea Why? Its for API Testing purpose, Im new to this I've been debugging this for whole day but still couldn't figure out.

It is not clear from the question what kind of a request the server is expecting. Also, its not clear the exact code snippet you are using too.
From the question, the snippet looks like as follows,
response = requests.post("url",headers={mytoken},params=p,files=files)
files = { "file 1": open("sample.pdf",'rb'), "file 2":open("sample11.pdf",'rb')}
if so, that means you are reading files after you send the request, may be thats why the server complained about missing files field.
See the below example on how you can send two files to an endpoint expecting files.
import requests
import logging
logger = logging.getLogger(__name__)
post_url = "https://exampledomain.local/upload"
file1 = open("sample1.pdf", "rb")
file2 = open("sample2.pdf", "rb")
files = {"file1": file1, "file2": file2}
headers = {"Authorization": "Bearer <your_token_here>"}
params = {"file_type": "pdf"}
response = requests.post(post_url, files=files, headers=headers, params=params)
file1.close()
file2.close()
logger.debug(response.text)

Locust does response with 2xx but fails to gather request statistics

I am trying to do a local load testing with Locust. I got the test environment up and running and a local build is also working. I am trying to test the responses of a local path and the response I get in the terminal is correct. But the Locust UI and also the statistics after terminating the test give me 100% fail results.
For creating the locust code (I am pretty new to it) I use the postman content and adjusted it. This is the Code for Locust:
from locust import HttpLocust, TaskSet, task, between
import requests
url = "http://localhost:8080/registry/downloadCounter"
payload = "[\n {\n \"appName\": \"test-app\",\n \"appVersion\": \"1.6.0\"\n }\n]"
class MyTaskSet(TaskSet):
#task(2)
def index(self):
self.client.get("")
headers = {
'Content-Type': 'application/json',
'Accept':'application/json'
}
response = requests.request("POST", url, headers=headers, data = payload)
print(response.text.encode('utf8'))
class MyLocust(HttpLocust):
task_set = MyTaskSet
wait_time = between(2.0, 4.0)
For the Locust swarm I used just basic numbers:
Number of total users to simulate: 1
Hatch Rate: 3
Host: http://localhost:8080/registry/downloadCounter
I do not get any results there, the table stays blank. I guess it has something to do with the json format but I am not able to find the solution myself.
I also put a Screenshot of the Terminal response after termination in this post.
Thank you in advance for your help!
Best regards

This helped:
from locust import HttpLocust, TaskSet, task, between
import requests
url = "http://localhost:8080/registry/downloadCounter"
payload = "[\n {\n \"appName\": \"test-app\",\n \"appVersion\": \"1.6.0\"\n }\n]"
headers = {'Content-type':'application/json', 'Accept':'application/json'}
class MyTaskSet(TaskSet):
#task(2)
def index(self):
response = self.client.post(url = url, data = payload, headers=headers)
print(response.text.encode('utf8'))
print(response.status_code)
class MyLocust(HttpLocust):
task_set = MyTaskSet
wait_time = between(2.0, 4.0)
```

Python Request Session JIRA REST post http 405

Using python requests session I can connect to JIRA and retrieve issue information ...
session = requests.Session()
headers = {"Authorization": "Basic %s" % bas64_val}
session.post(jira_rest_url, headers=headers)
jira = session.get(jira_srch_issue_url + select_fields)
# select_fields = the fields I want from the issue
Now I'm trying to post a payload via the JIRA API, using a fixed issue url e.g. "https://my_jira_server.com:1234/rest/api/latest/issue/KEY-9876"
Which should be a case of the following, given: https://developer.atlassian.com/jiradev/jira-apis/about-the-jira-rest-apis/jira-rest-api-tutorials/jira-rest-api-example-edit-issues
payload = { "update": {
"fixVersions": [ {"set": "release-2.139.0"} ]
}}
posted = session.post(jira_task_url, data=payload)
# returns <Response [405]>
# jira_task_url = https://my_jira_server.com:1234/rest/api/latest/issue/KEY-9876
But this doesn't appear to work! Looking into the http 405 response, suggests that my payload is not properly formatted! Which notably, is the not easiest thing to diagnose.
What am I doing wrong here? Any help on this would be much appreciated.
Please note, I am not looking to use the python jira module, I am using requests.session to manage several sessions for different systems i.e. JIRA, TeamCity, etc..

Found the solution! I had two problems:
1) The actual syntax structure should have been:
fix_version = { "update": { "fixVersions": [ {"set" : [{ "name" : "release-2.139.0" }]}]
2) To ensure the payload is actually presented as JSON, use json.dumps() which takes an object and produces a string (see here) AND set 'content-type' to 'application/json':
payload = json.dumps(fix_version)
app_json = { 'content-type': 'application/json' }
session.put(https://.../rest/api/latest/issue/KEY-9876, headers=app_json, data=payload)
Rather than trying to define the JSON manually!

Can't yield json response from server with redux-saga

i am trying to use the stack react redux and redux-saga and understand the minimal needed plumbing.
i did a github repo to reproduce the error that i got :
https://github.com/kasra0/react-redux-saga-test.git
running the app : npm run app
url : http://localhost:3000/
the app consists of a simple combo box and a button.
Once selecting a value from the combo, clicking the button dispatch an action that consists simply of fetching some json data .
The server recieves the right request (based on the selected value) but at the line let json = yield call([res, 'json']). i got the error :
the error message that i got from the browser :
index.js:2177 uncaught at equipments at equipments
at takeEvery
at _callee
SyntaxError: Unexpected end of input
at runCallEffect (http://localhost:3000/static/js/bundle.js:59337:19)
at runEffect (http://localhost:3000/static/js/bundle.js:59259:648)
at next (http://localhost:3000/static/js/bundle.js:59139:9)
at currCb (http://localhost:3000/static/js/bundle.js:59212:7)
at <anonymous>
it comes from one of my sagas :
import {call,takeEvery,apply,take} from 'redux-saga/effects'
import action_types from '../redux/actions/action_types'
let process_equipments = function* (...args){
let {department} = args[0]
let fetch_url = `http://localhost:3001/equipments/${department}`
console.log('fetch url : ',fetch_url)
let res = yield call(fetch,fetch_url, {mode: 'no-cors'})
let json = yield call([res, 'json'])
// -> this is the line where something wrong happens
}
export function* equipments(){
yield takeEvery(action_types.EQUIPMENTS,process_equipments)
}
I did something wrong in the plumbing but i can't find where.
thanks a lot for your help !
Kasra

Just another way to call .json() without using call method.
let res = yield call(fetch,fetch_url, {mode: 'no-cors'})
// instead of const json = yield call([res, res.json]);
let json = yield res.json();
console.log(json)

From redux-saga viewpoint all code is slightly correct - two promises are been executed sequentially by call effect.
let res = yield call(fetch,fetch_url, {mode: 'no-cors'})
let json = yield call([res, 'json'])
But using fetch in no-cors mode meant than response body will not be available from program code, because request mode is opaque in this way: https://fetch.spec.whatwg.org/#http-fetch
If you want to fetch information from different origin, use cors mode with appropriate HTTP header like Access-Control-Allow-Origin, more information see here: https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS

scraping content of ASP.NET based website using scrapy

I've been trying to scrape some lists from this website http://www.golf.org.au its an ASP.NET based I did some research and it appears that I must pass some values in a POST request to make the website fetch the data into the tables I did that but still I'm failing any Idea what I'm missing?
Here is my code:
# -*- coding: utf-8 -*-
import scrapy
class GolfscraperSpider(scrapy.Spider):
name = "golfscraper"
allowed_domains = ["golf.org.au","www.golf.org.au"]
ids = ['3012801330', '3012801331', '3012801332', '3012801333']
start_urls = []
for id in ids:
start_urls.append('http://www.golf.org.au/handicap/%s' %id)
def parse(self, response):
scrapy.FormRequest('http://www.golf.org.au/default.aspx?
s=handicap',
formdata={
'__VIEWSTATE':
response.css('input#__VIEWSTATE::attr(value)').extract_first(),
'ctl11$ddlHistoryInMonths':'48',
'__EVENTTARGET':
'ctl11$ddlHistoryInMonths',
'__EVENTVALIDATION' :
response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
'gaHandicap' : '6.5',
'golflink_No' : '2012003003',
'__VIEWSTATEGENERATOR' : 'CA0B0334',
},
callback=self.parse_details)
def parse_details(self,response):
for name in response.css('div.rnd-course::text').extract():
yield {'name' : name}

Yes, ASP pages are tricky to scrape. Most probably some little parameter is missing.
Solution for this:
instead of creating the request through scrapy.FormRequest(...) use the scrapy.FormRequest.from_response() method (see code example below). This will capture most or even all of the hidden form data and use it to prepopulate the FormRequest's data.
it seems you forgot to return the request, maybe that's another potential problem too ...
as far as I recall the __VIEWSTATEGENERATOR also will change each time and has to be extracted from the page
If this doesn't work, fire up your Firefox browser with Firebug plugin or Chrome's developer tools, do the request in the browser and then check the full request header and body data against the same data in your request. There will be some difference.
Example code with all my suggestions:
def parse(self, response):
req = scrapy.FormRequest.from_response(response,
formdata={
'__VIEWSTATE': response.css('input#__VIEWSTATE::attr(value)').extract_first(),
'ctl11$ddlHistoryInMonths':'48',
'__EVENTTARGET': 'ctl11$ddlHistoryInMonths',
'__EVENTVALIDATION' : response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
'gaHandicap' : '6.5',
'golflink_No' : '2012003003',
'__VIEWSTATEGENERATOR' : 'CA0B0334',
},
callback=self.parse_details)
log.info(req.headers)
log.info(req.body)
return req

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to scrape content? - web-scraping

Related

Passing files in POST using Requests in Python

Locust does response with 2xx but fails to gather request statistics

Python Request Session JIRA REST post http 405

Can't yield json response from server with redux-saga

scraping content of ASP.NET based website using scrapy

Categories

Resources