Instagram seems to have changed the request rules which caused the html fetched at requests.get to not have what it should have - python-requests

Instagram seems to have changed the request rules which caused the html fetched at requests.get to not have what it should have. How to skip this page to get to the content?
def ins_info_catch(ins_id):
url = 'https://www.instagram.com/'+ins_id+'/'
header = { 'cookie':random.choice(my_cookie), 'User-Agent':set_user_agent}
r = requests.get(url,headers=header)
if r.status_code==429: #Too Many Requests
exit()
return 'error 429'
elif r.status_code != 200:
print(f'URL error{str(r.status_code)}')
else:
json_match = re.search(r'window\._sharedData = (.*);</script>', r.text)
if json_match == None:
print('NoneType')
else:
if 'ProfilePage' in json.loads(json_match.group(1))['entry_data']:
profile_json = json.loads(json_match.group(1))['entry_data']['ProfilePage'][0]['graphql']['user']
print(profile_json)

Related

Scrapy get all urls from a domain and go beyond the domain by the depth of 2

I'm trying to scrape the newspaper online, I wanted to get all the URLs within the domain, and if there are any external URLs (articles from other domains) mentioned in the article, I may want to go and fetch those URLs. In other words, I want to allow the spider to go at a depth of 3 (is it two clicks away from start_urls?). Can someone look let me know if the snippet is right/wrong?
Any help is greatly appreciated.
Here is my code snippet:
start_urls = ['www.example.com']
master_domain = tldextract.extract(start_urls[0]).domain
allowed_domains = ['www.example.com']
rules = (Rule(LinkExtractor(deny=(r"/search", r'showComment=', r'/search/')),
callback="parse_item", follow=True),
)
def parse_item(self, response):
url = response.url
master_domain = self.master_domain
self.logger.info(master_domain)
current_domain = tldextract.extract(url).domain
referer = response.request.headers.get('Referer')
depth = response.meta.get('depth')
if current_domain == master_domain:
yield {'url': url,
'referer': referer,
'depth': depth}
elif current_domain != master_domain:
if depth < 2:
yield {'url': url,
'referer': referer,
'depth': depth}
else:
self.logger.debug('depth is greater than 3')
Open settings and add
DEPTH_LIMIT = 2
For more details see
There is no need of checking the domain,
if current_domain == master_domain:
when you have allowed domains it will automatically follow only those domains mentioned in allowed_domains

Python - requests fail silently

import requests
is working properly for all my requests, like so:
url = 'http://www.stackoverflow.com'
response = requests.get(url)
bur the following url does not return any results:
url = 'http://www.billboard.com'
response = requests.get(url)
it stalls and fails silently, returning nothing.
how do I force requests into throwing me an exception response,
so I can know if I'm being blacklisted or else?
Requests won't raise an exception for a bad HTTP response, but you could use raise_for_status to raise a HTTPError exception manually, example:
response = requests.get(url)
response.raise_for_status()
Another option is status_code, which holds the HTTP code.
response = requests.get(url)
if response.status_code != 200:
print('HTTP', response.status_code)
else:
print(response.text)
If a site returns HTTP 200 for bad requests, but has an error message in the response body or has no body, you'll have to check the response content.
error_message = 'Nothing found'
response = requests.get(url)
if error_message in response.text or not response.text:
print('Bad response')
else:
print(response.text)
If a site takes too long to respond you could set a maximum timeout for the request. If the site won't respond in that time a ReadTimeout exception will be raised.
try:
response = requests.get(url, timeout=5)
except requests.exceptions.ReadTimeout:
print('Request timed out')
else:
print(response.text)
with:
import requesocks
#Initialize a new wrapped requests object
session = requesocks.session()
#Use Tor for both HTTP and HTTPS
session.proxies = {'http': 'socks5://localhost:9050',
'https': 'socks5://localhost:9050'}
#fetch a page that shows your IP address
response = session.get('https://www.billboard.com')
print(response.text)
I was able to get:
raise ConnectionError(e)
requesocks.exceptions.ConnectionError: HTTPSConnectionPool(host='www.billboard.com', port=None): Max retries exceeded with url: https://www.billboard.com/

Using HTTPBuilder to execute a HTTP DELETE request

I'm trying to use the Groovy HTTPBuilder library to delete some data from Firebase via a HTTP DELETE request. If I use curl, the following works
curl -X DELETE https://my.firebase.io/users/bob.json?auth=my-secret
Using the RESTClient class from HTTPBuilder works if I use it like this:
def client = new RESTClient('https://my.firebase.io/users/bob.json?auth=my-secret')
def response = client.delete(requestContentType: ContentType.ANY)
However, when I tried breaking down the URL into it's constituent parts, it doesn't work
def client = new RESTClient('https://my.firebase.io')
def response = client.delete(
requestContentType: ContentType.ANY,
path: '/users/bob.json',
query: [auth: 'my-secret']
)
I also tried using the HTTPBuilder class instead of RESTClient
def http = new HTTPBuilder('https://my.firebase.io')
// perform a POST request, expecting TEXT response
http.request(Method.DELETE, ContentType.ANY) {
uri.path = '/users/bob.json'
uri.query = [auth: 'my-secret']
// response handler for a success response code
response.success = { resp, reader ->
println "response status: ${resp.statusLine}"
}
}
But this also didn't work. Surely there's a more elegant approach than stuffing everything into a single string?
There's an example of using HttpURLClient in the tests to do a delete, which in its simplest form looks like:
def http = new HttpURLClient(url:'https://some/path/')
resp = http.request(method:DELETE, contentType:JSON, path: "destroy/somewhere.json")
def json = resp.data
assert json.id != null
assert resp.statusLine.statusCode == 200
Your example is very close to the test for the delete in a HTTPBuilder.
A few differences I see are:
Your path is absolute and not relative
Your http url path doesn't end with trailing slash
You're using content type ANY where test uses JSON. Does the target need the content type to be correct? (Probably not as you're not setting it in curl example unless it's doing some voodoo on your behalf)
Alternatively you could use apache's HttpDelete but requires more boiler plate. For a HTTP connection this is some code I've got that works. You'll have to fix it for HTTPS though.
def createClient() {
HttpParams params = new BasicHttpParams()
HttpProtocolParams.setVersion(params, HttpVersion.HTTP_1_1)
HttpProtocolParams.setContentCharset(params, "UTF-8")
params.setBooleanParameter(ClientPNames.HANDLE_REDIRECTS, true)
SchemeRegistry registry = new SchemeRegistry()
registry.register(new Scheme("http", PlainSocketFactory.getSocketFactory(), 80))
ClientConnectionManager ccm = new PoolingClientConnectionManager(registry)
HttpConnectionParams.setConnectionTimeout(params, 8000)
HttpConnectionParams.setSoTimeout(params, 5400000)
HttpClient client = new DefaultHttpClient(ccm, params)
return client
}
HttpClient client = createClient()
def url = new URL("http", host, Integer.parseInt(port), "/dyn/admin/nucleus$component/")
HttpDelete delete = new HttpDelete(url.toURI())
// if you have any basic auth, you can plug it in here
def auth="USER:PASS"
delete.setHeader("Authorization", "Basic ${auth.getBytes().encodeBase64().toString()}")
// convert a data map to NVPs
def data = [:]
List<NameValuePair> nvps = new ArrayList<NameValuePair>(data.size())
data.each { name, value ->
nvps.add(new BasicNameValuePair(name, value))
}
delete.setEntity(new UrlEncodedFormEntity(nvps))
HttpResponse response = client.execute(delete)
def status = response.statusLine.statusCode
def content = response.entity.content
I adopted the code above from a POST version, but the principle is the same.

Scrapy: sending http requests and parse response

i have looked at scrapy docs , but Can scrapy send http form (Ex: user name , password ,....) and parse the result of sending this form ?
There's an example in the same page : http://scrapy.readthedocs.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
You have just to pass a callback parameter function to the request and then, parse the result in parse_page2 ;)

Problem with HTTP Request

I'm trying to make http request with python:
class DownloadManager():
def __init__(self, servername):
self.conn = httplib.HTTPConnection(servername)
print self.conn
def download(self, modname):
params = urllib.urlencode({"name" : modname})
self.conn.request("GET", "/getmod", params)
resp = self.conn.getresponse()
print resp.status
print resp.reason
if resp.status == 200:
url = resp.read()
else:
return
mod = urllib2.urlopen(url)
return mod.read()
But getting:
400
Bad request
In server log I see:
WARNING 2011-08-15 06:58:39,692 dev_appserver.py:4013] Request body in GET is not permitted: name=Test
INFO 2011-08-15 06:58:39,692 dev_appserver.py:4248] "GET /getmod HTTP/1.1" 400 -
What's wrong?
The GET request method can't have anything in the body. If you want to pass arguments via the GET method, you have to add the url-encoded parameters to the URL after a question mark '?' character:
params = urllib.urlencode({"name" : modname})
self.conn.request("GET", "/getmod?%s" % params)
However, what it appears you really want to do is a POST request:
params = urllib.urlencode({"name" : modname})
self.conn.request("POST", "/getmod", params)

Resources