Python3 requests-html: Unfortunately, automated access to this page was denied - web-scraping

Hello StackOverflow community,
a few months ago, I created a scraper with python3 and html-requests together with BeautifulSoup in order to scrape car ads from https://www.mobile.de. The scraper uses the following search URL to fetch a list of all available car ads and later on iterates through the detail pages.
Please find below the code:
from bs4 import BeautifulSoup, SoupStrainer
from requests_html import HTMLSession
import re
url = 'https://suchen.mobile.de/fahrzeuge/search.html?&damageUnrepaired=NO_DAMAGE_UNREPAIRED&isSearchRequest=true&makeModelVariant1.makeId=25200&makeModelVariant1.modelId=g29&scopeId=C&sfmr=false'
session = HTMLSession()
r = session.get(url)
only_a_tags = SoupStrainer("a")
soup = BeautifulSoup(r.content,'lxml', parse_only=only_a_tags)
for link in soup.find_all('a', attrs={'href': re.compile("^https://suchen.mobile.de/fahrzeuge/details.html")}):
print (link.get("href"))
Since a few days, the scraper is not able to fetch car ads from the website anymore. When iterating through all tags in order to fetch the detail pages of the car ads (always like https://suchen.mobile.de/fahrzeuge/details.html), currently no results are shown. In the past, links to the car ad detail pages were printed.
I only receive following error message when printing the html content:
b'<!DOCTYPE html>\n<html>\n <!--\nLeider koennen wir Dir an dieser Stelle keinen Zugriff auf unsere Daten gewaehren.\nSolltest Du weiterhin Interesse an einem Bezug unserer Daten haben, wende Dich bitte an:\n\nUnfortunately, automated access to this page was denied.\nIf you are interested in accessing our data, please contact us:\n\nPhone:\n+49 (0) 30 8109-7573\n\nMail:\nDatenpartner#team.mobile.de\n -->\n <head>\n <meta charset="UTF-8">\n\n <title>Ups, bist Du ein Mensch? / Are you a human?</title>\n <link rel="stylesheet" href="https://static.classistatic.de/shared/mde-style/2.1.0/style.css">\n <link rel="icon" type="image/x-icon" href="">\n <script src=\'https://www.google.com/recaptcha/api.js\'></script>\n </head>\n <body>\n <header id="mdeHeader" class="header">\n <div class="header-meta-container header-hidden-small">\n <!-- placeholder for desktop meta -->\n </div>\n <div class="header-navbar clearfix">\n <div class="header-corporate">\n <i class="gicon-mobilede-logo"></i>\n <span class="claim header-hidden-small">Deutschlands gr\xc3\xb6\xc3\x9fter Fahrzeugmarkt</span>\n </div>\n </div>\n </header>\n <div class="g-container">\n <h2 class="u-pad-bottom-18 u-margin-top-18">Ups, bist Du ein Mensch? / Are you a human?</h2>\n\n\n <div id="root"></div>\n <div class="cBox cBox--content">\n <p><b>\n Um fortzufahren muss dein Browser Cookies unterst\xc3\xbctzen und JavaScript aktiviert sein.<br>\n To continue your browser has to accept cookies and has to have JavaScript enabled.</b>\n </p>\n\n <p>\n Bei Problemen wende Dich bitte an:<br>\n In case of problems please contact:\n </p>\n <p>\n Phone: 030 81097-601<br>\n Mail: service#team.mobile.de\n </p>\n\n <p>\n Sollte grunds\xc3\xa4tzliches Interesse am Bezug von mobile.de Daten bestehen, wende Dich bitte an:<br/>\n If you are primarily interested in purchasing data from mobile.de, please contact:\n </p>\n <p>\n Mail: Datenpartner#team.mobile.de\n </p>\n </div>\n <hr class="u-pad-top-9 u-pad-bottom-18"/>\n <div id="footer"></div>\n <script async src="https://www.mobile.de/api/consent/static/js/consentBanner.js"></script>\n <script type="text/javascript" src="https://www.mobile.de/youre-blocked/app.js"></script><script type="text/javascript" >var _cf = _cf || []; _cf.push([\'_setFsp\', true]); _cf.push([\'_setBm\', true]); _cf.push([\'_setAu\', \'/static/16b9372bb8fti233b6fc758bf7a4291f0\']); </script><script type="text/javascript" src="/static/16b9372bb8fti233b6fc758bf7a4291f0"></script></body>\n</html>\n'
When creating the scraper, I also received the "Unfortunately, automated access to this page was denied." message when using urrlib, hence I switched over to html-requests and everything worked great.
I already tried to solve it with following approaches, but none of them worked so far :(
proxy rotation (I thought my IP address might have been blocked)
different user agent in header via fake_useragent library
I hope you are able to help as I am currently don't know what else I could try.
Thanks a lot in advance for helping me with this issue :)

Use a Selenium Webdriver to first navigate to the search page and then run the query from there.
I just got the same message on my own machine when running your code. When I visit the site manually, I also see a reCAPTCHA. Even opening it directly with Selenium generates the reCAPTCHA.
Were I working to defeat you, I would just require the reCAPTCHA whenever a direct connection was made to search results. That would be my guess for how you are being blocked. When I use a WebDriver to first navigate to the search page, I do not get challenged.
Here is the code that I used.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://suchen.mobile.de/fahrzeuge/search.html")
driver.implicitly_wait(5000) #not good practice, but quick and easy
driver.find_element_by_id("gdpr-consent-accept-button").click()
driver.implicitly_wait(2000) #not good practice, but quick and easy
driver.find_element_by_id("fuels-PETROL-ds").click()
driver.implicitly_wait(2000) #not good practice, but quick and easy
driver.find_element_by_id("dsp-upper-search-btn").click()
This is not going to work forever, but it works at least for now.

Related

google analytics duplicate url with unicode character

when I checked my google analytics > acquisition > search console > landing page
understand that I have 2 URLs for each blog post.
for example:
blog/429/legal/اسقاط-کافه-خیارات-به-چه-معناست/
and
/blog/429/legal/%D8%A7%D8%B3%D9%82%D8%A7%D8%B7-%DA%A9%D8%A7%D9%81%D9%87-%D8%AE%DB%8C%D8%A7%D8%B1%D8%A7%D8%AA-%D8%A8%D9%87-%DA%86%D9%87-%D9%85%D8%B9%D9%86%D8%A7%D8%B3%D8%AA/
Both refer to one blog post.
But the main problem is statistics:
URL #2 have 0 Impressions, clicks and CTR but correct position. Also URL #1 have correct Impressions, clicks and ctr, but incorrect position.
My blog posts have canonical tag and I check all internal link building. I used all linked with same form (for example: example.com/blog/429/legal/اسقاط-کافه-خیارات-به-چه-معناست/)
now
1- what is the source of problem and
2- how to fix it?
This is an issue of Google Search Console. The url is sent to Google Analytics in encoded form and the tool manages to convert it to show ad decoded. When it retrieves it from the search console it shows it as it receives it. I don't think there is an effective solution with the two tools, however you can export data and manage them in another tool, for example with Javascript (i.e. in Spreadsheet and Google Apps Script) you can de decode a URI with only one operation so after that you can build a table (in Spreadsheet) that finds the matches and compare the metrics.
<div id="get_url_encoded">/blog/429/legal/%D8%A7%D8%B3%D9%82%D8%A7%D8%B7-%DA%A9%D8%A7%D9%81%D9%87-%D8%AE%DB%8C%D8%A7%D8%B1%D8%A7%D8%AA-%D8%A8%D9%87-%DA%86%D9%87-%D9%85%D8%B9%D9%86%D8%A7%D8%B3%D8%AA/</div>
<br /><br />
<div id="set_url_decoded"></div>
<script>
var uri = document.getElementById("get_url_encoded").innerHTML;
var uri_dec = decodeURIComponent(uri);
document.getElementById("set_url_decoded").innerHTML = uri_dec;
</script>
https://jsfiddle.net/michelepisani/de058c4o/5/

Can't webscrape with R the site of Fitch Ratings

I'm trying to scrape the website of Fitch Ratings and until now I can't get what I wanted: the list of ratings. When I scrape with R it returns the header of the website and in the body it gets an "iframe" from googleTagManager the "hide" the content that matters.
website: https://www.fitchratings.com/site/search?content=research&filter=RESEARCH%20LANGUAGE%5EPortuguese%2BGEOGRAPHY%5EAmericas%2BREPORT%20TYPE%5EHeadlines%5ERating%20Action%20Commentary
return:
[1] <head>\n<title>Search - Fitch Ratings</title>\n<!-- headerScripts --><!-- --><meta http-equiv="Content-Type" content="text/html; chars ...
[2] <body id="search-results">\n <div id="privacy-policy-tos-modal-container"></div>\n <!-- Google Tag Manager (noscript) -- ...
_____________
What I want:
Date;Research;Type;Text
04 Sep 2019; Fitch afirma Rating de Qualidade(...);Rating Action Commentary;Fitch Ratings-Sao Paulo - 04 September 2019: A Fitch Ratings Afirmou hoje, o Rating de Qualidade de Gestão de Ivnestimento 'Excelente' (...)
02 Sep 2019; Fitch Eleva Rating (...); Rating Action Commentary; Fitch Ratings - Sao Paulo - 02 September 2019: A Fitch Ratings elevou hoje (...)
Code below
html_of_site <- read_html(url("https://www.fitchratings.com/site/search?content=research&filter=RESEARCH%20LANGUAGE%5EPortuguese%2BGEOGRAPHY%5EAmericas%2BREPORT%20TYPE%5EHeadlines%5ERating%20Action%20Commentary"))
html_of_site
Short Answer: Don't scrape this website.
Long Answer: Technically it is possible to scrape this site, but you need your code to act like a human. What this means is that you would need to convince Fitch Group's server that you are indeed a human visitor and not a bot.
To do this you need to:
Send the same headers that your browser would send to the site
Keep track of any cookies the site sends back to you and return them in subsequent requests if necessary
Evaluate any scripts sent back by the server (to actually load the data you want).
I wasn't able to access the site policy for the thefitchgroup.com, but I assume it includes clauses about what bots are and are not allowed to do on the site. Since this company likely sells the data you are trying to scrape, you should probably avoid scraping this site.
In general, don't scrape sites without reading the site policies first. If the data you are scraping is not free without scraping it, then you probably shouldn't be scraping it.

Mailchimp RSS campaign only includes 1 post

Setting up an RSS campaign with Mailchimp, and hit a roadblock. The import seems to work, the design looks great, but we only are able to ever get one post -- the most recent one-- into the email.
The RSS feed is: https://our.news/feed/trending
We have verified that pubDate is included and properly formatted on all items, ie:
<item>
<title>The FBI is warning you to reboot your router to prevent a new attack here’s everything you need to do</title>
<link>https://our.news/2018/05/30/the-fbi-is-warning-you-to-reboot-your-router-to-prevent-a-new-attack-heres-everything-you-need-to-d/</link>
<comments>https://our.news/2018/05/30/the-fbi-is-warning-you-to-reboot-your-router-to-prevent-a-new-attack-heres-everything-you-need-to-d/#comments</comments>
<pubDate>Wed, 30 May 2018 07:33:04 +0000</pubDate>
<dc:creator><![CDATA[OurBot]]></dc:creator>
<category><![CDATA[Headlines]]></category>
<guid isPermaLink="false">https://our.news/?p=103857</guid>
<description><![CDATA[BUSINESSINSIDER.COM – On Friday, the FBI said anyone who uses a router to connect to the internet should reboot their routers. That will “temporarily disrupt...]]></description>
<wfw:commentRss>https://our.news/2018/05/30/the-fbi-is-warning-you-to-reboot-your-router-to-prevent-a-new-attack-heres-everything-you-need-to-d/feed/</wfw:commentRss>
<slash:comments>1</slash:comments>
<media:content xmlns:media="http://search.yahoo.com/mrss/" medium="image" type="image/jpeg" url="https://dsezcyjr16rlz.cloudfront.net/wp-content/uploads/2018/05/30023303/httpsamp.businessinsider.comimages5b0d64001ae66220008b47d5640320.1.jpg.jpg" width="150" height="75" />
</item>
The specific email design template we're using is simple, relevant section is:
*|RSSITEMS:[$count=5]|*
<span style="float:left">*|RSSITEM:IMAGE|* </span>
*|RSSITEM:TITLE|*
*|END:RSSITEMS|*
This happens in Preview Mode, in the Test Email, AND in the actual weekly campaign sends. The campaign is set to send weekly, and when it does, it only includes the first item from the list. Ideally, we'd like this to always just include the most recent 5 items. Anyone have any ideas?
Try using a FeedBlock
*|FEEDBLOCK:https://www.url.com/test.xml|*
*|FEEDITEMS:[$count=5]|*
<span style="float:left">*|FEEDITEM:IMAGE|* </span>
*|FEEDITEM:TITLE|*
*|END:FEEDITEMS|*

Analytics Experiment Parser blocking error

I've tried to implement google analytics experiments to one of my pages. I added the following code:
<!-- Google Analytics Content Experiment code -->
<script>function utmx_section(){}function utmx(){}(function(){var
k='122644017-3',d=document,l=d.location,c=d.cookie;
if(l.search.indexOf('utm_expid='+k)>0)return; function f(n){if(c){var
i=c.indexOf(n+'=');if(i>-1){var j=c. indexOf(';',i);return
escape(c.substring(i+n.length+1,j<0?c. length:j))}}}var
x=f('__utmx'),xx=f('__utmxx'),h=l.hash;d.write( '<sc'+'ript src="'+'http'+
(l.protocol=='https:'?'s://ssl': '://www')+'.google-
analytics.com/ga_exp.js?'+'utmxkey='+k+ '&utmx='+(x?x:'')+'&utmxx='+(xx?
xx:'')+'&utmxtime='+new Date(). valueOf()+
(h?'&utmxhash='+escape(h.substr(1)):'')+ '" type="text/javascript"
charset="utf-8"><\/sc'+'ript>')})(); </script><script>utmx('url','A/B');
</script>
<!-- End of Google Analytics Content Experiment code -->
to the head but my console gives the following output:
A Parser-blocking, cross-origin script, https://ssl.google-analytics.com/ga_exp.js?utmxkey=122644017-3&utmx=&utmxx=&utmxtime=1475762834841, is invoked via document.write. This may be blocked by the browser if the device has poor network connectivity.
Failed to execute 'write' on 'Document': It isn't possible to write into a document from an asynchronously-loaded external script unless it is explicitly opened.
I have tried to google this error, even search this place before posting, but I have no idea why I am receiving this. Is there any one here that could point me in the right direction to solve this?

flickr and wordpress integration don't do it

Here is the code that shows a photo from flickr
THIS USER : 53335537#N04 doesn't show ANYTHING - NADA
but this user : 85173533#N00 work great
<script type="text/javascript">
jQuery.noConflict();
jQuery(document).ready( function() {
var cesc = new flickrshow('flickrbox', {
'autoplay':true,
'hide_buttons':false,
'interval':3500,
'page':1,
'per_page':10,
'user':'53335537#N04'});
});
</script>
Question : Why a user doesn't work....
This address work great, so flickr is not blocking it
http://www.flickr.com/photos/53335537#N04
Here is the "live" page.... upper right : http://www.notrepanorama.com/1-la-table-et-ses-partenaires/
Seems to call this URL: http://api.flickr.com/services/rest/?api_key=6cb7449543a9595800bc0c365223a4e8&extras=url_s,url_m,url_z,url_l&format=json&jsoncallback=flickrshow_jsonp_22262679527&page=1&per_page=10&license=1,2,3,4,5,6,7&method=flickr.photos.search&user_id=53335537#N04&
which returns an empty result set:
flickrshow_jsonp_22262679527({"photos":{"page":1, "pages":0, "perpage":10, "total":"0", "photo":[]}, "stat":"ok"})
Removing the license=1,2,3,4,5,6,7 param causes results to be returned
So this user has apparently not licensed his images under one of the listed licenses. Flickrshow has this to say about that parameter:
A comma seperated list of the
allowable licenses within your
slideshow. If set to null, no license
restrictions will be set so please
ensure you have permission to display
the images. See the Flickr API for
more information on license codes.
Here's the relevant doc page from flickr: http://www.flickr.com/services/api/flickr.photos.licenses.getInfo.html
It seems that embedding that user's images, since they're marked "all rights reserved", is legally questionable (although I'm guessing in this case, the embedder and the photo owner are the same person). flickrshow only displays images with CC licenses by default, it seems.
So, in the end: either relicense the photos, or override flickrshow's license filter (probably by adding 'license':null, to your params)

Resources