I got timeout error with scrapy when crawling this page - web-scraping

I can't crawl this page https://www.adidas.pe/, scrapy crawl my_spider returns:
2018-12-17 15:36:39 [scrapy.core.engine] INFO: Spider opened
2018-12-17 15:36:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-12-17 15:36:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-12-17 15:36:39 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.adidas.pe/> from <GET http://adidas.pe/>
2018-12-17 15:37:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-12-17 15:38:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
I tried to change settings.py:
COOKIES_ENABLED = True
ROBOTSTXT_OBEY = False
and doesn't works

You could try changing USER_AGENT in settings.py, it works for me. My settings.py:
# -*- coding: utf-8 -*-
# Scrapy settings for adidas project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'adidas'
SPIDER_MODULES = ['adidas.spiders']
NEWSPIDER_MODULE = 'adidas.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'

Related

Scrapy playwright gets stuck on Telnet console listening

I am doing a practice project about scraping dynamically loaded content with scrapy-plawright but I managed to hit a wall and cannot figure out what's the issue is. The spider simply refuses to start the crawling process and gets stuck on the "Telnet console listening on 127.0.0.1:6023" part.
I set up the project as it is is recommended in the tutorial.
this is how the relevant part of my settings.py looks like (I played around other settings too to try to fix it like with CONCURRENT_REQUESTS and COOKIES_ENABLED but no changes)
import asyncio
from scrapy.utils.reactor import install_reactor
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
And this is how the spider itself:
class roksh_crawler(scrapy.Spider):
name = "roksh_crawler"
def start_requests(self):
yield Request(
url="https://www.roksh.com/",
callback=self.parse,
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("screenshot", path="example.png", full_page=True),
],
},
)
def parse(self, response):
screenshot = response.meta["playwright_page_methods"][0]
# screenshot.result contains the image's bytes
I tried to take a screenshot of the page but nothing else works either so I assume this is not the issue.
And here is the the log I am getting:
2022-11-24 09:54:19 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: roksh_crawler)
2022-11-24 09:54:19 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.2.0, parsel 1.7.0, w3lib 2.0.1, Twisted 21.7.0, Python 3.9.5 (tags/v3.9.5:0a7dcbd, May 3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.3,
Platform Windows-10-10.0.19045-SP0
2022-11-24 09:54:19 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'roksh_crawler',
'CONCURRENT_REQUESTS': 32,
'NEWSPIDER_MODULE': 'roksh.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['roksh.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2022-11-24 09:54:19 [asyncio] DEBUG: Using selector: SelectSelector
2022-11-24 09:54:19 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2022-11-24 09:54:19 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2022-11-24 09:54:19 [scrapy.extensions.telnet] INFO: Telnet Password: 7aad12ee78cfff92
2022-11-24 09:54:19 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-11-24 09:54:19 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-11-24 09:54:19 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-11-24 09:54:19 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-11-24 09:54:19 [scrapy.core.engine] INFO: Spider opened
2022-11-24 09:54:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-11-24 09:54:19 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-11-24 09:55:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-11-24 09:56:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-11-24 09:57:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-11-24 09:58:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-11-24 09:59:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-11-24 10:00:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-11-24 10:01:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-11-24 10:02:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-11-24 10:03:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
and this goes on infinitely.
I also tried with different URLs but got the same result so I assume the problem is on my end not on the server's. Plus if I try to run the spider without playwright (so if I take out the DOWNLOAD_HANDLERS from the settings) then it works, albeit it only returns the source HTML which is not my desired result.
It works fine for me.
Just remove or comment out these lines in your settings.py file
# import asyncio
# from scrapy.utils.reactor import install_reactor
# install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
# asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

How can I use Scrapy to download all my Quora answers?

I'm trying to use Scrapy to download my Quora answers, but I can't even seem to be able to download my page. Using the simple
scrapy shell 'http://it.quora.com/profile/Ferdinando-Randisi'
returns this error
2017-10-05 22:16:52 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: quora)
2017-10-05 22:16:52 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'quora.spiders', 'ROBOTSTXT_OBEY': True, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'SPIDER_MODULES': \[quora.spiders'], 'BOT_NAME': 'quora', 'LOGSTATS_INTERVAL': 0}
....
2017-10-05 22:16:53 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-10-05 22:16:53 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-10-05 22:16:53 [scrapy.core.engine] INFO: Spider opened
2017-10-05 22:16:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://it.quora.com/robots.txt> from <GET http://it.quora.com/robots.txt>
2017-10-05 22:16:55 [scrapy.core.engine] DEBUG: Crawled (429) <GET https://it.quora.com/robots.txt> (referer: None)
2017-10-05 22:16:55 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://it.quora.com/profile/Ferdinando-Randisi> from <GET http://it.quora.com/profile/Ferdinando-Randisi>
2017-10-05 22:16:56 [scrapy.core.engine] DEBUG: Crawled (429) <GET https://it.quora.com/profile/Ferdinando-Randisi> (referer: None)
2017-10-05 22:16:58 [root] DEBUG: Using default logger
What's wrong? Error 429 is associated with too many requests, but I'm making only one request. Why would that be too many?
It blocks Scrapy based on user agent string. Try to mimic e.g. Chromium:
scrapy shell "http://it.quora.com/profile/Ferdinando-Randisi" -s USER_AGENT="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.52 Safari/537.36"

Login with scrapy doesn't work

I have recently started using scrapy and was setting it up for a typical task of scraping a webpage which requires authentication.
My idea is to start with the login page, submit the form and then download the data from other login protected pages.
I can see that I am authenticated, however, I see that I am stuck in a loop of redirects when it goes to the download page.
My spider class looks like below:
class MySpiderWithLogin(Spider):
name = 'my-spider'
download_url = 'https://example.com/files/1.zip'
login_url = 'https://example.com/login'
login_user = '...'
login_password = '...'
def start_requests(self):
# let's start by sending a first request to login page
yield Request(self.login_url, self.parse_login)
def parse_login(self, response):
# got the login page, let's fill the login form...
return FormRequest.from_response(response,
formdata={'username': self.login_user, 'password': self.login_password},
callback=self.start_crawl,
dont_filter = True)
def start_crawl(self, response):
# OK, we're in, let's start crawling the protected pages
yield Request(self.download_url, dont_filter = True)
def parse(self, response):
# do stuff with the logged in respons
inspect_response(response, self)
return
What I see after running the spider is a redirect loop as below. I have abstracted login_page, download_page and some query parameters along with them namely ticket, jsession_id, cas_check
2016-12-21 18:06:36 [scrapy] DEBUG: Redirecting (302) to <GET <login_page>>
2016-12-21 18:06:39 [scrapy] DEBUG: Crawled (200) <GET <login_page>>
2016-12-21 18:06:39 [scrapy] DEBUG: Redirecting (302) to <GET <login_page/j_spring_cas_security_check;jsessionid=bar?ticket=foo> from <POST <login_page/j_spring_cas_security_check;jsessionid=bar?ticket=foo>
2016-12-21 18:06:42 [scrapy] DEBUG: Redirecting (302) to <GET home_page>
2016-12-21 18:06:44 [scrapy] DEBUG: Crawled (200) <GET <home page>>
2016-12-21 18:06:44 [scrapy] DEBUG: Redirecting (302) to <GET login_page>>from <GET download_page>
2016-12-21 18:06:47 [scrapy] DEBUG: Redirecting (302) to < download_page?ticket=biz> from <GET login_page>
2016-12-21 18:06:50 [scrapy] DEBUG: Redirecting (302) to <GET download_page> from <GET download_page?ticket=biz>
2016-12-21 18:06:54 [scrapy] DEBUG: Redirecting (302) to <GET login_page>>from <GET download_page>
....
....
2016-12-21 18:07:34 [scrapy] DEBUG: Discarding <GET download_page?ticket=biz_100>: max redirections reached
I have set my User-agent to that of a browser in settings.py but to no effect here. Any ideas what could possibly be wrong?
Here's payload of login form from a successful request from a browser for reference:
url: login_page/jsessionid=...?service=../j_spring_cas_security_check%3Bjsessionid%3D...
method: POST
payload:
- username: ...
- password: ...
- lt: e1s1
- _eventId: submit
- submit: Sign In
UPDATE
Using python's requests library works like a charm for the same url. Also it would be worthy to mention that the website uses Jasig CAS for authentication, which makes the download url one-time accessible for a given ticket. For any further access, a new ticket needs to be issued.
I am guessing that might be the reason why Scrapy's Request is stuck in redirects as it might not be built around the one-time access scenario.

Broken Adobe CQ page rendering

I am trying to get page (CQ 5.4):
htttp://localhost:4502/etc/replication/agents.author.html
But see the next:
Resource dumped by HtmlRendererServlet
Resource path: /etc/replication/agents.author
Resource metadata: {sling.resolutionPathInfo=.html, sling.resolutionPath=/etc/replication/agents.author}
Resource type: cq:Page
Resource super type: -
Resource properties..
In the 'system/console' > 'Recent requests' we can see render process.
0 (2013-12-16 02:33:09) TIMER_START{Request Processing}
0 (2013-12-16 02:33:09) COMMENT timer_end format is {<elapsed msec>,<timer name>} <optional message>
0 (2013-12-16 02:33:09) LOG Method=GET, PathInfo=/etc/replication/agents.author.html
0 (2013-12-16 02:33:09) TIMER_START{ResourceResolution}
1 (2013-12-16 02:33:09) TIMER_END{1,ResourceResolution} URI=/etc/replication/agents.author.html resolves to Resource=JcrNodeResource, type=cq:Page, superType=null, path=/etc/replication/agents.author
1 (2013-12-16 02:33:09) LOG Resource Path Info: SlingRequestPathInfo: path='/etc/replication/agents.author', selectorString='null', extension='html', suffix='null'
1 (2013-12-16 02:33:09) TIMER_START{ServletResolution}
1 (2013-12-16 02:33:09) TIMER_START{resolveServlet(JcrNodeResource, type=cq:Page, superType=null, path=/etc/replication/agents.author)}
1 (2013-12-16 02:33:09) TIMER_END{0,resolveServlet(JcrNodeResource, type=cq:Page, superType=null, path=/etc/replication/agents.author)} Using servlet org.apache.sling.servlets.get.DefaultGetServlet
1 (2013-12-16 02:33:09) TIMER_END{0,ServletResolution} URI=/etc/replication/agents.author.html handled by Servlet=org.apache.sling.servlets.get.DefaultGetServlet
1 (2013-12-16 02:33:09) LOG Applying Requestfilters
Used DefaultGetServlet instead of Page.jsp (Using servlet org.apache.sling.servlets.get.DefaultGetServlet)
All bundles are active.
Log outputs:
==> request.log <==
17/Dec/2013:01:29:49 -0800 [3677] -> GET /etc/replication/agents.author.html HTTP/1.1
17/Dec/2013:01:29:49 -0800 [3677] <- 200 text/html 3ms
==> access.log <==
<myIp> - admin 17/Dec/2013:01:29:49 -0800 "GET /etc/replication/agents.author.html HTTP/1.1" 200 1232 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36"
==> request.log <==
17/Dec/2013:01:29:50 -0800 [3678] -> GET /favicon.ico HTTP/1.1
==> error.log <==
17.12.2013 01:29:50.332 *INFO* [82.209.214.162 [1387272590327] GET /favicon.ico HTTP/1.1] org.apache.sling.engine.impl.SlingRequestProcessorImpl service: Resource /favicon.ico not found
==> request.log <==
17/Dec/2013:01:29:50 -0800 [3678] <- 404 text/html 6ms
==> access.log <==
<myip> admin 17/Dec/2013:01:29:50 -0800 "GET /favicon.ico HTTP/1.1" 404 393 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36"
What have broken ? Why is it happened?
In bundles I found just next differencies (compare broken instance and a same one unbroken):
Day CRX Sling - Token Authenticationcom.day.crx.sling.crx-auth-token 2.2.0.54 cq5 Active
Day CRX Sling - Token Authenticationcom.day.crx.sling.crx-auth-token 2.2.0.61 cq5 Active
Also I have on my broken instance one more active bundle:
Day Communique 5 WCM Geometrixx Gocom.day.cq.wcm.cq-wcm-geometrixx-go 5.4.0
Turn it off, but it didn't help.
After that I uploaded cq-content-5.4.jar in the package manager, installed it, restarted system.
But again I have the same error (Resource dumped by HtmlRendererServlet) for all pages. It did not help too.
It also can happen due to settings in Apache Sling Resource Resolver Factory settings. We faced same issue and found that the setting for Resource Search Path is missing all entry. Verify the default entry should have following:
/aps
/libs
/apps/foundation/components/primary
/libs/foundation/components/primary
I realise this is long after the fact, but perhaps it may be of use to future developers.
If you use the default GET servlet you need to configure a setting in the Apache Sling Get Servlet to render HTML. This can be found in the Apache Felix web console. You can access it here:
/system/console/configMgr
Then do a search for "Sling Get servlet". Inside your configuration you need to toggle the config setting for "Enable HTML" to select whether the HTML renderer for the default Get servlet is enabled or not.

Cannot upload image to wordpress (nginx+varnish+apache)

I'm running two servers.
One is a gateway running nginx for dispatching requests for different domains to different servers.
The other one is the the server for my WordPress installation.
I'm using Varnish in front of Apache to do caching stuffs (only caching, no load balancing). I've turned off KeepAlive and set Timeout to 20 seconds for Apache.
Now I'm uploading an image of size 160KB and it fails, while my server configuration allows a maximum size of 20MB. After I submit the upload form in WordPress, I can see from the status line of my browser that the file is uploaded several times (mostly 2 or 3). When I use the asynch uploading plugin of WordPress, I can also see the progress bar growing from 0% to 100% and over and over again, until it fails.
When it fails, it stucks at the path /wp-admin/media-upload.php?inline=&upload-page-form= and Chrome says "Error 101 (net::ERR_CONNECTION_RESET): The connection was reset." I've tried Firefox, exactly the same.
I cannot see anything relevant in the error logs of Varnish and Apache. However, I do see mutiple lines of the following log in the access log of nginx:
220.255.1.18 - - [01/Jan/2013:12:16:36 +0800] "POST /wp-admin/media-upload.php?inline=&upload-page-form= HTTP/1.1" 400 0 "http://MY-DOMAIN/wp-admin/media-new.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.101 Safari/537.11"
220.255.1.29 - - [01/Jan/2013:12:16:41 +0800] "POST /wp-admin/media-upload.php?inline=&upload-page-form= HTTP/1.1" 400 0 "http://MY-DOMAIN/wp-admin/media-new.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.101 Safari/537.11"
220.255.1.23 - - [01/Jan/2013:12:16:51 +0800] "POST /wp-admin/media-upload.php?inline=&upload-page-form= HTTP/1.1" 400 0 "http://MY-DOMAIN/wp-admin/media-new.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.101 Safari/537.11"
220.255.1.26 - - [01/Jan/2013:12:17:03 +0800] "POST /wp-admin/media-upload.php?inline=&upload-page-form= HTTP/1.1" 400 0 "http://MY-DOMAIN/wp-admin/media-new.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.101 Safari/537.11"
So what's the problem? How can I fix it?

Resources