Scrapy splash not load content - web-scraping

I started using selenium a few months ago, then scrapy. Learning tutorials from Udemy, youtube, and stackoverflow questions, all the scrapes were successful, until I started working with this page , response.css or response.xpath didn't work, so I went to scrapy-splash. I installed docker, and did many tests, and I had successful responses. I have tried all the solutions I have found and it doesn't work, it doesn't even print.I tried python 3.8 and 2.7 with scrapy-splash.
import scrapy
from scrapy_splash import SplashRequest
LUA_SCRIPT = """
function main(splash)
splash.private_mode_enabled = false
splash:go(splash.args.url)
splash:wait(2)
html = splash:html()
splash.private_mode_enabled = true
return html
end
"""
class MySpider(scrapy.Spider):
name = "quotes"
start_urls = ["url"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url,
callback=self.parse,
endpoint='execute',
args={
'wait': 1,
"lua_source":LUA_SCRIPT})
def parse(self, response):
print ('Result:')
print(".breadcrumbs-link = %s" % (response.css('body').extract())) # OUTPUT: [...HTML ELEMENTS...]
print(".breadcrumbs-link = %s" % (response.xpath("//td'][1]").extract()))
(Face python 3.8) F:\Selenium\>scrapy crawl quotes
2020-07-31 04:31:00 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapysplash)
2020-07-31 04:31:00 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.0 (tags/v
3.8.0:fa919fd, Oct 14 2019, 19:21:23) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Windows-7
-SP0
2020-07-31 04:31:00 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-07-31 04:31:00 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapysplash',
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
'NEWSPIDER_MODULE': 'scrapysplash.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapysplash.spiders']}
2020-07-31 04:31:00 [scrapy.extensions.telnet] INFO: Telnet Password: f075a705cb8e0509
2020-07-31 04:31:00 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-07-31 04:31:00 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-31 04:31:00 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-31 04:31:00 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-07-31 04:31:00 [scrapy.core.engine] INFO: Spider opened
2020-07-31 04:31:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-31 04:31:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-31 04:31:00 [py.warnings] WARNING: d:\selenium\python\lib\site-packages\scrapy_splash\request.py:41: ScrapyDeprecationWarning: Call to deprecated fu
nction to_native_str. Use to_unicode instead.
url = to_native_str(url)
2020-07-31 04:31:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://url/robots.txt> (referer: None)
2020-07-31 04:31:01 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET url>
2020-07-31 04:31:01 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-31 04:31:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1,
'downloader/request_bytes': 223,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 370,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.355421,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 7, 31, 8, 31, 1, 228466),
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'response_received_count': 1,
'robotstxt/forbidden': 1,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 7, 31, 8, 31, 0, 873045)}
2020-07-31 04:31:01 [scrapy.core.engine] INFO: Spider closed (finished)

Related

Trying to scrape multiple data points in same table with duplicate titles

I am trying to scrape multiple data points from the left most table in the below link. My issue is I need to collect the "Total Qty:" under each month but am struggling with it. I have tried getall and a few other options but I am trying to collect the Qty from each month and document it in the csv output under each months name. The issue is the list lengths change for each part so it is difficult for me to identify the correct way to grab this data. Any help would be apricated.
https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&showbulk=0&currency=1
def parse_details(self, response):
yield {
'element_id': response.meta.get('element_id'),
'USA_NEW_times_sold Month 1': response.xpath('//*[#class="pcipgOddColumn"]')[0].xpath(
'.//td[contains(text(),"Total Qty:")]/following::td//text()').get('')
}
I have tried getall, along with changing the pathing but am struggling with that portion.
When all else fails you can always iterate through elements to search for the one you are looking for. Try finding the months, and using the following-sibling directive to search for the very next row that contains the Total Qty:
def parse(self, response):
column = response.xpath('//td[#valign="top"][1]')
for i in column.css('td.pcipgSubHeader'):
month = i.xpath('./b/text()').get()
for j in i.xpath('./../following-sibling::tr'):
if j.xpath('./td/text()').re('Total Qty:'):
qty = j.xpath('.//td/b/text()').get()
yield {'month': month, 'qty': qty}
break
With the above method I get this output:
{'month': 'February 2023', 'qty': '13'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'January 2023', 'qty': '8'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'December 2022', 'qty': '3'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'November 2022', 'qty': '2'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'October 2022', 'qty': '6'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'September 2022', 'qty': '10'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'January 2023', 'qty': '12'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'November 2022', 'qty': '2'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'October 2022', 'qty': '11'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'August 2022', 'qty': '2'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'November 2022', 'qty': '6'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'October 2022', 'qty': '2'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'January 2023', 'qty': '9'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'December 2022', 'qty': '2'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'November 2022', 'qty': '2'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'October 2022', 'qty': '2'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'September 2022', 'qty': '3'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'November 2022', 'qty': '3'}

Selecting text where each char is stored in separate span

I am trying to scrape a code chunk from this documentation page that hosts code in a peculiar way. Namely, the code chunk is divided into the function call and arguments having their own spans, parantheses and comma even have their own span. I am at a loss trying to extract this code snippet under 'Usage' with a scrapy spider.
Here's the code for my spider, which also scrapes the documentation text.
import scrapy
import w3lib.html
class codeSpider(scrapy.Spider):
name = 'mycodespider'
def start_requests(self):
url="https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html"
yield scrapy.Request(url)
def parse(self, response):
docu = response.css('div#man-container p').getall()[2]
code = response.css('pre::text').getall()
yield {
'docu': w3lib.html.remove_tags(docu).strip(),
'code': code
}
When trying to extract the text of the block using response.css('pre::text').getall() somehow only the punctation is returned, not the entire function call. This also includes the example at the bottom of the page, which I'd rather avoid but do not know how.
Is there a better way to do this? I thought ::text would be perfect for this use case.
Try iterating through the pre elements and extracting the text from them individually.
import scrapy
import w3lib.html
class codeSpider(scrapy.Spider):
name = 'mycodespider'
def start_requests(self):
url="https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html"
yield scrapy.Request(url)
def parse(self, response):
docu = response.css('div#man-container p').getall()[2]
code = []
for pre in response.css('pre'):
code.append("".join(pre.css("::text").getall()))
yield {
'docu': w3lib.html.remove_tags(docu).strip(),
'code': code
}
OUTPUT:
2023-01-23 16:35:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html> (referer: None) ['cached']
2023-01-23 16:35:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html>
{'docu': 'Read in csv for year and plot all the accidents in state\non a map.', 'code': [['1'], ['fars_map_state', '(', 'state.num', ',', ' ', 'year', ')', '\n'], ['1'], ['fars_map_state', '(', '1', ',', ' ', '2013
', ')', '\n']]}
2023-01-23 16:35:16 [scrapy.core.engine] INFO: Closing spider (finished)
2023-01-23 16:35:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 336,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 29477,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.108742,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 1, 24, 0, 35, 16, 722131),
'httpcache/hit': 1,
'item_scraped_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,

scrapy stops scraping elements that are addressed

Here are my spider code and the log I got. The problem is the spider seems to stop scraping items addressed from somewhere in the midst of page 10 (while there are 352 pages to be scraped). When I check the XPath expressions of the rest of the elements, I find them the same in my browser.
Here is my spider:
# -*- coding: utf-8 -*-
import scrapy
import logging
import urllib.parse
parts = urllib.parse.urlsplit(u'http://fa.wikipedia.org/wiki/صفحهٔ_اصلی')
parts = parts._replace(path=urllib.parse.quote(parts.path.encode('utf8')))
encoded_url = parts.geturl().encode('ascii')
'https://fa.wikipedia.org/wiki/%D8%B5%D9%81%D8%AD%D9%87%D9%94_%D8%A7%D8%B5%D9%84%DB%8C'
class CriptolernSpider(scrapy.Spider):
name = 'criptolern'
allowed_domains = ['arzdigital.com']
def start_requests(self):
yield scrapy.Request(url='https://arzdigital.com',
callback= self.parse,dont_filter = True)
def parse(self, response):
posts=response.xpath("//a[#class='arz-last-post arz-row']")
try:
for post in posts:
post_title=post.xpath(".//#title").get()
post_link=post.xpath(".//#href").get()
post_date=post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-last-post__publish-time']/time/#datetime").get()
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()"):
likes=int(post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()").get())
else:
likes=0
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()"):
commnents=int(post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()").get())
else:
commnents=0
yield{
'post_title':post_title,
'post_link':post_link,
'post_date':post_date,
'likes':likes,
'commnents':commnents
}
next_page=response.xpath("//div[#class='arz-last-posts__get-more']/a[#class='arz-btn arz-btn-info arz-round arz-link-nofollow']/#href").get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse,dont_filter = True)
else:
next_pages= response.xpath("//div[#class='arz-pagination']/ul/li[#class='arz-pagination__item arz-pagination__next']/a[#class='arz-pagination__link']/#href").get()
if next_pages:
yield scrapy.Request(url=next_pages, callback=self.parse, dont_filter = True)
except AttributeError:
logging.error("The element didn't exist")
Here is the log, when the spider stops:
2021-12-04 11:06:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/10/>
{'post_title': 'ولادیمیر پوتین: ارزهای دیجیتال در نوع خود ارزشمند هستند', 'post_link': 'https://arzdigital.com/russias-putin-says-crypto-has-value-but-maybe-not-for-trading-oil-html/', 'post_date': '2021-10-16', 'likes': 17, 'commnents': 1}
2021-12-04 11:06:51 [scrapy.core.scraper] ERROR: Spider error processing <GET https://arzdigital.com/latest-posts/page/10/> (referer: https://arzdigital.com/latest-posts/page/9/)
Traceback (most recent call last):
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\projects\arzdigital\arzdigital\spiders\criptolern.py", line 32, in parse
likes=int(post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()").get())
ValueError: invalid literal for int() with base 10: '۱,۸۵۱'
2021-12-04 11:06:51 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-04 11:06:51 [scrapy.extensions.feedexport] INFO: Stored csv feed (242 items) in: dataset.csv
2021-12-04 11:06:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4112,
'downloader/request_count': 12,
'downloader/request_method_count/GET': 12,
'downloader/response_bytes': 292561,
'downloader/response_count': 12,
'downloader/response_status_count/200': 12,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 4, 7, 36, 51, 830291),
'item_scraped_count': 242,
'log_count/DEBUG': 254,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'request_depth_max': 10,
'response_received_count': 12,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 11,
'scheduler/dequeued/memory': 11,
'scheduler/enqueued': 11,
'scheduler/enqueued/memory': 11,
'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2021, 12, 4, 7, 36, 47, 423017)}
2021-12-04 11:06:51 [scrapy.core.engine] INFO: Spider closed (finished)
I can't find the problem, if it is related to wrong XPath expression. Thanks for any help!!
EDIT:
So I guess it's better to see two files here.
The first is settings.py:
BOT_NAME = 'arzdigital'
SPIDER_MODULES = ['arzdigital.spiders']
NEWSPIDER_MODULE = 'arzdigital.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 10
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'arzdigital.middlewares.ArzdigitalDownloaderMiddleware': None,
'arzdigital.middlewares.UserAgentRotatorMiddleware':400
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 60
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 120
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = False
FEED_EXPORT_ENCODING='utf-8'
And the second file is middlewares.py:
from scrapy import signals
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random, logging
class UserAgentRotatorMiddleware(UserAgentMiddleware):
user_agent_list=[
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/2010010 1 Firefox/7.0.1',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWeb Kit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393'
]
def __init__(self, user_agent=''):
self.user_agent= user_agent
def process_request(self, request, spider):
try:
self.user_agent= random.choice(self.user_agent_list)
request.headers.setdefault('User-Agent', self.user_agent)
except IndexError:
logging.error("Couldn't fetch the user agent")
Your code is working fine as your expectation and the problem was in pagination portion and I've made the pagination in start_urls which type of pagination is always accurate and more than two times faster than if next page.
Code
import scrapy
import logging
#base url=https://arzdigital.com/latest-posts/
#start_url =https://arzdigital.com/latest-posts/page/2/
class CriptolernSpider(scrapy.Spider):
name = 'criptolern'
allowed_domains = ['arzdigital.com']
start_urls=[f'https://arzdigital.com/latest-posts/page/{i}/'.format(i) for i in range(1,353)]
def parse(self, response):
posts = response.xpath("//a[#class='arz-last-post arz-row']")
try:
for post in posts:
post_title = post.xpath(".//#title").get()
post_link = post.xpath(".//#href").get()
post_date = post.xpath(
".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-last-post__publish-time']/time/#datetime").get()
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()"):
likes = int(post.xpath(
".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()").get())
else:
likes = 0
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()"):
commnents = int(post.xpath(
".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()").get())
else:
commnents = 0
yield{
'post_title': post_title,
'post_link': post_link,
'post_date': post_date,
'likes': likes,
'commnents': commnents
}
except AttributeError:
logging.error("The element didn't exist")
Output:
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'تأکید مقام رسمی سابق وزارت دفاع آمریکا مبنی بر تشویق سرمایه گذاری بر روی بلاکچین', 'post_link': 'https://arzdigital.com/blockchain-investment/', 'post_date': '2017-07-27', 'likes': 4, 'commnents': 0}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'ریسک سرمایه گذاری از طریق ICO', 'post_link': 'https://arzdigital.com/ico-risk/', 'post_date': '2017-07-27', 'likes': 9, 'commnents': 0}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': '\xa0ای.سی.او چیست؟', 'post_link': 'https://arzdigital.com/what-is-ico/', 'post_date': '2017-07-27', 'likes': 7, 'commnents': 7}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'چرا\xa0فراریت بیت کوین و واحدهای مشابه آن، نسبت به سایر واحدهای پولی بیش\u200cتر است؟', 'post_link': 'https://arzdigital.com/bitcoin-currency/', 'post_date': '2017-07-27', 'likes': 6, 'commnents': 0}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'اتریوم کلاسیک Ethereum Classic چیست ؟', 'post_link': 'https://arzdigital.com/what-is-ethereum-classic/', 'post_date': '2017-07-24', 'likes': 10, 'commnents': 2}
2021-12-04 17:25:19 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-04 17:25:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 111431,
'downloader/request_count': 353,
'downloader/request_method_count/GET': 353,
'downloader/response_bytes': 8814416,
'downloader/response_count': 353,
'downloader/response_status_count/200': 352,
'downloader/response_status_count/301': 1,
'elapsed_time_seconds': 46.29503,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 4, 11, 25, 19, 124154),
'httpcompression/response_bytes': 55545528,
'httpcompression/response_count': 352,
'item_scraped_count': 7920
.. so on
settings.py file
Please make sure that the settings.py file, you have to change only the uncomment portion nothing else
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 10
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'gs_spider.middlewares.GsSpiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'gs_spider.middlewares.GsSpiderDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'gs_spider.pipelines.GsSpiderPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

Unable to understand the ValueError: invalid literal for int() with base 10: 'تومان'

My crawler isn't working properly and I can't find what is the solution to it.
Here is the related part of my spider:
def parse(self, response):
original_price=0
discounted_price=0
star=0
discounted_percent=0
try:
for product in response.xpath("//ul[#class='c-listing__items js-plp-products-list']/li"):
title= product.xpath(".//div/div[2]/div/div/a/text()").get()
if product.xpath(".//div/div[2]/div[2]/div[1]/text()"):
star= float(str(product.xpath(".//div/div[2]/div[2]/div[1]/text()").get()))
if product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()"):
discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', ''))
if product.xpath(".//div/div[2]/div[3]/div/div/div/text()"):
discounted_price= int(str(product.xpath(".//div/div[2]/div[3]/div/div/div/text()").get().strip()).replace(',', ''))
if product.xpath(".//div/div[2]/div[3]/div/div/del/text()"):
original_price= int(str(product.xpath(".//div/div[2]/div[3]/div/div/del/text()").get().strip()).replace(',', ''))
discounted_amount= original_price-discounted_price
else:
original_price= print("not available")
discounted_amount= print("not available")
url= response.urljoin(product.xpath(".//div/div[2]/div/div/a/#href").get())
This is my log:
2020-10-21 16:49:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.digikala.com/search/category-book/> from <GET https://www.digikala.com/search/category-book>
2020-10-21 16:49:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.digikala.com/search/category-book/> (referer: None)
2020-10-21 16:49:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.digikala.com/search/category-book/> (referer: None)
Traceback (most recent call last):
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\projects\digi_allbooks\digi_allbooks\spiders\allbooks.py", line 31, in parse
discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', ''))
ValueError: invalid literal for int() with base 10: 'تومان'
2020-10-21 16:49:57 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-21 16:49:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 939,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 90506,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 10, 21, 13, 19, 57, 630044),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 9,
'log_count/WARNING': 1,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2020, 10, 21, 13, 19, 55, 914304)}
2020-10-21 16:49:57 [scrapy.core.engine] INFO: Spider closed (finished)
I guess it says there is a string in an int() function which returns the ValueError but the XPath I'm using targets a number, not a string.
I can't get the error correctly, so I don't find the solution. Can someone help me out, please?
In at least one of the iterations this line is scraping تومان instead of an integer
discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', ''))
From a google search it seems this is a monetary unit. You need to work on your XPaths, or have the spider ignore this return as there isn't a discount in this item.
It seems this XPath may be a better option for your intention: (I haven't checked all items though)
product.xpath(".//div[#class="c-price__discount-oval"]/span/text()").get()

How to create a DAG from task in Airflow

I have requirement where there is parent Dag with only one task, which create certain parameters (not-fixed). Lets call them as params1, params2 and params3. Now I want to create three DAGs from task in parent Dag, which will have params available in cotext of each task with DAG. I was going through following link to create the dynamic dags and tried it -
https://airflow.incubator.apache.org/faq.html#how-can-i-create-dags-dynamically
class ParentBigquerySql(object):
def __init__(self):
pass
def run(self, **context):
logging.info('Running job')
batch_id = 100
#parent_sql = '''SELECT max(run_start_date) AS run_start_date,
# max(run_end_date) AS run_end_date
# FROM `vintel_rel_2_0_staging_westfield.in_venue_batch_dates_daily`'''
parent_sql = '''SELECT run_start_date, run_end_date
from vintel_rel_2_0_staging_westfield.in_venue_batch_dates_daily
order by 1 ,2'''
params = self.get_params(batch_id, parent_sql)
XcomManager.push_query_params(context, params)
return params
def get_params(self, batch_id, parent_sql):
batch_id = str(batch_id)
result = BigQueryManager.read_query_to_table(parent_sql)
t_list = []
if result and type(result) is not list and result.error_result:
#LogManager.info("Error in running the parent jobs - %s." % (result.error_result))
#LogManager.info("Not populating cache... ")
pass
elif len(result) > 0:
for row in result:
if len(row) > 0:
run_start_date = row[0]
run_end_date = row[1]
if run_start_date and run_end_date:
t_list.append({'min_date': run_start_date, 'max_date': run_end_date})
params = {}
params['date_range'] = t_list
return params
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017, 3, 23),
'retries': 1,
'provide_context': True,
'retry_delay': timedelta(minutes=2),
}
dag = DAG('parent_dynamic_job_dag', # give the dag a name
schedule_interval='#once',
default_args=default_args
)
def pull_child11(**context):
logging.info(" Date range " + str(context['date_range']))
def conditionally_trigger(context, dag_run_obj):
return dag_run_obj
def create_dag_from_task(**context):
job = ParentBigquerySql()
job.run(**context)
logging.info("Context data")
logging.info(context)
params = XcomManager.pull_query_params(context)
logging.info("Xcomm parameters: " + str(params))
tl = []
counter = 1
for d1 in params['date_range']:
dyn_dag_id = 'child_dag_id' + str(counter)
dag_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': context['execution_date'],
'execution_date': context['execution_date'],
'retries': 1,
'provide_context': True,
'retry_delay': timedelta(minutes=2),
}
dyn_dag = DAG(dyn_dag_id, # give the dag a name
schedule_interval='#once',
default_args=dag_args
)
t1 = PythonOperator(
task_id='child' + str(counter),
dag=dyn_dag,
provide_context=True,
python_callable=pull_child11,
op_kwargs={'dag_id':10, 'date_range':d1}
)
t2 = TriggerDagRunOperator(task_id='test_trigger_dag',
trigger_dag_id='child_dag_id' + str((counter + 1)),
python_callable=conditionally_trigger,
dag=dyn_dag)
t1.set_downstream(t2)
logging.info("Updating globals for the dag " + dyn_dag_id)
#trigger_op.execute(context)
globals()[dyn_dag_id] = dyn_dag ##Assing DAG objects to global namespace
if counter > 2:
break
counter = counter + 1
push1 = PythonOperator(
task_id='100-Parent',
dag=dag,
provide_context=True,
python_callable=create_dag_from_task,
op_kwargs={'dag_id':100})
push11 = PythonOperator(
task_id='101-Child',
dag=dag,
provide_context=True,
python_callable=pull_child11,
op_kwargs={'dag_id': 100, 'date_range': {'start_date': 'temp_start_date', 'end_date': 'temp_end_date'}})
t2 = TriggerDagRunOperator(task_id='test_trigger_dag',
trigger_dag_id='child_dag_id1',
python_callable=conditionally_trigger,
dag=dag)
push1.set_downstream(push11)
push11.set_downstream(t2)
I am getting following error -
[2018-05-01 09:24:27,764] {__init__.py:45} INFO - Using executor SequentialExecutor
[2018-05-01 09:24:27,875] {models.py:189} INFO - Filling up the DagBag from /mnt/test_project /airflow/dags
[2018-05-01 09:25:02,074] {models.py:1197} INFO - Dependencies all met for <TaskInstance: parent_dynamic_job_dag.test_trigger_dag 2018-04-23 00:00:00 [up_for_retry]>
[2018-05-01 09:25:02,081] {base_executor.py:49} INFO - Adding to queue: airflow run parent_dynamic_job_dag test_trigger_dag 2018-04-23T00:00:00 --local -sd DAGS_FOLDER/test_dynamic_parent_child.py
[2018-05-01 09:25:07,003] {sequential_executor.py:40} INFO - Executing command: airflow run parent_dynamic_job_dag test_trigger_dag 2018-04-23T00:00:00 --local -sd DAGS_FOLDER/test_dynamic_parent_child.py
[2018-05-01 09:25:08,235] {__init__.py:45} INFO - Using executor SequentialExecutor
[2018-05-01 09:25:08,431] {models.py:189} INFO - Filling up the DagBag from /mnt/test_project /airflow/dags/test_dynamic_parent_child.py
[2018-05-01 09:26:44,207] {base_task_runner.py:115} INFO - Running: ['bash', '-c', u'airflow run parent_dynamic_job_dag test_trigger_dag 2018-04-23T00:00:00 --job_id 178 --raw -sd DAGS_FOLDER/test_dynamic_parent_child.py']
[2018-05-01 09:26:45,243] {base_task_runner.py:98} INFO - Subtask: [2018-05-01 09:26:45,242] {__init__.py:45} INFO - Using executor SequentialExecutor
[2018-05-01 09:26:45,416] {base_task_runner.py:98} INFO - Subtask: [2018-05-01 09:26:45,415] {models.py:189} INFO - Filling up the DagBag from /mnt/test_project /airflow/dags/test_dynamic_parent_child.py
[2018-05-01 09:27:49,798] {base_task_runner.py:98} INFO - Subtask: [2018-05-01 09:27:49,797] {models.py:189} INFO - Filling up the DagBag from /mnt/test_project /airflow/dags
[2018-05-01 09:27:50,108] {base_task_runner.py:98} INFO - Subtask: Traceback (most recent call last):
[2018-05-01 09:27:50,108] {base_task_runner.py:98} INFO - Subtask: File "/Users/manishz/anaconda2/bin/airflow", line 27, in <module>
[2018-05-01 09:27:50,109] {base_task_runner.py:98} INFO - Subtask: args.func(args)
[2018-05-01 09:27:50,109] {base_task_runner.py:98} INFO - Subtask: File "/Users/manishz/anaconda2/lib/python2.7/site-packages/airflow/bin/cli.py", line 392, in run
[2018-05-01 09:27:50,110] {base_task_runner.py:98} INFO - Subtask: pool=args.pool,
[2018-05-01 09:27:50,110] {base_task_runner.py:98} INFO - Subtask: File "/Users/manishz/anaconda2/lib/python2.7/site-packages/airflow/utils/db.py", line 50, in wrapper
[2018-05-01 09:27:50,110] {base_task_runner.py:98} INFO - Subtask: result = func(*args, **kwargs)
[2018-05-01 09:27:50,111] {base_task_runner.py:98} INFO - Subtask: File "/Users/manishz/anaconda2/lib/python2.7/site-packages/airflow/models.py", line 1493, in _run_raw_task
[2018-05-01 09:27:50,111] {base_task_runner.py:98} INFO - Subtask: result = task_copy.execute(context=context)
[2018-05-01 09:27:50,112] {base_task_runner.py:98} INFO - Subtask: File "/Users/manishz/anaconda2/lib/python2.7/site-packages/airflow/operators/dagrun_operator.py", line 67, in execute
[2018-05-01 09:27:50,112] {base_task_runner.py:98} INFO - Subtask: dr = trigger_dag.create_dagrun(
[2018-05-01 09:27:50,112] {base_task_runner.py:98} INFO - Subtask: AttributeError: 'NoneType' object has no attribute 'create_dagrun'
[2018-05-01 09:28:14,407] {jobs.py:2521} INFO - Task exited with return code 1
[2018-05-01 09:28:14,569] {jobs.py:1959} ERROR - Task instance <TaskInstance: parent_dynamic_job_dag.test_trigger_dag 2018-04-23 00:00:00 [failed]> failed
[2018-05-01 09:28:14,573] {models.py:4584} INFO - Updating state for <DagRun parent_dynamic_job_dag # 2018-04-23 00:00:00: backfill_2018-04-23T00:00:00, externally triggered: False> considering 3 task(s)
[2018-05-01 09:28:14,576] {models.py:4631} INFO - Marking run <DagRun parent_dynamic_job_dag # 2018-04-23 00:00:00: backfill_2018-04-23T00:00:00, externally triggered: False> failed
[2018-05-01 09:28:14,600] {jobs.py:2125} INFO - [backfill progress] | finished run 1 of 1 | tasks waiting: 0 | succeeded: 2 | kicked_off: 0 | failed: 1 | skipped: 0 | deadlocked: 0 | not ready: 0
Traceback (most recent call last):
File "/Users/manishz/anaconda2/bin/airflow", line 27, in <module>
args.func(args)
File "/Users/manishz/anaconda2/lib/python2.7/site-packages/airflow/bin/cli.py", line 185, in backfill
delay_on_limit_secs=args.delay_on_limit)
File "/Users/manishz/anaconda2/lib/python2.7/site-packages/airflow/models.py", line 3724, in run
job.run()
File "/Users/manishz/anaconda2/lib/python2.7/site-packages/airflow/jobs.py", line 198, in run
self._execute()
File "/Users/manishz/anaconda2/lib/python2.7/site-packages/airflow/jobs.py", line 2441, in _execute
raise AirflowException(err)
airflow.exceptions.AirflowException: ---------------------------------------------------
Some task instances failed:
%s
But above code is not running the following dags. Any idea whats happening here?
Thanks in Advance
Manish

Resources