Recursive call for scrapy spider

Recursive call for scrapy spider - web-scraping

How to restart the same spider once its finished, so that it can fetch next list of urls to process. Since my database is too large I cant pass all the websites at once. So i need the spider should run in a loop in order to fetch 100 websites process it and again fetch 100 websites and so on. Is there a way to call the spider once it finishes processing 100 websites? Please help me for the issue as I am new to scrapy. Or is there any option of scheduling the spider to run after a specified interval of time?
In the current code, i can get the urls from the domains and store it in the database. But i need to run the spider all the time. Is there a way to run it once and it will continuously run until there is no website to process. Please help.
class MyItem(Item):
url = Field()
class MySpider(CrawlSpider):
con = MySQLdb.connect(host="localhost", user="user",
passwd="pwd", db="db")
cur = con.cursor(MySQLdb.cursors.DictCursor)
cur.execute("select website from table limit 100")
name = "first"
urls = []
domains = []
allowed_domains = []
start_urls = []
row = cur.fetchone()
while row is not None:
p = "%s" % (row["website"])
domains.append(p)
start_urls = "http://www.% s" % p
urls.append(start_urls)
row = cur.fetchone()
allowed_domains = domains
start_urls = urls
cur.close()
con.close()
print(start_urls)
rules = (Rule(SgmlLinkExtractor(), callback='parse_url', follow=True),)
connection = MySQLdb.connect(host="localhost", user="user",
passwd="pwd", db="db")
cursor = connection.cursor(MySQLdb.cursors.DictCursor)
def parse_url(self, response):
item = MyItem()
item['url'] = response.url
topDomain = tldextract.extract(response.url)
tld = topDomain.domain + '.' + topDomain.suffix
add_url = "INSERT INTO crawl_db (url,tld,time) VALUES (%s,%s,%s)"
add_url_data = (item['url'], tld, strftime("%Y-%m-%d %H:%M:%S", gmtime()))
self.cursor.execute(add_url, add_url_data)
self.connection.commit()
Thank You.

what about passing the job an index indicating its order, or two values indicating the offset and limit for the sql query, check here to use it.
I assume you could use a bash script to run every job and passing those parameters, to run a scrapy job with extra parameters check here.
Now if you want to run "something" when a job is finishing, you could use an Extension with the spider_closed signal.
Check this tutorial on how to create your own extension, and execute whatever you want on the spider_closed method. Remember to activate it.

I made it work by making a new python script and calling the spider from the code as follows:
import os
while True:
os.system("scrapy crawl first")
So it runs the spider, gets the url and process it. And again runs the process with different set of urls.
Later I will be giving the condition to end once database doesn't have any url to process.
Thnakyou

Related

Airflow DockerOperator create Mount objects dynamically

Is there a way to create the Mount objects of DockerOperator dynamically so that I can use the filesystem connection stored in Airflow db? I don't want to change the dag code if the connection changes.
At the moment I need to hardcode the paths like this
incoming_path = "/incoming/XYZ"
output_path = "/output_path/ABC"
input_mount = {"source": incoming_path,
"target": incoming_path,
"type": "bind"}
output_mount = {"source": output_path,
"target": output_path,
"type": "bind"}
create_stuff = DockerOperator(
task_id = f'create_stuff',
user = 1234,
queue = 'default',
image = 'image_name',
api_version='auto',
auto_remove=True,
mount_tmp_dir=False,
mounts=[Mount(**input_mount),
Mount(**output_mount)],
environment={
'AF_EXECUTION_DATE': "{{ ds }}",
'AF_OWNER': "{{ task.owner }}",
},
command = f"do stuff",
entrypoint='',
docker_url='unix://var/run/docker.sock',
network_mode='bridge'
)
I tried to use the FSHook outside an Operator, but then it returns an empty string when I call
with DAG(...) as dag:
...
#THIS WORKS
#task
def task1():
incoming_hook = FSHook('fs_incoming')
incoming_path = incoming_hook.get_path()
...
#THIS RETURNS AN EMPTY STRING
incoming_hook = FSHook('fs_incoming')
incoming_path = incoming_hook.get_path()
So another phrasing for the question would be is there a way to get the path from the connection outside an Operator?
I'm using Airflow 2.4.1

Given that you just need to share a common string between multiple DAGs, I'd recommend either using an Airflow Variable or using some kind of shared config file in your /dags directory. You can find more detail in this answer.
Forcing the use of a hook here is unnecessary as Airflow Connections are "used for storing credentials and other information necessary for connecting to external services". I don't believe that applies here, but if you did need to use a Connection, you wouldn't need a Hook; you could just use the Connection API to access Connection properties.
Hooks give you additional functionality for actually interacting with external systems. FSHook would allow you to actually interact with a filesystem rather than just share the path value across DAGs.

Is there a way to send metadata to a confluence document via a script?

I have a task where I make a request to download a confluence page from a page and copy it into another with the same content, but changing it's metadata (for example the title of the page is given via a metadata) and I'd like to do this automatically, my idea of it being: the script gets certain parameters (title, author, etc.) and it puts these into the metadata of the second page. Does anyone have an idea how to do this?

If using ScripRUnner/Groovy Runner you can use such Groovy code:
def PAGE_ID = 123456
def pageManager = ComponentLocator.getComponent(PageManager.class)
def page = pageManager.getPage(PAGE_ID)
def newPage = new Page()
newPage.setTitle(page.getTitle())
newPage.setBody(page.getBodyAsString())
newPage.setSpace(page.getSpace());
pageManager.saveContentEntity(newPage, null);

The conn_id isn't defined

I'm learning Airflow and I'm trying to understand how connections works.
I have a first dag with the following code:
c = Connection(
conn_id='aws_credentials',
conn_type='Amazon Web Services',
login='xxxxxxxx',
password='xxxxxxxxx'
)
def list_keys():
hook = S3Hook(aws_conn_id=c.conn_id)
logging.info(f"Listing Keys from {bucket}/{prefix}")
keys = hook.list_keys(bucket, prefix=prefix)
for key in keys:
logging.info(f"- s3://{bucket}/{key}")
In this case It's working fine. The connection is well passed to the S3Hook.
Then I have a second dag:
redshift_connection = Connection(
conn_id='redshift',
conn_type='postgres',
login='duser',
password='xxxxxxxxxx',
host='xxxxxxxx.us-west-2.redshift.amazonaws.com',
port=5439,
schema='db'
)
aws_connection = Connection(
conn_id='aws_credentials',
conn_type='Amazon Web Services',
login='xxxxxxxxx',
password='xxxxxxxx'
)
def load_data_to_redshift(*args, **kwargs):
aws_hook = AwsHook(aws_connection.conn_id)
credentials = aws_hook.get_credentials()
redshift_hook = PostgresHook(redshift_connection.conn_id)
sql_stmnt = sql_statements.COPY_STATIONS_SQL.format(aws_connection.login, aws_connection.password)
redshift_hook.run(sql_stmnt)
dag = DAG(
's3_to_Redshift',
start_date=datetime.datetime.now()
)
create_table = PostgresOperator(
task_id='create_table',
postgres_conn_id=redshift_connection.conn_id,
sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
dag=dag
)
This dag return me the following error: The conn_idredshiftisn't defined
Why is that? What are the differences between my first and second dag? Why the connection does seems to work in the first example and not in the second situation?
Thanks.

Connections are usually created using the UI or CLI as described here and stored by Airflow in the database backend. The operators and the respective hooks then take a connection ID as an argument and use it to retrieve the usernames, passwords, etc. for those connections.
In your case, I suspect you created a connection with the ID aws_credentials using the UI or CLI. So, when you pass its ID to S3Hook it successfully retrieves the credentials (from the Airflow's database backend, not from the Connection object that you created).
But, you did not create a connection with the ID redshift, therefore, AwsHook complains that it is not defined. You have to create the connection as described in the documentation first.
Note: The reason for not defining connections in the DAG code is that the DAG code is usually stored in a version control system (e.g., Git). And it would be a security risk to store credentials there.

Use DB to generate airflow tasks dynamically

I want to run an airflow dag like so ->
I have 2 airflow workers W1 and W2.
In W1 I have scheduled a single task (W1-1) but in W2, I want to create X number of tasks (W2-1, W2-2 ... W2-X).
The number X and the bash command for each task will be derived from a DB call.
All tasks for worker W2 should run in parallel after W1 completes.
This is my code
dag = DAG('deploy_single', catchup=False, default_args=default_args, schedule_interval='16 15 * * *')
t1 = BashOperator(
task_id='dummy_task',
bash_command='echo hi > /tmp/hi',
queue='W1_queue',
dag=dag)
get_all_engines = "select full_command, queue_name from internal_airflow_hosts where logical_group = 'live_engines';"
db_creds = json.loads(open('/opt/airflow/db_creds.json').read())
conn_dict = db_creds["airflowdb_local"]
connection = psycopg2.connect(**conn_dict)
cursor = connection.cursor()
cursor.execute(get_all_engines)
records = cursor.fetchall()
i = 1
for record in records:
t = BashOperator(
task_id='script_test_'+str(i),
bash_command="{full_command} ".format(full_command=str(record[0])),
queue=str(record[1]),
dag=dag)
t.set_upstream(t1)
i += 1
cursor.close()
connection.close()
However, when I run this, the task on W1 completed successfully but all tasks on W2 failed. In the airflow UI, I can see that it can resolve the correct number of tasks (10 in this case) but each of these 10 failed.
Looking at the logs, I saw that on W2 (which is on a different machine), airflow could not find the db_creds.json file.
I do not want to provide the DB creds file to W2.
My question is how can an airflow task be created dynamically in this case?
Basically i want to run a DB query on the airflow server and assign tasks to one or more workers based on the results of that query. The DB will contain updated info about which engines are active etc I want the DAG to reflect this. From logs, it looks like each worker runs the DB query. Providing access to DB to each worker is not an option.

Thank you #viraj-parekh and #cwurtz.
After much trial and error, found the correct way to use airflow variables for this case.
Step 1) We create another script called gen_var.pyand place it in the dag folder. This way, the scheduler will pick up and generate the variables. If the code for generating variables is within the deploy_single dag then we run into the same dependency issue as the worker will try and process the dag too.
"""
Code that goes along with the Airflow tutorial located at:
https://github.com/airbnb/airflow/blob/master/airflow/example_dags/tutorial.py
"""
import json
import psycopg2
from airflow.models import Variable
from psycopg2.extensions import AsIs
get_all_engines = "select full_command, queue_name from internal_airflow_hosts where logical_group = 'live_engines';"
db_creds = json.loads(open('/opt/airflow/db_creds.json').read())
conn_dict = db_creds["airflowdb_local"]
connection = psycopg2.connect(**conn_dict)
cursor = connection.cursor()
cursor.execute(get_all_engines)
records = cursor.fetchall()
hosts = {}
i = 1
for record in records:
comm_dict = {}
comm_dict['full_command'] = str(record[0])
comm_dict['queue_name'] = str(record[1])
hosts[i] = comm_dict
i += 1
cursor.close()
connection.close()
Variable.set("hosts",hosts,serialize_json=True)
Note the call to serialize_json. Airflow will try to store the variable as a string. If you want it to be stored as a dict, then use serialize_json=True. Airflow will still store it as string via json.dumps
Step 2) Simplify the dag and call this "hosts" variable (now deserialize to get back the dict) like so -
hoztz = Variable.get("hosts",deserialize_json=True)
for key in hoztz:
host = hoztz.get(key)
t = BashOperator(
task_id='script_test_'+str(key),
bash_command="{full_command} ".format(full_command=str(host.get('full_command'))),
queue=str(host.get('queue_name')),
dag=dag)
t.set_upstream(t1)
Hope it helps someone else.

One way to do this would be to store the information in an Airflow Variable.
You can fetch the information needed to dynamically generate the DAG (and necessary configs) in a Variable and have W2 access it from there.
Variables are an airflow model that can be used to store static information (information that does not have an associated timestamp) that all tasks can access.

Scraping "older" pages with scrapy, rules and link extractors

I have been working on a project with scrapy. With help, from this lovely community I have managed to be able to scrape the first page of this website: http://www.rotoworld.com/playernews/nfl/football-player-news?ls=roto%3anfl%3agnav. I am trying to scrape information from the "older" pages as well. I have researched "crawlspider", rules and link extractors, and believed I had the proper code. I want the spider to perform the same loop on subsequent pages. Unfortunately at the moment when I run it, it just spits out the 1st page, and doesn't continue to the "older" pages.
I am not exactly sure what I need to change and would really appreciate some help. There are posts going all the way back to February of 2004... I am new to data mining, and not sure if it is actually a realistic goal to be able to scrape every post. If it is I would like to though. Please any help is appreciated. Thanks!
import scrapy
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor
class Roto_News_Spider2(crawlspider):
name = "RotoPlayerNews"
start_urls = [
'http://www.rotoworld.com/playernews/nfl/football/',
]
Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//input[#id="cp1_ctl00_btnNavigate1"]',)), callback="parse_page", follow= True),)
def parse(self, response):
for item in response.xpath("//div[#class='pb']"):
player = item.xpath(".//div[#class='player']/a/text()").extract_first()
position= item.xpath(".//div[#class='player']/text()").extract()[0].replace("-","").strip()
team = item.xpath(".//div[#class='player']/a/text()").extract()[1].strip()
report = item.xpath(".//div[#class='report']/p/text()").extract_first()
date = item.xpath(".//div[#class='date']/text()").extract_first() + " 2018"
impact = item.xpath(".//div[#class='impact']/text()").extract_first().strip()
source = item.xpath(".//div[#class='source']/a/text()").extract_first()
yield {"Player": player,"Position": position, "Team": team,"Report":report,"Impact":impact,"Date":date,"Source":source}

If your intention is to fetch the data traversing multiple pages, you don't need to go for scrapy. If you still want to have any solution related to scrapy then I suggest you opt for splash to handle the pagination.
I would do something like below to get the items (assuming you have already installed selenium in your machine):
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("http://www.rotoworld.com/playernews/nfl/football/")
wait = WebDriverWait(driver, 10)
while True:
for item in wait.until(EC.presence_of_all_elements_located((By.XPATH,"//div[#class='pb']"))):
player = item.find_element_by_xpath(".//div[#class='player']/a").text
player = player.encode() #it should handle the encoding issue; I'm not totally sure, though
print(player)
try:
idate = wait.until(EC.presence_of_element_located((By.XPATH, "//div[#class='date']"))).text
if "Jun 9" in idate: #put here any date you wanna go back to (last limit: where the scraper will stop)
break
wait.until(EC.presence_of_element_located((By.XPATH, "//input[#id='cp1_ctl00_btnNavigate1']"))).click()
wait.until(EC.staleness_of(item))
except:break
driver.quit()

My suggestion: Selenium
If you want to change of page automatically, you can use Selenium WebDriver.
Selenium makes you to be able to interact with the page click on buttons, write on inputs, etc. You'll need to change your code to scrap the data an then, click on the older button. Then, it'll change the page and keep scraping.
Selenium is a very useful tool. I'm using it right now, on a personal project. You can take a look at my repo on GitHub to see how it works. In the case of the page that you're trying to scrap, you cannot go to older just changing the link to be scraped, so, you need to use Selenium to do change between pages.
Hope it helps.

No need to use Selenium in current case. Before scraping you need to open url in browser and press F12 to inspect code and to see packets in Network Tab. When you press next or "OLDER" in your case you can see new set of TCP packets in Network tab. It provide to you all you need. When you understand how it work you can write working spider.
import scrapy
from scrapy import FormRequest
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor
class Roto_News_Spider2(CrawlSpider):
name = "RotoPlayerNews"
start_urls = [
'http://www.<DOMAIN>/playernews/nfl/football/',
]
Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//input[#id="cp1_ctl00_btnNavigate1"]',)), callback="parse", follow= True),)
def parse(self, response):
for item in response.xpath("//div[#class='pb']"):
player = item.xpath(".//div[#class='player']/a/text()").extract_first()
position= item.xpath(".//div[#class='player']/text()").extract()[0].replace("-","").strip()
team = item.xpath(".//div[#class='player']/a/text()").extract()[1].strip()
report = item.xpath(".//div[#class='report']/p/text()").extract_first()
date = item.xpath(".//div[#class='date']/text()").extract_first() + " 2018"
impact = item.xpath(".//div[#class='impact']/text()").extract_first().strip()
source = item.xpath(".//div[#class='source']/a/text()").extract_first()
yield {"Player": player,"Position": position, "Team": team,"Report":report,"Impact":impact,"Date":date,"Source":source}
older = response.css('input#cp1_ctl00_btnNavigate1')
if not older:
return
inputs = response.css('div.aspNetHidden input')
inputs.extend(response.css('div.RW_pn input'))
formdata = {}
for input in inputs:
name = input.css('::attr(name)').extract_first()
value = input.css('::attr(value)').extract_first()
formdata[name] = value or ''
formdata['ctl00$cp1$ctl00$btnNavigate1.x'] = '42'
formdata['ctl00$cp1$ctl00$btnNavigate1.y'] = '17'
del formdata['ctl00$cp1$ctl00$btnFilterResults']
del formdata['ctl00$cp1$ctl00$btnNavigate1']
action_url = 'http://www.<DOMAIN>/playernews/nfl/football-player-news?ls=roto%3anfl%3agnav&rw=1'
yield FormRequest(
action_url,
formdata=formdata,
callback=self.parse
)
Be carefull you need to replace all to corrent one in my code.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Recursive call for scrapy spider - web-scraping

Related

Airflow DockerOperator create Mount objects dynamically

Is there a way to send metadata to a confluence document via a script?

The conn_id isn't defined

Use DB to generate airflow tasks dynamically

Scraping "older" pages with scrapy, rules and link extractors

Categories

Resources