How to recover from "cloudscraper.exceptions.CloudflareChallengeError" - web-scraping

The following code was working well without any exceptions or errors. However, after just two months, I run the code and it gave me CloudflareChallengeError. The problem is that when I run it many times the error appears on different rows, for example the first time after the first request, the second time after the 20th request. I don't find ant solution or explanation for that!.
def expandQuery(text):
print('\n****Expanding the Query****')
scraper = cloudscraper.create_scraper()
textList=text.strip().split(' ')
print(textList)
extraWordsAlMaani=[]
for x in range(10):
print(x)
for i in textList:
response = scraper.get('https://www.almaany.com/ar/thes/ar-ar/'+i +'').text
soup = BeautifulSoup(response, features='html.parser')
AllParagraph= soup.body.find('section', {'class': 'container'})
AllParagraph=AllParagraph.find('div', {'class': 'col-md-12'})
AllParagraph=AllParagraph.find('div', {'class': 'row', 'id':'page-content'})
if AllParagraph is not None:
AllParagraph=AllParagraph.find('div',{'class': 'mainbar-column'})
if AllParagraph is not None:
AllParagraph=AllParagraph.find('div', {'class': 'panel panel-default'})
if AllParagraph is not None:
AllParagraph=AllParagraph.find('div', {'class': 'panel-body'})
if AllParagraph is not None:
AllParagraph=AllParagraph.find('ul', {'class': 'list-inline'})
if AllParagraph is not None:
AllParagraph=AllParagraph.find('li').text
if AllParagraph is not None:
extraWordsAlMaani = extraWordsAlMaani + (AllParagraph).split(' , ')
extraWordsAlMaani = ' '.join(set(extraWordsAlMaani))
print('Original text is ', text)
print('extraWordsAlMaani text is ', extraWordsAlMaani)
Statements = pd.read_csv('/gdrive/MyDrive/FullSystem/Statements-Final.csv')
for index, row in Statements.iterrows():
expandQuery(row['Text'])
The error:
cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 Captcha challenge, This feature is not available in the opensource (free) version.

Related

How to see the templated values in airflow?

When I do something like:
some_value = "{{ dag_run.get_task_instance('start').start_date }}"
print(f"some interpolated value: {some_value}")
I see this in the airflow logs:
some interpolated value: {{ dag_run.get_task_instance('start').start_date }}
but not the actual value itself. How can I easily see what the value is?
Everything in the DAG task run comes through as kwargs (before 1.10.12 you needed to add provide_context, but all context is provided after version 2).
To get something out of kwargs, do something like this in your python callable:
run_id = kwargs['run_id']
print(f'run_id = {run_id}')
Additional info:
To get the kwargs out, add them to your callable, so:
def my_func(**kwargs):
run_id = kwargs['run_id']
print(f'run_id = {run_id}')
And just call this from your DAG task like:
my_task = PythonOperator(
task_id='my_task'
, dag=dag
, python_callable=my_func)
I'm not sure what your current code structure is because you haven't provided more info I'm afraid.

Passing Result of a Python Operator as a parameter to BigQueryInsertJobOperator

I have a python operator and BigQueryInsertJobOperator in my DAG. The result returned by the python operator should be passed to BigQueryInsertJobOperator in the params field.
Below is the script I am running.
def get_columns():
field = "name"
return field
with models.DAG(
"xcom_test",
default_args=default_args,
schedule_interval="0 0 * * *",
tags=["xcom"],
)as dag:
t1 = PythonOperator(task_id="get_columns", python_callable=get_columns, do_xcom_push=True)
t2 = BigQueryInsertJobOperator(
task_id="bigquery_insert",
project_id=project_id,
configuration={
"query": {
"query": "{% include 'xcom_query.sql' %}",
"useLegacySql": False,
}
},
force_rerun=True,
provide_context=True,
params={
"columns": "{{ ti.xcom_pull(task_ids='get_columns') }}",
"project_id": project_id
},
)
The xcom_query.sql looks below
INSERT INTO `{{ params.project_id }}.test.xcom_test`
{{ params.columns }}
select 'Andrew' from `{{ params.project_id }}.test.xcom_test`
While running this, the columns params are converted to a string and hence resulting in an error. Below is how the query was converted.
INSERT INTO `project.test.xcom_test`
{{ ti.xcom_pull(task_ids='get_columns') }}
select 'Andrew' from `project.test.xcom_test`
Any idea what am I missing ?
I found the reason why my dag is failing.
The "params" field for BigQueryInsertJobOperator is not a templatized field and hence calling "task_instance.xcom_pull" will not work this way.
But instead, you can directly access the 'task_instance' variable from jinja template.
INSERT INTO `{{ params.project_id }}.test.xcom_test`
({{ task_instance.xcom_pull(task_ids='get_columns') }})
select
'Andrew' from `{{ params.project_id }}.test.xcom_test`
https://marclamberti.com/blog/templates-macros-apache-airflow/ - This article explains how to identifies template parameters in airflow

Airflow Template_fields added but variable like {{ ds }} is not working

I want to pass airflow variables to SQL query template file like this (in sql/test.sql file):
select 'test', '{{ params.test_ds }}', '{{ test_dt }}' from test_table;
I created a Operator inherited from PostgresOperator:
class EtlOperator(PostgresOperator):
template_fields = ('sql', 'test_dt', 'params')
template_ext = PostgresOperator.template_ext
#apply_defaults
def __init__(self, test_dt, params, *args, **kwargs):
super(EtlRunIdOperator, self).__init__(*args, **kwargs)
self.test_dt = test_dt
self.params = params
def execute(self, context):
super(EtlRunIdOperator, self).execute(context)
I created this task:
test_task00 = EtlOperator(
task_id=f'test_task00',
postgres_conn_id='redshift',
sql='sql/test.sql',
params={
'test_ds': '{{ ds }}'
},
database='default',
test_dt='{{ execution_date }}',
provide_context=True, # tried without it too
dag=dag
)
However, no matter the params or test_dt that are part of template_fields, the SQL still not parsing the variables, results like this:
INFO - Executing: select 'test', '{{ ds }}', '' from test_table;
Is there anywhere wrong in my configurations?

Python script runs but does not work and no errors are thrown

This program is an API based program that has been working for a few months and all of a sudden has went days without pushing anything to Discord. The script looks fine in CMD, but no errors are being thrown. I was wondering if there was a way to eliminate possible issues such as an API instability issue or something obvious. The program is supposed to go to the site www.bitskins.com and pull skins based on parameters set and push them as an embed to a Discord channel every 10 minutes.
There are two files that run this program.
Here is the one that uses Bitskins API (bitskins.py):
import requests, json
from datetime import datetime, timedelta
class Item:
def __init__(self, item):
withdrawable_at= item['withdrawable_at']
price= float(item['price'])
self.available_in= withdrawable_at- datetime.timestamp(datetime.now())
if self.available_in< 0:
self.available= True
else:
self.available= False
self.suggested_price= float(item['suggested_price'])
self.price= price
self.margin= round(self.suggested_price- self.price, 2)
self.reduction= round((1- (self.price/self.suggested_price))*100, 2)
self.image= item['image']
self.name= item['market_hash_name']
self.item_id= item['item_id']
def __str__(self):
if self.available:
return "Name: {}\nPrice: {}\nSuggested Price: {}\nReduction: {}%\nAvailable Now!\nLink: https://bitskins.com/view_item?app_id=730&item_id={}".format(self.name, self.price, self.suggested_price, self.reduction, self.item_id)
else:
return "Name: {}\nPrice: {}\nSuggested Price: {}\nReduction: {}%\nAvailable in: {}\nLink: https://bitskins.com/view_item?app_id=730&item_id={}".format(self.name, self.price, self.suggested_price, self.reduction, str(timedelta(seconds= self.available_in)), self.item_id)
def __lt__(self, other):
return self.reduction < other.reduction
def __gt__(self, other):
return self.reduction > other.reduction
def get_url(API_KEY, code):
PER_PAGE= 30 # the number of items to retrieve. Either 30 or 480.
return "https://bitskins.com/api/v1/get_inventory_on_sale/?api_key="+ API_KEY+"&code=" + code + "&per_page="+ str(PER_PAGE)
def get_data(url):
r= requests.get(url)
data= r.json()
return data
def get_items(code, API_KEY):
url= get_url(API_KEY, code)
try:
data= get_data(url)
if data['status']=="success":
items= []
items_dic= data['data']['items']
for item in items_dic:
tmp= Item(item)
if tmp.reduction>=25 and tmp.price<=200: # Minimum discount and maximum price to look for when grabbing items. Currently set at minimum discount of 25% and maxmimum price of $200.
items.append(tmp)
return items
else:
raise Exception(data["data"]["error_message"])
except:
raise Exception("Couldn't connect to BitSkins.")
# my_token = pyotp.TOTP(my_secret)
# print(my_token.now()) # in python3
And here is the file with Discord's API (solution.py):
#!/bin/env python3.6
import bitskins
import discord
import pyotp, base64, asyncio
from datetime import timedelta, datetime
TOKEN= "Not input for obvious reasons"
API_KEY= "Not input for obvious reasons"
my_secret= 'Not input for obvious reasons'
client = discord.Client()
def get_embed(item):
embed=discord.Embed(title=item.name, url= "https://bitskins.com/view_item?app_id=730&item_id={}".format(item.item_id), color=0xA3FFE8)
embed.set_author(name="Skin Bot", url="http://www.reactor.gg/",icon_url="https://pbs.twimg.com/profile_images/1050077525471158272/4_R8PsrC_400x400.jpg")
embed.set_thumbnail(url=item.image)
embed.add_field(name="Price:", value="${}".format(item.price))
embed.add_field(name="Discount:", value="{}%".format(item.reduction), inline=True)
if item.available:
tmp= "Instantly Withdrawable"
else:
tmp= str(timedelta(seconds= item.available_in))
embed.add_field(name="Availability:", value=tmp, inline=True)
embed.add_field(name="Suggested Price:", value="${}".format(item.suggested_price), inline=True)
embed.add_field(name="Profit:", value="${}".format(item.margin), inline=True)
embed.set_footer(text="Made by Aqyl#0001 | {}".format(datetime.now()), icon_url="https://www.discordapp.com/assets/6debd47ed13483642cf09e832ed0bc1b.png")
return embed
async def status_task(wait_time= 60* 5):
while True:
print("Updated on: {}".format(datetime.now()))
code= pyotp.TOTP(my_secret)
try:
items= bitskins.get_items(code.now(), API_KEY)
for item in items:
await client.send_message(client.get_channel("656913641832185878"), embed=get_embed(item))
except:
pass
await asyncio.sleep(wait_time)
#client.event
async def on_ready():
wait_time= 60 * 10 # 10 mins in this case
print('CSGO BitSkins Bot')
print('Made by Aqyl#0001')
print('Version 1.0.6')
print('')
print('Logged in as:')
print(client.user.name)
print('------------------------------------------')
client.loop.create_task(status_task(wait_time))
try:
client.run(TOKEN)
except:
print("Couldn't connect to the Discord Server.")
You have a general exception, this will lead to catching exceptions that you really don't want to catch.
try:
items= bitskins.get_items(code.now(), API_KEY)
for item in items:
await client.send_message(client.get_channel("656913641832185878"), embed=get_embed(item))
except:
pass
This is the same as catching any exception that appears there (Exceptions that inherit BaseException
To avoid those problems, you should always catch specific exceptions. (i.e. TypeError).
Example:
try:
raise Exception("Example exc")
except Exception as e:
print(f"Exception caught! {e}")

How to parse json string in airflow template

Is it possible to parse JSON string inside an airflow template?
I have a HttpSensor which monitors a job via a REST API, but the job id is in the response of the upstream task which has xcom_push marked True.
I would like to do something like the following, however, this code gives the error jinja2.exceptions.UndefinedError: 'json' is undefined
t1 = SimpleHttpOperator(
http_conn_id="s1",
task_id="job",
endpoint="some_url",
method='POST',
data=json.dumps({ "foo": "bar" }),
xcom_push=True,
dag=dag,
)
t2 = HttpSensor(
http_conn_id="s1",
task_id="finish_job",
endpoint="job/{{ json.loads(ti.xcom_pull(\"job\")).jobId }}",
response_check=lambda response: True if response.json().state == "complete" else False,
poke_interval=5,
dag=dag
)
t2.set_upstream(t1)
You can add a custom Jinja filter to your DAG with the parameter user_defined_filters to parse the json.
a dictionary of filters that will be exposed
in your jinja templates. For example, passing
dict(hello=lambda name: 'Hello %s' % name) to this argument allows
you to {{ 'world' | hello }} in all jinja templates related to
this DAG.
dag = DAG(
...
user_defined_filters={'fromjson': lambda s: json.loads(s)},
)
t1 = SimpleHttpOperator(
task_id='job',
xcom_push=True,
...
)
t2 = HttpSensor(
endpoint='job/{{ (ti.xcom_pull("job") | fromjson)["jobId"] }}',
...
)
However, it may be cleaner to just write your own custom JsonHttpOperator plugin (or add a flag to SimpleHttpOperator) that parses the JSON before returning so that you can just directly reference {{ti.xcom_pull("job")["jobId"] in the template.
class JsonHttpOperator(SimpleHttpOperator):
def execute(self, context):
text = super(JsonHttpOperator, self).execute(context)
return json.loads(text)
Alternatively, it is also possible to add the json module to the template by doing and the json will be available for usage inside the template. However, it is probably a better idea to create a plugin like Daniel said.
dag = DAG(
'dagname',
default_args=default_args,
schedule_interval="#once",
user_defined_macros={
'json': json
}
)
then
finish_job = HttpSensor(
task_id="finish_job",
endpoint="kue/job/{{ json.loads(ti.xcom_pull('job'))['jobId'] }}",
response_check=lambda response: True if response.json()['state'] == "complete" else False,
poke_interval=5,
dag=dag
)

Resources