Why can't I access information in tbody? - web-scraping

[This is the source code of the website][1]I am doing web scraping with BeautifulSoup but cannot find tr in tbody; there actually is tr in tbody in the source code of the website; however, the find_all function can only return the tr in thead.
link I am scraping on: https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=Journalist&start_year=1992&end_year=2019&group_by=year
Here are some of my code:
```from bs4 import BeautifulSoup
```url = "https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=Journalist&start_year=1992&end_year=2019&group_by=year"
```html = urlopen(url)
```soup = BeautifulSoup(html,'lxml')
```type(soup)
```tr = soup.find_all("tr")
```print(tr)
[1]: https://i.stack.imgur.com/NFwEV.png

Data is requested via API returning json i.e. it is dynamically added so it doesn't appear in your request to the landing page. You can find the API endpoint in the network tab which is used to get the info.
You can alter one of the parameters to a number larger than expected result set then check if you need to make further requests.
import requests
r = requests.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=1&pageSize=2000&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()
Otherwise, you can do an initial call and verify how many more requests to make and alter the appropriate paramters in the url. You can see the pageCount is returned.
You can see relevant parts in response here for pagesize 20:
{'rowCount': 1343,
'pageNum': 1,
'pageSize': '20',
'pageCount': 68,
All the relevant info for a loop to get all results is there.
After altering to larger number you can see the following:
'rowCount': 1343,
'pageNum': 1,
'pageSize': '2000',
'pageCount': 1,
You can convert to a table using pandas:
import requests
import pandas as pd
r = requests.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=1&pageSize=2000&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()
df = pd.DataFrame(r['data'])
print(df)
Sample of df:
Example of checking actual count and make additional request for request of records:
import requests
import pandas as pd
request_number = 1000
with requests.Session() as s:
r = s.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=1&pageSize=' + str(request_number) + '&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()
df = pd.DataFrame(r['data'])
actual_number = r['rowCount']
if actual_number > request_number:
request_number = actual_number - request_number
r = s.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=2&pageSize=' + str(request_number) + '&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()
df2 = pd.DataFrame(r['data'])
final = pd.concat([df,df2])
else:
final = df

To get the tabular content using the selectors you see by inspecting elements, you can try using this pyppeteer which I've shown below how to work with. The following approach is an asynchronous one. So, I suggest you to go for this unless you find any api to play with:
import asyncio
from pyppeteer import launch
url = "https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=Journalist&start_year=1992&end_year=2019&group_by=year"
async def get_table(link):
browser = await launch(headless=False)
[page] = await browser.pages()
await page.goto(link)
await page.waitForSelector("table.js-report-builder-table tr td")
for tr in await page.querySelectorAll("table.js-report-builder-table tr"):
tds = [await page.evaluate('e => e.innerText',td) for td in await tr.querySelectorAll("th,td")]
print(tds)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(get_table(url))
Output are like:
['Name', 'Organization', 'Date', 'Location', 'Attack', 'Type of Death', 'Charge']
['Abadullah Hananzai', 'Radio Azadi,Radio Free Europe/Radio Liberty', 'April 30, 2018', 'Afghanistan', 'Killed', 'Murder', '']
['Abay Hailu', 'Agiere', 'February 9, 1998', 'Ethiopia', 'Killed', 'Dangerous Assignment', '']
['Abd al-Karim al-Ezzo', 'Freelance', 'December 21, 2012', 'Syria', 'Killed', 'Crossfire', '']
['Abdallah Bouhachek', 'Révolution et Travail', 'February 10, 1996', 'Algeria', 'Killed', 'Murder', '']
['Abdel Aziz Mahmoud Hasoun', 'Masar Press', 'September 5, 2013', 'Syria', 'Killed', 'Crossfire', '']
['Abdel Karim al-Oqda', 'Shaam News Network', 'September 19, 2012', 'Syria', 'Killed', 'Murder', '']

Related

How can I use Google Cloud Functions to run a web scraper?

Thanks in advance for your help.
I'm currently running a webscraper - this is the first time I've ever done something like this - It pulls addresses from the URL and then will match the address to the users input. This will be going into a chat bot, I wondering how I can make this run on Google Functions. Whats the process to do this, is there a tutorial anywhere?
This is my code so far. There is a small items file too
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ..items import DataItem
from fuzzywuzzy import fuzz
from urllib.parse import urljoin
import scrapy
class AddressesSpider(scrapy.Spider):
name = 'Addresses'
allowed_domains = ['find-energy-certificate.service.gov.uk']
postcode = "bh10+4ah"
start_urls = ['https://find-energy-certificate.service.gov.uk/find-a-certificate/search-by-postcode?postcode=' + postcode]
## def start_requests(self):
## self.first = input("Please enter the address you would like to match: ")
## yield scrapy.Request(url=self.start_urls[0], callback=self.parse)
def parse(self, response):
first = input("Please enter the address you would like to match: ")
highest_ratios = []
highest_item = None
for row in response.xpath('//table[#class="govuk-table"]//tr'):
address = row.xpath("normalize-space(.//a[#class='govuk-link']/text())").extract()[0].lower()
address = address.rsplit(',', 2)[0]
link = row.xpath('.//a[#class="govuk-link"]/#href').extract()
details = row.xpath("normalize-space(.//td/following-sibling::td)").extract()
ratio = fuzz.token_set_ratio(address, first)
item = DataItem()
item['link'] = link
item['details'] = details
item['address'] = address
item['ratioresult'] = ratio
if len(highest_ratios) < 3:
highest_ratios.append(item)
elif ratio > min(highest_ratios, key=lambda x: x['ratioresult'])['ratioresult']:
highest_ratios.remove(min(highest_ratios, key=lambda x: x['ratioresult']))
highest_ratios.append(item)
highest_ratios_100 = [item for item in highest_ratios if item['ratioresult'] == 100]
if highest_ratios_100:
for item in highest_ratios_100:
yield item
else:
yield max(highest_ratios, key=lambda x: x['ratioresult'])
if len(highest_ratios_100) > 1:
for i, item in enumerate(highest_ratios_100):
print(f"{i+1}: {item['address']}")
selected = int(input("Please select the correct address by entering the number corresponding to the address: ")) - 1
selected_item = highest_ratios_100[selected]
else:
selected_item = highest_ratios_100[0] if highest_ratios_100 else max(highest_ratios, key=lambda x: x['ratioresult'])
new_url = selected_item['link'][0]
new_url = str(new_url)
if new_url:
base_url = 'https://find-energy-certificate.service.gov.uk'
print(f'Base URL: {base_url}')
print(f'New URL: {new_url}')
new_url = urljoin(base_url, new_url)
print(f'Combined URL: {new_url}')
yield scrapy.Request(new_url, callback=self.parse_new_page)
def parse_new_page(self, response):
Postcode = response.xpath('normalize-space((//p[#class="epc-address govuk-body"]/text())[last()])').extract()
Town = response.xpath('normalize-space((//p[#class="epc-address govuk-body"]/text())[last()-1])').extract()
First = response.xpath(".//p[#class='epc-address govuk-body']").extract()
Type = response.xpath('normalize-space(//dd[1]/text())').extract_first()
Walls = response.xpath("//th[contains(text(), 'Wall')]/following-sibling::td[1]/text()").extract()
Roof = response.xpath("//th[contains(text(), 'Roof')]/following-sibling::td[1]/text()").extract()
Heating = response.xpath("//th[text()='Main heating']/following-sibling::td[1]/text()").extract_first()
CurrentScore = response.xpath('//body[1]/div[2]/main[1]/div[1]/div[3]/div[3]/svg[1]/svg[1]/text[1]/text()').re_first("[0-9+]{1,2}")
Maxscore = response.xpath('//body[1]/div[2]/main[1]/div[1]/div[3]/div[3]/svg[1]/svg[2]/text[1]/text()').re_first("[0-9+]{2}")
Expiry = response.xpath('normalize-space(//b)').extract_first()
FloorArea = response.xpath('//dt[contains(text(), "floor area")]/following-sibling::dd/text()').re_first("[0-9+]{2,3}")
Steps = response.xpath("//h3[contains(text(),'Step')]/text()").extract()
yield {
'Postcode': Postcode,
'Town': Town,
'First': First,
'Type': Type,
'Walls': Walls,
'Roof': Roof,
'Heating': Heating,
'CurrentScore': CurrentScore,
'Maxscore': Maxscore,
'Expiry': Expiry,
'FloorArea': FloorArea,
'Steps': Steps
}
I've tried googling and having a look around and can't get how to deploy this as a project to run on google functions or can I just copy the code into the console somewhere?
You can try running your spider from a script. However, a better solution would be to wrap scrapy in its own child process.
For example:
from multiprocessing import Process, Queue
from ... import MySpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
def my_cloud_function(event, context):
def script(queue):
try:
settings = get_project_settings()
settings.setdict({
'LOG_LEVEL': 'ERROR',
'LOG_ENABLED': True,
})
process = CrawlerProcess(settings)
process.crawl(MySpider)
process.start()
queue.put(None)
except Exception as e:
queue.put(e)
queue = Queue()
# wrap the spider in a child process
main_process = Process(target=script, args=(queue,))
main_process.start() # start the process
main_process.join() # block until the spider finishes
result = queue.get() # check the process did not return an error
if result is not None:
raise result
return 'ok'
You can refer to this tutorial for more info.

Company name extraction with bert-base-ner: easy way to know which words relate to which?

Hi I'm trying to extract the full company name from a string description about the company with bert-base-ner. I am also open to trying other methods but I couldn't really find one. The issue is that although it tags the orgs correctly, it tags it by word/token so I can't easily extract the full company name without having to concat and build it myself.
Is there an easier way or model to do this?
Here is my code:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
ner_results = nlp(text1)
print(ner_results)
Here is my output for one text string:
[{'entity': 'B-ORG', 'score': 0.99965024, 'index': 1, 'word': 'Orion', 'start': 0, 'end': 5}, {'entity': 'I-ORG', 'score': 0.99945647, 'index': 2, 'word': 'Metal', 'start': 6, 'end': 11}, {'entity': 'I-ORG', 'score': 0.99943095, 'index': 3, 'word': '##s', 'start': 11, 'end': 12}, {'entity': 'I-ORG', 'score': 0.99939036, 'index': 4, 'word': 'Limited', 'start': 13, 'end': 20}, {'entity': 'B-LOC', 'score': 0.9997398, 'index': 14, 'word': 'Australia', 'start': 78, 'end': 87}]
I have faced a similar issue and solved it by using a better model called "xlm-roberta-large-finetuned-conll03-English" which is much better than the one you're using right now and will render the complete organization's name rather than the broken pieces. Feel free to test out the below-mentioned code which will extract the full organization's list from the document. Accept my answer by clicking on tick button if it founds useful.
from transformers import pipeline
from subprocess import list2cmdline
from pdfminer.high_level import extract_text
import docx2txt
import spacy
from spacy.matcher import Matcher
import time
start = time.time()
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
model_checkpoint = "xlm-roberta-large-finetuned-conll03-english"
token_classifier = pipeline(
"token-classification", model=model_checkpoint, aggregation_strategy="simple"
)
def text_extraction(file):
""""
To extract texts from both pdf and word
"""
if file.endswith(".pdf"):
return extract_text(file)
else:
resume_text = docx2txt.process(file)
if resume_text:
return resume_text.replace('\t', ' ')
return None
# Organisation names extraction
def org_name(file):
# Extract the complete text in the resume
extracted_text = text_extraction(file)
classifier = token_classifier(extracted_text)
# Get the list of dictionary with key value pair "entity":'ORG'
values = [item for item in classifier if item["entity_group"] == "ORG"]
# Get the list of dictionary with key value pair "entity":'ORG'
res = [sub['word'] for sub in values]
final1 = list(set(res)) # Remove duplicates
final = list(filter(None, final1)) # Remove empty strings
print(final)
org_name("your file name")
end = time.time()
print("The time of execution of above program is :", round((end - start), 2))

my result only get 1 data, can you help me for get more data and save another file?

import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = 'http://h1.nobbd.de/index.php?start='
for page in range(1,10):
req = requests.get(URL + str(page) + '=')
soup = BeautifulSoup(req.text, 'html.parser')
h1 = soup.find_all('div',attrs={'class','report-wrapper'})
for hack in h1:
h2 = hack.find_all("div",attrs={"class","report"})
for i in h2:
layanan = i.find_all('b')[0].text.strip()
report = i.find_all('a')[2].text.strip()
bug_hunter = i.find_all('a')[1].text.strip()
mirror = i.find("a", {"class": "title"})['href']
date = i.find_all("div", {"class": "date"})
for d in date:
waktu = d.text
data = {"Company": [layanan], "Title:": [report], "Submit:": [bug_hunter], "Link:": [mirror], "Date:": [waktu]}
df = pd.DataFrame(data)
my result only get 1 data, can you help me for get more data and save another file?
df.head()
index
Company
Title:
Submit:
Link:
Date:
0
Reddit
Application level DOS at Login Page ( Accepts Long Password )
e100_speaks
https://hackerone.com/reports/1168804
03 Feb 2022
What happens?
Based on your questions code, you will overwrite your dataframe with every iteration, thats why you only get one result.
How to fix?
Create an empty list before your loops
Append all the extracted dicts to this list
Create your dataframe based on that list of dicts
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = []
url = 'http://h1.nobbd.de/index.php?start='
for page in range(1,3):
req = requests.get(url + str(page))
soup = BeautifulSoup(req.text, 'html.parser')
h1 = soup.find_all('div',attrs={'class','report-wrapper'})
for hack in h1:
h2 = hack.find_all("div",attrs={"class","report"})
for i in h2:
layanan = i.find_all('b')[0].text.strip()
report = i.find_all('a')[2].text.strip()
bug_hunter = i.find_all('a')[1].text.strip()
mirror = i.find("a", {"class": "title"})['href']
date = i.find_all("div", {"class": "date"})
for d in date:
waktu = d.text
data.append({'Company':[layanan], 'Title':[report], 'Submit':[hunter], 'link':[mirror], 'Date':[waktu]})
df = pd.DataFrame(data)

Instaloader data scraping using specific hashtag and timeframe

I need help using instaloader to data scrape posts from Instagram that include #slowfashion from a specific timeframe.
I want to scrape the visual and textual data from the posts (specifically, the image/s posted, their descriptions, and comments).
from datetime import datetime
from itertools import dropwhile, takewhile
import instaloader
# Use parameters to save diffrent metadata
L = instaloader.Instaloader(download_pictures=True,download_videos=False,download_comments=False,save_metadata=True)
# Login
username = input("Enter your username: ")
L.interactive_login(username=username)
# User Query
search = input("Enter Hashtag: ")
limit = int(input("How many posts to download: "))
# Hashtag object
hashtags = instaloader.Hashtag.from_name(L.context, search).get_posts()
# Download Period
SINCE = datetime(2021, 5, 1)
UNTIL = datetime(2021, 3, 1)
no_of_downloads = 0
for post in takewhile(lambda p: p.date > UNTIL, dropwhile(lambda p: p.date > SINCE, hashtags)):
if no_of_downloads == limit:
break
print(post.date)
L.download_post(post, "#"+search)
no_of_downloads += 1

I'm Getting error json.decoder.JSONDecodeError: while running a python code

I have got this code from internet, for extracting data from justdial website.
While running this code I got the following error:
ERROR:json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) is shown.
Please help me to run this code as i'm not familiar with python. What changes should be done to run this code step by step.
Thank you in advance.
Here is my code:
import csv
import json
import requests
from bs4 import BeautifulSoup
print(25*"=")
print("Just Dial Scraper")
print(25*"=")
url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act'\
'=pagination&city={0}&search={1}&page={2}'
what = input("Enter your Query: ")
what = what.replace(' ', '+')
where = input("Enter the Location: ")
with open(what+"_"+where+'.csv', 'w') as f:
f.write('company, address, phone\n')
page = 1
while True:
print('Scraping Page', page)
resp = requests.get(url.format(where, what, page))
if not resp.json()['paidDocIds']:
print(25*"-")
print('Scraping Finished')
print(25*"-")
break
markup = resp.json()['markup'].replace('\/', '/')
soup = BeautifulSoup(markup, 'html.parser')
for thing in soup.find_all('section'):
csv_list = []
if thing.get('class') == ['jcar']:
# Company name
for a_tag in thing.find_all('a'):
if a_tag.get('onclick') == "_ct('clntnm', 'lspg');":
csv_list.append(a_tag.get('title'))
# Address
for span_tag in thing.find_all('span'):
if span_tag.get('class') == ['mrehover', 'dn']:
csv_list.append(span_tag.get_text().strip())
# Phone number
for a_tag in thing.find_all('a'):
if a_tag.get('href').startswith('tel:'):
csv_list.append(a_tag.get('href').split(':')[-1])
csv_list = ['"'+item+'"' for item in csv_list]
writeline = ','.join(csv_list)+'\n'
f.write(','.join(csv_list)+'\n')
page += 1

Resources