I´m trying to get the coordinates ('latitude' and 'longitude') from these json-ld piece of code.
> <script type="application/ld+json">
> {"#context":"http://schema.org","#graph":[
> {"#type":"Place","address":
> {"#type":"PostalAddress","streetAddress":"XX, XX"},"geo":
> {"#type":"GeoCoordinates","latitude":50.08872,"longitude":20.0297}}]}
> </script>
The closest I was:
req = requests.get(link)
soup = BeautifulSoup(req.text, 'html.parser')
text_ = json.loads("".join(soup.find("script", {"type":"application/ld+json"}).contents)
But even this script is giving me a previous json-ld block of code (the first one in full html code).
I´d appreciate even getting the json-ld block like a string.
Thanks
import json
from bs4 import BeautifulSoup
data = """<script type="application/ld+json">
{"#context":"http://schema.org","#graph":[
{"#type":"Place","address":
{"#type":"PostalAddress","streetAddress":"XX, XX"},"geo":
{"#type":"GeoCoordinates","latitude":50.08872,"longitude":20.0297}}]}
</script>"""
soup = BeautifulSoup(data, 'html.parser')
goal = soup.select_one("script").string
match = json.loads(goal)
print(type(match))
print(match)
<class 'dict'>
{'#context': 'http://schema.org', '#graph': [{'#type': 'Place', 'address': {'#type': 'PostalAddress', 'streetAddress': 'XX, XX'}, 'geo': {'#type': 'GeoCoordinates', 'latitude': 50.08872, 'longitude': 20.0297}}]}
Related
I am trying to get the company location from this website:https://slashdot.org/software/p/monday.com/
I am able to get close with the following code, but I am unable to navigate there.
Code:
url = 'https://slashdot.org/software/p/monday.com/'
profile = requests.get(url)
soup = bs(profile.content, 'lxml')
location = soup.select_one('div:nth-of-type(4).field-row').text
I feel like this is getting me in the area, but I've been unable to navigate over to "United States." Can someone show me what I am doing wrong?
Desired Out:
United States
Thanks!
To get the desired data you can use soup-contains() method and put them into a dict to get both the key value pairs
import pandas as pd
from bs4 import BeautifulSoup
import requests
url= 'https://slashdot.org/software/p/monday.com/'
req = requests.get(url)
soup = BeautifulSoup(req.text,'lxml')
d = {soup.select_one('.field-row div:-soup-contains("Headquarters")').text.replace(':',''):soup.select_one('.field-row div:-soup-contains("Headquarters") + div').text}
print(d)
Output:
{'Headquarters': 'United States'}
At the moment have got a bit of the Frankenstein code (consisting of Beautifulsoup and Scrapy parts) that seem to be doing a job in terms of the reading the info from page 1 urls. Shall try to redo everything in Scrapy as soon as pagination issue resolved.
So what codes is meant to do:
Read all subcategories (BeautifulSoup part)
The rest are Scrapy code parts
Using the above urls read sub-subcategories.
Extract the last page number and loop over the above urls.
Extract the necessary product info from the above urls.
All except part 3 do seem to work.
Have tried to use the below code to extract the last page number but not sure how to integrate it into the main code
def parse_paging(self, response):
try:
for next_page in ('?pn=1' + response.xpath('//ul[#class="pagination pull-left"]/noscript/a/text()').extract()[-1]):
print(next_page)
# yield scrapy.Request(url=response.urljoin(next_page))
except:
pass
The below is the main code.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import scrapy
from scrapy.crawler import CrawlerProcess
category_list = []
sub_category_url = []
root_url = 'https://uk.rs-online.com/web'
page = requests.get(root_url)
soup = BeautifulSoup(page.content, 'html.parser')
cat_up = [a.find_all('a') for a in soup.find_all('div',class_='horizontalMenu sectionUp')]
category_up = [item for sublist in cat_up for item in sublist]
cat_down = [a.find_all('a') for a in soup.find_all('div',class_='horizontalMenu sectionDown')]
category_down = [item for sublist in cat_down for item in sublist]
for c_up in category_up:
sub_category_url.append('https://uk.rs-online.com' + c_up['href'])
for c_down in category_down:
sub_category_url.append('https://uk.rs-online.com' + c_down['href'])
# print(k)
class subcategories(scrapy.Spider):
name = 'subcategories'
def start_requests(self):
urls = sub_category_url
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
products = response.css('div.card.js-title a::href').extract() #xpath("//div[contains(#class, 'js-tile')]/a/#href").
for p in products:
url = urljoin(response.url, p)
yield scrapy.Request(url, callback=self.parse_product)
def parse_product(self, response):
for quote in response.css('tr.resultRow'):
yield {
'product': quote.css('div.row.margin-bottom a::text').getall(),
'stock_no': quote.css('div.stock-no-label a::text').getall(),
'brand': quote.css('div.row a::text').getall(),
'price': quote.css('div.col-xs-12.price.text-left span::text').getall(),
'uom': quote.css('div.col-xs-12.pack.text-left span::text').getall(),
}
process = CrawlerProcess()
process.crawl(subcategories)
process.start()
Would be exceptionally grateful if you could provides any hints on how to deal with the above issue.
Let me know if you have any questions.
I would suggest you to extract next page number by using this
and then construct next page url using this number.
next_page_number = response.css('.nextPage::attr(ng-click)').re_first('\d+')
I've been working on this for a week and am determined to get this working!
My ultimate goal is to write a webscraper where you can insert the county name and the scraper will produce a csv file of information from mugshots - Name, Location, Eye Color, Weight, Hair Color and Height (it's a genetics project I am working on).
The site organization is primary site page --> state page --> county page -- 120 mugshots with name and url --> url with data I am ultimately after and next links to another set of 120.
I thought the best way to do this would be to write a scraper that will grab the URLs and Names from the table of 120 mugshots and then use pagination to grab all the URLs and names from the rest of the county (in some cases there are 10's of thousands). I can get the first 120, but my pagination doesn't work.. so Im ending up with a csv of 120 names and urls.
I closely followed this article which was very helpful
from bs4 import BeautifulSoup
import requests
import lxml
import pandas as pd
county_name = input('Please, enter a county name: /Arizona/Maricopa-County-AZ \n')
print(f'Searching {county_name}. Wait, please...')
base_url = 'https://www.mugshots.com'
search_url = f'https://mugshots.com/US-Counties/{county_name}/'
data = {'Name': [],'URL': []}
def export_table_and_print(data):
table = pd.DataFrame(data, columns=['Name', 'URL'])
table.index = table.index + 1
table.to_csv('mugshots.csv', index=False)
print('Scraping done. Here are the results:')
print(table)
def get_mugshot_attributes(mugshot):
name = mugshot.find('div', attrs={'class', 'label'})
url = mugshot.find('a', attrs={'class', 'image-preview'})
name=name.text
url=mugshot.get('href')
url = base_url + url
data['Name'].append(name)
data['URL'].append(url)
def parse_page(next_url):
page = requests.get(next_url)
if page.status_code == requests.codes.ok:
bs = BeautifulSoup(page.text, 'lxml')
list_all_mugshot = bs.find_all('a', attrs={'class', 'image-preview'})
for mugshot in list_all_mugshot:
get_mugshot_attributes(mugshot)
next_page_text = mugshot.find('a class' , attrs={'next page'})
if next_page_text == 'Next':
next_page_text=mugshot.get_text()
next_page_url=mugshot.get('href')
next_page_url=base_url+next_page_url
print(next_page_url)
parse_page(next_page_url)
else:
export_table_and_print(data)
parse_page(search_url)
Any ideas on how to get the pagination to work and also how to eventually get the data from the list of URLs I scrape?
I appreciate your help! I've been working in python for a few months now, but the BS4 and Scrapy stuff is so confusing for some reason.
Thank you so much community!
Anna
It seems you want to know the logic as to how you can get the content using populated urls derived from each of the page traversing next pages. This is how you can parse all the links from each page including next page and then use those links to get the content from their inner pages.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "https://mugshots.com/"
base = "https://mugshots.com"
def get_next_pages(link):
print("**"*20,"current page:",link)
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("[itemprop='name'] > a[href^='/Current-Events/']"):
yield from get_main_content(urljoin(base,item.get("href")))
next_page = soup.select_one(".pagination > a:contains('Next')")
if next_page:
next_page = urljoin(url,next_page.get("href"))
yield from get_next_pages(next_page)
def get_main_content(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
item = soup.select_one("h1#item-title > span[itemprop='name']").text
yield item
if __name__ == '__main__':
for elem in get_next_pages(url):
print(elem)
I am trying to extract some information from HTML of a web page.
But neither regex method nor list comprehension method works.
At http://bitly.kr/RWz5x, there is some key called encparam enclosed in getjason from a javascript tag which is 49th from all script elements of the page.
Thank you for your help in advance.
sam = requests.get('http://bitly.kr/RWz5x')
#html = sam.text
html=sam.content
soup = BeautifulSoup(html, 'html.parser')
scripts = soup.find_all('script')
#your_script = [script for script in scripts if 'encparam' in str(script)][0]
#print(your_script)
#print(scripts)
pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, scripts.text))
Send your request to the following url which you can find in the sources tab:
import requests
from bs4 import BeautifulSoup as bs
import re
res = requests.get("https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=005930")
soup = bs(res.content, 'lxml')
r = re.compile(r"encparam: '(.*)'")
data = soup.find('script', text=r).text
encparam = r.findall(data)[0]
print(encparam)
It is likely you can avoid bs4 altogether:
import requests
import re
r = requests.get("https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=005930")
p = re.compile(r"encparam: '(.*)'")
encparam = p.findall(r.text)[0]
print(encparam)
If you actually want the encparam part in the string:
import requests
import re
r = requests.get("https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=005930")
p = re.compile(r"(encparam: '\w+')")
encparam = p.findall(r.text)[0]
print(encparam)
I am trying to find all the # elements in a particular webpage by using Beautiful Soup.
import requests
from bs4 import BeautifulSoup as Soup
source = "https://www.runinrabbit.com/"
def getPageContents(source):
req = requests.get(source)
print("req : ",req,type(req))
print("***************************")
content = Soup(req.text, 'html.parser')
print("content data",type(content),content)
return content
Like the content, I am just getting everything else but the tagged value.
e.g., Strings with tags, like below is not getting printed in my function: getPageContents.
#marathoner, #winner, #runinrabbit, #topoathletic, #hartfordmarathon, #rabbitpro, #marathon, #olympictrials, #runnergirl, #winning, #finisher, #run, #running, #runner, #runnersofinstagram, #runnersworld, #runnerscommunity, #breezyback, #lightweight, #simple, #runinrabbit, #borntorunfree, #breezyback, #lightweight, #simple, #runinrabbit, #borntorunfree", #racerollcall, #racetime, #runfast, #goodluck, #RADrabbit, #rabbitELITE, #rabbitELITEtrail, #rabbitPRO, #runinrabbit, #borntorunfree"