Web Scraping: Some Pages URL are missing

Web Scraping: Some Pages URL are missing - web-scraping

I was trying to scrap the first 9 pages of a website but it looks like page5 and page7 are missing. This is making show python an attribute error. However, I think an 'if' function can solve this but I'm unable to figure out the code for the if function.
Here is my code
import requests
from bs4 import BeautifulSoup
base_url="http://cbcs.fastvturesults.com/student/1sp15me00"
for page in range(1,10,1):
r=requests.get(base_url+str(page))
c=r.content
soup=BeautifulSoup(c,"html.parser")
items=soup.find(class_="text-muted")
if ??????????:
pass
else:
print("{}\n{}".format(items.previous_sibling,items.text))

The error occurs when you try to access attributes of items when items is set to None. This is done when BeautifulSoup cannot find anything with class_="text-muted"
The solution:
if not items:
continue
Note that pass(from your solution) will just pass the current statement and move on to the next line in the loop. continue will end the current iteration and move on to the next iteration.

You don't need to create else block here. Only checking if items is not None suffices. Try the below approach:
items = soup.find(class_="text-muted")
if items:
print("{}\n{}".format(items.previous_sibling,items.text))

Related

Scraping Multiple Pages without manually getting the amount of pages

We currently busy with a property web scrape and trying to scrape multiple pages without manually getting the page range (There are 5 pages)
for num in range(0,5):
url = "https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p" + str(num)
How do you output a URL of all pages without manually typing the page range?
Output
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p1
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p2
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p3
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p4
https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/p4
Maybe using the ul class="pagination" in order to count the page number?

you can use pagination class to fetch the last a tag and from that you can fetch data-pagenumber and then use it get all the links. Follow the below code to get it done.
Code:
import requests
from bs4 import BeautifulSoup
#url="https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467"
url="https://www.property24.com/for-sale/woodstock/cape-town/western-cape/10164"
data=requests.get(url)
soup=BeautifulSoup(data.content,"html.parser")
noofpages=soup.find("ul",{"class":"pagination"}).find_all("a")[-1]["data-pagenumber"]
for i in range(1,int(noofpages)+1):
print(f"{url}/p{i}")
Output:
Let me know if you have any questions :)

Request returns not actual value

I have written the following code and it works fine. I really enjoyed because I am quite new in python requests or even python3 but at the following day I noticed that the price variable is not updated. And it does not update any time I run the code for a week (709.49 if does it matter). I think it is not a secret so I pasted the whole code below with link to the website.
So I want to ask whether I wrote something in wrong way or the web page is not that simple to make a request. Could you tell me what happened?
Here is the original code:
import requests
import re
from bs4 import BeautifulSoup
pattern = '\d+\.?\d*'
site_doc = requests.get('https://bitbay.net/pl/kurs-walut/kurs-ethereum-pln').text
soup = BeautifulSoup(site_doc, 'html.parser')
price = str(soup.select('title'))
price = re.findall(pattern, price)
print(price)
Thanks in advance!

The reason this doesn't work is that the content you are trying to get is JavaScript rendered. For this, I'd recommend using Selenium in order to get JavaScript rendered content.

Scrapy does not find text in Xpath or Css

I've been at this one for a few days, and no matter how I try, I cannot get scrapy to abstract text that is in one element.
to spare you all the code, here are the important pieces. The setup does grab everything else off the page, just not this text.
from scrapy.selector import Selector
start_url = "https://www.tripadvisor.com/VacationRentalReview-g34416-d12428323-On_the_Beach_Wide_flat_beach_Sunsets_Gulf_view_Sharks_teeth_Shells_Fish-Manasota_Key_F.html"
#BASIC ITEM AND SPIDER YADA, SPARE YOU THE DETAILS
hxs = Selector(response)
response_css = response.css("body")
desc_data = hxs.xpath('//*[#id="DETAILS_TRUNC_TEXT"]//text()').extract()
desc_data2 = response_css.css('#DETAILS_TRUNC_TEXT::text').extract()
both return empty lists. Yes, I found the xpath and css selector via chrome, but the rest of them work just fine as I'm able to find other data on the site. Please help me find out why this isn't working.

To get the data you need to use any browser simulator like selenium so that It can catch the response of dynamically generated content. You need to put some delay to let the webpage load it's content fully. This is how you can go:
from selenium import webdriver
from scrapy import Selector
import time
driver = webdriver.Chrome()
URL = "https://www.tripadvisor.com/VacationRentalReview-g34416-d12428323-On_the_Beach_Wide_flat_beach_Sunsets_Gulf_view_Sharks_teeth_Shells_Fish-Manasota_Key_F.html"
driver.get(URL)
time.sleep(5) #If you take out this line you won't get anything because the content of that page take some time to get loaded.
sel = Selector(text=driver.page_source)
item = sel.css('#DETAILS_TRUNC_TEXT::text').extract() #It is working
item_ano = sel.xpath('//*[#id="DETAILS_TRUNC_TEXT"]//text()').extract() #It is also working
print(item, item_ano)
driver.quit()

I tried your xpath and css in scrapy shell, and got nothing also.
Then I used view(response) command and found out the site is dynamic.
Here is a screenshot:
You can see that the details under Overview doesn't show up, and that's why no matter how you try, you still got nothing.
Solutions: Try Selenium (check the solution that SIM provided in the last answer) or Splash.
Good Luck. :)

Scrapy spider will not crawl on start urls

I am brand new to scrappy and have worked my way through the tutorial and am trying to figure out how to implement what I have learned so far to complete a seemingly basic task. I know very little python so far and am using this as a learning experience, so if I ask a simple question, I apologize.
My goal for this program is to follow this link http://ucmwww.dnr.state.la.us/ucmsearch/FindDocuments.aspx?idx=xwellserialnumber&val=971683 and to extract the well serial number to a csv file. Eventually I want to run this spider on several thousand different well files and retrieve specific data. However, I am starting with the basics first.
Right now the spider doesnt crawl on any web page that I enter. There are no errors listed in the code when I run it, it just states that 0 pages were crawled. I cant quite figure out what I am doing wrong. I am positive the start url is ok as I have checked it out. Do I need a specific type of spider to accomplish what I am trying to do?
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
class Sonrisdataaccess(Spider):
name = "serial"
allowed_domains = ["sonris.com"]
start_urls = [
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498"]
def parse(self, response):
questions = Selector(response).xpath('/html/body/table[1]/tbody/tr[2]/td[1]')
for question in questions:
item = SonrisdataaccessItem()
item['serial'] = question.xpath ('/html/body/table[1]/tbody/tr[2]/td[1]').extract()[0]
yield item
Thank you for any help, I greatly appreciate it!

First of all I do not understand what you are doing in your for loop because if you have a selector you do not get the whole HTML again to select it...
Nevertheless, the interesting part is that the browser represents the table way different than it is downloaded with Scrapy. If you look at the response in your parse method you will see that there is no tbody element in the first table. This is why your selection does not return anything.
So to get the first serial number (as it is in your XPath) change your parse function to this:
def parse(self, response):
item = SonrisdataaccessItem()
item['serial'] = response.xpath('/html/body/table[1]/tr[2]/td[1]/text()').extract()[0]
yield item
For later changes you may have to alter the XPath expression to get more data.

Programmatically find and change html on pages in my Plone site

I want to search for all documents inside a fairly large Plone site that contain a specific snippet of html in the body (list items with headings inside them, urgh ...) and then change that html (drop the headings).
Pointers on how to do that are much appreciated!

You should create a browserview (or run the instance in debug mode) and run this code:
from Products.CMFCore.utils import getToolByName
import re
ctool = getToolByName(context, 'portal_catalog')
results = ctool.searchResults(portal_type='Document')
for i in results:
obj = i.getObject()
text = obj.getField('text').get(obj)
<find and remove your html using the regular expression module>
obj.reindexObject()
If you need to do this many times, you could evaluate to add your custom index that simplify the job.

I have not tried it in a while, but check out GoReplace

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web Scraping: Some Pages URL are missing - web-scraping

You don't need to create else block here. Only checking if items is not None suffices. Try the below approach: items = soup.find(class_="text-muted") if items: print("{}\n{}".format(items.previous_sibling,items.text))

Related

Scraping Multiple Pages without manually getting the amount of pages

Request returns not actual value

Scrapy does not find text in Xpath or Css

Scrapy spider will not crawl on start urls

Programmatically find and change html on pages in my Plone site

Categories

Resources