url doesn't change in next page- web scraping - web-scraping

I want to get some data from this page. when I navigate to the next page url doesn't change. here is my code for scraping the first page
import requests
from bs4 import BeautifulSoup
url="https://www.airyrooms.com/search?s=26-02-2018.28-02-2018.GEO.103859.Bandung"
r=requests.get(url)
soup=BeautifulSoup(r.content)
g_data=soup.find_all("div",{"class":"styles-propertySearchResultDisplayContainer-1XpMp"})
i=0
for item in g_data:
try:
i=i+1
print (item.contents[0].find_all("div",{"class": "styles-titlePopUp-17tHZ"})[0].text)#name
print (item.contents[0].find_all("span",{"class": "styles-propertyLocationLink-1iVPv"})[0].text)#location
print (item.contents[0].find_all("span",{"class": "styles-lineThrough-xyCPH"})[0].text)# initial price
print (item.contents[0].find_all("div",{"class": "styles-value-3pvw_"})[1].text)# price after disount
except:
pass
print(i)
I don't know how to get data from the other pages.

Related

Python Requests doesn't render full code from page

Im trying to capture each agents data from this page using python requests.
Capture These
But the response.text doesn't render the code shown in the code inspector. **See snapshot.**a
Below is my script.
import requests
import re
response = requests.get('https://exprealty.com/agents/#/?city=Aberdeen+WA&country=US')
result = re.search('Mike Arthur',response.text)
try:
print (result.group())
except:
print('Nothing found.')

How to get data with BeautifulSoup without having "None"?

I am new to web scraping and I have a problem with it.
I want to get the name of the courses in specific search results on Udemy (from this link https://www.udemy.com/courses/search/?src=ukw&q=veri+bilimi).
Here is my code:
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.udemy.com/courses/search/?src=ukw&q=veri+bilimi")
print(result.status_code)
src = result.content
soup = BeautifulSoup(src, "lxml")
print(soup.find("div", attrs={"class":"udlite-focus-visible-target udlite-heading-md course-card--course-title--2f7tE"}))
It turns "None" instead of course names. Unfortunately, I didn't understand and see my mistake.
Can you help me?
The udemy website is using javascript to load course title that requests won't access. You need to use selenium
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
url ="https://www.udemy.com/courses/search/?src=ukw&q=veri+bilimi"
import time
webdriver =webdriver.Chrome()
webdriver.get(url)
time.sleep(6) # delay 6 sec
soup = BeautifulSoup(webdriver.page_source, "lxml")
course_titles = soup.find_all("div", attrs={"class":"udlite-focus-visible-target udlite-heading-md course-card--course-title--2f7tE"})
for title in course_titles:
print(title.get_text())
Selenium Installation if you need it.

Loading the full HTML after clicking a button to load additional elements with Selenium

I want to scrape a page and collect all links. The page shows 30 entries and to view the full list it's necessary to click a load all button.
I'm using following code:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.PhantomJS()
driver.get('http://www.christies.com/lotfinder/searchresults.aspx?&searchfrom=header&lid=1&entry=edgar%20degas&searchtype=p&action=paging&pg=all')
labtn = driver.find_element_by_css_selector('a.load-all')
labtn.click()
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
soup = BeautifulSoup(source_code, 'lxml')
url_list = []
for div in soup.find_all(class_ ='image-container'):
for childdiv in div.find_all('a'):
url_list.append(childdiv['href'])
print(url_list)
Here's the HTML mark-up
<div class="loadAllbtn">
<a class="load-all" id="loadAllUpcomingPast" href="javascript:void(0);">Load all</a>
</div>
I am still getting the original 30 links and the initial code. It seems that I'm not properly using Selenium and would like to know what I'm doing wrong.
Selenium works so far. Node JS is installed, I managed to make a screenshot and save it to a file.
When you click "Load all" you make additional request to receive all items. You need to wait some time for server response:
from selenium.webdriver.support.ui import WebDriverWait as wait
driver = webdriver.PhantomJS()
driver.get('http://www.christies.com/lotfinder/searchresults.aspx?&searchfrom=header&lid=1&entry=edgar%20degas&searchtype=p&action=paging&pg=all')
labtn = driver.find_element_by_css_selector('a.load-all')
labtn.click()
wait(driver, 15).until(lambda x: len(driver.find_elements_by_css_selector("div.detailscontainer")) > 30)
Above code should allow you to wait up to 15 seconds until number of items exceed 30. Then you can scrape page source with complete list of items
P.S. Note that you don't need to use these lines of code
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
to get page source. Just try
source_code = driver.page_source
P.P.S. Also you don't need to use BeautifulSoup to get links to each item. You can do it as
links = [link.get_attribute('href') for link in driver.find_elements_by_css_selector('div.image-container>a')]

Detecting valid search parameters for a site? (Web scraping)

I'm trying to scrape a bunch of search results from the site:
http://www.wileyopenaccess.com/view/journals.html
Currently the results show up on 4 pages. The 4th page could be accessed with http://www.wileyopenaccess.com/view/journals.html?page=4
I'd like some way to get all of the results on one page for easier scraping, but I have no idea how to determine which request parameters are valid. I tried a couple of things like:
http://www.wileyopenaccess.com/view/journals.html?per_page=100
http://www.wileyopenaccess.com/view/journals.html?setlimit=100
to no avail. Is there a way to detect the valid parameters of this search?
I'm using BeautifulSoup; is there some obvious way to do this that I've overlooked?
Thanks
You cannot pass any magic params to get all the links but you can use the Next button to get all the pages which will work regardless of how many pages there may be:
from bs4 import BeautifulSoup
def get_all_pages():
response = requests.get('http://www.wileyopenaccess.com/view/journals.html')
soup = BeautifulSoup(response.text)
yield soup.select("div.journalRow")
nxt = soup.select_one("div.journalPagination.borderBox a[title^=Next]")
while nxt:
response = requests.get(nxt["href"])
soup = BeautifulSoup(response.text)
yield soup.select("div.journalRow")
nxt = soup.select_one("div.journalPagination.borderBox a[title^=Next]")
for page in get_all_pages():
print(page)

Python-requests: Can't scrape all the html code from a page

I am trying to scrape the content of the
Financial Times Search page.
Using Requests, I can easily scrape the articles' titles and hyperlinks.
I would like to get the next page's hyperlink, but I can not find it in the Requests response, unlike the articles' titles or hyperlinks.
from bs4 import BeautifulSoup
import requests
url = 'http://search.ft.com/search?q=SABMiller+PLC&t=all&rpp=100&fa=people%2Corganisations%2Cregions%2Csections%2Ctopics%2Ccategory%2Cbrand&s=-lastPublishDateTime&f=lastPublishDateTime[2000-01-01T00%3A00%3A00%2C2016-01-01T23%3A59%3A59]&curations=ARTICLES%2CBLOGS%2CVIDEOS%2CPODCASTS&highlight=true&p=1et'
response = requests.get(url, auth=(my login informations))
soup = BeautifulSoup(response.text, "lxml")
def get_titles_and_links():
titles = soup.find_all('a')
for ref in titles:
if ref.get('title') and ref.get('onclick'):
print ref.get('href')
print ref.get('title')
The get_titles_and_links() function gives me the titles and links of all the articles.
However, with a similar function for the next page, I have no results:
def get_next_page():
next_page = soup.find_all("li", class_="page next")
return next_page
Or:
def get_next_page():
next_page = soup.find_all('li')
for ref in next_page:
if ref.get('page next'):
print ref.get('page next')
If you can see the required links in the page source, but are not able to get them via requests or urllib. It can mean two things.
There is something wrong with your logic. Let's assume it's not that.
Then the thing remains is: Ajax, those parts of the page you are looking for are loaded by javascript after the document.onload method fired. So you cannot get something that's not there in the first place.
My solutions(more like suggestions) are
Reverse engineer the network requests. Difficult, but universally applicable. I personally do that. You might want to use re module.
Find something that renders javascript. That's just to say that, simulate web browsing. You might wanna check out the webdriver component of selenium, Qt etc. This is easier, but kinda memory hungry and consumes a lot more network resource compared to 1.

Resources