I am trying to get the movie names from https://www.sunnxt.com/movie/inside/ website. When I go to inspect element it shows me elements with movie names, but when I perform css selector or xpath expression, it did not give the movie name.
When I click on view source, I saw the code is different there. and all the movie data is placed between <script></script> tag.
Please help me to get all movie name.
To retrieve the movie names you have to induce WebDriverWait inconjunction with expected_conditions as visibility_of_all_elements_located as follows :
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver=webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get("https://www.sunnxt.com/movie/inside/")
movieList = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//h2[#class='title']//following::div[#class='home_movie_list_wrap']//a//h2")))
for item in movieList:
print(item.text)
driver.quit()
Console Output :
Gulaebaghavali
KALAKALAPPU 2
Motta Shiva Ketta Shiva
Annadurai
Aramm
Kaththi Sandai
Meesaya Murukku
Spyder
Sathriyan
Bogan
Brindavanam
Vivegam
Bairavaa
Karuppan
Muthina Kathirika
Dharmadurai
Thozha
Pichaikkaran
Devi
Aranmanai 2
Jackson Durai
Hello Naan Pei Pesuren
Kathakali
Kodi
Related
I used the following code below,unfortunately I get an error :-(
import requests
from bs4 import BeautifulSoup
url='https://finance.yahoo.com/quote/AAPL'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
price = soup.find_all('div', {'class' : 'My(6px) Pos(r) smartphone_Mt(6px)'})(0).find()
print(price)
Do you know why it says:
TypeError Traceback (most recent call last)
<ipython-input-23-4a9fd9081b7f> in <module>
8 #print(soup)
9
---> 10 price = soup.find_all('div', {'class' : 'My(6px) Pos(r) smartphone_Mt(6px)'})(0).find()
11 print(price)
TypeError: 'ResultSet' object is not callable
Is there anybody who can help?
BeautifulSoup's find_all method returns a list (a resultset). You need to treat it as a list and print results one by one. That subsequent find() at the end - I cannot explain it from my knowledge of bs4. Also, if you want to refer to an element in a list by index, you need to use square brackets, like so: my_list[0]. Also, your class is incomplete.
Here is a correct way of dealing with a list in your scenario (although that list has just one element):
import requests
from bs4 import BeautifulSoup
url='https://finance.yahoo.com/quote/AAPL'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
# print(soup)
prices = soup.select('div[class ^= "My(6px) Pos(r) smartphone_Mt(6px)"]')
print(len(prices))
for price in prices:
print(price.text)
Result in terminal:
1
144.11-1.99 (-1.36%)As of 11:20AM EDT. Market open.
Advertisement
BeautifulSoup documentation can be found here: https://beautiful-soup-4.readthedocs.io/en/latest/
elements requiredHello I would like to scrap the Ph values of this page and on the website. I tried yet many things but nothing with a really usefulness. I get the list but not the element. I tried a dataframe too but colaboratory ask me for dtype and I don't where I have to introduce it.
Please could you help me please? I'm starting with scraping 🙈
ph éléments on Farmi page
I would like to add the name of the product and the amm with the "ph" data. I get it so i try to go to a dataframe that would easier to manipulate but I failed. I don't how could i do it. Any help please, i wouldenter code here be so grateful
from bs4 import BeautifulSoup
import pandas as pd
import requests
import csv
import re
res=requests.get("https://www.farmi.com/Soufflet-FR/fr_FR/EUR/Sant%C3%A9-
du-v%C3%A9g%C3%A9tal-/Herbicides/Roundup-Flash-Plus/p/19795520")
soup=BeautifulSoup(res.text,"lxml")
all_data=soup.find("ph")
title=soup.find('title')
amm=soup.select('body > main > div.lg\:flex-grow > div.main__inner-wrapper
> div.main-container > div > div.lg\:flex-grow.lg\:w-8\/12 > div > div >
div.lg\:flex-grow.lg\:w-5\/12.lg\:pl-20.lg\:pr-40 > div.name > p')
tags=[]
print(title,amm,all_data)
for d in all_data:
main_data=d.find("li",text=re.compile("Ph "))
if main_data is not None:
tags.append(d)
final_dict={}
for t in tags:
name=t.find("li").get_text(strip=True).replace(" ","")
print(name)
final_dict[name]=t.find("p").get_text(strip=True).replace(" ","")
print(title,amm,final_dict)
complete_data = title,amm,final_dict
print(complete_data)
data_table=pd.DataFrame(complete_data) here It fails
from bs4 import BeautifulSoup
import requests
import re
res=requests.get("https://www.farmi.com/Soufflet-FR/fr_FR/EUR/Sant%C3%A9-du-v%C3%A9g%C3%A9tal-/Herbicides/Roundup-Flash-Plus/p/19795520")
soup=BeautifulSoup(res.text,"lxml")
First find data which is required and filter it using find_all method and iterate over to get only specific data using re module to find matching string data
all_data=soup.find_all("dl")
tags=[]
for d in all_data:
main_data=d.find("li",text=re.compile("Ph "))
if main_data is not None:
tags.append(d)
Now iterate over tags to get data in form of key-value pair
final_dict={}
for t in tags:
name=t.find("li").get_text(strip=True).replace(" ","")
print(name)
final_dict[name]=t.find("p").get_text(strip=True).replace(" ","")
Output:
{'Ph :': '4.5-5.5 (10 g/l à 23°C) ', 'Ph max :': '5.5 ', 'Ph min :': '4.5 '}
//a[contains(#class,'inprogress')] - selects active matches
//span[contains(#itemprop,'name')] - selects all matches
How do I select only matches that aren't active? (atctives are red colored)
https://www.fudbal91.com/previews/2022-03-30
You can use not() like
//a[not(contains(#class,'inprogress'))]
And if you want use both then use then both together
//a[not(contains(#class,'inprogress'))]//span[contains(#itemprop,'name')]
from selenium import webdriver
from selenium.webdriver.common.by import By
#from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager
import time
url = 'https://www.fudbal91.com/previews/2022-03-30'
#driver = webdriver.Chrome(executable_path=ChromeDriverManager().install())
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
driver.get(url)
time.sleep(2)
all_items = driver.find_elements(By.XPATH, '//a[not(contains(#class,"inprogress"))]//span[contains(#itemprop,"name")]')
print('len(all_items):', len(all_items))
for item in all_items:
print(item.text)
I am interested in downloading data from a national public dataset referring to Vaccine in Italy.
https://app.powerbi.com/view?r=eyJrIjoiMzg4YmI5NDQtZDM5ZC00ZTIyLTgxN2MtOTBkMWM4MTUyYTg0IiwidCI6ImFmZDBhNzVjLTg2NzEtNGNjZS05MDYxLTJjYTBkOTJlNDIyZiIsImMiOjh9
In particular I am interested in downloading the last table.
I tried to use a scraping model via HTML, but it seems that data is not stored directly into the HTML source page.
Then I thought to use the code below in Python 3.9:
import pytest
import time
import json
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
import pyautogui
import win32api
import win32gui
from win32con import *
driver = webdriver.Chrome()
LINK = ".tableEx .bodyCells div:nth-child(1) > .pivotTableCellWrap:nth-child("
TO_ADD ="1)"
data = []
action = ActionChains(driver)
driver.get("https://app.powerbi.com/view?r=eyJrIjoiMzg4YmI5NDQtZDM5ZC00ZTIyLTgxN2MtOTBkMWM4MTUyYTg0IiwidCI6ImFmZDBhNzVjLTg2NzEtNGNjZS05MDYxLTJjYTBkOTJlNDIyZiIsImMiOjh9")
driver.set_window_size(784, 835)
driver.execute_script("window.scrollTo(0,0)")
driver.execute_script("window.scrollTo(0,0)")
for i in range(1,293):
print(i)
if i%10==0:
win32api.SetCursorPos((1625,724))
win32api.mouse_event(MOUSEEVENTF_WHEEL, x, y, -3, 0)
time.sleep(6)
else:
pass
action = ActionChains(driver)
TO_ADD = str(i)+")"
action.move_to_element(driver.find_element(By.CSS_SELECTOR, LINK+TO_ADD)).perform()
action.context_click().send_keys(Keys.ARROW_DOWN).send_keys(Keys.ARROW_DOWN).perform()
time.sleep(0.5)
x,y = pyautogui.locateCenterOnScreen(r'C:\Users\migli\Documents\Learning\copy.png')
pyautogui.moveTo(x,y,0.1)
pyautogui.click()
time.sleep(0.5)
x,y = pyautogui.locateCenterOnScreen(r'C:\Users\migli\Documents\Learning\copiaselez.png')
pyautogui.moveTo(x,y,0.1)
pyautogui.click()
data.append(pyperclip.paste())
Images for clicking:
copiaselez
copy
This seems to achieve what I am trying to do. But then it blocks around the 14th cycle. I don't know why. Maybe I should scroll down the page in some manner, but I tried to do it manually during the code inserting an sleep time around 10th cycle, but it gets error as well.
I also thought using an API but it seems not to exist one.
Any idea is accepted.
I'm trying to scrape price information, and the HTML of the website looks like this
<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>
I want to get 999. (I don't want the dollar sign or the .00) I currently have
product_price_sn = product.css('.def-price i').extract()
I know it's wrong but not sure how to fix it. Any idea how to scrape that price information? Thanks!
You can use this xpath //span[#class="def-price"]/text()
Make sure you are using /text() and not //text(). Otherwise it will return all text nodes inside span tag.
or
This css selector .def-price::text. When using css selector don't use .def-price ::text, it will return all text nodes like the //text() in xpath.
Using scrapy response.xpath object
from scrapy.http import Request, HtmlResponse as Response
content = '''<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>'''.encode('utf-8')
url = 'https://stackoverflow.com/questions/62849500'
''' mocking scrapy request object '''
request = Request(url=url)
''' mocking scrapy response object '''
response = Response(url=url, request=request, body=content)
''' using xpath '''
print(response.xpath('//span[#class="def-price"]/text()').extract())
# outputs ['\n ', '\n "999"\n ']
print(''.join(response.xpath('//span[#class="def-price"]/text()').extract()).strip())
# outputs "99"
''' using css selector '''
print(response.css('.def-price::text').extract())
# outputs ['\n ', '\n "999"\n ']
print(''.join(response.css('.def-price::text').extract()).strip())
# outputs "99"
See it in action here
Using lxml html parser
from lxml import html
parser = html.fromstring("""
<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>
"""
)
print(parser.xpath('//span[#class="def-price"]/text()'))
# outputs ['\n ', '\n "999"\n ']
print(''.join(parser.xpath('//span[#class="def-price"]/text()')).strip())
# outputs "999"
See it in action here
With BeautifulSoup, you can use CSS selector .def_price and then .find_all(text=True, recursive=0) to get all immediate text.
For example:
from bs4 import BeautifulSoup
txt = '''<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>'''
soup = BeautifulSoup(txt, 'html.parser')
print( ''.join(soup.select_one('.def-price').find_all(text=True, recursive=0)).strip() )
Prints:
"999"
Scrapy implements an extension for that as it isn't standard for CSS selectors. So this should work for you:
product_price_sn = product.css('.def-price i::text').extract()
Here is what the docs say:
Per W3C standards, CSS selectors do not support selecting text nodes
or attribute values. But selecting these is so essential in a web
scraping context that Scrapy (parsel) implements a couple of
non-standard pseudo-elements:
to select text nodes, use ::text
to select attribute values, use ::attr(name) where name is the name of the attribute that you want the value of