Unable to get response in css selector or xpath expression - css

I am trying to get the movie names from https://www.sunnxt.com/movie/inside/ website. When I go to inspect element it shows me elements with movie names, but when I perform css selector or xpath expression, it did not give the movie name.
When I click on view source, I saw the code is different there. and all the movie data is placed between <script></script> tag.
Please help me to get all movie name.

To retrieve the movie names you have to induce WebDriverWait inconjunction with expected_conditions as visibility_of_all_elements_located as follows :
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver=webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get("https://www.sunnxt.com/movie/inside/")
movieList = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//h2[#class='title']//following::div[#class='home_movie_list_wrap']//a//h2")))
for item in movieList:
print(item.text)
driver.quit()
Console Output :
Gulaebaghavali
KALAKALAPPU 2
Motta Shiva Ketta Shiva
Annadurai
Aramm
Kaththi Sandai
Meesaya Murukku
Spyder
Sathriyan
Bogan
Brindavanam
Vivegam
Bairavaa
Karuppan
Muthina Kathirika
Dharmadurai
Thozha
Pichaikkaran
Devi
Aranmanai 2
Jackson Durai
Hello Naan Pei Pesuren
Kathakali
Kodi

Related

Website Scraping with BeautifulSoup : TypeError: 'ResultSet'

I used the following code below,unfortunately I get an error :-(
import requests
from bs4 import BeautifulSoup
url='https://finance.yahoo.com/quote/AAPL'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
price = soup.find_all('div', {'class' : 'My(6px) Pos(r) smartphone_Mt(6px)'})(0).find()
print(price)
Do you know why it says:
TypeError Traceback (most recent call last)
<ipython-input-23-4a9fd9081b7f> in <module>
8 #print(soup)
9
---> 10 price = soup.find_all('div', {'class' : 'My(6px) Pos(r) smartphone_Mt(6px)'})(0).find()
11 print(price)
TypeError: 'ResultSet' object is not callable
Is there anybody who can help?
BeautifulSoup's find_all method returns a list (a resultset). You need to treat it as a list and print results one by one. That subsequent find() at the end - I cannot explain it from my knowledge of bs4. Also, if you want to refer to an element in a list by index, you need to use square brackets, like so: my_list[0]. Also, your class is incomplete.
Here is a correct way of dealing with a list in your scenario (although that list has just one element):
import requests
from bs4 import BeautifulSoup
url='https://finance.yahoo.com/quote/AAPL'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
# print(soup)
prices = soup.select('div[class ^= "My(6px) Pos(r) smartphone_Mt(6px)"]')
print(len(prices))
for price in prices:
print(price.text)
Result in terminal:
1
144.11-1.99 (-1.36%)As of 11:20AM EDT. Market open.
Advertisement
BeautifulSoup documentation can be found here: https://beautiful-soup-4.readthedocs.io/en/latest/

How can I scrape the Ph elements of this list page?

elements requiredHello I would like to scrap the Ph values of this page and on the website. I tried yet many things but nothing with a really usefulness. I get the list but not the element. I tried a dataframe too but colaboratory ask me for dtype and I don't where I have to introduce it.
Please could you help me please? I'm starting with scraping 🙈
ph éléments on Farmi page
I would like to add the name of the product and the amm with the "ph" data. I get it so i try to go to a dataframe that would easier to manipulate but I failed. I don't how could i do it. Any help please, i wouldenter code here be so grateful
from bs4 import BeautifulSoup
import pandas as pd
import requests
import csv
import re
res=requests.get("https://www.farmi.com/Soufflet-FR/fr_FR/EUR/Sant%C3%A9-
du-v%C3%A9g%C3%A9tal-/Herbicides/Roundup-Flash-Plus/p/19795520")
soup=BeautifulSoup(res.text,"lxml")
all_data=soup.find("ph")
title=soup.find('title')
amm=soup.select('body > main > div.lg\:flex-grow > div.main__inner-wrapper
> div.main-container > div > div.lg\:flex-grow.lg\:w-8\/12 > div > div >
div.lg\:flex-grow.lg\:w-5\/12.lg\:pl-20.lg\:pr-40 > div.name > p')
tags=[]
print(title,amm,all_data)
for d in all_data:
main_data=d.find("li",text=re.compile("Ph "))
if main_data is not None:
tags.append(d)
final_dict={}
for t in tags:
name=t.find("li").get_text(strip=True).replace("&nbsp","")
print(name)
final_dict[name]=t.find("p").get_text(strip=True).replace("&nbsp","")
print(title,amm,final_dict)
complete_data = title,amm,final_dict
print(complete_data)
data_table=pd.DataFrame(complete_data) here It fails
from bs4 import BeautifulSoup
import requests
import re
res=requests.get("https://www.farmi.com/Soufflet-FR/fr_FR/EUR/Sant%C3%A9-du-v%C3%A9g%C3%A9tal-/Herbicides/Roundup-Flash-Plus/p/19795520")
soup=BeautifulSoup(res.text,"lxml")
First find data which is required and filter it using find_all method and iterate over to get only specific data using re module to find matching string data
all_data=soup.find_all("dl")
tags=[]
for d in all_data:
main_data=d.find("li",text=re.compile("Ph "))
if main_data is not None:
tags.append(d)
Now iterate over tags to get data in form of key-value pair
final_dict={}
for t in tags:
name=t.find("li").get_text(strip=True).replace("&nbsp","")
print(name)
final_dict[name]=t.find("p").get_text(strip=True).replace("&nbsp","")
Output:
{'Ph :': '4.5-5.5 (10 g/l à 23°C) ', 'Ph max :': '5.5 ', 'Ph min :': '4.5 '}

How to select with two conditions using XPath

//a[contains(#class,'inprogress')] - selects active matches
//span[contains(#itemprop,'name')] - selects all matches
How do I select only matches that aren't active? (atctives are red colored)
https://www.fudbal91.com/previews/2022-03-30
You can use not() like
//a[not(contains(#class,'inprogress'))]
And if you want use both then use then both together
//a[not(contains(#class,'inprogress'))]//span[contains(#itemprop,'name')]
from selenium import webdriver
from selenium.webdriver.common.by import By
#from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager
import time
url = 'https://www.fudbal91.com/previews/2022-03-30'
#driver = webdriver.Chrome(executable_path=ChromeDriverManager().install())
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
driver.get(url)
time.sleep(2)
all_items = driver.find_elements(By.XPATH, '//a[not(contains(#class,"inprogress"))]//span[contains(#itemprop,"name")]')
print('len(all_items):', len(all_items))
for item in all_items:
print(item.text)

How to get data from app PowerBI as external user

I am interested in downloading data from a national public dataset referring to Vaccine in Italy.
https://app.powerbi.com/view?r=eyJrIjoiMzg4YmI5NDQtZDM5ZC00ZTIyLTgxN2MtOTBkMWM4MTUyYTg0IiwidCI6ImFmZDBhNzVjLTg2NzEtNGNjZS05MDYxLTJjYTBkOTJlNDIyZiIsImMiOjh9
In particular I am interested in downloading the last table.
I tried to use a scraping model via HTML, but it seems that data is not stored directly into the HTML source page.
Then I thought to use the code below in Python 3.9:
import pytest
import time
import json
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
import pyautogui
import win32api
import win32gui
from win32con import *
driver = webdriver.Chrome()
LINK = ".tableEx .bodyCells div:nth-child(1) > .pivotTableCellWrap:nth-child("
TO_ADD ="1)"
data = []
action = ActionChains(driver)
driver.get("https://app.powerbi.com/view?r=eyJrIjoiMzg4YmI5NDQtZDM5ZC00ZTIyLTgxN2MtOTBkMWM4MTUyYTg0IiwidCI6ImFmZDBhNzVjLTg2NzEtNGNjZS05MDYxLTJjYTBkOTJlNDIyZiIsImMiOjh9")
driver.set_window_size(784, 835)
driver.execute_script("window.scrollTo(0,0)")
driver.execute_script("window.scrollTo(0,0)")
for i in range(1,293):
print(i)
if i%10==0:
win32api.SetCursorPos((1625,724))
win32api.mouse_event(MOUSEEVENTF_WHEEL, x, y, -3, 0)
time.sleep(6)
else:
pass
action = ActionChains(driver)
TO_ADD = str(i)+")"
action.move_to_element(driver.find_element(By.CSS_SELECTOR, LINK+TO_ADD)).perform()
action.context_click().send_keys(Keys.ARROW_DOWN).send_keys(Keys.ARROW_DOWN).perform()
time.sleep(0.5)
x,y = pyautogui.locateCenterOnScreen(r'C:\Users\migli\Documents\Learning\copy.png')
pyautogui.moveTo(x,y,0.1)
pyautogui.click()
time.sleep(0.5)
x,y = pyautogui.locateCenterOnScreen(r'C:\Users\migli\Documents\Learning\copiaselez.png')
pyautogui.moveTo(x,y,0.1)
pyautogui.click()
data.append(pyperclip.paste())
Images for clicking:
copiaselez
copy
This seems to achieve what I am trying to do. But then it blocks around the 14th cycle. I don't know why. Maybe I should scroll down the page in some manner, but I tried to do it manually during the code inserting an sleep time around 10th cycle, but it gets error as well.
I also thought using an API but it seems not to exist one.
Any idea is accepted.

CSS selector or XPath that gets information between two i tags?

I'm trying to scrape price information, and the HTML of the website looks like this
<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>
I want to get 999. (I don't want the dollar sign or the .00) I currently have
product_price_sn = product.css('.def-price i').extract()
I know it's wrong but not sure how to fix it. Any idea how to scrape that price information? Thanks!
You can use this xpath //span[#class="def-price"]/text()
Make sure you are using /text() and not //text(). Otherwise it will return all text nodes inside span tag.
or
This css selector .def-price::text. When using css selector don't use .def-price ::text, it will return all text nodes like the //text() in xpath.
Using scrapy response.xpath object
from scrapy.http import Request, HtmlResponse as Response
content = '''<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>'''.encode('utf-8')
url = 'https://stackoverflow.com/questions/62849500'
''' mocking scrapy request object '''
request = Request(url=url)
''' mocking scrapy response object '''
response = Response(url=url, request=request, body=content)
''' using xpath '''
print(response.xpath('//span[#class="def-price"]/text()').extract())
# outputs ['\n ', '\n "999"\n ']
print(''.join(response.xpath('//span[#class="def-price"]/text()').extract()).strip())
# outputs "99"
''' using css selector '''
print(response.css('.def-price::text').extract())
# outputs ['\n ', '\n "999"\n ']
print(''.join(response.css('.def-price::text').extract()).strip())
# outputs "99"
See it in action here
Using lxml html parser
from lxml import html
parser = html.fromstring("""
<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>
"""
)
print(parser.xpath('//span[#class="def-price"]/text()'))
# outputs ['\n ', '\n "999"\n ']
print(''.join(parser.xpath('//span[#class="def-price"]/text()')).strip())
# outputs "999"
See it in action here
With BeautifulSoup, you can use CSS selector .def_price and then .find_all(text=True, recursive=0) to get all immediate text.
For example:
from bs4 import BeautifulSoup
txt = '''<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>'''
soup = BeautifulSoup(txt, 'html.parser')
print( ''.join(soup.select_one('.def-price').find_all(text=True, recursive=0)).strip() )
Prints:
"999"
Scrapy implements an extension for that as it isn't standard for CSS selectors. So this should work for you:
product_price_sn = product.css('.def-price i::text').extract()
Here is what the docs say:
Per W3C standards, CSS selectors do not support selecting text nodes
or attribute values. But selecting these is so essential in a web
scraping context that Scrapy (parsel) implements a couple of
non-standard pseudo-elements:
to select text nodes, use ::text
to select attribute values, use ::attr(name) where name is the name of the attribute that you want the value of

Resources