How to extract title name and rating of a movie from IMDB database? - web-scraping

I'm very new to web scrapping in python. I want to extract the movie name, release year, and ratings from the IMDB database. This is the website for IMBD with 250 movies and ratings https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm.I use the module, BeautifulSoup, and request. Here is my code
movies = bs.find('tbody',class_='lister-list').find_all('tr')
When I tried to extract the movie name, rating & year, I got the same attribute error for all of them.
<td class="title column">
Glass Onion: une histoire à couteaux tirés
<span class="secondary info">(2022)</span>
<div class="velocity">1
<span class="secondary info">(
<span class="global-sprite telemeter up"></span>
1)</span>
<td class="ratingColumn imdbRating">
<strong title="7,3 based on 207 962 user ratings">7,3</strong>strong text
title = movies.find('td',class_='titleColumn').a.text
rating = movies.find('td',class_='ratingColumn imdbRating').strong.text
year = movies.find('td',class_='titleColumn').span.text.strip('()')
AttributeError Traceback (most recent call last)
<ipython-input-9-2363bafd916b> in <module>
----> 1 title = movies.find('td',class_='titleColumn').a.text
2 title
~\anaconda3\lib\site-packages\bs4\element.py in getattr(self, key)
2287 def getattr(self, key):
2288 """Raise a helpful exception to explain a common code fix."""
-> 2289 raise AttributeError(
2290 "ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
2291 )
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Can someone help me to solve the problem? Thanks in advance!

To get the ResultSets as list, you can try the next example.
from bs4 import BeautifulSoup
import requests
import pandas as pd
data = []
res = requests.get("https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm.I")
#print(res)
soup = BeautifulSoup(res.content, "html.parser")
for card in soup.select('.chart.full-width tbody tr'):
data.append({
"title": card.select_one('.titleColumn a').get_text(strip=True),
"year": card.select_one('.titleColumn span').text,
'rating': card.select_one('td[class="ratingColumn imdbRating"]').get_text(strip=True)
})
df = pd.DataFrame(data)
print(df)
#df.to_csv('out.csv', index=False)
Output:
title year rating
0 Avatar: The Way of Water (2022) 7.9
1 Glass Onion (2022) 7.2
2 The Menu (2022) 7.3
3 White Noise (2022) 5.8
4 The Pale Blue Eye (2022) 6.7
.. ... ... ...
95 Zoolander (2001) 6.5
96 Once Upon a Time in Hollywood (2019) 7.6
97 The Lord of the Rings: The Fellowship of the Ring (2001) 8.8
98 New Year's Eve (2011) 5.6
99 Spider-Man: No Way Home (2021) 8.2
[100 rows x 3 columns]
Update: To extract data using find_all and find method.
from bs4 import BeautifulSoup
import requests
import pandas as pd
headers = {'User-Agent':'Mozilla/5.0'}
data = []
res = requests.get("https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm.I")
#print(res)
soup = BeautifulSoup(res.content, "html.parser")
for card in soup.table.tbody.find_all("tr"):
data.append({
"title": card.find("td",class_="titleColumn").a.get_text(strip=True),
"year": card.find("td",class_="titleColumn").span.get_text(strip=True),
'rating': card.find('td',class_="ratingColumn imdbRating").get_text(strip=True)
})
df = pd.DataFrame(data)
print(df)

AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
find_all returns an array, meaning that movies is an array. You need to iterate over the array with for movie in movies:
for movie in movies:
title = movie.find('td',class_='titleColumn').a.text
rating = movie.find('td',class_='ratingColumn imdbRating').strong.text
year = movie.find('td',class_='titleColumn').span.text.strip('()')

Related

Rvest: using css selector pulls data from different tab in URL

I am very new to scraping, and am trying to pull data from a section of this website - https://projects.fivethirtyeight.com/soccer-predictions/premier-league/. The data I'm trying to get is in the second tab, "Matches," and is the section titled "Upcoming Matches."
I have attempted to do this with SelectorGadget and using rvest, as follows -
library(rvest)
url <- ("https://projects.fivethirtyeight.com/soccer-predictions/premier-league/")
url %>%
html_nodes(".prob, .name") %>%
html_text()
this returns values, however corresponding to the first tab on the page, "Standings." How can I reference the correct section that I am trying to pull?
First:I don't know R but Python.
When you click Matches then page uses JavaScript to generate matches and it loads JSON data from:
https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_forecast.json
https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_matches.json
https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_clinches.json
I checked only one of them - 2021_premier-league_matches.json - and I see it has data for Completed Matches
I made example in Python:
import requests
url = 'https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_matches.json'
response = requests.get(url)
data = response.json()
for item in data:
# search date
if item['datetime'].startswith('2022-03-16'):
print('team1:', item['team1_code'], '|', item['team1'])
print('prob1:', item['prob1'])
print('score1:', item['score1'])
print('adj_score1:', item['adj_score1'])
print('chances1:', item['chances1'])
print('moves1:', item['moves1'])
print('---')
print('team2:', item['team2_code'], '|', item['team2'])
print('prob2:', item['prob2'])
print('score2:', item['score2'])
print('adj_score2:', item['adj_score2'])
print('chances2:', item['chances2'])
print('moves2:', item['moves2'])
print('----------------------------------------')
Result:
team1: BHA | Brighton and Hove Albion
prob1: 0.30435
score1: 0
adj_score1: 0.0
chances1: 1.244
moves1: 1.682
---
team2: TOT | Tottenham Hotspur
prob2: 0.43627
score2: 2
adj_score2: 2.1
chances2: 1.924
moves2: 1.056
----------------------------------------
team1: ARS | Arsenal
prob1: 0.22114
score1: 0
adj_score1: 0.0
chances1: 0.569
moves1: 0.514
---
team2: LIV | Liverpool
prob2: 0.55306
score2: 2
adj_score2: 2.1
chances2: 1.243
moves2: 0.813
----------------------------------------

Webscraping RequestGet from Airbnb not working properly

This query is returning 0 or 20 randomly every time i run it. Yesterday when i loop through the pages i always get 20 and I am able to scrape through 20 listings and 15 pages. But now, I can't run my code properly because sometimes the listings return 0.
I tried adding headers in the request get and time sleep (5-10s random) before each request but am still facing the same issue. Tried connecting to hotspot to change my IP but am still facing the same issue. Anyone understand why?
import time
from random import randint
from bs4 import BeautifulSoup
import requests #to connect to url
airbnb_url = 'https://www.airbnb.com/s/Mayrhofen--Austria/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&date_picker_type=calendar&query=Mayrhofen%2C%20Austria&place_id=ChIJbzLYLzjdd0cRDtGuTzM_vt4&checkin=2021-02-06&checkout=2021-02-13&adults=4&source=structured_search_input_header&search_type=autocomplete_click'
soup = BeautifulSoup(requests.get(airbnb_url).content, 'html.parser')
listings = soup.find_all('div', '_8s3ctt')
print(len(listings))
It seems AirBnB returns 2 versions of the page. One "normal" HTML and other where the listings are stored inside <script>. To parse the <script> version of page you can use next example:
import json
import requests
from bs4 import BeautifulSoup
def find_listing(d):
if isinstance(d, dict):
if "__typename" in d and d["__typename"] == "DoraListingItem":
yield d["listing"]
else:
for v in d.values():
yield from find_listing(v)
elif isinstance(d, list):
for v in d:
yield from find_listing(v)
airbnb_url = "https://www.airbnb.com/s/Mayrhofen--Austria/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&date_picker_type=calendar&query=Mayrhofen%2C%20Austria&place_id=ChIJbzLYLzjdd0cRDtGuTzM_vt4&checkin=2021-02-06&checkout=2021-02-13&adults=4&source=structured_search_input_header&search_type=autocomplete_click"
soup = BeautifulSoup(requests.get(airbnb_url).content, "html.parser")
listings = soup.find_all("div", "_8s3ctt")
if len(listings):
# normal page:
print(len(listings))
else:
# page that has listings stored inside <script>:
data = json.loads(soup.select_one("#data-deferred-state").contents[0])
for i, l in enumerate(find_listing(data), 1):
print(i, l["name"])
Prints (when returned the <script> version):
1 Mariandl (MHO103) for 36 persons.
2 central and friendly! For Families and Friends
3 Sonnenheim for 5 persons.
4 MO's Apartments
5 MO's Apartments
6 Beautiful home in Mayrhofen with 3 Bedrooms
7 Quaint Apartment in Finkenberg near Ski Lift
8 Apartment 2 Villa Daringer (5 pax.)
9 Modern Apartment in Schwendau with Garden
10 Holiday flats Dornau, Mayrhofen
11 Maple View
12 Laubichl Lodge by Apart Hotel Therese
13 Haus Julia - Apartment Edelweiß Mayrhofen
14 Melcherhof,
15 Rest coke
16 Vacation home Traudl
17 Luxurious Apartment near Four Ski Lifts in Mayrhofen
18 Apartment 2 60m² for 2-4 persons "Binder"
19 Apart ZEMMGRUND, 4-9 persons in Mayrhofen/Tirol
20 Apartment Ahorn View
EDIT: To print lat, lng:
...
for i, l in enumerate(find_listing(data), 1):
print(i, l["name"], l["lat"], l["lng"])
Prints:
1 Mariandl (MHO103) for 36 persons. 47.16522 11.85723
2 central and friendly! For Families and Friends 47.16209 11.859691
3 Sonnenheim for 5 persons. 47.16809 11.86694
4 MO's Apartments 47.166969 11.863186
...

Python code to scrape ticker symbols from Yahoo finance

I have a list of >1.000 companies which I could use to invest in. I need the ticker symbol id's from all these companies. I find difficulties when I am trying to strip the output of the soup, and when I am trying to loop through all the company names.
Please see an example of the site: https://finance.yahoo.com/lookup?s=asml. The idea is to replace asml and put 'https://finance.yahoo.com/lookup?s='+ Companies., so I can loop through all the companies.
companies=df
Company name
0 Abbott Laboratories
1 ABBVIE
2 Abercrombie
3 Abiomed
4 Accenture Plc
This is the code I have now, where the strip code doesn't work, and where the loop for all the company isn't working as well.
#Create a function to scrape the data
def scrape_stock_symbols():
Companies=df
url= 'https://finance.yahoo.com/lookup?s='+ Companies
page= requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
Company_Symbol=Soup.find_all('td',attrs ={'class':'data-col0 Ta(start) Pstart(6px) Pend(15px)'})
for i in company_symbol:
try:
row = i.find_all('td')
company_symbol.append(row[0].text.strip())
except Exception:
if company not in company_symbol:
next(Company)
return (company_symbol)
#Loop through every company in companies to get all of the tickers from the website
for Company in companies:
try:
(temp_company_symbol) = scrape_stock_symbols(company)
except Exception:
if company not in companies:
next(Company)
Another difficulty is that the symbol look up from yahoo finance will retrieve many companies names.
I will have to clear the data afterwards. I want to set the AMS exchange as the standard, hence if a company is listed on multiple exchanges, I am only interested in the AMS ticker symbol. The final goal is to create a new dataframe:
Comapny name Company_symbol
0 Abbott Laboratories ABT
1 ABBVIE ABBV
2 Abercrombie ANF
Here's a solution that doesn't require any scraping. It uses a package called yahooquery (disclaimer: I'm the author), which utilizes an API endpoint that returns symbols for a user's query. You can do something like this:
import pandas as pd
import yahooquery as yq
def get_symbol(query, preferred_exchange='AMS'):
try:
data = yq.search(query)
except ValueError: # Will catch JSONDecodeError
print(query)
else:
quotes = data['quotes']
if len(quotes) == 0:
return 'No Symbol Found'
symbol = quotes[0]['symbol']
for quote in quotes:
if quote['exchange'] == preferred_exchange:
symbol = quote['symbol']
break
return symbol
companies = ['Abbott Laboratories', 'ABBVIE', 'Abercrombie', 'Abiomed', 'Accenture Plc']
df = pd.DataFrame({'Company name': companies})
df['Company symbol'] = df.apply(lambda x: get_symbol(x['Company name']), axis=1)
Company name Company symbol
0 Abbott Laboratories ABT
1 ABBVIE ABBV
2 Abercrombie ANF
3 Abiomed ABMD
4 Accenture Plc ACN

Webscraping bs4 - sorting resaults from different URLs into table

I have written the script below to scrape a website.
I have left out the URL, if you need this write to me and i will supply it to you.
The current output is kinda messy but it does the job.
Im very novice at scraping so if you have any suggestions on how to improve the scraping it selfe please tell me.
Im looking for help to structure the results into a table that looks like this:
| source | columns... |
| -------- | -------------- |
| url1 | values |
| url2 | values |
Columns: Antal aktier, Börsvärde MSEK, Direktavkastning %, P/E-tal, P/S-tal, etc...
values from data1: 59840000, 5084,00, 0,00,11,11, 0,59, etc...
values from data2: 14532434, 2284,50, 2,70, 9,73, 0,52, etc...
Ides on how to solve this is very welcome.
Script:
import bs4
import requests
import re
from bs4 import BeautifulSoup as bs
URL1 = "XXX"
URL2 ="YYY"
r1 = requests.get(URL1)
r2 = requests.get(URL2)
soup1 = bs(r1.content)
soup2 = bs(r2.content)
data1 = soup1.find_all('dl', attrs= {"class": "border XSText rightAlignText noMarginTop highlightOnHover thickBorderBottom noTopBorder"})
data2 = soup2.find_all('dl', attrs= {"class": "border XSText rightAlignText noMarginTop highlightOnHover thickBorderBottom noTopBorder"})
print(data1[1])
print(data2[1])
Webscraping output:
<dt><span>Antal aktier</span></dt>
<dd><span>59 840 000</span></dd>
<dt><span>Börsvärde MSEK</span></dt>
<dd><span>5 084,00</span></dd>
<dt><span>Direktavkastning %</span></dt>
<dd><span>0,00</span></dd>
<dt><span>P/E-tal</span></dt>
<dd><span>11,11</span></dd>
<dt><span>P/S-tal</span></dt>
<dd><span>0,59</span></dd>
<dt><span>Kurs/eget kapital </span></dt>
<dd><span>2,60</span></dd>
<dt><span>Omsättning/aktie SEK</span></dt>
<dd><span>132,00</span></dd>
<dt><span>Vinst/aktie SEK</span></dt>
<dd><span>6,98</span></dd>
<dt><span>Eget kapital/aktie SEK</span></dt>
<dd><span>29,55</span></dd>
<dt><span>Försäljning/aktie SEK</span></dt>
<dd><span>-</span></dd>
<dt><span>Effektivavkastning %</span></dt>
<dd><span>0,00</span></dd>
<dt><span>Antal ägare hos Avanza</span></dt>
<dd><span>16 041</span></dd>
</dl>
<dl class="border XSText rightAlignText noMarginTop highlightOnHover thickBorderBottom noTopBorder">
<dt><span>Antal aktier</span></dt>
<dd><span>14 532 434</span></dd>
<dt><span>Börsvärde MSEK</span></dt>
<dd><span>2 284,50</span></dd>
<dt><span>Direktavkastning %</span></dt>
<dd><span>2,70</span></dd>
<dt><span>P/E-tal</span></dt>
<dd><span>9,73</span></dd>
<dt><span>P/S-tal</span></dt>
<dd><span>0,52</span></dd>
<dt><span>Kurs/eget kapital </span></dt>
<dd><span>2,73</span></dd>
<dt><span>Omsättning/aktie SEK</span></dt>
<dd><span>303,47</span></dd>
<dt><span>Vinst/aktie SEK</span></dt>
<dd><span>16,16</span></dd>
<dt><span>Eget kapital/aktie SEK</span></dt>
<dd><span>58,34</span></dd>
<dt><span>Försäljning/aktie SEK</span></dt>
<dd><span>-</span></dd>
<dt><span>Effektivavkastning %</span></dt>
<dd><span>2,70</span></dd>
<dt><span>Antal ägare hos Avanza</span></dt>
<dd><span>3 994</span></dd>
</dl>```
There are many ways you can solve this!
Looking at your output tells us that <dt> elements hold the column names and <dd> elements hold the values. So we can iterate through them and append the data to lists.
column_list = []
value_list = []
columns = soup1.find_all('dt')
for col in columns:
column_list.append(col.text.strip()) # strip() removes extra space from the text
values = soup1.find_all('dd')
for val in values:
value_list.append(val.text.strip())
for i in range(len(column_list)):
print(column_list[i] + ': ' + value_list[i])
Now you can use the data in your lists as you wish. It currently gives an output likes this:
Kortnamn: AAPL
ISIN: US0378331005
Marknad: NASDAQ
Bransch: Teknik
Handlas i: USD
Beta: 1,1927
Volatilitet %: 24,99
Belåningsvärde %: 60
Säkerhetskrav %: 150
Superränta: Ja
Blankningsbar: Nej
Antal aktier: 17 001 802 000
Börsvärde MUSD: 2 226 555,99
Direktavkastning %: 0,62
P/E-tal: 38,24
P/S-tal: 8,16
Kurs/eget kapital: 31,05
Omsättning/aktie USD: 16,05
Vinst/aktie USD: 3,42
Eget kapital/aktie USD: 4,25
Försäljning/aktie USD: -
Effektivavkastning %: 0,62
Antal ägare hos Avanza: 34 331

Data Scraping with list in excel

I have a list in Excel. One code in Column A and another in Column B.
There is a website in which I need to input both the details in two different boxes and it takes to another page.
That page contains certain details which I need to scrape in Excel.
Any help in this?
Ok. Give this a shot:
import pandas as pd
import requests
df = pd.read_excel('C:/test/data.xlsx')
url = 'http://rla.dgft.gov.in:8100/dgft/IecPrint'
results = pd.DataFrame()
for row in df.itertuples():
payload = {
'iec': '%010d' %row[1],
'name':row[2]}
response = requests.post(url, params=payload)
print ('IEC: %010d\tName: %s' %(row[1],row[2]))
try:
dfs = pd.read_html(response.text)
except:
print ('The name Given By you does not match with the data OR you have entered less than three letters')
temp_df = pd.DataFrame([['%010d' %row[1],row[2], 'ERROR']],
columns = ['IEC','Party Name and Address','ERROR'])
results = results.append(temp_df, sort=False).reset_index(drop=True)
continue
generalData = dfs[0]
generalData = generalData.iloc[:,[0,-1]].set_index(generalData.columns[0]).T.reset_index(drop=True)
directorData = dfs[1]
directorData = directorData.iloc[:,[-1]].T.reset_index(drop=True)
directorData.columns = [ 'director_%02d' %(each+1) for each in directorData.columns ]
try:
branchData = dfs[2]
branchData = branchData.iloc[:,[-1]].T.reset_index(drop=True)
branchData.columns = [ 'branch_%02d' %(each+1) for each in branchData.columns ]
except:
branchData = pd.DataFrame()
print ('No Branch Data.')
temp_df = pd.concat([generalData, directorData, branchData], axis=1)
results = results.append(temp_df, sort=False).reset_index(drop=True)
results.to_excel('path.new_file.xlsx', index=False)
Output:
print (results.to_string())
IEC IEC Allotment Date File Number File Date Party Name and Address Phone No e_mail Exporter Type IEC Status Date of Establishment BIN (PAN+Extension) PAN ISSUE DATE PAN ISSUED BY Nature Of Concern Banker Detail director_01 director_02 director_03 branch_01 branch_02 branch_03 branch_04 branch_05 branch_06 branch_07 branch_08 branch_09
0 0305008111 03.05.2005 04/04/131/51473/AM20/ 20.08.2019 NISSAN MOTOR INDIA PVT. LTD. PLOT-1A,SIPCOT IN... 918939917907 shailesh.kumar#rnaipl.com 5 Merchant/Manufacturer Valid IEC 2005-02-07 AACCN0695D FT001 NaN NaN 3 Private Limited STANDARD CHARTERED BANK A/C Type:1 CA A/C No :... HARDEEP SINGH BRAR GURMEL SINGH BRAR HOUSE NO ... JEROME YVES MARIE SAIGOT THIERRY SAIGOT A9/2, ... KOJI KAWAKITA KIHACHI KAWAKITA 3-21-3, NAGATAK... Branch Code:165TH FLOOR ORCHID BUSINESS PARK,S... Branch Code:14NRPDC , WAREHOUSE NO.B -2A,PATAU... Branch Code:12EQUINOX BUSINESS PARK TOWER 3 4T... Branch Code:8GRAND PALLADIUM,5TH FLR.,B WING,,... Branch Code:6TVS LOGISTICS SERVICES LTD.SING,C... Branch Code:2PLOT 1A SIPCOT INDUL PARK,ORAGADA... Branch Code:5BLDG.NO.3 PART,124A,VALLAM A,SRIP... Branch Code:15SURVEY NO. 678 679 680 681 682 6... Branch Code:10INDOSPACE SKCL INDL.PARK,BULD.NO...

Resources