Unable to scrape a table - web-scraping

I'm attempting to scrape the data from a table on the following website: https://droughtmonitor.unl.edu/DmData/DataTables.aspx
import requests
from bs4 import BeautifulSoup
url = 'https://droughtmonitor.unl.edu/DmData/DataTables.aspx'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
drought_table = soup.find('table', {'id':'datatabl'}).find('tbody').find_all('tr')
for some reason I am getting no outputs. I've tried to use pandas for the same job
import pandas as pd
url = 'https://droughtmonitor.unl.edu/DmData/DataTables.aspx'
table = pd.read_html(url)
df = table[0]
But also ended up getting an empty dataframe.
What could be causing this?

By checking network tool of browser it's obvious site uses Fetch/XHR to load table in another request.
Image: network monitor
You can use this code to get table data:
import requests
import json
headers = {
'Content-Type': 'application/json; charset=utf-8',
}
params = (
('area', '\'conus\''),
('statstype', '\'1\''),
)
response = requests.get(
'https://droughtmonitor.unl.edu/DmData/DataTables.aspx/ReturnTabularDMAreaPercent_national',
headers=headers, params=params
)
table = json.loads(response.content)
# Code generated by https://curlconverter.com/

Related

Newbie, Scraping Issue , FUTBIN web scraping issue

I'm new to web scraping and i was trying to scrape through FUTBIN (FUT 22) player database
"https://www.futbin.com/players" . My code is below and I don't know why if can't get any sort of results from the FUTBIN page but was successful in other webpages like IMDB.
CODE :`
import requests
from bs4 import BeautifulSoup
request = requests.get("https://www.futbin.com/players")
src = request.content
soup = BeautifulSoup(src, features="html.parser")
results = soup.find("a", class_="player_name_players_table get-tp`enter code here`")
print(results)

Data Scraping for Pagination of the Products to get all products details

I want to scrape all the product data for the 'Cushion cover' category having URL = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/'
I analysed the data is in the script tag ,but how to get the data from all the pages. I required the URL's of all the Products from all the pages and the data is also in API for different pages API= 'https://www.noon.com/_next/data/B60DhzfamQWEpEl9Q8ajE/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover.json?limit=50&page=2&sort%5Bby%5D=popularity&sort%5Bdir%5D=desc&catalog=home-and-kitchen&catalog=home-decor&catalog=slipcovers&catalog=cushion-cover'
if we goes on changing the page num in the above link we have the data for the respective pages but how to get that data from different pages
Please suggest for this.
import requests
import pandas as pd
import json
import csv
from lxml import html
headers ={'authority': 'www.noon.com',
'accept' :
'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
produrl = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/'
prodresp = requests.get(produrl, headers = headers, timeout =30)
prodResphtml = html.fromstring(prodresp.text)
print(prodresp)
partjson = prodResphtml.xpath('//script[#id= "__NEXT_DATA__"]/text()')
partjson = partjson[0]
You are about to reach your goal. You can make the next pages meaning pagination using for loop and range function to pull all the pages as we know that total page numbers are 192 that's why I've made the pagination this robust way. So to get all the products url (or any data item) from all of the pages, you can follow the next example.
Script:
import requests
import pandas as pd
import json
from lxml import html
headers ={
'authority': 'www.noon.com',
'accept' :'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
produrl = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/?limit=50&page={page}&sort[by]=popularity&sort[dir]=desc'
data=[]
for page in range(0,192):
prodresp = requests.get(produrl.format(page=page), headers = headers, timeout =30)
prodResphtml = html.fromstring(prodresp.text)
#print(prodresp)
partjson = prodResphtml.xpath('//script[#id= "__NEXT_DATA__"]/text()')
#partjson = partjson[0]
partjson = json.loads(partjson[0])
#print(partjson)
# with open('data.json','w',encoding='utf-8') as f:
# f.write(partjson)
for item in partjson['props']['pageProps']['props']['catalog']['hits']:
link='https://www.noon.com/'+item['url']
data.append(link)
df = pd.DataFrame(data,columns=['URL'])
#df.to_csv('product.csv',index=False)#to save data into your system
print(df)
Output:
URL
0 https://www.noon.com/graphic-geometric-pattern...
1 https://www.noon.com/classic-nordic-decorative...
2 https://www.noon.com/embroidered-iconic-medusa...
3 https://www.noon.com/geometric-marble-texture-...
4 https://www.noon.com/traditional-damask-motif-...
... ...
9594 https://www.noon.com/geometric-printed-cushion...
9595 https://www.noon.com/chinese-style-art-printed...
9596 https://www.noon.com/chinese-style-art-printed...
9597 https://www.noon.com/chinese-style-art-printed...
9598 https://www.noon.com/chinese-style-art-printed...
[9599 rows x 1 columns]
I used re lib. In other word, I used regex it is much better to scrape any page use JavaScript
import requests
import pandas as pd
import json
import csv
from lxml import html
import re
headers ={
'authority': 'www.noon.com',
'accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
url = "https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/"
prodresp = requests.get(url, headers = headers, timeout =30)
jsonpage = re.findall(r'type="application/json">(.*?)</script>', prodresp.text)
jsonpage = json.loads(jsonpage[0])

How to scrape multiple URLs and print them in an individual text file?

I'm trying to learn bs4 for past few days, I successfully scraped a page and print them in a text file so I try to scrape multiple pages and the results too print successfully in the terminal but when I try to print them in a text file only the last file get saved and rest of them are not executed. Since I'm new to coding I can't figure out the actual reason.
import bs4
import requests
from fake_useragent import UserAgent
import io
urls = ['https://en.m.wikipedia.org/wiki/Grove_(nature)','https://en.wikipedia.org/wiki/Azadirachta_indica','https://en.wikipedia.org/wiki/Olive']
user_agent = UserAgent()
for url in urls:
page = requests.get(url, headers={"user-agent": user_agent.chrome})
tree = bs4.BeautifulSoup(page.text, 'html.parser')
title = tree.find('title').get_text()
text = tree.find_all('p')[1].get_text()
name = title + '.txt'
with io.open(name, "w", encoding="utf-8") as text_file:
text_file.write(text)
print('files are ready')
You create the file outside the loop. Put the with statement in the for-loop like this:
import bs4
import requests
from fake_useragent import UserAgent
import io
urls = ['https://en.m.wikipedia.org/wiki/Grove_(nature)','https://en.wikipedia.org/wiki/Azadirachta_indica','https://en.wikipedia.org/wiki/Olive']
user_agent = UserAgent()
for url in urls:
page = requests.get(url, headers={"user-agent": user_agent.chrome})
tree = bs4.BeautifulSoup(page.text, 'html.parser')
title = tree.find('title').get_text()
text = tree.find_all('p')[1].get_text()
name = title + '.txt'
with io.open(name, "w", encoding="utf-8") as text_file:
text_file.write(text)
print('files are ready')
try this:
import bs4
import requests
from fake_useragent import UserAgent
import io
urls = ['https://en.m.wikipedia.org/wiki/Grove_(nature)','https://en.wikipedia.org/wiki/Azadirachta_indica','https://en.wikipedia.org/wiki/Olive']
user_agent = UserAgent()
for url in urls:
page = requests.get(url, headers={"user-agent": user_agent.chrome})
tree = bs4.BeautifulSoup(page.text, 'html.parser')
title = tree.find('title').get_text()
text = tree.find_all('p')[1].get_text()
name = title + '.txt'
with io.open(name, "w", encoding="utf-8") as text_file:
text_file.write(text)
print('files are ready')

Why is this CSS selector returning no results?

I am following along with a webscraping example in Automate-the-boring-stuff-with-python but my CSS selector is returning no results
import bs4
import requests
import sys
import webbrowser
print("Googling ...")
res = requests.get('https://www.google.com/search?q=' + ' '.join(sys.argv[1:]))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
linkelems = soup.find_all(".r a")
numopen = min(5, len(linkelems))
for i in range(numopen):
webbrowser.open('https://google.com' + linkelems[i].get('href'))
Has google since modified how they store search links ?
From inspecting the search page elements I see no reason this selector would not work.
There are two problems:
1.) Instead of soup.find_all(".r a") use soup.select(".r a") Only .select() method accepts CSS selectors
2.) Google page needs that you specify User-Agent header to return correct page.
import bs4
import sys
import requests
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
print("Googling ...")
res = requests.get('https://www.google.com/search?q=' + ' '.join(sys.argv[1:]), headers=headers)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
linkelems = soup.select(".r a")
for a in linkelems:
print(a.text)
Prints (for example):
Googling ...
Tree - Wikipediaen.wikipedia.org › wiki › Tree
... and so on.
A complimentary answer to Andrej Kesely's answer.
If you don't want to deal with figuring out what selectors to use or how to bypass blocks from Google, then you can try to use Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that bypass blocks, data extraction, and more is already done for the end-user. All that needs to be done is just to iterate over structured JSON and get the data you want.
Example code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google", # search engine
"q": "fus ro dah", # query
"api_key": os.getenv("API_KEY"), # environment variable with your API-KEY
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
print(link)
------------
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.nexusmods.com/skyrimspecialedition/mods/4889/
https://www.nexusmods.com/skyrimspecialedition/mods/14094/
https://tenor.com/search/fus-ro-dah-gifs
'''
Disclaimer, I work for SerpApi.

Get multiple results for different ids from single python requests

I want to get info for different user_ids from an API using python requests, I can use a loop and change the id every time, but it is slow, is there a simple way to do this?
import requests
from pprint import pprint
url = "....../api"
paras = {
'username': 'guest',
'password': '123456',
'method': 'location_info',
'barcode': ['1150764','1150765'],
}
r = requests.get(url, params=paras, verify = False)
pprint(r.json())
The result only return the info for latter barcode '1150765'. Is there a way to query 100 barcodes at the same time?

Resources