I am trying to download all pdf files which contain scanned school books from a website. I tried using wget but it doesn't work. I suspect this is due to the website being an ASP-page with a selection options to select the course/year.
I also tried selecting a certain year/course and saving the html file locally but this doesn't work either
from bs4 import BeautifulSoup as bs
import urlopen
import wget
from urllib import parse as urlparse
def get_pdfs(my_url):
links = []
html = urlopen(my_url).read()
html_page = bs(html, features="lxml")
og_url = html_page.find("meta", property = "og:url")
base = urlparse(my_url)
print("base",base)
for link in html_page.find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
if og_url:
print("currentLink",current_link)
links.append(og_url["content"] + current_link)
else:
links.append(base.scheme + "://" + base.netloc + current_link)
for link in links:
try:
wget.download(link)
except:
print(" \n \n Unable to Download A File \n")
my_url = 'https://www.svpo.nl/curriculum.asp'
get_pdfs(my_url)
my_url_local_html = r'C:\test\en_2.html' # downloaded year 2 english books page locally to extract pdf links
get_pdfs(my_url_local_html )
snipplet of my_url_local_html with links to pdf:
<li><a target="_blank" href="https://www.ib3.nl/curriculum/engels\010 TB 2 Ch 5.pdf">Chapter 5 - Going extreme</a></li>
<li><a target="_blank" href="https://www.ib3.nl/curriculum/engels\020 TB 2 Ch 6.pdf">Chapter 6 - A matter of taste</a></li>
You need to specify payload. For example Engels and 2e klas
url = "https://www.svpo.nl/curriculum.asp"
payload = 'vak=Engels&klas_en_schoolsoort=2e klas'
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
response = requests.post(url, data=payload, headers=headers)
for link in BeautifulSoup(response.text, "lxml").find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
print(current_link)
OUTPUT:
https://www.ib3.nl/curriculum/engels\010 TB 2 Ch 5.pdf
https://www.ib3.nl/curriculum/engels\020 TB 2 Ch 6.pdf
https://www.ib3.nl/curriculum/engels\030 TB 2 Ch 7.pdf
https://www.ib3.nl/curriculum/engels\040 TB 2 Ch 8.pdf
https://www.ib3.nl/curriculum/engels\050 TB 2 Ch 9.pdf
https://www.ib3.nl/curriculum/engels\060 TB 2 Reading matters.pdf
https://www.ib3.nl/curriculum/engels\080 TB 3 Ch 1.pdf
https://www.ib3.nl/curriculum/engels\090 TB 3 Ch 2.pdf
https://www.ib3.nl/curriculum/engels\100 TB 3 Ch 3.pdf
https://www.ib3.nl/curriculum/engels\110 TB 3 Ch 4.pdf
https://www.ib3.nl/curriculum/engels\120 TB 3 Ch 5.pdf
https://www.ib3.nl/curriculum/engels\130 TB 3 Ch 6.pdf
https://www.ib3.nl/curriculum/engels\140 TB 3 Ch 7.pdf
https://www.ib3.nl/curriculum/engels\150 TB 3 Ch 8.pdf
https://www.ib3.nl/curriculum/engels\160 TB 3 Reading matters.pdf
https://www.ib3.nl/curriculum/engels\170 TB 3 Grammar.pdf
https://www.ib3.nl/curriculum/engels\Grammar Survey StSt 2.pdf
https://www.ib3.nl/curriculum/engels\StSt 2 Reading Matters.pdf
https://www.ib3.nl/curriculum/engels\StSt2 Yellow Pages.pdf
https://www.ib3.nl/curriculum/engels\050 WB 2 Ch 5.pdf
https://www.ib3.nl/curriculum/engels\060 WB 2 Ch 6.pdf
https://www.ib3.nl/curriculum/engels\070 WB 2 Ch 7.pdf
https://www.ib3.nl/curriculum/engels\080 WB 2 Ch 8.pdf
https://www.ib3.nl/curriculum/engels\090 WB 2 Ch 9.pdf
https://www.ib3.nl/curriculum/engels\110 WB 3 Ch 1.pdf
https://www.ib3.nl/curriculum/engels\115 WB 3 Ch 2.pdf
https://www.ib3.nl/curriculum/engels\120 WB 3 Ch 3.pdf
https://www.ib3.nl/curriculum/engels\125 WB 3 Ch 4.pdf
https://www.ib3.nl/curriculum/engels\130 WB 3 Ch 5.pdf
https://www.ib3.nl/curriculum/engels\135 WB 3 Ch 6.pdf
https://www.ib3.nl/curriculum/engels\140 WB 3 Ch 7.pdf
https://www.ib3.nl/curriculum/engels\145 WB 3 Ch 8.pdf
UPDATE:
To save pdf from link:
with open('somefilename.pdf', 'wb') as f:
url = r'https://www.ib3.nl/curriculum/engels\010 TB 2 Ch 5.pdf'.replace(' ', '%20').replace('\\', '/')
response = requests.get(url)
f.write(response.content)
#KJ right, you need to replace the spaces and the left slash
Related
This query is returning 0 or 20 randomly every time i run it. Yesterday when i loop through the pages i always get 20 and I am able to scrape through 20 listings and 15 pages. But now, I can't run my code properly because sometimes the listings return 0.
I tried adding headers in the request get and time sleep (5-10s random) before each request but am still facing the same issue. Tried connecting to hotspot to change my IP but am still facing the same issue. Anyone understand why?
import time
from random import randint
from bs4 import BeautifulSoup
import requests #to connect to url
airbnb_url = 'https://www.airbnb.com/s/Mayrhofen--Austria/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&date_picker_type=calendar&query=Mayrhofen%2C%20Austria&place_id=ChIJbzLYLzjdd0cRDtGuTzM_vt4&checkin=2021-02-06&checkout=2021-02-13&adults=4&source=structured_search_input_header&search_type=autocomplete_click'
soup = BeautifulSoup(requests.get(airbnb_url).content, 'html.parser')
listings = soup.find_all('div', '_8s3ctt')
print(len(listings))
It seems AirBnB returns 2 versions of the page. One "normal" HTML and other where the listings are stored inside <script>. To parse the <script> version of page you can use next example:
import json
import requests
from bs4 import BeautifulSoup
def find_listing(d):
if isinstance(d, dict):
if "__typename" in d and d["__typename"] == "DoraListingItem":
yield d["listing"]
else:
for v in d.values():
yield from find_listing(v)
elif isinstance(d, list):
for v in d:
yield from find_listing(v)
airbnb_url = "https://www.airbnb.com/s/Mayrhofen--Austria/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&date_picker_type=calendar&query=Mayrhofen%2C%20Austria&place_id=ChIJbzLYLzjdd0cRDtGuTzM_vt4&checkin=2021-02-06&checkout=2021-02-13&adults=4&source=structured_search_input_header&search_type=autocomplete_click"
soup = BeautifulSoup(requests.get(airbnb_url).content, "html.parser")
listings = soup.find_all("div", "_8s3ctt")
if len(listings):
# normal page:
print(len(listings))
else:
# page that has listings stored inside <script>:
data = json.loads(soup.select_one("#data-deferred-state").contents[0])
for i, l in enumerate(find_listing(data), 1):
print(i, l["name"])
Prints (when returned the <script> version):
1 Mariandl (MHO103) for 36 persons.
2 central and friendly! For Families and Friends
3 Sonnenheim for 5 persons.
4 MO's Apartments
5 MO's Apartments
6 Beautiful home in Mayrhofen with 3 Bedrooms
7 Quaint Apartment in Finkenberg near Ski Lift
8 Apartment 2 Villa Daringer (5 pax.)
9 Modern Apartment in Schwendau with Garden
10 Holiday flats Dornau, Mayrhofen
11 Maple View
12 Laubichl Lodge by Apart Hotel Therese
13 Haus Julia - Apartment Edelweiß Mayrhofen
14 Melcherhof,
15 Rest coke
16 Vacation home Traudl
17 Luxurious Apartment near Four Ski Lifts in Mayrhofen
18 Apartment 2 60m² for 2-4 persons "Binder"
19 Apart ZEMMGRUND, 4-9 persons in Mayrhofen/Tirol
20 Apartment Ahorn View
EDIT: To print lat, lng:
...
for i, l in enumerate(find_listing(data), 1):
print(i, l["name"], l["lat"], l["lng"])
Prints:
1 Mariandl (MHO103) for 36 persons. 47.16522 11.85723
2 central and friendly! For Families and Friends 47.16209 11.859691
3 Sonnenheim for 5 persons. 47.16809 11.86694
4 MO's Apartments 47.166969 11.863186
...
I am trying to use Beautiful Soup to webscrape the list of ticker symbols from this page: https://www.barchart.com/options/most-active/stocks
My code returns a lot of HTML from the page, but I can't find any of the ticker symbols with CTRL+F. Would be much appreciated if someone could let me know how I can access these!
Code:
from bs4 import BeautifulSoup as bs
import requests
headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36"}
url = "https://www.barchart.com/options/most-active/stocks"
page = requests.get(url, headers=headers)
html = page.text
soup = bs(html, 'html.parser')
print(soup.find_all())
import requests
from urllib.parse import unquote
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0",
}
def main(url):
with requests.Session() as req:
req.headers.update(headers)
r = req.get(url[:25])
req.headers.update(
{'X-XSRF-TOKEN': unquote(r.cookies.get_dict()['XSRF-TOKEN'])})
params = {
"list": "options.mostActive.us",
"fields": "symbol,symbolType,symbolName,hasOptions,lastPrice,priceChange,percentChange,optionsImpliedVolatilityRank1y,optionsTotalVolume,optionsPutVolumePercent,optionsCallVolumePercent,optionsPutCallVolumeRatio,tradeTime,symbolCode",
"orderBy": "optionsTotalVolume",
"orderDir": "desc",
"between(lastPrice,.10,)": "",
"between(tradeTime,2021-08-03,2021-08-04)": "",
"meta": "field.shortName,field.type,field.description",
"hasOptions": "true",
"page": "1",
"limit": "500",
"raw": "1"
}
r = req.get(url, params=params).json()
df = pd.DataFrame(r['data']).iloc[:, :-1]
print(df)
main('https://www.barchart.com/proxies/core-api/v1/quotes/get?')
Output:
symbol symbolType ... tradeTime symbolCode
0 AMD 1 ... 08/03/21 STK
1 AAPL 1 ... 08/03/21 STK
2 TSLA 1 ... 08/03/21 STK
3 AMC 1 ... 08/03/21 STK
4 PFE 1 ... 08/03/21 STK
.. ... ... ... ... ...
495 BTU 1 ... 08/03/21 STK
496 EVER 1 ... 08/03/21 STK
497 VRTX 1 ... 08/03/21 STK
498 MCHP 1 ... 08/03/21 STK
499 PAA 1 ... 08/03/21 STK
[500 rows x 14 columns]
I want to change the number of results on this page: https://fifatracker.net/players/ to more than 100 and then export the table to Excel and make it much easier for me. I tried to scrape it using python following a tutorial but I can't make it work. If there is a way to extract the table from all the pages it would also help me.
As stated, it's restricted to 100 per request. Simply iterate through the query payload on the api to get each page:
import pandas as pd
import requests
url = 'https://fifatracker.net/api/v1/players/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'}
page= 1
payload = {
"pagination":{
"per_page":"100","page":page},
"filters":{
"attackingworkrate":[],
"defensiveworkrate":[],
"primarypositions":[],
"otherpositions":[],
"nationality":[],
"order_by":"-overallrating"},
"context":{
"username":"guest",
"slot":"1","season":1},
"currency":"eur"}
jsonData = requests.post(url, headers=headers, json=payload).json()
current_page = jsonData['pagination']['current_page']
last_page = jsonData['pagination']['last_page']
dfs = []
for page in range(1,last_page+1):
if page == 1:
pass
else:
payload['pagination']['page'] = page
jsonData = requests.post(url, headers=headers, json=payload).json()
players = pd.json_normalize(jsonData['result'])
dfs.append(players)
print('Page %s of %s' %(page,last_page))
df = pd.concat(dfs).reset_index(drop=True)
Output:
print(df)
slug ... info.contract.loanedto_clubname
0 lionel-messi ... NaN
1 cristiano-ronaldo ... NaN
2 robert-lewandowski ... NaN
3 neymar-jr ... NaN
4 kevin-de-bruyne ... NaN
... ... ...
19137 levi-kaye ... NaN
19138 phillip-cancar ... NaN
19139 julio-pérez ... NaN
19140 alan-mclaughlin ... NaN
19141 tatsuki-yoshitomi ... NaN
[19142 rows x 92 columns]
My program returns different numbers each time. If I run each page individually it gives out the right results. I wanted to get all the links which have 3 or more votes.
from bs4 import BeautifulSoup as bs
import requests
import pandas
pg = 1
url ="https://stackoverflow.com/search?page="+str(pg)+"&tab=Relevance&q=scrappy%20python"
src = requests.get(url).text
soup = bs(src,'html.parser')
pages = soup.findAll('a',{'class' : 's-pagination--item js-pagination-item'})
number_of_pages = len(pages)
print(number_of_pages)
qualified=[]
while pg<=number_of_pages:
print("In Page :"+str(pg))
url = "https://stackoverflow.com/search?page=" + str(pg) + "&tab=Relevance&q=scrappy%20python"
src = requests.get(url).text
soup = bs(src, 'html.parser')
a_links = soup.findAll('a',{'class':'question-hyperlink'})
span_links = soup.findAll('span',{'class':'vote-count-post'})
hrefs = []
for a_link in a_links:
hrefs.append(a_link.get('href'))
for link in range(len(span_links)):
vote = span_links[link].strong.text
n = int(vote)
if n>2:
the_link = 'https://stackoverflow.com' + hrefs[link]
qualified.append(the_link)
print(len(qualified))
pg +=1
print(len(qualified)) will show the length of the full list that is your error. You can get how many links in each by adding i = 0 after while pg<=number_of_pages: and i += 1 after if n>2: then add print(i) before or after pg +=1.
Then code will be like this:
from bs4 import BeautifulSoup as bs
import requests
import pandas
pg = 1
url ="https://stackoverflow.com/search?page="+str(pg)+"&tab=Relevance&q=scrappy%20python"
src = requests.get(url).text
soup = bs(src,'html.parser')
pages = soup.findAll('a',{'class' : 's-pagination--item js-pagination-item'})
number_of_pages = len(pages)
print(number_of_pages)
qualified=[]
while pg<=number_of_pages:
i = 0
print("In Page :"+str(pg))
url = "https://stackoverflow.com/search?page=" + str(pg) + "&tab=Relevance&q=scrappy%20python"
src = requests.get(url).text
soup = bs(src, 'html.parser')
a_links = soup.findAll('a',{'class':'question-hyperlink'})
span_links = soup.findAll('span',{'class':'vote-count-post'})
hrefs = []
for a_link in a_links:
hrefs.append(a_link.get('href'))
for link in range(len(span_links)):
vote = span_links[link].strong.text
n = int(vote)
if n>2:
i += 1
the_link = 'https://stackoverflow.com' + hrefs[link]
qualified.append(the_link)
print(i)
pg +=1
#print(qualified)
Output:
6
In Page :1
1
In Page :2
4
In Page :3
2
In Page :4
3
In Page :5
2
In Page :6
2
Hi I have products that are made up of a couple of options. Each Option has a SKU Code. You can only select one option from each SKU Group and the options have to be concatenated in the order of the SKUGroup.
So for example i would have a list of options in a table in the DB that looked like
OptID PID SKU Price SKUGroup
156727 93941 C 171.00 1
156728 93941 BN 171.00 1
156729 93941 PN 171.00 1
156718 93940 W 115.20 2
156719 93940 CA 115.20 2
156720 93940 BA 115.20 2
156721 93940 BNA 115.20 2
156722 93940 BN 115.20 2
156723 93940 BS 115.20 2
156716 93939 CHR 121.50 3
156717 93939 NK 138.00 3
And a few finished product SKUs would look something like:
C-W-CHR 407.70
C-W-NK 424.20
C-CA-CHR 407.20
C-CA-NK 424.20
I am trying to make a script that will create a listing of every possible combination of SKU and the price of the combined options.
I need this done in Classic ASP (vbscript) and I'm not that familiar with it. So I'm looking for all the help I can get.
Thanks!
I would start by connecting to the database and creating three recordsets.
Set connection = CreateObject("ADODB.Connection")
connection.Open ConnectionString
Set rsOption1 = CreateObject("ADODB.recordset")
Set rsOption2 = CreateObject("ADODB.recordset")
Set rsOption3 = CreateObject("ADODB.recordset")
rsOption1.Open "SELECT * FROM TableName WHERE SKUGroup = 1", connection, 3,3
rsOption2.Open "SELECT * FROM TableName WHERE SKUGroup = 2", connection, 3,3
rsOption3.Open "SELECT * FROM TableName WHERE SKUGroup = 3", connection, 3,3
Then you can use nested loops to get the combinations. Something like this (Untested, this probably will not work as is, but it gives you an idea of how to do this) (Also this assumes that you have to select at least one option from each group)
for i = 0 to rsOption1.RecordCount
rsOption1.Move i, 1
for j = 0 to rsOption2.RecordCount
rsOption2.Move j, 1
for k = 0 to rsOption3.RecordCount
rsOption3.Move k, 1
'Write rsOption1.Fields(2).Value & "-" & rsOption2.Fields(2).Value & _
'"-" & rsOption3.Fields(2).Value & " " & _
'FormatCurrency((rsOption1.Fields(3).Value + rsOption2.Fields(3).Value + rsOption3.Fields(3).Value))
Next
Next
Next