Using Python, I'm trying to read the values on http://utahcritseries.com/RawResults.aspx. I can read the page just fine, but am having difficulty changing the value of the year combo box, to view data from other years. How can I read the data for years other than the default of 2002?
The page appears to be doing an HTTP Post once the year combo box has changed. The name of the control is ct100$ContentPlaceHolder1$ddlSeries. I try setting a value for this control using urllib.urlencode(postdata), but I must be doing something wrong-the data on the page is not changing. Can this be done in Python?
I'd prefer not to use Selenium, if at all possible.
I've been using code like this(from stackoverflow user dbr)
import urllib
postdata = {'ctl00$ContentPlaceHolder1$ddlSeries': 9}
src = urllib.urlopen(
"http://utahcritseries.com/RawResults.aspx",
data = urllib.urlencode(postdata)
).read()
print src
But seems to be pulling up the same 2002 data. I've tried using firebug to inspect the headers and I see a lot of extraneous and random-looking data being sent back and forth-do I need to post these values back to the server also?
Use the excellent mechanize library:
from mechanize import Browser
b = Browser()
b.open("http://utahcritseries.com/RawResults.aspx")
b.select_form(nr=0)
year = b.form.find_control(type='select')
year.get(label='2005').selected = True
src = b.submit().read()
print src
Mechanize is available on PyPI: easy_install mechanize
Related
I'm trying to get data from this page
https://bscscan.com/tokenholdings?a=0xFAe2dac0686f0e543704345aEBBe0AEcab4EDA3d
But the Website owner doesn't provide endpoints APIs for this purpose. So I tried to achieve it in different ways:
-USING DRYSCRAPE but the library seems to be abandoned;
-USING REQUESTS but the data are provided dinamically by javascript;
-USING REQUESTS HTML but even in this case the data doesn't seems to be loaded.
I would like to ignore selenium cause it's slow but I don't know how to solve this issue. Anyone has a solution that could work? The data I need is the table containing the tokens of the wallet. Thank U in advice and hv a nice day.
You can do it with requests-html, for example let's grab the symbol of the first row:
from requests_html import HTMLSession
session = HTMLSession()
url='https://bscscan.com/tokenholdings'
token={'a': '0xFAe2dac0686f0e543704345aEBBe0AEcab4EDA3d'}
r = session.get(url, params=token)
r.html.render(sleep=2)
binance_row = r.html.find('tbody tr', first=True)
symbol = binance_row.find('td')[2].text
print(symbol)
Output:
BNB
https://www.nyse.com/quote/XNYS:A
After I access the above URL, I open Developer Tools in Firefox. Then change the date in HISTORIC PRICES, then click 'GO'. The table is updated. But I don't see relevant HTTP requests sent in devtools.
So this means that the data has already been downloaded in the first request. But I can not figure out how to extract the raw data of the table. Could anybody take a look at how to extract the raw data from the table? (Note that I don't want to use methods like selenium, I want to stay with raw HTTP requests to get the raw data.)
EDIT: websocket is mentioned in the comment. But I can't see it in Developer Tools. I add websocket tag anyway in case somebody knows more about websocket can chime in.
I am afraid you cannot extract javascript rendered content without selenium. You can always make use of a headless browser(you don't see any instance on your screen, the only pitfall is that you have to wait until the page fully loads) and it won't bother you anymore.
In other words, all the other scraping libs are based on urls and forms. Scrapy can post forms but not run javascripts.
Selenium will save the day, all you lose is a couple of seconds for each attempt(will be milliseconds if it is run in frontend). You can share page source with driver.page_source and it can be directly used for parsing(as a html text) with BeautifulSoup or whatever.
You can do it with requests-html, for example let's grab the first row of the table:
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://www.nyse.com/quote/XNYS:A'
r = session.get(url)
r.html.render(sleep=7)
first_row = r.html.find('.flex_tr', first=True)
print(first_row.text)
Output:
06/18/2021
146.31
146.83
144.94
145.01
3,220,680
As #Nikita said you will have to wait the page loading (here 7sec but maybe less), but if you want to do multiple requests you can do it asynchronously !
I wrote some code which should check whether a product is back in stock and when it is, send me an email to notify me. This works when the things I'm looking for are in the html.
However, sometimes certain objects are loaded through JavaScript. How could I edit my code so that the web scraping also works with JavaScript?
This is my code thus far:
import time
import requests
while True:
# Get the url of the IKEA page
url = 'https://www.ikea.com/nl/nl/p/flintan-bureaustoel-vissle-zwart-20336841/'
# Get the text from that page and put everything in lower cases
productpage = requests.get(url).text.lower()
# Set the strings that should be on the page if the product is not available
outofstockstrings = ['niet beschikbaar voor levering', 'alleen beschikbaar in de winkel']
# Check whether the strings are in the text of the webpage
if any(x in productpage for x in outofstockstrings):
time.sleep(1800)
continue
else:
# send me an email and break the loop
Instead of scraping and analyzing the HTML you could use the inofficial stock API that the IKEA website is using too. That API return JSON data which is way easier to analyze and you’ll also get estimates when the product gets back to stock.
There even is a project written in javascript / node which provides you this kind of information straight from the command line: https://github.com/Ephigenia/ikea-availability-checker
You can easily check the stock amount of the chair in all stores in the Netherlands:
npx ikea-availability-checker stock --country nl 20336841
I am trying to scrape the URL of every company who has posted a job offer on this website:
https://jobs.workable.com/
I want to pull the info to generate some stats re this website.
The problem is that when I click on an add and navigate through the job post, the url is always the same. I know a bit of python so any solution using it would be useful. I am open to any other approach though.
Thank you in advance.
This is just a pseudo code to give you the idea of what you are looking for.
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
first_url = 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc'
base_url= 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc&offset='
page_ids = ['0','10','20','30','40','50'] ## can also be created dynamically this is just raw
for pep_id in page_ids:
# for initial page
if(pep_id == '0'):
page = requests.get(first_url, headers=headers)
print('You still need to parse the first page')
##Enter some parsing logic
else:
final_url = base_url + str(pep_id)
page = requests.get(final_url, headers=headers)
print('You still need to parse the other pages')
##Enter some parsing logic
I am trying to scrape data from a web page. Since the page has a dynamic content, I used phantomjs to handle. But, with the codes I am using, I just can download the data seen on the web page. However, I need to input the date range and then submit to get all the data I want.
Here are the codes i used,
library(xml2)
library(rvest)
connection <- "pr.js"
writeLines(sprintf("var page=require('webpage').create();
var fs = require('fs');
page.open('%s',function(){
console.log(page.content);//page source;
fs.write('pr.html', page.content, 'w');
phantom.exit();
});",url),con=connection)
system_input <- paste(path,"phantomjs"," ",connection,sep="")
system(system_input)
Thanks to the codes, I have the html output of the webpage which has been created dynamically.
And as I stated, I also need a date input submit. But I couldn't achieve.
The url is : https://seffaflik.epias.com.tr/transparency/piyasalar/gop/ptf.xhtml