Scrapy dowload zip file from ASP.NET site - web-scraping

I need some help getting scrapy to download a file from an asp.net site. Normally from a browser one would click the link and the file would begin downloading, but that is not possible with scrapy so what I am trying to do is the following:
def retrieve(self, response):
print('Response URL: {}'.format(response.url))
pattern = re.compile('(dg[^\']*)')
for file in response.xpath('//table[#id="dgFile"]/tbody/tr/td[2]/a'):
file_url = file.xpath('#href').extract_first()
target = re.search(pattern, file_url).group(1)
viewstate = response.xpath('//*[#id="__VIEWSTATE"]/#value').extract_first()
viewstategenerator = response.xpath('//*[#id="__VIEWSTATEGENERATOR"]').extract_first()
eventvalidation = response.xpath('//*[#id="__EVENTVALIDATION"]').extract_first()
data = {
'_EVENTTARGET': target,
'_VIEWSTATE': viewstate,
'_VIEWSTATEGEERATOR': viewstategenerator,
'_EVENTVALIDATION': eventvalidation
}
yield FormRequest.from_response(
response,
formdata=data,
callback=self.end(response)
)
I am trying to submit the information to the page in order the receive the zip file back as a response, however this is not working as I hoped it would. Instead I am simply getting the same page as a response.
In a situation like this is it even possible to use scrapy to download this file? does anyone have any pointers?
I have also tried to use Selenium+PhantomJS but I run into a dead end trying to transfer the session from scrapy to selenium. I would be willing to use selenium for this one function but I need to use scrapy for this project.

Related

How to figure out where is the raw data in a table?

https://www.nyse.com/quote/XNYS:A
After I access the above URL, I open Developer Tools in Firefox. Then change the date in HISTORIC PRICES, then click 'GO'. The table is updated. But I don't see relevant HTTP requests sent in devtools.
So this means that the data has already been downloaded in the first request. But I can not figure out how to extract the raw data of the table. Could anybody take a look at how to extract the raw data from the table? (Note that I don't want to use methods like selenium, I want to stay with raw HTTP requests to get the raw data.)
EDIT: websocket is mentioned in the comment. But I can't see it in Developer Tools. I add websocket tag anyway in case somebody knows more about websocket can chime in.
I am afraid you cannot extract javascript rendered content without selenium. You can always make use of a headless browser(you don't see any instance on your screen, the only pitfall is that you have to wait until the page fully loads) and it won't bother you anymore.
In other words, all the other scraping libs are based on urls and forms. Scrapy can post forms but not run javascripts.
Selenium will save the day, all you lose is a couple of seconds for each attempt(will be milliseconds if it is run in frontend). You can share page source with driver.page_source and it can be directly used for parsing(as a html text) with BeautifulSoup or whatever.
You can do it with requests-html, for example let's grab the first row of the table:
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://www.nyse.com/quote/XNYS:A'
r = session.get(url)
r.html.render(sleep=7)
first_row = r.html.find('.flex_tr', first=True)
print(first_row.text)
Output:
06/18/2021
146.31
146.83
144.94
145.01
3,220,680
As #Nikita said you will have to wait the page loading (here 7sec but maybe less), but if you want to do multiple requests you can do it asynchronously !

Scrape dynamic info from same URL using python or any other tool

I am trying to scrape the URL of every company who has posted a job offer on this website:
https://jobs.workable.com/
I want to pull the info to generate some stats re this website.
The problem is that when I click on an add and navigate through the job post, the url is always the same. I know a bit of python so any solution using it would be useful. I am open to any other approach though.
Thank you in advance.
This is just a pseudo code to give you the idea of what you are looking for.
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
first_url = 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc'
base_url= 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc&offset='
page_ids = ['0','10','20','30','40','50'] ## can also be created dynamically this is just raw
for pep_id in page_ids:
# for initial page
if(pep_id == '0'):
page = requests.get(first_url, headers=headers)
print('You still need to parse the first page')
##Enter some parsing logic
else:
final_url = base_url + str(pep_id)
page = requests.get(final_url, headers=headers)
print('You still need to parse the other pages')
##Enter some parsing logic

Scraping data using Python3 from JS generated content

I need to scrape a website (say "www.example.com") from a python3 program which has a form with two elements as follows:
1: Textbox
2: Dropdown
Need to run queries with several options (e.g. 'abc' and '1') to be filled/selected in the above form and scrape the pages thus generated. The pages thus generated after filling the form and submitting have a url as seen in the browser as "www.example.com/abc/1".The results on this page are fetched through a javacript as can be verified in the page source. Synopsis of the relevant javascript below:
<script type="text/rfetchscript">
$(document).ready(function(){
$.ajax({
url: "http://clients.example.com/api/search",
data: JSON.parse('{"textname":"abc", "dropval":"1"}'),
method: 'POST',
dataType: 'json',
Logic to fetch the data
</script>
I have tried to get the results of the page by using methods of requests, urllib:
1:
resp = requests.get('http://www.example.com/abc/1')
2:
req = urllib.request.Request('http://www.example.com/abc/1')
x = urllib.request.urlopen(req)
SourceCode = x.read()
3: Also tried scrapy.
But all of the above return only the static data as seen in "view page source", and not the actual results as can be seen in the browser.
Looking for help on the right approach here.
Scraping pages with urllib or requests will only return the page source since it can not execute the javascript codes etc that the server returns. If you want to load the content just like your browsers you have to use selenium with an optional chrome or firefox driver. If you want to keep using urllib or requests you have to find out which content pages the site loads with for example the network tab in your chrome browser. Probably the data you are interested in is loaded from a json file.

Staying Logged In Using Requests and Python

I am trying to log onto a website using python and requests. I'm pretty sure I am logging on properly. The next part is I go to a different page and try to download a file from that page. However, in order to download the file you have to be logged in. When I go to download the file, however, it redirects me to the log-in menu saying I haven't logged in. I am stuck and don't know what to do! By the way, the website is grabcad.com, what I'm basically trying to do is press the download all button featured on such a page
http://grabcad.com/library/apple-ipod-touch-5th-gen-1
payload = {'member[email]': 'username', 'member[password]': 'pass'}
with requests.Session() as s:
rObject = s.post('http://www.grabcad.com/login', data=payload)
cookies = rObject.cookies
rObject = s.get('http://www.grabcad.com' + downloadUrl, cookies=cookies)
#download URL is something I obtain early and I know it's correct. It's the URL for when you press the downloadAll button
path = 'C:\\User\\Desktop\\filename
with open(path, 'wb') as f:
for chunk in rObject.iter_content():
f.write(chunk)
So I took an altogether different route to solve the problem, I simply used mechanize which is an automated browswer tool for python.
#how to use mechanize to log-in, specifically for grabcad
b.open('http://grabcad.com/login')
b.form = list(b.forms())[1]
control = b.form.find_control("member[email]")
control2 = b.form.find_control("member[password]")
control.value = 'username'
control2.value = 'pass'
b.submit()
#Download Part
path = 'C:\\User\\Desktop\\filename
b.retrieve('https://www.grabcad.com' + downloadUrl, path)
#downloadUrl is obtained earlier and is simply the URL for the download
How are you ensuring that you're logged in correctly? I would print out the html after sending that post request from the session object & ensure it isn't a login page or an invalid password page. Cookies are automatically persistent across requests made on the session object, so I believe that the initial login isn't successful (http://docs.python-requests.org/en/latest/user/advanced/#session-objects).
Personally, I would use selenium for this though.
I have correctly logged into grabcad with the following code:
import requests
s = requests.session()
payload = {'member[email]': 'yourEmail', 'member[password]': 'yourPassword'}
p = s.post('https://grabcad.com/login', data=payload) # Ensure you're posting to HTTPS

Python-getting data from an asp.net AJAX application

Using Python, I'm trying to read the values on http://utahcritseries.com/RawResults.aspx. I can read the page just fine, but am having difficulty changing the value of the year combo box, to view data from other years. How can I read the data for years other than the default of 2002?
The page appears to be doing an HTTP Post once the year combo box has changed. The name of the control is ct100$ContentPlaceHolder1$ddlSeries. I try setting a value for this control using urllib.urlencode(postdata), but I must be doing something wrong-the data on the page is not changing. Can this be done in Python?
I'd prefer not to use Selenium, if at all possible.
I've been using code like this(from stackoverflow user dbr)
import urllib
postdata = {'ctl00$ContentPlaceHolder1$ddlSeries': 9}
src = urllib.urlopen(
"http://utahcritseries.com/RawResults.aspx",
data = urllib.urlencode(postdata)
).read()
print src
But seems to be pulling up the same 2002 data. I've tried using firebug to inspect the headers and I see a lot of extraneous and random-looking data being sent back and forth-do I need to post these values back to the server also?
Use the excellent mechanize library:
from mechanize import Browser
b = Browser()
b.open("http://utahcritseries.com/RawResults.aspx")
b.select_form(nr=0)
year = b.form.find_control(type='select')
year.get(label='2005').selected = True
src = b.submit().read()
print src
Mechanize is available on PyPI: easy_install mechanize

Resources