Staying Logged In Using Requests and Python - python-requests

I am trying to log onto a website using python and requests. I'm pretty sure I am logging on properly. The next part is I go to a different page and try to download a file from that page. However, in order to download the file you have to be logged in. When I go to download the file, however, it redirects me to the log-in menu saying I haven't logged in. I am stuck and don't know what to do! By the way, the website is grabcad.com, what I'm basically trying to do is press the download all button featured on such a page
http://grabcad.com/library/apple-ipod-touch-5th-gen-1
payload = {'member[email]': 'username', 'member[password]': 'pass'}
with requests.Session() as s:
rObject = s.post('http://www.grabcad.com/login', data=payload)
cookies = rObject.cookies
rObject = s.get('http://www.grabcad.com' + downloadUrl, cookies=cookies)
#download URL is something I obtain early and I know it's correct. It's the URL for when you press the downloadAll button
path = 'C:\\User\\Desktop\\filename
with open(path, 'wb') as f:
for chunk in rObject.iter_content():
f.write(chunk)

So I took an altogether different route to solve the problem, I simply used mechanize which is an automated browswer tool for python.
#how to use mechanize to log-in, specifically for grabcad
b.open('http://grabcad.com/login')
b.form = list(b.forms())[1]
control = b.form.find_control("member[email]")
control2 = b.form.find_control("member[password]")
control.value = 'username'
control2.value = 'pass'
b.submit()
#Download Part
path = 'C:\\User\\Desktop\\filename
b.retrieve('https://www.grabcad.com' + downloadUrl, path)
#downloadUrl is obtained earlier and is simply the URL for the download

How are you ensuring that you're logged in correctly? I would print out the html after sending that post request from the session object & ensure it isn't a login page or an invalid password page. Cookies are automatically persistent across requests made on the session object, so I believe that the initial login isn't successful (http://docs.python-requests.org/en/latest/user/advanced/#session-objects).
Personally, I would use selenium for this though.
I have correctly logged into grabcad with the following code:
import requests
s = requests.session()
payload = {'member[email]': 'yourEmail', 'member[password]': 'yourPassword'}
p = s.post('https://grabcad.com/login', data=payload) # Ensure you're posting to HTTPS

Related

Scrape dynamic info from same URL using python or any other tool

I am trying to scrape the URL of every company who has posted a job offer on this website:
https://jobs.workable.com/
I want to pull the info to generate some stats re this website.
The problem is that when I click on an add and navigate through the job post, the url is always the same. I know a bit of python so any solution using it would be useful. I am open to any other approach though.
Thank you in advance.
This is just a pseudo code to give you the idea of what you are looking for.
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
first_url = 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc'
base_url= 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc&offset='
page_ids = ['0','10','20','30','40','50'] ## can also be created dynamically this is just raw
for pep_id in page_ids:
# for initial page
if(pep_id == '0'):
page = requests.get(first_url, headers=headers)
print('You still need to parse the first page')
##Enter some parsing logic
else:
final_url = base_url + str(pep_id)
page = requests.get(final_url, headers=headers)
print('You still need to parse the other pages')
##Enter some parsing logic

Rcrawler - How to crawl account/password protected sites?

I am trying to crawl and scrape a website's tables. I have an account with the website, and I found out that Rcrawl could help me with getting parts of the table based on specific keywords, etc. The problem is that on the GitHub page there is no mentioning of how to crawl a site with account/password protection.
An example for signing in would be below:
login <- list(username="username", password="password",)
Do you have any idea if Rcrawler has this functionality? For example something like:
Rcrawler(Website = "http://www.glofile.com" +
list (username = "username", password = "password" + no_cores = 4, no_conn = 4, ExtractCSSPat = c(".entry-title",".entry-content"), PatternsNames = c("Title","Content"))
I'm confident my code above is wrong, but I hope it gives you an idea of what I want to do.
To crawl or scrape password-protected websites in R, more precisely HTML-based Authentication, you need to use web driver to stimulate a login session, Fortunately, this is possible since Rcrawler v0.1.9, which implement phantomjs web driver ( a browser but without graphics interface).
In the following example will try to log in a blog website
library(Rcrawler)
Dowload and install web driver
install_browser()
Run the browser session
br<- run_browser()
If you get an error than disable your antivirus or allow the program in your system setting
Run an automated login action and return a logged-in session if successful
br<-LoginSession(Browser = br, LoginURL = 'http://glofile.com/wp-login.php'
LoginCredentials = c('demo','rc#pass#r'),
cssLoginFields =c('#user_login', '#user_pass'),
cssLoginButton ='#wp-submit' )
Finally, if you know already the private pages you want to scrape/download use
DATA <- ContentScraper(... , browser =br)
Or, simply crawl/scrape/download all pages
Rcrawler(Website = "http://glofile.com/",no_cores = 1 ,no_conn = 1,LoggedSession = br ,...)
Don't use multiple parallel no_cores/no_conn as many websites reject multiple sessions by one user.
Stay legit and honor robots.txt by setting Obeyrobots = TRUE
You access the browser functions, like :
br$session$getUrl()
br$session$getTitle()
br$session$takeScreenshot(file = "image.png")

Scrapy dowload zip file from ASP.NET site

I need some help getting scrapy to download a file from an asp.net site. Normally from a browser one would click the link and the file would begin downloading, but that is not possible with scrapy so what I am trying to do is the following:
def retrieve(self, response):
print('Response URL: {}'.format(response.url))
pattern = re.compile('(dg[^\']*)')
for file in response.xpath('//table[#id="dgFile"]/tbody/tr/td[2]/a'):
file_url = file.xpath('#href').extract_first()
target = re.search(pattern, file_url).group(1)
viewstate = response.xpath('//*[#id="__VIEWSTATE"]/#value').extract_first()
viewstategenerator = response.xpath('//*[#id="__VIEWSTATEGENERATOR"]').extract_first()
eventvalidation = response.xpath('//*[#id="__EVENTVALIDATION"]').extract_first()
data = {
'_EVENTTARGET': target,
'_VIEWSTATE': viewstate,
'_VIEWSTATEGEERATOR': viewstategenerator,
'_EVENTVALIDATION': eventvalidation
}
yield FormRequest.from_response(
response,
formdata=data,
callback=self.end(response)
)
I am trying to submit the information to the page in order the receive the zip file back as a response, however this is not working as I hoped it would. Instead I am simply getting the same page as a response.
In a situation like this is it even possible to use scrapy to download this file? does anyone have any pointers?
I have also tried to use Selenium+PhantomJS but I run into a dead end trying to transfer the session from scrapy to selenium. I would be willing to use selenium for this one function but I need to use scrapy for this project.

Scraping login protected website with a challenge form?

I'm trying to do some web scraping from steamspy.com, specifically the total playtime hours for a certain game. That info is behind the login wall for the site, so I've been trying to figure out how to get R past it for html mining.
I tried this method for passing login credentials via POST() but it doesn't seem to work. I noticed that the login handler for that example used POST, whereas looking at the source code for steamspy it seems to use a challenge form and I wasn't sure how to proceed with R.
My attempt thus far looks like this:
handle <- handle("http://steamspy.com")
path <- "/login/"
login <- list(
jschl_vc = "bc4e...",
pass = "148..."
)
response <- POST(handle = handle, path = path, body = login)
I found the values for the jschl_vc and pass from inspecting the source code after I logged in. The code above doesn't work and gives me:
Error in curl::curl_fetch_memory(url, handle = handle) : Failure
when receiving data from the peer
probably since I'm tryign to use POST to a challenge form. Is there way that I'm missing to proceed?

Flask remote authentication issue with Graphite

I have a Flask app which sends a request to a Graphite server to authenticate and redirect to it's dashboard (changed setting REMOTE_USER_AUTHENTICATION = True). The request is as follows:
url = 'https://graphite.localdomain/dashboard'
s = requests.Session()
r = s.get(url, auth=('userx', 'passwordx'),verify=False)
print r.cookies.__dict__
return (r.text, r.status_code, r.headers.items())
The authentication from the request to the graphite server is good, I get 200's for valid users, and 401's for invalid users.
"print r.cookies.__dict__" will output...
{'_now': 1429303134, '_policy': <cookielib.DefaultCookiePolicy instance
at 0x7f263ec2b638>, '_cookies': {'graphite.localdomain': {'/':
{'sessionid': Cookie(version=0, name='sessionid',
value='**********masked**********', port=None, port_specified=False,
domain='graphite.localdomain', domain_specified=False,
domain_initial_dot=False, path='/', path_specified=True, secure=False,
expires=1430512734, discard=False, comment=None, comment_url=None,
rest={'httponly': None}, rfc2109=False)}}}, '_cookies_lock': <_RLock
owner=None count=0>}
...which appears right because it looks identical to the one I get from logging in directly to Graphite. But, when I return the response object (Return a requests.Response object from Flask) the browser returns content encoding errors in both Chrome and FireFox. If I change that to something like...
return r.content
...the dashboard page appears, but it's missing everything because the CSS and JS resources are 404'ing.
I am obviously not understanding something, any help would be greatly appreciated.

Resources