Logging in and scraping a site like ft.com with BeautifulSoup - web-scraping

I have this url: https://www.ft.com/content/87d644fc-73a4-11e7-aca6-c6bd07df1a3c
It corresponds to an article that requires signing up. I signed up and can see the content in my browser. However when I use this code with the url above:
soup = BeautifulSoup(urllib2.urlopen(url), 'lxml')
with open('ctp_output.txt', 'w') as f:
for tag in soup.find_all('p'):
f.write(tag.text.encode('utf-8') + '\n')
Especially, it redirects me on the signup page. Is there any way to be logged in to have access to the article when scraping?

Here are the basics.
Go to the login page. If you use the Chrome browser you can position your mouse over the email input area and use the context menu (in Windows) and then its 'Inspect' entry to reveal the form element that will be used to submit your email address. It looks like this.
<form name="enter-email-form" action="/login/submitEmail" class="js-email-lookup-form" method="POST" data-test-id="enter-email-form" novalidate="true">
<input type="hidden" name="location" value="https://www.ft.com/content/87d644fc-73a4-11e7-aca6-c6bd07df1a3c">
<input type="hidden" name="continueUrl" value="">
<input type="hidden" name="readerId" value="">
<input type="hidden" name="loginUrl" value="/login?location=https%3A%2F%2Fwww.ft.com%2Fcontent%2F87d644fc-73a4-11e7-aca6-c6bd07df1a3c">
<div class="lgn-box__title">
<h1 class="lgn-heading--alpha">Sign in</h1>
</div>
<div class="o-forms-group">
<label for="email" class="o-forms-label">Email address</label>
<input type="email" id="email" class="o-forms-text js-email" name="email" maxlength="64" autocomplete="off" autofocus="" required="">
<input type="password" id="password" name="password" style="display:none">
<label for="password">
</label></div>
<div class="o-forms-group">
<button class="o-buttons o-buttons--standout o-buttons--big" type="submit" name="Next">Next</button>
</div>
</form>
You will need to gather the action attribute from the form element and all the name-value pairs from the input statements. You use these in a POST request with the requests library.
You do this once for your email address and once for your password. Then you should be able to issue the GET for the URL with requests.
I must warn you that I haven't actually tried this with that particular site.

If you are to scrape a website using BeautifulSoup, I'd recommend the MechanicalSoup library. It is a very lightweight layer on top of BeautifulSoup (to parse HTML) and requests (to fetch pages), but it will deal for you with things like filling-in a form properly (i.e. what you need here), following relative links, ...
MechanicalSoup is also limited in the sense that it doesn't interpret JavaScript code, hence won't work on a website relying on JavaScript, but it reduces the manual effort compared to using BeautifulSoup and urllib or requests directly.
(Note: I'm one of the authors of MechanicalSoup)

Related

How can I request data with device access code for default gateway router 5268AC?

I would like to write a script that can collect data from the router, in this case 5268AC. So far, I've been able to use response.get() to get information from URLs that do not require device access code to see information. To get information about Wifi, for example, the default SSID, I need to enter the access code located on the router via the browser. I keep getting error 500 when trying request.post(url).
url="http://192.168.1.254/xslt?PAGE=C_2_1"
From inspection, I believe this is the form I'm trying fill. The key value is ADM_PASSWORD.
<h2>Login</h2>
<p>Device access code required. Please enter the device access code, then click Submit.</p>
<form name="pagepost" method="post" action="xslt?PAGE=login_post" id="pagepost">
<input type="hidden" name="NONCE" value="0abc59f54121398" />
<input type="hidden" name="THISPAGE" value="" />
<input type="hidden" name="NEXTPAGE" value="C_2_1" />
<input type="hidden" name="CMSKICK" value="" />
<div>
<div class="form-group">
<label for="ADM_PASSWORD">Access code</label>
<span>
<input type="password" id="ADM_PASSWORD" name="ADM_PASSWORD" size="16" maxlength="16" autofocus="autofocus" required="required" autocomplete="off" />
</span>
</div>
</div>
<p align="right">
<input type="submit" class="button" value="Submit" />
I tried this but got 500 status error.
payload = { 'ADM_PASSWORD':'*access code*' }
response = requests.post(url, headers=headers, data=payload)
Is there a way to be able to collect information via requests instead of using GUI?
First in Chrome/Firefox you can use DevTools (tab: Network) to see what browser sends when you login.
Form has action="xslt?PAGE=login_post" so you should send to
url = "http://192.168.1.254/xslt?PAGE=login_post"
It is relative path - so if you would open page ie. http://http://192.168.1.254/login/ then it would need
url = "http://192.168.1.254/login/xslt?PAGE=login_post"
You have to send all input which you see in form - especially hidden.
Sometimes it may need even submit.
And you should use name= as key, not id=
payload = {
'NONCE': '0abc59f54121398',
'THISPAGE': '',
'NEXTPAGE': 'C_2_1',
'CMSKICK': '',
'ADM_PASSWORD': '*access code*'
}
If you use post() with data= or json= then it should automatically set correct value in Content-Type and Content-Length and you don't have to set it in headers.
You may also first run get() to page with form because it may set cookies which server/device may also check.
Problem can be only if page uses JavaScript to generate some extra data in form and it would need to use Selenium - but it would need to write all with Selenium
Sometimes I send request to https://httpbin.org/post and it sends me back all headers, cookies, post data which it get from me and I can compare it with data which I see in DevTools.
Evetually I use local proxy server Charles and use it in browser and in Python to see if code sends the same data as browser.

Send amp-form data to mailchimp with Nextjs

I have a a standard amp form, with an email input, and a sumbit input, and a mailchimp endpoint.
<form
method="post"
action-xhr={`https://${DATACENTER}.api.mailchimp.com/3.0/lists/${LIST_ID}/members`}
target="_top"
>
<fieldset>
<input
type="email"
name="email"
placeholder="Enter your email"
required
className="email-input"
/>
<input
type="submit"
value="Sign Up"
className="sign-input"
/>
</fieldset>
</form>
now the problem is, i need to configure headers,to provide an Authorization API key, and to setup cors.
AMP requires to use xhr to send data. and i have no idea on how to do that inside nextjs, or a serverless function for that matter.
Maybe try this solution - https://www.miguoliang.com/how-to-build-a-mailchimp-embed-form-in-amp-pages.html
You'll have to make an API using mailchimp's python library and provide the credentials and use that API to post data from the form.

Integrating a custom built form with a subscriber list provider (Aweber)

I have created my own form because every other form out there for WordPress is horrible.
div class="exc-form">
<form action="action_page.php">
<fieldset>
<div id="form_jgrsh">
First name:<br>
<input type="text" name="firstname" placeholder="Please enter first name">
<br>
Last name:<br>
<input type="text" name="lastname" placeholder="Please enter last name">
<br>
Email:<br>
<input type="text" name="email" placeholder="Please enter email">
<div class="chk_bx">
<input type="checkbox" name="concent" value="agree">By entering my information and pressing submit I agree to subscribing to the Stockhouse AND Junior Gold Report (please check box before submiting)
<br>
</div>
<input type="submit" class="submit-btn" value="Submit">
</div>
</fieldset>
</form>
It's super simple and isn't anything special. I'm under the impression PHP will be needed as well.
I use Aweber as my list provider, and I've contacted them for support and they want me to use their horrible, ugly form builder, and won't help me integrate my own form to use their list.
I've never done something like this before, so I'm looking for some guidance on how to send subscriber information to a list. Aweber offer's a list ID and most plugins I've seen require a Aweber authentication before Aweber links up to the list.
https://labs.aweber.com/snippets/lists - this is as closest to the resources I could find from Aweber.
Here is the authentication list: https://labs.aweber.com/snippets/authentication

Analytic funnelling, tracking two URLs with different parameters, both being triggered as the same URL.

I want to be able to track goals but I need to know whether they came from a page with a social media parameter or a digital marketing parameter in the URL.
I currently have subscription form which returns a URL with a specific parameter depending on which page we're on. We're using wordpress.
<?php if(is_page( 'internet-marketing-software')): ?>
<div class="free-trial" style="display:none;">
<div class="sign-up-button" style="/* display:none; */">
<form name="signup" id="signup" action="http://dmtrk.net/signup.ashx" method="post" onsubmit="return validate_signup(this)">
<input type="hidden" name="addressbookid" value="1922561">
<input type="hidden" name="userid" value="52978">
<input type="hidden" name="ReturnURL" value="http://test-site.com/?signup=false&step2=true&digital-marketing=true&form=form-banner">
<input id="input" type="text" name="Email" placeholder="name#email.com">
<input type="hidden" id="double" name="double" value="double">
<input id="submit" class="banner" type="Submit" name="Submit" value="sign up">
</form>
</div>
<div class='trial-desc'>
<p>Interested? <span>Start 30 day FREE trial now!</span>
</p>
</div>
</div>
<?php endif; ?>
I load a similar piece of code in the header this time with the condition
if(is_page( 'social-media'))
and which returns the url
http://test-site.com/?signup=false&step2=true&social-media=true&form=form-banner.
I so depending on the page each user gets taken to either the social-media or digital-marketing page.
When this URL us triggered a double opt-in email is sent with a link taking them to the true goal page http://test-site.com/?signup=true.
In Google Analytics I have set up two goals with the funnelling capability turned on.
I currently have the destination setup as:
RegEx /?signup=true
Funnelling On
Step &digital-marketing=true
And then for social:
RegEx /?signup=true
Funnelling On
Step &social-media=true
Looking at my reports each time a goal is triggered, no matter which URL was used they both register as a goal.
I'm not sure what I'm doing wrong, is there something I'm overlooking with Analytics and URL parameters?
To follow up on the comments (I cannot comment) - You should mark the previous page (coming from digital marketing or social) as "required step" to make sure that the goal will register in one of the paths, and not both. See the button on the right.

Posting data to ASP.NET application

I have an application into which I wish to allow users to enter login details for their own websites. One of authentication methods is 'forms'. The way I had envisaged it working, is the users entering the method & action of their login form, and the name/value for each credential item, e.g. one for username, one for password. My application would then post this data in order to simulate a login, get the returned authentication cookie and be able to work on their site as if logged in.
In principle, this sounded like a reasonable kind of thing to do. However, as I'm sure you're aware, ASP.NET has a lot of inputs, and also hidden ones, e.g. __VIEWSTATE, which are all always posted back to the server whenever the ASP.NET form is submitted e.g. when a real user logs in. When my app tries to login however, it doesn't have the full list of inputs on that page, and their values, e.g. the always changing __VIEWSTATE.
My question: is there a way to post data to an ASPX page, posting only certain inputs, and excluding others, e.g. __VIEWSTATE?
If the page were, say, PHP it would probably look like this:
Ex. 1:
...
<div id="header">
<form action="search.php" action="POST">
<div id="search">
<input type="text" name="query" id="SearchQueryText" value="Search query" />
<input type="button" name=submit" id="SearchSubmitButton" value="Search!" />
</div>
</form>
<form action="login.php" action="POST">
<input type="text" name="uname" id="Username" value="Username" />
<input type="text" name="passwd" id="Password" value="Password" />
<input type="button" name=submit" id="LoginSubmitButton" value="Login" />
</form>
...
</div>
...
in ASP.NET Web Forms, however, through the use of server controls, it'd probably look like:
Ex. 2:
...
<body>
<form name="AspNetForm" method="post" action="/Products/SomethingOrOther.aspx" id="Form" enctype="multipart/form-data">
<div id="header">
<div id="search">
<input type="text id="ctl00$SearchComponent$SearchBox" name="ctl00$SearchComponent$SearchBox" value="Search query" />
<input type="submit" id="ctl00$SearchComponent$SearchSubmit" name="ctl00$SearchComponent$SearchSubmit" value="Search!">
</div>
<div id="login">
<input type="text id="ctl00$LoginComponent$Username" name="ctl00$LoginComponent$Username" value="Username" />
<input type="text" id="ctl00$LoginComponent$Password" name="ctl00$LoginComponent$Password" value="Password">
<input type="submit" id="ctl00$LoginComponent$LoginSubmit" name="ctl00$LoginComponent$LoginSubmit" value="Login">
</div>
</div>
...
</form>
</body>
...
With example 1, submitting the login form is a simple case of POSTing uname=something&passwd=somethingelse to login.php, however, in ASP.NET, because all inputs are wrapped in a 'global' <form>, to submit the login inputs, you have to submit the global form, and therefore all the inputs.
So what I'm after, is a way to submit only certain inputs in that global form, e.g. not __VIEWSTATE, which we can't know without probing the page beforehand.
You can use AJAX to post back the values to a specific page. In general, Web Forms is designed to post back all data on the page when you trigger a server side event. You then choose which elements/values to use in your code. If you don't want to use view state on a element, you can disable it (e.g. EnableViewState=False).
You can use asp.net page same as asp classic.
In html action you can put the aspx page from and then you have to take that.
then you can use request object of asp.net to retrive data from form. Same you can create a html form in string and put that via putting it into panel control.
Then you can asp.net button as submit button.

Resources