How to keep splash cookies - web-scraping

I am currently trying to scrape a website and trying to stay logged in as I scrape. Unfortunately, from what I understand splash resets cookies at every splashrequest. I am using splash with scrapy to scrape a site with javascript. My question is: How do I keep my cookies from being reset?
After scraping the web myself for a solution, I know it has something to do with lua scripts or cookie middleware but I have no idea how to use them. If anyone could help it would be great. All the sites that talk about that are really unclear so please be as clear as possible.

Yes, you can set cookies and return cookies in lua scripts. If the login page and scraping page use same script, your script should be like this:
function main(splash)
splash:init_cookies(splash.args.cookies)
-- ... your script
return {
cookies = splash:get_cookies(),
-- ... other results, e.g. html
}
end
If you use different scripts for login and scrape, u can return cookies from login_script and send it along with SplashRequest:
yield SplashRequest(url = url, callback=self.item_parse, endpoint='execute',args={
'lua_source': self.scrape_script
}, meta={'cookies': cookies})
In scrape_script you need to set cookies using command:
splash:init_cookies(splash.args.cookies)

Related

Is there a way to specify trusted origins for post requests in google web app?

Let's say i created a google sheet to capture user's email addresses.
On my website there is a small form and once the submit button is clicked and an ajax request to a google web app that writes data to a sheet is fired:
// Let's select and cache all the fields
var $inputs = $form.find("input, select, button, textarea");
// Serialize the data in the form
var serializedData = $form.serialize();
// Fire off the request
request = $.ajax({
url: https://script.google.com/macros/s/longURLcode/exec,
type: "post",
data: serializedData
});
In the google script you now use doPost(e) or doGet(e) to handle any incoming http request. To allow this to work as a sign up mechanism permissions for the web app have to be set to (i think !?)
Execute the app as: Me (myemail#gmail.com)
Who has access to the app: Anyone, even anonymous
Given everything on the google script site is set up properly, this works like a charm. So whats wrong?
Problem:
Anyone can either look into the source code of my webpage or use the dev tools to extract the url to the google web app after clicking submit. This url can now in theory be used to flood the sheet with countless (undefined) entries.
Questions:
1) Is there a way to limit accepted http request to certain origins? I tried to do this by accessing the http headers within doPost() but there seems to be no way to do so.
2) Is "Who has access to the app: Anyone, even anonymous" the wrong approach? I thought this is necessary since you can only choose google users here and some url (mywebsite.com) seemed to fall within the anonymous category.
3) I don't think this is possible but maybe i missed an option: Is there a way to NOT expose the google web app url to anyone? I guess not because you can monitor any requests with dev tools.
4) Is using sheets for capturing that kind of data just a terrible idea in general (partly for above reasons) and i should find another solution asap?

Observe http request and simulate the same request in code

Is there a way to observe a http request in the browser and save that request (header data and parameters) and simulate the same request in code?
What I want is to "simulate" a browser in my project, to get the same response back like if the user is using a normal browser.
I don't know exactly how to ask the question correctly, but what I want is to simulate the authentification on some websites and scrape the same data as when I were in the browser.
What I wanted was to crawle a website, which is secured with authentification, using simple http-requests and build the request-header in my code. And it was not only about sending a POST-request with name + password, but also some other hidden parameters which are first generated when a user visits the website - on the clientside with javascript.
Maybe it is possible to understand the algorithm behind generating those hidden parameters, but it can take a long time because of the complexity.
The best way to crawle a website in a automated way without caring about the correct headers is to use a "headless" browser, which is nothing else then a normal browser without a GUI. You can control it in your code. A list of those headless browsers can be found here.
So no need for observing and recording the request and simulating it in code - just use a headless browser.

Logging into a webpage via HTTP Request

So I have a webpage, ("http://data.terapeak.com/verify/") and I don't see any & tags in the URL so I am unaware how to post data to this. I need to do this via HTTPRequest rather than browser control. I am creating a double threaded batch searching program. I have already successfully made this using a single browser control but that wont allow for multi-threading, atleast with my current knowledge due to the fact that even when creating a new frmBrw that already exists it needs for me to set the threat apartment to single. If i set it to single, I am unable to have it send the data the the excel sheet I need both threads to access. I hope this is clear... The basic question is how can I log into this form via HTTP request.
This isn't going to be easy to answer without further details however I suspect you'll need to provide the variables via a HTTP POST request.
Can you successfully login to this page in your browser? If so, run a proxy tool such as fiddler and check the HTTP headers it makes to the server. You should see the form variables being passed over. You then need to mimic this in code.
How to: Send Data Using the WebRequest Class
Hope this gets you started

How do you determine cookies are disabled *without* javascript and *without* redirecting?

If I disable javascript and cookies, Amazon.com detects that cookies are disabled without a redirect. If you click the cart link, there's only a get on the cart page.
I'm guessing amazon.com is most likely not using ASP.NET, but how would you accomplish detecting disabled cookies using ASP.NET without the use of javascript and redirecting? Is it possible to detect if cookies are disabled in one round trip?
I believe what you're describing is impossible. Amazon doesn't appear to do that. As proof:
Disable JavaScript
Clear your cookies (but leave them enabled)
Go here: http://www.amazon.com/gp/cart/view.html/ref=gno_cart
You'll get the message "Please Enable Cookies in your Web Browser to Continue." But if you reload the page, the message will go away, because cookies got set on the first viewing.
The reason this doesn't work is that when a page response sets cookies, the server can't tell they've been properly set until the next request. You can get around that using JavaScript, of course, but without that there's no way for the server to know in advance whether a request comes from a browser that will accept cookies.
You don't need redirect to get at the cookies. All you need is a delayed load content.
Basically, I believe the following would work:
The 'GET /index.html' response sets the Cookies (they come in the header, and are stored before index.html is received and rendered).
You can than check for cookies while serving say 'GET /TinyImage.gif' if you don't run into caching problems and respond to images dynamically.
So, the final problem, is how do you inform the user about your findings from the TinyImage request? Definitely not easily, but if you use IFrame instead of a simple tag, you can essentially have two GET requests for a single page render.
Or, you can be really, really insane and actually stall the first GET until the second GET confirms the browser settings. This is for some HTTP wizards, but if you can wrap your head around Comet (not AJAX, Comet!), it can come in handy.
It's definitely possible, just tricky. Would I try doing so in ASP.NET? Can't promise anything but it will be a neat thing to share.
I guess it may load the page in the javascript / cookies off configuration and then use javascript to do the check and set functionality to cookies enabled if needed.
Could you set a Cookie in Page_Init for instance, then see if you could read from it in Page_PreRender?
Not sure that's even possible, but that's the only way I could think of.

PageMethods security

I'm trying to 'AJAX-ify' my site in order to improve the UI experience. In terms of performance, I'm also trying to get rid of the UpdatePanel. I've come across a great article over at Encosia showing a way of posting using PageMethods. My question is, how secure are page methods in a production environment? Being public, can anyone create a JSON script to POST directly to the server, or are there cross-domain checks taking place? My PageMethods would also write the data into the database (after filtering).
I'm using Forms Authentication in my pages and, on page load, it redirects unauthenticated users to the login page. Would the Page Methods on this page also need to check authentication if the user POSTs directly to the method, or is that authentication inherited for the entire page? (Essentially, does the entire page cycle occur even if a user has managed to post only to the PageMethod)?
Thanks
PageMethods are as secure as the handler in which they reside.
FormsAuthentication will protect everything except the Login page.
On an unprotected handler, like login, you should expose only methods that 1) are not sensitive or 2) validate the user.
EDIT: in response to comments and other answers regarding CSRF and XSS please see http://weblogs.asp.net/scottgu/archive/2007/04/04/json-hijacking-and-how-asp-net-ajax-1-0-mitigates-these-attacks.aspx
You're trying to protect against CSRF attacks.
These attacks can be prevented by requiring an authorization code in the POST parameters, and supplying the auth code in the initial page load. (The auth code should be per-IP address and per-user, and should expire quickly)
For added security, you can make each auth-code only usable once, and have each request return a new auth-code. (However, if any request fails, you'll need to reload the page)
I am working on a project that heavily utilizes ASP.Net WebForms Page Methods which I talk to using Ajax. This is rather very convenient for me than writing all my codes in JavaScript.
However, Securing the page methods became an issue which troubled me. I see that I can access the page methods via Postman and Fiddler hence, enabling hackers to play with your APIs.
My solution was quite simple which I discovered accidentally. Adding a static Cookie request to the page method would return error for any app that is NOT the website.
[WebMethod]
[ScriptMethod(UseHttpGet = false, ResponseFormat = ResponseFormat.Json)]
public static string GetAnything(object dat)
{
HttpCookie myguid = HttpContext.Current.Request.Cookies.Get(Constants.Session.PreventHacking);
var hackguid = myguid.Value ?? ""; //other page method contents
return "anything";
}
A postman request to this method would return :
{
"Message": "There was an error processing the request.",
"StackTrace": "",
"ExceptionType": ""}
While a more detailed error would show if on LocalHost.
I understand there are browser ad-ons that can intercept API calls by sitting just beside the website. I have not tested this. A separate security fix has to be built for this however.
I'll update here once I perform some tests.
Think of Pagemethods like a mini webservie local to the page. The fact is they will have no extra checks and verifications in place except those that are placed on the entire website, and those that you choose to put in.
Using Pagemethods is a smart idea from the point of view of 'Encapsulation', and if you're going to use them it doesn't hurt trying to put in some extra security measures in place.

Resources