How to enable Js on colly - web-scraping

I've had a lot of experiences with Scrapy but or some reasons in this project I should use colly. I'm trying to scrape data from a website but it returns To regain access, please make sure that cookies and JavaScript are enabled before reloading the page.
the part of my codes as follow:
func crawl(search savedSearch) {
c := colly.NewCollector()
extensions.RandomUserAgent(c)
/* for debugging to see what is the result
c.OnHTML("*", func(e *colly.HTMLElement) {
fmt.Println(e.Text)
os.Exit(1)
})*/
c.OnHTML(".result-list__listing", func(e *colly.HTMLElement) {
listingId, _ := strconv.Atoi(e.Attr("data-id"))
if !listingExist(search.id, listingId) {
fmt.Println("Listing found " + strconv.Itoa(listingId))
saveListing(search.id, listingId)
notifyUser(search.user, listingId)
}else{
fmt.Println("item is already crawled")
}
})
I see in the doc "Automatic cookie and session handling" so it might be the problem is js, how can I overcome this problem? first, try could be how can I enable js in colly?

Colly is the best choice for HTML pages. If you need to scrape JS-driven pages, you will need to use a different strategy. Browsers have a mutual protocol to work on JS and they have different libraries for different language including Go.

Related

Google Play Cross App Scripting Vulnerability: How do I protect calls to evaluateJavascript?

My app is caught up in Google's Cross App Scripting security warning and I can't seem to get a version of the app that doesn't trigger Google's warning.
The majority of the functionality is a WebView wrapper for a web app. That's where the warning is.
I think I've followed the directions in Google's tutorial for Option 2, which are as follows:
1. Update your targetSdkVersion.
It has to be above 16 and I've done that.
2. Protect calls to evaluateJavascript
The WebView does accept URL's from Intents, but those are checked ahead of time to always be trusted. And all external URLs that might appear inside the app are opened externally, i.e. in Chrome.
3. Prevent unsafe file loads
The WebView never opens file:// URIs.
The code below is the relevant section from the class and method that Google is indicating has a problem. I think I've correctly filtered out all code paths there so that the only URIs that open would be my own domain.
I've already been through two levels of Google support and all they say is to follow the directions in their tutorial. I think I've done that:
https://support.google.com/faqs/answer/9084685
rootUrl = "https://example.com"
Intent intent = getIntent();
if (intent.getStringExtra("action_url") != null) {
if (intent.getStringExtra(NotificationIntentService.NOTIFICATIONS_DESTINATION) != null) {
myWebView.loadUrl(rootUrl + intent.getStringExtra(NotificationIntentService.NOTIFICATIONS_DESTINATION));
} else if (
intent.getStringExtra("action_url").matches("^https://example.com/")) {
myWebView.loadUrl(intent.getStringExtra("action_url"));
}
} else {
if (retrieveHasRegistered(context)) {
myWebView.loadUrl(rootUrl + "/android?registered");
} else {
myWebView.loadUrl(rootUrl + "/android");
}
}
}

Allow access to web application only when my server redirects to it

My question is very very similar to this one
The idea is the following.
I have an app written in Node (specifically Sails.js) it a simple form for invoices.
And another one in Laravel.
So what I want is that the user can only access that form (Sails app) if one Controller of the Laravel app redirects to it.
On the link above it says that I could use sessions but as you can see this are very different applications. So I'm looking for the simplest and best way to do it.
Any advice is well received or if you have some better approach to solve this please let me know. Thanks
Probably the most simple way is to use the referer header in your Sails controller and do a simple comparison check.
For example:
getinvoice : function(req, res, next) {
var referer = req.headers.referer;
if(referer != 'http://somedomain.com/pageallowedtocallgetinvoice'){
return res.forbidden();
} else {
...
}
}

casperjs and a/b testing

i have a signup.js test that automates signing up for my web app (obviously). we're currently a/b testing a new flow that takes you to a different page ('.com/signupa' vs '.com/signupb') and i'm wondering what the best way to reflect this in my test.
options:
use evaluateOrDie and make it die at .com/signupb (this seems dumb)
flesh out test for .com/signupb and make it go that route if it hits that test (is this possible?) something like..
casper.waitForResource("classic.png",
function success() {
this.echo('on the old signup flow ');
<continue with regular signup test>
},
function fail() {
this.test.assertExists("classic.png");
<do something else>
});
any other ideas greatly appreciated!
My preference would be to hide some information in each of your pages, so you can cleanly switch on them. E.g.
<span id="version1"></span>
vs.
<span id="version2"></span>
Then, after submitting the form:
casper.then(function(){
if (this.exists('#version2')) {
testNewSite(this);
} else {
testOldSite(this);
}
});
But detecting on something you already know is only in one of the pages, like the "classic.png" you show in your question, is also fine. (It just feels a little more brittle: the web development team can break your tests by renaming that image, or putting an image with that name in the new version, etc., etc.)

Retrieve comments from website using disqus

I would like to write a scraping script to retrieve comments from cnn articles. For example, this article: http://www.cnn.com/2012/01/19/politics/gop-debate/index.html?hpt=hp_t1
I realize that cnn uses disqus for their comment discussion. As the comment loading is not webpage-based (ie, prev page, next page) and is dynamic (ie, need to click "load next 25"), I have no idea how to retrieve all the 5000+ comments for this article.
Any idea or suggestion?
Thanks so much!
I needed to get comments via scraping a page that had disqus comments via ajax. Because they were not rendered on the server, I had to call the disqus api. In the source code, you will need the identifier code:
var identifier = "456643" // take note of this from the page source
// this is the ident url query param in the following js request
also,look in the js source code to get the pages public key, and forum name. Place these in the url where appropriate.
I used javascript nodejs to test this, ie :
var request = require("request");
var publicKey = "pILMw27bsbJsdfsdQDh9Eh0MzAgFL6xx0hYdsdsdfaIfBHRvLGqFFQ09st";
var disqusUri = "https://disqus.com/api/3.0/threads/listPosts.json?&api_key=" + publicKey + "&thread:ident=456643&forum=nameOfForumFromSource";
request(disqusUri, function(res,status,err){
console.log(res.body);
if(err){
console.log("ERR: " + err);
}
});
The option for scraping (other then getting the page), which might be less robust (depends on you're needs) but will offer a solution for the problem you have, is to use some kind of wrapper around a full fledged web browser and literally code the usage pattern and extract the relevant data. Since you didn't mention which programming language you know, I'll give 3 examples: 1) Watir - ruby, 2) Watin - IE & Firefox via .net, 3) Selenium - IE via C#/Java/Perl/PHP/Ruby/Python
I'll provide a little example using Watin & C#:
IE browser = new IE();
browser.GoTo(YOUR CNN URL);
List visibleComments = Browser.List(Find.ById("dsq-comments"));
//do your scraping thing
Link moreComments = Browser.Link(Find.ByClass("dsq-paginate-append-text");
moreComments.click();
//wait util ajax ended by searching for some indicator
Browser.WaitUntilContainsText(SOME TEXT);
//do your scraping thing
Notice:
I'm not familiar with disqus, but it might be a better option to force all the comments to show by looping the Link & click parts of the code I posted until all the comments are visible and the scrape the List element dsq-comments

Run Javascript on the body of a Gmail message

I want to display LaTeX math in the gmail messages that I receive, so that for example $\mathbb P^2$ would show as a nice formula. Now, there are several Javascripts available (for example, this one, or MathJax which would do the job, I just need to call them at the right time to manipulate the gmail message.
I know that this is possible to do in "basic HTML" and "print" views. Is it possible to do in the standard Gmail view? I tried to insert a call to the javascript right before the "canvas_frame" iframe, but that did not work.
My suspicion is that manipulating a Gmail message by any Javascript would be a major security flaw (think of all the malicious links one could insert) and that Google does everything to prevent this. And so the answer to my question is probably 'no'. Am I right in this?
Of course, it would be very easy for Google to implement viewing of LaTeX and MathML math simply by using MathJax on their servers. I made the corresponding Gmail Lab request, but no answer, and no interest from Google apparently.
So, again: is this possible to do without Google's cooperation, on the client side?
I think one of the better ways to do this might be to embed images using the Google Charts API.
<img src="http://chart.apis.google.com/chart?cht=tx&chl=x=\frac{-b%20\pm%20\sqrt{b^2-4ac}}{2a}">
To Learn more: https://developers.google.com/chart/image/ [note, the API has been officially deprecated, but will work until April 2015]
If you really must use LaTeX and some js library, I think one way you could accomplish this is by injecting a script tag into the iframe.
I hope this is a good starting point.
Example:
// ==UserScript==
// #name Test Gmail Alterations
// #version 1
// #author Justen
// #description Test Alter Email
// #include https://mail.google.com/mail/*
// #include http://mail.google.com/mail/*
// #license GPL version 3 or any later version; http://www.gnu.org/copyleft/gpl.html
// ==/UserScript==
(function GmailIframeInject() {
GM_log('Starting GMail iFrame Injection');
var GmailCode = function() {
// Your code here;
// The ':pd' (div id) changes, so you might have to do some extra work
var mail = document.getElementById(':pd');
mail.innerHTML = '<h1>Hello, World!</h1>';
};
var iframe = document.getElementById('canvas_frame');
var doc = null;
if( iframe ) {
GM_log('Got iFrame');
doc = iframe.contentDocument;
} else {
GM_log('ERROR: Could not get iframe with id canvas_frame');
return
}
if( doc ) {
GM_log('Injecting GmailCode');
var code = "(" + GmailCode + ")();"
doc.body.appendChild(doc.createElement('script')).innerHTML=code;
} else {
GM_log('ERROR: Could not get iframe content document');
return;
}
})();
Well, there are already greasemonkey scripts that do things to GMail as far as i know (like this one). Is this a possible security hole? Of course, anything you'd do with executable code has that risk. Google seems to move a glacial speeds on things they're not interested in. They really do seem to function based on internal championing of ideas, so best way forward is to go find sympathetic googlers, if you want them to include something into GMail. Otherwise stick to Greasemonkey, at least you'll have an easy install path for other people who'd like to see the same functionality.

Resources