I've had a lot of experiences with Scrapy but or some reasons in this project I should use colly. I'm trying to scrape data from a website but it returns To regain access, please make sure that cookies and JavaScript are enabled before reloading the page.
the part of my codes as follow:
func crawl(search savedSearch) {
c := colly.NewCollector()
extensions.RandomUserAgent(c)
/* for debugging to see what is the result
c.OnHTML("*", func(e *colly.HTMLElement) {
fmt.Println(e.Text)
os.Exit(1)
})*/
c.OnHTML(".result-list__listing", func(e *colly.HTMLElement) {
listingId, _ := strconv.Atoi(e.Attr("data-id"))
if !listingExist(search.id, listingId) {
fmt.Println("Listing found " + strconv.Itoa(listingId))
saveListing(search.id, listingId)
notifyUser(search.user, listingId)
}else{
fmt.Println("item is already crawled")
}
})
I see in the doc "Automatic cookie and session handling" so it might be the problem is js, how can I overcome this problem? first, try could be how can I enable js in colly?
Colly is the best choice for HTML pages. If you need to scrape JS-driven pages, you will need to use a different strategy. Browsers have a mutual protocol to work on JS and they have different libraries for different language including Go.
How can I let the user know when they are getting a hot code push?
At the moment the screen will go blank during the push, and the user will feel it's rather weird. I want to reassure them the app is updating.
Is there a hook or something which I can use?
Here's the shortest solution I've found so far that doesn't require external packages:
var ALERT_DELAY = 3000;
var needToShowAlert = true;
Reload._onMigrate(function (retry) {
if (needToShowAlert) {
console.log('going to reload in 3 seconds...');
needToShowAlert = false;
_.delay(retry, ALERT_DELAY);
return [false];
} else {
return [true];
}
});
You can just copy that into the client code of your app and change two things:
Replace the console.log with an alert modal or something informing the user that the screen is about to reload.
Replace ALERT_DELAY with some number of milliseconds that you think are appropriate for the user to read the modal from (1).
Other notes
I'd recommend watching this video on Evented Mind, which explains what's going on in a little more detail.
You can also read the comments in the reload source for further enlightenment.
I can image more complex reload logic, especially around deciding when to allow a reload. Also see this pacakge for one possible implementation.
You could send something on Meteor.startup() in your client-side code. I personally use Bert to toast messages.
i have a signup.js test that automates signing up for my web app (obviously). we're currently a/b testing a new flow that takes you to a different page ('.com/signupa' vs '.com/signupb') and i'm wondering what the best way to reflect this in my test.
options:
use evaluateOrDie and make it die at .com/signupb (this seems dumb)
flesh out test for .com/signupb and make it go that route if it hits that test (is this possible?) something like..
casper.waitForResource("classic.png",
function success() {
this.echo('on the old signup flow ');
<continue with regular signup test>
},
function fail() {
this.test.assertExists("classic.png");
<do something else>
});
any other ideas greatly appreciated!
My preference would be to hide some information in each of your pages, so you can cleanly switch on them. E.g.
<span id="version1"></span>
vs.
<span id="version2"></span>
Then, after submitting the form:
casper.then(function(){
if (this.exists('#version2')) {
testNewSite(this);
} else {
testOldSite(this);
}
});
But detecting on something you already know is only in one of the pages, like the "classic.png" you show in your question, is also fine. (It just feels a little more brittle: the web development team can break your tests by renaming that image, or putting an image with that name in the new version, etc., etc.)
I need to add a map in my adddon and I know how to do what I need in a "common webpage", like I did here: http://jsfiddle.net/hCymP/6/
The problem is I really don't know how to to the same in a Firefox Addon. I tryed importing the scripts with LoadSubScript and also tryed adding a chrome html with the next line:
<script src="https://maps.googleapis.com/maps/api/js?v=3.exp&sensor=false"></script>
But nothing works. The best solution I found was to add part of the code in this file (the code of the script src) in my function, to import this file with loadSubScript, and all my function is executed but an empty div is returned.
Components.utils.import("resource://gre/modules/Services.jsm");
window.google = {};
window.google.maps = {};
window.google.maps.modules = {};
var modules = window.google.maps.modules;
var loadScriptTime = (new window.Date).getTime();
window.google.maps.__gjsload__ = function(name, text) { modules[name] = text;};
window.google.maps.Load = function(apiLoad) {
delete window.google.maps.Load;
apiLoad([0.009999999776482582,[[["https://mts0.googleapis.com/vt?lyrs=m#227000000\u0026src=api\u0026hl=en-US\u0026","https://mts1.googleapis.com/vt?lyrs=m#227000000\u0026src=api\u0026hl=en-US\u0026"],null,null,null,null,"m#227000000"],[["https://khms0.googleapis.com/kh?v=134\u0026hl=en-US\u0026","https://khms1.googleapis.com/kh?v=134\u0026hl=en-US\u0026"],null,null,null,1,"134"],[["https://mts0.googleapis.com/vt?lyrs=h#227000000\u0026src=api\u0026hl=en-US\u0026","https://mts1.googleapis.com/vt?lyrs=h#227000000\u0026src=api\u0026hl=en-US\u0026"],null,null,null,null,"h#227000000"],[["https://mts0.googleapis.com/vt?lyrs=t#131,r#227000000\u0026src=api\u0026hl=en-US\u0026","https://mts1.googleapis.com/vt?lyrs=t#131,r#227000000\u0026src=api\u0026hl=en-US\u0026"],null,null,null,null,"t#131,r#227000000"],null,null,[["https://cbks0.googleapis.com/cbk?","https://cbks1.googleapis.com/cbk?"]],[["https://khms0.googleapis.com/kh?v=80\u0026hl=en-US\u0026","https://khms1.googleapis.com/kh?v=80\u0026hl=en-US\u0026"],null,null,null,null,"80"],[["https://mts0.googleapis.com/mapslt?hl=en-US\u0026","https://mts1.googleapis.com/mapslt?hl=en-US\u0026"]],[["https://mts0.googleapis.com/mapslt/ft?hl=en-US\u0026","https://mts1.googleapis.com/mapslt/ft?hl=en-US\u0026"]],[["https://mts0.googleapis.com/vt?hl=en-US\u0026","https://mts1.googleapis.com/vt?hl=en-US\u0026"]],[["https://mts0.googleapis.com/mapslt/loom?hl=en-US\u0026","https://mts1.googleapis.com/mapslt/loom?hl=en-US\u0026"]],[["https://mts0.googleapis.com/mapslt?hl=en-US\u0026","https://mts1.googleapis.com/mapslt?hl=en-US\u0026"]],[["https://mts0.googleapis.com/mapslt/ft?hl=en-US\u0026","https://mts1.googleapis.com/mapslt/ft?hl=en-US\u0026"]]],["en-US","US",null,0,null,null,"https://maps.gstatic.com/mapfiles/","https://csi.gstatic.com","https://maps.googleapis.com","https://maps.googleapis.com"],["https://maps.gstatic.com/intl/en_us/mapfiles/api-3/13/11","3.13.11"],[3047554353],1.0,null,null,null,null,1,"",null,null,1,"https://khms.googleapis.com/mz?v=134\u0026",null,"https://earthbuilder.googleapis.com","https://earthbuilder.googleapis.com",null,"https://mts.googleapis.com/vt/icon"], loadScriptTime);
};
//I can't use document.write but use loadSubScript insthead
Services.scriptloader.loadSubScript("chrome://googleMaps/content/Google-Maps-V3.js", window, "utf8"); //chrome://MoWA/content/Google-Maps-V3.js", window, "utf8");
var mapContainer = window.content.document.createElement('canvas');
mapContainer.setAttribute('id', "map");
mapContainer.setAttribute('style',"width: 500px; height: 300px");
mapContainer.style.backgroundColor = "red";
var mapOptions = {
center: new window.google.maps.LatLng(latitude, longitude),
zoom: 5,
mapTypeId: window.google.maps.MapTypeId.ROADMAP
}
var map = new window.google.maps.Map(mapContainer,mapOptions);
return mapContainer;
Can you help me? I'm developing a "Firefox for Android" addon and that's why I need to do things like *window.content.*document.createElement because document is not declared, only window and I think thats may be the problem... But I can't declare everything if I don't know what Google Maps uses.
Added: I also read that Google Maps API Team has specific code that disallows you from copying the main script locally. In particular, that code "expires" every so many hours. I'm combined part of this script: https://maps.googleapis.com/maps/api/js?v=3.exp&sensor=false because I can't execute this directly (Error: write called on an object that does not implement interface HTMLDocument). So I don't have any alternative!
Use an iframe (type=content if in XUL) to display web content. There you can include whatever scripts you like. The content in the iframe will not have any special privileges, or at least should not. If you need to communicate with the privileged add-on part of your code, you can use e.g. regular HTML events (createEvent, addEventListener and friends) or the postMessage web API to pass messages.
Do not try to load remote code directly into other pages, or worse, into the browser, as this is a compatibility and security nightmare.
Because loading remote code and/or code not properly reviewed for running in a privileged context, the platform will refuse to load such scripts from remote sources (http, etc.) via loadSubScript, etc.
Should be noted, that if you'd later like to host your add-on on addons.mozilla.org and still do include remote scripts in privileged code, your add-on will be rejected until you fix it.
Also, mozilla might blocklist your add-on even if you host elsewhere if it is discovered that there are known security vulnerabilities in your add-on, per the Add-on Guidelines.
I would like to write a scraping script to retrieve comments from cnn articles. For example, this article: http://www.cnn.com/2012/01/19/politics/gop-debate/index.html?hpt=hp_t1
I realize that cnn uses disqus for their comment discussion. As the comment loading is not webpage-based (ie, prev page, next page) and is dynamic (ie, need to click "load next 25"), I have no idea how to retrieve all the 5000+ comments for this article.
Any idea or suggestion?
Thanks so much!
I needed to get comments via scraping a page that had disqus comments via ajax. Because they were not rendered on the server, I had to call the disqus api. In the source code, you will need the identifier code:
var identifier = "456643" // take note of this from the page source
// this is the ident url query param in the following js request
also,look in the js source code to get the pages public key, and forum name. Place these in the url where appropriate.
I used javascript nodejs to test this, ie :
var request = require("request");
var publicKey = "pILMw27bsbJsdfsdQDh9Eh0MzAgFL6xx0hYdsdsdfaIfBHRvLGqFFQ09st";
var disqusUri = "https://disqus.com/api/3.0/threads/listPosts.json?&api_key=" + publicKey + "&thread:ident=456643&forum=nameOfForumFromSource";
request(disqusUri, function(res,status,err){
console.log(res.body);
if(err){
console.log("ERR: " + err);
}
});
The option for scraping (other then getting the page), which might be less robust (depends on you're needs) but will offer a solution for the problem you have, is to use some kind of wrapper around a full fledged web browser and literally code the usage pattern and extract the relevant data. Since you didn't mention which programming language you know, I'll give 3 examples: 1) Watir - ruby, 2) Watin - IE & Firefox via .net, 3) Selenium - IE via C#/Java/Perl/PHP/Ruby/Python
I'll provide a little example using Watin & C#:
IE browser = new IE();
browser.GoTo(YOUR CNN URL);
List visibleComments = Browser.List(Find.ById("dsq-comments"));
//do your scraping thing
Link moreComments = Browser.Link(Find.ByClass("dsq-paginate-append-text");
moreComments.click();
//wait util ajax ended by searching for some indicator
Browser.WaitUntilContainsText(SOME TEXT);
//do your scraping thing
Notice:
I'm not familiar with disqus, but it might be a better option to force all the comments to show by looping the Link & click parts of the code I posted until all the comments are visible and the scrape the List element dsq-comments