Retrieve comments from website using disqus - web-scraping

I would like to write a scraping script to retrieve comments from cnn articles. For example, this article: http://www.cnn.com/2012/01/19/politics/gop-debate/index.html?hpt=hp_t1
I realize that cnn uses disqus for their comment discussion. As the comment loading is not webpage-based (ie, prev page, next page) and is dynamic (ie, need to click "load next 25"), I have no idea how to retrieve all the 5000+ comments for this article.
Any idea or suggestion?
Thanks so much!

I needed to get comments via scraping a page that had disqus comments via ajax. Because they were not rendered on the server, I had to call the disqus api. In the source code, you will need the identifier code:
var identifier = "456643" // take note of this from the page source
// this is the ident url query param in the following js request
also,look in the js source code to get the pages public key, and forum name. Place these in the url where appropriate.
I used javascript nodejs to test this, ie :
var request = require("request");
var publicKey = "pILMw27bsbJsdfsdQDh9Eh0MzAgFL6xx0hYdsdsdfaIfBHRvLGqFFQ09st";
var disqusUri = "https://disqus.com/api/3.0/threads/listPosts.json?&api_key=" + publicKey + "&thread:ident=456643&forum=nameOfForumFromSource";
request(disqusUri, function(res,status,err){
console.log(res.body);
if(err){
console.log("ERR: " + err);
}
});

The option for scraping (other then getting the page), which might be less robust (depends on you're needs) but will offer a solution for the problem you have, is to use some kind of wrapper around a full fledged web browser and literally code the usage pattern and extract the relevant data. Since you didn't mention which programming language you know, I'll give 3 examples: 1) Watir - ruby, 2) Watin - IE & Firefox via .net, 3) Selenium - IE via C#/Java/Perl/PHP/Ruby/Python
I'll provide a little example using Watin & C#:
IE browser = new IE();
browser.GoTo(YOUR CNN URL);
List visibleComments = Browser.List(Find.ById("dsq-comments"));
//do your scraping thing
Link moreComments = Browser.Link(Find.ByClass("dsq-paginate-append-text");
moreComments.click();
//wait util ajax ended by searching for some indicator
Browser.WaitUntilContainsText(SOME TEXT);
//do your scraping thing
Notice:
I'm not familiar with disqus, but it might be a better option to force all the comments to show by looping the Link & click parts of the code I posted until all the comments are visible and the scrape the List element dsq-comments

Related

Check if a webpage is online using VB in asp.net

I have been toying with getting this working for a while now to no avail.
I want to check if a website is available and display its status in a label. I have no code to share as I haven't got this anywhere close. I am using VS 2013 building a asp.net website using VB.
I thought I could just ping the website but after multiple test using command.exe the website doesn't respond to pings even when up.
I know the page will be taken offline periodically for updates as this happened last week and when it is you get page cannot be displayed. I need to test if the page is online and return true if it is and false if not.
The simple way is to do an httpWebRequest and examine the result. If the page doesn't exist, you will get an error that indicates that. I'm not as familiar with .webClient but that apparently works as well.
This past question gives you examples of both.
You can easily do this using HttpWebRequest.
try {
HttpWebRequest myHttpWebRequest = (HttpWebRequest) WebRequest.Create("siteAddress");
HttpWebResponse myHttpWebResponse = (HttpWebResponse) myHttpWebRequest.GetResponse();
myHttpWebResponse.Close();
}
catch(WebException e) {
//There is a problem accessing the site
}

Framebuster with exceptions

I have a question about writing a frame-buster-buster. I have already read Frame Buster Buster ... buster code needed but I need an extra tweak.
My content from my blog at [http://my_domain.c0m/blog] is being displayed at another site showing three "views". One view is a feed and doesn't particulary bother me. The other two bother me and I wish to break both. I also want to permit exceptions of domains with permission to frame.
In one view, it appears the the content from the top of my html of the top of my blog is first copied to create a "snapshot" [http://the_other_domain.c0m/copy_of_blog] then that copy is framed in [http://the_other_domain.c0m/ ]. So, in this case, the 'child' copy are both hosted at [http://the_other_domain.c0m/] . Google translate does a similar thing-- but I find this ok. So, I would like to break this frame while also permitting exceptions for google and also for people who have made a copy to their pcs and would like to view in a utility that might frame.
In the other view, it appears the content from my site is framed. So in this case [http://my_domain.c0m/blog_post] is framed by [http://the_other_domain.c0m/]. I would like to bust out of this frame. However, my difficulty is that I can't figure out how to do so while keeping the exceptions for google translate or individual pc users frames at home.
My solution so far (I am not particularly familiar with javascript. So, please don't laugh too hard at the redundancy and lack of knowledge):
I was able to bust the first frame using:
<SCRIPT type="text/javascript" >
var topWindow = String(top.location)
var topWord=topWindow.split("/")
var selfWindow = String(self.location)
var selfWord=topWindow.split("/")
var correctLocation ="http://my_domain.c0m/blog"
var correctWord2="my_domain.c0m"
var http="http:"
if( ( (topWord[2] != correctWord2) || (selfWord[2] != correctWord2) )
&& (topWord[2] != 'translate.googleusercontent.com' ) && (topWord[0] == http ) ){
document.write("message expressing my opinion about the asshattery going in here.]" )
setTimeout("redirect_after_pause()",8000)
}else{
//document.write("<p><font color='purple'>Hi there! Javascript is working.</font> </p> " )
}
function redirect_after_pause() {
var correctLocation ="http://my_domain.c0m/blog"
top.location=correctLocation
}
I know this is inefficient. But it works and achieves my goal of making an exception for a) translations at googlecontent which my readers in france requested and b) cases where a user is framing in a utility that downloads to their pc (which I think has uri's beginning with "FILE:".
Now the difficulty: This does not work for the view where content hosted at my domain is framed at the other domain. I believe I have tracked the problem down to var topWindow = String(top.location) not being permitted in my child window. In principle, this would work:
<script type="text/javascript">
if(top != self) top.location.replace(location);
However, I think it screws up the use of google translate which uses a top frame that holds their translation of my content also hosted at [http://translate.google.com]. I suspect it similarly screws up readers that might display a local copy on someones pc if that copy is displayed in a frame.
If someone can guide me toward a solution I can implement to break both frames while permitting my exception
BTW: It does appear that the site in question is using a framebuster. I poked around and found this inside their /static/common.js?1345250291 code:
enable_iframe_buster_buster:function(){var a=this,b=0;window.onbeforeunload=function(){b++};clearInterval(this.locks.iframe_buster_buster);this.locks.iframe_buster_buster=setInterval(function(){0<b&&(b-=2,a.flags.iframe_story_locations_fetched&&!a.flags.iframe_view_not_busting&&_.contains(["page","story"],a.story_view)&&NEWSBLUR.reader.active_feed&&($(".NB-feed-frame").attr("src",""),window.top.location="/reader/buster",$(".task_view_feed").click()))},1)},disable_iframe_buster_buster:function(){clearInterval(this.locks.iframe_buster_buster)}
That's deep inside some particulary dense javascript. Whatever it does it doesn't seem to affect my ability to bust the frame for the case where my content is copied and hosted at [http://the_other_domain.c0m/]. I haven't yet fully explored whether it busts simple framebusters because earlier I only recently recognized that " var topWindow = String(top.location) " was forbidden in the child frame with a different domain from the parent frame.
Whether or not the frame-buster is present, I'd like help with solutions here. I know that if one site is now framing my content in this way it is only a matter of time before the obnoxious technique catches on and I would like to code in solutions that bust both methods gracefully while providing myself with exceptions. Thanks in advance.

hide google analytic code

Is it possible not to show the google analytic code on the master page? I mean can I place it somewhere where no one can see it but still get all the analysis on the application ?
You can use the mobile code for google analytic to make what you ask.
http://code.google.com/mobile/analytics/docs/web/
How this works. This google modile code can call the google analytic from your server on code behind, and not from the client page using javascript. So you can totally hide this call from your clients, and all data are send to google analytic, let say on page load.
First on your google analytic detail page, get the code for mobiles for asp.net, and then see this code that you need to focus and change, so you can make a global function that use on Page Load. To avoid possible delay I also suggest to make a new thread call to the google analytics, until the asp.net 4.5 get outs that include that option on WebRequest.
string utmGifLocation = "http://www.google-analytics.com/__utm.gif";
// Construct the gif hit url.
string utmUrl = utmGifLocation + "?" +
"utmwv=" + Version +
"&utmn=" + GetRandomNumber() +
"&utmhn=" + HttpUtility.UrlEncode(domainName) +
"&utmr=" + HttpUtility.UrlEncode(documentReferer) +
"&utmp=" + HttpUtility.UrlEncode(documentPath) +
"&utmac=" + account +
"&utmcc=__utma%3D999.999.999.999.999.1%3B" +
"&utmvid=" + visitorId +
"&utmip=" + GetIP(GlobalContext.Request.ServerVariables["REMOTE_ADDR"]);
SendRequestToGoogleAnalytics(utmUrl);
private void SendRequestToGoogleAnalytics(string utmUrl)
{
try
{
WebRequest connection = WebRequest.Create(utmUrl);
((HttpWebRequest)connection).UserAgent = GlobalContext.Request.UserAgent;
connection.Headers.Add("Accepts-Language",
GlobalContext.Request.Headers.Get("Accepts-Language"));
using (WebResponse resp = connection.GetResponse())
{
// Ignore response
}
}
catch (Exception ex)
{
if (GlobalContext.Request.QueryString.Get("utmdebug") != null)
{
throw new Exception("Error contacting Google Analytics", ex);
}
}
}
All that is a little hack on google analytic mobile code, but the general idea works on your case. Get the Google Analytic SDK here.
http://code.google.com/apis/analytics/docs/tracking/home.html
What actually google try to archive here. Google say that there is not reason for mobile phone with limited and costly bandwidth to make the call on google analytics. So google make a code behind call to google analytic just by opening a page. From the part of the mobile, he only need to read a tiny image, and on code behind this call to the image is making the reall call to google. From your side, you do not need to place an image, you can direct call google analytic by changing a little the function that google provides.
Hope this help.
since you have to put it in your header for the analytic script, there is no easy way of doing this. you want to hide your ID#? There may be a way to reference a variable for your ID#, but without a bunch of extra coding there is no way.
If it's there, people can see it if they look for it. If it's not there, you can't get the analysis.
You could get a little sneaky, and have the analytics on a page that gets loaded into an invisible iframe, but someone that wants to find it will.

Run Javascript on the body of a Gmail message

I want to display LaTeX math in the gmail messages that I receive, so that for example $\mathbb P^2$ would show as a nice formula. Now, there are several Javascripts available (for example, this one, or MathJax which would do the job, I just need to call them at the right time to manipulate the gmail message.
I know that this is possible to do in "basic HTML" and "print" views. Is it possible to do in the standard Gmail view? I tried to insert a call to the javascript right before the "canvas_frame" iframe, but that did not work.
My suspicion is that manipulating a Gmail message by any Javascript would be a major security flaw (think of all the malicious links one could insert) and that Google does everything to prevent this. And so the answer to my question is probably 'no'. Am I right in this?
Of course, it would be very easy for Google to implement viewing of LaTeX and MathML math simply by using MathJax on their servers. I made the corresponding Gmail Lab request, but no answer, and no interest from Google apparently.
So, again: is this possible to do without Google's cooperation, on the client side?
I think one of the better ways to do this might be to embed images using the Google Charts API.
<img src="http://chart.apis.google.com/chart?cht=tx&chl=x=\frac{-b%20\pm%20\sqrt{b^2-4ac}}{2a}">
To Learn more: https://developers.google.com/chart/image/ [note, the API has been officially deprecated, but will work until April 2015]
If you really must use LaTeX and some js library, I think one way you could accomplish this is by injecting a script tag into the iframe.
I hope this is a good starting point.
Example:
// ==UserScript==
// #name Test Gmail Alterations
// #version 1
// #author Justen
// #description Test Alter Email
// #include https://mail.google.com/mail/*
// #include http://mail.google.com/mail/*
// #license GPL version 3 or any later version; http://www.gnu.org/copyleft/gpl.html
// ==/UserScript==
(function GmailIframeInject() {
GM_log('Starting GMail iFrame Injection');
var GmailCode = function() {
// Your code here;
// The ':pd' (div id) changes, so you might have to do some extra work
var mail = document.getElementById(':pd');
mail.innerHTML = '<h1>Hello, World!</h1>';
};
var iframe = document.getElementById('canvas_frame');
var doc = null;
if( iframe ) {
GM_log('Got iFrame');
doc = iframe.contentDocument;
} else {
GM_log('ERROR: Could not get iframe with id canvas_frame');
return
}
if( doc ) {
GM_log('Injecting GmailCode');
var code = "(" + GmailCode + ")();"
doc.body.appendChild(doc.createElement('script')).innerHTML=code;
} else {
GM_log('ERROR: Could not get iframe content document');
return;
}
})();
Well, there are already greasemonkey scripts that do things to GMail as far as i know (like this one). Is this a possible security hole? Of course, anything you'd do with executable code has that risk. Google seems to move a glacial speeds on things they're not interested in. They really do seem to function based on internal championing of ideas, so best way forward is to go find sympathetic googlers, if you want them to include something into GMail. Otherwise stick to Greasemonkey, at least you'll have an easy install path for other people who'd like to see the same functionality.

Drupal node.save and JSONP

I am having an issue with call Drupal node.save using MooTool's JSONP. Here is an example.
Here is my request:
callback Request.JSONP.request_map.request_1
method node.save
sessid 123123123123123
node {"type":"blog","title":"New Title","body":"This is the blog body"}
Here is my result
HTTP/1.0 500 Internal Server Error
I got this working before, but i used AMFPHP and was able to send objects to drupal. I am assuming that this has to do with Drupal expecting an object, but since it is a GET it gets transformed as a string. Is there any way of getting around this with out hacking the code?
Here is my code:
$('newBlogSubmit').addEvent('click', function()
{
var node = {
type : "blog",
title:"New Title",
body :"This is the blog body"
}
var string = JSON.encode(node);
string.escapeRegExp()
var sessID = _sessID;
DrupalService.getInstance().node_save(string, sessID, drupal_handleBlogSubmit);
});
My Drupal Service JS Code:
//NODE
DrupalService.prototype.node_save = function(node, sessid, callback){
var dataObj = {
method : "node.save",
sessid : sessid,
node : node
}
DrupalService.getInstance().request(dataObj, callback);
}
//SEND REQUEST AND CALLBACK FUNCTION
DrupalService.prototype.request = function(dataObject, callback){
new JsonP('http://myDrupalSite.com/services/json', {data: dataObject,onComplete: callback}).request();
}
I am trying to connect the dots, but not too familiar with Drupal, but i would guess all I need to do is turn the string back into an object. Any ideas where I should be looking, or if there is an existing patch?
A first question could be why you use mootools since Drupal comes with jQuery and use it extensively throughout the different modules and Drupal core itself.
Anyways I don't know mootools so can't help you there, but if your request in ending in a internal server error, you have a problem with your drupal code or your js code. So even if I knew exactly what you were doing, I couldn't tell you the problem without looking at the drupal code for your http://myDrupalSite.com/services/json callback.
In general, what you want to make sure is:
You make a POST request, as drupal will cache get's and the semantic of this, is that you are posting data - the node - to the server.
Your data should be sent as post params, this will make them end up in the PHP $_POST variable
Your callback should validate the data and act accordingly, creating a node when the data is intact. You don't need session id's since the script will have the same session the browser has.
I've answered a similar question in detail, which was about altering a field instead of saving a node, but much of the work is still the same. You can take a look on the post, although this is with jQuery and not Mootools.

Resources