PhantomJs onResourceReceived URL decode issue - web-scraping

I'm creating a web scraping bot in PhantomJs, I'm using onResourceReceived to sniff the the requests of the site and retrieve them using this simple code:
page.onResourceReceived = function(response)
{
if (response.url.match("XXXXXXX"))
{
console.log(response.url);
}
};
My problem is that response.url automatically update the data to a URL-decoded version of this. I need to check some parameters but instead of receiving something like this :
xxx.com?...&events=event20%2Cevent4%%2Cevent89%3D7%2Cevent50%2Cevent51%2Cevent52%2Cevent53%2Cevent54%2Cevent55%2Cevent56&...
I get this
xxx.com?... &events=event20%2Cevent4%2Cevent89%3D7&....
It looks like when %3D is reached it cuts the value and continues to the next property.
Is there a way to access the raw version of this data?
Thanks a lot for the help.

Related

Understand Dynamic Links Firebase

I would like to understand better Firebase Dynamic Links because i am very new to this subject.
What i would like to know :
FirebaseDynamicLinks.instance.getInitialLink() is supposed to return "only" the last dynamic link created with the "initial" url (before it was shorten) ?
Or why FirebaseDynamicLinks.instance.getInitialLink() doesn't take a String url as a parameter ?
FirebaseDynamicLinks.instance.getDynamicLink(String url) doesn't read custom parameters if the url was shorten, so how can we retrieve custom parameters from a shorten link ?
My use case is quite simple, i am trying to share an object through messages in my application, so i want to save the dynamic link in my database and be able to read it to run a query according to specific parameters.
FirebaseDynamicLinks.instance.getInitialLink() returns the link that opened the app and if the app was not opened by a dynamic link, then it will return null.
Future<PendingDynamicLinkData?> getInitialLink()
Attempts to retrieve the dynamic link which launched the app.
This method always returns a Future. That Future completes to null if
there is no pending dynamic link or any call to this method after the
the first attempt.
https://pub.dev/documentation/firebase_dynamic_links/latest/firebase_dynamic_links/FirebaseDynamicLinks/getInitialLink.html
FirebaseDynamicLinks.instance.getInitialLink() does not accept a string url as parameter because it is just meant to return the link that opened the app.
Looks like there's no straightforward answer to getting the query parameters back from a shortened link. Take a look at this discussion to see if any of the workarounds fit your use case.

Scraping login protected website with a challenge form?

I'm trying to do some web scraping from steamspy.com, specifically the total playtime hours for a certain game. That info is behind the login wall for the site, so I've been trying to figure out how to get R past it for html mining.
I tried this method for passing login credentials via POST() but it doesn't seem to work. I noticed that the login handler for that example used POST, whereas looking at the source code for steamspy it seems to use a challenge form and I wasn't sure how to proceed with R.
My attempt thus far looks like this:
handle <- handle("http://steamspy.com")
path <- "/login/"
login <- list(
jschl_vc = "bc4e...",
pass = "148..."
)
response <- POST(handle = handle, path = path, body = login)
I found the values for the jschl_vc and pass from inspecting the source code after I logged in. The code above doesn't work and gives me:
Error in curl::curl_fetch_memory(url, handle = handle) : Failure
when receiving data from the peer
probably since I'm tryign to use POST to a challenge form. Is there way that I'm missing to proceed?

Check if a webpage is online using VB in asp.net

I have been toying with getting this working for a while now to no avail.
I want to check if a website is available and display its status in a label. I have no code to share as I haven't got this anywhere close. I am using VS 2013 building a asp.net website using VB.
I thought I could just ping the website but after multiple test using command.exe the website doesn't respond to pings even when up.
I know the page will be taken offline periodically for updates as this happened last week and when it is you get page cannot be displayed. I need to test if the page is online and return true if it is and false if not.
The simple way is to do an httpWebRequest and examine the result. If the page doesn't exist, you will get an error that indicates that. I'm not as familiar with .webClient but that apparently works as well.
This past question gives you examples of both.
You can easily do this using HttpWebRequest.
try {
HttpWebRequest myHttpWebRequest = (HttpWebRequest) WebRequest.Create("siteAddress");
HttpWebResponse myHttpWebResponse = (HttpWebResponse) myHttpWebRequest.GetResponse();
myHttpWebResponse.Close();
}
catch(WebException e) {
//There is a problem accessing the site
}

Retrieve comments from website using disqus

I would like to write a scraping script to retrieve comments from cnn articles. For example, this article: http://www.cnn.com/2012/01/19/politics/gop-debate/index.html?hpt=hp_t1
I realize that cnn uses disqus for their comment discussion. As the comment loading is not webpage-based (ie, prev page, next page) and is dynamic (ie, need to click "load next 25"), I have no idea how to retrieve all the 5000+ comments for this article.
Any idea or suggestion?
Thanks so much!
I needed to get comments via scraping a page that had disqus comments via ajax. Because they were not rendered on the server, I had to call the disqus api. In the source code, you will need the identifier code:
var identifier = "456643" // take note of this from the page source
// this is the ident url query param in the following js request
also,look in the js source code to get the pages public key, and forum name. Place these in the url where appropriate.
I used javascript nodejs to test this, ie :
var request = require("request");
var publicKey = "pILMw27bsbJsdfsdQDh9Eh0MzAgFL6xx0hYdsdsdfaIfBHRvLGqFFQ09st";
var disqusUri = "https://disqus.com/api/3.0/threads/listPosts.json?&api_key=" + publicKey + "&thread:ident=456643&forum=nameOfForumFromSource";
request(disqusUri, function(res,status,err){
console.log(res.body);
if(err){
console.log("ERR: " + err);
}
});
The option for scraping (other then getting the page), which might be less robust (depends on you're needs) but will offer a solution for the problem you have, is to use some kind of wrapper around a full fledged web browser and literally code the usage pattern and extract the relevant data. Since you didn't mention which programming language you know, I'll give 3 examples: 1) Watir - ruby, 2) Watin - IE & Firefox via .net, 3) Selenium - IE via C#/Java/Perl/PHP/Ruby/Python
I'll provide a little example using Watin & C#:
IE browser = new IE();
browser.GoTo(YOUR CNN URL);
List visibleComments = Browser.List(Find.ById("dsq-comments"));
//do your scraping thing
Link moreComments = Browser.Link(Find.ByClass("dsq-paginate-append-text");
moreComments.click();
//wait util ajax ended by searching for some indicator
Browser.WaitUntilContainsText(SOME TEXT);
//do your scraping thing
Notice:
I'm not familiar with disqus, but it might be a better option to force all the comments to show by looping the Link & click parts of the code I posted until all the comments are visible and the scrape the List element dsq-comments

Drupal node.save and JSONP

I am having an issue with call Drupal node.save using MooTool's JSONP. Here is an example.
Here is my request:
callback Request.JSONP.request_map.request_1
method node.save
sessid 123123123123123
node {"type":"blog","title":"New Title","body":"This is the blog body"}
Here is my result
HTTP/1.0 500 Internal Server Error
I got this working before, but i used AMFPHP and was able to send objects to drupal. I am assuming that this has to do with Drupal expecting an object, but since it is a GET it gets transformed as a string. Is there any way of getting around this with out hacking the code?
Here is my code:
$('newBlogSubmit').addEvent('click', function()
{
var node = {
type : "blog",
title:"New Title",
body :"This is the blog body"
}
var string = JSON.encode(node);
string.escapeRegExp()
var sessID = _sessID;
DrupalService.getInstance().node_save(string, sessID, drupal_handleBlogSubmit);
});
My Drupal Service JS Code:
//NODE
DrupalService.prototype.node_save = function(node, sessid, callback){
var dataObj = {
method : "node.save",
sessid : sessid,
node : node
}
DrupalService.getInstance().request(dataObj, callback);
}
//SEND REQUEST AND CALLBACK FUNCTION
DrupalService.prototype.request = function(dataObject, callback){
new JsonP('http://myDrupalSite.com/services/json', {data: dataObject,onComplete: callback}).request();
}
I am trying to connect the dots, but not too familiar with Drupal, but i would guess all I need to do is turn the string back into an object. Any ideas where I should be looking, or if there is an existing patch?
A first question could be why you use mootools since Drupal comes with jQuery and use it extensively throughout the different modules and Drupal core itself.
Anyways I don't know mootools so can't help you there, but if your request in ending in a internal server error, you have a problem with your drupal code or your js code. So even if I knew exactly what you were doing, I couldn't tell you the problem without looking at the drupal code for your http://myDrupalSite.com/services/json callback.
In general, what you want to make sure is:
You make a POST request, as drupal will cache get's and the semantic of this, is that you are posting data - the node - to the server.
Your data should be sent as post params, this will make them end up in the PHP $_POST variable
Your callback should validate the data and act accordingly, creating a node when the data is intact. You don't need session id's since the script will have the same session the browser has.
I've answered a similar question in detail, which was about altering a field instead of saving a node, but much of the work is still the same. You can take a look on the post, although this is with jQuery and not Mootools.

Resources