Scraping Javascript generated data

Scraping Javascript generated data - r

I'm working on a project with the World Bank analyzing their procurement processes.
The WB maintains websites for each of their projects, containing links and data for the associated contracts issued (example). Contract-related data is available under the procurement tab.
I'd like to be able to pull a project's contract information from this site, but the links and associated data are generated using embedded Javascript, and the URLs of the pages displaying contract awards and other data don't seem to follow a discernable schema (example).
Is there any way I can scrape the browser rendered data in the first example through R?

The main page calls a javascript function
javascript:callTabContent('p','P090644','','en','procurement','procurementId');
The main thing here is the project id P090644. This together with the required language en are passed as parameters to a form at http://www.worldbank.org/p2e/procurement.html.
This form call can be replicated with a url http://www.worldbank.org/p2e/procurement.html?lang=en&projId=P090644.
Code to extract relevant project description urls follows:
projID<-"P090644"
projDetails<-paste0("http://www.worldbank.org/p2e/procurement.html?lang=en&projId=",projID)
require(XML)
pdData<-htmlParse(projDetails)
pdDescribtions<-xpathSApply(pdData,'//*/table[#id="contractawards"]//*/#href')
#> pdDescribtions
href
#"http://search.worldbank.org/wcontractawards/procdetails/OP00005718"
href
#"http://search.worldbank.org/wcontractawards/procdetails/OP00005702"
href
#"http://search.worldbank.org/wcontractawards/procdetails/OP00005709"
href
#"http://search.worldbank.org/wcontractawards/procdetails/OP00005715"
it should be noted that excel links are provided which maybe of use to you also. They may contain the data you intend to scrap from the description links
procNotice<-paste0("http://search.worldbank.org/wprocnotices/projectdetails/",projID,".xls")
conAward<-paste0("http://search.worldbank.org/wcontractawards/projectdetails/",projID,".xls")
conData<-paste0("http://search.worldbank.org/wcontractdata/projectdetails/",projID,".xls")
require(gdata)
pnData<-read.xls(procNotice)
caData<-read.xls(conAward)
cdData<-read.xls(conData)
UPDATE:
To find what is being posted we can examine what happens when the javascript function is called. Using Firebug or something similar we intercept the request header which starts:
POST /p2e/procurement.html HTTP/1.1
Host: www.worldbank.org
and has parameters:
lang=en
projId=P090644
Alternatively we can examine the javascript at http://siteresources.worldbank.org/cached/extapps/cver116/p2e/js/script.js and look at the function callTabContent:
function callTabContent(tabparam, projIdParam, contextPath, langCd, htmlId, anchorTagId) {
if (tabparam == 'n' || tabparam == 'h') {
$.ajax( {
type : "POST",
url : contextPath + "/p2e/"+htmlId+".html",
data : "projId=" + projIdParam + "&lang=" + langCd,
success : function(msg) {
if(tabparam=="n"){
$("#newsfeed").replaceWith(msg);
} else{
$("#cycle").replaceWith(msg);
}
stickNotes();
}
});
} else {
$.ajax( {
type : "POST",
url : contextPath + "/p2e/"+htmlId+".html",
data : "projId=" + projIdParam + "&lang=" + langCd,
success : function(msg) {
$("#tabContent").replaceWith(msg);
$('#map_container').hide();
changeAlternateColors();
$("#tab_menu a").removeClass("selected");
$('#'+anchorTagId).addClass("selected");
stickNotes();
}
});
}
}
examining the content of the function we can see it is simply posting relevant parameters to a form then updating the webpage.

I am not sure I have understood every details of your problem.
But what I know for sure is that casperJS works great for javascript generated content.
You can have a look at it here: http://casperjs.org/
It's written in Javascript and has a bunch of useful functions very well documented on the link I provided.
I have used it myself lately for a personal project and can be set up easily with a few lines of code.
Give it a go!
Hope, that helps..

Related

How to send a POST request from GTM custom tag template?

I'm developing a simple custom tag template for Google Tag Manager. It's supposed to bind to some events and send event data to our servers as JSON in the body of a POST request.
The sandboxed GTM Javascript runtime provides the sendPixel() API. However, that only provides GET requests.
How one sends a POST request from within this sandboxed runtime?

You can use a combination of the injectScript and copyFromWindow APIs found here Custom Template APIs.
Basically, the workflow goes like this.
Build a simple script that contains a function attached to the window object that sends a normal XHR post request. The script I made and use can be found here: https://storage.googleapis.com/common-scripts/basicMethods.js
Upload that script somewhere publically accessible so you can import it into your template.
Use the injectScript API to add the script to your custom template.
The injectScript API wants you to provide an onSuccess callback function. Within that function, use the copyWindow api to grab the post request function you created in your script and save it as a variable.
You can now use that variable to send a post request the same way you would use a normal JS function.
The script I included above also includes JSON encode and Base64 encode functions which you can use the same way via the copyWindow api.
I hope that helps. If you need some specific code examples for parts I can help.

According to #Ian Mitchell answer - I've made similar solution.
This is the basic code pattern that can be used inside GTM template code section in such as scenario:
const injectScript = require('injectScript');
const callInWindow = require('callInWindow');
const log = require('logToConsole');
const queryPermission = require('queryPermission');
const postScriptUrl = 'https://myPostScriptUrl'; //provide your script url
const endpoint = 'https://myEndpoint'; //provide your endpoint url
//provide your data; data object contains all properties from fields tab of the GTM template
const data = {
sessionId: data.sessionId,
name: data.name,
description: data.description
};
//add appropriate permission to inject script from 'https://myPostScriptUrl' url in GTM template's privileges tab
if (queryPermission('inject_script', postScriptUrl)) {
injectScript(postScriptUrl, onSuccess, data.gtmOnFailure, postScriptUrl);
} else {
log('postScriptUrl: Script load failed due to permissions mismatch.');
data.gtmOnFailure();
}
function onSuccess() {
//add appropriate permission to call `sendData` variable in GTM template's privileges tab
callInWindow('sendData', gtmData, endpoint);
data.gtmOnSuccess();
}
It's important to remember to add all necessary privillages inside GTM template. Appropriate permissions will show automatically in privillages tab after use pertinent options inside code section.
Your script at 'https://myPostScriptUrl' may looks like this:
function sendData(data, endpoint) {
var xhr = new XMLHttpRequest();
var stringifiedData = JSON.stringify(data);
xhr.open('POST', endpoint);
xhr.setRequestHeader('Content-type', 'application/json');
xhr.send(stringifiedData);
xhr.onload = function () {
if (xhr.status.toString()[0] !== '2') {
console.error(xhr.status + '> ' + xhr.statusText);
}
};
}

It is not strictly necessary to load an external script. While still a workaround, you can also pass a fetch reference into the tag through a "JavaScript Variable" type variable:
Create a GTM variable of type "JavaScript Variable" with the content "fetch", thus referencing "window.fetch"
Add a text field to your Custom Tag, e. g. named "js.fetchReference".
Use data.fetchReference in your Custom Tag's like you normally would use window.fetch
Make sure the tag instance actually references the variable created in step 2 with {{js.fetchReference}}
I jotted this down with screenshots at https://hume.dev/articles/post-request-custom-template/

Create Native Client MediaStreamVideoTrack and send to javascript

According to the docs for the Native Client MediaStreamVideoTrack there is a constructor that "Constructs a MediaStreamVideoTrack that outputs given frames to a new video track, which will be consumed by Javascript."
My idéa was then to put frames into this video track, that can later be displayed by javascript in a video tag or passed to a RTCPeerConnection.
I don't know if do it correctly, but from the docs for PostMessage states that it should be supported to pass a resource. But with the simple Native Client code below I only get a warning in the browser console: "Failed to convert a PostMessage argument from a PP_Var to a Javascript value. It may have cycles or be of an unsupported type."
virtual void HandleMessage(const pp::Var& var_message) {
if (!var_message.is_dictionary()) {
LogToConsole(PP_LOGLEVEL_ERROR, pp::Var("Invalid message!"));
return;
}
pp::VarDictionary var_dictionary_message(var_message);
std::string command = var_dictionary_message.Get("command").AsString();
if (command == "create_track") {
pp::MediaStreamVideoTrack video_track = pp::MediaStreamVideoTrack::MediaStreamVideoTrack(this);
pp::VarDictionary dictionary;
dictionary.Set(pp::Var("track"), pp::Var(video_track));
PostMessage(dictionary);
}
}
Am I doing something wrong, or something that isn't just supported? :)

A MediaStreamVideoTrack can only be created by the page (not the plugin) and passed to the plugin by PostMessage. See this example code:
https://code.google.com/p/chromium/codesearch#chromium/src/ppapi/examples/media_stream_video/media_stream_video.cc
https://code.google.com/p/chromium/codesearch#chromium/src/ppapi/examples/media_stream_video/media_stream_video.html

How to feed site details on controller in dashlet in alfresco

how to directly get SITE details (like ID and name ) on main() of controller js of dashelt in alfresco
i can use " Alfresco.constants.SITE" on FTL file to read site ID, but need to know is there any KEY to read data on controller
janaka

There isn't a service on the Share side which provides that information, because the information you want is only held on the repository. As such, you'll need to call one of the REST APIs on the Repo to get the information you need
Your code would want to look something like:
// Call the repository for the site profile
var json = remote.call("/api/sites/" + page.url.templateArgs.site);
if (json.status == 200)
{
// Create javascript objects from the repo response
var obj = eval('(' + json + ')');
if (obj)
{
var siteTitle = obj.title;
var siteShortName = obj.shortName;
}
}
You can see a fuller example of this in various Alfresco dashlets, eg the Dynamic Welcome dashlet

Query strip is removed from open graph url

In relation to this question: Dynamic generation of Facebook Open Graph meta tags
I have followed these instructions but the api seems to remove my query string so that the url passed into the aggregation contains none of my dynamic information. If I enter the url with the query string into the debugger it doesn't remove it and works fine. I can confirm my og:url meta tag does also contain the same query string not just the base url. What am I doing wrong?

I was having a similar issue and solved it like this:
So assuming you're doing your post request like it shows in the tutorial, youre Javascript probably looks something like this:
function postNewAction()
{
passString = '&object=http://yoursite.com/appnamespace/object.php';
FB.api('/me/APP_NAMESPACE:ACTION' + passString,'post',
function(response) {
if (!response || response.error) {
alert(response.error.message);
}
else {
alert('Post was successful! Action ID: ' + response.id);
}
}
);
}
And since you say you want to generate meta tags dynamically, you're probably adding a parameter to the url (passString) there like so:
passString = '&object=http://yoursite.com/appnamespace/object.php?user=' + someuser;
This is wrong.
What you need to do is to make the url a 'pretty url' and use htaccess to decipher it. So:
passString = '&object=http://yoursite.com/appnamespace/object/someuser';
Then your htaccess file will tell your site that that url actually equates to
http://yoursite.com/appnamespace/object/object.php?user=someuser
Then you can use GET to store the user parameter with php and insert it however you like into your meta tags.
In case youre wondering, in the og:url meta tag's content will be:
$url = 'http://yoursite.com/appnamespace/object/object.php?user=' . $_GET[$user];
Does that help?

Visualization api - hide data from displaying on browser using servlet

Generally the servlet extends httpservlet but in the code below
the servlet extends DataSourceServlet
and the page is created like this
The text begins with google.visualization.Query.setResponse
and ends with {c:[{v:'Bob'},{v:'Jane'}]}]}}); on the browser
code: http://code.google.com/apis/visualization/documentation/dev/dsl_csv.html
can you please guide me as to how can i make servlet page silent
without giving the output on the browser.? so that i can directly call the javascript page for drawing the chart
I want to integrate all the code but i am not able to remove this browser from coming.
I am new to servlet please help

Ok I will explain my doubt again
I am writting this servlet code
http://code.google.com/apis/visualization/documentation/dev/dsl_csv.html#intro
the url to execute is /CsvDataSourceServlet?url=http://localhost:8084/WebApplication1/F2.csv
When i execute this code i get output result on my browser ... I am not understanding how that code is opening my browser and showing
{c:[{v:'Bob'},{v:'Jane'}]}]}}); etc etc
why is this happening , why is the browser opening to show result
can we figure out something from this code
http://code.google.com/apis/visualization/documentation/dev/dsl_csv.html#intro
were F2.csv is my *.csv file
now after executing the code I have to display the result which i have to do using the javascript code as follows
All Examples
// Load the Visualization API and the ready-made Google table visualization.
google.load('visualization', '1', {'packages':['annotatedtimeline']});
// Set a callback to run when the API is loaded.
google.setOnLoadCallback(init);
// Send the queries to the data sources.
function init() {
//var query = new google.visualization.Query('simpleexample?tq=select name,population');
//query.send(handleSimpleDsResponse);
var query = new google.visualization.Query('CsvDataSourceServlet?url=http://localhost:8084/WebApplication1/F2.csv');
query.send(handleCsvDsResponse);
}
// Handle the csv data source query response
function handleCsvDsResponse(response) {
if (response.isError()) {
alert('Error in query: ' + response.getMessage() + ' ' + response.getDetailedMessage());
return;
}
var data = response.getDataTable();
var chart = new google.visualization.AnnotatedTimeLine(document.getElementById('csv_div'));
chart.draw(data, {displayAnnotations: true});
}
CSV Data Source
An organization chart.
The data is taken from the csv data source.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scraping Javascript generated data - r

Related

How to send a POST request from GTM custom tag template?

Create Native Client MediaStreamVideoTrack and send to javascript

How to feed site details on controller in dashlet in alfresco

Query strip is removed from open graph url

Visualization api - hide data from displaying on browser using servlet

Categories

Resources