Casperjs fillXPath https - web-scraping

i'm new to casperjs so apologize me if my question is too stupid but i search for about 2 hours on google and didn't find anything to solve my problem.
So i want to open google maps page and fill the search form in it.
Here is my code:
var casper = require('casper').create({
pageSettings: {
javascriptEnabled: true,
loadImages: true,
loadPlugins: true,
userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:37.0) Gecko/20100101 Firefox/37.0"
}
});
var x = require('casper').selectXPath;
var fields = {}
casper.start('https://www.google.fr/maps/', function() {
this.echo(this.getCurrentUrl()); // "http://www.google.fr/"
});
casper.then(function() {
fields['//*[#id="searchboxinput"]'] = '36 Quai des Orfèvres 75001 Paris'
casper.test.assertExists(x('.//form[#name="searchbox_form"]'));
this.fillXPath('form[name="searchbox_form"]', fields, false);
});
casper.run();
I right click on the form and inspect element to get the xpath.
Even the assert fail.
I dont understand where i make mistake.
Thanks.
Edit:
I add
this.getCurrentUrl()
in the casper.start() and i have
about:blank
It seems that capersjs doesn't want to load https page..
What can I do to bypass this problem.
PS: The API doesn't provide the information I want.

Related

Is there any way to scrape Google Search results without getting blocked by Captcha?

Say I wanted to scrape results from searching "hi google" (just an example). I'm using Puppeteer with Node.js to scrape. I use the following code:
const puppeteer = require('puppeteer');
scrape = async function () {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto("https://www.google.com/search?q=hi+google&rlz=1C1CHBF_enUS879US879&oq=hi+google&aqs=chrome..69i57j0l3j46j69i60l3.1667j0j7&sourceid=chrome&ie=UTF-8", { waitUntil: "networkidle2" });
await page.setViewport({ width: 1366, height: 663 });
await page.waitForSelector('.xpd');
let data = await page.evaluate(() => {
return document.querySelectorAll('.xpd')[16];
});
await browser.close();
return data;
}
scrape()
.then(function(result) {
console.log(result);
})
When the browser launches, it immediately goes to a reCAPTCHA page:
Is there any way to surpass this issue? I've done some researching online, but those results are either 1. quite theoretical and I have no idea how to implement those in my code, or 2. Python solutions, and I'm not sure how some of those solutions would look with Puppeteer. The most helpful result I came across was randomly timing the scraping to make the requests seem human-like, but as you can see it's not working even retrieving just one data element, and it just immediately takes you to a reCAPTCHA page.
Thanks.
This is down to a large amount of factors.
First of all, you'll want to utilise puppteer-extra-stealth (https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth).
This library patches the most common methods in which puppeteer is detected.
Secondly, you also want to emulate realistic mouse movements. I've found the library ghost-cursor to work very well for that (https://github.com/Xetera/ghost-cursor).
That alone will not work, however. You also need to utilise non-spammed residential proxies or ideally 4g proxies.
4g proxies work off a pooled system based on the location and rotate and are shared among all mobile data users on that network in the area.
I recommend using https://rsocks.net UK or USA proxies - or ideally building your own 4g proxy locally to avoid any saturation.
You will still encounter some Captchas so it is worth implementing a solution such as 2captcha as well.
To further increase your success rate, you will want to use Google account cookies that have history and legitimate or "farmed" activity on them.
The more used the cookies attached to the account are for normal browsing, the more trust your session will have.
To scrape google search pages I don't recommend you to use browser automation instead you can get what you need from a simple request, which needs fewer resources to do this. For example, you can use axios to make a request and cheerio to parse HTML using jQuery syntax. To prevent blocking always add the User-Agent header to your request. Here is a simple example of usage:
const cheerio = require("cheerio");
const axios = require("axios");
const searchString = "hi google";
const AXIOS_OPTIONS = {
headers: {
// https://www.whatismybrowser.com/detect/what-is-my-user-agent/
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
},
params: { q: `${searchString}`, hl: "en", gl: "us" },
};
async function getResults() {
return axios.get(`http://www.google.com/search`, AXIOS_OPTIONS).then(function ({ data }) {
let $ = cheerio.load(data);
const someResult = $("necessary_selector").text().trim();
return someResult;
});
}
getResults().then(console.log);
Unfortunately, this also doesn't always help to bypass Google blocking. You can learn more about how to deal with blocking from the blog post Reducing the chance of being blocked while web scraping.

Why is my urlFetchApp function failing to successfully login

I'm trying to use google apps script to login to an ASP.Net website and scrape some data that I typically have to retrieve manually. I've used Chrome Developer tools to get the correct payload names (TEXT_Username, TEXT_Password, _VIEWSTATE, _VIEWSTATEGENERATOR), I also got a ASP Net session Id to send along with my Post request.
When I run my function(s) it returns a Response Code = 200 if followRedirects is set to false and returns Response Code = 302 if followRedirects is set to true. Unfortunately in neither case do the functions successfully authenticate the website. Instead the HTML returned is that of the Login Page.
I've tried different header variants and parameters, but I can't seem to successfully login.
Couple of other points. When I do the login in Chrome using the Developer tools, the response code appears to be 302 Found.
Does anyone have any suggestions on how I can successfully login to this site. Do you see any errors in my functions that could be the cause of my problems. I'm open to any and all suggestions.
My GAS functions follow:
function login(cookie, viewState,viewStateGenerator) {
var payload =
{
"__VIEWSTATE" : viewState,
"__VIEWSTATEGENERATOR" : viewStateGenerator,
"TEXT_Username" : "myUserName",
"TEXT_Password" : "myPassword",
};
var header = {'Cookie':cookie};
Logger.log(header);
var options =
{
"method" : "post",
"payload" : payload,
"followRedirects" : false,
"headers" : header
};
var browser = UrlFetchApp.fetch("http://tnetwork.trakus.com/tnet/Login.aspx?" , options);
Utilities.sleep(1000);
var html = browser.getContentText();
var response = browser.getResponseCode();
var cookie2 = browser.getAllHeaders()['Set-Cookie'];
Logger.log(response);
Logger.log(html);
}
function loginPage() {
var options =
{
"method" : "get",
"followRedirects" : false,
};
var browser = UrlFetchApp.fetch("http://tnetwork.trakus.com/tnet/Login.aspx?" , options);
var html = browser.getContentText();
// Utilities.sleep(500);
var response = browser.getResponseCode();
var cookie = browser.getAllHeaders()['Set-Cookie'];
login(cookie);
var regExpGen = new RegExp("<input type=\"hidden\" name=\"__VIEWSTATEGENERATOR\" id=\"__VIEWSTATEGENERATOR\" value=\"(.*)\" \/>");
var viewStateGenerator = regExpGen.exec(html)[1];
var regExpView = new RegExp("<input type=\"hidden\" name=\"__VIEWSTATE\" id=\"__VIEWSTATE\" value=\"(.*)\" \/>");
var viewState = regExpView.exec(html)[1];
var response = login(cookie,viewState,viewStateGenerator);
return response
}
I call the script by running the loginPage() function. This function obtains the cookie (session id) and then calls the login function and passes along the session id (cookie).
Here is what I see in the Google Developer tools Network section when I login using Google's Chrome browser:
Remote Address: 66.92.89.141:80
Request URL: http://tnetwork.trakus.com/tnet/Login.aspx
Request Method: POST
Status Code:302 Found
**Request Headers** view source
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate
Accept-Language: en-US,en;q=0.8
Cache-Control:max-age=0
Connection:keep-alive
Content-Length: 252
Content-Type:application/x-www-form-urlencoded
Cookie: ASP.NET_SessionId=jayaejut5hopr43xkp0vhzu4; userCredentials=username=myUsername; .ASPXAUTH=A54B65A54A850901437E07D8C6856B7799CAF84C1880EEC530074509ADCF40456FE04EC9A4E47D1D359C1645006B29C8A0A7D2198AA1E225C636E7DC24C9DA46072DE003EFC24B9FF2941755F2F290DC1037BB2B289241A0E30AF5CB736E6E1A7AF52630D8B31318A36A4017893452B29216DCF2; __utma=260442568.1595796669.1421539534.1425211879.1425214489.16; __utmc=260442568; __utmz=260442568.1421539534.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utma=190106350.1735963725.1421539540.1425152706.1425212185.18; __utmc=190106350; __utmz=190106350.1421539540.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)
Host:tnetwork.trakus.com
Origin:http://tnetwork.trakus.com
Referer:http://tnetwork.trakus.com/tnet/Login.aspx?
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36
**Form Dataview** sourceview URL encoded
__VIEWSTATE: O7YCnq5e471jHLqfPre/YW+dxYxyhoQ/VetOBeA1hqMubTAAUfn+j9HDyVeEgfAdHMl+2DG/9Gw2vAGWYvU97gml5OXiR9E/9ReDaw9EaQg836nBvMMIjE4lVfU=
__VIEWSTATEGENERATOR:F4425990
TEXT_Username:myUsername
TEXT_Password:myPassword
BUTTON_Submit: Log In
Update: It appears that the website is using an HttpOnly cookie. As a result, I don't think I am capturing the whole cookie and therefore my header is not correct. As a result, I believe I need to set followRedirects to false and handle the redirect and cookie manually. I'm currently researching this process, but welcome input from anyone who has been down this road.
I was finally able to successfully login to the page. The issue seems to be that the urlFetchApp was unable to follow the redirect. I credit this stackoverflow post: how to fetch a wordpress admin page using google apps script
This post described the following process that led to my successful login:
Set followRedirect to false
Submit the post and capture the cookies
Use the captured cookie to issue a get with the appropriate url.
Here is the relevant code:
var url = "http://myUrl.com/";
var options = {
"method": "post",
"payload": {
"TEXT_Username" : "myUserName",
"TEXT_Password" : "myPassword",
"BUTTON_Submit" : "Log In",
},
"testcookie": 1,
"followRedirects": false
};
var response = UrlFetchApp.fetch(url, options);
if ( response.getResponseCode() == 200 ) {
// Incorrect user/pass combo
} else if ( response.getResponseCode() == 302 ) {
// Logged-in
var headers = response.getAllHeaders();
if ( typeof headers['Set-Cookie'] !== 'undefined' ) {
// Make sure that we are working with an array of cookies
var cookies = typeof headers['Set-Cookie'] == 'string' ? [ headers['Set-Cookie'] ] : headers['Set-Cookie'];
for (var i = 0; i < cookies.length; i++) {
// We only need the cookie's value - it might have path, expiry time, etc here
cookies[i] = cookies[i].split( ';' )[0];
};
url = "http://myUrl/Calendar.aspx";
options = {
"method": "get",
// Set the cookies so that we appear logged-in
"headers": {
"Cookie": cookies.join(';')
}
}
...
I notice that the provided Chrome payload includes BUTTON_Submit: Log In but your POST payload does not. I have found that for POSTs in GAS things go much more smoothly if I explicitly set a submit variable in my payload objects. In any case, if you're trying to emulate what Chrome is doing, this is a good first step.
So in your case, it's a one line change:
var payload =
{
"__VIEWSTATE" : viewState,
"__VIEWSTATEGENERATOR" : viewStateGenerator,
"TEXT_Username" : "myUserName",
"TEXT_Password" : "myPassword",
"BUTTON_Submit" : "Log In"
};

HTTP requests in Intel XDK

I previously built an app in the Intel XDK platform pre the Feb 23rd update and now the software has updated when i try to run the emulator it just crashes.
previously i sent a get request to a process php page for a login in the following way.
$(document).ready(function(){
$('form.login').submit(function () {
var user = $(this).find("[name='user']").val();
var pass = $(this).find("[name='pass']").val();
var sublogin = $(this).find("[name='sublogin']").val();
// ...
$.ajax({
type: "POST",
url: "http://www.domain.com/data/apps/project1/process.php",
data: {
user : user,
pass : pass,
sublogin : sublogin,
},
success: function(response){
if(response == "1")
{
$("#responsecontainer").html(response);
window.location.href = "menu.html";
}
// Login failed
else
{
$("#responsecontainer").html(response);
}
//alert(response);
}
});
this.reset();
return false;
});
});
However it seems that this is the piece of code that is causing the problems, if I remove this item of code the project no longer crashes.
When i read through the Intel XDK documents it only shows HTTP request to call XML files.
So i was hoping that somebody may know why this is causing the problem or how i might construct it so that Intel XDK doesn't crash.
There is a regression bug with regards to relative location URL referenced through emulator, a fix is being worked on. This is related to emulator only. Your app should work fine with test tab using App Preview on the device and using the build.
Till we come up with a fix for emulator crash, here is a workaround. The issue arises when you are trying to change the location of your current page with window.location.href = "menu.html"; and emulator is not able to resolve the relative path during ajax call.
Please use the following code as a workaround.
var newLocation = 'menu.html';
if ( window.tinyHippos ) {
// special case for emulator
newLocation = getWebRoot() + newLocation;
}
document.location.href=newLocation;
function getWebRoot() {
"use strict" ;
var path = window.location.href ;
path = path.substring( 0, path.lastIndexOf('/') ) ;
path += '/';
return path;
}
Swati

How to fetch a wordpress admin page using google apps script?

I need to fetch a page inside my Wordpress blog admin area. The following script:
function fetchAdminPage() {
var url = "http://www.mydomain.invalid/wp/wp-admin/wp-login.php";
var options = {
"method": "post",
"payload": {
"log": "admin",
"pwd": "password",
"wp-submit": "Login",
"redirect_to":"http://www.mydomain.invalid/wp/wp-admin/edit-comments.php",
"testcookie": 1
}
};
var response = UrlFetchApp.fetch(url, options);
...
}
is executed without errors. Anyway, response.getContentText() returns the login page, and I am not able to access the page http://www.mydomain.invalid/wp/wp-admin/edit-comments.php which is the one I want to fetch.
Any idea on how to do this?
There might be an issue with Google Apps Scripts and post-ing to a URL that gives you back a redirection header.
It seems like it might not be possible to follow the redirect with a post - here's a discussion on the issue -
https://issuetracker.google.com/issues/36754794
Would it be possible, if you modify your code to not follow redirects, capture the cookies and then do a second request to your page? I haven't actually used GAS, but here's my best guess from reading the documentation:
function fetchAdminPage() {
var url = "http://www.mydomain.invalid/wp/wp-admin/wp-login.php";
var options = {
"method": "post",
"payload": {
"log": "admin",
"pwd": "password",
"wp-submit": "Login",
"testcookie": 1
},
"followRedirects": false
};
var response = UrlFetchApp.fetch(url, options);
if ( response.getResponseCode() == 200 ) {
// Incorrect user/pass combo
} else if ( response.getResponseCode() == 302 ) {
// Logged-in
var headers = response.getAllHeaders();
if ( typeof headers['Set-Cookie'] !== 'undefined' ) {
// Make sure that we are working with an array of cookies
var cookies = typeof headers['Set-Cookie'] == 'string' ? [ headers['Set-Cookie'] ] : headers['Set-Cookie'];
for (var i = 0; i < cookies.length; i++) {
// We only need the cookie's value - it might have path, expiry time, etc here
cookies[i] = cookies[i].split( ';' )[0];
};
url = "http://www.mydomain.invalid/wp/wp-admin/edit-comments.php";
options = {
"method": "get",
// Set the cookies so that we appear logged-in
"headers": {
"Cookie": cookies.join(';')
}
};
response = UrlFetchApp.fetch(url, options);
};
};
...
}
You would obviously need to add some debugging and error handling, but it should get you through.
What happens here is that we first post to the log-in form. Assuming that everything goes correctly, that should give us back a response code of 302(Found). If that's the case, we will then process the headers and look specifically for the "Set-Cookie" header. If it's set, we'll get rid of the un-needed stuff and store the cookies values.
Finally we make a new get request to the desired page on the admin( in this case /wp/wp-admin/edit-comments.php ), but this time we attach the "Cookie" header which contains all of the cookies acquired in the previous step.
If everything works as expected, you should get your admin page :)
I would advise on storing the cookies information(in case you're going to make multiple requests to your page) in order to save time, resources and requests.
Again - I haven't actually tested the code, but in theory it should work. Please test it and come back to me with any findings you make.

Alfresco - submitting dynamic forms to upload.post with javascript

I'm encountering issues with a dashlet that I'm trying to develop for Alfresco. It's a simple drag and drop file upload dashlet using HTML 5's drag and drop and file APIs. For the drop event listener, I call the following function which is seemingly the cause of all the problems:
function handleFileSelect(evt) {
var files = evt.target.files || evt.dataTransfer.files,
tmpForm, tmpDest, tmpMeta, tmpType, tmpName, tmpData;
dropZone.className = "can-drop";
evt.stopPropagation();
evt.preventDefault();
for (var i=0,f;f=files[i];i++) {
tmpForm = document.createElement('form');
tmpDest = document.createElement('input');
tmpDest.setAttribute('type', 'text');
tmpDest.setAttribute('name', 'destination');
tmpDest.setAttribute('value', destination);
tmpForm.appendChild(tmpDest);
tmpMeta = document.createElement('input');
tmpMeta.setAttribute('type', 'text');
tmpMeta.setAttribute('name', 'mandatoryMetadata');
tmpMeta.setAttribute('value', window.metadataButton.value);
tmpForm.appendChild(tmpMeta);
tmpType = document.createElement('input');
tmpType.setAttribute('type', 'text');
tmpType.setAttribute('name', 'contenttype');
tmpType.setAttribute('value', "my:document");
tmpForm.appendChild(tmpType);
tmpName = document.createElement('input');
tmpName.setAttribute('type', 'text');
tmpName.setAttribute('name', 'filename');
tmpName.setAttribute('value', f.name);
tmpForm.appendChild(tmpName);
tmpData = document.createElement('input');
tmpData.setAttribute('type', 'file');
tmpData.setAttribute('name', 'filedata');
tmpData.setAttribute('value', f);
tmpForm.appendChild(tmpData);
Alfresco.util.Ajax.request({
url: Alfresco.constants.PROXY_URI_RELATIVE + "api/upload",
method: 'POST',
dataForm: tmpForm,
successCallback: {
fn: function(response) {
console.log("SUCCESS!!");
console.dir(response);
},
scope: this
},
failureCallback: {
fn: function(response) {
console.log("FAILED!!");
console.dir(response);
},
scope: this
}
});
}
}
The server responds with a 500, and if I turn on debug level logging for web scripts, upload.post returns with:
DEBUG [repo.jscript.ScriptLogger] ReferenceError: "formdata" is not defined.
Which, to me at least, indicates that the form above isn't getting submitted properly (if at all). When digging through it all with Chrome dev tools, I notice that that request payload looks drastically different from something such as a REST client. The above code results in the request using Content-Type: application/x-www-form-urlencoded whereas using a REST client, or Alfresco Share's standard uploader(s) are using Content-Type: multipart/form-data. If I need to submit the form using multipart/form-data, what is the easiest way to write out the request body (with the boundaries, Content-Disposition's, etc...) to include the file being uploaded?
I ditched the idea of creating a form HTML Element through javascript, and assume that if a browser supports the File API, and the Drag and Drop API, that they will likely also support the XMLHttpRequest2 API. As per HTML5 File Upload to Java Servlet, The above code now reads:
function handleFileSelect(evt) {
var files = evt.target.files || evt.dataTransfer.files,
xhr = new XMLHttpRequest();
dropZone.className = "can-drop";
evt.stopPropagation();
evt.preventDefault();
for (var i=0,f;f=files[i];i++) {
formData = new FormData();
formData.append('destination', destination);
formData.append('mandatoryMetadata', window.metadataButton.value);
formData.append('contenttype', "my:document");
formData.append('filename', f.name);
formData.append('filedata', f);
formData.append('overwrite', false);
xhr.open("POST", Alfresco.constants.PROXY_URI_RELATIVE + "api/upload");
xhr.send(formData);
}
}
with the necessary event listeners to be added later. It would seem that the Alfresco AJAX methods that come stock and standard heavily modify the underlying requests being made, making it very difficult for one to simply send a FormData() object.

Resources