scraping a website eith POST method - web-scraping

Hi im scraping a website cleartrip.com
I get the page info by this method :
$url = "http://www.cleartrip.com/m/flights/results?from=CCU&to=DEL&depart_date=22/06/2012&adults=1&childs=0&infants=0&dep_time=0&class=Economy&airline=&carrier=&x=57&y=16&flexi_search=no&tb=n";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "android Mozilla/5.0 (Linux; U; Android 0.5; en-us)");
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
Problem is i want those flights that have a stop ..............info is passed through post method but i don't know how i can get it?????

You'll probably need to search the DOM for the element containing the info you want. Probably through a regular expression.

Related

Login to Zoho CRM with http call

im trying to automate a login to Zoho CRM. I'm trying to Log In using my data in a http call but looks like it doesn't work. I would like to know if anyone has achieved this.
What I tried:
POST to https://accounts.zoho.com/login
with body:
{
LOGIN_ID: "username",
PASSWORD: "password",
IS_AJAX: "true",
remember :-1,
servicename: "ZohoCRM"
}
The response I get:
Status 200
showErrorAndReload('Please\x20reload\x20the\x20page\x20and\x20try\x20again.');
I just recently had to do a lot of research and trial and error to figure this one out. Be sure to use Postman or something else to test this out for your app.
Get the iamcsr value from the cookie you receive by visiting https://accounts.zoho.com/
Here is what the value looks like
Note: You can't reuse or hardcode that value in. Your app has to generate it each login.
Insert the following values into the link below and POST it.
login_id = your Zoho account login
password = your Zoho account password
unix_timestamp = generate a current unix timestamp with milliseconds
iamcsr = value received from cookie
The parameter values for remember, servicename, and serviceurl always remain the same.
https://accounts.zoho.com/signin/auth?LOGIN_ID={login_id}&PASSWORD={password}&cli_time={unix_timestamp}&remember=2592000&iamcsrcoo={iamcsr}&servicename=AaaServer&serviceurl=https://accounts.zoho.com/u/
To verify you have successfully logged in you will receive the following response:
showsuccess('https\x3A\x2F\x2Faccounts.zoho.com\x2Fu\x2F',"",'', '', '-1', 'dXM\x3D');
This will get you through the login portion and will authenticate you to do further actions assuming your app has stored cookies for the current session.
I made a Php script based on Christian Barahona answer:
Get session cookie
$useragent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36';
$cookie_file_path = 'cookie.txt';
$curl = curl_init('https://accounts.zoho.com/signin?servicename=AaaServer&serviceurl=%2Fu%2Fh/');
curl_setopt($curl, CURLOPT_USERAGENT, $useragent);
curl_setopt($curl,CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl,CURLOPT_SSL_VERIFYHOST, 2);
curl_setopt($curl, CURLOPT_COOKIEJAR,
$cookie_file_path);
curl_exec($curl);
Login
//Read cookies file
$cookies = curl_getinfo($curl, CURLINFO_COOKIELIST);
foreach ($cookies as $cookie){
$splitted=preg_split('/\s+/',$cookie);
if($splitted[0]=="accounts.zoho.com"){
if($splitted[sizeof($splitted)-2]=="iamcsr"){
$iamscr=$splitted[sizeof($splitted)-1];
}
}
}
//return current unix timestamp in milliseconds
function milliseconds() {
$mt = explode(' ', microtime());
return ((int)$mt[1]) * 1000 + ((int)round($mt[0] * 1000));
}
$postValues = array(
'LOGIN_ID' => '*******',
'PASSWORD' => '*******',
'cli_time'=> milliseconds(),
'remember'=> '2592000',
'iamcsrcoo'=> $iamscr,
'servicename'=> 'AaaServer',
'serviceurl'=> 'https://accounts.zoho.com/u/h'
);
$postValuesFormatted = http_build_query($postValues);
curl_setopt($curl, CURLOPT_URL, 'https://accounts.zoho.com/signin/auth');
curl_setopt($curl, CURLOPT_POST, true);
curl_setopt($curl, CURLOPT_POSTFIELDS, $postValuesFormatted);
curl_setopt($curl, CURLOPT_COOKIEFILE, $cookie_file_path);
curl_exec($curl);
Curl any page while logged
curl_setopt($curl, CURLOPT_URL, 'https://accounts.zoho.com/u/h');
curl_setopt($curl, CURLOPT_POST, false);
$result = curl_exec($curl);

Cron Authentification Digest Symfony

My website is on an Infomaniak cloud server.
I can create a cron task with Authentification Digest.
So I made a Method "updateAction()" that update my database.
If I run this method in my browser (http://site1.com/update/), then I authenticate with my username and password. (It works and the database is updated)
I create a cronjob in my "Infomaniak Cloud Server" with Digest Authentication and the URL : http://site1.com/update/
I obtain this mail notification :
Cron name : Update
Date : 19/01/2018 16:51:18
Url : https://site1.com/update/
Result : OK
Detail : 204
Reponse : No response
So, the HTTP response is "204 : No content"
And my database is not updated.
Could you help me ?
So the solution is :
$url="http://site1.com/update/";
define('USERNAME','user');
define('PASSWORD','12345');
$ch=null;
try
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_BASIC);
curl_setopt($ch, CURLOPT_USERPWD, USERNAME . ":" . PASSWORD);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
$resp = curl_exec($ch);
}
catch(Exception $ex)
{
}

Alfresco login page bypassing

I created a valid ticket using a webservice call...code shown below
$url="http://serverip:port/alfresco/service/api/login?u=xxx&pw=xxx";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
$response =curl_exec($ch);
Now using this ticket i want to authenticate alfresco without again entering username and password.Also i want to create a valid cookie JSESSIONID inside browser with this ticket...Is it feasible??
my purpose is to integrate a php application with alfresco....php application already has an authentication system...so i want to bypass the authentication of alfresco
You need to append below parameter
alf_ticket="TICKET_WHICH_YOU_GET"
for further authentication.
Finally i resolved the issue by calling the login page url http://ip:port/share/page/ via Curl with login parameters(username and pwd)...I got JsessionId as response from curl...Now i took that JsessionId and set inside the browser...so wen u click http://ip:port/share/page/ the login page is bypassed
As per your suggestion, we are tried with below curl call but their is no JsessionId in response. can you please check and let me know the resolution
$post = [
'username' => 'user',
'password' => 'pass',
];
$ch = curl_init('http://ip:port/share/page/dologin/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
// execute!
$response = curl_exec($ch);
// close the connection, release resources used
curl_close($ch);
// do anything you want with your response
var_dump($response);
please suggest with the solution

httrack follow redirects

I try to mirror webpages recursively starting from URL supplied by user (there is a depth limit set of course). Wget didn't catch links from css/js so I decided to use httrack.
I try to mirror some site like this:
# httrack <http://onet.pl> -r6 --ext-depth=6 -O ./a "+*"
This website uses redirect (301) to http://www.onet.pl:80, httrack just
downloads index.html page with:
<a HREF="onet.pl/index.html" >Page has moved</a>
and nothing more! When I run:
# httrack <http://www.onet.pl> -r6 --ext-depth=6 -O ./a "+*"
it does what I want.
Is there any way to make httrack following redirects? Currently I just add "www."+url to httrack's URLs but it's not a real solution (doesn't cover all user cases). Are there any better website mirroring tools for linux?
On main httrack forum one of developers said that it's not possible.
Proper solution is to use another web mirroring tool.
You could use this script to determine first the real target url and then run httrack against that url :
function getCorrectUrl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, $url);
$out = curl_exec($ch);
// line endings is the wonkiest piece of this whole thing
$out = str_replace("\r", "", $out);
// only look at the headers
$headers_end = strpos($out, "\n\n");
if ($headers_end !== false) {
$out = substr($out, 0, $headers_end);
}
$headers = explode("\n", $out);
foreach ($headers as $header) {
if (substr($header, 0, 10) == "Location: ") {
$target = substr($header, 10);
return $target;
}
}
return $url;
}

200 HTTP CODE at non-url strings

$agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";
$ch=curl_init();
curl_setopt ($ch, CURLOPT_URL,$url );
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch,CURLOPT_VERBOSE,false);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch,CURLOPT_SSLVERSION,3);
curl_setopt($ch,CURLOPT_SSL_VERIFYHOST, FALSE);
$page=curl_exec($ch);
//echo curl_error($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
I'm using cUrl at my project. I'm getting HTTP CODE of web sites. But when i try "asdasd" ($url="asdasd") , returned HTTP CODE is 200 . But "asdasd" isn't a web site. Why HTTP CODE is 200 ?
You might want to check the return value from curl_exec first. It says in the manual:
Returns TRUE on success or FALSE on failure. However, if the CURLOPT_RETURNTRANSFER option is set, it will return the result on success, FALSE on failure.
For curl_getinfo there's a side note:
Information gathered by this function is kept if the handle is re-used. This means that unless a statistic is overridden internally by this function, the previous info is returned.
So that HTTP 200 might well be the result of a previous cURL invocation.
Another possible error can happen if you develop in local and you use custom managed DNS services, then you probably can get some managed error pages like OpenDNS is doing.

Resources