How to capture data from third party web site?

How to capture data from third party web site? - web-scraping

For example, I would just like to capture the data for the 30 latest events for the scrolling info shown on this URL:
http://hazmat.globalincidentmap.com/home.php#
Any idea how to capture it?

What language are you using? In Java, you can get the page HTML content using something like this:
URL url;
InputStream is = null;
BufferedReader br;
String line;
try {
url = new URL("http://hazmat.globalincidentmap.com/home.php");
is = url.openStream(); // throws an IOException
br = new BufferedReader(new InputStreamReader(is));
while ((line = br.readLine()) != null) {
// Here you need to parse the HTML lines until
//you find something you want, like for example
// "eventdetail.php?ID", and then read the content of
// the <td> tag or whatever you want to do.
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
if (is != null) is.close();
} catch (IOException ioe) {
}
}
Example in PHP:
$c = curl_init('http://hazmat.globalincidentmap.com/home.php');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c);
if (curl_error($c))
die(curl_error($c));
$status = curl_getinfo($c, CURLINFO_HTTP_CODE);
curl_close($c);
And then parse the contents of $html variable.

Related

Xamarin.Forms failing to use EvoHtmlToPdfclient in order to convert html string to a pdf file

I'm using Xamarin.Forms and I am trying to convert an html string to a pdf file using EvoPdfConverter, but the problem is that when I try to do so, on the line htmlToPdfConverter.ConvertHtmlToFile(htmlData, "", myDir.ToString()); in the code snippet below, the app just freezes and does nothing, seems like it wants to connect to the given IP, but it can't, however I don't get any errors or exceptions! not even catch!! does anybody know what I should do to resolve this issue? and here is my code for this:
public void ConvertHtmlToPfd(string htmlData)
{
ServerSocket s = new ServerSocket(0);
HtmlToPdfConverter htmlToPdfConverter = new
HtmlToPdfConverter(GetLocalIPAddress(),(uint)s.LocalPort);
htmlToPdfConverter.TriggeringMode = TriggeringMode.Auto;
htmlToPdfConverter.PdfDocumentOptions.CompressCrossReference = true;
htmlToPdfConverter.PdfDocumentOptions.PdfCompressionLevel = PdfCompressionLevel.Best;
if (ContextCompat.CheckSelfPermission(Android.App.Application.Context, Manifest.Permission.WriteExternalStorage) != Permission.Granted)
{
ActivityCompat.RequestPermissions((Android.App.Activity)Android.App.Application.Context, new String[] { Manifest.Permission.WriteExternalStorage }, 1);
}
if (ContextCompat.CheckSelfPermission(Android.App.Application.Context, Manifest.Permission.ReadExternalStorage) != Permission.Granted)
{
ActivityCompat.RequestPermissions((Android.App.Activity)Android.App.Application.Context, new String[] { Manifest.Permission.ReadExternalStorage }, 1);
}
try
{
// create the HTML to PDF converter object
if (Android.OS.Environment.IsExternalStorageEmulated)
{
root = Android.OS.Environment.ExternalStorageDirectory.ToString();
}
htmlToPdfConverter.LicenseKey = "4W9+bn19bn5ue2B+bn1/YH98YHd3d3c=";
htmlToPdfConverter.PdfDocumentOptions.PdfPageSize = PdfPageSize.A4;
htmlToPdfConverter.PdfDocumentOptions.PdfPageOrientation = PdfPageOrientation.Portrait;
Java.IO.File myDir = new Java.IO.File(root + "/Reports");
try
{
myDir.Mkdir();
}
catch (Exception e)
{
string message = e.Message;
}
Java.IO.File file = new Java.IO.File(myDir, filename);
if (file.Exists()) file.Delete();
htmlToPdfConverter.ConvertHtmlToFile(htmlData, "", myDir.ToString());
}
catch (Exception ex)
{
string message = ex.Message;
}
}

Could you try to set a base URL to ConvertHtmlToFile call as the second parameter? You passed an empty string. That helps to resolve the relative URLs found in HTML to full URLs. The converter might have delays when trying to retrieve content from invalid resources URLs.

Invoke PDF, Server returned HTTP response code: 400

English is not my native language; please excuse typing errors.
Hope I improved the question.
I start a Script over Selenium, automated testing, WebDriver.
I perform a Link on HTML-Page
This link creates a new Browser Tab
This BrowserTab display a PDF/A Standard Document
I want to parse the PDF an verify its content
After HttpURLConnection I got Hypertext Transfer Protocol (HTTP) response status code: 400
and parsing the InputStream can not be done.
Over localhost it can be done.
Also I can hit link "curent URL" without any problems.
Selenium v2.45.0
Tested on IE 9 and Firefox 39.0.
How can I examine the problem?
I read about HTTP response code: 400
But didn't find solution for my problem
Could it be a redirect-problem?
Thank you
Here is my code:
public class httperrorStack {
public static void main ( String[] args )
{
// Selnium WebDriver
WebDriver driver = null;
//driver = new FirefoxDriver();
// Test with Browser IE 9 same Problem withFirefox 39.0
System.setProperty("webdriver.ie.driver","C:\\dir\\IEDriverServer.exe");
driver = new InternetExplorerDriver();
String baseUrl = "http://##.###.###.##:#####/wps/portal"; //ATU2
// launch direct to Base URL
driver.get(baseUrl);
// launch to Documentlink: opens PDF/A-Document in a new Browser Tab
driver.findElement(By.xpath("//a[#title='document_link']")).click();
//Get all the window handles in a set
Set <String> handles =driver.getWindowHandles();
Iterator<String> it = handles.iterator();
//iterate through windows
while (it.hasNext()){
String parent = it.next();
String newwin = it.next();
driver.switchTo().window(newwin);
URL url = null;
try {
// The CurrentURL of the documentlink
// url = "http://##.###.###.###:#####/wps/myportal/dirname/dirname/dir-name/!dir/p/b1/about50characters/In"
url = new URL(driver.getCurrentUrl());
} catch (MalformedURLException e1) {
e1.printStackTrace();
}
// verify connection
HttpURLConnection connection = null;
HttpURLConnection httpConn = (HttpURLConnection)connection;
try {
httpConn = (HttpURLConnection)url.openConnection();
} catch (IOException e1) {
e1.printStackTrace();
}
InputStream is = null;
// Response Code = 400
try {
if (httpConn.getResponseCode() >= 400) {
is = httpConn.getErrorStream();
// httpConn.getResponseCode() = 400
//httpConn.getErrorStream() = "sun.net.www.protocol.http.HttpURLConnection$HttpInputStream#1d082e88"
} else {
is = httpConn.getInputStream();
}
} catch (IOException e1) {
e1.printStackTrace();
}
try{
connection = (HttpURLConnection)url.openConnection();
InputStream is2;
if (connection.getResponseCode() >= 400) {
is2 = connection.getErrorStream();
//connection.getErrorStream() = "sun.net.www.protocol.http.HttpURLConnection$HttpInputStream#60704c"
} else {
is2 = connection.getInputStream();
}
}
catch (Exception e){
e.printStackTrace();
}
BufferedInputStream fileToParse = null;
try {
fileToParse = new BufferedInputStream(
url.openStream());
} catch (IOException e) {
e.printStackTrace();
// output of StackTrace
/*java.io.IOException: Server returned HTTP response code: 400 for URL: http://##.###.###.##:#####/wps/redirect
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at java.net.URL.openStream(Unknown Source)
at mypackage.httperrorStack.main(httperrorStack.java:134)*/
}
// Close Window
driver.close();
driver.switchTo().window(parent);
driver.quit();
// exit the program explicitly
System.exit(0);
}
}
}

Wikipedia page parsing program caught in endless graph cycle

My program is caught in a cycle that never ends, and I can't see how it get into this trap, or how to avoid it.
It's parsing Wikipedia data and I think it's just following a connected component around and around.
Maybe I can store the pages I've visited already in a set and if a page is in that set I won't go back to it?
This is my project, its quite small, only three short classes.
This is a link to the data it generates, I stopped it short, otherwise it would have gone on and on.
This is the laughably small toy input that generated that mess.
It's the same project I was working on when I asked this question.
What follows is the entirety of the code.
The main class:
public static void main(String[] args) throws Exception
{
String name_list_file = "/home/matthias/Workbench/SUTD/nytimes_corpus/NYTimesCorpus/2005/01/02/test/people_test.txt";
String single_name;
try (
// read in the original file, list of names, w/e
InputStream stream_for_name_list_file = new FileInputStream( name_list_file );
InputStreamReader stream_reader = new InputStreamReader( stream_for_name_list_file , Charset.forName("UTF-8"));
BufferedReader line_reader = new BufferedReader( stream_reader );
)
{
while (( single_name = line_reader.readLine() ) != null)
{
//replace this by a URL encoder
//String associated_alias = single_name.replace(' ', '+');
String associated_alias = URLEncoder.encode( single_name , "UTF-8");
String platonic_key = single_name;
System.out.println("now processing: " + platonic_key);
Wikidata_Q_Reader.getQ( platonic_key, associated_alias );
}
}
//print the struc
Wikidata_Q_Reader.print_data();
}
The Wikipedia reader / value grabber:
static Map<String, HashSet<String> > q_valMap = new HashMap<String, HashSet<String> >();
//public static String[] getQ(String variable_entity) throws Exception
public static void getQ( String platonic_key, String associated_alias ) throws Exception
{
//get the corresponding wikidata page
//check the validity of the URL
String URL_czech = "https://www.wikidata.org/wiki/Special:ItemByTitle?site=en&page=" + associated_alias + "&submit=Search";
URL wikidata_page = new URL(URL_czech);
HttpURLConnection wiki_connection = (HttpURLConnection)wikidata_page.openConnection();
InputStream wikiInputStream = null;
try
{
// try to connect and use the input stream
wiki_connection.connect();
wikiInputStream = wiki_connection.getInputStream();
}
catch(IOException e)
{
// failed, try using the error stream
wikiInputStream = wiki_connection.getErrorStream();
}
BufferedReader wiki_data_pagecontent = new BufferedReader(
new InputStreamReader(
wikiInputStream ));
String line_by_line;
while ((line_by_line = wiki_data_pagecontent.readLine()) != null)
{
// if we can determine it's a disambig page we need to send it off to get all
// the possible senses in which it can be used.
Pattern disambig_pattern = Pattern.compile("<div class=\"wikibase-entitytermsview-heading-description \">Wikipedia disambiguation page</div>");
Matcher disambig_indicator = disambig_pattern.matcher(line_by_line);
if (disambig_indicator.matches())
{
//off to get the different usages
Wikipedia_Disambig_Fetcher.all_possibilities( platonic_key, associated_alias );
}
else
{
//get the Q value off the page by matching
Pattern q_page_pattern = Pattern.compile("<!-- wikibase-toolbar --><span class=\"wikibase-toolbar-container\"><span class=\"wikibase-toolbar-item " +
"wikibase-toolbar \">\\[<span class=\"wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit\"><a " +
"href=\"/wiki/Special:SetSiteLink/(.*?)\">edit</a></span>\\]</span></span>");
Matcher match_Q_component = q_page_pattern.matcher(line_by_line);
if ( match_Q_component.matches() )
{
String Q = match_Q_component.group(1);
// 'Q' should be appended to an array, since each entity can hold multiple
// Q values on that basis of disambig
put_to_hash( platonic_key, Q );
}
}
}
wiki_data_pagecontent.close();
// \\ // ! PRINT IT ! // \\ // \\ // \\ // \\ // \\ // \\
for (Map.Entry<String, HashSet<String> > entry : q_valMap.entrySet())
{
System.out.println(entry.getKey()+" : " + Arrays.deepToString(q_valMap.entrySet().toArray()) );
}
}
// add Q values to their arrayList in the hash map at the index of the appropriate entity
public static HashSet<String> put_to_hash(String key, String value )
{
HashSet<String> valSet;
if (q_valMap.containsKey(key)) {
valSet = q_valMap.get(key);
} else {
valSet = new HashSet<String>();
q_valMap.put(key, valSet);
}
valSet.add(value);
return valSet;
}
// add Q values to their arrayList in the hash map at the index of the appropriate entity
public static void print_data()
{
System.out.println("THIS IS THE FINAL DATA SET!!!");
// \\ // ! PRINT IT ! // \\ // \\ // \\ // \\ // \\ // \\
for (Map.Entry<String, HashSet<String> > entry : q_valMap.entrySet())
{
System.out.println(entry.getKey()+" : " + Arrays.deepToString(q_valMap.entrySet().toArray()) );
}
}
Dealing with disambiguation pages:
public static void all_possibilities( String platonic_key, String associated_alias ) throws Exception
{
System.out.println("this is a disambig page");
//if it's a disambig page we know we can go right to the Wikipedia
//get it's normal wiki disambig page
String URL_czech = "https://en.wikipedia.org/wiki/" + associated_alias;
URL wikidata_page = new URL(URL_czech);
HttpURLConnection wiki_connection = (HttpURLConnection)wikidata_page.openConnection();
InputStream wikiInputStream = null;
try
{
// try to connect and use the input stream
wiki_connection.connect();
wikiInputStream = wiki_connection.getInputStream();
}
catch(IOException e)
{
// failed, try using the error stream
wikiInputStream = wiki_connection.getErrorStream();
}
// parse the input stream using Jsoup
Document docx = Jsoup.parse(wikiInputStream, null, wikidata_page.getProtocol()+"://"+wikidata_page.getHost()+"/");
//this can handle the less structured ones.
Elements linx = docx.select( "p:contains(" + associated_alias + ") ~ ul a:eq(0)" );
for (Element linq : linx)
{
System.out.println(linq.text());
String linq_nospace = URLEncoder.encode( linq.text() , "UTF-8");
Wikidata_Q_Reader.getQ( platonic_key, linq_nospace );
}
}

Check Httpconnection is open or not in blackberry

Before making HttpConnection from blackberry application i want to check if it is open or not?. Because without checking that when i tried to make a connection i got class net.rim.device.api.io.ConnectionClosedException.
EDIT: Posted the code from the OP's answer.
Below is my code for the http connection.
public String makePostRequest(String[] paramName, String[] paramValue) {
StringBuffer postData = new StringBuffer();
HttpConnection connection = null;
InputStream inputStream = null;
OutputStream out = null;
try {
connection = (HttpConnection) Connector.open(this.url);
connection.setRequestMethod(HttpConnection.POST);
for (int i = 0; i < paramName.length; i++) {
postData.append(paramName[i]);
postData.append("=");
postData.append(paramValue[i]);
postData.append("&");
}
String encodedData = postData.toString();
connection.setRequestProperty("Content-Language", "en-US");
connection.setRequestProperty("Content-Type",
"application/x-www-form-urlencoded");
connection.setRequestProperty("Content-Length", (new Integer(
encodedData.length())).toString());
connection.setRequestProperty("Cookie", Constants.COOKIE_TOKEN);
byte[] postDataByte = postData.toString().getBytes("UTF-8");
out = connection.openOutputStream();
out.write(postDataByte);
DebugScreen.Log("Output stream..."+out);
DebugScreen.Log("Output stream..."+connection.getResponseCode());
// get the response from the input stream..
inputStream = connection.openInputStream();
DebugScreen.Log("Input stream..."+inputStream);
byte[] data = IOUtilities.streamToBytes(inputStream);
response = new String(data);
} catch ( Exception e) {
UiApplication.getUiApplication().invokeLater(new Runnable() {
public void run() {
WaitingScreen.removePopUP();
Status.show(Constants.CONNETION_ERROR);
}
});
DebugScreen.Log("Exception inside the make connection..makePostRequest."
+ e.getMessage());
DebugScreen.Log("Exception inside the make connection..makePostRequest."
+ e.getClass());
}finally {
try {
if(inputStream != null){
inputStream.close();
inputStream = null;
}
if(out != null){
out.close();
out = null;
}
if(connection != null){
connection.close();
connection = null;
}
} catch ( Exception ex) {
UiApplication.getUiApplication().invokeLater(new Runnable() {
public void run() {
WaitingScreen.removePopUP();
}
});
DebugScreen.Log("Exception from the connection2 class.."
+ ex.getMessage());
DebugScreen.Log("Exception from the connection2 class.."
+ ex.getClass());
}
}
return response;
}

Before making httpconnection from blackberry application i want to check if it is open or not.
That doesn't make sense. You want to make sure it is open before you open it. You can't. You have to try to open it, and handle the exception if it fails. That's what the exception is for.
The best way to test whether any resource is available is to try to use it. You can't predict that. You have to try it.
Because without checking that when i tried to make a connection i got class net.rim.device.api.io.ConnectionClosedException.
So it wasn't available. So now you know. That's the correct behaviour. You're already doing the right thing. There is no question here to answer.

How to check Rss Feed link Is valid or not

I want check that that provided Rss feed Link Is valid or not and is it working right now?

Just load the URL and check that it is actually an RSS feed.
try {
var feedDoc = XDocument.Load(url);
return ValidateRss(feedDoc); // implementation left as an exercise for the reader.
}
catch(HttpException) { // perhaps others
return false;
}

Use below code to check RSS URL:
using System.ServiceModel.Syndication;
public static bool IsValidFeedUrl(string url)
{
bool isValid = true;
try
{
XmlReader reader = XmlReader.Create(url);
Rss20FeedFormatter formatter = new Rss20FeedFormatter();
formatter.ReadFrom(reader);
reader.Close();
}
catch
{
isValid = false;
}
return isValid;
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to capture data from third party web site? - web-scraping

For example, I would just like to capture the data for the 30 latest events for the scrolling info shown on this URL: http://hazmat.globalincidentmap.com/home.php# Any idea how to capture it?

Related

Xamarin.Forms failing to use EvoHtmlToPdfclient in order to convert html string to a pdf file

Invoke PDF, Server returned HTTP response code: 400

Wikipedia page parsing program caught in endless graph cycle

Check Httpconnection is open or not in blackberry

How to check Rss Feed link Is valid or not

Categories

Resources