How to get if an url is 404 or 301 in crawler4j - crawler4j

Is it possible to get if an URL is 404 or 301 in crawler4j ?
#Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
List<WebURL> links = htmlParseData.getOutgoingUrls();
System.out.println("Text length: " + text.length());
System.out.println("Html length: " + html.length());
System.out.println("Number of outgoing links: " + links.size());
}
}
I use this in the crawler code .Can anyone tell me how ?

As Crawler4j Version 3.3 (Feb 2012 released) - Crawler4j supportting for handling http status codes for fetched pages.
to visit StatusHandlerCrawlerExample click.
Also you can parse pages by using Jsoup (Java HTML Parser, with best of DOM, CSS, and jquery). And there is an example here - shows how to download a page from given URL and getting page status code. I think you should use Crawler4j for crawling and Jsoup for page fetching.

Related

Widevine DRM Content on Exoplayer 2.0

I am trying to play Widevine encrypted content on an Android TV application using Exoplayer. I have my video URL which is served from a CDN and acquired with a ticket. I have my widevine license URL, a ticket and a auth token for the license server.
I am creating a drmSessionManager, putting the necessary headers needed by the license server as follows:
UUID drmSchemeUuid = C.WIDEVINE_UUID;
mediaDrm = FrameworkMediaDrm.newInstance(drmSchemeUuid);
static final String USER_AGENT = "user-agent";
HttpMediaDrmCallback drmCallback = new HttpMediaDrmCallback("my-license-server", new DefaultHttpDataSourceFactory(USER_AGENT));
keyRequestProperties.put("ticket-header", ticket);
keyRequestProperties.put("token-header", token);
drmCallback.setKeyRequestProperty("ticket-header", ticket);
drmCallback.setKeyRequestProperty("token-header", token);
new DefaultDrmSessionManager(drmSchemeUuid, mediaDrm, drmCallback, keyRequestProperties)
After this Exoplayer handles most of the stuff, the following breakpoints are hit.
response = callback.executeKeyRequest(uuid, (KeyRequest) request);
in class DefaultDrmSession
return executePost(dataSourceFactory, url, request.getData(), requestProperties) in HttpMediaDrmCallback
I can observe that everything is fine till this point, the URL is correct, the headers are set fine.
in the following piece of code, I can observe that the dataSpec is fine, trying to POST a request to the license server with the correct data, but when making the connection the response code returns 405.
in class : DefaultHttpDataSource
in method : public long open(DataSpec dataSpec)
this.dataSpec = dataSpec;
this.bytesRead = 0;
this.bytesSkipped = 0;
transferInitializing(dataSpec);
try {
connection = makeConnection(dataSpec);
} catch (IOException e) {
throw new HttpDataSourceException("Unable to connect to " + dataSpec.uri.toString(), e,
dataSpec, HttpDataSourceException.TYPE_OPEN);
}
try {
responseCode = connection.getResponseCode();
responseMessage = connection.getResponseMessage();
} catch (IOException e) {
closeConnectionQuietly();
throw new HttpDataSourceException("Unable to connect to " + dataSpec.uri.toString(), e,
dataSpec, HttpDataSourceException.TYPE_OPEN);
}
When using postman to make a request to the URL, a GET request returns the following body with a response code of 405.
{
"Message": "The requested resource does not support http method 'GET'." }
a POST request also returns response code 405 but returns an empty body.
In both cases the following header is also returned, which I suppose the request must be accepting GET and POST requests.
Access-Control-Allow-Methods →GET, POST
I have no access to the configuration of the DRM server, and my contacts which are responsible of the DRM server tells me that POST requests must be working fine since there are clients which have managed to get the content to play from the same DRM server.
I am quite confused at the moment and think maybe I am missing some sort of configuration in exoplayer since I am quite new to the concept of DRMs.
Any help would be greatly appreciated.
We figured out the solution. The ticket supplied for the DRM license server was wrong. This works as it is supposed to now and the content is getting played. Just in case anyone somehow gets the same problem or is in need of a basic Widevine content playing code, this works fine at the moment.
Best regards.

jxbrowser howto get response content before redirect

I used jxbrowser to tracking data exchange through NetworkDelegate interface, I can get all response data except redirection occurs.
if redirection occurs, the onDataReceived event for original URL will not be happened, so the question is, how can I get response content when redirection occurs, or how to disable automatic redirection.
thanks for any help!
There is no data for the original URL. The response in this case usually contains the status line and Location header. For example:
HTTP/1.1 301 Moved Permanently
Location: http://www.example.org/index.asp
You can obtain this information in JxBrowser using the following approach:
Browser browser = new Browser();
browser.getContext().getNetworkService().setNetworkDelegate(new DefaultNetworkDelegate() {
#Override
public void onBeforeRedirect(BeforeRedirectParams params) {
System.out.println("URL: " + params.getURL());
System.out.println("New URL: " + params.getNewURL());
System.out.println("Response code: " + params.getResponseCode());
}
});

Image Url validation in asp.net

i have images url , i need to check url is responding or not .
For Example :Below i i have written three image url, first two url is not valid only third url is valid .but second and fourth url is responding as valid image
and but there is no image.
http://media.expedia.com/hotels/1000000/90000/84900/84853/84853_744_b.jpg
http://www.iceportal.com/brochures/media/show.aspx?brochureid=ICE19044&did=3073&mtype=3073&type=pic&lang=en&publicid=4175749&resizing=X
http://images.trvl-media.com/hotels/1000000/30000/20400/20313/20313_166_b.jpg
http://www.iceportal.com/brochures/ice/ErrorPages/404.htm?aspxerrorpath=/brochures/media/show_A.aspx
here is my code:
public static bool CheckUrlExists(string url)
{
try
{
Uri u = new Uri(url);
WebRequest w = WebRequest.Create(u);
w.Method = WebRequestMethods.Http.Head;
using (StreamReader s = new StreamReader(w.GetResponse().GetResponseStream()))
{
return (s.ReadToEnd().Length >= 0);
}
}
catch
{
return false;
}
}
with this code i am validating only those url which is showing 404 error,but not those url which showing 'Sorry, requested brochure is temporarily un-published 'or any other type of message.
You will need a more complex logic to validate if the URL points to an image. If a resource is missing from the server or it is otherwise unavailable, you may get a HTTP error like the infamous 404, which will trigger a WebException. However, that is only part of the story.
Your second URL returns HTTP 200, confirming that the resource is there when in fact the resource is missing. What you really get there is a HTML document explaining the resource is not available. This is bad practice, but not without example.
At very least, you should examine the MIME type (Content-Type header, see WebResponse.ContentType) of the resource you test. A content type of image/* suggests an image-type resource. Failing to detect a known MIME type (e.g. if you receive application/octet-stream) you can actually HTTP GET get resource and run image type detection on the downloaded content.
I would suggest using HttpWebRequest and HttpWebResponse to do this, they are sub classes of WebRequest and WebResponse and as such are more granular for what you're trying to achive. The following code works with the example URIs provided
public static bool CheckUrlExists(string url)
{
try
{
Uri u = new Uri(url);
HttpWebRequest w = (HttpWebRequest)WebRequest.Create(u);
w.AllowAutoRedirect = false;
w.Method = WebRequestMethods.Http.Head;
HttpWebResponse response = (HttpWebResponse)w.GetResponse();
return response.StatusCode == HttpStatusCode.OK; //Check http status code
}
catch(WebException ex)
{
return false;
}
}
What's important here is that I'm checking the HttpStatus code. You're catch will already catch the 404s but the problem URIs ultimately lead to a 200 (OK). By setting AllowAutoRedirect to false the HttpWebRequest instance returns a 302 (redirect) status code, instead of following the redirect through to the "Sorry, requested brochure is temporarily un-published." page which is returning 200 (OK). This should serve your purpose.
Also: Catching a WebException will allow you to examine the status code (400+,500+, etc).
Be aware however, that you may be redirected to a new location for the image you're requesting. Taking that you might want to use PeterK's mime type check.

Redirect to external url - link works in browser but not redirect method

With reference to .net mvc redirect to external url.
So you have your controller setup with the redirect as below:
public ActionResult SiteDetails(short id)
{
return Redirect("localhost:1234/Controller/Action");
}
BUT
- nothing happens when you call the action.
- Expecting a redirect and nothing happens.
- Expecting a redirect to another MVC site and nothing happens.
- Not only that - in debug the string url going into the redirect works when copied into the browser.
Why doesn't this work?
Requires 'http://' within the string.
public ActionResult SiteDetails(short id)
{
return Redirect("http://localhost:1234/Controller/Action");
}

Get Root/Base Url In Spring MVC

What is the best way to get the root/base url of a web application in Spring MVC?
Base Url = http://www.example.com or http://www.example.com/VirtualDirectory
I prefer to use
final String baseUrl =
ServletUriComponentsBuilder.fromCurrentContextPath().build().toUriString();
It returns a completely built URL, scheme, server name and server port, rather than concatenating and replacing strings which is error prone.
If base url is "http://www.example.com", then use the following to get the "www.example.com" part, without the "http://":
From a Controller:
#RequestMapping(value = "/someURL", method = RequestMethod.GET)
public ModelAndView doSomething(HttpServletRequest request) throws IOException{
//Try this:
request.getLocalName();
// or this
request.getLocalAddr();
}
From JSP:
Declare this on top of your document:
<c:set var="baseURL" value="${pageContext.request.localName}"/> //or ".localAddr"
Then, to use it, reference the variable:
Go Home
You can also create your own method to get it:
public String getURLBase(HttpServletRequest request) throws MalformedURLException {
URL requestURL = new URL(request.getRequestURL().toString());
String port = requestURL.getPort() == -1 ? "" : ":" + requestURL.getPort();
return requestURL.getProtocol() + "://" + requestURL.getHost() + port;
}
Explanation
I know this question is quite old but it's the only one I found about this topic, so I'd like to share my approach for future visitors.
If you want to get the base URL from a WebRequest you can do the following:
ServletUriComponentsBuilder.fromRequestUri(HttpServletRequest request);
This will give you the scheme ("http" or "https"), host ("example.com"), port ("8080") and the path ("/some/path"), while fromRequest(request) would give you the query parameters as well. But as we want to get the base URL only (scheme, host, port) we don't need the query params.
Now you can just delete the path with the following line:
ServletUriComponentsBuilder.fromRequestUri(HttpServletRequest request).replacePath(null);
TLDR
Finally our one-liner to get the base URL would look like this:
//request URL: "http://example.com:8080/some/path?someParam=42"
String baseUrl = ServletUriComponentsBuilder.fromRequestUri(HttpServletRequest request)
.replacePath(null)
.build()
.toUriString();
//baseUrl: "http://example.com:8080"
Addition
If you want to use this outside a controller or somewhere, where you don't have the HttpServletRequest present, you can just replace
ServletUriComponentsBuilder.fromRequestUri(HttpServletRequest request).replacePath(null)
with
ServletUriComponentsBuilder.fromCurrentContextPath()
This will obtain the HttpServletRequest through spring's RequestContextHolder. You also won't need the replacePath(null) as it's already only the scheme, host and port.
request.getRequestURL().toString().replace(request.getRequestURI(), request.getContextPath())
Simply :
/*
* Returns the base URL from a request.
*
* #example: http://myhost:80/myapp
* #example: https://mysecuredhost:443/
*/
String getBaseUrl(HttpServletRequest req) {
return ""
+ req.getScheme() + "://"
+ req.getServerName()
+ ":" + req.getServerPort()
+ req.getContextPath();
}
In controller, use HttpServletRequest.getContextPath().
In JSP use Spring's tag library: or jstl
Either inject a UriCompoenentsBuilder:
#RequestMapping(yaddie yadda)
public void doit(UriComponentBuilder b) {
//b is pre-populated with context URI here
}
. Or make it yourself (similar to Salims answer):
// Get full URL (http://user:pwd#www.example.com/root/some?k=v#hey)
URI requestUri = new URI(req.getRequestURL().toString());
// and strip last parts (http://user:pwd#www.example.com/root)
URI contextUri = new URI(requestUri.getScheme(),
requestUri.getAuthority(),
req.getContextPath(),
null,
null);
You can then use UriComponentsBuilder from that URI:
// http://user:pwd#www.example.com/root/some/other/14
URI complete = UriComponentsBuilder.fromUri(contextUri)
.path("/some/other/{id}")
.buildAndExpand(14)
.toUri();
In JSP
<c:set var="scheme" value="${pageContext.request.scheme}"/>
<c:set var="serverPort" value="${pageContext.request.serverPort}"/>
<c:set var="port" value=":${serverPort}"/>
base url
reference https://github.com/spring-projects/greenhouse/blob/master/src/main/webapp/WEB-INF/tags/urls/absoluteUrl.tag
#RequestMapping(value="/myMapping",method = RequestMethod.POST)
public ModelandView myAction(HttpServletRequest request){
//then follow this answer to get your Root url
}
Root URl of the servlet
If you need it in jsp then get in in controller and add it as object in ModelAndView.
Alternatively, if you need it in client side use javascript to retrieve it:
http://www.gotknowhow.com/articles/how-to-get-the-base-url-with-javascript
I think the answer to this question: Finding your application's URL with only a ServletContext shows why you should use relative url's instead, unless you have a very specific reason for wanting the root url.
If you just interested in the host part of the url in the browser then directly from request.getHeader("host")) -
import javax.servlet.http.HttpServletRequest;
#GetMapping("/host")
public String getHostName(HttpServletRequest request) {
request.getLocalName() ; // it will return the hostname of the machine where server is running.
request.getLocalName() ; // it will return the ip address of the machine where server is running.
return request.getHeader("host"));
}
If the request url is https://localhost:8082/host
localhost:8082
Here:
In your .jsp file inside the [body tag]
<input type="hidden" id="baseurl" name="baseurl" value=" " />
In your .js file
var baseUrl = windowurl.split('://')[1].split('/')[0]; //as to split function
var xhr = new XMLHttpRequest();
var url='http://'+baseUrl+'/your url in your controller';
xhr.open("POST", url); //using "POST" request coz that's what i was tryna do
xhr.send(); //object use to send```

Resources