How to let crawler4j fetch page by relative path?

How to let crawler4j fetch page by relative path? - crawler4j

With Crawler4j, I can fetch page linked by a complete url, such as:
<a href='http://www.domain.com/thelink'>
However I found that if the link is relative, such as:
<a href='/thelink'>
Crawler4j will bypass this link(page), and I even have no chance to see the link in shouldVisit(Page referringPage, WebURL url) method.
I do not see any configuration about this in Crawler4j Github page, do I miss something?

As described in the related issue on the project page, it seems that this behaviour is related to the fact, that this specific web-page does a lot of rendering content using ajax / javascript.
However, crawler4j is not able to render javascript styling on demand as it does not include a javascript engine for this purpose. In addition, the script tag is not scanned for URLS yet.

Related

GitHub Page Custom URL Styling and Pics don't render

My site, Jekyll static pages and a blog, https://omnebonum.github.io/dsu/ on GitHub pages looks like it is supposed to (such as it is). I have a custom URL set up that points the repo to democracystraightup.org. That works but when I go that URL the pages show up without any CSS or pics.
I know this isn't super specific information, but you can check them both out if you like, and any general insights would be appreciated.

All your css, javascript and other assets are returning 404 errors:
The paths are all pointing to subdirectories of http://democracystraightup.org/dsu/ (e.g. http://democracystraightup.org/dsu/js/gallery.js).
Thats wrong.
You need to drop the /dsu to make it work.
For example http://democracystraightup.org/js/gallery.js works totally fine.

Web Component that can request another HTML URL and inject it into it's shadow DOM

I spent some time today with Lit trying to make a simple WebComponent that makes a HTTP GET to a URI, which returns a fully formed HTML document, and I want to inject said HTML document into the WebComponent's shadow DOM; basically this WebComponent acts as a simple proxy for embedding an externally hosted (but trusted) web snippet on my web page. I ran into a few problems:
Lit considers all HTML unsafe, so i had to mark it with Lit's unsafeHTML directive.
Then, I noticed none of the script or link tags in the injected HTML were being followed, so I parsed the incoming HTML as a HtmlDocument, located all the script/link tags, and "re-created" them using document.createElement(...) and returned them in my render(). I'm now noticing that images arent showing up either.
I don't like scraping scripts/links and re-creating them and jamming them into my web component anyhow, but I'm curious - what's the right way to approach this syndicating/consuming syndicated HTML pages/fragments?
Is this a solved problem w/ oEmbed already?
Is this simpler to do with a different WebComponent library?
This seems way harder than it should be at this point.

I think that it has little to do with WebComponents but rather with the HTML5 specs. lit-html uses innerHTML to create elements.
Script elements inserted using innerHTML do not execute when they are inserted
There are still ways to execute JS but this has little to do with your question.
unsafeHTML(`<img src="triggerError" onerror="alert('BOOM')">`)
Regarding the images, it may be a path issue?
This should work:
unsafeHTML(`<img src='http://placehold.it/350x350'>`)

Can not load html file in amp iframe

I want to create an html page and use it in the src attribute of the amp-iframe tag.
The amp-iframe tag then (as i understood from some examples) creates the iframe and loads the html page.
In drupal though, i can not find a way to use this html page in a twig template.
Is there a way to find a path inside drupal for this html page and use this path in the src?
I know drupal works with templates and twig files but in the amp examples everyone uses html.

Problem could be that you're trying to add amp-iframe content that's on the same origin as your page domain. That's forbidden for security reasons (mostly to do with the way the same-origin policy uses synthetic origins inside iframes).
The fix is to make sure that external JS is served from a different origin to your AMP. So if your AMPs are on example.com then you should serve the iframed JS from SOMEOTHERORIGIN.example.com
Share your code for better under standing of your issue.
Also refer - https://github.com/ampproject/amphtml/blob/master/spec/amp-iframe-origin-policy.md

Change HTTP status code for page in Adobe CQ5 (AEM)

I'm trying to support a CQ5 (5.5) installation developed by an outside firm for my company.
It appears that my company wanted a pretty 404 page that looked like the rest of the site, and using the custom Sling 404.jsp error handler to redirect to a regular page that merely says "Page Not Found" was the easiest way to do it. The problem is that the 404 page actually returns a 200 status code since it really is just a regular content page that bears a "Not Found" message on it.
This is causing us problems with Google and the GoogleBot, since Google believes all the old search links to now non-existent pages are still valid (200 status code).
Is there any way to configure CQ to return the appropriate 404 status code for the "not found" HTML page that we display? When I am in the CQ Author mode editing the page, I find nothing in page properties or in components that could be added to the page.
Any help would be appreciated, as CQ is not exactly my area of expertise.

You'll have to overlay /libs/sling/servlet/errorhandler/404.jsp file in order to do so - copy it to /apps/sling/servlet/errorhandler/404.jsp and change according to your specification.
And if you are looking specifically into setting appropriate response status code - you can do it by setting respective response property:
response.setStatus(404);
UPDATE: instead of redirecting to the page_not_found.html you might want to include it to the 404.jsp after setting response status:
<sling:include path="path/page_not_found.html" />

You can set the response code fairly easily with this sort of code: response.setStatus(SlingHttpServletResponse.SC_NOT_FOUND);
So for example, a quick-and-dirty implementation on your page_not_found.jsp would be as follows:
<%
response.setStatus(SlingHttpServletResponse.SC_NOT_FOUND);
%>
(or a longer-term/better implementation would be to set it via a tag and a tag library to avoid scriptlets)
If your page_not_found.html page is a static HTML page and not rendered via a jsp, you may need to change your 404.jsp so it redirects to a page that is rendered via a jsp for this approach to work. The status code is set by the server rendering the response. It is not something intrinsic in the HTML itself, so you won't be able to set this in a regular, static HTML page. Something must be done on the server to set this status code. Also see How to Return Specific HTTP Status Code in a Plain HTML Page

WebRequest retrieved site loads different then original

I am using WebRequest to retrieve a html page from the web and then displaying it using Response.Write.
The resulting page looks different from the original mostly in font and layout.
What could be the possible reasons and how to fix it?

Most probably, the HTML you retrieve contains relative URLs for loading images, stylesheets, scripts. These URLs are not correct for the page as you serve it from your site. You can fix this by converting all of the relative URLs into absolute URLs or by including a BASE tag in the head of the HTML, pointing to the URL of the original page.
Be advised though that deeplinking to images and other resources is considered bad practice. The source site may not like what you are doing.

The reason might be that the original html page contains relative (to the original site) paths to the stylesheet files so when you render the html in your site it cannot find the css.

Does the remote web site include CSS, JavaScript, or images?
If so, are any of the above resources referenced with relative links (i.e.: /javascript/script.js)?
If so, when the browser receives the HTML from your server, the relative links (which were originally relative to the source server) are now relative to your server.
You can fix this by either changing the HTML to use absolute links (i.e.: http://www.server.com/javascript/script.js). This is more complicated than it sounds: you'll need to catch <link href="..."/>, <a href="..."/>, <form action="..."/>, <script src="..."/>, <img src="..."/>, etc.
A more limited solution would be to place the actual resources onto your server in the same structure as they exist on the original server.

The remote site might look at the User-Agent and serve different content based on that.
Also, you should compare the HTML you can retrieve from the remote site, with the HTML you get by visiting the site in a browser. If they are not different, you are probably missing images and/or css and javascript, because of relative paths, as already suggested in another answer.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to let crawler4j fetch page by relative path? - crawler4j

Related

GitHub Page Custom URL Styling and Pics don't render

Web Component that can request another HTML URL and inject it into it's shadow DOM

Can not load html file in amp iframe

Change HTTP status code for page in Adobe CQ5 (AEM)

WebRequest retrieved site loads different then original

Categories

Resources