scraping video links from lazy loaded videos - web-scraping

I am trying to scrape a video from a page using a package called icrawler, but that video is not rendered instantly when a page loads, so when I get the html code of that page, the video tag doesn't exist but it does if I open the page in the browser and inspect.
How do i wait for the page to load the video before crawling it ?

The page most likely loads the video using javascript so, you would need library capable of rendering/executing HTML and javascript.
I took a quick look at icrawler and according to the doc it uses Cheerio which quoting from its doc "does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript".
The same docs mention that you could use something like PhahomJS (seems to be abandoned) or JSDom. Another alternative is to use Selenium.

Related

Removing render blocking JS and CSS causing issue in my WordPress website

i'm trying to improve speed of my website. i'm using PageSpeed Insights to check my site performance and it was telling me to remove render blocking java script and css. so i did it and know its causing problem in my website design. so what should i do to remove rendering blocking without causing problem in my website design.
Render Blocking CSS
Render blocking CSS will always show on Google Page Speed Insights if you are using external resources for your CSS.
What you need to do is to inline all of your 'above the fold' styles in <style></style> tags in the head of your web page.
I will warn you, this is NOT easy and plugins that claim to do this often do not work, it requires effort.
To explain what is happening:-
A user navigates to your site and the HTML starts downloading.
As the HTML downloads the browser is trying to work out how to render that HTML correctly and it expects styling on those elements.
Once the HTML has downloaded if it hasn't found styles for the elements that appear above the fold (the initial part of the visible page) then it cannot render anything yet.
The browser looks for your style sheets and once they have downloaded it can render the page.
Point 4. is the render blocking as those resources are stopping the page from rendering the initial view.
To achieve this you need to work out every element that displays without scrolling the page and then find all the styles associated with those elements and inline them.
Render Blocking JS
This one is simpler to fix.
If you are able to use the async attribute on your external JS then use that.
However be warned that in a lot of cases this will break your site if you have not designed for it in the first place.
This is because async will download and execute your JS files as fast as possible. If a script requires another script to function (i.e. you are using jQuery) then if it loads before the other script it will throw an error. (i.e. your main.js file uses jQuery but downloads before it. You call $('#element') and you get a $ is undefined error as jQuery is not downloaded yet.)
The better tag to use if you do not have the knowledge required to implement async without error is to use the defer attribute instead.
This will not start downloading the script until the HTML has finished parsing. However it will still download and execute scripts in the order specified in the HTML.
Add async in the script tag and put the css and js in the last of the page

How to scrape data in a page with jquery button click using HtmlAgility pack

I am trying to scrape data from a page with similar content(Shopping website) using HtmlAgility pack.
There is a button to load more items designed of tag. On click it loads more items on same page.
If it is designed using tag then I will get the next items using the href attribute URL in tag and also I will be loading new page for the new next items, So no problem.
But here no new URL and items loaded on same page.
So is there any way to get this functionality implemented? How to trigger that load more button to get more items?
HtmlAgilityPack is an html parser alone, it knows only to parse a static html document. what you want may be accomplished using selenium web driver.
Another possibility is - if the number of item load actions is so that you can complete the loading manually - do so and save the resulting html locally, and only afterwards use HtmlAgiliyPack to parse the static html you stored locally (instead of parsing the http response).
Share the link of the site you are talking about so I can add some code snippets to exemplify.

How to let crawler4j fetch page by relative path?

With Crawler4j, I can fetch page linked by a complete url, such as:
<a href='http://www.domain.com/thelink'>
However I found that if the link is relative, such as:
<a href='/thelink'>
Crawler4j will bypass this link(page), and I even have no chance to see the link in shouldVisit(Page referringPage, WebURL url) method.
I do not see any configuration about this in Crawler4j Github page, do I miss something?
As described in the related issue on the project page, it seems that this behaviour is related to the fact, that this specific web-page does a lot of rendering content using ajax / javascript.
However, crawler4j is not able to render javascript styling on demand as it does not include a javascript engine for this purpose. In addition, the script tag is not scanned for URLS yet.

Meteor: Exporting rendered template for offline use

I have an online tool for users to build and preview slideshow presentations (uploading images, editing text).
Would there be a way to "export" the content of a rendered slideshow for offline use? This would mean the user could view the presentation locally in a browser using only static files.
Use
var myRenderedHTML = Blaze.toHTMLWithData(templateYouWantToCache, dataUsedToRenderTemplate);
Then use something like the Filesaver.js library to force download of that content as an HTML file. (as in the last demo on this page)

How to add background music in asp.net without bgsound and embed

How to add background music when a website loads for the first time without using embed and bgsound.I am using visual studio 2010 and these two are not supported in this.
I am developing website using master page and i want to use the code in master
page.
What is the best practice to to be able when I open my website and some music starts f sometime.I am not much expert in .net with c#.So finding some problem in it.
If browser also matters?
Regards,
suparna
Add the following code anywhere in the body of your HTML document to embed a music file and play it automatically when a visitor browses your website.
<audio src="music/yoursongname.mp3" autoplay="autoplay" loop="loop"></audio>
Change the "src" attribute so that it contains the path and filename of the music file that you want to embed.
Note: Add the "loop" attribute if you want the music file to play over and over

Resources