Why is data not fully loaded with view(response) in Scrapy?

Why is data not fully loaded with view(response) in Scrapy? - web-scraping

I am trying to scrape some comments from Youtube with Scrapy. However, when I enter the scrapy shell mode and open it with view(response), I cannot find any comments but the loading spinner.
scrapy shell https://www.youtube.com/watch?v=kkl7-NzqxWo
view(response)
Shows me infinite spinner on the comment section. How can I load comments as well so that I could scrape them?

This is because Scrapy does not support JS. So comments are loaded with extra request https://www.youtube.com/comment_service_ajax?action_get_comments=1... (check Network tab in Chrome devtools panel).
You can:
check extra requests and parse them
use Scrapy+Splash
use other scraping tools, that support JS rendering

Related

How to avoid this sort of paywall when scraping with Python requests?

I am trying to download content from a website which has a sort of paywall.
You have a number of free articles you can read and then it requires a subscription for you to read more.
However, if you open the link in incognito mode, you can read one more article for each incognito window you open.
So I am trying to download some pages from this site using Python's requests library.
I request the URL and then parse the result using Bs4. However it only works for the first page in the list, the following ones don't have content but have instead the message with "buy a subscription etc.".
How to avoid this?

I think you can try to turn off javascript in the browser, it may work, but not 100%.

Download a file from a permalink URL, and not a direct exe url

So I am using InnoSetup 6 which natively supports downloading files from the internet during installation. I have figured out downloading files given a direct link, from this thread Inno Setup: Install file from Internet
However, I can't for the life of me figure out how to download the latest version of a file given a permalink URL. My specific example is to download the Microsoft Hosting package.
https://dotnet.microsoft.com/permalink/dotnetcore-current-windows-runtime-bundle-installer
Going to this page automatically downloads the latest package.
Inno doesn't like this link (or I don't know how to get Inno to use it) since it doesn't point to the direct file. If I use the direct link (https://download.visualstudio.microsoft.com/download/pr/24847c36-9f3a-40c1-8e3f-4389d954086d/0e8ae4f4a8e604a6575702819334d703/dotnet-hosting-5.0.6-win.exe) this works for obvious reasons.
I'd like to always download the latest, but I'm not sure how to accomplish this. Any suggestions?
Adding super basic code being used...
DownloadPage.Clear;
DownloadPage.Add('https://dotnet.microsoft.com/permalink/dotnetcore-current-windows-runtime-bundle-installer', 'dotnet-hosting.exe', '');
DownloadPage.Show;

You would have to retrieve the HTML page, find the URL in the HTML code and use it in your download code.
See Inno Setup - HTTP request - Get www/web content
It would be quite unreliable. Microsoft can change the HTML any time.
You better setup your own webpage (web service) that will provide an up to date link to your installer. The web page can even do what I suggested: retrieve the URL from the Microsoft's download page. In case Microsoft changes the HTML, you can fix your web page any time. What you cannot do with the installer.

Without realizing it you are asking two different question here. That is because these "permalinks" aren't really permalinks but redirects to some dynamic resource that has a link to what you are looking for.
So first, addressing the Microsoft "permalink", you need to realize that under the hood you are accessing a URL that redirects to some page which will point to the latest. Then under the hood, that page invokes a JavaScript function, IF YOU ACCESSING VIA A WEB BROWSER, to download the installer. Note that both the page pointed to and the code to invoke the installer WILL eventually change. In fact, the code itself logs a "warning" when people attempt to download directly:
If you do a view source you'll see:
<script>
$(function () {
recordDownload('.NET', 'runtime-aspnetcore-5.0.6-windows-hosting-bundle-installer');
window.open("https://download.visualstudio.microsoft.com/download/pr/24847c36-9f3a-40c1-8e3f-4389d954086d/0e8ae4f4a8e604a6575702819334d703/dotnet-hosting-5.0.6-win.exe", "_self");
});
function recordManualDownload() {
ga("send", "event", "Download.Warning", "Direct Link Used", "runtime-aspnetcore-5.0.6-windows-hosting-bundle-installer");
}
</script>
So you can download the HTML from this page and use some regex to get the directo downloadlink but beware, the link is going to change every time Microsoft releases a new version. Furthermore, WHEN (not if but when) MS decides to rebrand this entire process might break. So the best you can do here is try to download the html and try parse the download URL from this "permalink"
As an alternative. you can to download the latest DotNet powershell install script as described here.
If possible, execute that script directly. If not look at the function Get-AkaMSDownloadLink within the install script to see how it builds the url to get the latest version. You would probably be better served using that building and using that URL as opposed to attempting to download from some arbitrary HTML code.
Now, onto the second question you might not have realized your were asking is how to automate this for any random installer. The answer is you can't. Some might have a permalink that directly points to the latest but you are always going to find cases like Microsoft. Best you can down is hard code some links in some service, as #martin-prikryl suggested, and when the break update the links in those services.

Google PageSpeed Insights show unidentified scripts in results

I am testing my home page (http://stayuncle.com/home) speed at Google PageSpeed Insights. In result, I am getting few unidentified java script. I have no idea how they get into the results.
Can someone help me to understand how they get into result?

If you open up a network tab and view the results when the page loads you will see that those scripts load with the page. This script isn't coming from Google PageSpeed, but rather from your own site. It seems to be coming from a metrics script and is pushing mixpanel results.
This is the URL inside the script.
http://popcornmetrics.com/legal
Unfortunately, I am not able to see what initiated the script. You might want to go through each of your JavaScript files and check if it isn't loaded from there!

What you're seeing is PopcornMetrics script and library installed in your website (http://stayuncle.com/home). Our new library has a minified and gzipped version which will show better results in Google PageSpeed.

Embed API Working with Custom Components fails to display

Google Analytics demo code at .
Logged in to Google Chrome as the owner of the Analytics Account and then navigating to that page displays my Google analytics data correctly.
I follow instructions on the page and embed the code into a simple page .
Authentication works as indicated by the displayed message: “You are logged in as: me(at)gmail.com” but there is nothing more, no graph no message.
I am reasonably certain that the page is coded correctly as I have:
Basic Dashboard (basic.html)
Multipleviews (multipleviews.html)and
Interactive Charts (ic.html)
all working and displaying correctly (they display but not styled like the demo).
Why will the page not display the graphics?

As Eike pointed out in the comments, you've simply copied and pasted the code from the demo without downloading the components to your own server. If you open up your JavaScript console, you'll notice that you have 404 errors saying the browser can't find those components. Here's a screenshot of what I see on your site:
To add those components to your site, you have a number of options. I've answered a similar question on one of the repo's Github issues, but I'll copy it here for convenience.
The built and minified versions of those components are located in the build/javascript/embed-api/components directory. You can simply download those files and add them as script tags on your page, or include them in your site's main, bundled script.
If you're using an AMD script loader like RequireJS, you can also just point to those built files as they're wrapped in a UMD wrapper.
If you're using a tool like browserify or webpack, you can npm install this repo and require the files in the src/javascript/embed-api/components directory.

Injecting script into iframe before load in node-webkit

I'm Trying to make a simple web browser in node-webkit, to polyfill features that Chromium doesn't support yet (time element, etc). I have had success in listening for the iframe.onload event and then appending a script tag with the polyfills, but this still means that features that I've polyfilled won't be detected by Modernizr or other feature detention.
I've tried loading the page using the http node module, appending a script tag and then turning the page source into a data URI for the frame but data uris essentially turn external pages into static html with no scripting, which renders many web pages unusable.
Also, loading a page through node's http module is proving extremely slow compared to loading through an iframe.
So, is there any other way? Ideally I run a script in the iframe before any other scripts are run.
Yes, I am using nwfaketop and nwdisable on the iframe.

The 'document-start' event should be helpful. See https://github.com/rogerwang/node-webkit/wiki/Window#document-start
See also Window.eval() in https://github.com/rogerwang/node-webkit/wiki/Window#windowevalframe-script

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Why is data not fully loaded with view(response) in Scrapy? - web-scraping

Related

How to avoid this sort of paywall when scraping with Python requests?

Download a file from a permalink URL, and not a direct exe url

Google PageSpeed Insights show unidentified scripts in results

Embed API Working with Custom Components fails to display

Injecting script into iframe before load in node-webkit

Categories

Resources