how to get the dynamic loaded content from a webpage (Alphafold database) - web-scraping

I think similar topics may have been discussed before. But I still could not find the solution. Here is the example page
https://alphafold.ebi.ac.uk/entry/P82662
I want to extract the target downloading URL:
https://alphafold.ebi.ac.uk/files/AF-P82662-F1-model_v4.pdb
When I "Inspect" this page in Chrome, I could see the desired URL here:
<a _ngcontent-brn-c12="" class="vf-button vf-button--secondary vf-button--sm" style="color: #3B6FB6;" href="https://alphafold.ebi.ac.uk/files/AF-P82662-F1-model_v4.pdb">PDB file </a>
I used request.get to get the html but could not find the target URL, and realized this could be dynamically loaded. Then, under the Network tab and Fetch/XHR I could only see a similar URL without the desired file type:
https://alphafold.ebi.ac.uk/files/AF-P82662-F1-model_v4.cif
So how can I get this URL string using request?

Related

Download links in Blazor

I'm programatically creating links to download files of different formats like .csv and .yml.
My code is
<a href="/organs/#organ/#policy/#folder.Name/#file.Name" download>#file.Name</a>
so lets say these links get created:
https://localhost:44372/organs/Heart-Lung/03-15-2020/data/fakedata.yml
https://localhost:44372/organs/Heart-Lung/03-15-2020/data/testdata.csv
the .csv works as it should, I click and it downloads. the .yml however opens a new web page at that link and then says that it can't be found
I'm really not sure what I'm doing wrong, is there a way force it to download or should I be doing this a different way?
I ended up using javascript to do this, based on https://wellsb.com/csharp/aspnet/blazor-jsinterop-save-file/
most of the code is the same, but I changed the link to this to pass in the variable and to keep it as a clickable link that doesn't go anywhere
<a href="javascript:void(0)" #onclick=#(() => DownloadFile(file.FullName))>#file.Name</a>

Using Google Sheets /copy url parameter in iframe usage case

I am designing an exercise in an e-learning application. I have a template Google Spreadsheet and can display the spreadsheet within the software as an iframe.
Document URL (it's just a sample table)
https://docs.google.com/spreadsheets/d/1zmZ-oW8lC2Hus-G70O3CkhmGE5qqnoOGmSZMH6x526U/edit?usp=sharing
I learned about the /copy parameter that can be added to a url to generate a copy of that document so that the editing does not overwrite the original. Source: https://www.makeuseof.com/tag/make-copy-trick-sharing-google-drive-documents/
Spreadsheet URL with copy parameter
https://docs.google.com/spreadsheets/d/1zmZ-oW8lC2Hus-G70O3CkhmGE5qqnoOGmSZMH6x526U/copy
However, when I run that url as the iframe's source, it returns an error:
I learned that it is not the issue with the iframe, having tried the url on the w3 schools iframe demo with the same result. Source: https://www.w3schools.com/tags/tryit.asp?filename=tryhtml_iframe
From this StackOverflow answer, I believe I understand that the iframe will not permit executing JavaScript within it and I expect that Javascript is being used to generate a copy.
Answer: Google Spreadsheets redirect
My question becomes, is there an alternative way for an end user (student) to generate a Google Sheets copy url and have that appear in an iframe in a homework lesson?
Desired Result:
This in the iframe (or equivalent):
and be able to edit that copy as their own document

Google Tag Manager - Event - Track Download on .aspx

I have an issue where I am trying to track the filedownloads on a website trough Google Tag Manager events. What I am doing is and already done on many websites is to find out if the "Click Url" contains any of the filetypes I am looking for like pdf, docx and so on. The issue here is that the Click Urls does not contain the information and are ending with .aspx. What is the best method to solve this issue?
I'd rephrase the question: what do these links (elements) have in common, that point to various downloadable files, and that is different from anything else, what is not a download link. This can be referred in GTM as a trigger.
Consider the following example:
<a class="button file_download" id="file_link_12" href="index.aspx?action=download&file_id=12">Download file</a>
If any of these can be identified in your HTML source code, then you can use these as triggers, which describes your elements used to point to downloadable files, e.g.:
Click Classes contains file_download
Click ID contains file_link
Click text equals Download file
Click URL contains action=download
If you are able to alter the HTML code, than you can choose any of these methods above, that bests suits your needs.

How to let crawler4j fetch page by relative path?

With Crawler4j, I can fetch page linked by a complete url, such as:
<a href='http://www.domain.com/thelink'>
However I found that if the link is relative, such as:
<a href='/thelink'>
Crawler4j will bypass this link(page), and I even have no chance to see the link in shouldVisit(Page referringPage, WebURL url) method.
I do not see any configuration about this in Crawler4j Github page, do I miss something?
As described in the related issue on the project page, it seems that this behaviour is related to the fact, that this specific web-page does a lot of rendering content using ajax / javascript.
However, crawler4j is not able to render javascript styling on demand as it does not include a javascript engine for this purpose. In addition, the script tag is not scanned for URLS yet.

Open PDF in browser instead of downloading it

After uploading a PDF to the Media Archive, I am trying to link to it from a page on a site.
While editing content, I use the hyperlink tool then select the PDF I want to link to via the URL input box.
After saving and publishing the content, clicking the link downloads the PDF and I don't see any apparent way to make this view-able in the browser by using the current Media ID Composite provides. When rendered, we get this:
pdf
Is there a way that I can reference a PDF without using the Media ID and simply use the file name instead?
Here is the Request/Response header info:
After reading what Pauli Østerø said, I understand the problem but am still not able to think of a solution.
I can get the PDF to view in the browser by adding ?download=false to the href URL via Developer Tools. But when I try to add ?download=false to the href through Composite, it doesn't take affect and I get the console output: "Resource interpreted as Document but transferred with MIME type application/pdf: "http://c1.wittenauers.com/media/4afb7bc8-f703-469d-a9b2-a524d8f93dcb/ryc7iw/CompositeDocumentation.PDF"."
Here is the network trace that was asked for by Pauli. In the image, I included the bit where I add ?download=false to the URL, in source view, just in case there could be another way to add it.
Edit: URL and headers for the page.
Here is the link to the page that contains the link:
http://c1.wittenauers.com/cafe/test
Here is the headers for the page containing the link:
From what you're experiencing, it seems to me that Composite have gotten the MIME type of your uploaded file wrong, and is therefor not correctly telling the browser that this file is a pdf, and the browser doesn't know what to do with it.
Try deleting the file and uploading it again.
Try add ?download=false and the end of the href to the file. You prob. need to go into source mode of the content editor.
This is the exact line in the Source Code which is responsible for this behavior, and the logic is as follows
If there is no Querystring named download, the attachment is determined by the Mime Type. Only png, gif and jpeg will be shown inline, the rest will be shown as attachment.
If there is a Querystring named download with a value of false, it will override the Mime Type check and always force the Content-Disposition to be inline.
I made a quick test here to show that the behaviour is a expected. At least in my Chrome browser in Windows 8
Force download: https://www.dokument24.dk/media/9fdd29da-dde8-41f7-ba4c-1117059fdf06/z8srMQ/test/Prisblad%202015%20inkl%20moms.pdf
Show in browser: https://www.dokument24.dk/media/9fdd29da-dde8-41f7-ba4c-1117059fdf06/z8srMQ/test/Prisblad%202015%20inkl%20moms.pdf?download=false
Expanding on Pauli's answer, you can add the following snippet to your page template to automatically add the '?download=false' to all pdf links.
$("a").each(function () {
if (this.href.includes(".pdf")) {
this.href = this.href + "?download=false";
}
})

Resources