I've learnt about wget and have downloaded a few directories from the web. However, I've hit some roadblocks.
I'm trying to download from a site which requires a password and username, which I have access to.
There are no apparent directories that I could find from inspecting elements. The site was loading up the documents in a reader.js (whatever that is) and it seemed that each page was being fetched as I clicked the arrow button instead of the whole document.
Any ideas would be helpful :)
The site was loading up the documents in a reader.js (whatever that
is) and it seemed that each page was being fetched as I clicked the
arrow button instead of the whole document.
This suggest that JavaScript is used to get document, wget is not right tool in such situation. You would need another tool, namely one which provides browser automation, I suggest giving try PhantomJS (if you are tolerating writing in JavaScript-like language) or selenium (if you are tolerating writing in at least one of officially supported languages).
Related
So I am using InnoSetup 6 which natively supports downloading files from the internet during installation. I have figured out downloading files given a direct link, from this thread Inno Setup: Install file from Internet
However, I can't for the life of me figure out how to download the latest version of a file given a permalink URL. My specific example is to download the Microsoft Hosting package.
https://dotnet.microsoft.com/permalink/dotnetcore-current-windows-runtime-bundle-installer
Going to this page automatically downloads the latest package.
Inno doesn't like this link (or I don't know how to get Inno to use it) since it doesn't point to the direct file. If I use the direct link (https://download.visualstudio.microsoft.com/download/pr/24847c36-9f3a-40c1-8e3f-4389d954086d/0e8ae4f4a8e604a6575702819334d703/dotnet-hosting-5.0.6-win.exe) this works for obvious reasons.
I'd like to always download the latest, but I'm not sure how to accomplish this. Any suggestions?
Adding super basic code being used...
DownloadPage.Clear;
DownloadPage.Add('https://dotnet.microsoft.com/permalink/dotnetcore-current-windows-runtime-bundle-installer', 'dotnet-hosting.exe', '');
DownloadPage.Show;
You would have to retrieve the HTML page, find the URL in the HTML code and use it in your download code.
See Inno Setup - HTTP request - Get www/web content
It would be quite unreliable. Microsoft can change the HTML any time.
You better setup your own webpage (web service) that will provide an up to date link to your installer. The web page can even do what I suggested: retrieve the URL from the Microsoft's download page. In case Microsoft changes the HTML, you can fix your web page any time. What you cannot do with the installer.
Without realizing it you are asking two different question here. That is because these "permalinks" aren't really permalinks but redirects to some dynamic resource that has a link to what you are looking for.
So first, addressing the Microsoft "permalink", you need to realize that under the hood you are accessing a URL that redirects to some page which will point to the latest. Then under the hood, that page invokes a JavaScript function, IF YOU ACCESSING VIA A WEB BROWSER, to download the installer. Note that both the page pointed to and the code to invoke the installer WILL eventually change. In fact, the code itself logs a "warning" when people attempt to download directly:
If you do a view source you'll see:
<script>
$(function () {
recordDownload('.NET', 'runtime-aspnetcore-5.0.6-windows-hosting-bundle-installer');
window.open("https://download.visualstudio.microsoft.com/download/pr/24847c36-9f3a-40c1-8e3f-4389d954086d/0e8ae4f4a8e604a6575702819334d703/dotnet-hosting-5.0.6-win.exe", "_self");
});
function recordManualDownload() {
ga("send", "event", "Download.Warning", "Direct Link Used", "runtime-aspnetcore-5.0.6-windows-hosting-bundle-installer");
}
</script>
So you can download the HTML from this page and use some regex to get the directo downloadlink but beware, the link is going to change every time Microsoft releases a new version. Furthermore, WHEN (not if but when) MS decides to rebrand this entire process might break. So the best you can do here is try to download the html and try parse the download URL from this "permalink"
As an alternative. you can to download the latest DotNet powershell install script as described here.
If possible, execute that script directly. If not look at the function Get-AkaMSDownloadLink within the install script to see how it builds the url to get the latest version. You would probably be better served using that building and using that URL as opposed to attempting to download from some arbitrary HTML code.
Now, onto the second question you might not have realized your were asking is how to automate this for any random installer. The answer is you can't. Some might have a permalink that directly points to the latest but you are always going to find cases like Microsoft. Best you can down is hard code some links in some service, as #martin-prikryl suggested, and when the break update the links in those services.
I'm trying to download a municipal planning plan together with all the relevant documents.
All documents can be reached from the following link
I've tried the following command (that worked well for other sites) and some variations without success.
wget -E -k -r -l 3 "http://www.mavat.moin.gov.il/MavatPS/Forms/SV4.aspx?tid=4&et=1&mp_id=ppnCWTcsST9gG0%2fa0ayWnjFyZ%2bo14s221Ujlpi7UvR4jIRAHLKhJ8lOLSkomZ%2fvlHk8b2T0oENpI6Wh2hKzxQJCw9BPJP8gav%2ftgiKlk5S0%3d"
The same plan in their new site I can't get the files either,
https://mavat.iplan.gov.il/SV4/1/5000931297/310
I'd appreciate any help.
Well, these days, and especially with .net web sites?
We don't use hyper-links with a simple (full) path name to actual files from the web server. In fact in most cases one will not even give the web server rights to those folders. (they are not exposed to Internet Services).
So, no actual links as a full "url" to documents exist.
What happens is when you click on a button or button link? Then the code behind on the web server runs. (and that is code you don't have). And further more, that code behind can browser, read, retrieve any file from any folder on the server or other servers. But links from the web site don't exist and it not even possible to type in a url to resolve to a actual file name on the server.
So the server side code (not internet services) goes and grabs the document. In fact, the documents could be in a database. So, the code behind on the server side runs and pulls the binary data from the database (which represents a valid PDF file). Or the code behind reads the file from disk and then STREAMS the file for a download.
Now, this is often done for reasons of security. It means that no valid URL exists to get at a document.
Not only is this done for security, but from a developer point of view, it often better to retrieve a row from a database. That row can have the information you SEE rendered on that form, but the web page is not static, and the display of information is thus a developer coding a pull of rows from a database, and then you simply "assign" that data to some type of control - save datagrid, or listview or whatever. (this assignment of data is only 1 or two lines of code, and then the control + web server renders that datagrid control.
So, this is done since the developer thus only assigns the result of a database query to the control when then renders on the form. Thus, to add or remove documents? Then you only have to edit the database for the information on the web page to render.
As a result? There is no direct links to the actual documents on the server. To retrieve a document, you would have to send to the web site the exact command required.
You can hit f12 (most browsers support this). This will put your browser into developer mode. If we do this, and then select elements (select element feature). Now click on a pdf link. You get this:
<img src="../images/ft/file_PDF.gif" style="cursor:pointer"
onclick="openDoc('99000526871729',
'AABA7BE646E182B67DB1C15220E531DF36BBB591D8EEA7757435B2606C08E6F9')">
So, note above. The above code event openDoc is the SERVER side code you have to run to retrive a document. There is thus NO link. And you not going to be able to wire up, or run your OWN web page that hits that server and runs the routine "onclick".
However, the onclick DOES expose the internal database document numbers used to pull/read and retrieve a given document. But the path name, and how the code gets/grabs this file? You have no idea, and HAVE to run server side code (c#, or vb.net) code. That code as noted grabs the file and then uses code to "stream" the file when you download or click on a link.
So for simple HTML like pages? Well, for those that took a one day HTML course? Sure, such web sites will have scr=some path name to a valid url). And these simple systems thus allow you to enter a URL to grab/get a document. And those documents are fully exposed to the web site, and a simple valid URL path name to a file exists. Not so with asp.net, and as noted, this is not only done for security, but it a better over all developer experience to write code that grabs the files as opposed to rendering full path link names to files.
There are many additional benefits. For example, the database that drives this likely has a setting (or some settings) that contain the path names to the documents. If they run out of storage, or say want to move older files to a much slower storage system, which of course is much lower cost? Then can move the files, and update the path name columns in the database. The web site will continue to work, since we NEVER using a exposed URL on the web site. And as noted, actual direct URL's don't exist, and the web server (IIS) as opposed to the code behind will not even have rights to the file names.
As a result?
You not be able to simply pull the web page, and THEN extract the URL's to file names.
What you might be able to do is write code that loads the web page, and then scans all the event code stubs for the links, and have your code click on each button with web browser automation. But, even that don't allow you to enter file names into the download prompts.
So, what you ask is not easy, likely not possible, and a very difficult task. And the simple reason is that site does not use simple HTML and static links to files, and it never actually exposes a direct link to files, and even worse yet is the web server does not have or even allow a URL direct link to a site - they don't exist, and the web site will not even have rights or even allow such URL's to file names. (only the .net code behind does - not internet services).
and grabs the document and then code "streams" the file to to the web site or link you clicked on. So the simple HTML coders in the past would create say a folder (usually a virtual folder) that points to the files on some server/folder. But with .net, it easier (and far more secure).
Modern development tools don't use old fashioned ideas like a URL's to directly retrieve a file - they are designed differently.
In some cases, URL's are allowed or created, and this is done for reasons of sharing links. So if you have a cute video or document? Then the designers of the system will often permit use of parameters in the URL, so you can share a link to someone else. This page has no such provisions. So, you can share a link to the page, but no actual URL to documents or even provisions to allow URL's to a document even exists.
So this quite much means to retrieve a document, you have to go to that web page, and ONLY when you click on a document will the web site "stream" down that one particular document in question.
I am trying to restrict the user from downloading the page as .html or .aspx file from browser.
Or is there a way to change the content of file if its downloaded?
This is a complex area, with lots of moving parts. The short answer is "there is no way to do this with 100% success; there are a few things you can do which make it harder".
Firstly, you can include JavaScript to disable the right click context menu. This doesn't stop Ctrl+S, but might discourage casual attempts.
Secondly, you can use DRM in the browser (though this is primarily aimed at protecting media content. As browser support is all over the show, this isn't realistic right now.
Thirdly, you could write your site as a single page web application, and build some degree of authentication into the "retrieve content" logic. This way, saving the page to disk wouldn't bring the content along, just the "page furniture". However, any mechanism you include to only download content when you think you should is likely to be easily subverted by anyone who is moderately motivated.
Also, any steps you take to stop people persisting your pages locally are likely to break the caching mechanisms on which the internet depends for performance, so your site would likely be dramatically slower.
No you can't stop them.
Consider how the web actually works here: once the user has visited your website and loaded your page into their browser, they have already downloaded it - the web page was transmitted from your server to their computer and appeared on their screen.
All they have to do then is click the Save button to keep it permanently on their disk. That doesn't involve downloading it again, it just copies the page data from a temporary folder to a permanent one. Of course it's also possible for people to use another HTTP client (i.e. not a browser, but maybe an existing program, or some code they wrote themselves) to visit the URL of your page and save the returned contents.
It's not clear what problem think you would solve by stopping people from saving pages. Saving the page is something done within the browser - you as a site developer don't control the user's browser, so you can't prevent that. And if you stop them from downloading your page in the first place then - by definition - you also stop them from using your website...which kind of defeats the point of having one :-).
If you've got some sort of worry about security, you'll have to clarify exactly what you are concerned about, and maybe you can get advice about a sensible way to deal with it.
I have a hyperlink to an executable like so: Run Now
I'm trying to make the download dialog box appear without the save function as it is to only run only on the user's computer.
Is there any way to manipulate the file download dialog box?
FYI: Running on Windows Server '03' - IIS.
Please no suggestions for a WCF program.
Okay I found it for anyone stumbling upon this conundrum in the future.
Add the following tag to your head section: <meta name="DownloadOptions" content="nosave" /> and the file download dialog box will not display the "save" option.
For the user to not open/run but save replace "nosave" with "noopen"
Not unless you have some control over a user's machine. If your application can run on limited resources, you might want to consider doing it in Silverlight.
IMO, having a website launching an executable is a pretty bad idea.... even worst if that website is open to the general public (not on intranet). I don't know what that app is doing but it sure is NOT, 1) cross browser, 2) cross platform, and 3) safe for your users.
If you are on intranet, you might get away with giving the full server path (on a shared drive) to the executable and change security settings on your in-house machines.
Other than that, you won't succeed in a open environment such as the Internet.
From your comments, if the user downloading the file is the issue, then there's no way to get around it, as they have to download the file in order to be able to run it.
There's any number of ways to get around whatever you could manage in browser, from proxies like Fiddler intercepting the data, or lower level things like packet sniffing. Or even simply going into the browser's temp/cache folder and copying the file out once it's running.
You could probably get around most laymen by having a program that they can download that registers a file extension with Windows. Then the file downloaded from this site would have the URL of the actual data obfuscated somehow (crypto/encoding/ROT-13/etc). The app would then go and grab the file. The initial program could even have whatever functionality provided by what you want to download, but it needs the downloaded key.
But this is moving into the area of DRM and security by obscurity. If an attacker wants your file, and it's on the Internet, they will get the file.
A colleague set up an intranet application where the users can upload documents. These documents are displayed afterwards in an IFRAME using <IFRAME src="document.doc"></IFRAME> - of course this only works in IE. While this works with some users, others (including myself) do not see the document, but rather a download dialogue allowing them to download the document.
I vaguely remember that there was a recent security issue with displaying MS Office documents in IFRAMES, but could not find any information whether there was a security update blocking this. Anyone here who has a clue?
I am not looking for alternatives for the IFrame, I just want to know why some users are displayed the download box while other users see the inline document.
If you get a download dialogue instead of displaying the document inline in an iframe, then:
you probably haven't installed the Office Web Components. You can change the components Office has installed from its Add/Remove Programs entry in the Control Panel. But,
DON'T. There have been endless security holes in OWC. Installing a plugin means a great deal of new net-facing code and subsequently a great potential for exploitable bugs, especially in software that wasn't originally intended to be net-facing like Office.
Install the absolute minimum number of plugins you can get away with (these days usually just Flash). Don't install every plugin an application offers you, don't install a PDF plugin, and definitely don't install a load of plugins for Office documents.
Is viewing an Office document in a little annoying scrolly box tucked into a web page really compelling enough to justify the risk? I suggest that no, it's in fact much much less usable than just downloading the document to the desktop and opening it in a proper document editor/viewer.
You might consider an RTE like CKEditor. It allows the user to cut-and-paste from Word (I assume you are primarily concerned with Word docs given your problem description) and then to view and edit. CKEditor claims to be "compatible with all major browsers."