How to download website source files in python? - web-scraping

Given a website (for example stackoverflow.com) I want to download all the files under:
(Right Click) -> Inspect -> Sources -> Page
Please Try it yourself and see the files you get.
How can I do that in python?
I know how to retrive page source but not the source files.
I tried searching this multiple times with no success and there is a confusion between sources (files) and page source.
Please Note, I'm looking for a an approach or example rather than ready-to-use code.
For example, I want to gather all of these files under top:

To download website source files (mirroring websites / copy source files from websites) you may try PyWebCopy library.
To save any single page -
from pywebcopy import save_webpage
save_webpage(
url="https://httpbin.org/",
project_folder="E://savedpages//",
project_name="my_site",
bypass_robots=True,
debug=True,
open_in_browser=True,
delay=None,
threaded=False,
)
To save full website -
from pywebcopy import save_website
save_website(
url="https://httpbin.org/",
project_folder="E://savedpages//",
project_name="my_site",
bypass_robots=True,
debug=True,
open_in_browser=True,
delay=None,
threaded=False,
)
You can also check tools like httrack which comes with a GUI to download website files (mirror).
On the other hand to download web-page source code (HTML pages) -
import requests
url = 'https://stackoverflow.com/questions/72462419/how-to-download-website-source-files-in-python'
html_output_name = 'test2.html'
req = requests.get(url, 'html.parser', headers={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'})
with open(html_output_name, 'w') as f:
f.write(req.text)
f.close()

The easiest way to do this is definitely not with Python.
As you seem to know, you can download code to a single site w/ Command click > View Page Source or the sources tab of inspect element. To download all the files in a website's structure, you should use a web-scraper.
For Mac, SiteSucker is your best option if you don’t care about having all of the site assets (videos, images, etc. hosted on the website) downloaded locally on your computer. Videos especially could take up a lot of space, so this sometimes helpful. (Site Sucker is not free, but pretty cheap). The GUI in SiteSucker is self-explanatory, so there's no learning curve.
If you do want all assets to be downloaded locally on your computer (you may want to do this if you want to access a site’s content offline, for example), HTTrack is the best option, in my opinion, for Mac and Windows. (Free). HTTrack is harder to use than SiteSucker, but allows more options about which files to grab, and again will download things locally. There are many good tutorials/pages about how to use the GUI for HTTrack, like this one: http://www.httrack.com/html/shelldoc.html
You could also use wget (Free) to download content, but wget does not have a GUI and has less flexibility, so I prefer HTTrack.

Related

How to download source code of website

#include <IE.au3>
#include <Inet.au3>
call("logowanie")
Sleep(4000)
Func Logowanie()
Global $oie= _IECreate("http://pl.ikariam.gameforge.com/")
Local $login= _IEGetObjByName($oie,"name")
Local $haslo= _IEGetObjByName($oie,"password")
Local $przycisk= _IEGetObjById($oie,"loginBtn")
Local $serwer= _IEGetObjByName($oie,"uni_url")
_IEFormElementSetValue($login,"<mylogin>")
_IEFormElementSetValue($haslo,"<mypassword>")
_IEFormElementSetValue($serwer,"s30-pl.ikariam.gameforge.com")
_IEAction($przycisk,"click")
EndFunc
This code logs me in to the website but I don't know how to download the website's source code to do some stuff. Could you help?
You can read the source of a website using _IEDocReadHTML:
$sHtml = _IEDocReadHTML($oie)
ConsoleWrite($sHtml)
If you only want to download a single relatively simple page, the easiest way to get the code is: control-click > view page source. You can also get it by going to the sources tab in inspect element, which will also give you some of the other files in the site’s file structure. If you use this method, make sure to convert any relative links to absolute links in your downloaded code (i.e. convert '../../style.css' to it's full url starting with https://)
This won't give you everything, because code on the back-end is not accessible through any means. For a simple website though, there may not be any code on the back end and this will give you exactly what you're looking for.
This can be very finicky and will not scale to downloading a large website with many pages. The most robust way to get code from any website is by using dedicated web-scrapers rather than trying to go into inspect element and look at the “site sources” tab.
For Mac, SiteSucker is your best option if you don’t care about having all of the site assets (videos, images, etc. hosted on the website) downloaded locally on your computer. Videos especially could take up a lot of space, so this sometimes helpful. (Site Sucker is not free, but pretty cheap)
If you do want all assets to be locally downloaded on your computer (you may want to do this if you want to access a site’s content offline, for example), HTTrack is the best option, in my opinion, for Mac and Windows. (Free)
You could also use wget (Free) to download content, but wget does not have a GUI and has less flexibility, so I prefer HTTrack.

How to access https page via wget or curl?

Let's say I want to save the contents of my Facebook page. Obviously fb uses https, thus ssl, how do I download the contents of a secure page using wget?
I found a lot of sources online... and I modify my command, but it doesn't save the page I want.
wget --secure-protocol=auto "https://www.facebook.com/USERNAMEHERE" -O index.html
Actually this is the result I'm getting in index.html:
"Update Your Browser
You’re using a web browser that isn’t supported by Facebook.
To get a better experience, go to one of these sites and get the latest version of your preferred browser:"
The problem is not the SSL / https. The problem is the fact that facebook sees "wget" as the agent and tells "update your browser".
You have to fool facebook with the --user-agent switch and imitate a modern browser.
wget --user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" https://facebook.com/USERNAME -O index.html
and then you will see the actual facebook page if you open index.html in a modern browser.

WebResource.axd not loaded when using latest Google Chrome Browser

I'm having a strange issue:
I can't login at http://maskatel.info/login, when I try to click the login button (the blue button that says Connexion), nothing happens at all.
So I opened up the developer tools in Chrome (f12) and saw the following JavaScript error every time I click the button: Uncaught ReferenceError: WebForm_PostBackOptions
I then found out that this function should be in WebResource.axd, I then went to the Resources tab in the developers tool and found out that this 'file' is not there and it is not loaded in the HTML source.
I tried a lot of different things without any success and finally tried another browser and it works fine in any other browsers. That same page was working perfectly previously in Chrome on the same computer in the past.
So then I tried to click the small gear in the Chrome developer tools and went to the overrides section and changed the UserAgent to something else and refreshed the page and it works perfectly with any other UserAgent string. The correct UserAgent when not overridden for my browser is Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36
So right now I really don't know what to do next:
Is this issue related to the latest version of Chrome? I have not found any information on release dates for chrome.
It could also be a DotNetNuke problem but I doubt it since nothing there changed before and after the problem
It could also be asp.net related (I renamed App_Browsers to App_Browsers2 and still no luck.
Any help would be appreciated.
A data file which addresses this issue is available to download from the following url.
http://51degrees.mobi/portals/0/downloads/51Degrees.mobi-Premium-20130823.dat
.NET users will need to perform the following steps.
Download the above data file.
Replace the file 51Degrees.mobi-Premium.dat in the App_Data folder of the web site with the data file downloaded, renaming the downloaded data file to 51Degrees.mobi-Premium.dat
Restart the application pool servicing the web site to apply the new data file.
Some configurations may place the 51Degrees.mobi-Premium.dat file in a location other than App_Data. The web sites current location can be found in the 51Degrees.mobi.config file found in either the web site’s root folder or the App_Data folder. See the following page for more details.
https://51degrees.mobi/Support/Documentation/NET/WebApplications/Config/Detection.aspx
Please use contact us if you have any issues deploying this data file.
We are having this problem on all our DNN6 sites at work (we can't update to DNN7 since we are stuck on SQL Server 2005 and Windows 2003 boxes). DNN support ticket response was:
"This is a known issue with the Google Chrome update to version 29, the browser is having many issues with ASP.Net pages. The current workaround is to use a different web browser until Google can release a new update."
but I know big asp.net sites like redbox and msdn.microsoft.com are working fine, so it's definitely not a global problem.
Our servers are patched by our infrastructure folks, and they are usually up to date (patched regularly), so not sure what specifically is the issue.
I have personal sites on DNN6 (3essentials hosting), that are working fine. So its definitely not all DNN6/7 sites that are having problems. Maybe its DNN6 sites that are running on Windows 2003 boxes?????
It looks like someone has found the culprit at google. It is related to 51degrees that reports a version 0 for Chrome 29 user-agent string.
More details at https://code.google.com/p/chromium/issues/detail?id=277303
I tried to update the premium data (it is a professional edition installation) but I only get the same version that was aready there dating from 2013-08-15 and having 109 properties.
Then I tried renaming the App_Data/51Degrees.mobi-Premium.dat to add a .old at then end, but the system redownloads that file (same one looks like) to that directory.
So I went away and commented out the fiftyone configuration in the web.config file which instantly made the site work again for Chrome 29.
Let's hope there could be an update on a beter solution for this, but I think the culprit is finally found at least.
On a DNN 7.1.0 site, that uses the Popup feature in DNN (login window opens in a modal popup) the login functionality appears to work fine.
I would recommend you try the Popup option, and if that doesn't work, look at upgrading to the latest release of DNN.
update: I tested the same 7.1.0 site using /login instead of the login popup and it also still works fine, so I would encourage you to look at upgrading your DNN instance.

Downloading all pdf files from google scholar search results using wget

I'd like to write a simple web spider or just use wget to download pdf results from google scholar. That would actually be quite a spiffy way to get papers for research.
I have read the following pages on stackoverflow:
Crawl website using wget and limit total number of crawled links
How do web spiders differ from Wget's spider?
Downloading all PDF files from a website
How to download all files (but not HTML) from a website using wget?
The last page is probably the most inspirational of all. I did try using wget as suggested on this.
My google scholar search result page is thus but nothing was downloaded.
Given that my level of understanding of webspiders is minimal, what should I do to make this possible? I do realize that writing a spider is perhaps very involved and is a project I may not want to undertake. If it is possible using wget, that would be absolutely awesome.
wget -e robots=off -H --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008092416 Firefox/3.0.3" -r -l 1 -nd -A pdf http://scholar.google.com/scholar?q=filetype%3Apdf+liquid+films&btnG=&hl=en&as_sdt=0%2C23
A few things to note:
Use of filetyle:pdf in the search query
One level of recursion
-A pdf for only accepting pdfs
-H to span hosts
-e robots=off and use of --user-agent will ensure best results. Google Scholar rejects a blank user agent, and pdf repositories are likely to disallow robots.
The limitation of course is that this will only hit the first page of results. You could expand the depth of recursion, but this will run wild and take forever. I would recommend using a combination of something like Beautiful Soup and wget subprocesses, so that you can parse and traverse the search results strategically.

Can I make the download dialog box appear without "save" option?

I have a hyperlink to an executable like so: Run Now
I'm trying to make the download dialog box appear without the save function as it is to only run only on the user's computer.
Is there any way to manipulate the file download dialog box?
FYI: Running on Windows Server '03' - IIS.
Please no suggestions for a WCF program.
Okay I found it for anyone stumbling upon this conundrum in the future.
Add the following tag to your head section: <meta name="DownloadOptions" content="nosave" /> and the file download dialog box will not display the "save" option.
For the user to not open/run but save replace "nosave" with "noopen"
Not unless you have some control over a user's machine. If your application can run on limited resources, you might want to consider doing it in Silverlight.
IMO, having a website launching an executable is a pretty bad idea.... even worst if that website is open to the general public (not on intranet). I don't know what that app is doing but it sure is NOT, 1) cross browser, 2) cross platform, and 3) safe for your users.
If you are on intranet, you might get away with giving the full server path (on a shared drive) to the executable and change security settings on your in-house machines.
Other than that, you won't succeed in a open environment such as the Internet.
From your comments, if the user downloading the file is the issue, then there's no way to get around it, as they have to download the file in order to be able to run it.
There's any number of ways to get around whatever you could manage in browser, from proxies like Fiddler intercepting the data, or lower level things like packet sniffing. Or even simply going into the browser's temp/cache folder and copying the file out once it's running.
You could probably get around most laymen by having a program that they can download that registers a file extension with Windows. Then the file downloaded from this site would have the URL of the actual data obfuscated somehow (crypto/encoding/ROT-13/etc). The app would then go and grab the file. The initial program could even have whatever functionality provided by what you want to download, but it needs the downloaded key.
But this is moving into the area of DRM and security by obscurity. If an attacker wants your file, and it's on the Internet, they will get the file.

Resources