Download entire history of a Wikipedia page in HTML Format - web-scraping

I want to download the entire revision history of a single article on Wikipedia in HTML format, thanks to this question question, I am getting the entire history of a Wikipedia page in a JSON format but I want to get as a HTML format with images and everything.
I tried to convert to that json to other format but it did not work. Is there any way to doing this?

If you visit page then there is link view history which shows list of all previous versions and every version has link curr to display page which compares old version with current version.
Every link has &diff=...&oldid=... and if you remove &diff=... and keep &oldid=... then you should get only old version as HTML (with header which informs that you visit old version)
See page for Stack_Overflow
Current version:
https://en.wikipedia.org/wiki/Stack_Overflow
or
https://en.wikipedia.org/w/index.php?title=Stack_Overflow&oldid=1074237099
The oldest version:
https://en.wikipedia.org/w/index.php?title=Stack_Overflow&oldid=273483259
This way you could get HTML for all versions.
And if you use #diff=... with ID for different version (doesn't have to be current) then you can see also differences between two versions.
Current version:
The oldest version:

Related

RoboHelp is missing ouput preset for CHM generation

Adobe RoboHelp 2020 Trial Version:
The list of available output presets is also missing Responsive HTML5, and Mobile App.
I did have a problem with the PDF output geneation. Error message advised to install Java runtime, as it was missing. After Java installation, PDF genearated. But its bookmarks did not work at all.
I have worked laboriously at learning how to use the RoboHelp; and also took several days to create my project. I desperately need to generate CHM output for a Windows program that I developed. I was devastated when I finally went to generate CHM output/Microsoft HTML Help, it was not on thelist of presets.
That said, I am at wits end, having searched for potential solutions. Can someone please suggest a solution?
Also the Adobe RoboHelp 2020 Trial Version contains the functionality to generate CHM help files.
To add and output the Microsoft HTML Help (CHM) in your presets, click the + icon in the Outputs panel as shown in the screenshot below.
Select Type Microsoft HTML Help
Enter a short Name like CHM
After that it is available for permanent use, e.g. in the Quick Generate window.
I updated RH 2020.
I selected the Microsoft HTML Help preset
I configured the preset: I selected the TOC file. I specified the output folder, which is on a different drive than the project folder. (I knew that it could not be in the location whose parent folder was the project folder.)
Then I clicked the command button to generate the CHM. The progress display continuously indicated the various task that were being performed during the generation process.
Ultimately, I clicked the View Output button. The program displayed an error message indicating that the CHM filed could not be opened. I checked the output designation folder; and, the file existed. However, it was only 1 KB. I double-clicked the CHM file; but, the same error message appeared as when I attempted to open the file from within RoboHelp.
As an aside, I tried he same process with RoboHelp’s sample project, i.e., Compass Travel. And, wouldn’t you know it? The CHM file did properly generate and display; even the bookmarks functioned as expected.
That said, I had mentioned in the original post that I had generated a PDF output. But, the bookmarks did not worked.
Clarification:
The H 1 bookmarks display. However, clicking anyone of the them results in nothing. Whereas most of the H 1 folders have several H 2 subfolders; and there existed two H-3 level items.

R - Web scraping item price

I'm trying to write an R script checking prices on a popular swiss website.
Following methodology explained here: https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/ I tried to use rvestfor that:
library(rvest)
url <- "https://www.galaxus.ch/fr/s8/product/quiksilver-everyday-stretch-l-shorts-de-bain-10246344"
webpage <- read_html(url)
Unfortunately, I have limited html/css knowledge and the content of webpage is very obscure to me.
I tried inspecting the page with google chrome and it looks like the price is located in something named priceEnergyWrapper--2ZNIJ but I cannot find any trace of that in webpage. I did not have more luck using SelectorGadget
Can anybody help me get the price out of webpage?
Since it is dynamically generated, you will need RSelenium.
Your code should be something like:
library(RSelenium)
driver <- rsDriver(browser=c("chrome"))
rem_driver <- driver[["client"]]
rem_driver$open()
rem_driver$navigate("https://www.galaxus.ch/fr/s8/product/quiksilver-everyday-stretch-l-shorts-de-bain-10246344")
This will ask Selenium to open this page after loading the entire page, and hence all the HTML that you see by clicking Page Source should be available.
Now do:
rem_driver$findElement(using = 'class', value = 'priceEnergyWrapper--2ZNIJ')
You should now see the necessary HTML to get the price value out of it, which at the time of checking the website is 25 CHF.
PS: I do not scrape websites for others unless I am sure that the owners of the websites do not object to crawlers/scrapers/bots. Hence, my codes are based on the idea of how to go about with Selenium. I have not tested them personally. However, you should more or less get the general idea and the reason behind using a tool like Selenium. You should also find out if you are allowed to legally scrape this website and for others in the near future.
Additional resources to read about RSelenium:
https://ropensci.org/tutorials/rselenium_tutorial/

Web scraping Oracle (ATG) Commerce

I am new to web scraping, and I use the following tool and method to scrap:
I use R (with packages Curl, XML, etc) to read the web pages (with a url link), and htmlTreeParse function to parse the html page.
Then in order to know get the data I want, I first use the developer tool i Chrome to insepct the code.
When I know in which node the data are, I use xpathApply to get them.
Usually, it works well. But I had an issue with this site: http://www.sephora.fr/Parfum/Parfum-Femme/C309/2
When you click on the link, you will load the page, and in fact it is the page 1 (of the products).
You have to load the url again (by entering a second time the url), in order to get the page 2.
When I use the usual process to read the data. The htmlTreeParse function always gives me the page1.
I tried to understand more this web site:
It seems that it is built with Oracle commerce (ATG commerce).
The "real" url is hidden, and when you click on the filter (for instance, you select a brand), you will get url with requestid: http://www.sephora.fr/Parfum/Parfum-Femme/C309?_requestid=285099
This doesn't help to know which selection I made.
Could you please help:
How can I access to more products ?
Thank you
I found the solution: selenium ! I think that it is the ultimate tool for web scraping. I posted several questions concerning web scraping, now with rselenium, almost everything is possible.

How can I automatically add author and copyright info to newly created ipython notebooks?

I'd like to pre-pend a simple author and copyright clause the beginning of newly created ipython notebooks. Is this possible? If so how can it be done?
We want to support this in metadata at notebook top level, but nobody has taken time to write a proposal for metadata structure, how to edit it, and how to show it.
This would be usefull for view on nbviewer, but also for conversion to LaTeX, and other format. It might just be slightly more complicated that at first thought, as you probably want the Authors to be more that just first name/last name (like a full embeded vcard for example).
If you want to work on that you are welcomed, otherwise in the meantime I suggest adding a simple markdown cell at top with those info.
This should be easy to do on a buch on notebook at once as they are easy parsable json.

Where can one find source archives of old SQLite versions?

The downloads page on www.sqlite.org appears to only have links to the current version, and I would like to get a previous version. I cannot see any obvious links to historical versions on the site and (unless I'm missing something obvious) there does not appear to be a sourceforge project.
Can someone point me at an archive of old SQLite source releases if such a thing exists?
Nigel.
I found this in their old message list archives. Hopefully this helps:
Older version of SQLite are aviable
from the website, but there are no
direct links on the web pages. You
need to manually edit the links to
get the file you need.
The 2.1 version of the database file
implies that it was created with a
2.X.Y version of SQLite. You should get the latest version which is
2.8.17 (I believe).
If you go the download page
http://www.sqlite.org/download.html
and the right click on the link to
download the latest Windows binary
file, then
select Copy Link Location (at least
using Firefox, in IE the command is
Copy Shortcut). Now open a new tab or
window and paste the link into the
address bar. You can edit the link and
replace the version number with the
version you want to download. In your
case you need to change
http://www.sqlite.org/sqlitedll-3_5_6.zip
to
http://www.sqlite.org/sqlitedll-2_8_17.zip
and then press enter to start
I know this question is kind of old but there's an easier way to get the URL to the older zip files.
Using a combination of answers here, you can calculate the download URL of the zip file for the specific version you want.
Determine the version number you want to get (we'll use 3.8.7.4 as an example)
Look on the timeline (http://www.sqlite.org/src/timeline?t=release) to figure out the year in which your desired version was released (3.8.7.4 was released in 2014)
Normalize the version number into exactly 7 digits. Expand each piece into 2 digits with leading zeroes except for the initial 3 which should remain 1 digit. For example 3.8.7.4 becomes 3080704. 3.13.0 becomes 3130000. (If there is no 4th period delimited piece, use 00)
Using your normalized version number, plug it into one of these formats, depending on what you're looking for (Replace the text '7DIGITS' in the urls below with your normalized version number, replace the text YEAR with the year in which the version was released
Source: http://www.sqlite.org/YEAR/sqlite-src-7DIGITS.zip
Amalgamated: http://www.sqlite.org/YEAR/sqlite-amalgamation-7DIGITS.zip
So our example versions become
http://www.sqlite.org/2014/sqlite-src-3080704.zip and http://www.sqlite.org/2014/sqlite-amalgamation-3080704.zip
I haven't tried this for every version but my 3 test versions worked. I would imagine the other download types (like precompiled binaries, documentation, etc) work as well.
Direct Access To The Sources
Also, if you want to compile yourself. Access to all SQLite source code is maintained in a CVS repository that is available for read-only access by anyone. You can interactively view the repository contents and download individual files by visiting this link
Also
www.sqlite.org/src/timeline?t=release will show when every sqlite version was released.
Checkout from cvs from the date you want and compile. Instruction how to checkout from cvs are here
Note: Use the -D option to checkout by date
The idea from TomWitt2 is fantastic (I had spent hours to find solution) but now the links seems to be have modified so follow below steps to get your archived version:
Get the latest version link from the download page here http://www.sqlite.org/download.html
Get your normalized 7 digit number as mentioned in answer by TomWitt2
Just replace the number in the link and you are ready to go
I've tried a few solutions on this page and elsewhere, all that I've seen seem outdated and no longer work. I've done the steps below as of 5/4/2016 with success.
Go to http://system.data.sqlite.org/index.html/finfo?name=www/downloads.wiki to view the history of the SQLite downloads wiki.
Search (ctrl+f) for the version you want (ex. 1.0.91.0)
Select the commit ID and it will open that old version of the page complete with download links to source code as well as precompiled binaries and setups.
You won't always find the version in a search (ex. 1.0.98.0), but it should be reasonably easy to find the correct page by the surrounding versions/commits.
You can also check archive.org for the downloads page:
http://web.archive.org/web/20150101000000*/http://system.data.sqlite.org/index.html/doc/trunk/www/downloads.wiki
Find the date that your desired version was released on from the SQLite news page (you may need to pick the next archive date after that). Select your desired link (sometimes the download page was archived, more ofter it seems like it was not). If the download page was not archived, edit the address bar to remove the archive.org-related info and you should be able to navigate directly to the SQLite download page for that version.
Follow this link to the official website and under "3.0 Obtaining Code Directly From the Version Control System" you can read further directions:
get the list of release check-ins (this link)
choose the required check-in
download source code archive
The oldest release available now is from
2007-08-13.
You are correct to point out that https://www.sqlite.org/download.html only links to the most recent release, but a page always linking the current release combined with Wayback Machine preserving every (well, hopefully) version of said page, is all you need to download your favorite release:
http://web.archive.org/web/*/https://www.sqlite.org/download.html
Given that the binaries themselves have not been removed, of course, but fortunately they seem intact to me.
In fact, I just downloaded SQLite 3.9.2 through this page:
http://web.archive.org/web/20151231190003/https://www.sqlite.org/download.html
You can get all release of SQLite from https://www.sqlite.org/chronology.html, this page contains the history of SQLite releases

Resources