Python, save as mht - python-3.4

I can bring up a web page, no problem. I can save the webpage...as html, no problem. I need to save the webpage as mht so I can can get all the html that gets hidden without saving as mht. In researching I'm coming up with absolutely nothing as to how to save as mht using python. Like I said above I can try to save it as a mht file, using the standard coded for saving as html but that simply doesn't work...not surprised it doesn't work either, but it was worth a shot.
url = 'https://www.thewebsite.com'
html = urllib.request.urlopen(url).read()
m = open('websitetest.mht', 'w')
m.write(str(html))
m.close()
The site I'm trying to save does 'hidden code' that comes across when saved as mht, but not when saved as html. Hence why I'm trying to save as mht so I get all the code and then can go through the code to pull off what I need to compile a database.

There is a very handy Github project coded in Python 2.7 (you need to make simple modifications to make it compatible with Python 3.4). This project has code for packing/unpacking MHT files. I think this is what you are looking for:
Un/packs an MHT (MHTML) archive into/from separate files, writing/reading them in directories to match their Content-Location.

Recently came accross the same issue,
I wanted to convert html page to mht format.
Followed Tim Golden's Python stuff and was able to achieve it using win32com.
http://timgolden.me.uk/python/win32_how_do_i/create-an-mhtml-archive.html
import win32com.client as win32
URL = r'C:\WorkSpace\chetan_index.html' # issues found 1> One while using local files, pass the path in url format like file://directory01/directory02/index.html with %20 formating for special characters
# 2> Also same to be followed for files reffered internally inside html file i.e. src="file://reference/directory01/smiley.png"
# 3> Rare issue, if alt tag is found with src, images are not embedded into mht coreectly, trying poping alt tag from web page and then call CreateMHTMLBody
message = win32.gencache.EnsureDispatch('CDO.Message')
message.CreateMHTMLBody(URL, 0) # 0 - suppress none , download all images and others
stream = win32.gencache.EnsureDispatch(message.GetStream())
stream.SaveToFile(r'C:\temp\saved_mht.mht', 2) # 2, for overwrite existing file, 1 for not to overwrite
stream.Close()

Related

Why can't I view or open downloaded PNG file from URL

I have downloaded an image (in fact several images using a for loop) using the below code. However, these images are not opening up, though they seem to have got downloaded completely. In fact these images are not opening up in plain Photo editor or Paint etc., tools. Appreciate inputs and what shall be done..
Below is the code that I tried with for loop:
p <- c("http://assets.pokemon.com/assets/cms2/img/pokedex/full/001.png",
"http://assets.pokemon.com/assets/cms2/img/pokedex/full/002.png",
"http://assets.pokemon.com/assets/cms2/img/pokedex/full/003.png",
"http://assets.pokemon.com/assets/cms2/img/pokedex/full/003_f2.png",
"http://assets.pokemon.com/assets/cms2/img/pokedex/full/004.png")
p
for (url in p)
download.file(url, destfile=file.path("C:/Users/xyz/Desktop/test",basename(url)))
library(imager)
# loading only first image for viewing
i <- load.image("C:/Users/xyz/Desktop/test/001.png")
plot(i)
Then I just downloaded a single file giving a simple destination name and tried to load and display using the below code.
download.file("http://assets.pokemon.com/assets/cms2/img/pokedex/full/001.png",
destfile=file.path("C:/Users/xyz/Desktop/test","first_img.png"))
i_s <- load.image("C:/Users/xyz/Desktop/test/first_img.png")
plot(i_s)
In both the cases I am getting the below error message.
Error in read.bitmap(file) :
File f: C:/Users/xyz/Desktop/test/001.png does not appear to be a PNG, BMP, JPEG, or TIFF
Same way, if I try to open the downloaded images using Photos, Photos editor, Adobe, Paint etc., I get similar messages like format not supported, unable to load photo, etc., messages. However, note that if I simply copy and paste the image url in the browser, the image appears perfectly in the web page.
Appreciate inputs on what can be done here.
Loos like you have to set mode = "wb" in download.file. The manual says:
The choice of binary transfer (‘mode = "wb"’ or ‘"ab"’) is important on Windows, since unlike Unix-alikes it does distinguish between text and binary files and for text transfers changes ‘\n’ line endings to ‘\r\n’ (aka ‘CRLF’).
On Windows, if ‘mode’ is not supplied (‘missing()’) and ‘url’ ends in one of ‘.gz’, ‘.bz2’, ‘.xz’, ‘.tgz’, ‘.zip’, ‘.jar’, ‘.rda’, ‘.rds’ or ‘.RData’, ‘mode = "wb"’ is set so that a binary transfer is done to help unwary users.
So for the single file try:
download.file("http://assets.pokemon.com/assets/cms2/img/pokedex/full/001.png",
destfile=file.path("C:/Users/vsvas/Desktop/test","first_img.png"),
mode = "wb")

Is there a way to (de)serialize and hardcode an image in a R-markdown file?

I want to load an image into a .rmd file.
I am working on a university project where I have to hand in this .rmd file at the end. As a restriction, this file has to run out of the box, so unfortunately it is not an option to load the image from a given file path as I can't submit a folder containing the image or similar (and I don't want to upload it somewhere and access it via URL either).
I was looking for a way to serialize the image information and hard code it into the file but I couldn't find anything helpful related to that.
So in short, I want to do the following:
serialize image
hard code serialized image as variable in .rmd
deserialize hard coded variable and plot image data in .rmd
Is there a simple way to do this?
This seems to be unfeasible. I've tried:
image <- load.image(path)
serialized_image <- serialize(image, NULL)
some_file <- file("test", "wb")
write(serialized_image, "test")
and then I wanted to open test and copy the content manually with CTRL+C, CTRL+V into a variable in my .rmd but it turned out that the content is way too large for doing that (around 30 million lines of bytes..), so theoretically this would be a solution but unless you want a 30 million lines .rmd, this is not an option.

Figuring out file extension before downloading

I'm doing a project that requires going into a database of the brazillian equivalent of the FTC and downloading a few files (which I will later process), and I want to automate this using R.
My problem is that when naming the file, I have to tell it the file extension, and I don't know what it will be (usually it will be a scanned pdf, but sometimes it will be an html file). Here an example:
https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_processo_exibir.php?0c62g277GvPsZDAxAO1tMiVcL9FcFMR5UuJ6rLqPEJuTUu08mg6wxLt0JzWxCor9mNcMYP8UAjTVP9dxRfPBcbZvmE_iaYkTbpPedZsRpa1llf9W8WXxdUJxor5q0IiE
I want the first and the tenth file. Downloading them is easy:
download.file("https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_documento_consulta_externa.php?DZ2uWeaYicbuRZEFhBt-n3BfPLlu9u7akQAh8mpB9yPDzrBMElK1BGz7u3NcOFP7-Z5s9oDvQR1K4ELVR_nmNlPto_G3CRD_y2Hu6JLvHZVV2LDxnr4dccffqX3xlEao", destfile = 'C:/teste/teste1', mode = 'wb')
download.file("https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_documento_consulta_externa.php?DZ2uWeaYicbuRZEFhBt-n3BfPLlu9u7akQAh8mpB9yPaFy5S3krC8lTKjlRbfodOIg2NArJmAFS5PyUEHL3hnJYr8VG9zLGdNts6K99Ht673e_ZPr2gr3Cw7r8zJqRiH", destfile = 'C:/teste/teste2', mode = 'wb')
The thing is, I don't know which one is a pdf file and which one is an html file without manually trying to open them with another program. Is there any way to tell R to automatically add the correct file extension when downloading?
If you use the httr package, you can get the content-type header which will help you decide what type of file it is. You can use the HEAD() function to get the headers of the files. For example with your URLs
urls <- c(
"https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_documento_consulta_externa.php?DZ2uWeaYicbuRZEFhBt-n3BfPLlu9u7akQAh8mpB9yPDzrBMElK1BGz7u3NcOFP7-Z5s9oDvQR1K4ELVR_nmNlPto_G3CRD_y2Hu6JLvHZVV2LDxnr4dccffqX3xlEao",
"https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_documento_consulta_externa.php?DZ2uWeaYicbuRZEFhBt-n3BfPLlu9u7akQAh8mpB9yPaFy5S3krC8lTKjlRbfodOIg2NArJmAFS5PyUEHL3hnJYr8VG9zLGdNts6K99Ht673e_ZPr2gr3Cw7r8zJqRiH"
)
You can write a helper function
get_content_type <- function(x) {
unname(sapply(x, function(x) headers(HEAD(x))[["content-type"]]))
}
get_content_type(urls)
# [1] "application/pdf;" "text/html; charset=ISO-8859-1"
These return mime-type, but you can grep for things like "pdf" to save as a PDF or "html" for web pages. Not sure what other types of files might be available. There is no "correct" file name for a given file type so you'd need to make that decision yourself.

Reading data files in blogdown

I am a new user of blogdown using Hugo. I would like to create a new post that includes R code to read a data file.
The data file is in my static folder, local path C:\mydir\myblogdown\static\data\savedrecs.txt. Since I was successful in referring to an image using a relative path like this, ![](/images\myimage.jpg), I tried reading the data using something similar for the data file, read.csv("/data\savedrecs.txt"), but that didn't work.
I started playing around with the list.files() function to see if I could find a relative path that worked in my local version of the post, list.files("../../static/data") worked, showing me ## [1] "savedrecs.txt".
I tried searching around other folks' blogdown repos on Github to see how they might have referred to a data file, but the only example I found referred to a data file using a URL.
I suspect that this may be due to the location of your data file. In my working example, the Rmd form of my blog post is in the directory, called p0bs/content/post. I also add my data file (a CSV in my case) to this directory.
I then treat the rest of the post as I would in a standard Rmarkdown website, with Rmd chunks (that are named without spaces). In your case, that code would include:
read.csv("savedrecs.txt")
I hope that helps you.

Importing the contents of a word document into R

I am new to R and have worked for a while as follows. I have the code writen in a word document, then I copy and paste the document with the code into R as to have the code run which works fine, however when the code is long (hundred pages) it takes a significant amount of time in R to start making the code run. This seems rather not a very effective working procedure and I am sure there are other forms to compile the R code.
On another hand one of then that comes to my mind is to import the content of word into R which I am unsure how to do. Have tried with read.table but it does not work, have look on internet as to how to import data, however most explanations are all for data tables etc or internet files in the form of data tables and similar. I have tried saving the document into csv. however word does not include csv have tried with Rich text format and XML package but again the instructions from the packages are for importing tables and similars. I am wondering if there is an effective way for R to import a word document as is in the word document.
Thank you
It's hard to say what the easiest solution would be, without examining the word document. Assuming it only contains code and nothing else, it should be pretty easy to convert it all to plain text from within Word. You can do that by going to File -> Save As, and use 'plain text' under 'Save as type'.
Then edit the filename extension to .R from .txt, download a proper text editor (I can recommend RStudio for R), and open your code in it. Then you will be able to run the code from inside the editor without using copy / paste.
No, read table won't do it.
Microsoft Word has its own format, which includes a lot of meta data over and above the text you enter into it. You'll need a reader/parser that understands the Word format.
A Java developer would use a library like Apache POI to read and parse it into word tokens and n-grams.
Look for Natural Language Processing tools, like this R module:
http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Resources