Figuring out file extension before downloading - r

I'm doing a project that requires going into a database of the brazillian equivalent of the FTC and downloading a few files (which I will later process), and I want to automate this using R.
My problem is that when naming the file, I have to tell it the file extension, and I don't know what it will be (usually it will be a scanned pdf, but sometimes it will be an html file). Here an example:
https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_processo_exibir.php?0c62g277GvPsZDAxAO1tMiVcL9FcFMR5UuJ6rLqPEJuTUu08mg6wxLt0JzWxCor9mNcMYP8UAjTVP9dxRfPBcbZvmE_iaYkTbpPedZsRpa1llf9W8WXxdUJxor5q0IiE
I want the first and the tenth file. Downloading them is easy:
download.file("https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_documento_consulta_externa.php?DZ2uWeaYicbuRZEFhBt-n3BfPLlu9u7akQAh8mpB9yPDzrBMElK1BGz7u3NcOFP7-Z5s9oDvQR1K4ELVR_nmNlPto_G3CRD_y2Hu6JLvHZVV2LDxnr4dccffqX3xlEao", destfile = 'C:/teste/teste1', mode = 'wb')
download.file("https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_documento_consulta_externa.php?DZ2uWeaYicbuRZEFhBt-n3BfPLlu9u7akQAh8mpB9yPaFy5S3krC8lTKjlRbfodOIg2NArJmAFS5PyUEHL3hnJYr8VG9zLGdNts6K99Ht673e_ZPr2gr3Cw7r8zJqRiH", destfile = 'C:/teste/teste2', mode = 'wb')
The thing is, I don't know which one is a pdf file and which one is an html file without manually trying to open them with another program. Is there any way to tell R to automatically add the correct file extension when downloading?

If you use the httr package, you can get the content-type header which will help you decide what type of file it is. You can use the HEAD() function to get the headers of the files. For example with your URLs
urls <- c(
"https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_documento_consulta_externa.php?DZ2uWeaYicbuRZEFhBt-n3BfPLlu9u7akQAh8mpB9yPDzrBMElK1BGz7u3NcOFP7-Z5s9oDvQR1K4ELVR_nmNlPto_G3CRD_y2Hu6JLvHZVV2LDxnr4dccffqX3xlEao",
"https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_documento_consulta_externa.php?DZ2uWeaYicbuRZEFhBt-n3BfPLlu9u7akQAh8mpB9yPaFy5S3krC8lTKjlRbfodOIg2NArJmAFS5PyUEHL3hnJYr8VG9zLGdNts6K99Ht673e_ZPr2gr3Cw7r8zJqRiH"
)
You can write a helper function
get_content_type <- function(x) {
unname(sapply(x, function(x) headers(HEAD(x))[["content-type"]]))
}
get_content_type(urls)
# [1] "application/pdf;" "text/html; charset=ISO-8859-1"
These return mime-type, but you can grep for things like "pdf" to save as a PDF or "html" for web pages. Not sure what other types of files might be available. There is no "correct" file name for a given file type so you'd need to make that decision yourself.

Related

Importing to R an Excel file saved as web-page

I would like to open an Excel file saved as webpage using R and I keep getting error messages.
The desired steps are:
1) Upload the file into RStudio
2) Change the format into a data frame / tibble
3) Save the file as an xls
The message I get when I open the file in Excel is that the file format (excel webpage format) and extension format (xls) differ. I have tried the steps in this answer, but to no avail. I would be grateful for any help!
I don't expect anybody will be able to give you a definitive answer without a link to the actual file. The complication is that many services will write files as .xls or .xlsx without them being valid Excel format. This is done because Excel is so common and some non-technical people feel more confident working with Excel files than a csv file. Now, the files will have been stored in a format that Excel can deal with (hence your warning message), but R's libraries are more strict and don't see the actual file type they were expecting, so they fail.
That said, the below steps worked for me when I last encountered this problem. A service was outputting .xls files which were actually just HTML tables saved with an .xls file extension.
1) Download the file to work with it locally. You can script this of course, e.g. with download.file(), but this step helps eliminate other errors involved in working directly with a webpage or connection.
2) Load the full file with readHTMLTable() from the XML package
library(XML)
dTemp = readHTMLTable([filename], stringsAsFactors = FALSE)
This will return a list of dataframes. Your result set will quite likely be the second element or later (see ?readHTMLTable for an example with explanation). You will probably need to experiment here and explore the list structure as it may have nested lists.
3) Extract the relevant list element, e.g.
df = dTemp[2]
You also mention writing out the final data frame as an xls file which suggests you want the old-style format. I would suggest the package WriteXLS for this purpose.
I seriously doubt Excel is 'saved as a web page'. I'm pretty sure the file just sits on a server and all you have to do is go fetch it. Some kind of files (In particular Excel and h5) are binary rather than text files. This needs an added setting to warn R that it is a binary file and should be handled appropriately.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download.file(url=myurl, destfile="localcopy.xlsx", mode="wb")
or, for use downloader, and ty something like this.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download(myurl, destfile="localcopy.csv", mode="wb")

Cannot import csv file into R

I'm trying to import the csv files from my working directory. There are 3 such files, but for some reason R insists on recognizing only one of them. I can't determine what the pattern is, and if the recognized file is moved out of the folder then nothing is recognized. Here is my code:
files = list.files(pattern="*\\.csv$")
Each of the files is for sure a csv file which I confirmed by inspecting the "Type" column in the windows folder navigator, and to be safe I also saved a copy as CSV and still had the same problem.
Is there an aspect to this I'm unaware of?
Thanks!
The issue turned out to be that the file extension for the file that worked was ".csv" and for the ones that didn't was ".CSV". I do not know how or why something like that can happen, but the pattern parameter of the list.files function is case sensitive.
Using the parameter setting ignore.case = TRUE solved this issue.

Python, save as mht

I can bring up a web page, no problem. I can save the webpage...as html, no problem. I need to save the webpage as mht so I can can get all the html that gets hidden without saving as mht. In researching I'm coming up with absolutely nothing as to how to save as mht using python. Like I said above I can try to save it as a mht file, using the standard coded for saving as html but that simply doesn't work...not surprised it doesn't work either, but it was worth a shot.
url = 'https://www.thewebsite.com'
html = urllib.request.urlopen(url).read()
m = open('websitetest.mht', 'w')
m.write(str(html))
m.close()
The site I'm trying to save does 'hidden code' that comes across when saved as mht, but not when saved as html. Hence why I'm trying to save as mht so I get all the code and then can go through the code to pull off what I need to compile a database.
There is a very handy Github project coded in Python 2.7 (you need to make simple modifications to make it compatible with Python 3.4). This project has code for packing/unpacking MHT files. I think this is what you are looking for:
Un/packs an MHT (MHTML) archive into/from separate files, writing/reading them in directories to match their Content-Location.
Recently came accross the same issue,
I wanted to convert html page to mht format.
Followed Tim Golden's Python stuff and was able to achieve it using win32com.
http://timgolden.me.uk/python/win32_how_do_i/create-an-mhtml-archive.html
import win32com.client as win32
URL = r'C:\WorkSpace\chetan_index.html' # issues found 1> One while using local files, pass the path in url format like file://directory01/directory02/index.html with %20 formating for special characters
# 2> Also same to be followed for files reffered internally inside html file i.e. src="file://reference/directory01/smiley.png"
# 3> Rare issue, if alt tag is found with src, images are not embedded into mht coreectly, trying poping alt tag from web page and then call CreateMHTMLBody
message = win32.gencache.EnsureDispatch('CDO.Message')
message.CreateMHTMLBody(URL, 0) # 0 - suppress none , download all images and others
stream = win32.gencache.EnsureDispatch(message.GetStream())
stream.SaveToFile(r'C:\temp\saved_mht.mht', 2) # 2, for overwrite existing file, 1 for not to overwrite
stream.Close()

Downloading sound files using urls in R

I am trying to download some sound files through R (mostly mp3). I've started off using download.file() like below. However, the sound files downloaded this way sound horrible and it's like as if they're playing way too fast. Any ideas?
download.file("http://www.mfiles.co.uk/mp3-downloads/frederic-chopin-piano-sonata-2-op35-3-funeral-march.mp3","test.mp3")
Even better than if the above function would work, is there a way do download files without having to specify the extension? Sometimes I only have the redirecting page.
Thanks!
Try explicitly setting binary mode with mode="wb":
download.file("http://www.mfiles.co.uk/mp3-downloads/frederic-chopin-piano-sonata-2-op35-3-funeral-march.mp3",
tf <- tempfile(fileext = ".mp3"),
mode="wb")
(You can view the filename with cat(tf).)

Opening an .R files EDIT: gzip files saved with .R extension. Is there any other way to access them?

I have a number of R files with an .R extension. I've tried various ways to see what is inside these file, including Xcode, vim, etc.
What I find is utterly indecipherable. For example, it looks like this Lçæ§o‡dµ’Ò6ÇìùëfiFŒÀ±y2Â8á∫˝É, but pages of it.
Is it safe to say that these files are fundamentally corrupt? Or should I be using R to open these files to see what's actually in them?
EDIT: I've never worked with a file like this. After using load() in R, how would I read the data? I have used
> data <- load("~/filename.RData")
> data
The output is [1] "filename".
EDIT2: It appears these are gzip files saved with an .R extension. I can using load() to read the data into R. Is there any other way I can access these data files?
"filename" is now loaded and it is stored in an object of the same name. You should be able to see what it is inside by running:
filename

Resources