download a file directly to memory - r

I'm wondering if this is possible.
I've got a link to a pdf file and I need to get it's MD5Sum. (my earlier question) For that I'll need to first download it and I want to do it on the most efficient way. I'm guessing that since files are small and RAM is faster than disk the obvious path would be to store it on RAM, calculate the MD5Sum and throw it away.
So far I'm using this:
link <- "xxxxx.pdf"
download.file(link, "temp.pdf", quiet = TRUE, mode = "wb")
result <- md5sum("temp.pdf")
unlink("temp.pdf")
I just read about tempfile() and that I could use it instead of "temp.pdf" but it looks like i'd just be adding an extra line to the code.
Are there better methods to get the files?

Related

Does R have an equivalent to python's io for saving file like objects to memory?

In python we can import io and then make make a file like object with some_variable=io.BytesIO() and then download any type of file to that and interact with it like it were a locally saved file except that it's in memory. Does R have something like that? To be clear I'm not asking about what any particular OS does when you save some R object to a temp file.
This is kind of a duplicate of Can I write to and access a file in memory in R? but that is about 9 years old so maybe the functionality exists now either in base or with a package.
Yes, readBin.
readBin("/path", raw(), file.info("/path")$size)
This is a working example:
tfile <- tempfile()
writeBin(serialize(iris, NULL), tfile)
x <- readBin(tfile, raw(), file.info(tfile)$size)
unserialize(x)
…and you get back your iris data.
This is just an example, but for R objects, it is way more convenient to use readRDS/saveRDS().
However, if the object is an image you want to analyse, readBin gives a raw memory representation.
For text files, you should then use:
rawToChar(x)
but again there are readLines(), read.table(), etc., for these tasks.

Create parquet file directory from CSV file in R

I'm running into more and more situations where I need out-of-memory (OOM) approaches to data analytics in R. I am familiar with other OOM approaches, like sparklyr and DBI but I recently came across arrow and would like to explore it more.
The problem is that the flat files I typically work with are sufficiently large that they cannot be read into R without help. So, I would ideally prefer a way to make the conversion without actually need to read the dataset into R in the first place.
Any help you can provide would be much appreciated!
arrow::open_dataset() can work on a directory of files and query them without reading everything into memory. If you do want to rewrite the data into multiple files, potentially partitioned by one or more columns in the data, you can pass the Dataset object to write_dataset().
One (temporary) caveat: as of {arrow} 3.0.0, open_dataset() only accepts a directory, not a single file path. We plan to accept a single file path or list of discrete file paths in the next release (see issue), but for now if you need to read only a single file that is in a directory with other non-data files, you'll need to move/symlink it into a new directory and open that.
You can do it in this way:
library(arrow)
library(dplyr)
csv_file <- "obs.csv"
dest <- "obs_parquet/"
sch = arrow::schema(checklist_id = float32(),
species_code = string())
csv_stream <- open_dataset(csv_file, format = "csv",
schema = sch, skip_rows = 1)
write_dataset(csv_stream, dest, format = "parquet",
max_rows_per_file=1000000L,
hive_style = TRUE,
existing_data_behavior = "overwrite")
In my case (56GB csv file), I had a really weird situation with the resulting parquet tables, so double check your parquet tables to spot any funky new rows that didn't exist in the original csv. I filed a bug report about it:
https://issues.apache.org/jira/browse/ARROW-17432
If you also experience the same issue, use the Python Arrow library to convert the csv into parquet and then load it into R. The code is also in the Jira ticket.

Importing to R an Excel file saved as web-page

I would like to open an Excel file saved as webpage using R and I keep getting error messages.
The desired steps are:
1) Upload the file into RStudio
2) Change the format into a data frame / tibble
3) Save the file as an xls
The message I get when I open the file in Excel is that the file format (excel webpage format) and extension format (xls) differ. I have tried the steps in this answer, but to no avail. I would be grateful for any help!
I don't expect anybody will be able to give you a definitive answer without a link to the actual file. The complication is that many services will write files as .xls or .xlsx without them being valid Excel format. This is done because Excel is so common and some non-technical people feel more confident working with Excel files than a csv file. Now, the files will have been stored in a format that Excel can deal with (hence your warning message), but R's libraries are more strict and don't see the actual file type they were expecting, so they fail.
That said, the below steps worked for me when I last encountered this problem. A service was outputting .xls files which were actually just HTML tables saved with an .xls file extension.
1) Download the file to work with it locally. You can script this of course, e.g. with download.file(), but this step helps eliminate other errors involved in working directly with a webpage or connection.
2) Load the full file with readHTMLTable() from the XML package
library(XML)
dTemp = readHTMLTable([filename], stringsAsFactors = FALSE)
This will return a list of dataframes. Your result set will quite likely be the second element or later (see ?readHTMLTable for an example with explanation). You will probably need to experiment here and explore the list structure as it may have nested lists.
3) Extract the relevant list element, e.g.
df = dTemp[2]
You also mention writing out the final data frame as an xls file which suggests you want the old-style format. I would suggest the package WriteXLS for this purpose.
I seriously doubt Excel is 'saved as a web page'. I'm pretty sure the file just sits on a server and all you have to do is go fetch it. Some kind of files (In particular Excel and h5) are binary rather than text files. This needs an added setting to warn R that it is a binary file and should be handled appropriately.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download.file(url=myurl, destfile="localcopy.xlsx", mode="wb")
or, for use downloader, and ty something like this.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download(myurl, destfile="localcopy.csv", mode="wb")

Figuring out file extension before downloading

I'm doing a project that requires going into a database of the brazillian equivalent of the FTC and downloading a few files (which I will later process), and I want to automate this using R.
My problem is that when naming the file, I have to tell it the file extension, and I don't know what it will be (usually it will be a scanned pdf, but sometimes it will be an html file). Here an example:
https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_processo_exibir.php?0c62g277GvPsZDAxAO1tMiVcL9FcFMR5UuJ6rLqPEJuTUu08mg6wxLt0JzWxCor9mNcMYP8UAjTVP9dxRfPBcbZvmE_iaYkTbpPedZsRpa1llf9W8WXxdUJxor5q0IiE
I want the first and the tenth file. Downloading them is easy:
download.file("https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_documento_consulta_externa.php?DZ2uWeaYicbuRZEFhBt-n3BfPLlu9u7akQAh8mpB9yPDzrBMElK1BGz7u3NcOFP7-Z5s9oDvQR1K4ELVR_nmNlPto_G3CRD_y2Hu6JLvHZVV2LDxnr4dccffqX3xlEao", destfile = 'C:/teste/teste1', mode = 'wb')
download.file("https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_documento_consulta_externa.php?DZ2uWeaYicbuRZEFhBt-n3BfPLlu9u7akQAh8mpB9yPaFy5S3krC8lTKjlRbfodOIg2NArJmAFS5PyUEHL3hnJYr8VG9zLGdNts6K99Ht673e_ZPr2gr3Cw7r8zJqRiH", destfile = 'C:/teste/teste2', mode = 'wb')
The thing is, I don't know which one is a pdf file and which one is an html file without manually trying to open them with another program. Is there any way to tell R to automatically add the correct file extension when downloading?
If you use the httr package, you can get the content-type header which will help you decide what type of file it is. You can use the HEAD() function to get the headers of the files. For example with your URLs
urls <- c(
"https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_documento_consulta_externa.php?DZ2uWeaYicbuRZEFhBt-n3BfPLlu9u7akQAh8mpB9yPDzrBMElK1BGz7u3NcOFP7-Z5s9oDvQR1K4ELVR_nmNlPto_G3CRD_y2Hu6JLvHZVV2LDxnr4dccffqX3xlEao",
"https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_documento_consulta_externa.php?DZ2uWeaYicbuRZEFhBt-n3BfPLlu9u7akQAh8mpB9yPaFy5S3krC8lTKjlRbfodOIg2NArJmAFS5PyUEHL3hnJYr8VG9zLGdNts6K99Ht673e_ZPr2gr3Cw7r8zJqRiH"
)
You can write a helper function
get_content_type <- function(x) {
unname(sapply(x, function(x) headers(HEAD(x))[["content-type"]]))
}
get_content_type(urls)
# [1] "application/pdf;" "text/html; charset=ISO-8859-1"
These return mime-type, but you can grep for things like "pdf" to save as a PDF or "html" for web pages. Not sure what other types of files might be available. There is no "correct" file name for a given file type so you'd need to make that decision yourself.

Downloading sound files using urls in R

I am trying to download some sound files through R (mostly mp3). I've started off using download.file() like below. However, the sound files downloaded this way sound horrible and it's like as if they're playing way too fast. Any ideas?
download.file("http://www.mfiles.co.uk/mp3-downloads/frederic-chopin-piano-sonata-2-op35-3-funeral-march.mp3","test.mp3")
Even better than if the above function would work, is there a way do download files without having to specify the extension? Sometimes I only have the redirecting page.
Thanks!
Try explicitly setting binary mode with mode="wb":
download.file("http://www.mfiles.co.uk/mp3-downloads/frederic-chopin-piano-sonata-2-op35-3-funeral-march.mp3",
tf <- tempfile(fileext = ".mp3"),
mode="wb")
(You can view the filename with cat(tf).)

Resources