download large zipped csv over https, unzip, and load - r

I'm trying to follow this example to download a zipped file over https, extract the csv file (14GB), and load the data into a dataframe. I created a small example (<1MB).
library(data.table)
temp <- tempfile()
download.file("https://www.dropbox.com/s/h130oe03krthcl0/example.csv.zip",
temp, method="curl")
data <- fread(unz(temp, "example.csv"))
unlink(temp)
Is my mistake obvious?

This works fine for me (download.file does too but I'm on 3.2.2 OS X so this is more "portable" given the updates to download.file since 3.1.2):
library(httr)
response <- GET("https://www.dropbox.com/s/h130oe03krthcl0/example.csv.zip?dl=1",
write_disk("example.csv.zip"),
progress())
fil <- unzip("example.csv.zip")
read.csv(fil[1], stringsAsFactors=FALSE)
## v1 v2 v3
## 1 1 2 3
## 2 1 2 3
## 3 1 2 3
I didn't try it w/o the ?dl=1 (& I do that by wrote, not due to the edit queue suggestion).
Honestly, though, I'd probably spare the download in R and just use curl on the command line in an automated workflow for files the size you've indicated (and, I'd do that if the processing language was python [et al], too).

In my of the application I was trying to download the zip file from http and just create stream for unzipping that file into a folder.
After making some google search I was able to write following code which helps me in my task
Here are few steps you have to follow
Install unzipper package
import unzipper and http into the code file
import unzipper from ‘unzipper’;
import http from ‘http’;
Now you have to download the zip file and create stream for this, here is the complete code
import unzipper from ‘unzipper’;
import http from ‘http’;
var self=this;
http.get(‘http://yoururl.com/file.zip’, function(res) {
res.pipe(unzipper.Extract({ path: ‘C:/cmsdata/’ })).on(‘close’, function() {
//Here you can perform any action after completion of stream unzipping
});
});

Related

Read open excel file in R

is there a way to read an open excel file into R?
When an excel file is open in Excel, Excel puts a lock on the file, such as the reading method in R cannot access the file.
Can you circumvent this lock?
Thanks
Edit: this occurs under windows with original excel.
I too do not have problem opening xlsx files that are already open in excel, but if you do i have a workaround that might work:
path_to_xlsx <- "C:/Some/Path/to/test.xlsx"
temp <- tempdir()
file.copy(path_to_xlsx, to = paste0(temp, "/test.xlsx"))
df <- openxlsx::read.xlsx(paste0(temp, "/test.xlsx"))
This copies the file (Which should not be blocked) to a temporary directory, and then loads the file from there. Again, i'm not sure if this is needed, as i do not have the problem you have.
You could try something like this using the ps package. I've used it on Windows and Mac to read from files that I had downloaded from some web resource and opened in Excel with openxlsx2, but it should work with other packages or programs too.
# get the path to the open file via the ps package
library(ps)
p <- ps()
# get the pid for the current program, in my case Excel on Mac
ppid <- p$pid[grepl("Excel", p$name)]
# get the list of open files for that program
pfiles <- ps_open_files(ps_handle(ppid))
pfile <- pfiles[grepl(".xlsx", pfiles$path),]
# return the path to the file
sel <- grepl("^(.|[^~].*)\\.xlsx", basename(pfile$path))
path <- pfile$path[sel]
What do you mean by "the reading method in R", and by "cannot access the file" (i.e. what code are you using and what error message do you get exactly)? I'm successfully importing Excel files that are currently open, with something like:
dat <- readxl::read_excel("PATH/TO/FILE.xlsx")
If the file is being edited in Excel, R imports the last saved version.
EDIT: I've now tried it on both Linux and Windows and it still works, at least with version 1.3.1 of 'readxl'.

Download a large zipped CSV file, unzip and read into R on Linux

I wish to read into my environment a large CSV (~ 8Gb) but I am having issues.
My data is a publicly available dataset:
# CREATE A TEMP FILE TO STORE THE DOWNLOADED DATA
temp <- tempfile()
# DOWNLOAD THE FILE FROM THE CMS
download.file("https://download.cms.gov/nppes/NPPES_Data_Dissemination_February_2022.zip",
destfile = temp)
This is where I'm running into difficulty, I am unfamiliar with linux working directories and where temp folders are created.
When I use list.dir() or list.files() I don't see any reference to this temp file.
I am working in an R project and my working director is as follows:
getwd()
[1] "/home/myName/myProjectName"
I'm able to read in the first part of the file but my system crashes after about 4Gb.
# UNZIP THE NPI FILE
npi <- unz(temp, "npidata_pfile_20050523-20220213.csv")
I then came across this post which has a function for decompressing large zip files using the system2 unzip functionality. However due to my limited R knowledge and Linux experience I couldn't get the function to point to the downloaded file in the temp folder
checking the path for temp above I get the following path:
temp
[1] "/tmp/Rtmpl6SHIJ/file7e5e6c1fc693"
Using the system2 function from the link above I tried the following:
x <- decompress_file(directory = temp,
file = "NPPES_Data_Dissemination_February_2022.zip")
But get the following error about setting the working directory:
Any pointers to how I can get this file unzipped given it's size and read it into memory would be much appreciated.
It might be a file permission issue. To get around it work in a directory you're already in, or know you have access to.
# DOWNLOAD THE FILE
# to a directory you can access, and name the file. No need to overcomplicate this.
download.file("https://download.cms.gov/nppes/NPPES_Data_Dissemination_February_2022.zip",
destfile = "/home/myName/myProjectname/npi.csv")
# use the decompress function if you need to, though unzip might work
x <- decompress_file(directory = "/home/myName/myProjectname/",
file = "npi.zip")
# remove .zip file if you need the space back
file.remove("/home/myName/myProjectname/npi.zip")
temp is the path to the file, not just the directory. By default, tempfile does not add a file extension. It can be done by using tempfile(fileext = ".zip")
Consequently, decompress_file can not set the working directory to a file. Try this:
x <- decompress_file(directory = dirname(temp), file = basename(temp))

how to download files from a sharepoint folder with R

I have some files in my company's sharepint folder. I need to use r to download all files in that folder. How can I do that with R?
I googled the question and got this link. It's about uploading the file to a sharepoint folder.
[Uploading files to SharePoint from R
The code from the link is like below:
saveToSharePoint <- function(fileName)
{
cmd <- paste("curl --max-time 7200 --connect-timeout 7200 --ntlm --user","username:password",
"--upload-file /home/username/FolderNameWhereTheFileToTransferExists/",fileName,
"teamsites.OrganizationName.com/sites/PageTitle/Documents/UserDocumentation/FolderNameWhereTheFileNeedsToBeCopied/",fileName, sep = " ")
system(cmd)
}
saveToSharePoint("SomeFileName.Ext")
It's about uploading one specific file. But what I need is download files from sharepoint folder.
so, I modified the code to copy from sharepoint folder. I changed --upload-file to --download-file.
copyFromSharePoint <- function(fileName)
{
cmd <- paste("curl --max-time 7200 --connect-timeout 7200 --ntlm --user","username:password",
"--download-file teamsites.OrganizationName.com/sites/PageTitle/Documents/FolderNameWhereTheFileToTransferExists/",fileName,
"home/username/UserDocumentation/FolderNameWhereTheFileNeedsToBeCopied/",fileName, sep = " ")
system(cmd)
}
copyFromSharePoint("SomeFileName.Ext")
However, that does not seem to work. Error message below:
curl: option --download-file: is unknown
Does anyone know how I can do it with R?
In addition, what if I need to download all files in the folder, not one specific file.Does anyone know how I can do it with R?
Solution
I made a simple R Package called sharepointr to upload and download files from SharePoint.
The way I made the whole download/upload to sharepoint work from R was to:
Make a SharePoint App Registration
Add permissions to App Registration
Getting "tenant_id" and "resource_id" using cURL
Make get/post calls to API
It is all described in the package Readme.md
Example
# Installing the package
install.packages("devtools")
devtools::install_github("esbeneickhardt/sharepointr")
# Setting parameters
client_id <- "insert_from_first_step"
client_secret <- "insert_from_first_step"
tenant_id <- "insert_from_fourth_step"
resource_id <- "insert_from_fourth_step"
site_domain <- "yourorganisation.sharepoint.com"
sharepoint_url <- "https://yourorganisation.sharepoint.com/sites/MyTestSite"
# Getting a SharePoint Token
sharepoint_token <- get_sharepoint_token(client_id, client_secret, tenant_id, resource_id, site_domain)
# Getting sharepoint digest value
sharepoint_digest_value <- get_sharepoint_digest_value(sharepoint_token, sharepoint_url)
# Downloading a file
sharepoint_path <- "Shared Documents/test"
sharepoint_file_name <- "Mappe.xlsx"
out_path <- "C:/Users/User/Desktop/"
download_sharepoint_file(sharepoint_token, sharepoint_url, sharepoint_digest_value, sharepoint_path, sharepoint_file_name, out_path)

Reading csv files from microsoft Azure using R

I have recently started working with databricks and azure.
I have microsoft azure storage explorer. I ran a jar program on databricks
which outputs many csv files in the azure storgae explorer in the path
..../myfolder/subfolder/output/old/p/
The usual thing I do is to go the folder p and download all the csv files
by right clicking the p folder and click download on my local drive
and these csv files in R to do any analysis.
My issue is that sometimes my runs could generate more than 10000 csv files
whose downloading to the local drive takes lot of time.
I wondered if there is a tutorial/R package which helps me to read in
the csv files from the path above without downloading them. For e.g.
is there any way I can set
..../myfolder/subfolder/output/old/p/
as my working directory and process all the files in the same way I do.
EDIT:
the full url to the path looks something like this:
https://temp.blob.core.windows.net/myfolder/subfolder/output/old/p/
According to the offical document CSV Files of Azure Databricks, you can directly read a csv file in R of a notebook of Azure Databricks as the R example of the section Read CSV files notebook example said, as the figure below.
Alternatively, I used R package reticulate and Python package azure-storage-blob to directly read a csv file from a blob url with sas token of Azure Blob Storage.
Here is my steps as below.
I created a R notebook in Azure Databricks workspace.
To install R package reticulate via code install.packages("reticulate").
To install Python package azure-storage-blob as the code below.
%sh
pip install azure-storage-blob
To run Python script to generate a sas token of container level and to use it to get a list of blob urls with sas token, please see the code below.
library(reticulate)
py_run_string("
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlobPermissions
from datetime import datetime, timedelta
account_name = '<your storage account name>'
account_key = '<your storage account key>'
container_name = '<your container name>'
blob_service = BaseBlobService(
account_name=account_name,
account_key=account_key
)
sas_token = blob_service.generate_container_shared_access_signature(container_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))
blob_names = blob_service.list_blob_names(container_name, prefix = 'myfolder/')
blob_urls_with_sas = ['https://'+account_name+'.blob.core.windows.net/'+container_name+'/'+blob_name+'?'+sas_token for blob_name in blob_names]
")
blob_urls_with_sas <- py$blob_urls_with_sas
Now, I can use different ways in R to read a csv file from the blob url with sas token, such as below.
5.1. df <- read.csv(blob_urls_with_sas[[1]])
5.2. Using R package data.table
install.packages("data.table")
library(data.table)
df <- fread(blob_urls_with_sas[[1]])
5.3. Using R package readr
install.packages("readr")
library(readr)
df <- read_csv(blob_urls_with_sas[[1]])
Note: for reticulate library, please refer to the RStudio article Calling Python from R.
Hope it helps.
Update for your quick question:

How to extract data from file on an FTP server without downloading it all in R? - encoding error?

I am trying to get a large dataset (3+ GB) from the following server:
ftp://podaac-ftp.jpl.nasa.gov/allData/ghrsst/data/L4/GLOB/JPL/MUR
I know RCurl is a good package for getting data from FTP. The file is a compressed netcdf file. I need to uncompress it to read it into R using ncdf4. It's compressed as bz2.
Importantly, the file is larger than I want on my hard drive, so saving a copy locally is not an ideal option. How can I access data on the file without saving a copy to my disk first?
Here's my attempt so far:
library(RCurl); library(ncdf4)
d = getURL('ftp://podaac-ftp.jpl.nasa.gov/allData/ghrsst/data/L4/GLOB/JPL/MUR/2015/144/20150524-JPL-L4UHfnd-GLOB-v01-fv04-MUR.nc.bz2')
d = bzfile(d, open = 'r')
d = nc_open(d)
But I'm stuck at this cryptic error after the first line:
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
embedded nul in string: 'BZh91AY&SY¦ÁÀÉ\0033[ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿáåÏ\035\017)³îÎ\u009dÍØcn]sw7½ÎkÜÞõï=uÎׯv]ìçn\u009dÎn½îê·±Þìê÷wS­M\u008có·+ÎçW¹Ý=Ù×¹\u009cγ­ÜëÞs½ÛN¹²w;\u009buÍÝ]{·k^çuªnìº-³6«[+Üå;\033m»Û½ow:w¹ïo{uyîî\u00937¬\\Ƶl¶½\u009dÖVìç¯{ÎõïoSm]Ý×\u009eî\u008dæî®î®î\vÛÕïgW\036î®wqîÝ\\ïw«6½Þï\036Ýrë§=¬Fg·\\íåÔÙº÷gu·3\u009bKmÛ\027­Þ»\u0092îî\016îêwwm»\u009b­·s;MÞÁ½½­ÎóÍso^»q¯o;k\033iµ\u009bÛuyÝÞní5w:ï]ÓuÎo[«\033:åÞvEÜíÎç½ÝË­\u009eìQNöÔ\u008e\u0094vmÝȯg»e lÍ^\u008a©'
It seems to be an encoding issue based on other similar problems but I tried both .encoding = 'UTF-8' and .encoding = 'ISO-8859-1' as shown in the getURL() documentation but neither work.
I've seen other answers to problems like this but they all seem to involve editing the source file. I don't have write access to this file, however. Any help?
I'd use httr for this
library("httr")
library("ncdf4")
url <- 'ftp://podaac-ftp.jpl.nasa.gov/allData/ghrsst/data/L4/GLOB/JPL/MUR/2015/144/20150524-JPL-L4UHfnd-GLOB-v01-fv04-MUR.nc.bz2'
res <- GET(url, write_disk(basename(url)))
# uncompress - I used OSX's default compression tool
nc_open(sub("\\.bz2", "", res$request$output$path))
the only step i didn't sort out programatically is un-compressing the bz2 file, just did that with OSX's default tool
I don't know much at all about R, but you should be able to do this with curl in FTP mode by changing the output to stdout rather than a local filename and then using bz2 to uncompress the file you want from its standard input.
So, for example, I can do this:
curl --output - --user user:password 'ftp://127.0.0.1/somefile.bz2' | bz2 ...
Maybe you can start that from within R? Or make a fifo with:
mkfifo fifo
curl ....
and then read from the file called fifo in R.
Or maybe R has a system() command, and you could do:
system('mkfifo fifo; curl ..... | bz2 .... > fifo &')
and then read from the file called fifo in R.

Resources