Use R package "googledrive" to load in R a file from my googledrive - r

I have a file in my google drive that is an xlsx. It is too big so it is not automatically converted to a googlesheet (that's why using googlesheets package did not work). The file is big and I can't even preview it through clicking on it on my googledrive. The only way to see it is to download is as an .xlsx . While I could load it as an xlsx file, I am trying instead to use the googledrive package.
So far what I have is:
library(googledrive)
drive_find(n_max = 50)
drive_download("filename_without_extension.xlsx",type = "xlsx")
but I got the following error:
'file' does not identify at least one Drive file.
Maybe it is me not specifying the path where the file lives in the Drive. For example : Work\Data\Project1\filename.xlsx
Could you give me an idea on how to load in R the file called filename.xlsx that is nested in the drive like that?
I read the documentation but couldn't figure out how to do that.Thanks in advance.

You should be able to do this by:
library(googledrive)
drive_download("~/Work/Data/Project1/filename.xlsx")
The type parameter is only for Google native spreadsheets, and does not apply to raw files.

I want to share my way.
I do this way because I keep on updating the xlsx file. It is a query result that comes from an ERP.
So, when I tried to do it by googleDrive Id, it gave me errors because each time the ERP update the file its Id change.
This is my context. Yours can be absolutely different. This file changes just 2 or three times at month. Even tough it is a "big" xlsx file (78-80K records with 19 factors), I use it for just seconds to calculate some values and then I can trash it. It does not have any sense to store it. (to store is more expensive than upload)
library(googledrive)
library(googlesheets4) # watch out: it is not the CRAN version yet 0.1.1.9000
drive_folder_owner<-"carlos.sxxx#xxxxxx.com" # this is my account in this gDrive folder.
drive_auth(email =drive_folder_owner) # previously authorized account
googlesheets4::sheets_auth(email =drive_folder_owner) # Yes, I know, should be the same, but they are different.
d1<-drive_find(pattern = "my_file.xlsx",type = drive_mime_type("xlsx")) # This is me finding the file created by the ERP, and I do shorten the search using the type
meta<-drive_get(id=d1$id)[["drive_resource"]] # Get the id from the file in googledrive
n_id<-glue("https://drive.google.com/open?id=",d1$id[[1]]) # here I am creating a path for reading
meta_name<- paste(getwd(),"/Files/",meta[[1]]$originalFilename,sep = "") # and a path to temporary save it.
drive_download(file=as_id(n_id),overwrite = TRUE, path = meta_name) # Now read and save locally.
V_CMV<-data.frame(read_xlsx(meta_name)) # store to data frame
file.remove(meta_name) # delete from R Server
rm(d1,n_id) # Delete temporary variables

Related

Is there a way of reading shapefiles directly into R from an online source?

I am trying to find a way of loading shapefiles (.shp) from an online repository/folder/url directly into my global environment in R, for the purpose of making plots in ggplot2 using geom_sf. In the first instance I'm using my Google Drive to store these files but I'd ideally like to find a solution that works with any folder with a valid url and appropriate access rights.
So far I have tried a few options, the first 2 involving zipping the source folder on Google Drive where the shapefiles are stored and then downloading and unzipping in some way. Have included reproducable examples using a small test shapefile:
Using utils::download.file() to retrieve the compressed folder and unzipping using either base::system('unzip..') or zip::unzip() (loosely following this thread: Downloading County Shapefile from ONS):
# Create destination data folder (if there isn't one)
if(!dir.exists('data')) dir.create('data')
# Download the zipped file/folder
download.file("https://drive.google.com/file/d/1BYTCT_VL8EummlAsH1xWCd5rC4bZHDMh/view?usp=sharing", destfile = "data/test_shp.zip")
# Unzip folder using unzip (fails)
unzip(zipfile = "data/test_shp.zip", exdir = "data/test_shp", junkpaths = TRUE)
# Unzip folder using system (also fails)
system("unzip data/test_shp.zip")
If you can't run the above code then FYI the 2 error messages are:
Warning message:
In unzip(zipfile = "data/test_shp.zip", exdir = "data/test_shp", :
error 1 in extracting from zip file
AND
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of data/test_shp.zip or
data/test_shp.zip.zip, and cannot find data/test_shp.zip.ZIP, period.
Worth noting here that I can't even manually unzip this folder outside R so I think there's something going wrong with the download.file() step.
Using the googledrive package:
# Create destination data folder (if there isn't one)
if(!dir.exists('data')) dir.create('data')
# Specify googledrive url:
test_shp = drive_get(as_id("https://drive.google.com/file/d/1BYTCT_VL8EummlAsH1xWCd5rC4bZHDMh/view?usp=sharing"))
# Download zipped folder
drive_download(test_shp, path = "data/test_shp.zip")
# Unzip folder
zip::unzip(zipfile = "data/test_shp.zip", exdir = "data/test_shp", junkpaths = TRUE)
# Load test.shp
test_shp <- read_sf("data/test_shp/test.shp")
And that works!
...Except it's still a hacky workaround, which requires me to zip, download, unzip and then use a separate function (such as sf::read_sf or st_read) to read in the data into my global environment. And, as it's using the googledrive package it's only going to work for files stored in this system (not OneDrive, DropBox and other urls).
I've also tried sf::read_sf, st_read and fastshp::read.shp directly on the folder url but those approaches all fail as one might expect.
So, my question: is there a workflow for reading shapefiles stored online directly into R or should I stop looking? If there is not, but there is a way of expanding my above solution (2) beyond googledrive, I'd appreciate any tips on that too!
Note: I should also add that I have deliberately ignored any option requiring the package rgdal due to its imminient permanent retirement and so am looking for options that are at least somewhat future-proof (I understand all packages drop off the map at some point). Thanks in advance!
I ran into a similar problem recently, having to read in shapefiles directly from Dropbox into R.
As a result, this solution only applies for the case of Dropbox.
The first thing you will need to do is create a refreshable token for Dropbox using rdrop2, given recent changes from Dropbox that limit single token use to 4 hours. You can follow this SO post.
Once you have set up your refreshable token, identify all the files in your spatial data folder on Dropbox using:
shp_files_on_db<- drop_dir("Dropbox path/to your/spatial data/", dtoken = refreshable_token) %>%
filter(str_detect(name, "adm2"))
My 'spatial data' folder contained two sets of shapefiles – adm1 and adm 2. I used the above code to choose only those associated with adm2.
Then create a vector of the names of the shp, csv, shx, dbf, cpg files in the 'spatial data' folder, as follows:
shp_filenames<- shp_files_on_db$name
I choose to read in shapefiles into a temporary directory, avoiding the need to have to store the files on my disk – also useful in a Shiny implementation. I create this temporary directory as follows:
# create a new directory under tempdir
dir.create(dir1 <- file.path(tempdir(), "testdir"))
#If needed later on, you can delete this temporary directory
unlink(dir1, recursive = T)
#And test that it no longer exists
dir.exists(dir1)
Now download the Dropbox files to this temporary directory:
for (i in 1: length(shp_filenames)){
drop_download(paste0("Dropbox path/to your/spatial data/",shp_filenames[i]),
dtoken = refreshable_token,
local_path = dir1)
}
And finally, read in your shapefile as follows:
#path to the shapefile in the temporary directory
path1_shp<- paste0(dir1, "/myfile_adm2.shp")
#reading in the shapefile using the sf package - a recommended replacement for rgdal
shp1a <- st_read(path1_shp)

How can I parse a XML file in R, which has been generated probably using SRSS?

In my job I have to perform some analytics on data shared by external organisation through user access granted on web portal. Various reports are available there, which I can view and download in many formats. Two of these formats are very useful namely MS Excel and 'XML file with report data'. Excel file is normally heavily formatted (with sub-totals, merged cells, etc.) to suit the purpose of Excel users. Converting these Excel files to data frame/table is normally a big hassle. I therefore prefer to download 'xml' file and then parse it through -> save it in csv and then carry out my analysis in R.
However, whenever I try to parse xml file directly into R (to avoid intervening convert to csv step) I never succeed. So far I have tried XML xml2 libraries in R but to no avail.
Recently I tried this code.
library("XML")
library("methods")
setwd("C:\\Users\\Administrator\\Desktop\\")
res <- xmlParse("Skil.xml")
> res <- xmlParse("Skil.xml")
xmlns: URI RptSancDig_VoucherCompilationSheet is not absolute
rootnode <- xmlRoot(res)
rootsize <- xmlSize(rootnode)
> rootsize
[1] 2
xmldataframe <- xmlToDataFrame("Skil.xml")
> xmldataframe <- xmlToDataFrame("Skil.xml")
xmlns: URI RptSancDig_VoucherCompilationSheet is not absolute
> xmldataframe
Textbox24 Textbox63 DDOName_Collection
1 <NA> <NA> <NA>
2
Just to mention the file size of Skil.xml is about 12.1 Mb, and is successfully parsed in Excel.
I have also tried read_xml() function of xml2 but to no avail.
I would have happily shared a sample file to try, but I am unable to do so. Moreover, I am also unable to generate a sample file in that kind of xml format.
Can someone help?

readxl::read_xls returns "libxls error: Unable to open file"

I have multiple .xls (~100MB) files from which I would like to load multiple sheets (from each) into R as a dataframe. I have tried various functions, such as xlsx::xlsx2 and XLConnect::readWorksheetFromFile, both of which always run for a very long time (>15 mins) and never finish and I have to force-quit RStudio to keep working.
I also tried gdata::read.xls, which does finish, but it takes more than 3 minutes per one sheet and it cannot extract multiple sheets at once (which would be very helpful to speed up my pipeline) like XLConnect::loadWorkbook can.
The time it takes these functions to execute (and I am not even sure the first two would ever finish if I let them go longer) is way too long for my pipeline, where I need to work with many files at once. Is there a way to get these to go/finish faster?
In several places, I have seen a recommendation to use the function readxl::read_xls, which seems to be widely recommended for this task and should be faster per sheet. This one, however, gives me an error:
> # Minimal reproducible example:
> setwd("/Users/USER/Desktop")
> library(readxl)
> data <- read_xls(path="test_file.xls")
Error:
filepath: /Users/USER/Desktop/test_file.xls
libxls error: Unable to open file
I also did some elementary testing to make sure the file exists and is in the correct format:
> # Testing existence & format of the file
> file.exists("test_file.xls")
[1] TRUE
> format_from_ext("test_file.xls")
[1] "xls"
> format_from_signature("test_file.xls")
[1] "xls"
The test_file.xls used above is available here.
Any advice would be appreciated in terms of making the first functions run faster or the read_xls run at all - thank you!
UPDATE:
It seems that some users are able to open the file above using the readxl::read_xls function, while others are not, both on Mac and Windows, using the most up to date versions of R, Rstudio, and readxl. The issue has been posted on the readxl GitHub and has not been resolved yet.
I downloaded your dataset and read each excel sheet in this way (for example, for sheets "Overall" and "Area"):
install.packages("readxl")
library(readxl)
library(data.table)
dt_overall <- as.data.table(read_excel("test_file.xls", sheet = "Overall"))
area_sheet <- as.data.table(read_excel("test_file.xls", sheet = "Area"))
Finally, I get dt like this (for example, only part of the dataset for the "Area" sheet):
Just as well, you can use the read_xls function instead read_excel.
I checked, it also works correctly and even a little faster, since read_excel is a wrapper over read_xls and read_xlsx functions from readxl package.
Also, you can use excel_sheets function from readxl package to read all sheets of your Excel file.
UPDATE
Benchmarking is done with microbenchmark package for the following packages/functions: gdata::read.xls, XLConnect::readWorksheetFromFile and readxl::read_excel.
But XLConnect it's a Java-based solution, so it requires a lot of RAM.
I found that I was unable to open the file with read_xl immediately after downloading it, but if I opened the file in Excel, saved it, and closed it again, then read_xl was able to open it without issue.
My suggested workaround for handling hundreds of files is to build a little C# command line utility that opens, saves, and closes an Excel file. Source code is below, the utility can be compiled with visual studio community edition.
using System.IO;
using Excel = Microsoft.Office.Interop.Excel;
namespace resaver
{
class Program
{
static void Main(string[] args)
{
string srcFile = Path.GetFullPath(args[0]);
Excel.Application excelApplication = new Excel.Application();
excelApplication.Application.DisplayAlerts = false;
Excel.Workbook srcworkBook = excelApplication.Workbooks.Open(srcFile);
srcworkBook.Save();
srcworkBook.Close();
excelApplication.Quit();
}
}
}
Once compiled, the utility can be called from R using e.g. system2().
I will propose a different workflow. If you happen to have LibreOffice installed, then you can convert your excel files to csv programatically. I have Linux, so I do it in bash, but I'm sure it can be possible in macOS.
So open a terminal and navigate to the folder with your excel files and run in terminal:
for i in *.xls
do soffice --headless --convert-to csv "$i"
done
Now in R you can use data.table::fread to read your files with a loop:
Scenario 1: the structure of files is different
If the structure of files is different, then you wouldn't want to rbind them together. You could run in R:
files <- dir("path/to/files", pattern = ".csv")
all_files <- list()
for (i in 1:length(files)){
fileName <- gsub("(^.*/)(.*)(.csv$)", "\\2", files[i])
all_files[[fileName]] <- fread(files[i])
}
If you want to extract your named elements within the list into the global environment, so that they can be converted into objects, you can use list2env:
list2env(all_files, envir = .GlobalEnv)
Please be aware of two things: First, in the gsub call, the direction of the slash. And second, list2env may overwrite objects in your Global Environment if they have the same name as the named elements within the list.
Scenario 2: the structure of files is the same
In that case it's likely you want to rbind them all together. You could run in R:
files <- dir("path/to/files", pattern = ".csv")
joined <- list()
for (i in 1:length(files)){
joined <- rbindlist(joined, fread(files[i]), fill = TRUE)
}
On my system, i had to use path.expand.
R> file = "~/blah.xls"
R> read_xls(file)
Error:
filepath: ~/Dropbox/signal/aud/rba/balsheet/data/a03.xls
libxls error: Unable to open file
R> read_xls(path.expand(file)) # fixed
Resaving your file and you can solve your problem easily.
I also find this problem before but I get the answer from your discussion.
I used the read_excel() to open those files.
I was seeing a similar error and wanted to share a short-term solution.
library(readxl)
download.file("https://mjwebster.github.io/DataJ/spreadsheets/MLBpayrolls.xls", "MLBPayrolls.xls")
MLBpayrolls <- read_excel("MLBpayrolls.xls", sheet = "MLB Payrolls", na = "n/a")
Yields (on some systems in my classroom but not others):
Error: filepath: MLBPayrolls.xls libxls error: Unable to open file
The temporary solution was to paste the URL of the xls file into Firefox and download it via the browser. Once this was done we could run the read_excel line without error.
This was happening today on Windows 10, with R 3.6.2 and R Studio 1.2.5033.
If you have downloaded the .xls data from the internet, even if you are opening it in Ms.Excel, it will open a prompt first asking to confirm if you trust the source, see below screenshot, I am guessing this is the reason R (read_xls) also can't open it, as it's considered unsafe. Save it as .xlsx file and then use read_xlsx() or read_excel().
Even thought this is not a code-based solution, I just changed the type file. For instance, instead of xls I saved as csv or xlsx. Then I opened it as regular one.
I worked it for me, because when I opened my xlsfile, I popped up the message: "The file format and extension of 'file.xls'' don't match. The file could be corrupted or unsafe..."

How to upload a R data frame into a google drive ?

I am using googledrive package from CRAN. But, function - drive_upload lets you upload a local file and not a data frame. Can anybody help with this?
Just save a data_frame in question to a local file. Most basic options would be saving to CSV or saving an RData.
Example:
test <- data.frame(a = 1)
tempFileCon <- file()
write.csv(test, file = tempFileCon)
rm(test)
load("test.Rds")
exists("test")
Since clarified it is not possible to use temporary file we could use a file connection.
test <- data.frame(a = 1)
tempFileCon <- file()
write.csv(test, file = tempFileCon)
And now we have the file conneciton in memory that we can use to provide for other functions. Caveat - use literal object name to address it and not quotations like you would with actual files.
Unfortunately I can find no way to push the dataframe up directly, but just to document for others trying to get the basics accomplished that this question touches upon is with the following code that writes a local .csv and then bounces it up through tidyverse::googledrive to express itself as a googlesheet.
write_csv(iris, 'df_iris.csv')
drive_upload('df_iris.csv', type='spreadsheet')
You can achieve this using gs_add_row from googlesheets package. This API accepts dataframes directly as input parameter and uploads data to the specified google sheet. Local files are not required.
From the help section of ?gs_add_row:
"If input is two-dimensional, internally we call gs_add_row once per input row."
This can be done in two ways. Like mentioned by others, a local file can be created and this can be uploaded. It is also possible to create a new spreadsheet in your drive. This spreadsheet will be created in the main folder of your drive. If you want it stored somewhere else, you can move it after creation.
# install the packages
install.packages("googledrive", "googlesheets4")
# load the libraries
library(googledrive)
library(googlesheets4)
## With local storage
# Locally store the file
write.csv(x = iris, file = "iris.csv")
# Upload the file
drive_upload(media = "iris.csv", type='spreadsheet')
## Direct storage
# Create an empty spreadsheet. It is stored as an object with a sheet_id and drive_id
ss <- gs4_create(name = "my_spreadsheet", sheets = "Sheet 1")
# Put the data.frame in the spreadsheet and provide the sheet_id so it can be found
sheet_write(data=iris, ss = ss, sheet ="Sheet 1")
# Move your spreadsheet to the desired location
drive_mv(file = ss, path = "my_creations/awesome location/")

R extension write local data

I am creating a package and would like to store settings data locally, since it is unique for each user of the package and so that the setting does not have to be set each time the package is loaded.
How can I do this in the best way?
You could save your necessary data in an object and save it using saveRDS()
whenever a change it made or when user is leaving or giving command for saving.
It saves the R object as it is under a file name in the specified path.
saveRDS(<obj>, "path/to/filename.rds")
And you can load it next time when package is starting using loadRDS().
The good thing of loadRDS() is that you can assign a new name to the obj. (So you don't have to remember its old obj name. However the old obj name is also loaded with the object and will eventually pollute your namespace.
newly.assigned.name <- loadRDS("path/to/filename.rds")
# or also possible:
loadRDS("path/to/filename.rds") # and use its old name
Where to store
Windows
Maybe here:
You can use %systemdrive%%homepath% environment variable to accomplish
this.
The two command variables when concatenated gives you the desired
user's home directory path as below:
Running echo %systemdrive% on command prompt gives:
C:
Running echo %homepath% on command prompt gives:
\Users\
When used together it becomes:
C:\Users\
Linux/OsX
Either in the package location of the user,
path.to.package <- find.package("name.of.your.pacakge",
lib.loc = NULL, quiet = FALSE,
verbose = getOption("verbose"))
# and then construct with
destination.folder.path <- file.path(path.to.package,
"subfoldername", "filename")`
# the path to the final destination
# You should use `file.path()` to construct such paths, because it detects automatically the correct ('/' or '\') separators for the file paths in Unix-derived systems (Linux/Mac Os X) versus Windows.
Or use the $HOME variable of the user and there in a file - the name of which beginning with "." - this is convention in Unix-systems (Linux/Mac OS X) for such kind of file which save configurations of software programs.
e.g. ".your-packages-name.rds".
If anybody has a better solution, please help!

Resources