I'm trying to get data for RAIS (a Brazilian employee registry dataset) that is shared using a Google Drive public folder. This is the address:
https://drive.google.com/folderview?id=0ByKsqUnItyBhZmNwaXpnNXBHMzQ&usp=sharing&tid=0ByKsqUnItyBhU2RmdUloTnJGRGM#list
Data is divided into one folder per year and within each folder there is one file per state to download. I would like to automate the downloading process in R, for all years, and if not at least within each year folder. Downloaded file names should follow the file names that occur when downloading manually.
A know a little R, but no web programming or web scraping. This is what I got so faar:
By manually downloading the first of the 2012 file, I could see the URL my browser used to download:
https://drive.google.com/uc?id=0ByKsqUnItyBhS2RQdFJ2Q0RrN0k&export=download
Thus, I suppose the file id is: 0ByKsqUnItyBhS2RQdFJ2Q0RrN0k
Searching the html code of the 2012 page I was able to find that ID and the file name associated with it: AC2012.7z.
All the other ids' and file names are in that section of the html code. So, assuming I can download the file correctly, I suppose I could at least generalize tho the other files.
In R, I tried the flowing code to download the file:
url <- "https://drive.google.com/uc?id=0ByKsqUnItyBhS2RQdFJ2Q0RrN0k&export=download"
download.file(url,"AC2012.7z")
unzip("AC2012.7z")
It does download but I get and error when trying to uncompress the file (both within R and manually with 7.zip) There must be something wrong with file downloaded in R, as the the file size (3.412Kb) does not match what I get from manualy downloading the file (3.399Kb)
For anyone trying to solve this problem today, you can use the googledrive package.
library(googledrive)
ls_tibble <- googledrive::drive_ls(GOOGLE_DRIVE_URL_FOR_THE_TARGET_FOLDER)
for (file_id in ls_tibble$id) {
googledrive::drive_download(as_id(file_id))
}
This will (1) trigger an authentication page to open in your browser to authorise the Tidyverse libraries using gargle to access Google Drive on behalf of your account and (2) download all the files in the folder at that URL to your current working directory for the current R session.
Related
I need to download an excel file shared in my company from onedrive, unfortunately I can't handle it. The link looks like this:
https://company.sharepoint.com/:x:/p/user_name/XXXXXXXXXXXXXXXXXX
After adding the parameter download=1 to the URL, the browser downloads automatically, but I can't write the R code that could download such a file.
I tried to download the file with this function
httr::GET(paste0(url), authenticate("username","password",type="any"))
I tried to get a list of files using Microsoft365R, but after accessing from IT, list_sharepoint_sites() returns an empty list.
I have a google form that accepts file uploads, all of them in zip format.
That form generated a spreadsheet, where each row has a link with a unique uri, that represents a zip. I want to download all of these zip files to disk.
Using gspread it is easy to get all the URIs. However these do not have .zip extensions, and they seem to be google drive paths.
I've tried extracting the ids from the URI and using the requests package to get:
https://drive.google.com/uc?export=download&id=DRIVE_FILE_ID
https://drive.google.com/u/2/uc?id=DRIVE_FILE_ID&export=download
but neither of these approaches seemed to work.
It seems like the URI is linking to a preview of the inside of the zip, but I can't figure out how to simply download it programatically. Clicking on hundreds of links and downloading each by hand isn't really an option.
I am trying to download a .tif file from my Google Drive folder (which is exported to it via Google Earth Engine), using the googledrive library. However, when calling the map function, I get the following error:
Error: 'file' identifies more than one Drive file.
I have already managed to download other .tif files with this code, which worked without any error. Why do I get this error, and how do I resolve it? As you can see in the Drive folder (it's public), the folder contains only one file, so why does 'file' identify more than one Drive file?
Code:
library(googledrive)
library(purrr)
## Store the URL to the folder
folder_url <- "https://drive.google.com/drive/folders/1Qdp0GN7_BZoU70OrpbEL-vIBBxBa1_Db"
## Identify this folder on Google Drive
## let googledrive know this is a file ID or URL, as opposed to file name
folder <- drive_get(as_id(folder_url))
## Identify files in the folder
files <- drive_ls(folder, pattern = "*.tif")
# Download all files in folder
map(files$name, overwrite = T, drive_download)
Google Drive API's method Files: list returns you by default an array
Even if the results contain only one file, or no files at all - it will still be an array.
All you need to do is to retrieve the first (0) element of this array. You can verify it by testing with the Try this API.
I expect the correct syntax in R to be something like
files[0]$name to retrieve the name of the first (even if it is the only) file.
Also: You should implement some condition statement to verify that the
list of files is not empty before you retrieve the file name.
Google Drive allows for multiple files with the exact same name in the same folder. The googledrive library does not accept this and thus will throw an error. However, even after deleting "double" files, the error wasn't solved. It seems that Google Drive also keeps some kind of hidden record/cache of the files, even when there are deleted. Only by deleting the entire folder and recreating it, I was able to solve the error.
Suppose I have a dropbox folder with several files in it.
For example:
https://www.dropbox.com/sh/rgiolfumqlhm9ng/AACs8AwiDmU98JR9UFm842-Ba?dl=0
How can I use R to list.files() in this folder? This is necessary to get the sub-url locations to then read them in.
I've seen how to do this in python with API but I am working in R.
The trick is to use the API provided in the link, but then have the original owner authenticate a token for you. Once you have this token file, you can get into the dropbox with R. The rest is straightforward from the link.
I'm missing something here, or this isn't possible...
My real goal is to eliminate the "do you want to open or save" message on excel files linked from a LOCAL INTRANET SITE ONLY. I was NOT able to limit this to the local intranet, and using regedit could only remove the message from every excel file downloaded from who knows where.
It was suggested that I create a new file extension, and do the same regedit and we'd just name our custom excel files with a different extension. Ok, trying that out, I created a new key called .sxls in HKEY_CLASSES_ROOT which has two string values (Default) REG_SZ STER.XLS.8 and Content Type REG_SZ application/vnd.ms-excel. It would in all respects be an .xls file, but with a different extension. I then went into HKEY_CURRENT_USER->SOFTWARE->Microsoft->Windows->Shell->AttachementExecute->{002DF01etc} and added a binary STER.XLS.8
Not only was the "do you want to download or" message not suppressed, excel said filename.sxls is in a different format than specified by the file extension.
So, help me out... either to limit the registry setting in AttachmentExecute to 1) a new file type, or 2) just files downloaded from local intranet, or 3) eliminate through the asp.net web app.
Thanks, John