Importing data into R (rdata) from Github - r

I want to put some R code plus the associated data file (RData) on Github.
So far, everything works okay. But when people clone the repository, I want them to be able to run the code immediately. At the moment, this isn't possible because they will have to change their work directory (setwd) to directory that the RData file was cloned (i.e. downloaded) to.
Therefore, I thought it might be easier, if I changed the R code such that it linked to the RData file on github. But I cannot get this to work using the following snippet. I think perhaps there is some issue text / binary issue.
x <- RCurl::getURL("https://github.com/thefactmachine/hex-binning-gis-data/raw/master/popDensity.RData")
y <- load(x)
Any help would be appreciated.
Thanks

This works for me:
githubURL <- "https://github.com/thefactmachine/hex-binning-gis-data/raw/master/popDensity.RData"
load(url(githubURL))
head(df)
# X Y Z
# 1 16602794 -4183983 94.92019
# 2 16602814 -4183983 91.15794
# 3 16602834 -4183983 87.44995
# 4 16602854 -4183983 83.79617
# 5 16602874 -4183983 80.19643
# 6 16602894 -4183983 76.65052
EDIT Response to OP comment.
From the documentation:
Note that the https:// URL scheme is not supported except on Windows.
So you could try this:
download.file(githubURL,"myfile")
load("myfile")
which works for me as well, but this will clutter your working directory. If that doesn't work, try setting method="curl" in the call to download.file(...).

I've had trouble with this before as well, and the solution I've found to be the most reliable is to use a tiny modification of source_url from the fantastic [devtools][1] package. This works for me (on a Mac).
load_url <- function (url, ..., sha1 = NULL) {
# based very closely on code for devtools::source_url
stopifnot(is.character(url), length(url) == 1)
temp_file <- tempfile()
on.exit(unlink(temp_file))
request <- httr::GET(url)
httr::stop_for_status(request)
writeBin(httr::content(request, type = "raw"), temp_file)
file_sha1 <- digest::digest(file = temp_file, algo = "sha1")
if (is.null(sha1)) {
message("SHA-1 hash of file is ", file_sha1)
}
else {
if (nchar(sha1) < 6) {
stop("Supplied SHA-1 hash is too short (must be at least 6 characters)")
}
file_sha1 <- substr(file_sha1, 1, nchar(sha1))
if (!identical(file_sha1, sha1)) {
stop("SHA-1 hash of downloaded file (", file_sha1,
")\n does not match expected value (", sha1,
")", call. = FALSE)
}
}
load(temp_file, envir = .GlobalEnv)
}
I use a very similar modification to get text files from github using read.table, etc. Note that you need to use the "raw" version of the github URL (which you included in your question).
[1] https://github.com/hadley/devtoolspackage

load takes a filename.
x <- RCurl::getURL("https://github.com/thefactmachine/hex-binning-gis-data/raw/master/popDensity.RData")
writeLines(x, tmp <- tempfile())
y <- load(tmp)

Related

googledrive::drive_mv gives error "Parent specified via 'path' is invalid: x Does not exist"

This is a weird one and I am hoping someone can figure it out. I have written a function that uses googlesheets4 and googledrive. One thing I'm trying to do is move a googledrive document (spreadsheet) from the base folder to a specified folder. I had this working perfectly yesterday so I don't know what happened as it just didn't when I came in this morning.
The weird thing is that if I step through the function, it works fine. It's just when I run the function all at once that I get the error.
I am using a folder ID instead of a name and using drive_find to get the correct folder ID. I am also using a sheet ID instead of a name. The folder already exists and like I said, it was working yesterday.
outFolder <- 'exact_outFolder_name_without_slashes'
createGoogleSheets <- function(
outFolder
){
folder_id <- googledrive::drive_find(n_max = 10, pattern = outFolder)$id
data <- data.frame(Name = c("Sally", "Sue"), Data = c("data1", "data2"))
sheet_id <- NA
nameDate <- NA
tempData <- data.frame()
for (i in 1:nrow(data)){
nameDate <- data[i, "Name"]
tempData <- data[i, ]
googlesheets4::gs4_create(name = nameDate, sheets = list(sheet1 = tempData)
sheet_id <- googledrive::drive_find(type = "spreadsheet", n_max = 10, pattern = nameDate)$id
googledrive::drive_mv(file = as_id(sheet_id), path = as_id(folder_id))
} end 'for'
} end 'function'
I don't think this will be a reproducible example. The offending code is within the for loop that is within the function and it works fine when I run through it step by step. folder_id is defined within the function but outside of the for loop. sheet_id is within the for loop. When I move folder_id into the for loop, it still doesn't work although I don't know why it would change anything. These are just the things I have tried. I do have the proper authorization for google drive and googlesheets4 by using:
googledrive::drive_auth()
googlesheets4::gs4_auth(token = drive_token())
<error/rlang_error>
Error in as_parent():
! Parent specified via path is invalid:
x Does not exist.
Backtrace:
global createGoogleSheets(inputFile, outPath, addNames)
googledrive::drive_mv(file = as_id(sheet_id), path = as_id(folder_id))
googledrive:::as_parent(path)
Run rlang::last_trace() to see the full context.
Backtrace:
x
-global createGoogleSheets(inputFile, outPath, addNames)
-googledrive::drive_mv(file = as_id(sheet_id), path = as_id(folder_id))
\-googledrive:::as_parent(path)
\-googledrive:::drive_abort(c(invalid_parent, x = "Does not exist."))
\-cli::cli_abort(message = message, ..., .envir = .envir)
\-rlang::abort(message, ..., call = call, use_cli_format = TRUE)
I have tried changing the folder_id to the exact path of my google drive W:/My Drive... and got the same error. I should mention I have also tried deleting the folder and re-creating it fresh.
Anybody have any ideas?
Thank you in advance for your help!
I can't comment because I don't have the reputation yet, but I believe you're missing a parenthesis in your for-loop.
You need that SECOND parenthesis below:
for (i in 1:nrow(tempData) ) {
...
}

Give a character string to define a file dependency in drake

I am learning drake to define my analysis workflow but I have trouble getting data files as dependencies.
I use the function file_in() inside drake_plan() but it only works if I give the path to the file directly. If I give it with the file.path() function or with a variable storing that file path, it doesn't work.
Examples:
# preparation
library(drake)
path.data <- "data"
dir.create(path.data)
write.csv(iris, file.path(path.data, "iris.csv"))
 Working plan:
# working plan
working_plan <-
drake_plan(iris_data = read.csv(file_in("data/iris.csv")),
strings_in_dots = "literals")
working_config <- make(working_plan)
vis_drake_graph(working_config)
This plan works fine, and the file data/iris.csv is considered as a dependency
Working plan
 Not working plan:
# not working
notworking_plan <-
drake_plan(iris_data = read.csv(file_in(file.path(path.data, "iris.csv"))),
strings_in_dots = "literals")
notworking_config <- make(notworking_plan)
vis_drake_graph(notworking_config)
Here it is trying to read the file iris.csv instead of data/iris.csv.
Working but problem with dependency:
# working but "data/iris.csv" is not considered as a dependency
file.name <- file.path(path.data, "iris.csv")
notworking_plan <-
drake_plan(iris_data = read.csv(file_in(file.name)),
strings_in_dots = "literals")
notworking_config <- make(notworking_plan)
vis_drake_graph(notworking_config)
This last one works fine but the file is not considered a dependency, so drake doesn't re-run the plan if this file is changed.
Not working drake plan
So, is there a way to tell drake file dependencies from variables?
Per tidyeval, if you add !! in front of the file.path(), it will be evaluated and not quoted.
Also, in the new version of drake, the strings_in_dots = "literals" argument is deprecated.
library(drake)
path.data <- "data"
dir.create(path.data)
write.csv(iris, file.path(path.data, "iris.csv"))
# now working
notworking_plan <-
drake_plan(iris_data = read.csv(file_in(!!file.path(path.data, "iris.csv"))))
notworking_plan
#> # A tibble: 1 x 2
#> target command
#> <chr> <expr>
#> 1 iris_data read.csv(file_in("data/iris.csv"))
Created on 2019-05-08 by the reprex package (v0.2.1)
After a answer from the developpers on Github, the code in file_in() is not evaluated so that's why it is not possible to use file.path in it.

git commit throws error '[<-'

Does anybody have an idea, how I can fix this? git commit -a -m "message here works fine for other projects and previous commits this day were all ok.
Now, it throws the error:
Error in [<-(*tmp*, 1, "Date", value = "2016-07-29") :
Indizierung außerhalb der Grenzen
Ausführung angehalten
The error message is something like:
index out of bounds
Please let me know if you need any further information.
Here is a screenshot:
Edit: #Carsten guessed right! I have a hook running. But I cannot see why it should stop working from one to another minute... (It still does not work)
#!C:/R/R-3.2.2/bin/x64/Rscript
# License: CC0 (just be nice and point others to where you got this)
# Author: Robert M Flight <rflight79#gmail.com>, github.com/rmflight
inc <- TRUE # default
# get the environment variable and modify if necessary
tmpEnv <- as.logical(Sys.getenv("inc"))
if (!is.na(tmpEnv)) {
inc <- tmpEnv
}
# check that there are files that will be committed, don't want to increment version if there won't be a commit
fileDiff <- system("git diff HEAD --name-only", intern = TRUE)
if ((length(fileDiff) > 0) && inc) {
currDir <- getwd() # this should be the top level directory of the git repo
currDCF <- read.dcf("DESCRIPTION")
currVersion <- currDCF[1,"Version"]
splitVersion <- strsplit(currVersion, ".", fixed = TRUE)[[1]]
nVer <- length(splitVersion)
currEndVersion <- as.integer(splitVersion[nVer])
newEndVersion <- as.character(currEndVersion + 1)
splitVersion[nVer] <- newEndVersion
newVersion <- paste(splitVersion, collapse = ".")
currDCF[1,"Version"] <- newVersion
currDCF[1, "Date"] <- strftime(as.POSIXlt(Sys.Date()), "%Y-%m-%d")
write.dcf(currDCF, "DESCRIPTION")
system("git add DESCRIPTION")
cat("Incremented package version and added to commit!\n")
}
Thanks to #Carsten: Using print statements I could track the error in the hook file. In the end it was a stupid bug, where Date was accidentially deleted (=missing) in the description file.

Issue with curl: brackets in URLs

I have a vector URLs that I would like to download from R, using curl on Mac OSX:
## URLs
grab = c("http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1079_33994-C81-I620_5-ANI-L056-00001[006154]ready//DA_2011-06-03_STINGA SIMONA_30381371.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1486_67011-C27-I620_6-ANI-L141-00001[045849]ready//DA_2012-05-28_SORIN VASILE_1308151.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1151_34934-C93-I620_6-ANI-L058-00001[005631]ready//DI_2011-05-25_CONSTANTIN CATALIN IONITA_50364334.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1486_66964-C65-I620_5-ANI-L141-00001[045952]ready//DA_2012-05-24_DORINA ORZAC_1312037.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1486_67290-C65-I620_5-ANI-L141-00001[045768]ready//DI_2012-06-01_JIPA CAMELIA_1304833.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1151_34936-C74-I620_7-ANI-L058-00001[005633]ready//DA_2011-06-09_NICOLE MOT_50364493.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1151_34937-C74-I620_7-ANI-L058-00001[005634]ready//DA_2011-06-14_PETRE ECATERINA_50364543.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1566_67978-C85-I780_2-ANI-L144-00001[046398]ready//DA_2012-05-25_RAMONA GHIERGA_1332323.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1151_34936-C74-I620_7-ANI-L058-00001[005633]ready//DA_2011-06-05_LOVIN G. ADINA_50364475.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP2135_40131-C90-I780_3-ANI-L069-00001[009742]ready//DI_2011-05-25_VARTOLOMEI PAUL-CONSTANTIN_467652.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1086_34373-C11-I620_3-ANI-L057-00001[005657]ready//DI_2011-05-16_CAZACU LILIANA_40437536.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1151_34935-C93-I620_6-ANI-L058-00001[005632]ready//DI_2011-06-07_ROSCA EUGEN-CONSTANTIN_50364400.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP181_27399-C11-I780_2-ANI-L051-00001[005421]ready//DI_2010-11-03_DIAMANDI SAVA-CONSTANTIN_40429564.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1151_34936-C74-I620_7-ANI-L058-00001[005633]ready//DI_2011-06-07_ZAMFIRESCU I. IULIA_50364498.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1563_67587-C71-I780_3-ANI-L143-00001[046079]ready//DI_2012-05-21_MAZURU C. EMILIA_1317509.pdf"
)
My first attempt returned HTTP Error 400:
## fails on Mac OSX 10.9 (HTTP 400)
## for(x in grab) download.file(x, destfile = gsub("(.*)//D", "D", x))
I learnt that this was due to the URLs containing brackets, so I applied the globoff fix this way:
## also fails despite fixing HTTP Err 400 (files are zero-sized)
for(x in grab) download.file(x, destfile = gsub("(.*)//D", "D", x), method = "curl", extra = "--globoff")
… and the files now download, but are all empty (zero-sized).
What am I getting wrong?
P.S. I'm willing to switch to Python or shell to get the files, but would prefer to keep the code 100% R.
Have you tried URL encoding the brackets?
%5B = [
%5D = ]
A bit late, but URLencode is what you use to ensure you have a well-formed URL.
> x <- "http://example.com/[xyz]//file with spaces.pdf"
> URLencode(x)
[1] "http://example.com/%5bxyz%5d//file%20with%20spaces.pdf"

Get function's title from documentation

I would like to get the title of a base function (e.g.: rnorm) in one of my scripts. That is included in the documentation, but I have no idea how to "grab" it.
I mean the line given in the RD files as \title{} or the top line in documentation.
Is there any simple way to do this without calling Rd_db function from tools and parse all RD files -- as having a very big overhead for this simple stuff? Other thing: I tried with parse_Rd too, but:
I do not know which Rd file holds my function,
I have no Rd files on my system (just rdb, rdx and rds).
So a function to parse the (offline) documentation would be the best :)
POC demo:
> get.title("rnorm")
[1] "The Normal Distribution"
If you look at the code for help, you see that the function index.search seems to be what is pulling in the location of the help files, and that the default for the associated find.packages() function is NULL. Turns out tha tthere is neither a help fo that function nor is exposed, so I tested the usual suspects for which package it was in (base, tools, utils), and ended up with "utils:
utils:::index.search("+", find.package())
#[1] "/Library/Frameworks/R.framework/Resources/library/base/help/Arithmetic"
So:
ghelp <- utils:::index.search("+", find.package())
gsub("^.+/", "", ghelp)
#[1] "Arithmetic"
ghelp <- utils:::index.search("rnorm", find.package())
gsub("^.+/", "", ghelp)
#[1] "Normal"
What you are asking for is \title{Title}, but here I have shown you how to find the specific Rd file to parse and is sounds as though you already know how to do that.
EDIT: #Hadley has provided a method for getting all of the help text, once you know the package name, so applying that to the index.search() value above:
target <- gsub("^.+/library/(.+)/help.+$", "\\1", utils:::index.search("rnorm",
find.package()))
doc.txt <- pkg_topic(target, "rnorm") # assuming both of Hadley's functions are here
print(doc.txt[[1]][[1]][1])
#[1] "The Normal Distribution"
It's not completely obvious what you want, but the code below will get the Rd data structure corresponding to the the topic you're interested in - you can then manipulate that to extract whatever you want.
There may be simpler ways, but unfortunately very little of the needed coded is exported and documented. I really wish there was a base help package.
pkg_topic <- function(package, topic, file = NULL) {
# Find "file" name given topic name/alias
if (is.null(file)) {
topics <- pkg_topics_index(package)
topic_page <- subset(topics, alias == topic, select = file)$file
if(length(topic_page) < 1)
topic_page <- subset(topics, file == topic, select = file)$file
stopifnot(length(topic_page) >= 1)
file <- topic_page[1]
}
rdb_path <- file.path(system.file("help", package = package), package)
tools:::fetchRdDB(rdb_path, file)
}
pkg_topics_index <- function(package) {
help_path <- system.file("help", package = package)
file_path <- file.path(help_path, "AnIndex")
if (length(readLines(file_path, n = 1)) < 1) {
return(NULL)
}
topics <- read.table(file_path, sep = "\t",
stringsAsFactors = FALSE, comment.char = "", quote = "", header = FALSE)
names(topics) <- c("alias", "file")
topics[complete.cases(topics), ]
}

Resources