How to get the github repo url for all the packages on CRAN? - r

I would like to extract the github repo url for all the packages on CRAN. And I have tried to first read the link of CRAN and get the table of all the package names, which also contains the url for the description page of each package, for I want to extract the github repo url through the description page. But I can't get the completed url. Could you please help me with this? Or is there any better way to get the repo url for all packages?
This is my supplementary :
Actually, I want to filter the pkgs that do have a official github repo, like some pkgs as xfun or fddm. And I found I can extract the username and repo name from the description of pkgs on CRAN, and put them in a github formatted url. (for most of them have the same format url like : https://github.com/{username}/{reponame}. For example, for package xfun , it would be like : https://github.com/yihui/xfun.
And now, I have get some of them like : (three of them)
enter image description here
And I am wondering how could I get the url for all of them. I know use glue pkg can replace the elements in a url. and for get the url by replacing elements (username and reponame) I have tried map()
and map_dfr() function. But it returns me error : Error in parse_url(url) : length(url) == 1 is not TRUE
Here is my code :
get <- map_dfr(dat, ~{
username <- dat$user
reponame <- dat$package
pkg_url <- GET(glue::glue("https://github.com/{username}/{reponame}"))
})
Could you please help me with this? Thanks a lot ! :)

I want to suggest a different method for getting where you want.
As discussed in the comments, not all R packages have public GitHub repos.
Here is a version of some code from an answer to another question by Dirk Eddelbuettel that retrieves information from CRAN's database, including the package name and the URL field. If a package has a public GH repo, it is very likely that the authors have included that information in the URL field: there may be a few packages where the GH repo information is guessable (i.e. the GH user name is the same as (e.g.) the identifier in the maintainer's e-mail address; the GH repo name is the same as the package name), but it seems like a lot of work to do all that guessing (and accessing GitHub to see if the guess was correct) for a relatively low return.
getPackageRDS <- function() {
description <- sprintf("%s/web/packages/packages.rds",
getOption("repos")["CRAN"])
con <- if(substring(description, 1L, 7L) == "file://") {
file(description, "rb")
} else {
url(description, "rb")
}
on.exit(close(con))
db <- readRDS(gzcon(con))
rownames(db) <- NULL
return(db)
}
dd <- as.data.frame(getPackageRDS())
dd2 <- subset(dd, grepl("github.com", URL))
## clean up (multiple URLs, etc.)
dd2$URL <- sapply(strsplit(dd2$URL,"[, \n]"),
function(x) trimws(grep("github.com", x, value=TRUE)[1]))
As of today (25 May 2021) there are 17665 packages in total on CRAN, of which 6184 have "github.com" in the URL field. Here are the first few results:
Package URL
5 abbyyR http://github.com/soodoku/abbyyR
12 ABCoptim http://github.com/gvegayon/ABCoptim
16 abctools https://github.com/dennisprangle/abctools
18 abdiv https://github.com/kylebittinger/abdiv
20 abess https://github.com/abess-team/abess
23 ABHgenotypeR http://github.com/StefanReuscher/ABHgenotypeR
The URL field may still not be completely clean, but this should get you most of the way there.
An alternative approach would be to use the githubinstall package, which works by downloading a data frame that has been generated by crawling GitHub looking for R packages.
library(githubinstall)
dd3 <- gh_list_packages()
At present there are 34491 packages in this list, so obviously it includes a lot of stuff that's not on CRAN. You could intersect this list of packages with information from available_packages() ...

Related

Can I get the URL of what will be used by install.packages?

When running install.packages("any_package") on windows I get the message :
trying URL
'somepath.zip'
I would like to get this path without downloading, is it possible ?
In other terms I'd like to get the CRAN link to the windows binary of the latest release (the best would actually be to be able to call a new function with the same parameters as install.packages and get the proper url(s) as an output).
I would need a way that works from the R console (no manual checking of the CRAN page etc).
I am not sure if this is what you are looking for. This build the URL from the repository information and building the file name of the list of available packages.
#get repository name
repos<- getOption("repos")
#Get url for the binary package
#contrib.url(repos, "both")
contriburl<-contrib.url(repos, "binary")
#"https://mirrors.nics.utk.edu/cran/bin/windows/contrib/3.5"
#make data.frame of avaialbe packages
df<-as.data.frame(available.packages())
#find package of interest
pkg <- "tidyr" #example
#ofinterest<-grep(pkg, df$Package)
ofinterest<-match(pkg, df$Package) #returns a single value
#assemble name, assumes it is always a zip file
name<-paste0(df[ofinterest,]$Package, "_", df[ofinterest,]$Version, ".zip")
#make final URL
finalurl<-paste0(contriburl, "/", name)
Here's a couple functions which respectively :
get the latest R version from RStudio's website
get the url of the last released windows binary
The first is a variation of code I found in the installr package. It seems there's no clean way of getting the last version, so we have to scrape a webpage.
The second is really just #Dave2e's code optimized and refactored into a function (with a fix for outdated R versions), so please direct upvotes to his answer.
get_package_url <- function(pkg){
version <- try(
available.packages()[pkg,"Version"],
silent = TRUE)
if(inherits(version,"try-error"))
stop("Package '",pkg,"' is not available")
contriburl <- contrib.url(getOption("repos"), "binary")
url <- file.path(
dirname(contriburl),
get_last_R_version(2),
paste0(pkg,"_",version,".zip"))
url
}
get_last_R_version <- function(n=3){
page <- readLines(
"https://cran.rstudio.com/bin/windows/base/",
warn = FALSE)
line <- grep("R-[0-9.]+.+-win\\.exe", page,value=TRUE)
long <- gsub("^.*?R-([0-9.]+.+)-win\\.exe.*$","\\1",line)
paste(strsplit(long,"\\.")[[1]][1:n], collapse=".")
}
get_package_url("data.table")
# on my system with R 3.3.1
# [1] "https://lib.ugent.be/CRAN/bin/windows/contrib/3.5/data.table_1.11.4.zip"

How to quickly replicate/update local library under $R_LIBS_USER?

Suppose that
the version of R I have installed is 3.3.1;
my environment variable $R_LIBS_USER is set to $HOME/my/R_lib/%V; and
I have a directory $HOME/my/R_lib/3.3.1 containing a large number of packages I've installed over time.
Now I want to upgrade my version of R, to 3.4.1, say.
I'm looking for a convenient way to install a new collection of packages under the new directory $HOME/my/R_lib/3.4.1 that is the "version-3.4.1-equivalent" of the library I currently have under $HOME/my/R_lib/3.3.1.
(IOW, I'm looking for a functionality similar to what one can do with Python's pip installer's "freeze" option, which, essentially, produces the input one would have to give the installer in the future to reproduce the current installation.)
You can use function installed.packages for that purpose. Namely:
installed.packages(lib.loc = "$HOME/my/R_lib/3.3.1")
The returned object contains a lot of informations (most fields from each package DESCRIPTION file) but the names of the packages are in the first column. So something like the following should do the trick:
inst <- installed.packages(lib.loc = "$HOME/my/R_lib/3.3.1")
install.packages(inst[,1], lib="$HOME/my/R_lib/3.4.1", dependencies=FALSE)
To answer the additional question in the comments:
If your old library contained packages from other sources than CRAN, you will have to do some gymnastic based on the content of the DESCRIPTION files, and thus will depend on how well the package author documented it.
Argument field in installed.packages allows you to select some additional fields from those files. Fields of interest to determine the source of a package are fields Repository, URL and Maintainer. Here are some ideas on how to split them apart:
CRAN vs non-CRAN:
inst <- installed.packages(lib.loc = "$HOME/my/R_lib/3.3.1",
fields=c("URL","Repository","Maintainer"))
inst <- as.data.frame(inst, row.names=NA, stringsAsFactors=FALSE)
cran <- inst[inst$Repository%in%"CRAN",]
non_cran <- inst[!inst$Repository%in%"CRAN" & !inst$Priority%in%"base",]
Bioconductor packages:
bioc <- inst[grepl("Bioconductor",inst$Maintainer),]
source("https://bioconductor.org/biocLite.R")
biocLite(pkgs=bioc$Packages)
Github packages:
git <- non_cran[grepl("github", non_cran$URL),]
install.packages("devtools")
library(devtools)
for(i in seq(nrow(git))){
install_github(repo=gsub("http://github.com/","",git$URL[i]))
}

How to use getReturns with the Yahoo finance API

I have problems with R package getReturns. I have encountered this error since 17th May:
Warning in file(file, "rt") :
cannot open URL 'http://ichart.finance.yahoo.com/table.csv?s=AAPL&a=4&b=28&c=2014&d=4&e=27&f=2017&g=w&ignore=.csv': HTTP status was '404 Not Found'
It looks like that ichart API did not run anymore. Can anyone help me with this issue? Does someone know how to fix it? I have encountered the same issue with the quantmod R package.
I have encountered this problem as well. Yahoo! has taken down ichart and the open-source libraries that rely on it are now broken. Yahoo! also has no plans to introduce a replacement. For more information, see this post on Yahoo!'s Forums.
I switched to eodhistoricaldata.com after Yahoo failed, several weeks ago I found good alternative with an API very similar to Yahoo Finance.
Basically, for almost all R scripts I use I just changed this:
URL <- paste0("ichart.finance.yahoo.com/table.csv?s=", symbols[i])
to:
URL <- paste0("eodhistoricaldata.com/api/table.csv?s=", symbols[i])
Then add an API key and it will work in the same way as before. I saved a lot of time for my R scripts on it.
You can follow my an earlier post, which might help you.
I tried :
library(quantmod)
# Create an object containing the Pfizer ticker symbol
symbol <- "PFE"
# Use getSymbols to import the data
getSymbols(symbol, src="yahoo", auto.assign=T)
# because src='google' throws error, yahoo was used, and even that is down
When I tried other source, it worked:
# "quantmod::oanda.currencies" contains a list of currencies provided by Oanda.com
currency_pair <- "GBP/CAD"
# Load British Pound to Canadian Dollar exchange rate data
getSymbols(currency_pair, src="oanda")
str(GBPCAD)
It seems there are issues with google and yahoo while we use quantmod pkg.
I will suggest you to use 'Quandl' instead. Plz goto Quandl website, register for free and create API key, and then copy it in below:
# Install Quandl
install.packages("Quandl")
# or from github
install.packages("devtools")
library(devtools)
install_github("quandl/quandl-r")
# Load the Quandl package
library(Quandl)
# use API for full access
Quandl.api_key("xxxxxx")
# Download APPLE stock data
mydata = Quandl::Quandl.datatable("ZACKS/FC", ticker="AAPL")
For HDFC at BSE, you can use:
hdfc = Quandl("BSE/BOM500180")
for more details:
https://www.quandl.com/data/BSE-Bombay-Stock-Exchange?keyword=HDFC

Package that downloads data from the internet during installation

Is anyone aware of a package that downloads a dataset from the internet during the installation process and then prepares and saves it so that it is available when loading the package using library(packageName)? Are there any drawbacks in this approach (besides the obvious one that package installation will fail if the data source is unavailable or the data format has changed)?
EDIT: Some background. The data is three tab-separated files in a ZIP archive, owned by federal statistics and generally freely accessible. I have R code which downloads, extracts and prepares the data, in the end three data frames are created which could be saved in .RData format.
I am thinking about creating two packages: A "data" package that provides the data, and a "code" package that operates on it.
I did this mockup before, while you were posting your edit. I presume it would work, but not tested. I've commented it so you can see what you would need to change. The idea here is to check to see if an expected object is available in the current working environment. If it is not, check to see that the file that the data can be found in is in the current working directory. If that is not found, prompt the user to download the file, then proceed from there.
myFunction <- function(this, that, dataset) {
# We're giving the user a chance to specify the dataset.
# Maybe they have already downloaded it and saved it.
if (is.null(dataset)) {
# Check to see if the object is already in the workspace.
# If it is not, check to see whether the .RData file that
# contains the object is in the current working directory.
if (!exists("OBJECTNAME", where = 1)) {
if (isTRUE(list.files(
pattern = "^DATAFILE.RData$") == "DATAFILE.RData")) {
load("DATAFILE.RData")
# If neither of those are successful, prompt the user
# to download the dataset.
} else {
ans = readline(
"DATAFILE.RData dataset not found in working directory.
OBJECTNAME object not found in workspace. \n
Download and load the dataset now? (y/n) ")
if (ans != "y")
return(invisible())
# I usually use RCurl in case the URL is https
require(RCurl)
baseURL = c("http://some/base/url/")
# Here, we actually download the data
temp = getBinaryURL(paste0(baseURL, "DATAFILE.RData"))
# Here we load the data
load(rawConnection(temp), envir=.GlobalEnv)
message("OBJECTNAME data downloaded from \n",
paste0(baseURL, "DATAFILE.RData \n"),
"and added to your workspace\n\n")
rm(temp, baseURL)
}
}
dataset <- OBJECTNAME
}
TEMP <- dataset
## Other fun stuff with TEMP, this, and that.
}
Two packages, hosted at Github
Here's another approach, building on the comments between #juba and I. The basic concept is to have, as you describe, one package for the codes and one for the data. This function would be part of the package that contains your code. It will:
Check to see if the data package is installed
Check to see if the version of the data package you have installed matches the version at Github, which we are going to assume is the most up to date version.
When it fails any of the checks, it asks the user if they want to update their installation of the package. In this case, for demonstration, I've linked to one of my packages in progress at Github. This should give you an idea of what you need to substitute to get it to work with your own package once you've hosted it there.
CheckVersionFirst <- function() {
# Check to see if installed
if (!"StataDCTutils" %in% installed.packages()[, 1]) {
Checks <- "Failed"
} else {
# Compare version numbers
require(RCurl)
temp <- getURL("https://raw.github.com/mrdwab/StataDCTutils/master/DESCRIPTION")
CurrentVersion <- gsub("^\\s|\\s$", "",
gsub(".*Version:(.*)\\nDate.*", "\\1", temp))
if (packageVersion("StataDCTutils") == CurrentVersion) {
Checks <- "Passed"
}
if (packageVersion("StataDCTutils") < CurrentVersion) {
Checks <- "Failed"
}
}
switch(
Checks,
Passed = { message("Everything looks OK! Proceeding!") },
Failed = {
ans = readline(
"'StataDCTutils is either outdated or not installed. Update now? (y/n) ")
if (ans != "y")
return(invisible())
require(devtools)
install_github("StataDCTutils", "mrdwab")
})
# Some cool things you want to do after you are sure the data is there
}
Try it out with CheckVersionFirst().
Note: This would succeed only if you religiously remember to update your version number in your description file every time you push a new version of the data to Github!
So, to clarify/recap/expand, the basic idea would be to:
Periodically push the updated version of your data package to Github, being sure to change the version number of the data package in its DESCRIPTION file when you do so.
Integrate this CheckVersionFirst() function as an .onLoad event in your code package. (Obviously modify the function to match your account and package name).
Change the commented line that reads # Some cool things you want to do after you are sure the data is there to reflect the cool things you actually want to do, which would probably start with library(YOURDATAPACKAGE) to load the data....
This method may not be efficient, but a good workaround. If you are making a package that needs regularly updated data, first make a package which has that data. It does not need any functions, but I like the concept of a setter (which you might not need in this case) & getter.
Then when you make your package, have the 'data'-package as a dependency. This way, whenever someone installs your package, he/she will always have the latest data.
On your part, you'll just have to swap out the data in your 'data' package, and upload it to the repo you want.
If you don't know how to build a package, check ?packages.skeleton and R CMD CHECK, R CMD BUILD

available CRAN vignettes

There's the available.packages() function to list all packages available on CRAN. Is there a similar function to find all available vignettes? If not how would I get a list of all vignettes and the packages they're associated with?
As a corner case to keep in mind the data.table package has 3 vignettes associated with it.
EDIT: Per Andrie's response I realize I wasn't clear. I know about the vignette function for finding all the available local vignettes, I'm after a way to get all the vignettes of all packages on CRAN.
I seem to recall looking at this in response to some SO question (can't find it now) and deciding that since the information isn't included in the output of available.packages(), nor in the result of applying readRDS to #CRAN/web/packages/packages.rds (a trick from Jeroen Ooms), I couldn't think of a non-scraping way to do it ...
Here's my scraper, applied to the first 100 packages (leading to 44 vignettes)
pkgs <- unname(available.packages()[, 1])[1:100]
vindex_urls <- paste0(getOption("repos"),"/web/packages/", pkgs,
"/vignettes/index.rds", sep = "")
getf <- function(x) {
## I think there should be a way to do this directly
## with readRDS(url(...)) but I can't get it to work
suppressWarnings(
download.file(x,"tmp.rds",quiet=TRUE))
readRDS("tmp.rds")
}
library(plyr)
vv <- ldply(vindex_urls,
.progress="text",
function(x) {
if (inherits(z <- try(getf(x),silent=TRUE),
"try-error")) NULL else z
})
tmpf <- function(x,n) { if (is.null(x)) NULL else
data.frame(pkg=n,x) }
vframe <- do.call(rbind,mapply(tmpf,vv,pkgs))
rownames(vframe) <- NULL
head(vframe[,c("pkg","Title")])
There may be ways to clean this up/make it more compact, but it seems to work OK. Your scrape once/update occasionally strategy seems reasonable. Or if you wanted you could scrape daily (or weekly or whatever seems reasonable) and save/post the results somewhere publicly accessible, then include a function with that URL hard-coded in the package ... or even create a nicely formatted HTML table, with links, that the whole world could use (and then add Viagra ads to the page, and $$PROFIT$$ ...)
edit: wrapped both the download and the readRDS in a function, so I can wrap the whole thing in try
The functions vignette() and browseVignettes() list all vignettes of packages installed on your machine.
vignette(package="data.table")
Vignettes in package ‘data.table’:
datatable-faq Frequently asked questions (source, pdf)
datatable-intro Quick introduction (source, pdf)
datatable-timings Timings of common tasks (source, pdf)
browseVignettes() is especially helpful since it creates a web page with hyperlinks:
browseVignettes(package="data.table")
Vignettes found by browseVignettes(package = "data.table")
Vignettes in package data.table
Frequently asked questions - PDF R LaTeX/noweb
Quick introduction - PDF R LaTeX/noweb
Timings of common tasks - PDF R LaTeX/noweb

Resources