Can I get the URL of what will be used by install.packages? - r

When running install.packages("any_package") on windows I get the message :
trying URL
'somepath.zip'
I would like to get this path without downloading, is it possible ?
In other terms I'd like to get the CRAN link to the windows binary of the latest release (the best would actually be to be able to call a new function with the same parameters as install.packages and get the proper url(s) as an output).
I would need a way that works from the R console (no manual checking of the CRAN page etc).

I am not sure if this is what you are looking for. This build the URL from the repository information and building the file name of the list of available packages.
#get repository name
repos<- getOption("repos")
#Get url for the binary package
#contrib.url(repos, "both")
contriburl<-contrib.url(repos, "binary")
#"https://mirrors.nics.utk.edu/cran/bin/windows/contrib/3.5"
#make data.frame of avaialbe packages
df<-as.data.frame(available.packages())
#find package of interest
pkg <- "tidyr" #example
#ofinterest<-grep(pkg, df$Package)
ofinterest<-match(pkg, df$Package) #returns a single value
#assemble name, assumes it is always a zip file
name<-paste0(df[ofinterest,]$Package, "_", df[ofinterest,]$Version, ".zip")
#make final URL
finalurl<-paste0(contriburl, "/", name)

Here's a couple functions which respectively :
get the latest R version from RStudio's website
get the url of the last released windows binary
The first is a variation of code I found in the installr package. It seems there's no clean way of getting the last version, so we have to scrape a webpage.
The second is really just #Dave2e's code optimized and refactored into a function (with a fix for outdated R versions), so please direct upvotes to his answer.
get_package_url <- function(pkg){
version <- try(
available.packages()[pkg,"Version"],
silent = TRUE)
if(inherits(version,"try-error"))
stop("Package '",pkg,"' is not available")
contriburl <- contrib.url(getOption("repos"), "binary")
url <- file.path(
dirname(contriburl),
get_last_R_version(2),
paste0(pkg,"_",version,".zip"))
url
}
get_last_R_version <- function(n=3){
page <- readLines(
"https://cran.rstudio.com/bin/windows/base/",
warn = FALSE)
line <- grep("R-[0-9.]+.+-win\\.exe", page,value=TRUE)
long <- gsub("^.*?R-([0-9.]+.+)-win\\.exe.*$","\\1",line)
paste(strsplit(long,"\\.")[[1]][1:n], collapse=".")
}
get_package_url("data.table")
# on my system with R 3.3.1
# [1] "https://lib.ugent.be/CRAN/bin/windows/contrib/3.5/data.table_1.11.4.zip"

Related

How to get the github repo url for all the packages on CRAN?

I would like to extract the github repo url for all the packages on CRAN. And I have tried to first read the link of CRAN and get the table of all the package names, which also contains the url for the description page of each package, for I want to extract the github repo url through the description page. But I can't get the completed url. Could you please help me with this? Or is there any better way to get the repo url for all packages?
This is my supplementary :
Actually, I want to filter the pkgs that do have a official github repo, like some pkgs as xfun or fddm. And I found I can extract the username and repo name from the description of pkgs on CRAN, and put them in a github formatted url. (for most of them have the same format url like : https://github.com/{username}/{reponame}. For example, for package xfun , it would be like : https://github.com/yihui/xfun.
And now, I have get some of them like : (three of them)
enter image description here
And I am wondering how could I get the url for all of them. I know use glue pkg can replace the elements in a url. and for get the url by replacing elements (username and reponame) I have tried map()
and map_dfr() function. But it returns me error : Error in parse_url(url) : length(url) == 1 is not TRUE
Here is my code :
get <- map_dfr(dat, ~{
username <- dat$user
reponame <- dat$package
pkg_url <- GET(glue::glue("https://github.com/{username}/{reponame}"))
})
Could you please help me with this? Thanks a lot ! :)
I want to suggest a different method for getting where you want.
As discussed in the comments, not all R packages have public GitHub repos.
Here is a version of some code from an answer to another question by Dirk Eddelbuettel that retrieves information from CRAN's database, including the package name and the URL field. If a package has a public GH repo, it is very likely that the authors have included that information in the URL field: there may be a few packages where the GH repo information is guessable (i.e. the GH user name is the same as (e.g.) the identifier in the maintainer's e-mail address; the GH repo name is the same as the package name), but it seems like a lot of work to do all that guessing (and accessing GitHub to see if the guess was correct) for a relatively low return.
getPackageRDS <- function() {
description <- sprintf("%s/web/packages/packages.rds",
getOption("repos")["CRAN"])
con <- if(substring(description, 1L, 7L) == "file://") {
file(description, "rb")
} else {
url(description, "rb")
}
on.exit(close(con))
db <- readRDS(gzcon(con))
rownames(db) <- NULL
return(db)
}
dd <- as.data.frame(getPackageRDS())
dd2 <- subset(dd, grepl("github.com", URL))
## clean up (multiple URLs, etc.)
dd2$URL <- sapply(strsplit(dd2$URL,"[, \n]"),
function(x) trimws(grep("github.com", x, value=TRUE)[1]))
As of today (25 May 2021) there are 17665 packages in total on CRAN, of which 6184 have "github.com" in the URL field. Here are the first few results:
Package URL
5 abbyyR http://github.com/soodoku/abbyyR
12 ABCoptim http://github.com/gvegayon/ABCoptim
16 abctools https://github.com/dennisprangle/abctools
18 abdiv https://github.com/kylebittinger/abdiv
20 abess https://github.com/abess-team/abess
23 ABHgenotypeR http://github.com/StefanReuscher/ABHgenotypeR
The URL field may still not be completely clean, but this should get you most of the way there.
An alternative approach would be to use the githubinstall package, which works by downloading a data frame that has been generated by crawling GitHub looking for R packages.
library(githubinstall)
dd3 <- gh_list_packages()
At present there are 34491 packages in this list, so obviously it includes a lot of stuff that's not on CRAN. You could intersect this list of packages with information from available_packages() ...

STRINGdb r environment; error in plot_network

I'm trying to use stringdb in R and i'm getting the following error when i try to plot the network:
Error in if (grepl("The document has moved", res)) { : argument is
of length zero
code:
library(STRINGdb)
#(specify organism)
string_db <- STRINGdb$new( version="10", species=9606, score_threshold=0)
filt_mapped = string_db$map(filt, "GeneID", removeUnmappedRows = TRUE)
head(filt_mapped)
(i have columns titled: GeneID, logFC, FDR, STRING_id with 156 rows)
filt_mapped_hits = filt_mapped$STRING_id
head(filt_mapped_hits)
(156 observations)
string_db$plot_network(filt_mapped_hits, add_link = FALSE)
Error in if (grepl("The document has moved", res)) { : argument is
of length zero
You are using quite few years old version of Bioconductor and by extension the STRING package.
If you update to the newest one, it will work. However the updated package only supports only the latest version STRING (currently version 11), so the underlying network may change a bit.
More detailed reason is this:
The STRING's hardware infrastructure underwent recently major changes which forced a different server setup.
Now all the old calls are forwarded to a different URL, however the cURL call, how it was implemented, does not follow our redirects which breaks the STRINGdb package functionality.
We cannot update the old bioconductor package and our server setup can’t be really changed.
That said, the fix for an old version is relatively simple.
In STRINGdb library there is script with all the methods "rstring.r".
In there you’ll find “get_png” method. In it replace this line:
urlStr = paste("http://string-db.org/version_", version, "/api/image/network", sep="" )
With this line:
urlStr = paste("http://version", version, ".string-db.org/api/image/network", sep="" )
Load the library again and it should create the PNG, as before.

How to find out if I am using a package

My R script has evolved over many months with many additions and subtractions. It is long and rambling and I would like to find out which packages I am actually using in the code so I can start deleting library() references. Is there a way of finding redundant dependencies in my R script?
I saw this question so I tried:
library(mvbutils)
library(MyPackage)
library(dplyr)
foodweb( funs=find.funs( asNamespace( 'EndoMineR')), where=
asNamespace( 'EndoMineR'), prune='filter')
But that really tells me where I am using a function from a package whereas I don't necessarily remember which functions I have used from which package.
I tried packrat but this is looking for projects whereas mine is a directory of scripts I am trying to build into a package.
To do this you first parse your file and then use getAnywhere to check for each terminal token in which namespace it is defined. Compare the result with the searchpath and you will have the answer. I compiled a function that takes a filename as argument and returns the packages which are used in the file. Note that it can only find packages that are loaded when the function is executed, so be sure to load all "candidates" first.
findUsedPackages <- function(sourcefile) {
## get parse tree
parse_data <- getParseData(parse(sourcefile))
## extract terminal tokens
terminal_tokens <- parse_data[parse_data$terminal == TRUE, "text"]
## get loaded packages/namespaces
search_path <- search()
## helper function to find the package a token belongs to
findPackage <- function(token) {
## get info where the token is defined
token_info <- getAnywhere(token)
##return the package the comes first in the search path
token_info$where[which.min(match(token_info$where, search_path))]
}
packages <- lapply(unique(terminal_tokens), findPackage)
##remove elements that do not belong to a loaded namespace
packages <- packages[sapply(packages, length) > 0]
packages <- do.call(c, packages)
packages <- unique(packages)
##do not return base and .GlobalEnv
packages[-which(packages %in% c("package:base", ".GlobalEnv"))]
}

Installing pdftotext on Windows (for use with R, 'tm' package)

I am having trouble using R, 'tm' package, to read in .pdf files.
Specifically, I try to run the following code:
library(tm)
filename = "myfile.pdf"
tmp1 <- readPDF(PdftotextOptions="-layout")
doc <- tmp1(elem=list(uri=filename),language="en",id="id1")
doc[1:15]
...which gives me the error:
Error in readPDF(PdftotextOptions = "-layout") :
unused argument (PdftotextOptions = "-layout")
I assume this is due to the fact that the pdftotext program (part of xpdf, http://www.foolabs.com/xpdf/download.html) has not been installed correctly on my machine, so that R cannot access it.
What are the steps to install xpdf/pdftotext correctly such that the above R code can be executed? (I am aware of similar questions already posted, however they don't address the same issue)
PdftotextOptions is no parameter of readPDF. readPDF has a control parameter, which expects a list. So correct use would be:
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
tmp1 <- readPDF(control = list(text = "-layout"))
doc <- tmp1(elem=list(uri=filename),language="en",id="id1")
}
Set
setwd('C:/xpdf/bin64')
It works for me.

Package that downloads data from the internet during installation

Is anyone aware of a package that downloads a dataset from the internet during the installation process and then prepares and saves it so that it is available when loading the package using library(packageName)? Are there any drawbacks in this approach (besides the obvious one that package installation will fail if the data source is unavailable or the data format has changed)?
EDIT: Some background. The data is three tab-separated files in a ZIP archive, owned by federal statistics and generally freely accessible. I have R code which downloads, extracts and prepares the data, in the end three data frames are created which could be saved in .RData format.
I am thinking about creating two packages: A "data" package that provides the data, and a "code" package that operates on it.
I did this mockup before, while you were posting your edit. I presume it would work, but not tested. I've commented it so you can see what you would need to change. The idea here is to check to see if an expected object is available in the current working environment. If it is not, check to see that the file that the data can be found in is in the current working directory. If that is not found, prompt the user to download the file, then proceed from there.
myFunction <- function(this, that, dataset) {
# We're giving the user a chance to specify the dataset.
# Maybe they have already downloaded it and saved it.
if (is.null(dataset)) {
# Check to see if the object is already in the workspace.
# If it is not, check to see whether the .RData file that
# contains the object is in the current working directory.
if (!exists("OBJECTNAME", where = 1)) {
if (isTRUE(list.files(
pattern = "^DATAFILE.RData$") == "DATAFILE.RData")) {
load("DATAFILE.RData")
# If neither of those are successful, prompt the user
# to download the dataset.
} else {
ans = readline(
"DATAFILE.RData dataset not found in working directory.
OBJECTNAME object not found in workspace. \n
Download and load the dataset now? (y/n) ")
if (ans != "y")
return(invisible())
# I usually use RCurl in case the URL is https
require(RCurl)
baseURL = c("http://some/base/url/")
# Here, we actually download the data
temp = getBinaryURL(paste0(baseURL, "DATAFILE.RData"))
# Here we load the data
load(rawConnection(temp), envir=.GlobalEnv)
message("OBJECTNAME data downloaded from \n",
paste0(baseURL, "DATAFILE.RData \n"),
"and added to your workspace\n\n")
rm(temp, baseURL)
}
}
dataset <- OBJECTNAME
}
TEMP <- dataset
## Other fun stuff with TEMP, this, and that.
}
Two packages, hosted at Github
Here's another approach, building on the comments between #juba and I. The basic concept is to have, as you describe, one package for the codes and one for the data. This function would be part of the package that contains your code. It will:
Check to see if the data package is installed
Check to see if the version of the data package you have installed matches the version at Github, which we are going to assume is the most up to date version.
When it fails any of the checks, it asks the user if they want to update their installation of the package. In this case, for demonstration, I've linked to one of my packages in progress at Github. This should give you an idea of what you need to substitute to get it to work with your own package once you've hosted it there.
CheckVersionFirst <- function() {
# Check to see if installed
if (!"StataDCTutils" %in% installed.packages()[, 1]) {
Checks <- "Failed"
} else {
# Compare version numbers
require(RCurl)
temp <- getURL("https://raw.github.com/mrdwab/StataDCTutils/master/DESCRIPTION")
CurrentVersion <- gsub("^\\s|\\s$", "",
gsub(".*Version:(.*)\\nDate.*", "\\1", temp))
if (packageVersion("StataDCTutils") == CurrentVersion) {
Checks <- "Passed"
}
if (packageVersion("StataDCTutils") < CurrentVersion) {
Checks <- "Failed"
}
}
switch(
Checks,
Passed = { message("Everything looks OK! Proceeding!") },
Failed = {
ans = readline(
"'StataDCTutils is either outdated or not installed. Update now? (y/n) ")
if (ans != "y")
return(invisible())
require(devtools)
install_github("StataDCTutils", "mrdwab")
})
# Some cool things you want to do after you are sure the data is there
}
Try it out with CheckVersionFirst().
Note: This would succeed only if you religiously remember to update your version number in your description file every time you push a new version of the data to Github!
So, to clarify/recap/expand, the basic idea would be to:
Periodically push the updated version of your data package to Github, being sure to change the version number of the data package in its DESCRIPTION file when you do so.
Integrate this CheckVersionFirst() function as an .onLoad event in your code package. (Obviously modify the function to match your account and package name).
Change the commented line that reads # Some cool things you want to do after you are sure the data is there to reflect the cool things you actually want to do, which would probably start with library(YOURDATAPACKAGE) to load the data....
This method may not be efficient, but a good workaround. If you are making a package that needs regularly updated data, first make a package which has that data. It does not need any functions, but I like the concept of a setter (which you might not need in this case) & getter.
Then when you make your package, have the 'data'-package as a dependency. This way, whenever someone installs your package, he/she will always have the latest data.
On your part, you'll just have to swap out the data in your 'data' package, and upload it to the repo you want.
If you don't know how to build a package, check ?packages.skeleton and R CMD CHECK, R CMD BUILD

Resources