Importing data into R from google spreadsheet - r

There seems to be a change in the google spreadsheet publishing options. It is no longer possible to publish to the web as csv or tab file (see this recent post). Thus the usual way to use RCurl to import data into R from a google spreadsheed does not work anymore:
require(RCurl)
u <- "https://docs.google.com/spreadsheet/pub?hl=en_GB&hl=en_GB&key=0AmFzIcfgCzGFdHQ0eEU0MWZWV200RjgtTXVMY1NoQVE&single=true&gid=4&output=csv"
tc <- getURL(u, ssl.verifypeer=FALSE)
net <- read.csv(textConnection(tc))
Does anyone have a work-around?

I just wrote a simple package to solve exactly this problem: downloading a Google sheet using just the URL.
install.packages('gsheet')
library(gsheet)
gsheet2tbl('docs.google.com/spreadsheets/d/1I9mJsS5QnXF2TNNntTy-HrcdHmIF9wJ8ONYvEJTXSNo')
More detail is here: https://github.com/maxconway/gsheet

Use the googlesheets4 package, a Google Sheets R API by Jenny Bryan. It is the best way to analyze and edit Google Sheets data in R. Not only can it pull data from Google Sheets, but you can edit the data in Google Sheets, create new sheets, etc.
The package can be installed with install.packages("googlesheets4").
There's a vignette for getting started; see her GitHub repository for more. And you also can install the latest development version of the package from that GitHub page, if desired.

I am working on a solution for this. Here is a function that works on your data as well as a few of my own Google Spreadsheets.
First, we need a function to read from Google sheets. readGoogleSheet() will return a list of data frames, one for each table found on the Google sheet:
readGoogleSheet <- function(url, na.string="", header=TRUE){
stopifnot(require(XML))
# Suppress warnings because Google docs seems to have incomplete final line
suppressWarnings({
doc <- paste(readLines(url), collapse=" ")
})
if(nchar(doc) == 0) stop("No content found")
htmlTable <- gsub("^.*?(<table.*</table).*$", "\\1>", doc)
ret <- readHTMLTable(htmlTable, header=header, stringsAsFactors=FALSE, as.data.frame=TRUE)
lapply(ret, function(x){ x[ x == na.string] <- NA; x})
}
Next, we need a function to clean the individual tables. cleanGoogleTable() removes empty lines inserted by Google, removes the row names (if they exist) and allows you to skip empty lines before the table starts:
cleanGoogleTable <- function(dat, table=1, skip=0, ncols=NA, nrows=-1, header=TRUE, dropFirstCol=NA){
if(!is.data.frame(dat)){
dat <- dat[[table]]
}
if(is.na(dropFirstCol)) {
firstCol <- na.omit(dat[[1]])
if(all(firstCol == ".") || all(firstCol== as.character(seq_along(firstCol)))) {
dat <- dat[, -1]
}
} else if(dropFirstCol) {
dat <- dat[, -1]
}
if(skip > 0){
dat <- dat[-seq_len(skip), ]
}
if(nrow(dat) == 1) return(dat)
if(nrow(dat) >= 2){
if(all(is.na(dat[2, ]))) dat <- dat[-2, ]
}
if(header && nrow(dat) > 1){
header <- as.character(dat[1, ])
names(dat) <- header
dat <- dat[-1, ]
}
# Keep only desired columns
if(!is.na(ncols)){
ncols <- min(ncols, ncol(dat))
dat <- dat[, seq_len(ncols)]
}
# Keep only desired rows
if(nrows > 0){
nrows <- min(nrows, nrow(dat))
dat <- dat[seq_len(nrows), ]
}
# Rename rows
rownames(dat) <- seq_len(nrow(dat))
dat
}
Now we are ready to read you Google sheet:
> u <- "https://docs.google.com/spreadsheets/d/0AmFzIcfgCzGFdHQ0eEU0MWZWV200RjgtTXVMY1NoQVE/pubhtml"
> g <- readGoogleSheet(u)
> cleanGoogleTable(g, table=1)
2012-Jan Mobile internet Tanzania
1 Airtel Zantel Vodacom Tigo TTCL Combined
> cleanGoogleTable(g, table=2, skip=1)
BUNDLE FEE VALIDITY MB Cost Sh/MB
1 Daily Bundle (20MB) 500/= 1 day 20 500 25.0
2 1 Day bundle (300MB) 3,000/= 1 day 300 3,000 10.0
3 Weekly bundle (3GB) 15,000/= 7 days 3,000 15,000 5.0
4 Monthly bundle (8GB) 70,000/= 30 days 8,000 70,000 8.8
5 Quarterly Bundle (24GB) 200,000/= 90 days 24,000 200,000 8.3
6 Yearly Bundle (96GB) 750,000/= 365 days 96,000 750,000 7.8
7 Handset Browsing Bundle(400 MB) 2,500/= 30 days 400 2,500 6.3
8 STANDARD <NA> <NA> 1 <NA> <NA>

Not sure if other use cases have a higher complexity or if something changed in the meantime. After publishing the spreadsheet in CSV format this simple 1-liner worked for me:
myCSV<-read.csv("http://docs.google.com/spreadsheets/d/1XKeAajiH47jAP0bPkCtS4OdOGTSsjleOXImDrFzxxZQ/pub?output=csv")
R version 3.3.2 (2016-10-31)

There is an easiest way to fetch the google sheets even if you're behind the proxy
require(RCurl)
fileUrl <- "https://docs.google.com/spreadsheets/d/[ID]/export?format=csv"
fileCSV <- getURL(fileUrl,.opts=list(ssl.verifypeer=FALSE))
fileCSVDF <- read.csv(textConnection(fileCSV))

A simpler way.
Be sure to match your URL carefully to the format of the example one here. You can get all but the /export?format=csv piece from the Google Spreadsheets edit page. Then, just manually add this piece to the URL and then use as shown here.
library(RCurl)
library(mosaic)
mydat2 <- fetchGoogle(paste0("https://docs.google.com/spreadsheets/d/",
"1mAxpSTrjdFv1UrpxwDTpieVJP16R9vkSQrpHV8lVTA8/export?format=csv"))
mydat2

Scrape the html table using httr and XML packages.
library(XML)
library(httr)
url <- "https://docs.google.com/spreadsheets/d/12MK9EFmPww4Vw9P6BShmhOolH1C45Irz0jdzE0QR3hs/pubhtml"
readSpreadsheet <- function(url, sheet = 1){
library(httr)
r <- GET(url)
html <- content(r)
sheets <- readHTMLTable(html, header=FALSE, stringsAsFactors=FALSE)
df <- sheets[[sheet]]
dfClean <- function(df){
nms <- t(df[1,])
names(df) <- nms
df <- df[-1,-1]
row.names(df) <- seq(1,nrow(df))
df
}
dfClean(df)
}
df <- readSpreadsheet(url)
df

Publish as CSV doesn't seem to be supported (or at least isn't currently supported) in the new Google Sheets, which is the default for any new sheet you create. You can, though, create a sheet in the old Google Sheets format, which does support publish as CSV, through this link... https://g.co/oldsheets.
More details on the new vs. old Sheets is here... https://support.google.com/drive/answer/3541068?p=help_new_sheets&rd=1

Thanks for this solution! Works as good as the old one. I used another fix to get rid of the blank first line. When you just exclude it, you might per accident delete a valid observation when the line is 'unfreezed'. The extra instruction in the function deletes any rows which have no time stamp.
readSpreadsheet <- function(url, sheet = 1){
library(httr)
r <- GET(url)
html <- content(r)
sheets <- readHTMLTable(html, header=FALSE, stringsAsFactors=FALSE)
df <- sheets[[sheet]]
dfClean <- function(df){
nms <- t(df[1,])
names(df) <- nms
df <- df[-1,-1]
df <- df[df[,1] != "",] ## only select rows with time stamps
row.names(df) <- seq(1,nrow(df))
df
}
dfClean(df)
}

It is still (as of May 2015) possible to get a CSV file out of Google Spreadsheets, using the hidden URL <sheeturl>/export?format=csv trick 1.
However, after solving this problem, one encounters another problem - numbers are formatted according to the locale of the sheet, e.g. you may get 1,234.15 in a "US" sheet or 1.234,15 in a "German" sheet. To decide on a sheet locale, go to File > Spreadsheet Settings in Google Docs.
Now you need to remove the decimal mark from the numeric columns so that R can parse them; depending on how large your numbers are, this may need to be done several times for each column. A simple function I wrote to accomplish this:
# helper function to load google sheet and adjust for thousands separator (,)
getGoogleDataset <- function(id) {
download.file(paste0('https://docs.google.com/spreadsheets/d/', id, '/export?format=csv'),'google-ds.csv','curl');
lines <- scan('google-ds.csv', character(0), sep="\n");
pattern<-"\"([0-9]+),([0-9]+)";
for (i in 0:length(lines)) {
while (length(grep(pattern,lines[i]))> 0) {
lines[i] <- gsub(pattern,"\"\\1\\2",lines[i]);
}
}
return(read.csv(textConnection(lines)));
}
You will need to require(utils) and have curl installed, but no other extra packages.

Related

Gather multiple data sets from an URL/FTP site and merge them into a single dataframe for wrangling

Okay R community. I have a myrid of code pieces going on here from data.table, dyplr, base, etc.. My goal is to download a group of files from NOAA into a single data frame for wrangling. Currently, my code is ugly, to say the least and of course not working. I should have all of data set 1950, then right below it i have 1951 data, etc.
library(data.table)
library(XML)
library(RCurl)
library(tidyverse)
#hard code website addressess
noaa.url <- "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
noaa.ftp <- "ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
#set fixed name of files for download
details.str <- "StormEvents_details-ftp_*"
fatalities.str <- "StormEvents_fatalities-ftp_"
locations.str <- "StormEvents_locations-ftp_"
#test function to download file using manual operation
index.storm <- "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1950_c20210803.csv.gz"
storm.1950 <- fread(index.storm )
storm.1951 <- fread("https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1951_c20210803.csv.gz")
#test append
storm.append <- rbind(storm.1950, storm.1951)
#create a list of colnames
detail.colnames <- colnames(storm.1950)
#-------------------------------Begin Website Scrap-------------------------------------
#create a directory from the NOAA website. Must use the FTP directory. Will get 404 error if using the http site
dir_list <-
read.table(
textConnection(
getURLContent(noaa.ftp)
),
sep = "",
strip.white = TRUE)
#subset the data we want
dir_list <- dir_list %>%
select("V9","V5")
#fix column names
colnames(dir_list) <- c("FileName", "FileSize")
#create new table for loop through list with complete website directory. This will get just the storm details we want
details.dir <- dir_list %>%
select(1) %>%
filter(str_detect(FileName,"details")) %>%
mutate(FileName = paste0(noaa.url,FileName))
#how many rows to get. could use this in counter for loop if needed
total.count <- count(details.dir)
total.count
#subset just first 5 rows
details.dirsub <- head(details.dir,5)
details.dirsub
#very basic loop and apply a list. Note: files get larger as years go on.
for (x in details.dirsub) {
something = details.dirsub$FileName
#print(something)
storm.append = lapply(something, fread) #lapply is creating a join not an append
#storm.append = rbindlist(fread(something)) #does not work
return(storm.append)
}
#expand the list into a dataframe for wrangling
storm.full <- as.data.frame(do.call(cbind, storm.append))
# try to set colnames if use sapply instead of lapply
#colnames(storm.full)
#setnames(storm.full, detail.colnames)
#filter by GEORGIA -- can not filter because lapply is creating joins instead of append. tried rbindlist() but errors.
storm.georgia <- storm.full %>%
filter(STATE == "GEORGIA")
If I understand correctly, the OP wants
to read all data files
whose file names include the string "details"
from a certain FTP directory,
combine them into one large data.frame
but keep only rows whih are related to GEORGIA.
This is what I would do using my favourite tools:
library(data.table)
library(RCurl)
library(magrittr)
noaa_ftp <- "ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
state_to_select <- "GEORGIA"
storm.georgia <-
getURL(noaa_ftp, dirlistonly = TRUE) %>%
fread(header = FALSE) %>%
.[V1 %like% "details.*csv.gz", V1] %>%
lapply(function(x) {
url <- file.path(noaa.ftp, x)
cat(url, "\n")
fread(url)[STATE == state_to_select]
}) %>%
rbindlist()
This filters each file directly after reading in order to reduce memory allocation. The result consists of one data.table with nearly 50k rows and 51 columns:
tables()
NAME NROW NCOL MB COLS KEY
1: storm.georgia 48,257 51 28 BEGIN_YEARMONTH,BEGIN_DAY,BEGIN_TIME,END_YEARMONTH,END_DAY,END_TIME,...
Total: 28MB
Please note that there are inconsistencies in the data files as can be seen from the last lines of output
ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1990_c20220425.csv.gz
trying URL 'ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1990_c20220425.csv.gz'
Content type 'unknown' length 385707 bytes (376 KB)
ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1988_c20220425.csv.gz
trying URL 'ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1988_c20220425.csv.gz'
Content type 'unknown' length 255646 bytes (249 KB)
ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1986_c20220425.csv.gz
trying URL 'ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1986_c20220425.csv.gz'
Content type 'unknown' length 298130 bytes (291 KB)
Warning messages:
1: In fread(url) :
Stopped early on line 33957. Expected 51 fields but found 65. Consider fill=TRUE and comment.char=. First discarded non-empty line:
<<199712,7,400,199712,8,1800,2071900,5623952,"UTAH",49,1997,"December","Winter
Storm","Z",1,"NW CACHE","SLC","07-DEC-97 04:00:00","MST","08-DEC-97
18:00:00","20","0","0","0","200K",,,,,,,,,,,,,,,,,,,,,,,,"A strong
trough moved through northern Utah on the 7th. The cold moist airmass
remained unstable into the 8th in a strong northwest flow.
Lake-enhanced snowbands developed along the Wasatch Front on the 8th
as well. Criteria snow fell across much of the state and locally
strong winds occurred >>
2: In fread(url) :
Found and resolved improper quoting out-of-sample. First healed line 23934:
<<199703,14,1543,199703,14,1543,2059347,5589290,"FLORIDA",12,1997,"March","Waterspout","C",87,"MONROE","MFL","14-MAR-97
15:43:00","EST","14-MAR-97
15:43:00","0","0","0","0",,,,,,,,,,,,,,,,,"S OF "7 MILE" BRIDGE",,,"S
OF "7 MILE" BRIDGE",,,,,"TWO WATERSPOUTS OBSERVED BY SKYWARN
SPOTTER...LOST IN RAIN SHAFT.",,"PDC">>. If the fields are not quoted
(e.g. field separator does not appear within any field), try quote=""
to avoid this warning. ```

How to loop through multiple URLs in R and save in data frame

I have been unable to loop through multiple URLs and have it save in a data frame. I have shared the code that can retrieve only one url at a time and save in a data frame.
The part of the url that changes is a number at the very end of the url that refers to a date.
I am trying to scrape all the data from, for example, 20190901 through 20190915 and store it in the same data frame.
Here is the Code:
library(rvest)
library(dplyr)
# Specifying URL
url <- 'https://classic.sportsbookreview.com/betting-odds/mlb-baseball/?date=20190901'
# Reading the HTML code from website
oddspage <- read_html(url)
# Using CSS selectors to scrape away teams
awayHtml <- html_nodes(oddspage,'.eventLine-value:nth-child(1) a')
#Using CSS selectors to scrape scores
awayScoreHtml <- html_nodes(oddspage,'.first.total')
awayScore <- html_text(awayScoreHtml)
awayScore <- as.numeric(awayScore)
homeScoreHtml <- html_nodes(oddspage, '.score-periods+ .score-periods .total')
homeScore <- html_text(homeScoreHtml)
homeScore <- as.numeric(homeScore)
# Converting away data to text
away <- html_text(awayHtml)
# Using CSS selectors to scrape home teams
homeHtml <- html_nodes(oddspage,'.eventLine-value+ .eventLine-value a')
# Converting home data to text
home <- html_text(homeHtml)
# Using CSS selectors to scrape Away Odds
awayPinnacleHtml <- html_nodes(oddspage,'.eventLine-consensus+ .eventLine-book.eventLine-book-value:nth-child(1) b')
awayBookmakerHtml <- html_nodes(oddspage,'.eventLine-book:nth-child(12) .eventLine-book-value:nth-child(1) b')
# Converting Away Odds to Text
awayPinnacle <- html_text(awayPinnacleHtml)
awayBookmaker <- html_text(awayBookmakerHtml)
# Converting Away Odds to numeric
awayPinnacle <- as.numeric(awayPinnacle)
awayBookmaker <- as.numeric(awayBookmaker)
# Using CSS selectors to scrape Pinnacle Home Odds
homePinnacleHtml <- html_nodes(oddspage,'.eventLine-consensus+ .eventLine-book .eventLine-book-value+ .eventLine-book-value b')
homeBookmakerHtml <- html_nodes(oddspage,'.eventLine-book:nth-child(12) .eventLine-book-value:nth-child(2) b')
# Converting Home Odds to Text
homePinnacle <- html_text(homePinnacleHtml)
homeBookmaker <- html_text(homeBookmakerHtml)
# Converting Home Odds to Numeric
homePinnacle <- as.numeric(homePinnacle)
homeBookmaker <- as.numeric(homeBookmaker)
# Create Data Frame
df <- data.frame(away,home,awayScore,homeScore,awayPinnacle,homePinnacle,awayBookmaker,homeBookmaker)
View(df)
I am very new to coding and I have not been able to successfully apply any of the techniques used in similar questions.
Put all your code in a function and make date dynamic to generate url :
get_data <- function(date) {
url <- paste0('https://classic.sportsbookreview.com/betting-odds/mlb-baseball/?date=', date)
#...Rest of the code as it is
#...
}
Create a vector of dates using sprintf
date_vec <- sprintf('201909%02d', 1:15)
date_vec
# [1] "20190901" "20190902" "20190903" "20190904" "20190905" "20190906"
# [7] "20190907" "20190908" "20190909" "20190910" "20190911" "20190912"
#[13] "20190913" "20190914" "20190915"
Use lapply to extract data for each date and combine them.
all_data <- do.call(rbind, lapply(date_vec, get_data))
You can also use map_df from purrr.
all_data <- purrr::map_df(date_vec, get_data)
However, you might need to add some checks in your function for pages which don't return any values for a particular field.

change a for loop to a function to scrape a website

I am trying to scrape a website using the following:
industryurl <- "https://finance.yahoo.com/industries"
library(rvest)
read <- read_html(industryurl) %>%
html_table()
library(plyr)
industries <- ldply(read, data.frame)
industries = industries[-1,]
read <- read_html(industryurl)
industryurls <- html_attr(html_nodes(read, "a"), "href")
links <- industryurls[grep("/industry/", industryurls)]
industryurl <- "https://finance.yahoo.com"
links <- paste0(industryurl, links)
links
##############################################################################################
store <- NULL
tbl <- NULL
for(i in links){
store[[i]] = read_html(i)
tbl[[i]] = html_table(store[[i]])
}
#################################################################################################
I am mostly interested in the code between ########## and I want to apply a function instead of a for loop since I am running into time out issues with yahoo and I want to make it more human like to extract this data (it is not too much).
My question is, how can I take links apply a function and set a sort of delay timer to read in the contents of the for loop?
I can paste my own version of the for loop which does not work.
This is the function I came up with
##First argument is the link you need
##The second argument is the total time for Sys.sleep
extract_function <- function(define_link, define_time){
print(paste0("The system will stop for: ", define_time, " seconds"))
Sys.sleep(define_time)
first <- read_html(define_link)
print(paste0("It will now return the table for link", define_link))
return(html_table(first))
}
##I added the following tryCatch function
link_try_catch <- function(define_link, define_time){
out <- tryCatch(extract_function(define_link,define_time), error =
function(e) NA)
return(out)
}
##You can now retrieve the data using the links vector in two ways
##Picking the first ten, so it should not crash on link 5
p <- lapply(1:10, function(i)link_try_catch(links[i],1))
##OR (I subset the vector just for demo purposes
p2 <- lapply(links[1:10], function(i)extract_function(i,1))
Hope it helps

Web scraping of key stats in Yahoo! Finance with R

Is anyone experienced in scraping data from the Yahoo! Finance key statistics page with R? I am familiar scraping data directly from html using read_html, html_nodes(), and html_text() from rvest package. However, this web page MSFT key stats is a bit complicated, I am not sure if all the stats are kept in XHR, JS, or Doc. I am guessing the data is stored in JSON. If anyone knows a good way to extract and parse data for this web page with R, kindly answer my question, great thanks in advance!
Or if there is a more convenient way to extract these metrics via quantmod or Quandl, kindly let me know, that would be a extremely good solution!
I know this is an older thread, but I used it to scrape Yahoo Analyst tables so I figure I would share.
# Yahoo webscrape Analysts
library(XML)
symbol = "HD"
url <- paste('https://finance.yahoo.com/quote/HD/analysts?p=',symbol,sep="")
webpage <- readLines(url)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
earningEstimates <- readHTMLTable(tableNodes[[1]])
revenueEstimates <- readHTMLTable(tableNodes[[2]])
earningHistory <- readHTMLTable(tableNodes[[3]])
epsTrend <- readHTMLTable(tableNodes[[4]])
epsRevisions <- readHTMLTable(tableNodes[[5]])
growthEst <- readHTMLTable(tableNodes[[6]])
Cheers,
Sody
I gave up on Excel a long time ago. R is definitely the way to go for things like this.
library(XML)
stocks <- c("AXP","BA","CAT","CSCO")
for (s in stocks) {
url <- paste0("http://finviz.com/quote.ashx?t=", s)
webpage <- readLines(url)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
# ASSIGN TO STOCK NAMED DFS
assign(s, readHTMLTable(tableNodes[[9]],
header= c("data1", "data2", "data3", "data4", "data5", "data6",
"data7", "data8", "data9", "data10", "data11", "data12")))
# ADD COLUMN TO IDENTIFY STOCK
df <- get(s)
df['stock'] <- s
assign(s, df)
}
# COMBINE ALL STOCK DATA
stockdatalist <- cbind(mget(stocks))
stockdata <- do.call(rbind, stockdatalist)
# MOVE STOCK ID TO FIRST COLUMN
stockdata <- stockdata[, c(ncol(stockdata), 1:ncol(stockdata)-1)]
# SAVE TO CSV
write.table(stockdata, "C:/Users/your_path_here/Desktop/MyData.csv", sep=",",
row.names=FALSE, col.names=FALSE)
# REMOVE TEMP OBJECTS
rm(df, stockdatalist)
When I use the methods shown here with XML library, I get a Warning
Warning in readLines(page) : incomplete final line found on
'https://finance.yahoo.com/quote/DIS/key-statistics?p=DIS'
We can use rvest and xml2 for a cleaner approach. This example demonstrates how to pull a key statistic from the key-statistics Yahoo! Finance page. Here I want to obtain the float of an equity. I don't believe float is available from quantmod, but some of the key stats values are. You'll have to reference the list.
library(xml2)
library(rvest)
getFloat <- function(stock){
url <- paste0("https://finance.yahoo.com/quote/", stock, "/key-statistics?p=", stock)
tables <- read_html(url) %>%
html_nodes("table") %>%
html_table()
float <- as.vector(tables[[3]][4,2])
last <- substr(float, nchar(float)-1+1, nchar(float))
float <-gsub("[a-zA-Z]", "", float)
float <- as.numeric(as.character(float))
if(last == "k"){
float <- float * 1000
} else if (last == "M") {
float <- float * 1000000
} else if (last == "B") {
float <- float * 1000000000
}
return(float)
}
getFloat("DIS")
[1] 1.81e+09
That's a lot of shares of Disney available.

available.packages by publication date

Is it possible to get the publication date of CRAN packages from within R? I would like to get a list of the k most recently published CRAN packages, or alternatively all packages published after date dd-mm-yy. Similar to the information on the available_packages_by_date.html?
The available.packages() command has a "fields" argument, but this only extracts fields from the DESCRIPTION. The date field on the package description is not always up-to-date.
I can get it with a smart regex from the html page, but I am not sure how reliable and up-to-date the this html file is... At some point Kurt might decide to give the layout a makeover which would break the script. An alternative is to use timestamps from the CRAN FTP but I am also not sure how good this solution is. I am not sure if there is somewhere a formally structured file with publication dates? I assume the HTML page is automatically generated from some DB.
Turns out there is an undocmented file "packages.rds" which contains the publication dates (not times) of all packages. I suppose these data are used to recreate the HTML file every day.
Below a simple function that extracts publication dates from this file:
recent.packages.rds <- function(){
mytemp <- tempfile();
download.file("http://cran.r-project.org/web/packages/packages.rds", mytemp);
mydata <- as.data.frame(readRDS(mytemp), row.names=NA);
mydata$Published <- as.Date(mydata[["Published"]]);
#sort and get the fields you like:
mydata <- mydata[order(mydata$Published),c("Package", "Version", "Published")];
}
The best approach is to take advantage of the fact the package DESCRIPTION is published on the cran mirror, and since the DESCRIPTION is from the build package, it contains information about exactly when it was packaged:
pkgs <- unname(available.packages()[, 1])[1:20]
desc_urls <- paste("http://cran.r-project.org/web/packages/", pkgs, "/DESCRIPTION", sep = "")
desc <- lapply(desc_urls, function(x) read.dcf(url(x)))
sapply(desc, function(x) x[, "Packaged"])
sapply(desc, function(x) x[, "Date/Publication"])
(I'm restricting it to the first 20 packages here to illustrate the basic idea)
Here a function that uses the HTML and regular expressions. I still rather get the information from a more formal place though in case the HTML ever changes layout.
recent.packages <- function(number=10){
#html is malformed
maxlines <- number*2 + 11
mytemp <- tempfile()
if(getOption("repos") == "#CRAN#"){
repo <- "http://cran.r-project.org"
} else {
repo <- getOption("repos");
}
newurl <- paste(repo,"/web/packages/available_packages_by_date.html", sep="");
download.file(newurl, mytemp);
datastring <- readLines(mytemp, n=maxlines)[12:maxlines];
#we only find packages from after 2010-01-01
myexpr1 <- '201[0-9]-[0-9]{2}-[0-9]{2} </td> <td> <a href="../../web/packages/[a-zA-Z0-9\\.]{2,}/'
myexpr2 <- '^201[0-9]-[0-9]{2}-[0-9]{2}'
myexpr3 <- '[a-zA-Z0-9\\.]{2,}/$'
newpackages <- unlist(regmatches(datastring, gregexpr(myexpr1, datastring)));
newdates <- unlist(regmatches(newpackages, gregexpr(myexpr2, newpackages)));
newnames <- unlist(regmatches(newpackages, gregexpr(myexpr3, newpackages)));
newdates <- as.Date(newdates);
newnames <- substring(newnames, 1, nchar(newnames)-1);
returndata <- data.frame(name=newnames, date=newdates);
return(head(returndata, number));
}
So here a solution that uses the dir listing from the FTP. It is a little tricky because the FTP gives the date in linux format with either a timestamp or a year. Other than that it does it's job. I'm still not convinced this is reliable though. If packages are copied over to another server all timestmaps might be reset.
recent.packages.ftp <- function(){
setwd(tempdir())
download.file("ftp://cran.r-project.org/pub/R/src/contrib/", destfile=tempfile(), method="wget", extra="--no-htmlify");
#because of --no-htmlify the destfile argument does not work
datastring <- readLines(".listing");
unlink(".listing");
myexpr1 <- "(?<date>[A-Z][a-z]{2} [0-9]{2} [0-9]{2}:[0-9]{2}) (?<name>[a-zA-Z0-9\\.]{2,})_(?<version>[0-9\\.-]*).tar.gz$"
matches <- gregexpr(myexpr1, datastring, perl=TRUE);
packagelines <- as.logical(sapply(regmatches(datastring, matches), length));
#subset proper lines
matches <- matches[packagelines];
datastring <- datastring[packagelines];
N <- length(matches)
#from the ?regexpr manual
parse.one <- function(res, result) {
m <- do.call(rbind, lapply(seq_along(res), function(i) {
if(result[i] == -1) return("")
st <- attr(result, "capture.start")[i, ]
substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
}))
colnames(m) <- attr(result, "capture.names")
m
}
#parse all records
mydf <- data.frame(date=rep(NA, N), name=rep(NA, N), version=rep(NA,N))
for(i in 1:N){
mydf[i,] <- parse.one(datastring[i], matches[[i]]);
}
row.names(mydf) <- NULL;
#convert dates
mydf$date <- strptime(mydf$date, format="%b %d %H:%M");
#So linux only displays dates for packages of less then six months old.
#However strptime will assume the current year for packages that don't have a timestamp
#Therefore for dates that are in the future, we subtract a year. We can use some margin for timezones.
infuture <- (mydf$date > Sys.time() + 31*24*60*60);
mydf$date[infuture] <- mydf$date[infuture] - 365*24*60*60;
#sort and return
mydf <- mydf[order(mydf$date),];
row.names(mydf) <- NULL;
return(mydf);
}
You could process the page http://cran.r-project.org/src/contrib/, and split the fields by whitespace in order to obtain the fully specified package source filename, which includes the version # and a .gz suffix.
There are a few other items in the list that are not package files, such as the .rds files, various subdirectories, and so on.
Barring changes in how the directory structure is presented or the locations of the files, I can't think of anything more authoritative than this.

Resources