Obtain a package 1st version's date of publication - r

Documentation of R packages only include the date of last update/publication.
Numbering of versions do not follow a common pattern to all packages.
Therefore, it is quite difficult to know at a glance if the package is old or new. Sometimes you need to decide between two packages with similar functions and knowing the age of a package could guide the decision.
My first approach was to plot downloads per year: By traking CRAN downloads. This methods provides also the relative popularity/usage of a package. However, this requires a lot of memory and time to proceed. Therefore, I would rather have a faster way to look into the history of one package.
Is there a quick way to know or vizualize the first version's date of release of one specific package or even to compare several pakages at once?
The purpose is to facilitate a mental mapping of all available packages in R, especially for newcomers. Getting to know packages and managing them is probably the main challenge why people give up on R.

Just for fun:
## not all repositories have the same archive structure!
archinfo <- function(pkgname,repos="http://www.cran.r-project.org") {
pkg.url <- paste(contrib.url(repos),"Archive",pkgname,sep="/")
r <- readLines(pkg.url)
## lame scraping code
r2 <- gsub("<[^>]+>"," ",r) ## drop HTML tags
r2 <- r2[-(1:grep("Parent Directory",r2))] ## drop header
r2 <- r2[grep(pkgname,r2)] ## drop footer
strip.white <- function(x) gsub("(^ +| +$)","",x)
r2 <- strip.white(gsub(" ","",r2)) ## more cleaning
r3 <- do.call(rbind,strsplit(r2," +")) ## pull out data frame
data.frame(
pkgvec=gsub(paste0("(",pkgname,"_|\\.tar\\.gz)"),"",r3[,1]),
pkgdate=as.Date(r3[,2],format="%d-%b-%Y"),
## assumes English locale for month abbreviations
size=r3[,4])
}
AERinfo <- archinfo("AER")
lme4info <- archinfo("lme4")
comb <- rbind(data.frame(pkg="AER",AERinfo),
data.frame(pkg="lme4",lme4info))
We can't compare package numbers directly because everyone uses different numbering schemes ...
library(dplyr) ## overkill
comb2 <- comb %>% group_by(pkg) %>% mutate(numver=seq(n()))
If you want to arrange by package date:
comb2 <- arrange(comb2,pkg,pkgdate)
Pretty pictures ...
library(ggplot2); theme_set(theme_bw())
ggplot(comb2,aes(x=pkgdate,y=numver,colour=pkg))+geom_line()

As Andrew Taylor suggested, CRAN Archives contains all previous versions and the date is indicated.

Related

Retain SPSS value labels when working with data

I am analysing student level data from PISA 2015. The data is available in SPSS format here
I can load the data into R using the read_sav function in the haven package. I need to be able to edit the data in R and then save/export the data in SPSS format with the original value labels that are included in the SPSS download intact. The code I have used is:
library(haven)
student<-read_sav("CY6_MS_CMB_STU_QQQ.sav",user_na = T)
student2<-data.frame(student)
#some edits to data
write_sav(student2,"testdata1.sav")
When my colleague (who works in SPSS) tries to open the "testdata1.sav" the value labels are missing. I've read through the haven documentation and can't seem to find a solution for this. I have also tried read/write.spss in the foreign package but have issues loading in the dataset.
I am using R version 3.4.0 and the latest build of haven.
Does anyone know if there is a solution for this? I'd be very grateful of your help. Please let me know if you require any additional information to answer this.
library(foreign)
df <- read.spss("spss_file.sav", to.data.frame = TRUE)
This may not be exactly what you are looking for, because it uses the labels as the data. So if you have an SPSS file with 0 for "Male" and 1 for "Female," you will have a df with values that are all Males and Females. It gets you one step further, but perhaps isn't the whole solution. I'm working on the same problem and will let you know what else I find.
library ("sjlabelled")
student <- sjlabelled::read_spss("CY6_MS_CMB_STU_QQQ.sav")
student2 <-student
write_spss(student2,"testdata1.sav")
I did not try and hope it works. The sjlabelled package is good with non-ascii-characters as German Umlaute.
But keep in mind, that R saves the labels as attributes. These attributes are lost, when doing some data transformations (as subsetting data for example). When lost in R they won't show up in SPSS of course. The sjlabelled::copy_labels function is helpful in those cases:
student2 <- copy_labels(student2, student) #after data transformations and before export to spss
I think you need to recover the value labels in the dataframe after importing dataset into R. Then write the that dataframe into sav file.
#load library
libray(haven)
# load dataset
student<-read_sav("CY6_MS_CMB_STU_QQQ.sav",user_na = T)
#map to find class of each columns
map_dataset<-map(student, function(x)attr(x, "class"))
#Run for loop to identify all Factors with haven-labelled
factor_variable<-c()
for(i in 1:length(map_dataset)){
if(map_dataset[i]!="NULL"){
name<-names(map_dataset[i])
factor_variable<-c(factor_variable,name)
}
}
#convert all haven labelled variables into factor
student2<-student %>%
mutate_at(vars(factor_variable), as_factor)
#write dataset
write_sav(student2, "testdata1.sav")

Loading intraday data into R for handling it with quantmod

I need to modify this example code for using it with intraday data which I should get from here and from here. As I understand, the code in that example works well with any historical data (or not?), so my problem then boils down to a question of loading the initial data in a necessary format (I mean daily or intraday).
As I also understand from answers on this question, it is impossible to load intraday data with getSymbols(). I tried to download that data into my hard-drive and to get it then with a read.csv() function, but this approach didn't work as well. Finally, I found few solutions of this problem in various articles (e.g. here), but all of them seem to be very complicated and "artificial".
So, my question is how to load the given intraday data into the given code elegantly and correctly from programmer's point of view, without reinventing the wheel?
P.S. I am very new to analysis of time series in R and quantstrat thus if my question seems to be obscure let me know what you need to know to answer it.
I don't know how to do this without "reinventing the wheel" because I'm not aware of any existing solutions. It's pretty easy to do with a custom function though.
intradataYahoo <- function(symbol, ...) {
# ensure xts is available
stopifnot(require(xts))
# construct URL
URL <- paste0("http://chartapi.finance.yahoo.com/instrument/1.0/",
symbol, "/chartdata;type=quote;range=1d/csv")
# read the metadata from the top of the file and put it into a usable list
metadata <- readLines(paste(URL, collapse=""), 17)[-1L]
# split into name/value pairs, set the names as the first element of the
# result and the values as the remaining elements
metadata <- strsplit(metadata, ":")
names(metadata) <- sub("-","_",sapply(metadata, `[`, 1))
metadata <- lapply(metadata, function(x) strsplit(x[-1L], ",")[[1]])
# convert GMT offset to numeric
metadata$gmtoffset <- as.numeric(metadata$gmtoffset)
# read data into an xts object; timestamps are in GMT, so we don't set it
# explicitly. I would set it explicitly, but timezones are provided in
# an ambiguous format (e.g. "CST", "EST", etc).
Data <- as.xts(read.zoo(paste(URL, collapse=""), sep=",", header=FALSE,
skip=17, FUN=function(i) .POSIXct(as.numeric(i))))
# set column names and metadata (as xts attributes)
colnames(Data) <- metadata$values[-1L]
xtsAttributes(Data) <- metadata[c("ticker","Company_Name",
"Exchange_Name","unit","timezone","gmtoffset")]
Data
}
I'd consider adding something like this to quantmod, but it would need to be tested. I wrote this in under 15 minutes, so I'm sure there will be some issues.

R: use LaF (reads fixed column width data FAST) with SAScii (parses SAS dicionary for import instructions)

I'm trying to read quickly into R a ASCII fixed column width dataset, based on a SAS import file (the file that declares the column widths, and etc).
I know I can use SAScii R package for translating the SAS import file (parse.SAScii) and actually importing (read.SAScii). It works but it is too slow, because read.SAScii uses read.fwf to do the data import, which is slow. I would like to change that for a fast import mathod, laf_open_fwf from the "LaF" package.
I'm almost there, using parse.SAScii() and laf_open_fwf(), but I'm able to correctly connect the output of parse.SAScii() to the arguments of laf_open_fwf().
Here is the code, the data is from PNAD, national household survey, 2013:
# Set working dir.
setwd("C:/User/Desktop/folder")
# installing packages:
install.packages("SAScii")
install.packages("LaF")
library(SAScii)
library(LaF)
# Donwload and unzip data and documentation files
# Data
file_url <- "ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2013/Dados.zip"
download.file(file_url,"Dados.zip", mode="wb")
unzip("Dados.zip")
# Documentation files
file_url <- "ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2013/Dicionarios_e_input_20150814.zip"
download.file(file_url,"Dicionarios_e_input.zip", mode="wb")
unzip("Dicionarios_e_input.zip")
# importing with read.SAScii(), based on read.fwf(): Works fine
dom.pnad2013.teste1 <- read.SAScii("Dados/DOM2013.txt","Dicionarios_e_input/input DOM2013.txt")
# importing with parse.SAScii() and laf_open_fwf() : stuck here
dic_dom2013 <- parse.SAScii("Dicionarios_e_input/input DOM2013.txt")
head(dic_dom2013)
data <- laf_open_fwf("Dados/DOM2013.txt",
column_types=????? ,
column_widths=dic_dom2013[,"width"],
column_names=dic_dom2013[,"Varname"])
I'm stuck on the last commmand, passing the importing arguments to laf_open_fwf().
UPDATE: here are two solutions, using packages LaF and readr.
Solution using readr (8 seconds)
readr is based on LaF but surprisingly faster. More info on readr here
# Load Packages
library(readr)
library(data.table)
# Parse SAS file
dic_pes2013 <- parse.SAScii("./Dicion rios e input/input PES2013.sas")
setDT(dic_pes2013) # convert to data.table
# read to data frame
pesdata2 <- read_fwf("Dados/DOM2013.txt",
fwf_widths(dput(dic_pes2013[,width]),
col_names=(dput(dic_pes2013[,varname]))),
progress = interactive()
)
Take way: readr seems to be the best option: it's faster, you don't need to worry about column types, shorter code and it shows a progress bar :)
Solution using LaF (20 seconds)
LaFis one of the (maybe THE) fastest ways to read fixed-width files in R, according to this benchmark. It tooke me 20 sec. to read the person level file (PES) into a data frame.
Here is the code:
# Parse SAS file
dic_pes2013 <- parse.SAScii("./Dicion rios e input/input PES2013.sas")
# Read .txt file using LaF. This is virtually instantaneous
pesdata <- laf_open_fwf("./Dados/PES2013.txt",
column_types= rep("character", length(dic_pes2013[,"width"])),
column_widths=dic_pes2013[,"width"],
column_names=dic_pes2013[,"varname"])
# convert to data frame. This tooke me 20 sec.
system.time( pesdata <- pesdata[,] )
Note that that I've used character in column_types. I'm not quite sure why the command returns me an error if I try integer or numeric. This shouldn't be a problem, since you can convert all columns to numeric like this:
# convert all columns to numeric
varposition <- grep("V", colnames(pesdata))
pesdata[varposition] <- sapply(pesdata[],as.numeric)
sapply(pesdata, class)
You can try the read.SAScii.sqlite, also by Anthony Damico. It's 4x faster and lead to no RAM issues (as the author himself describes). But it imports data to a SQLite self-contained database file (no SQL server needed) -- not to a data.frame. Then you can open it in R by using a dbConnection. Here it goes the GitHub adress for the code:
https://github.com/ajdamico/usgsd/blob/master/SQLite/read.SAScii.sqlite.R
In the R console, you can just run:
source("https://raw.githubusercontent.com/ajdamico/usgsd/master/SQLite/read.SAScii.sqlite.R")
It's arguments are almost the same as those for the regular read.SAScii.
I know you are asking for a tip on how to use LaF. But I thought this could also be useful to you.
I think that the best choice is to use fwf2csv() from desc package (C++ code). I will illustrate the procedure with PNAD 2013. Be aware that i'm considering that you already have the dictionary with 3 variables: beginning of the field, size of the field, variable name, AND the dara at Data/
library(bit64)
library(data.table)
library(descr)
library(reshape)
library(survey)
library(xlsx)
end_dom <- dic_dom2013$beggining + dicdom$size - 1
fwf2csv(fwffile='Dados/DOM2013.txt', csvfile='dadosdom.csv', names=dicdom$variable, begin=dicdom$beggining, end=end_dom)
dadosdom <- fread(input='dadosdom.csv', sep='auto', sep2='auto', integer64='double')

How to bind two xts data environments in R

I am completely new to R and I am learning how to program in R to get historical stock index data. I am planning to build an daily update code to update historic index data. I use a data environment call "indexData" to store the data as a xts. But unfortunately, the rbind or merge does not support data environments. They only support objects. I am wondering whether there are any workarounds or packages that I can used to solve this. My code is the following:
indexData<-new.env()
startDate<-"2013-11-02"
getSymbols(Symbols=indexList,src='yahoo',from=startDate,to="2013-11-12",env=indexData)
startDate<-(end(indexData$FTSE)+1)
NewIndexData<-new.env()
getSymbols(Symbols=indexList,src='yahoo',from=startDate,env=NewIndexData)
rbind(indexData,NewIndexData) #it does not work for data environments
Much appreciate for any suggestions!
If you use auto.assign=FALSE, you get explicit variables -- and those you can extend as shown below.
First we get two IBM price series:
R> IBM <- getSymbols("IBM",from="2013-01-01",to="2013-12-01",auto.assign=FALSE)
R> IBM2 <- getSymbols("IBM",from="2013-12-02",to="2013-12-28",auto.assign=FALSE)
Then we check their dates:
R> c(start(IBM), end(IBM))
[1] "2013-01-02" "2013-11-29"
R> c(start(IBM2), end(IBM2))
[1] "2013-12-02" "2013-12-27"
Finally, we merge them and check the dates of the combined series:
R> allIBM <- merge(IBM, IBM2)
R> c(start(allIBM), end(allIBM))
[1] "2013-01-02" "2013-12-27"
R>
You can approximate this by pulling distinct symbols out of the two environments but I think that is harder for you as someone beginning in R. So my recommendation would be to not work with environments -- see about lists instead.

searchTwitter( ) in R package for R 2.15.3. High number of duplicates

Removing duplicates from the searchTwitter output works fine, the problem is that the amount of original tweets that the searchTwitter() function provides is always 100, no matter n=1000 or n=3000.
This is the code i've used:
tweets <- searchTwitter("#rstats", n = 1000)
tweets.df <- do.call("rbind", lapply(tweets, as.data.frame))
df.undup <- tweets.df[duplicated(tweets.df) == FALSE,]
dim(df.undup)
The resulting data frame has always 100 rows, so that means the amount of original tweets is 100.
dim(df.undup)
tweets [1] 100 12
My question is: Does this have something to do with the twitter API and how could i get round this problem.
I'm using R version 2.15.3 on a Mac OS X 10.7.5.
Unfortunately, the presently available versions of the twitteR package aren't working properly. You can grab the most recent versions which appear to work better than those available on CRAN from Geoff Jentry's webpage.
http://geoffjentry.hexdump.org/twitteR_1.1.5.tar.gz
It requires ROAuth 0.9.4 (also not yet updated on CRAN)
http://geoffjentry.hexdump.org/ROAuth_0.9.4.tar.gz
I have a feeling you may have trouble getting it to work on MacOSX unless you can compile the packages (i.e. unless you do not require binary packages).
I am still getting dupes with these new versions, but not as many.
I've managed to install the packages. Here is the code i've used just in case someone is interested. But the problem remains the same, the original tweets remain just 100.
I wonder why are we getting different results regarding duplicates.
install.packages("~/Downloads/ROAuth_0.9.4.tar.gz",
repos=NULL,type="source",
INSTALL_opts="--no-multiarch")
install.packages("~/Downloads/twitteR_1.1.5.tar.gz",
repos=NULL,type="source",
INSTALL_opts="--no-multiarch")
library(twitteR)
library(ROAuth)
tweets <- searchTwitter("#rstats", n = 1000)
tweets.df <- do.call("rbind", lapply(tweets, as.data.frame))
df.undup <- tweets.df[duplicated(tweets.df) == FALSE,]
dim(df.undup)
dim(df.undup)
[1] 100 12

Resources