How can I read fertility dataset from Countr package? - r

I am trying to use "fertility" dataset form "Countr" package, as mentioned in:
https://cran.r-project.org/web/packages/Countr/vignettes/exampleFertility.pdf
But I get an empty dataset. Here is my code:
library("Countr")
d <- data("fertility", package="Countr")
nrow(d)
and I get NULL:
> nrow(d)
NULL

When you load package "Countr" (or any other package), all datasets provided by it become available immediately.
library("Countr")
dim(fertility)
## [1] 1243 9
If you wish to load fertility, without loading "Countr", then use data. In a new R session do:
data("fertility", package = "Countr")
dim(fertility)
Note that data() makes available the dataset(s) as a side effect.
The return value is just the name of the dataset:
d <- data("fertility", package = "Countr")
d
## [1] "fertility"
Of course, d is a character vector, so it's dim is NULL.
If only argument "package" is specified, the result gives information about the available datasets in the package:
data(package = "Countr")$results[ , c("Package", "Item", "Title")]
## Package Item Title
[1,] "Countr" "fertility" "Fertility data"
[2,] "Countr" "football" "Football data"

library(Countr)
d <- fertility
nrow(d)
# [1] 1243

Related

User Based Recommendation in R

I am trying to do user based recommendation in R by using recommenderlab package but all the time I am getting 0(no) prediction out of the model.
my code is :
library("recommenderlab")
# Loading to pre-computed affinity data
movie_data<-read.csv("D:/course/Colaborative filtering/data/UUCF Assignment Spreadsheet_user_row.csv")
movie_data[is.na(movie_data)] <- 0
rownames(movie_data) <- movie_data$X
movie_data$X <- NULL
# Convert it as a matrix
R<-as.matrix(movie_data)
# Convert R into realRatingMatrix data structure
# realRatingMatrix is a recommenderlab sparse-matrix like data-structure
r <- as(R, "realRatingMatrix")
r
rec=Recommender(r[1:nrow(r)],method="UBCF", param=list(normalize = "Z-score",method="Cosine",nn=5, minRating=1))
recom <- predict(rec, r["1648"], n=5)
recom
as(recom, "list")
all the time I am getting out put like :
as(recom, "list")
$`1648`
character(0)
I am using user-row data from this link:
https://drive.google.com/file/d/0BxANCLmMqAyIQ0ZWSy1KNUI4RWc/view
In that data column A contains user id and apart from that all are movie rating for each movie name.
Thanks.
The line of code movie_data[is.na(movie_data)] <- 0 is the source of the error. For realRatingMatrix (unlike the binaryRatingMatrix) the movies that are not rated by the users are expected to be NA values, not zero values. For example, the following code gives the correct predictions:
library("recommenderlab")
movie_data<-read.csv("UUCF Assignment Spreadsheet_user_row.csv")
rownames(movie_data) <- movie_data$X
movie_data$X <- NULL
R<-as.matrix(movie_data)
r <- as(R, "realRatingMatrix")
rec=Recommender(r,method="UBCF", param=list(normalize = "Z-score",method="Cosine",nn=5, minRating=1))
recom <- predict(rec, r["1648"], n=5)
as(recom, "list")
# [[1]]
# [1] "X13..Forrest.Gump..1994." "X550..Fight.Club..1999."
# [3] "X77..Memento..2000." "X122..The.Lord.of.the.Rings..The.Return.of.the.King..2003."
# [5] "X1572..Die.Hard..With.a.Vengeance..1995."

getting Error while installing DMwR package

hi i am getting this error message while installing DMwR package from RGUI-3.3.1.
Error in read.dcf(file.path(pkgname, "DESCRIPTION"), c("Package", "Type")) :
cannot open the connection
In addition: Warning messages:
1: In unzip(zipname, exdir = dest) : error 1 in extracting from zip file
2: In read.dcf(file.path(pkgname, "DESCRIPTION"), c("Package", "Type")) :
cannot open compressed file 'bitops/DESCRIPTION', probable reason 'No such file or directory'
Approach 1:
The error being reported is inability to open a connection. In Windows that is often a firewall problem and is in the Windows R FAQ. The usual first attempt should be to run internet2.dll. From a console session you can use:
setInternet2(TRUE)
NEWS for R version 3.3.1 Patched (2016-09-13 r71247)
(Windows only) Function
setInternet2()
has no effect and will be removed in due
course. The choice between methods
"internal"
and
"wininet"
is now made by the
method
arguments of
url()
and
download.file()
and their defaults can be set
via
options. The out-of-the-box default remains
"wininet"
(as it has been since
R
3.2.2)
You are using version 3.3.1, this is why it is not working anymore.
Approach 2
The error is suggesting that the package requires another package bitops that is not available. That package is not in any of the dependencies but perhaps one of the dependencies requires it in turn(In this case, it is: ROCR).
Try installing:
install.packages("bitops",repos="https://cran.r-project.org/bin/windows/contrib/3.3/bitops_1.0-6.zip",dependencies=TRUE,type="source")
The package DMwR contains packages abind, zoo, xts, quantmod and ROCR as imports. So, additionally to installing 5 packages you must install DMwR package, Install these packages manually.
Install packages in following sequence:
install.packages('abind')
install.packages('zoo')
install.packages('xts')
install.packages('quantmod')
install.packages('ROCR')
install.packages("DMwR")
library("DMwR")
Approach 3:
chooseCRANmirror()
Select CRAN mirror from popup list. Then install packages:
install.packages("bitops")
install.packages("DMwR")
Package ‘DMwR’ was removed from the CRAN repository.
Formerly available versions can be obtained from the archive.
https://CRAN.R-project.org/package=DMwR
You can use the function as written in CRAN package. Copy the following code in a new RScript, run it and save it for future use if you want. Once you run this function, you should be able to use the way you have been trying to use it.
# ===================================================
# Creating a SMOTE training sample for classification problems
#
# If called with learner=NULL (the default) is does not
# learn any model, simply returning the SMOTEd data set
#
# NOTE: It does not handle NAs!
#
# Examples:
# ms <- SMOTE(Species ~ .,iris,'setosa',perc.under=400,perc.over=300,
# learner='svm',gamma=0.001,cost=100)
# newds <- SMOTE(Species ~ .,iris,'setosa',perc.under=300,k=3,perc.over=400)
#
# L. Torgo, Feb 2010
# ---------------------------------------------------
SMOTE <- function(form,data,
perc.over=200,k=5,
perc.under=200,
learner=NULL,...
)
# INPUTS:
# form a model formula
# data the original training set (with the unbalanced distribution)
# minCl the minority class label
# per.over/100 is the number of new cases (smoted cases) generated
# for each rare case. If perc.over < 100 a single case
# is generated uniquely for a randomly selected perc.over
# of the rare cases
# k is the number of neighbours to consider as the pool from where
# the new examples are generated
# perc.under/100 is the number of "normal" cases that are randomly
# selected for each smoted case
# learner the learning system to use.
# ... any learning parameters to pass to learner
{
# the column where the target variable is
tgt <- which(names(data) == as.character(form[[2]]))
minCl <- levels(data[,tgt])[which.min(table(data[,tgt]))]
# get the cases of the minority class
minExs <- which(data[,tgt] == minCl)
# generate synthetic cases from these minExs
if (tgt < ncol(data)) {
cols <- 1:ncol(data)
cols[c(tgt,ncol(data))] <- cols[c(ncol(data),tgt)]
data <- data[,cols]
}
newExs <- smote.exs(data[minExs,],ncol(data),perc.over,k)
if (tgt < ncol(data)) {
newExs <- newExs[,cols]
data <- data[,cols]
}
# get the undersample of the "majority class" examples
selMaj <- sample((1:NROW(data))[-minExs],
as.integer((perc.under/100)*nrow(newExs)),
replace=T)
# the final data set (the undersample+the rare cases+the smoted exs)
newdataset <- rbind(data[selMaj,],data[minExs,],newExs)
# learn a model if required
if (is.null(learner)) return(newdataset)
else do.call(learner,list(form,newdataset,...))
}
# ===================================================
# Obtain a set of smoted examples for a set of rare cases.
# L. Torgo, Feb 2010
# ---------------------------------------------------
smote.exs <- function(data,tgt,N,k)
# INPUTS:
# data are the rare cases (the minority "class" cases)
# tgt is the name of the target variable
# N is the percentage of over-sampling to carry out;
# and k is the number of nearest neighours to use for the generation
# OUTPUTS:
# The result of the function is a (N/100)*T set of generated
# examples with rare values on the target
{
nomatr <- c()
T <- matrix(nrow=dim(data)[1],ncol=dim(data)[2]-1)
for(col in seq.int(dim(T)[2]))
if (class(data[,col]) %in% c('factor','character')) {
T[,col] <- as.integer(data[,col])
nomatr <- c(nomatr,col)
} else T[,col] <- data[,col]
if (N < 100) { # only a percentage of the T cases will be SMOTEd
nT <- NROW(T)
idx <- sample(1:nT,as.integer((N/100)*nT))
T <- T[idx,]
N <- 100
}
p <- dim(T)[2]
nT <- dim(T)[1]
ranges <- apply(T,2,max)-apply(T,2,min)
nexs <- as.integer(N/100) # this is the number of artificial exs generated
# for each member of T
new <- matrix(nrow=nexs*nT,ncol=p) # the new cases
for(i in 1:nT) {
# the k NNs of case T[i,]
xd <- scale(T,T[i,],ranges)
for(a in nomatr) xd[,a] <- xd[,a]==0
dd <- drop(xd^2 %*% rep(1, ncol(xd)))
kNNs <- order(dd)[2:(k+1)]
for(n in 1:nexs) {
# select randomly one of the k NNs
neig <- sample(1:k,1)
ex <- vector(length=ncol(T))
# the attribute values of the generated case
difs <- T[kNNs[neig],]-T[i,]
new[(i-1)*nexs+n,] <- T[i,]+runif(1)*difs
for(a in nomatr)
new[(i-1)*nexs+n,a] <- c(T[kNNs[neig],a],T[i,a])[1+round(runif(1),0)]
}
}
newCases <- data.frame(new)
for(a in nomatr)
newCases[,a] <- factor(newCases[,a],levels=1:nlevels(data[,a]),labels=levels(data[,a]))
newCases[,tgt] <- factor(rep(data[1,tgt],nrow(newCases)),levels=levels(data[,tgt]))
colnames(newCases) <- colnames(data)
newCases
}
It has been removed from the CRAN library. There are instructions on how to retrieve it from the archive.
Either follow the link - https://packagemanager.rstudio.com/client/#/repos/2/packages/DMwR
OR copy-paste the three lines of code mentioned below:
install.packages("devtools")
devtools::install_version('DMwR', '0.4.1')
library("DMwR")
EDIT: this is the error I got while downloading the DMwR package in 2022, but looks like when the question was posted, the error happened because of another reason.
The reason is that the package 'DMwR' was built under R version 3.4.3 So the solution is actually explained in the marked answer in details. Hence, to be short:Just run the script below to get the problem solved! 
install.packages('abind')
install.packages('zoo')
install.packages('xts')
install.packages('quantmod')
install.packages('ROCR')
install.packages("DMwR")
library("DMwR")

How to write rownames into a spreadsheet with the googlesheets package in R?

I would like to write a data frame in a Google spreadsheet with the googlessheets package but the rownames isn't written in the first column.
My data frame looks like this :
> str(stats)
'data.frame': 4 obs. of 2 variables:
$ Offensive: num 194.7 87 62.3 10.6
$ Defensive: num 396.28 51.87 19.55 9.19
> stats
Offensive Defensive
Annualized Return 194.784261 396.278385
Annualized Standard Deviation 87.04125 51.872826
Worst Drawdown 22.26618 9.546208
Annualized Sharpe Ratio (Rf=0%) 1.61126 0.9193734
I load the library as recommanded in the documentation, create spreadsheet & worksheet then write the data with the gs_edit_cells command :
> install.packages("googlesheets")
> library("googlesheets")
> suppressPackageStartupMessages(library("dplyr"))
> mySpreadsheet <- gs_new("mySpreadsheet")
> mySpreadsheet <- mySpreadsheet %>% gs_ws_new("Stats")
> mySpreadsheet <- mySpreadsheet %>% gs_edit_cells(ws = "Stats", input = stats, trim = TRUE)
Everything goes well but googlesheets doesn't create a column with the rownames. Only two columns are created with their data (Offensive and Defensive).
I have try to convert the data frame into a matrix but still the same.
Any idea how I could achieve this ?
Thank you
Doesn't look like there is a row names argument for gs_edit_cells(). If you just want the row names to show up in the first column of the sheet you could try:
stats$Rnames = rownames(stats) ## add column equal to the row names
stats[,c("Rnames","Offensive", "Defensive")] ## re order so names are first
# names(stats) = c("","Offensive", "Defensive") optional if you want the names col to not have a "name"
From here just pass stats to the functions from the googlessheets package just like you did before

Scrape number of articles on a topic per year from NYT and WSJ?

I would like to create a data frame that scrapes the NYT and WSJ and has the number of articles on a given topic per year. That is:
NYT WSJ
2011 2 3
2012 10 7
I found this tutorial for the NYT but is not working for me :_(. When I get to line 30 I get this error:
> cts <- as.data.frame(table(dat))
Error in provideDimnames(x) :
length of 'dimnames' [1] not equal to array extent
Any help would be much appreciated.
Thanks!
PS: This is my code that is not working (A NYT api key is needed http://developer.nytimes.com/apps/register)
# Need to install from source http://www.omegahat.org/RJSONIO/RJSONIO_0.2-3.tar.gz
# then load:
library(RJSONIO)
### set parameters ###
api <- "API key goes here" ###### <<<API key goes here!!
q <- "MOOCs" # Query string, use + instead of space
records <- 500 # total number of records to return, note limitations above
# calculate parameter for offset
os <- 0:(records/10-1)
# read first set of data in
uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[1], "&fields=date&api-key=", api, sep="")
raw.data <- readLines(uri, warn="F") # get them
res <- fromJSON(raw.data) # tokenize
dat <- unlist(res$results) # convert the dates to a vector
# read in the rest via loop
for (i in 2:length(os)) {
# concatenate URL for each offset
uri <- paste ("http://api.nytimes.com/svc/search/v1/article?format=json&query=", q, "&offset=", os[i], "&fields=date&api-key=", api, sep="")
raw.data <- readLines(uri, warn="F")
res <- fromJSON(raw.data)
dat <- append(dat, unlist(res$results)) # append
}
# aggregate counts for dates and coerce into a data frame
cts <- as.data.frame(table(dat))
# establish date range
dat.conv <- strptime(dat, format="%Y%m%d") # need to convert dat into POSIX format for this
daterange <- c(min(dat.conv), max(dat.conv))
dat.all <- seq(daterange[1], daterange[2], by="day") # all possible days
# compare dates from counts dataframe with the whole data range
# assign 0 where there is no count, otherwise take count
# (take out PSD at the end to make it comparable)
dat.all <- strptime(dat.all, format="%Y-%m-%d")
# cant' seem to be able to compare Posix objects with %in%, so coerce them to character for this:
freqs <- ifelse(as.character(dat.all) %in% as.character(strptime(cts$dat, format="%Y%m%d")), cts$Freq, 0)
plot (freqs, type="l", xaxt="n", main=paste("Search term(s):",q), ylab="# of articles", xlab="date")
axis(1, 1:length(freqs), dat.all)
lines(lowess(freqs, f=.2), col = 2)
UPDATE: the repo is now at https://github.com/rOpenGov/rtimes
There is a RNYTimes package created by Duncan Temple-Lang https://github.com/omegahat/RNYTimes - but it is outdated because the NYTimes API is on v2 now. I've been working on one for political endpoints only, but not relevant for you.
I'm rewiring RNYTimes right now...Install from github. You need to install devtools first to get install_github
install.packages("devtools")
library(devtools)
install_github("rOpenGov/RNYTimes")
Then try your search with that, e.g,
library(RNYTimes); library(plyr)
moocs <- searchArticles("MOOCs", key = "<yourkey>")
This gives you number of articles found
moocs$response$meta$hits
[1] 121
You could get word counts for each article by
as.numeric(sapply(moocs$response$docs, "[[", 'word_count'))
[1] 157 362 1316 312 2936 2973 355 1364 16 880

List and description of all packages in CRAN from within R

I can get a list of all the available packages with the function:
ap <- available.packages()
But how can I also get a description of these packages from within R, so I can have a data.frame with two columns: package and description?
Edit of an almost ten-year old accepted answer. What you likely want is not to scrape (unless you want to practice scraping) but use an existing interface: tools::CRAN_package_db(). Example:
> db <- tools::CRAN_package_db()[, c("Package", "Description")]
> dim(db)
[1] 18978 2
>
The function brings (currently) 66 columns back of which the of interest here are a part.
I actually think you want "Package" and "Title" as the "Description" can run to several lines. So here is the former, just put "Description" in the final subset if you really want "Description":
R> ## from http://developer.r-project.org/CRAN/Scripts/depends.R and adapted
R>
R> require("tools")
R>
R> getPackagesWithTitle <- function() {
+ contrib.url(getOption("repos")["CRAN"], "source")
+ description <- sprintf("%s/web/packages/packages.rds",
+ getOption("repos")["CRAN"])
+ con <- if(substring(description, 1L, 7L) == "file://") {
+ file(description, "rb")
+ } else {
+ url(description, "rb")
+ }
+ on.exit(close(con))
+ db <- readRDS(gzcon(con))
+ rownames(db) <- NULL
+
+ db[, c("Package", "Title")]
+ }
R>
R>
R> head(getPackagesWithTitle()) # I shortened one Title here...
Package Title
[1,] "abc" "Tools for Approximate Bayesian Computation (ABC)"
[2,] "abcdeFBA" "ABCDE_FBA: A-Biologist-Can-Do-Everything of Flux ..."
[3,] "abd" "The Analysis of Biological Data"
[4,] "abind" "Combine multi-dimensional arrays"
[5,] "abn" "Data Modelling with Additive Bayesian Networks"
[6,] "AcceptanceSampling" "Creation and evaluation of Acceptance Sampling Plans"
R>
Dirk has provided an answer that is terrific and after finishing my solution and then seeing his I debated for some time posting my solution for fear of looking silly. But I decided to post it anyway for two reasons:
it is informative to beginning scrapers like myself
it took me a while to do and so why not :)
I approached this thinking I'd need to do some web scraping and choose crantastic as the site to scrape from. First I'll provide the code and then two scraping resources that have been very helpful to me as I learn:
library(RCurl)
library(XML)
URL <- "http://cran.r-project.org/web/checks/check_summary.html#summary_by_package"
packs <- na.omit(XML::readHTMLTable(doc = URL, which = 2, header = T,
strip.white = T, as.is = FALSE, sep = ",", na.strings = c("999",
"NA", " "))[, 1])
Trim <- function(x) {
gsub("^\\s+|\\s+$", "", x)
}
packs <- unique(Trim(packs))
u1 <- "http://crantastic.org/packages/"
len.samps <- 10 #for demo purpose; use:
#len.samps <- length(packs) # for all of them
URL2 <- paste0(u1, packs[seq_len(len.samps)])
scraper <- function(urls){ #function to grab description
doc <- htmlTreeParse(urls, useInternalNodes=TRUE)
nodes <- getNodeSet(doc, "//p")[[3]]
return(nodes)
}
info <- sapply(seq_along(URL2), function(i) try(scraper(URL2[i]), TRUE))
info2 <- sapply(info, function(x) { #replace errors with NA
if(class(x)[1] != "XMLInternalElementNode"){
NA
} else {
Trim(gsub("\\s+", " ", xmlValue(x)))
}
}
)
pack_n_desc <- data.frame(package=packs[seq_len(len.samps)],
description=info2) #make a dataframe of it all
Resources:
talkstats.com thread on web scraping (great beginner
examples)
w3schools.com site on html stuff (very
helpful)
I wanted to try to do this using a HTML scraper (rvest) as an exercise, since the available.packages() in OP doesn't contain the package Descriptions.
library('rvest')
url <- 'https://cloud.r-project.org/web/packages/available_packages_by_name.html'
webpage <- read_html(url)
data_html <- html_nodes(webpage,'tr td')
length(data_html)
P1 <- html_nodes(webpage,'td:nth-child(1)') %>% html_text(trim=TRUE) # XML: The Package Name
P2 <- html_nodes(webpage,'td:nth-child(2)') %>% html_text(trim=TRUE) # XML: The Description
P1 <- P1[lengths(P1) > 0 & P1 != ""] # Remove NULL and empty ("") items
length(P1); length(P2);
mdf <- data.frame(P1, P2, row.names=NULL)
colnames(mdf) <- c("PackageName", "Description")
# This is the problem! It lists large sets column-by-column,
# instead of row-by-row. Try with the full list to see what happens.
print(mdf, right=FALSE, row.names=FALSE)
# PackageName Description
# A3 Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModels
# abbyyR Access to Abbyy Optical Character Recognition (OCR) API
# abc Tools for Approximate Bayesian Computation (ABC)
# abc.data Data Only: Tools for Approximate Bayesian Computation (ABC)
# ABC.RAP Array Based CpG Region Analysis Pipeline
# ABCanalysis Computed ABC Analysis
# For small sets we can use either:
# mdf[1:6,] #or# head(mdf, 6)
However, although working quite well for small array/dataframe list (subset), I ran into a display problem with the full list, where the data would be shown either column-by-column or unaligned. I would have been great to have this paged and properly formatted in a new window somehow. I tried using page, but I couldn't get it to work very well.
EDIT:
The recommended method is not the above, but rather using Dirk's suggestion (from the comments below):
db <- tools::CRAN_package_db()
colnames(db)
mdf <- data.frame(db[,1], db[,52])
colnames(mdf) <- c("Package", "Description")
print(mdf, right=FALSE, row.names=FALSE)
However, this still suffers from the display problem mentioned...

Resources