Datasets built into R [duplicate] - r

Can someone please help how to get the list of built-in data sets and their dependency packages?

There are several ways to find the included datasets in R:
1: Using data() will give you a list of the datasets of all loaded packages (and not only the ones from the datasets package); the datasets are ordered by package
2: Using data(package = .packages(all.available = TRUE)) will give you a list of all datasets in the available packages on your computer (i.e. also the not-loaded ones)
3: Using data(package = "packagename") will give you the datasets of that specific package, so data(package = "plyr") will give the datasets in the plyr package
If you want to know in which package a dataset is located (e.g. the acme dataset), you can do:
dat <- as.data.frame(data(package = .packages(all.available = TRUE))$results)
dat[dat$Item=="acme", c(1,3,4)]
which gives:
Package Item Title
107 boot acme Monthly Excess Returns

I often need to also know which structure of datasets are available, so I created dataStr in my misc package.
dataStr <- function(package="datasets", ...)
{
d <- data(package=package, envir=new.env(), ...)$results[,"Item"]
d <- sapply(strsplit(d, split=" ", fixed=TRUE), "[", 1)
d <- d[order(tolower(d))]
for(x in d){ message(x, ": ", class(get(x))); message(str(get(x)))}
}
dataStr()
Please mind that the output in the console is quite long.
This is the type of output:
[...]
warpbreaks: data.frame
'data.frame': 54 obs. of 3 variables:
$ breaks : num 26 30 54 25 70 52 51 26 67 18 ...
$ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
$ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...
WorldPhones: matrix
num [1:7, 1:7] 45939 60423 64721 68484 71799 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:7] "1951" "1956" "1957" "1958" ...
..$ : chr [1:7] "N.Amer" "Europe" "Asia" "S.Amer" ...
WWWusage: ts
Time-Series [1:100] from 1 to 100: 88 84 85 85 84 85 83 85 88 89 ...
Edit: To get more informative output and use it for unloaded packages or all the packages on the search path, please use the revised online version with
source("https://raw.githubusercontent.com/brry/berryFunctions/master/R/dataStr.R")

Here is a comprehensive R packages datasets list maintained by Prof. Vincent Arel-Bundock.
https://vincentarelbundock.github.io/Rdatasets/
Rdatasets is a collection of 1892 datasets that were originally
distributed alongside the statistical software environment R and some
of its add-on packages. The goal is to make these data more broadly
accessible for teaching and statistical software development.

Run
help(package = "datasets")
in the R Studio console and you'll get all available datasets in the tidy Help tab on the right.

Related

Visualizing network of sentences in Textrank

I'm using the Textrank method explained here to get the summary of the text. Is there a way to plot the output of the textrank_sentences like a network of all the textrank_ids connected to each other?
library(textrank)
data(joboffer)
library(udpipe)
tagger <- udpipe_load_model(tagger$file_model)
joboffer <- udpipe_annotate(tagger, job_rawtxt)
joboffer <- as.data.frame(joboffer)
joboffer$textrank_id <- unique_identifier(joboffer, c("doc_id","paragraph_id", "sentence_id"))
sentences <- unique(joboffer[, c("textrank_id", "sentence")])
terminology <- subset(joboffer, upos %in% c("NOUN", "ADJ"))
terminology <- terminology[, c("textrank_id", "lemma")]
tr <- textrank_sentences(data = sentences, terminology = terminology)
This question is rather old, but is a good question and deserves an answer.
Yes! textrank returns all the information that you need. Just look
at the output of str(tr). Part of it says:
$ sentences_dist:Classes ‘data.table’ and 'data.frame': 666 obs. of 3 variables:
..$ textrank_id_1: int [1:666] 1 1 1 1 1 1 1 1 1 1 ...
..$ textrank_id_2: int [1:666] 2 3 4 5 6 7 8 9 10 11 ...
..$ weight : num [1:666] 0.1429 0.4167 0 0.0625 0 ...
This gives which sentences are connected in the form of a lower triangular matrix. Two sentences are connected if the weight of their connection is greater than zero. To visualize the graph, use the non-zero weights as an edgelist and build the graph.
Links = which(tr$sentences_dist$weight > 0)
EdgeList = cbind(tr$sentences_dist$textrank_id_1[Links],
tr$sentences_dist$textrank_id_2[Links])
library(igraph)
SGraph1 = graph_from_edgelist(EdgeList, directed=FALSE)
set.seed(42)
plot(SGraph1)
We see that 11 of the nodes (sentences) are not connected to any other node.
For example, sentences 15 and 36
tr$sentences$sentence[c(36,15)]
[1] "Contact:"
[2] "Integration of the models into the existing architecture."
But other other nodes do connect up, for example node 1 is connected to node 2.
tr$sentences$sentence[c(1,2)]
[1] "Statistical expert / data scientist / analytical developer"
[2] "BNOSAC (Belgium Network of Open Source Analytical Consultants),
is a Belgium consultancy company specialized in data analysis and
statistical consultancy using open source tools."
because those sentences share the (important) words "statistical", "data", and "analytical".
The singleton nodes take up a lot of space in the graph making the other nodes rather crowded. So I will also show the graph with those removed.
which(degree(SGraph1) == 0)
[1] 4 7 15 20 21 23 25 26 29 30 36
SGraph2 = delete.vertices(SGraph1, which(degree(SGraph1) == 0))
set.seed(42)
plot(SGraph2)
That shows the relations between sentences somewhat better, but I expect that you can find a nicer layout for the graph that better shows the relations. However, that is not the thrust of the question and I leave it to you to make the graph pretty.

Error in `colnames<-`(`*tmp*`, value = c("x", "y", "x_1", "x_2", "y_1", : length of 'dimnames' [2] not equal to array extent for Panel Data

I encounter the above error while attempting to run the Granger causality test for panel data using the pgrangertest function from the plm package. I read several questions by users facing a similar issue and tried the suggestions given there, however, none of them could solve my problem.
Essentially, I have a panel data which looks something lime this:
>head(granger_data)
panel_id time_id close_close_ret log_volume
25-2 25 2 0.004307257 4.753590
25-3 25 3 -0.001912046 8.249836
25-4 25 4 0.011417821 8.628377
25-5 25 5 0.018744691 9.134754
25-6 25 6 -0.024913157 8.920122
25-7 25 7 -0.008604260 8.724370
str(granger_data)
'data.frame': 105209 obs. of 4 variables:
$ panel_id : Factor w/ 938 levels "25","26","27",..: 1 1 1 1 1 1 1 1 1 1 ...
$ time_id : Factor w/ 323 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ close_close_ret: num NA 0.00431 -0.00191 0.01142 0.01874 ...
$ log_volume : num 4.88 4.75 8.25 8.63 9.13 ...
Now, I want to run the granger causality test for panel data using the pgrangertest function from the plm package and while doing so, I encounter the following problem:
> vol_ret <- pgrangertest(log_volume ~ close_close_ret,data = granger_data)
Error in `colnames<-`(`*tmp*`, value = c("x", "y", "x_1", "y_1")) :
length of 'dimnames' [2] not equal to array extent
I even read the source code of the function and tried to understand where the error came, from, but I couldn't figure it out.
The panel granger test requires to have times series of length 5+3*order per individual otherwise the second order moments of the individual Wald statistics do not exist. pgrangertest in package plm has a check for that since version 1.7-0 of the package. From its NEWS file:
pgrangertest: better detection of infeasibility of test due to lacking data.
It gives an informative error message in case you supply too short a time series for an individual, like in the case you encountered, e.g.:
Error in pgrangertest(inv ~ value, data = pG, order = 1) :
Condition for test = "Ztilde" not met for all individuals: length of
time series must be larger than 5+3*order (>5+3*1=8)

arrange multiple graphs using a for loop in ggplot2

I want to produce a pdf which shows multiple graphs, one for each NetworkTrackingPixelId.
I have a data frame similar to this:
> head(data)
NetworkTrackingPixelId Name Date Impressions
1 2421 Rubicon RTB 2014-02-16 168801
2 2615 Google RTB 2014-02-16 1215235
3 3366 OpenX RTB 2014-02-16 104419
4 3606 AppNexus RTB 2014-02-16 170757
5 3947 Pubmatic RTB 2014-02-16 68690
6 4299 Improve Digital RTB 2014-02-16 701
I was thinking to use a script similar to the one below:
# create a vector which stores the NetworkTrackingPixelIds
tp <- data %.%
group_by(NetworkTrackingPixelId) %.%
select(NetworkTrackingPixelId)
# create a for loop to print the line graphs
for (i in tp) {
print(ggplot(data[which(data$NetworkTrackingPixelId == i), ], aes(x = Date, y = Impressions)) + geom_point() + geom_line())
}
I was expecting this command to produce many graphs, one for each NetworkTrackingPixelId. Instead the result is an unique graph which aggregate all the NetworkTrackingPixelIds.
Another thing I've noticed is that the variable tp is not a real vector.
> is.vector(tp)
[1] FALSE
Even if I force it..
tp <- as.vector(data %.%
group_by(NetworkTrackingPixelId) %.%
select(NetworkTrackingPixelId))
> is.vector(tp)
[1] FALSE
> str(tp)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1397 obs. of 1 variable:
$ NetworkTrackingPixelId: int 2421 2615 3366 3606 3947 4299 4429 4786 6046 6286 ...
- attr(*, "vars")=List of 1
..$ : symbol NetworkTrackingPixelId
- attr(*, "drop")= logi TRUE
- attr(*, "indices")=List of 63
..$ : int 24 69 116 162 205 253 302 351 402 454 ...
..$ : int 1 48 94 140 184 232 281 330 380 432 ...
[I've cut a bit this output]
- attr(*, "group_sizes")= int 29 29 2 16 29 1 29 29 29 29 ...
- attr(*, "biggest_group_size")= int 29
- attr(*, "labels")='data.frame': 63 obs. of 1 variable:
..$ NetworkTrackingPixelId: int 8799 2615 8854 8869 4786 7007 3947 9109 9126 9137 ...
..- attr(*, "vars")=List of 1
.. ..$ : symbol NetworkTrackingPixelId
Since I don't have your dataset, I will use the mtcars dataset to illustrate how to do this using dplyr and data.table. Both packages are the finest examples of the split-apply-combine paradigm in rstats. Let me explain:
Step 1 Split data by gear
dplyr uses the function group_by
data.table uses argument by
Step 2: Apply a function
dplyr uses do to which you can pass a function that uses the pieces x.
data.table interprets the variables to the function in context of each piece.
Step 3: Combine
There is no combine step here, since we are saving the charts created to file.
library(dplyr)
mtcars %.%
group_by(gear) %.%
do(function(x){ggsave(
filename = sprintf("gear_%s.pdf", unique(x$gear)), qplot(wt, mpg, data = x)
)})
library(data.table)
mtcars_dt = data.table(mtcars)
mtcars_dt[,ggsave(
filename = sprintf("gear_%s.pdf", unique(gear)), qplot(wt, mpg)),
by = gear
]
UPDATE: To save all files into one pdf, here is a quick solution.
plots = mtcars %.%
group_by(gear) %.%
do(function(x) {
qplot(wt, mpg, data = x)
})
pdf('all.pdf')
invisible(lapply(plots, print))
dev.off()
I recently had a project that required producing a lot of individual pngs for each record. I found I got a huge speed up doing some pretty simple parallelization. I am not sure if this is more performant than the dplyr or data.table technique but it may be worth trying. I saw a huge speed bump:
require(foreach)
require(doParallel)
workers <- makeCluster(4)
registerDoParallel(workers)
foreach(i = seq(1, length(mtcars$gear)), .packages=c('ggplot2')) %dopar% {
j <- qplot(wt, mpg, data = mtcars[i,])
png(file=paste(getwd(), '/images/',mtcars[i, c('gear')],'.png', sep=''))
print(j)
dev.off()
}
Unless I'm missing something, generating plots by a subsetting variable is very simple. You can use split(...) to split the original data into a list of data frames by NetworkTrackingPixelId, and then pass those to ggplot using lapply(...). Most of the code below is just to crate a sample dataset.
# create sample data
set.seed(1)
names <- c("Rubicon","Google","OpenX","AppNexus","Pubmatic")
dates <- as.Date("2014-02-16")+1:10
df <- data.frame(NetworkTrackingPixelId=rep(1:5,each=10),
Name=sample(names,50,replace=T),
Date=dates,
Impressions=sample(1000:10000,50))
# end create sample data
pdf("plots.pdf")
lapply(split(df,df$NetworkTrackingPixelId),
function(gg) ggplot(gg,aes(x = Date, y = Impressions)) +
geom_point() + geom_line()+
ggtitle(paste("NetworkTrackingPixelId:",gg$NetworkTrackingPixelId)))
dev.off()
This generates a pdf containing 5 plots, one for each NetworkTrackingPixelId.
I think you would be better off writing a function for plotting, then using lapply for every Network Tracking Pixel.
For example, your function might look like:
plot.function <- function(ntpid){
sub = subset(dataset, dataset$networktrackingpixelid == ntpid)
ggobj = ggplot(data=sub, aes(...)) + geom...
ggsave(filename=sprintf("%s.pdf", ntpid))
}
It would be helpful for you to put a reproducible example, but I hope this works! Not sure about the vector issue though..
Cheers!

Reading durations

I have a CSV file containing times per competitor of each section of a triathlon. I am having trouble reading the data so that R can use it. Here is an example of how the data looks (I've removed some columns for clarity):
"Place","Division","Gender","Swim","T1","Bike","T2","Run","Finish"
1, "40-49","M","7:45","0:55","27:07","0:29","18:53","55:07"
2, "UNDER 18","M","5:41","0:28","30:41","0:28","18:38","55:55"
3, "40-49","M","6:27","0:26","29:24","0:40","20:16","57:11"
4, "40-49","M","7:57","0:35","29:19","0:23","19:20","57:32"
5, "40-49","M","6:28","0:32","31:00","0:34","19:19","57:51"
6, "40-49","M","7:42","0:30","30:02","0:37","19:11","58:02"
....
250 ,"18-29","F","13:20","3:23","1:06:40","1:19","38:00","2:02:40"
251 ,"30-39","F","13:01","2:42","1:02:12","1:20","43:45","2:02:58"
252 ,50 ,"F","20:45","1:33","58:09","3:17","40:14","2:03:56"
253 ,"30-39","M","13:14","1:14","DNF","1:11","25:10","DNF bike"
254 ,"40-49","M","10:04","1:41","56:36","2:32",,"D.N.F"
My first naive attempt to plot the data went like this.
> tri <- read.csv(file.choose(), header=TRUE, as.is=TRUE)
> pairs(~ Bike + Run + Swim, data=tri)
The times are not being imported in a sensible way so the charts don't make sense.
I have found the difftime type and have tried to use it to parse the times in the data file.
There are some rows with DNF or similar in place of times, I'm happy for rows with times that can't be parsed to be discarded. There are two formats for the times "%M:%S" and "%H:%M:%S"
I think I need to create a new data frame from the data but I am having trouble parsing the times. This is what I have so far.
> tri <- read.csv(file.choose(), header=TRUE, as.is=TRUE)
> str(tri)
'data.frame': 254 obs. of 12 variables:
$ Place : num 1 2 3 4 5 6 7 8 9 10 ...
$ Race.. : num 237 274 268 226 267 247 264 257 273 272 ...
$ First.Name: chr ** removed names ** ...
$ Last.Name : chr ** removed names ** ...
$ Division : chr "40-49" "UNDER 18" "40-49" "40-49" ...
$ Gender : chr "M" "M" "M" "M" ...
$ Swim : chr "7:45" "5:41" "6:27" "7:57" ...
$ T1 : chr "0:55" "0:28" "0:26" "0:35" ...
$ Bike : chr "27:07" "30:41" "29:24" "29:19" ...
$ T2 : chr "0:29" "0:28" "0:40" "0:23" ...
$ Run : chr "18:53" "18:38" "20:16" "19:20" ...
$ Finish : chr "55:07" "55:55" "57:11" "57:32" ...
> as.numeric(as.difftime(tri$Bike, format="%M:%S"), units="secs")
This converts all the times that are under one hour, but the hours are interpreted as minutes for any times over an hour. Substituting "%H:%M:%S" for "%M:%S" parses times over an hour but produces NA otherwise. What is the best way to convert both types of times?
EDIT: Adding a simple example as requested.
> times <- c("27:07", "1:02:12", "DNF")
> as.numeric(as.difftime(times, format="%M:%S"), units="secs")
[1] 1627 62 NA
> as.numeric(as.difftime(times, format="%H:%M:%S"), units="secs")
[1] NA 3732 NA
The output I would like would be 1627 3732 NA
Here's a quick hack at a solution, although there may be a better one:
cdifftime <- function(x) {
x2 <- gsub("^([0-9]+:[0-9]+)$","00:\\1",x) ## prepend 00: to %M:%S elements
res <- as.difftime(x2,format="%H:%M:%S")
units(res) <- "secs"
as.numeric(res)
}
times <- c("27:07", "1:02:12", "DNF")
cdifftime(times)
## [1] 1627 3732 NA
You can apply this to the relevant columns:
tri[4:9] <- lapply(tri[4:9],cdifftime)
A couple of notes from trying to replicate your example:
you may want to use na.strings="DNF" to set "did not finish" values to NA automatically
you need to make sure strings are not read in as factors, e.g. (1) set options(stringsAsFactors="FALSE"); (2) use stringsAsFactors=FALSE when calling read.csv; (3) use as.is=TRUE, ditto.

R: Is it possible to turn combinations of vectors into data sets?

I am a beginner in R.
After watching a number of tutorials on regression analysis (on youtube), I decided to make up my own data set and apply what I learnt to it. This is what I did!
I wanted to randomly create a list of salaries, ages and marital status.
Salaries
salary = sample(2000:3000, 250, replace = T)
Ages
ages = sample(20:50, 250, replace = T)
MaritalStatus
marSt = sample(c("MARRIED", "SINGLE"), 250, repeat = T)
Then, I combined the three sets of data with:
dataset = cbind(salary, ages, marSt)
Finally, I tried to run a regression on what I thought was my new data set with this command:
data.reg = lm(salary~ages+marSt, data = dataset)
... only for me to be told that there was an error and that the object "dataset" was actually NOT a dataset.
My question is two fold:
(i) Is it possible to create data sets from combinations of vectors?
(ii) If no, is there any way in R to create data sets without importing them from other sources?
Thank you very much and please I am a beginner and do not be too sophisticated in your response.
You probably want a data.frame not a matrix (as returned by cbind),
dataset <- data.frame(salary, ages, marSt)
also, repeat is not an argument of sample(), you probably mean replace=TRUE. You would do well to read an introduction to R.
This may help:
salary = sample(2000:3000, 250, replace = T)
ages = sample(20:50, 250, replace = T)
marSt = sample(c("MARRIED", "SINGLE"), 250, replace = T)
# dataset = cbind(salary, ages, marSt) #WHAT YOU DID
dataset = data.frame(salary, ages, marSt) #WHAT YOU SHOULD HAVE DONE
data.reg = lm(salary~ages+marSt, data = dataset)
Also str() allows you to look at the structure of objects so you can see the difference between what you did and I did:
str(cbind(salary, ages, marSt))
str(data.frame(salary, ages, marSt))
Output:
> str(cbind(salary, ages, marSt))
chr [1:250, 1:3] "2388" "2530" "2518" "2450" "2008" "2502" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "salary" "ages" "marSt"
> str(data.frame(salary, ages, marSt))
'data.frame': 250 obs. of 3 variables:
$ salary: int 2388 2530 2518 2450 2008 2502 2264 2185 2207 2048 ...
$ ages : int 24 21 35 31 50 39 22 21 36 29 ...
$ marSt : Factor w/ 2 levels "MARRIED","SINGLE": 1 2 2 2 2 2 2 1 1 2 ...
EDIT:
baptiste beat me to this one but I'm leaving my answer up as it adds to the explanation given by baptiste

Resources