I apologise if this question has been asked already (I haven't been able to find it). I was under the impression that I could access datasets in R using data(), for example, from the datasets package. However, this doesn't work for time series objects. Are there other examples where this is not the case? (And why?)
data("ldeaths") # no dice
ts("ldeaths") # works
(However, this works for data("austres"), which is also a time-series object).
The data function is designed to load package data sets and all their attributes, time series or otherwise.
I think the issue your having is that there is no stand-alone data set called ldeaths in the datasets package. ldeaths does exist as 1 of 3 data sets in the UKLungDeaths data set. The other two are fdeaths and mdeaths.
The following should lazily load all data sets.
data(UKLungDeaths)
Then, typing ldeaths in the console or using it as an argument in some function will load it.
str(ldeaths)
While it is uncommon for package authors to include multiple objects in 1 data set, it does happen. This line from the data function documentation gives on a 'heads up' about this:
"For each given data set, the first two types (‘.R’ or ‘.r’, and ‘.RData’ or ‘.rda’ files) can create several variables in the load environment, which might all be named differently from the data set"
That is the case here, as while there are three time series objects contained in the data set, not one of them is named UKLungDeaths.
This choice occurs when the package author uses the save function to write multiple R objects to an external file. In the wild, I've seen folks use the save function to bundle a description file with the data set, although this would not be the proper way to document something in a full on package. If your really curious, go read the documentation on the save function.
Justin
r
Related
I have a large R Markdown file with many different outputs. The dataset is still being collected, and I often reknit the file to get an update including the most recent data. I would like to automatically see what has changed from the last time without needing to page through the entire output.
A) Is there an easier strategy than writing code to extract all the values from the output and formatting a side-by-side presentation myself?
B) The output includes several figures. I would like to compare these as well, but I would be happy with a solution that only compares numbers.
C) I would also be satisfied with a function or package that saves a defined subset of variables and lets me compare them to the values of variables saved with the same name in the past.
Stuck in a problem. There are two datasets A and B. Say they're datasets of two factories. Factory A is performing really well whereas Factory B is not. I have the data-set of Factory A (data being output from the manufacturing units) as well as Factory B, both having the same variables. How can I identify the problematic variable in Factory B which needs to be fixed so that Factory B starts performing well too? Therefore, I need to identify the problematic variable which needs immediate attention.
Looking forward to your response.
p.s: coding language being used is R
Well this is shameless plug for the dataMaid package which I helped write and which sort of does what you are asking. The idea of the dataMaid package is to run a battery of tests on the variables in a data frame and produce a report that a human investigator (preferably someone with knowledge about the context) can look through in order to identify potential problems.
A super simple way to get started is to load the package and use the
clean function on a data frame (if you try to clean the same data
frame several times then it may be necessary to add the replace=TRUE
argument to overwrite the existing report).
devtools::install_github("ekstroem/dataMaid")
library(dataMaid)
data(trees)
clean(trees)
This will create a report with summaries and error checks for each
variable in the trees data frame. A summary of all the variables is provided and for the trees data it looks like this
while the information from each variable may look like this
Here we get a status about the variable type, summary statistics, a plot and - in this case - an indicator that there might be a problem with outliers.
The dataMaid package can also be used interactively by running checks for the individual variables or for all variables in the dataset
data(toyData)
check(toyData$var2) # Individual check of var2
check(toyData) # Check all variables at once
By default the standard battery of tests is run depending on the
variable type, but it is possible to extend the package by providing your own checks.
In your case I would run the package on both datasets to get two reports, and any major differences in those would raise a flag about what could be problematic.
background
I am the maintainer of two packages that I would like to add to CRAN. They were rejected because some functions assign variables to .GlobalEnv. Now I am trying to find a different but as convenient way to handle these variables.
the packages
Both packages belong to the Database of Odorant Responses, DoOR. We collect published odor-response data in the DoOR.data package, the algorithms in the DoOR.functions package merge these data into a single consensus dataset.
intended functionality
The data package contains a precomputed consensus dataset. The user is able to modify the underlying original datasets (e.g. add his own datasets, remove some...) and compute a new consensus dataset. Thus the functions must be able to access the modified datasets.
The easiest way was to load the complete package data into the .GlobalEnv (via a function) and then modify data there. This was also straight forward for the user, as he saw the relevant datasets in his "main" environment. The problem is that writing into the user environment is bad practice and CRAN wouldn't accept the package this way (understandable).
things I tried
assigning only modified datasets to .GlobalEnv, non explicitly via parent.frame() - Hadley pointed out that this is still bad, in the end we are writing into the users environment.
writing only modified datasets into a dedicated new environment door.env <- new.env().
door.env is not in the search path, thus data in it is ignored by the functions
putting it into the search path with attach(door.env), as I learned, creates a new environment in the search path, thus any further edits in new.env() will again be ignored by the functions
it is complicated to see and edit the data in new.env for the user, I'd rather have a solution where a user would not have to learn environment handling
So bottom line, with all solutions I tried I ended up with multiple copies of datasets in different environments and I am afraid that this confuses the average user of our packe (including me :))
Hopefully someone has an idea of where to store data, easily accessible to user and functions.
EDIT: the data is stored as *.RData files under /data in the DoOR.data package. I tried using LazyData = true to avoid loading all the data into the .GlobalEnv. This works good, the probllems with the manipulated/updated data remain.
I would like to find out what the "R way" would be to let users the following with R: I have a file that can contain the data of one or more analysis runs of some other software. My R package should provide additional ways to calculate statistics or produce plots for those analyses. So the first step a user would have to do, is read in the file (with one or more analyses), then select the analysis and work with it.
An analysis is uniquely identified by two names (an analysis name and an analysis type where the type should later correspond to an S3 class).
What I am not sure about is how to best represent the collection of analyses that is returned when reading in the file: should this be an object or simply a list of lists (since there are two ids for identifying an analysis, the first list could be indexed by name and the second by type). Using a list feels very low-level and clumsy though.
If the read function returns a special kind of container object what would be a good method to access one of the contained objects based on name and type?
There are probably many ways how to do this, but since I only started to work with R in a way where others should eventually use my code, I am not sure how to best follow existing R-conventions for how to design this.
I am loading time series data using the read.zoo function. I noticed that when loading time series using zoo package it doesn't display as a data frame and when clicked on in displayed as shown in the picture.
One cannot discern what the data looks like from this. While data pulled using the read.csv/read.table are labeled as a data.frame and displayed in neat manner when clicked on. I know I can simply use the View(data) command but this is very cumbersome, I am sorry to be picky but it would be nice to simply click on the data and have it displayed with the appropriate columns and rows.
I also noticed that when I generate variables using the data-set that the new variables are never attached to the data-set in which they were created and therefore must use the data=merge(data,newvariable) command to combine it to the initial data.
Are there any techniques that can be employed to fix these two issues?