Convention for R function to read a file and return a collection of objects - r

I would like to find out what the "R way" would be to let users the following with R: I have a file that can contain the data of one or more analysis runs of some other software. My R package should provide additional ways to calculate statistics or produce plots for those analyses. So the first step a user would have to do, is read in the file (with one or more analyses), then select the analysis and work with it.
An analysis is uniquely identified by two names (an analysis name and an analysis type where the type should later correspond to an S3 class).
What I am not sure about is how to best represent the collection of analyses that is returned when reading in the file: should this be an object or simply a list of lists (since there are two ids for identifying an analysis, the first list could be indexed by name and the second by type). Using a list feels very low-level and clumsy though.
If the read function returns a special kind of container object what would be a good method to access one of the contained objects based on name and type?
There are probably many ways how to do this, but since I only started to work with R in a way where others should eventually use my code, I am not sure how to best follow existing R-conventions for how to design this.

Related

R - Long multi-use functions or short functions?

I am building an R package that contains a framework of functions that all support a certain type of project. I was wondering if its better design to have robust functions that can do different types of similar operations, or have a separate function for each different thing.
For example (purely made up), lets say I had a function that builds a certain type of table. There are two different ways that I can build the table, "way 1" and "way 2". The final output of either way is going to be the same type of table, how the table is made is very different. More different than just changing one argument or so that varies based on the 'way'.
For this example, should I build one larger function that takes an argument ,build_way = 'way 1' or should I build two different functions for each way?

Comparison of good vs bad dataset using R

Stuck in a problem. There are two datasets A and B. Say they're datasets of two factories. Factory A is performing really well whereas Factory B is not. I have the data-set of Factory A (data being output from the manufacturing units) as well as Factory B, both having the same variables. How can I identify the problematic variable in Factory B which needs to be fixed so that Factory B starts performing well too? Therefore, I need to identify the problematic variable which needs immediate attention.
Looking forward to your response.
p.s: coding language being used is R
Well this is shameless plug for the dataMaid package which I helped write and which sort of does what you are asking. The idea of the dataMaid package is to run a battery of tests on the variables in a data frame and produce a report that a human investigator (preferably someone with knowledge about the context) can look through in order to identify potential problems.
A super simple way to get started is to load the package and use the
clean function on a data frame (if you try to clean the same data
frame several times then it may be necessary to add the replace=TRUE
argument to overwrite the existing report).
devtools::install_github("ekstroem/dataMaid")
library(dataMaid)
data(trees)
clean(trees)
This will create a report with summaries and error checks for each
variable in the trees data frame. A summary of all the variables is provided and for the trees data it looks like this
while the information from each variable may look like this
Here we get a status about the variable type, summary statistics, a plot and - in this case - an indicator that there might be a problem with outliers.
The dataMaid package can also be used interactively by running checks for the individual variables or for all variables in the dataset
data(toyData)
check(toyData$var2) # Individual check of var2
check(toyData) # Check all variables at once
By default the standard battery of tests is run depending on the
variable type, but it is possible to extend the package by providing your own checks.
In your case I would run the package on both datasets to get two reports, and any major differences in those would raise a flag about what could be problematic.

Using `data()` for time series objects in R

I apologise if this question has been asked already (I haven't been able to find it). I was under the impression that I could access datasets in R using data(), for example, from the datasets package. However, this doesn't work for time series objects. Are there other examples where this is not the case? (And why?)
data("ldeaths") # no dice
ts("ldeaths") # works
(However, this works for data("austres"), which is also a time-series object).
The data function is designed to load package data sets and all their attributes, time series or otherwise.
I think the issue your having is that there is no stand-alone data set called ldeaths in the datasets package. ldeaths does exist as 1 of 3 data sets in the UKLungDeaths data set. The other two are fdeaths and mdeaths.
The following should lazily load all data sets.
data(UKLungDeaths)
Then, typing ldeaths in the console or using it as an argument in some function will load it.
str(ldeaths)
While it is uncommon for package authors to include multiple objects in 1 data set, it does happen. This line from the data function documentation gives on a 'heads up' about this:
"For each given data set, the first two types (‘.R’ or ‘.r’, and ‘.RData’ or ‘.rda’ files) can create several variables in the load environment, which might all be named differently from the data set"
That is the case here, as while there are three time series objects contained in the data set, not one of them is named UKLungDeaths.
This choice occurs when the package author uses the save function to write multiple R objects to an external file. In the wild, I've seen folks use the save function to bundle a description file with the data set, although this would not be the proper way to document something in a full on package. If your really curious, go read the documentation on the save function.
Justin
r

rangedummarizedexperiment for deseq2

I'm trying to use the DESeq2 package in R for differential gene expression, but I'm having trouble creating the required RangedSummarizedExperiment object from my input data. I have found several tutorials and vignettes for doing this, but they all seem to apply to a raw data set that is different from mine. My data has gene names as row names and patient id as column names, and the data is simply integer count data. There has to be a simple way to create the RangedSummarizedExperiment object from this type of input data, but I haven't yet found a way. Can anybody help? Thanks.
I had a similar problem understanding how to use this data structure. I eventually managed to do without it by using DESeqDataSetFromMatrix. You can see an example in the first code block of Modify r object with rpy2 (this code is pure R, rpy2 stuff comes after). In this example, I have genes as rows and samples as columns, so it is likely you will be able to adopt the same approach.

The internal implementation of R's dataset

I am trying to build a data processing program. Currently I use a double matrix to represent the data table, each row is an instance, each column represents a feature. I also have an extra vector as the target value for each instance, it is of double type for regression, it is of integer for classification.
I want to make it more general. I am wondering what kind of structure R uses to store a dataset, i.e. the internal implementation in R.
Maybe if you inspect the rpy2 package, you can learn something about how data structures are represented (and can be accessed).
The internal data structures are `data.frame', a detailed introduction to the data frame can be found here.
http://cran.r-project.org/doc/manuals/R-intro.html#Data-frames

Resources