Trouble including data in package - r

I've created my first package, hosted on github. I'm trying to include three data frames in the package. Per the Writing R Extensions guide, I saved each data frame as a separate .RData file in the data subdirectory. However, I can't seem to access the data when I load the package.
When I install the package from a clean R session using
require(devtools)
install_github("Lloyd.et.al.Cell.abundance.metaanalysis", "adsteen")
the package functions, documentation, and vignette seem to load correctly. Better yet,
data(package="Lloyd.et.al.Cell.abundance.metaanalysis")
shows the three data frames that are encoded as .RData files, named all_data, corrected_seds, and corrected_sw.
The problem is, I can' seem to actually access the data. I have LazyData: true at line 21 in the DESCRIPTION file, so I would expect head(all_data) to show the data frame, but it returns an error, obect 'all_data' not found. I can't seem to find a way to use load() to load the data.
What am I doing wrong?

Related

How do I export a data frame to Excel?

I am trying to export a dataframe from R to excel. I am using the 'writexl' package but it does not seem to work.
The code is as following:
install.packages('writexl')
library(writexl)
write_xlsx(data_frame, "H:\\folder1.xlsx")
There does not seem to be any error produced and the code appears to have run, however when I look in 'folder1' the data_frame is not there.
Is there anything I am doing incorrectly?
I've found the openxlsx package to be easier to use than the xlsx package. It also doesn't have a java dependency. The main command for directly writing a data frame to an Excel file is write.xlsx. You can also create worksheets, do lots of fancy formatting and write multiple tables to a worksheet (see the vignettes here for some examples), but start with write.xlsx for the basic creation of Excel files.

Try to do a basic Performance Attribution analysis using the "pa" package in R

I loaded a package called "pa" in R trying to do some basic performance attribution analysis for my portfolio, the pa package comes with its own data frame, namely "year"
I tried the data (year) in the console line, it works
My problem.....
I tried to create/import a CSV file and I call it Rtestcsv (with all my holding and Bloomberg code)
I tried to type [data (Rtestcsv)] in the console, error msg pops up and said the file is not found, why?
is it true the "pa" package created by Yang Lu can only apply to his own data frame?
In order to use the "pa" package performance simple analysis, what should i have to do with my own data frame? coz it seems the CSV file has been successfully imported and appeared in the top right-hand R environment
I would appreciate if someone could shed me some insights to handle my issue as I am new to R
The data() function is mainly used to import preexisting data sets from provided package and not your own csvs or xlsx files, etc. To import your own data you can use the read.csv() function. However, I would recommend the read_csv() function from the readr package as it's much faster. (Make sure you install it first)
library(readr)
rtestcsv <- read_csv("/path/to/file.csv")
The slashes can be omitted if the file is in your working directory. You should be able access your data frame as a variable. Once it's in your environment you should be able to do with it as you please. As for how you should further wrangle your data frame, that all depends on what kind of analysis you plan on doing and what functions you want to use.

Example input data with example output using relative pathway in vignette of R package?

I'm putting together an R package. I would like to show example code in the vignette, where example data files (included in the package) are used to generate an (example) output file.
I read about using example data in Hadley Wickham's post (http://r-pkgs.had.co.nz/data.html), and believe I should keep my example data as raw data, as it must be parsed to generate the output.
So, I created a directory in my package structure
/Users/userName/myPackage/inst/extdata/
with subdirectories InputFiles and OutputFiles.
And I put the example file (exampleData.csv) inside of the InputFiles subdirectory (/Users/userName/myPackage/inst/extdata/InputFiles).
My vignette is located in:
/Users/userName/myPackage/vignettes/myPackage.Rnw
It contains the following syntax:
<<eval=FALSE>>=
fileString = "/Users/userName/myPackage/inst/extdata/InputFiles/exampleData.csv"
doFunction1(fileString)
doFunction2(fileString)
doFunction3(fileString, output ="Users/userName/myPackage/inst/extdata/OutputFiles")
#
I am having two problems with developing this vignette and its example datasets:
1) I am unsure if my use of the extdata file is appropriate. This seemed to be the best directory name and location to place my example files, according to the aforementioned Hadley Wickham reference.
2) I am unsure how to make the pathways relative, instead of absolute, as I have them currently. This example code does not run automatically, as you can see. Instead, I have it under an R chunk of eval=FALSE so that it is simply listed there for the users to test themselves. After running the example code, the users can also check that the output file was indeed created in (/Users/userName/myPackage/inst/extdata/OutputFiles). What is the best way for me to allow the user to not have to use an absolute path when following the example? Is it possible to just follow a relative path from within the package directory myPackage?
My data files consist of .csv, .htm, and .text files. In the past, when constructing a package, I have saved a data frame as .rda file, and then the user could simply use:
data(example.rda)
to read that file. They would not have to write the entire pathway. Is there a similar function that can be used to read .csv, .html, and .text files, and then output them to an example output location - without having to use the full pathway? Would it be possible to have help functions that also read in the input files and write to the output files? Would this cause a conflict in CRAN if various example help functions in the /man folder physically save the example output file to the example output folder?
The standard way to refer to a file in a package is:
# gives root package directory
system.file(package="myPackage")
# specific file
system.file("extdata/InputFiles/exampleData.csv", package="myPackage")
# best is to use cross-platform way to write a file path:
system.file("extdata", "InputFiles", "exampleData.csv", package="myPackage")
When developing with devtools, the inst subdirectory is ignored, so you never need to worry about absolute paths. This should work in a vignette. Note that a vignette, I think, only ever uses the installed version of a package, not the one you may have loaded in your development environment (specifically, devtools::load_all() does not change the code which is used to build the vignette, you must install() it first).
Finally, using data() is a bit old fashioned. Hadley and others recommend using lazy data, so the data appears in the namespace automatically. Try the following in your DESCRIPTION.
LazyData: true
LazyDataCompression: xz

How to put datasets into an R package

I am creating my own R package and I was wondering what are the possible methods that I can use to add (time-series) datasets to my package. Here are the specifics:
I have created a package subdirectory called data and I am aware that this is the location where I should save the datasets that I want to add to my package. I am also cognizant of the fact that the files containing the data may be .rda, .txt, or .csv files.
Each series of data that I want to add to the package consists of a single column of numbers (eg. of the form 340 or 4.5) and each series of data differs in length.
So far, I have saved all of the datasets into a .txt file. I have also successfully loaded the data using the data() function. Problem not solved, however.
The problem is that each series of data loads as a factor except for the series greatest in length. The series that load as factors contain missing values (of the form '.'). I had to add these missing values in order to make each column of data the same in length. I tried saving the data as unequal columns, but I received an error message after calling data().
A consequence of adding missing values to get the data to load is that once the data is loaded, I need to remove the NA's in order to get on with my analysis of the data! So, this clearly is not a good way of doing things.
Ideally (I suppose), I would like the data to load as numeric vectors or as a list. In this way, I wouldn't need the NA's appended to the end of each series.
How do I solve this problem? Should I save all of the data into one single file? If so, in what format should I do it? Perhaps I should save the datasets into a number of files? Again, in which format? What is the best practical way of doing this? Any tips would greatly be appreciated.
I'm not sure if I understood your question correctly. But, if you edit your data in your favorite format and save with
save(myediteddata, file="data.rda")
The data should be loaded exactly the way you saw it in R.
To load all files in data directory you should add
LazyData: true
To your DESCRIPTION file, in your package.
If this don't help you could post one of your files and a print of the format you want, this will help us to help you ;)
In addition to saving as rda files you could also choose to load them as numeric with:
read.table( ... , colClasses="numeric")
Or as non-factor-text:
read.table( ..., as.is=TRUE) # which does pretty much the same as stringsAsFactors=FALSE
read.table( ..., colClasses="character")
It also appears that the data function would accept these arguments sinc it is documented to be a simple wrapper for read.table(..., header=TRUE).
Preferred saving location of your data depends on its format.
As Hadley suggested:
If you want to store binary data and make it available to the user,
put it in data/. This is the best place to put example datasets.
If you want to store parsed data, but not make it available to the
user, put it in R/sysdata.rda. This is the best place to put data
that your functions need.
If you want to store raw data, put it in inst/extdata.
I suggest you have a look at the linked chapter as it goes into detail about working with data when developing R packages.
You'll need to create the data file and include it in the R package, and you may want to also document it. Here's how to do both.
Create the data file and include it in R package
Create a directory inside the package called /data and place any data in it. Use only .rda and .RData files.
When creating the rda/RData file from an R object, make sure the R object is named what you want it to be named when it's used in the package and use save() to create it. Example:
save(river_fish, file = "data/river_fish.rda", version = 2)
Add this on a new line in the file called DESCRIPTION:
LazyData: true
Documenting the dataset
Document the dataset by placing a string with the dataset name after the documentation:
#' This is data to be included in my package
#'
#' #author My Name \email{blahblah##roxygen.org}
#' #references \url{data_blah.com}
"data-name"
Here and here are some nice examples from dplyr.
Notes
To access the data in the package, run river_fish or whatever the name of the dataset is. Nothing more is needed.
Using version = 2 when calling save() ensures your data object is available for older R versions (i.e. prior to 3.5.0) i.e. it will prevent this warning:
WARNING: Added dependency on R >= 3.5.0 because serialized objects in serialize/load version 3 cannot be read in older versions of R.
No need to use load() in the R package (just call the object directly instead e.g. river_fish will be enough to yield the data from data/river_fish.rda), but in the event you do wish to load an rda/RData file for some reason (e.g. playing around or testing), this will do it:
load("data/river_fish.rda")
Informative sources here and here

When should data go in /data, and when should it go in /inst/extdata?

The Writing R Extensions manual states:
The data subdirectory is for data files, either to be made available via lazy-loading or for loading using data(). (The choice is made by the ‘LazyData’ field in the DESCRIPTION file: the default is not to do so.) It should not be used for other data files needed by the package, and the convention has grown up to use directory inst/extdata for such files.)
But it is still not clear what data is "required" by a package. I would like to use data for the following (not always mutually exclusive) reasons:
documentation
function examples
function tests
vignettes
to provide access to an original data set
to make data available to functions within the package (e.g. a lookup table / dictionary)
But it is not clear which of these should go in the data folder, and which should go in inst/extdata. And are there any conditions under which "data" should go elsewhere?
Related questions: Previous questions (e.g. inst and extdata folders in R Packaging and Using inst/extdata with vignette during package checking R 2.14.0) give some instructions on use, but don't tell me how to decide which directory to use. Another question, R - where should I place RDA file - /R, /data, /inst/extdata?, gets the closest, but seems to focus specifically on RDA and RData files.
The data directory supplies data for the data() function and is expected to follow certain customs in terms of file formats and extensions.
The inst/extdata directory becomes extdata/ when installed and is more of a wild west and you can do whatever you want and it is expected that you write your own accessors.
It may be useful to look at empirics. On my machine, among around 240-some installed packages, a full 77 (or not quite a third) have data/, but only 4 (including one of mine) have extdata..

Resources