How does lazydata loading work in R package installation? - r

I want to expose data that is already published in my data/ directory of my R package skeleton. See this link for "External data" sharing basics: http://r-pkgs.had.co.nz/data.html.
My data is stored in .txt format. If you didn't want to load the data via lazy loading, which would just be loaded by loading the R package require(myRpackage) and then doing data(datasetName)... you can read the data in normally using many of the read.table(), read.csv2() functions in base R.
My dataset is called "publishedData.txt" in this example, and can be loaded as below, which works beautifully:
tmp = read.table("/dir/to/R/package/data/publishedData.txt", sep="\t", header=TRUE)
However, when I go to re-install my R package with this new shiny & wonderful data, I get the following fail message, over and over (see pasted below).
Downloading GitHub repo myGitRepo/myRpackage#master
from URL https://api.github.com/repos/myGitRepo/myRpackage/zipball/master
Installing myRpackage
library='/Library/Frameworks/R.framework/Versions/3.5/Resources/library' --
install-tests
* installing *source* package ‘myRpackage’ ...
** R
** data
*** moving datasets to lazyload DB
Error in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, :
line 1 did not have 215 elements
ERROR: lazydata failed for package ‘myRpackage’
* removing
‘/Library/Frameworks/R.framework/Versions/3.5/Resources/library/myRpackage’
Installation failed: Command failed (1)
Note, the above Github repo isn't real. I'm writing a generic post, so don't try to install this fake R package yourself.
My question: How do I debug lazydata load, when I don't know how lazydata load is performed? i.e., what code decides if the data in my publishedData.txt in my data/ folder is "A-OK", versus "Not okay"? I know they are using scan(), yet it should know that sep="\t" in a .txt file, and other than that, I'm not sure what's tripping it up?
Things I've tried:
I've scrubbed my header names as best as I can (e.g., removing non-alphabetical characters from column or rownames strings).
I've also removed any other column besides the rownames column that has string data in it instead of numerical data, just in case stringsAsFactors is set to default of TRUE in lazydata loading (which would slow down things by a lot).
Also, I've restarted R after each re-install attempt...

Okay, so I figured out a way to get this to work, without having to actually understand what was tripping it up.
Say your dataset loads using read.table(), but doesn't reinstall with lazydata load as described above. Chances are, your headers / rownames are off. A quick solution is just to do this:
# Load your data into R the way it works
tmp = read.table("/dir/to/R/package/data/publishedData.txt", sep="\t", header=TRUE)
# Write data to same file with these arguments
write.table(tmp, file="/dir/to/R/package/data/publishedData.txt", sep="\t", row.names = TRUE, col.names = TRUE)
Then, update your Github repo with git, and then try to reinstall R package. It will work this time around! The difference in the .txt file was the header for the col.names - the first "column" does not have a label associated with the rownames. It just starts with the col.name for column 1 of your data matrix. Then, in row 2, the row name for row 2 starts, then all the data comes next. So technically, row 1 has 1 less element in it than row 2, if you were to parse this data using a different method.
Hope it helps someone else. :-)

Related

unable to open .dat files on R even with haven installed

So I use SGA tools for processing my images. It gives back results in .dat files. Now in order to work on this data in R, I tried to import the .dat file using the haven package. I installed haven and then its library, but I am not able to import data still and it gives this error message.
Error: Failed to parse C:/Users/QuRana/Desktop/SGA Tools/Plate_Image_Example (1).dat: This version of the file format is not supported.
When I use this command install.packages("haven"), haven is loaded, but then when I load library using library(haven) nothing appears on my console except for this
> library(haven)
Then when I use this code:
datatrial1 <- read_dta("C:/Users/QuRana/Desktop/SGA Tools/Plate_Image_Example (1).dat")
It gives me the error mentioned above. When I try converting my .dat file to a .csv file and load my data, the imported data adds additional "t" values before the values in columns except for the first one like this:
Flags: S - Colony spill or edge interference C - Low colony circularity
# row\tcol\tsize\tcircularity\tflags
1\t1\t4355\t0.9053\t
1\t2\t4456\t0.8401\t
1\t3\t3439\t0.8219\t
1\t4\t3215\t0.8707\t
All the t's before the numeric values are not what I want. Another issue that I am facing is I cannot install the gitter package on my R version which is R 4.2.2.
You can read your tab separated file like so `read.delim("file_path", header = TRUE, sep = "\t")

Error when exporting R data frame using openxlsx ("Error in zipr")

Usually I'm using the openxlsx package and the write.xlsx function when exporting R data frames into .xlsx-files. Since yesterday - probably after I was using the package XLConnect - something got messed up and the write.xlsx function doesn't work anymore. This is the error I get:
Error in zipr(zipfile = tmpFile, include_directories = FALSE, files = list.files(path = tmpDir, :
unused argument (include_directories = FALSE)
Unfortunately, I don't understand what this error means. Thanks for any helpful advice.
Edit: The function works when I use an older openxlsx version (4.1.0).
I was getting the same error.
I think the problem is with dependencies of openxlsx. There is a "zipR" package that might be picked up when you install openxlsx, while the actual dependency is zip package:
https://cran.r-project.org/web/packages/zip/index.html
https://cran.r-project.org/web/packages/zipR/zipR.pdf
I installed "zip" along with openxlsx and I don't get the error anymore.
I do not really understand the error message here. My computer does not allow me to save files to "c:/". So, if remove "c:/" part, it works fine, to save the file to the current working directory.
library(openxlsx)
df <- data.frame('x' = c(1,2,3),
'y' = c(3,2,1))
openxlsx::write.xlsx(df, "test.xlsx")
You would also try another package: writexl
writexl::write_xlsx(df, "text5.xlsx")`
This works on my machine.

creating reproducible example using reprex package in r where a local file is being read

I often use reprex::reprex to create reproducible examples of R code to get help from others to get rid of errors in my code. Usually, I create minimal examples using datasets like iris or mtcars and it works well.
But I always fail to use reprex any time I need to use my own data since the problem is so specific and I can't rely on datasets from datasets library.
In that case, I get the following error:
# loading needed libraries
library(ggplot2)
library(cowplot)
library(devtools)
# reading the datafile
data <- utils::read.csv(file = "data.csv")
#> Warning in file(file, "rt"): cannot open file 'data.csv': No such file or
#> directory
#> Error in file(file, "rt"): cannot open the connection
Created on 2018-02-19 by the reprex package (v0.2.0).
There is a great discussion from pre-reprex era elsewhere (How to make a great R reproducible example?). The author recommends using something like dput-
If you have some data that would be too difficult to construct using
these tips, then you can always make a subset of your original data,
using eg head(), subset() or the indices. Then use eg. dput() to
give us something that can be put in R immediately
But also mentions-
If your data frame has a factor with many levels, the dput output
can be unwieldy because it will still list all the possible factor
levels even if they aren't present in the subset of your data.
So, if I want to work will my full dataset, this is not a good option.
In summary:
Anyone knows how to create a reprex which is standalone even if it relies on using a local file containing all of your data?
By default, reprex strongly encourages execution in the session temp directory. But sometimes it is unavoidable to refer to a specific local file, so yes, there has to be a way to do this.
To request that all work be done in current working directory, set outfile = NA. (More generally, you can use the outfile argument to specify a base file name and path.)
If I submit this reprex, with working directory set to my home directory:
reprex({
getwd()
writeLines(c("V1,V2","a,b"), "precious_data.csv")
list.files(pattern = "*.csv")
read.csv("precious_data.csv")
},
outfile = NA,
venue = "so"
)
I get this output:
getwd()
#> [1] "/Users/jenny"
writeLines(c("V1,V2","a,b"), "precious_data.csv")
list.files(pattern = "*.csv")
#> [1] "precious_data.csv"
read.csv("precious_data.csv")
#> V1 V2
#> 1 a b
Created on 2018-09-19 by the reprex package (v0.2.1)
Using outfile = NA or outfile = "path/to/desired/file/base" is the general pattern for asserting control over the location of all files generated by reprex().

Trouble providing data sets with package

I have two data sets full and raw that I placed in the data/ directory of my package. However, when I load my package, they are not available. I tried looking for them using the data function, but did not see them.
data(raw, package = "pkg")
Warning message:
In data(raw, package = "pkg") : data set 'raw' not found
Do I have to export them somehow?
I noticed when I tried to open the file using load from another computer, it read in as a string. Maybe I'm not writing the data frame properly? I used:
save(df.full, file = "full.RData")
save(df.raw, file = "raw.RData")

Package that downloads data from the internet during installation

Is anyone aware of a package that downloads a dataset from the internet during the installation process and then prepares and saves it so that it is available when loading the package using library(packageName)? Are there any drawbacks in this approach (besides the obvious one that package installation will fail if the data source is unavailable or the data format has changed)?
EDIT: Some background. The data is three tab-separated files in a ZIP archive, owned by federal statistics and generally freely accessible. I have R code which downloads, extracts and prepares the data, in the end three data frames are created which could be saved in .RData format.
I am thinking about creating two packages: A "data" package that provides the data, and a "code" package that operates on it.
I did this mockup before, while you were posting your edit. I presume it would work, but not tested. I've commented it so you can see what you would need to change. The idea here is to check to see if an expected object is available in the current working environment. If it is not, check to see that the file that the data can be found in is in the current working directory. If that is not found, prompt the user to download the file, then proceed from there.
myFunction <- function(this, that, dataset) {
# We're giving the user a chance to specify the dataset.
# Maybe they have already downloaded it and saved it.
if (is.null(dataset)) {
# Check to see if the object is already in the workspace.
# If it is not, check to see whether the .RData file that
# contains the object is in the current working directory.
if (!exists("OBJECTNAME", where = 1)) {
if (isTRUE(list.files(
pattern = "^DATAFILE.RData$") == "DATAFILE.RData")) {
load("DATAFILE.RData")
# If neither of those are successful, prompt the user
# to download the dataset.
} else {
ans = readline(
"DATAFILE.RData dataset not found in working directory.
OBJECTNAME object not found in workspace. \n
Download and load the dataset now? (y/n) ")
if (ans != "y")
return(invisible())
# I usually use RCurl in case the URL is https
require(RCurl)
baseURL = c("http://some/base/url/")
# Here, we actually download the data
temp = getBinaryURL(paste0(baseURL, "DATAFILE.RData"))
# Here we load the data
load(rawConnection(temp), envir=.GlobalEnv)
message("OBJECTNAME data downloaded from \n",
paste0(baseURL, "DATAFILE.RData \n"),
"and added to your workspace\n\n")
rm(temp, baseURL)
}
}
dataset <- OBJECTNAME
}
TEMP <- dataset
## Other fun stuff with TEMP, this, and that.
}
Two packages, hosted at Github
Here's another approach, building on the comments between #juba and I. The basic concept is to have, as you describe, one package for the codes and one for the data. This function would be part of the package that contains your code. It will:
Check to see if the data package is installed
Check to see if the version of the data package you have installed matches the version at Github, which we are going to assume is the most up to date version.
When it fails any of the checks, it asks the user if they want to update their installation of the package. In this case, for demonstration, I've linked to one of my packages in progress at Github. This should give you an idea of what you need to substitute to get it to work with your own package once you've hosted it there.
CheckVersionFirst <- function() {
# Check to see if installed
if (!"StataDCTutils" %in% installed.packages()[, 1]) {
Checks <- "Failed"
} else {
# Compare version numbers
require(RCurl)
temp <- getURL("https://raw.github.com/mrdwab/StataDCTutils/master/DESCRIPTION")
CurrentVersion <- gsub("^\\s|\\s$", "",
gsub(".*Version:(.*)\\nDate.*", "\\1", temp))
if (packageVersion("StataDCTutils") == CurrentVersion) {
Checks <- "Passed"
}
if (packageVersion("StataDCTutils") < CurrentVersion) {
Checks <- "Failed"
}
}
switch(
Checks,
Passed = { message("Everything looks OK! Proceeding!") },
Failed = {
ans = readline(
"'StataDCTutils is either outdated or not installed. Update now? (y/n) ")
if (ans != "y")
return(invisible())
require(devtools)
install_github("StataDCTutils", "mrdwab")
})
# Some cool things you want to do after you are sure the data is there
}
Try it out with CheckVersionFirst().
Note: This would succeed only if you religiously remember to update your version number in your description file every time you push a new version of the data to Github!
So, to clarify/recap/expand, the basic idea would be to:
Periodically push the updated version of your data package to Github, being sure to change the version number of the data package in its DESCRIPTION file when you do so.
Integrate this CheckVersionFirst() function as an .onLoad event in your code package. (Obviously modify the function to match your account and package name).
Change the commented line that reads # Some cool things you want to do after you are sure the data is there to reflect the cool things you actually want to do, which would probably start with library(YOURDATAPACKAGE) to load the data....
This method may not be efficient, but a good workaround. If you are making a package that needs regularly updated data, first make a package which has that data. It does not need any functions, but I like the concept of a setter (which you might not need in this case) & getter.
Then when you make your package, have the 'data'-package as a dependency. This way, whenever someone installs your package, he/she will always have the latest data.
On your part, you'll just have to swap out the data in your 'data' package, and upload it to the repo you want.
If you don't know how to build a package, check ?packages.skeleton and R CMD CHECK, R CMD BUILD

Resources