How to use Rscript with readr to get data from aws s3 - r

I have some R code with readr package that works well on a local computer - I use list.files to find files with a specific extension and then use readr to operate on those files found.
My question: I want to do something similar with files in AWS S3 and I am looking for some pointers on how to use my current R code to do the same.
Thanks in advance.
What I want:
Given AWS folder/file structure like this
- /folder1/subfolder1/quant.sf
- /folder1/subfolder2/quant.sf
- /folder1/subfolder3/quant.sf
and so on where every subfolder has the same file 'quant.sf', I would like to get a data frame which has the S3 paths and I want to use the R code shown below to operate on all the quant.sf files.
Below, I am showing R code that works currently with data on a Linux machine.
get_quants <- function(path1, ...) {
additionalPath = list(...)
suppressMessages(library(tximport))
suppressMessages(library(readr))
salmon_filepaths=file.path(path=path1,list.files(path1,recursive=TRUE, pattern="quant.sf"))
samples = data.frame(samples = gsub(".*?quant/salmon_(.*?)/quant.sf", "\\1", salmon_filepaths) )
row.names(samples)=samples[,1]
names(salmon_filepaths)=samples$samples
# IF no tx2Gene available, we will only get isoform level counts
salmon_tx_data = tximport(salmon_filepaths, type="salmon", txOut = TRUE)
## Get transcript count summarization
write.csv(as.data.frame(salmon_tx_data$counts), file = "tx_NumReads.csv")
## Get TPM
write.csv(as.data.frame(salmon_tx_data$abundance), file = "tx_TPM_Abundance.csv")
if(length(additionalPath > 0)) {
tx2geneFile = additionalPath[[1]]
my_tx2gene=read.csv(tx2geneFile,sep = "\t",stringsAsFactors = F, header=F)
salmon_tx2gene_data = tximport(salmon_filepaths, type="salmon", txOut = FALSE, tx2gene=my_tx2gene)
## Get Gene count summarization
write.csv(as.data.frame(salmon_tx2gene_data$counts), file = "tx2gene_NumReads.csv")
## Get TPM
write.csv(as.data.frame(salmon_tx2gene_data$abundance), file = "tx2gene_TPM_Abundance.csv")
}
}

I find it easiest to use the aws.s3 R package for this. In this case what you would do is use the s3read_using() and s3write_using() functions to save to and from S3. Like this:
library(aws.s3)
my_tx2gene=s3read_using(FUN=read.csv, object="[path_in_s3_to_file]",sep = "\t",stringsAsFactors = F, header=F)
It basically is a wrapper around whatever function you want to use for file input/output. Works great with read_json, saveRDS, or anything else!

Related

Using R to merge many large CSV files across sub-directories

I have over 300 large CSV files with the same filename, each in a separate sub-directory, that I would like to merge into a single dataset using R. I'm asking for help on how to remove columns I don't need in each CSV file, while merging in a way that breaks the process down into smaller chunks that my memory can more easily handle.
My objective is to create a single CSV file that I can then import into STATA for further analysis using code I have already written and tested on one of these files.
Each of my CSVs is itself rather large (about 80 columns, many of which are unnecessary, and each file has tens to hundreds of thousands of rows), and there are almost 16 million observations in total, or roughly 12GB.
I have written some code which manages to do this successfully for a test case of two CSVs. The challenge is that neither my work nor my personal computers have enough memory to do this for all 300+ files.
The code I have tried is here:
library(here) ##installs package to find files
( allfiles = list.files(path = here("data"), ##creates a list of the files, read as [1], [2], ... [n]
pattern = "candidates.csv", ##[identifies the relevant files]
full.names = TRUE, ##identifies the full file name
recursive = TRUE) ) ##searches in sub-directories
read_fun = function(path) {
test = read.csv(path,
header = TRUE )
test
} ###reads all the files
(test = read.csv(allfiles,
header = TRUE ) )###tests that file [1] has been read
library(purrr) ###installs package to unlock map_dfr
library(dplyr) ###installs packages to unlock map_dfr
( combined_dat = map_dfr(allfiles, read_fun) )
I expect the result to be a single RDS file, and this works for the test case. Unfortunately, the amount of memory this process requires when looking at 15.5m observations across all my files causes RStudio to crash, and no RDS file is produced.
I am looking for help on how to 1) reduce the load on my memory by stripping out some of the variables in my CSV files I don't need (columns with headers junk1, junk2, etc); and 2) how to merge in a more manageable way that merges my CSV files in sequence, either into a few RDS files to themselves be merged later, or through a loop cumulatively into a single RDS file.
However, I don't know how to proceed with these - I am still new to R, and any help on how to proceed with both 1) and 2) would be much appreciated.
Thanks,
Twelve GB is quite a bit for one object. It's probably not practical to use a single RDS or CSV unless you have far more than 12GB of RAM. You might want to look into using a database, a techology that is made for this kind of thing. I'm sure Stata can also interact with databases. You might also want to read up on how to interact with large CSVs using various strategies and packages.
Creating a large CSV isn't at all difficult. Just remember that you have to work with said giant CSV sometime in the future, which probably will be difficult. To create a large CSV, just process each component CSV individually and then append them to your new CSV. The following reads in each CSV, removes unwanted columns, and then appends the resulting dataframe to a flat file:
library(dplyr)
library(readr)
library(purrr)
load_select_append <- function(path) {
# Read in CSV. Let every column be of class character.
df <- read_csv(path, col_types = paste(rep("c", 82), collapse = ""))
# Remove variables beginning with "junk"
df <- select(df, -starts_with("junk"))
# If file exists, append to it without column names, otherwise create with
# column names.
if (file.exists("big_csv.csv")) {
write_csv(df, "big_csv.csv", col_names = F, append = T)
} else {
write_csv(df, "big_csv.csv", col_names = T)
}
}
# Get the paths to the CSVs.
csv_paths <- list.files(path = "dir_of_csvs",
pattern = "\\.csv.*",
recursive = T,
full.names = T
)
# Apply function to each path.
walk(csv_paths, load_select_append)
When you're ready to work with your CSV you might want to consider using something like the ff package, which enables interaction with on-disk objects. You are somewhat restricted in what you can do with an ffdf object, so eventually you'll have to work with samples:
library(ff)
df_ff <- read.csv.ffdf(file = "big_csv.csv")
df_samp <- df_ff[sample.int(nrow(df_ff), size = 100000),]
df_samp <- mutate(df_samp, ID = factor(ID))
summary(df_samp)
#### OUTPUT ####
values ID
Min. :-9.861 17267 : 6
1st Qu.: 6.643 19618 : 6
Median :10.032 40258 : 6
Mean :10.031 46804 : 6
3rd Qu.:13.388 51269 : 6
Max. :30.465 52089 : 6
(Other):99964
As far as I know, chunking and on-disk interactions are not possible with RDS or RDA, so you are stuck with flat files (or you go with one of the other options I mentioned above).

R with testthat: Get list of exported functions

I ship a text file with all exported functions listed. To make sure, that all functions are listed, I would like to create a unit test via testthat and compare all exported function with the one in the text file. My current approach reads in the file and compares it with ls("package:myPackage"). But this call returns a long list of all functions of all imported packages. Any ideas how to solve this?
A complete different approach would be to generate this file automatically. But I think the first approach is easier to realise. Hopefully.
Thanks to #Emmanuel-Lin here is my solution:
require(data.table)
test_that("Function List", {
# read namespace and extract exported functions
funnamespace = data.table(read.table(system.file("NAMESPACE", package = "PackageName"), stringsAsFactors = FALSE))
funnamespace[, c("status", "fun") := tstrsplit(V1, "\\(")]
funnamespace[, fun := tstrsplit(fun, "\\)")]
# read function list
funlist = read.csv2(system.file("subdirectory", "functionList.txt", package = "PackageName"), stringsAsFactors = FALSE)
# test
expect_equal(funnamespace[status == "export", fun], funlist[, 1])
})
Obviously, I was to lazy to work out the correct regular expression to replace the two tstrsplit by one row.

How can I source configuration data in an R package function?

If I were to use a configuration file in a normal R script, I would do this:
config.R
a <- 1
b <- 2
c <- 3
RScript
source('config.R')
d = a+b+c
Do stuff
How would I do this inside an R package? Can I keep a config file and source it inside an R function? Or should I include a,b,c in every function? What's the best practice?
If the configs shall be contained in the R package itself:
Store the config file(s) in the inst/configs folder.
After the package installation the configs are contained in the configs folder of the package location (libPaths())
Source the config file using the package installation directory from within a package function:
myPackage::load_config <- function(config_file_name = "default_config.R",
config_file_path = system.file("configs", package = getPackageName(), mustWork = TRUE))
{
env <- new.env() # all values are then contained in an separate environment
# env <- globalenv() # to make the variables visible in the client's environment
config_file_FQN <- file.path(config_file_path, config_file_name)
source(config_file_FQN, local = env, keep.source = TRUE)
return(env)
}
The client can then trigger the configuration and use it (eg. pass around)
# client call
myConf <- myPackage::load_config()
print(myConf$YourVariableName))
Or store the environment with the configured variables within the package
as a package-global variable, see this example code (sorry, too much to explain here):
https://github.com/aryoda/tryCatchLog/blob/master/R/zzz.R#L47
1: One option would be to have these as default values in your functions. As in
my_fun <- function(..., a = 1, b = 2) so on.
2: Given that what you have in a package is functions, you can easily have them declared in your main functions. So, the other functions being called by these have access to them.
3: Another option would be to keep them as functions.
a <- function()
a <- 1
Now you can call a() when ever you want, as in a() + 2.
4: Another option would be to use environments. I haven't use those much. I think you'll find this useful, in particular the section on Package state.

Extract multiple json documents from aws s3 using R

I am currently experimenting with extracting documents from aws S3 and R. I have successfully managed to extract 1 document and create a dataframe with that document. I would like to be able to extract multiple documents which are within multiple sub folders of eventstore/footballStats/.
CODE demonstrates 1 document being pulled which works.
install.packages("aws.s3", repos = c("cloudyr" = "http://cloudyr.github.io/drat")) # runs an update for aws S3
library(aws.s3)
# Set credentials for S3 ####
Sys.setenv("AWS_ACCESS_KEY_ID" = "KEY","AWS_SECRET_ACCESS_KEY" = "AccessKey")
# Extracts 1 document raw vector representation of an S3 documents ####
DataVector <-get_object("s3://eventstore/footballStats/2017-04-22/13/01/doc1.json")
I have then tried code below to pull all documents from the folder and subfolders but receive an error.
DataVector <-get_object("s3://eventstore/footballStats/2017-04-22/*")
ERROR :
chr "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error>
<Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><K"| __truncated__
Is there an alternative r package I should be using? or Is the function get_object() only work for 1 document and I should be using another function from aws.s3 library?
Based on the hints from Drj and Thomas I was able to solve this..
### Displays Buckets in s3####
bucketlist()
### Builds a dataframe of the files in a bucket###
dfBucket <- get_bucket_df('eventstore', 'footballStats/2017-04-22/')
# creates path based on data in bucket
path <- dfBucket$Key
### Extracts all data into values ####
s3Data <- NULL
for (lineN in path) {
url <- paste('s3://eventstore/',lineN, sep= "")
s3Vector <- get_object(url)
s3Value <- rawToChar(s3Vector)
s3Data <- c(s3Data, s3Value)
}
To create a dataframe from the data use tidyjson and dplyr. See link for well explained document on this.
https://cran.r-project.org/web/packages/tidyjson/vignettes/introduction-to-tidyjson.html

Run separate functions on multiple elements of list based on regex criteria in data.frame

The following works, but I'm missing a functional programming technique, indexing, or a better way of structuring my data. After a month, it will take a bit to remember exactly how this works instead of being easy to maintain. It seems like a workaround when it shouldn't be. I want to use regex to decide which function to use for expected groups of files. When a new file format comes along, I can write the read function, then add the function along with the regex to the data.frame to run it alongside all the rest.
I have different formats of Excel and csv files that need to be read in and standardized. I want to maintain a list or data.frame of the file name regex and appropriate function to use. Sometimes there will be new file formats that won't be matched, and old formats without new files. But then it gets complicated which is something I would prefer to avoid.
# files to read in based on filename
fileexamples <- data.frame(
filename = c('notanyregex.xlsx','regex1today.xlsx','regex2today.xlsx','nomatch.xlsx','regex1yesterday.xlsx','regex2yesterday.xlsx','regex3yesterday.xlsx'),
readfunctionname = NA
)
# regex and corresponding read function
filesourcelist <- read.table(header = T,stringsAsFactors = F,text = "
greptext readfunction
'.*regex1.*' 'readsheettype1'
'.*nonematchthis.*' 'readsheetwrench'
'.*regex2.*' 'readsheettype2'
'.*regex3.*' 'readsheettype3'
")
# list of grepped files
fileindex <- lapply(filesourcelist$greptext,function(greptext,files){
grepmatches <- grep(pattern = greptext,x = data.frame(files)[,1],ignore.case = T)
},files = fileexamples$filename)
# run function on files based on fileindex from grep
for(i in 1:length(fileindex)){
fileexamples[fileindex[[i]],'readfunctionname'] <- filesourcelist$readfunction[i]
}

Resources