Using tidyverse to read data from s3 bucket - r

I'm trying to read a .csv file stored in an s3 bucket, and I'm getting errors. I'm following the instructions here, but either it does not work or I am making a mistake and I'm not getting what I'm doing wrong.
Here's what I'm trying to do:
# I'm working on a SageMaker notebook instance
library(reticulate)
library(tidyverse)
sagemaker <- import('sagemaker')
sagemaker.session <- sagemaker$Session()
region <- sagemaker.session$boto_region_name
bucket <- "my-bucket"
prefix <- "data/staging"
bucket.path <- sprintf("https://s3-%s.amazonaws.com/%s", region, bucket)
role <- sagemaker$get_execution_role()
client <- sagemaker.session$boto_session$client('s3')
key <- sprintf("%s/%s", prefix, 'my_file.csv')
my.obj <- client$get_object(Bucket=bucket, Key=key)
my.df <- read_csv(my.obj$Body) # This is where it all breaks down:
##
## Error: `file` must be a string, raw vector or a connection.
## Traceback:
##
## 1. read_csv(my.obj$Body)
## 2. read_delimited(file, tokenizer, col_names = col_names, col_types = col_types,
## . locale = locale, skip = skip, skip_empty_rows = skip_empty_rows,
## . comment = comment, n_max = n_max, guess_max = guess_max,
## . progress = progress)
## 3. col_spec_standardise(data, skip = skip, skip_empty_rows = skip_empty_rows,
## . comment = comment, guess_max = guess_max, col_names = col_names,
## . col_types = col_types, tokenizer = tokenizer, locale = locale)
## 4. datasource(file, skip = skip, skip_empty_rows = skip_empty_rows,
## . comment = comment)
## 5. stop("`file` must be a string, raw vector or a connection.",
## . call. = FALSE)
When working with Python, I can read a CSV file using someting like this:
import pandas as pd
# ... Lots of boilerplate code
my_data = pd.read_csv(client.get_object(Bucket=bucket, Key=key)['Body'])
This is very similar to what I'm trying to do in R, and it works with Python... so why does it not work on R?
Can you point me in the right path?
Note: Although I could use a Python kernel for this, I'd like to stick to R, because I'm more fluent with it than with Python, at least when it comes to dataframe crunching.

I'd recommend trying the aws.s3 package instead:
https://github.com/cloudyr/aws.s3
Pretty simple - set your env variables:
Sys.setenv("AWS_ACCESS_KEY_ID" = "mykey",
"AWS_SECRET_ACCESS_KEY" = "mysecretkey",
"AWS_DEFAULT_REGION" = "us-east-1",
"AWS_SESSION_TOKEN" = "mytoken")
and then once that is out of the way:
aws.s3::s3read_using(read.csv, object = "s3://bucket/folder/data.csv")
Update: I see you're also already familiar with boto and trying to use reticulate so leaving this easy wrapper for that here:
https://github.com/cloudyr/roto.s3
Looks like it has a great api for example the variable layout you're aiming to use:
download_file(
bucket = "is.rud.test",
key = "mtcars.csv",
filename = "/tmp/mtcars-again.csv",
profile_name = "personal"
)
read_csv("/tmp/mtcars-again.csv")

Related

R: Importing file using rio and here packages in a nested function

I'm working on functions that can take the chracter string argument GSE_expt. I have written 4 separate functions which take the argument GSE_expt and produce the output that I am able to save as a variable in the R environment.
The code block below has 2 of those functions. I use paste0 function with the variable GSE_expt to create a file name that the here and rio packages can use to import the file.
# Extracting metadata from 2 different sources and combining them into a single file
extract_metadata <- function(GSE_expt){
GSE_expt <- deparse(substitute(GSE_expt)) # make sure it is a character string
metadata_1 <- rnaseq_metadata_allsamples %>% # subset a larger metadata file
as_tibble %>%
dplyr::filter(GSE == GSE_expt)
# metadata from ENA imported using rio and here packages
metadata_2 <- import(here("metadata", "rnaseq", paste0(GSE_expt, ".txt"))) %>%
as_tibble %>%
select("run_accession","library_layout", "library_strategy","library_source","read_count", "base_count", "sample_alias", "fastq_md5")
metadata <- full_join(metadata_1, metadata_2, by = c("Run"="run_accession"))
return(metadata)
}
# Extracting coverage stats obtained from samtools
clean_extract_coverage <- function(GSE_expt){
coverage <- read_tsv(file = here("results","rnaseq","2022-01-11", "coverage", paste0("coverage_stats_", deparse(substitute(GSE_expt)), "_percent.txt")), col_names = FALSE)
coverage <- data.frame("Run" = coverage$X1[c(TRUE, FALSE)],
"stats" = coverage$X1[c(FALSE, TRUE)])
coverage <- separate(coverage, stats, into = c("num_reads", "covered_bases", "coverage_percent"), convert = TRUE)
return(coverage)
}
The functions work fine on their own individually when I use GSE118008 as the variable for the argument GSE_expt.
I am trying to create a nested/combined function so that I can run GSE118008 on both (or more) functions at the same time and save the output as a list.
When I ran a nested/combined function,
extract_coverage_metadata <- function(GSE_expt){
coverage <- clean_extract_coverage(GSE_expt)
metadata <- extract_metadata(GSE_expt)
return(metadata)
}
extract_coverage_metadata(GSE118008)
This is the error message I got.
Error: 'results/rnaseq/2022-01-11/coverage/coverage_stats_GSE_expt_percent.txt' does not exist.
Rather than creating a filename
coverage_stats_GSE118008_percent.txt
(which it does fine with the individual function), it is unable to do so in this combined function, and instead returns the filename coverage_stats_GSE_expt_percent.txt
Traceback
8. stop("'", path, "' does not exist", if (!is_absolute_path(path)) { paste0(" in current working directory ('", getwd(), "')") }, ".", call. = FALSE)
7. check_path(path)
6. (function (path, write = FALSE) { if (is.raw(path)) { return(rawConnection(path, "rb")) ...
5. vroom_(file, delim = delim %||% col_types$delim, col_names = col_names, col_types = col_types, id = id, skip = skip, col_select = col_select, name_repair = .name_repair, na = na, quote = quote, trim_ws = trim_ws, escape_double = escape_double, escape_backslash = escape_backslash, ...
4. vroom::vroom(file, delim = "\t", col_names = col_names, col_types = col_types, col_select = { { col_select ...
3. read_tsv(file = here("results", "rnaseq", "2022-01-11", "coverage", paste0("coverage_stats_", deparse(substitute(GSE_expt)), "_percent.txt")), col_names = FALSE) at rnaseq_functions.R#30
2. clean_extract_coverage(GSE_expt)
1. extract_coverage_metadata(GSE118008)
I would appreciate any recommendations on how to solve this.
Thanks in advance!
Husain

R - avoid new column when merging csv files

I have a script that merge all csv files in a folder.
My problem is that a new column named "...20" is created with empty data. How can I avoid that ?
Thanks for helping
My script :
folderfiles <- list.files(path = "//myserver/Depots/",
pattern = "\\.csv$",
full.names = TRUE)
data_csv <- folderfiles %>%
set_names() %>%
map_dfr(.f = read_delim,
delim = ";",
)
and the message :
It's difficult to debug this without access to specific files. However, you can attempt to specify the columns you want to read using the cols_only function. For example, let's assume that you only want to read the mpg column. You can do that in the following manner:
library("fs")
library("readr")
library("tidyverse")
# Generating some sample files
temp_dir_files <- path_temp("cars")
dir_create(temp_dir_files)
for (i in 1:10) {
write_csv(mtcars, file = path(temp_dir_files, paste0("cars", i, ".csv")))
}
# Selected column import
# read_* can handle a vector of paths
read_csv(
file = dir_ls(temp_dir_files, glob = "*.csv"),
col_types = cols_only(
mpg = col_double()
),
id = "input_file"
)
The cols_only specification passed to read_csv will force the read_csv to skip the remaining columns and only import the column with the matching name.

Convert Stata 16 files to Stata 12 files using R

I am using RStudio (running R 4.0.1) and Stata 12 for Windows and have got a large number of folders with Stata 16 .dta files (and other types of files not relevant to this question). I want to create an automated process of converting all Stata 16 .dta files into Stata 12 format (keeping all labels) to then analyze.
Ideally, I want to keep the names of the original folders and files but save the converted versions into a new location.
This is what I have got so far:
setwd("C:/FilesLocation")
#vector with name of files to be converted
all_files <- list.files(pattern="*.dta",full.names = TRUE)
for (i in all_files){
#Load file to be converted into STATA12 version
data <- read_dta("filename.dta",
encoding = NULL,
col_select = NULL,
skip = 0,
n_max = Inf,
.name_repair = "unique")
#Write as .dta
write_dta(data,"c:/directory/filename.dta", version = 12, label = attr(data, "label"))
}
Not sure this is the best approach. I know the commands inside the loop are working for a single file but not really being able to automate for all files.
Your code only needs some very minor modifications. I've indicated the changes (along with comments explaining them) in the snippet below.
library(haven)
mypath <- "C:/FilesLocation"
all_files <- list.files(path = mypath, pattern = "*.dta", full.names = TRUE)
for (i in 1:length(all_files)){
#(Above) iterations need the length of the vector to be specified
#Load file to be converted into STATA12 version
data <- read_dta(all_files[i], #You want to read the ith element in all_files
encoding = NULL,
col_select = NULL,
skip = 0,
n_max = Inf,
.name_repair = "unique")
#Add a _v12 to the filename to
#specify that is is version 12 now
new_fname <- paste0(unlist(strsplit(basename(all_files[i]), "\\."))[1],
"_v12.", unlist(strsplit(basename(all_files[i]), "\\."))[2])
#Write as .dta
#with this new filename
write_dta(data, path = paste0(mypath, "/", new_fname),
version = 12, label = attr(data, "label"))
}
I tried this out with some .sta files from here, and the script ran without throwing up errors. I haven't tested this on Windows but in theory it should work fine.
Edit: here is a more complete solution with read_dta and write_dta wrapped into a single function dtavconv. This function also allows the user to convert version numbers to arbitrary values (default is 12).
#----
#.dta file version conversion function
dtavconv <- function(mypath = NULL, myfile = NULL, myver = 12){
#Function to convert .dta file versions
#Default version files are converted to is v12
#Default directory is whatever is specified by getwd()
if(is.null(mypath)) mypath <- getwd()
#Main code block wrapped in a tryCatch()
myres <- tryCatch(
{
#Load file to be converted into STATA12 version
data <- haven::read_dta(paste0(mypath, "/", myfile),
encoding = NULL,
col_select = NULL,
skip = 0,
n_max = Inf,
.name_repair = "unique")
#Add a _v12 to the filename to
#specify that is is version 12 now
new_fname <- paste0(unlist(strsplit(basename(myfile), "\\."))[1],
"_v", myver, ".", unlist(strsplit(basename(myfile), "\\."))[2])
#Write as .dta
#with this new filename
haven::write_dta(data, path = paste0(mypath, "/", new_fname),
version = myver, label = attr(data, "label"))
message("\nSuccessfully converted ", myfile, " to ", new_fname, "\n")
},
error = function(cond){
#message("Unable to write file", myfile, " as ", new_fname)
message("\n", cond, "\n")
return(NA)
}
)
return(myres)
}
#----
The function can then be run on as many files as desired by invoking it via lapply or a for loop, as the example below illustrates:
#----
#Example run
library(haven)
#Set your path here below
mypath <- paste0(getwd(), "/", "dta")
#Check to see if this directory exists
#if not, create it
if(!dir.exists(mypath)) dir.create(mypath)
list.files(mypath)
# character(0)
#----
#Downloading some valid example files
myurl <- c("http://www.principlesofeconometrics.com/stata/airline.dta",
"http://www.principlesofeconometrics.com/stata/cola.dta")
lapply(myurl, function(x){ download.file (url = x, destfile = paste0(mypath, "/", basename(x)))})
#Also creating a negative test case
file.create(paste0(mypath, "/", "anegcase.dta"))
list.files(mypath)
# [1] "airline.dta" "anegcase.dta" "cola.dta"
#----
#Getting list of files in the directory
all_files <- list.files(path = mypath, pattern = "*.dta")
#Converting files using dtavconv via lapply
res <- lapply(all_files, dtavconv, mypath = mypath)
#
# Successfully converted airline.dta to airline_v12.dta
#
#
# Error in df_parse_dta_file(spec, encoding, cols_skip, n_max, skip,
# name_repair = .name_repair): Failed to parse /my/path/
# /dta/anegcase.dta: Unable to read from file.
#
#
#
# Successfully converted cola.dta to cola_v12.dta
#
list.files(mypath)
# [1] "airline_v12.dta" "airline.dta" "anegcase.dta" "cola_v12.dta"
# "cola.dta"
#Example for converting to version 14
res <- lapply(all_files, dtavconv, mypath = mypath, myver = 14)
#
# Successfully converted airline.dta to airline_v14.dta
#
#
# Error in df_parse_dta_file(spec, encoding, cols_skip, n_max, skip,
# name_repair = .name_repair): Failed to parse /my/path
# /dta/anegcase.dta: Unable to read from file.
#
#
#
# Successfully converted cola.dta to cola_v14.dta
#
list.files(mypath)
# [1] "airline_v12.dta" "airline_v14.dta" "airline.dta" "anegcase.dta"
# "cola_v12.dta" "cola_v14.dta" "cola.dta"
#----

Import file from environment instead of read.table

I am using a package of someone else. As you see, there is a ImportHistData term in the function. I want to import the file from environment as rainfall name instead of rainfall.txt. When I replace rainfall.txt with rainfall, I got this error:
Error in read.table(x, header = FALSE, fill = TRUE, na.strings = y) :
'file' must be a character string or connection
So, to import file not as a text, which way should I follow?
Original shape of the function
DisagSimul(TimeScale=1/4,BLpar=list(lambda=l,phi=f,kappa=k,
alpha=a,v=v,mx=mx,sx=NA),CellIntensityProp=list(Weibull=FALSE,
iota=NA),RepetOpt=list(DistAllowed=0.1,FacLevel1Rep=20,MinLevel1Rep=50,
TotalRepAllowed=5000),NumOfSequences=10,Statistics=list(print=TRUE,plot=FALSE),
ExportSynthData=list(exp=TRUE,FileContent=c("AllDays"),file="15min.txt"),
ImportHistData=list("rainfall.txt",na.values="NA",FileContent=c("AllDays"),
DaysPerSeason=length(rainfall$Day)),PlotHyetographs=FALSE,RandSeed=5)
Source of ImportHistData part in the function
ImportHistDataFun(mode = 1, x = ImportHistData$file,
y = ImportHistData$na.values, z = ImportHistData$FileContent[1],
w = TRUE, s = ImportHistData$DaysPerSeason, timescale = 1)
First, check documentation of the package and see if the method (?DisagSimul) allows a data frame in memory to be used for ImportHistData argument instead of reading from an external .txt file.
If the function is set up to only read a file from disk and you do not want to save your rainfall data frame permanently as a file, consider using a tempfile that exists only in the R session or until you use unlink():
# INITIALIZE TEMP FILE
tf <- tempfile(pattern = "", fileext = ".txt")
# EXPORT rainfall to FILE
write.table(rainfall, tf, row.names=FALSE)
...
# USE TEMPFILE IN METHOD
DisagSimul(...
ImportHistData = list(tf, na.values="NA", FileContent=c("AllDays"),

Avoid repeating statements when importing data

Iv'e written the following code to import data into R:
## specify where all the data files are stored
DataFolder <- "DataFolder"
## obtain the name of each file in DataFolder
files <- list.files(DataFolder)
## obtain name of each file
LocNames <- unique(sub("^([^.]*).*", "\\1", files)) # this removes the extension and keeps the unique names
for (i in 1:length(LocNames)){
#
car <- read.table(paste(DataFolder, paste(LocNames[i], ".car", sep=""), sep="/"),
header = TRUE, sep = "\t", colClasses=c(dateTime="POSIXct"))
car <- aggregate(car[colnames(car)[2:length(colnames(car))]],list(dateTime = cut(car$dateTime,breaks = "hour")),mean, na.rm = TRUE)
#
light <- read.table(paste(DataFolder, paste(LocNames[i], ".light", sep=""), sep="/"),
header = TRUE, sep = "\t", colClasses=c(dateTime="POSIXct"))
light <- aggregate(light[colnames(light)[2]],list(dateTime = cut(light$dateTime, breaks = "hour")),mean, na.rm = TRUE)
}
So, here I have a DataFolder where all of my files are stored. The files are named according to the location where the data was recorded and the extension of the file given the name of the variable measured. Here we have car sales and light as examples.
From here I would like to reduce the size of the arguments inside of the loop so instead of having to name one variable after the other repeating the same steps I want to only have to write the variable name e.g. car, light and then the outcome of the script shown will be returned.
Please let me know if my intentions have not been clear.
Just use a function. Something to the effect of
## specify where all the data files are stored
DataFolder <- "DataFolder"
## obtain the name of each file in DataFolder
files <- list.files(DataFolder)
readMyFiles <- function(DataFolder, LocNames, extension){
data <- read.table(paste(DataFolder, paste(LocNames[i], ".", extension, sep=""), sep="/"),
header = TRUE, sep = "\t", colClasses=c(dateTime="POSIXct"))
data <- aggregate(data[colnames(data)[2:length(colnames(data))]],list(dateTime = cut(data$dateTime,breaks = "hour")),mean, na.rm = TRUE)
data
}
## obtain name of each file
LocNames <- unique(sub("^([^.]*).*", "\\1", files)) # this removes the extension and keeps the unique names
for (i in 1:length(LocNames)){
car <- readMyFiles(DataFolder, LocNames, ".car")
light <- readMyFiles(DataFolder, LocNames, ".light")
}

Resources