Stream in CSV content in R - r

I understand how to read a CSV file that is stored on disk, but I don't know how to stream in CSV content via CLI using R.
E.g., Reading CSV file from disk using a simple CLI.
library(optparse)
option_list <- list(
# Absolute filepath to CSV file.
make_option(c("-c","--csv"),type="character",default=NULL,
help="CSV filepath",metavar="character")
);
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
csv_filepath <- opt$csv
csv <- read.csv(csv_filepath)
How would I do this if I'm working with a data stream?

R always reads from connections. A connection can be a file, and url, an in-memory text, and so on.
So, in case you wanna read csv-format data from a content that is already in memory, you just use the text= parameter, instead of a file name.
Like this:
my_stream = "name;age\nJulie;25\nJohn;26"
read.csv(text = my_stream, sep = ";", header = T)
The output will be:
name age
1 Julie 25
2 John 26
You can place additional parameters to read.csv() normally, of course.

R source and package optparse.
First, write an R source file "example.R", such as the following.
#!/usr/bin/env Rscript
#
# R source: example.R
# options: -c --csv
#
library(optparse)
option_list <- list(
# Absolute filepath to CSV file.
make_option(c("-c","--csv"),type="character",default=NULL,
help="CSV filepath",metavar="character")
)
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
csv_filepath <- opt$csv
csv <- read.csv(csv_filepath)
message(paste("\nfile read:", csv_filepath, "\n"))
str(csv)
Then, change the execute permissions, in order for the bash shell to recognize the #! shebang and run Rscript passing it the file.
In this case, I will change the user permissions only, not its group.
bash$ chmod u+x example.R
The test.
I have tested the above script with this data.frame:
df1 <- data.frame(id=1:5, name=letters[1:5])
write.csv(df1, "test.csv", row.names=FALSE)
And, at a Ubuntu 20.04 LTS, ran the command ./example.R passing it the CSV filename in argument csv. The command and its output were
bash$ ./example.R --csv=test.csv
file read: test.csv
'data.frame': 5 obs. of 2 variables:
$ id : int 1 2 3 4 5
$ name: chr "a" "b" "c" "d" ...

Related

R script output values into another folder or directory

After running my R script in the terminal I get two output data files: a.dat and b.dat. My goal is to directly divert these output files into a new folder.
Is there any way to do something like this:
Rscript myscript.R > folder
Note: For writing the output file I simply use this:
write(t(result1), file = "a.dat", ncolumns = 5, append=TRUE)
I solved my problem by doing the following:
I created an output folder 'output'
I added the full path of the output in myscript.R as
write(t(result1), file = "home/Documents/output/a.dat", ncolumns = 5, append=TRUE)
Solved! :)
You could simply use write.table create two csv files like this:
A minimal working example:
using a r-script called "Rfile.r" in the directory "adir" in my "Dokumente" folder. the script reads the first two inputs , a numeric as the input argument for the function , aswell as a character string with the output-target-directory . (you could also do filenames , etc of course..)
Rfile.r ::
# set arguments, to later specifiy in terminal ,
# one numeric and one target directory
arg <- commandArgs(trailingOnly = TRUE)
n<-as.numeric(arg[1])
path<-as.character(arg[2])
## A random function two create two csv 's
fun <- function(n) {
data.a <-data.frame(rep("Some Data", n))
data.b<-data.frame(rnorm(1:n))
data<-list(data.a,data.b)
return(data)
}
# create data using input arg[1], aka 'n'
data<-fun(n)
# now the important Part: using write.table with the arg[2] aka 'path'
# :
write.table(data[1],file =paste(path,"/data_a.csv", sep = ""))
write.table(data[2],file =paste(path,"/data_b.csv", sep = ""))
## write terminal output message using cat()
cat(paste("Your input was :" ,arg[1],sep="\t"),
paste( "your target path was:" ,arg[2] ,sep="\t"), sep = "\n")
then run in a terminal :
$ Rscript ~/Dokumente/adir/Rfile.r 3 ~/Dokumente/bdir
it creates two csv's in the directory "bdir" called "data_a.csv" and "data_b.csv" where 3 was the numeric input for the function in Rfile.r

How to use read.csv2.sql to read zip file without unzipping it?

I am trying to read a zip file without unzipping it in my directory while utilizing read.csv2.sql for specific row filtering.
Zip file can be downloaded here :
I have tried setting up a file connection to read.csv2.sql, but it seems that it does not take in file connection as an parameter for "file".
I already installed sqldf package in my machine.
This is my following R code for the issue described:
### Name the download file
zipFile <- "Dataset.zip"
### Download it
download.file("https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip",zipFile,mode="wb")
## Set up zip file directory
zip_dir <- paste0(workingDirectory,"/Dataset.zip")
### Establish link to "household_power_consumption.txt" inside zip file
data_file <- unz(zip_dir,"household_power_consumption.txt")
### Read file into loaded_df
loaded_df <- read.csv2.sql(data_file , sql="SELECT * FROM file WHERE Date='01/02/2007' OR Date='02/02/2007'",header=TRUE)
### Error Msg
### -Error in file(file) : invalid 'description' argument
This does not use read.csv2.sql but as there are only ~ 2 million records in the file it should be possible to just download it, read it in using read.csv2 and then subset it in R.
# download file creating zipfile
u <-"https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip"
zipfile <- sub(".*%2F", "", u)
download.file(u, zipfile)
# extract fname from zipfile, read it into DF0 and subset it to DF
fname <- sub(".zip", ".txt", zipfile)
DF0 <- read.csv2(unz(zipfile, fname))
DF0$Date <- as.Date(DF0$Date, format = "%d/%m/%Y")
DF <- subset(DF0, Date == '2007-02-01' | Date == '2007-02-02')
# can optionally free up memory used by DF0
# rm(DF0)

How to read .xlsm files in a folder, a folder that is present in different folders using R

I have directory with a list of folders which contains a folder named "ABC" . This "ABC" has '.xlsm' files. I want to use a R code to read '.xlsm' files in the folder "ABC", which under different folders in a directory.
Thank you for your help
If you already know the paths to each file, then simply use read_excel from the readxl package:
library(readxl)
mydata <- read_excel("ABC/myfile.xlsm")
If you first need to get the paths to each file, you can use a system command (I'm on Ubuntu 18.04) to find all of the paths and store them in a vector. You can then import them one at a time:
myshellcommand <- "find /path/to/top/directory -path '*/ABC/*' -type d"
mypaths <- system(command = myshellcommand, intern = TRUE)
Because of your directory requirements, one method for finding all of the files can be a double list.files:
ld <- list.files(pattern="^ABC$", include.dirs=TRUE, recursive=TRUE, full.names=TRUE)
lf <- list.files(ld, pattern="\\.xlsm$", ignore.case=TRUE, recursive=TRUE, full.names=TRUE)
To read them all into a list (good ref for dealing with a list-of-frames: http://stackoverflow.com/a/24376207/3358272):
lstdf <- sapply(lf, read_excel, simplify=FALSE)
This defaults to opening the first sheet in each workbook. Other options in readxl::read_excel that might be useful: sheet=, range=, skip=, n_max=.
Given a list of *.xlsm files in your working directory you can do the following:
list.files(
path = getwd(),
pattern = glob2rx(pattern = "*.xlsm"),
full.names = TRUE,
recursive = TRUE
) -> files_to_read
lst_dta <- lapply(
X = files_to_read,
FUN = function(x) {
cat("Reading:", x, fill = TRUE)
openxlsx::read.xlsx(xlsxFile = x)
}
)
Results
Given two files, each with two columns A, B and C, D the generated list corresponds to:
>> lst_dta
[[1]]
C D
1 3 4
[[2]]
A B
1 1 2
Notes
This will read all .xlsm files found in the directory tree starting from getwd().
openxlsx is efficient due to the use of Rcpp. If you are going to be handling a substantial amount of MS Excel files this package is worth exploring, IMHO.
Edit
As pointed out by #r2evans in comments, you may want to read *.xlsm files that reside only within ABC folder ignoring *.xlsm files outside the ABC folder. You could filter your files vector in the following manner:
grep(pattern = "ABC", x = files_to_read, value = TRUE)
Unlikely, if you have *.xlsm files that have ABC string in names and are saved outside ABC folder you may get extra matches.

read_fwf not working while unzipping files

I want to read in several fixed width format txt files into R but I first need to unzip them.
Since they are very large files I want to use read_fwf from the readr package because it's very fast.
When I do:
read_fwf(unz(zipfileName, fileName), fwf_widths(colWidths, col_names = colNames))
I get this error Error in isOpen(con) : invalid connection
However when I do:
read.table(unz(zipfileName, fileName)) without specfiying widths it reads into R just fine. Any thoughts as to why this isn't working with read_fwf ?
I am having trouble making a reproducible example. Here is what I got:
df <- data.frame(
rnorm(100),
rnorm(100)
)
write.table(df, "data.txt", row.names=F, col.names = F)
zip(zipfile = "data.zip", files = "data.txt")
colWidths <- rep(2, 100)
colNames <- c("thing1","thing2")
zipfileName <- "data.zip"
fileName <- "data.csv"
I also had trouble getting read_fwf to read zip files when passing an unz-ed file to it but then reading the ?read_fwf page I see that zipped files are promised to be handled automagically. You didn't make a file that was a valid fwf as an example, since neither of the columns had constant positions but that is apparent with the output:
read_fwf(file="~/data.zip", fwf_widths(widths=rep(16,2) ,col_names = colNames) )
Warning: 1 parsing failure.
row col expected actual
3 thing2 16 chars 14
# A tibble: 100 x 2
thing1 thing2
<chr> <chr>
1 1.37170820802141 -0.58354018425322
2 0.03608988699566 7 -0.402708262870141
3 1.02963272114 -1 .0644333112294
4 0.73546166509663 8 0.607941664550652
5 -1.5285547658079 -0.319983522035755
6 -1.4673290956901 0.523579231857175
7 0.24946312418273 9 -0.574046655188405
8 0.58126541455159 5 -0.406516495600345
9 1.5074477698981 -0.496512994239183
10 -2.2999905645658 8 -0.662667854341041
# ... with 90 more rows
The error you were getting was from the unz function because it expects a full path to a zip extension file (and apparently won't accept an implicit working directory location) as the "description" argument. It's second argument is the name of the compressed file inside the zip file. I think it returns a connection, but not of a type that read_fwf is able to process. Doing parsing by hand I see that the errors both of us got was from this section of code in read_connection:
> readr:::read_connection
function (con)
{
stopifnot(is.connection(con))
if (!isOpen(con)) {
open(con, "rb")
on.exit(close(con), add = TRUE)
}
read_connection_(con)
}
<environment: namespace:readr>
You didn't give unz a valid "description" argument, and even if we did the effort to open with open(con, "rb") fails because of the lack of standardization in arguments in the various file handling functions.

how to read multiple specific columns out of compressed .csv file in R

I need a fast way to read multiple specific columns from a .csv file compressed as .tar.gz into a variable in R.
My approach:
con <- textConnection(system(paste("zcat ", filename.tar.gz, " | cut -d ; -f 1,2,3", sep = "")))
var <- read.csv(con, sep = ";")
it seems like he does not understand the pipe command, since it zcat filename.tar.gz | cut -d ; -f 1,2,3 is working on console.
The error i'm getting in R:
[5] "cut.gz: No such file or directory"
[6] ";.gz: No such file or directory"
[7] "2.gz: No such file or directory"
1) pipe If we have a csv file named a.csv in a.tar.gz and it has 8 columns and we want to read the first 3 columns and ignore the rest (or in place of using colClasses use a pipeline in pipe as in your question):
read.csv(pipe("tar -xOzf a.tar.gz a.csv"), colClasses = rep(c(NA, "NULL"), c(3, 5)))
2) gsubfn To parameterize it, it could be written like this:
library(gsubfn)
Archive <- "a.tar.gz"
File <- "a.csv"
read.csv(fn$pipe("tar -xOzf $Archive $File"), colClasses = rep(c(NA, "NULL"), c(3, 5)))
3) fread The fread function in data.table can also be useful here. This uses Archive and File from (2). It has the advantage of not requiring knowledge of the number of columns. Also fread handles shell commands directly, can usually figure out whether there are headers and what the separator is and it tends to be fast.
library(data.table)
library(gsubfn)
fn$fread("tar -xOzf $Archive $File", select = 1:3)

Resources