Force `read_tsv` to decompress file - r

I'm wondering if there is a way to get readr::read_tsv to read block gzip files with .bgz extension. I could rename the files to have .gz (which read_tsv automatically recognizes) which does work, but I don't want to do that everytime I get new files.
Thanks!

You can pass a connection object rather than a file path. For example
read_tsv(gzfile("data.bgz"))
The gzfile() function will take any file name.

You can rename the file if it is a bgz via R:
library(fs)
library(stringr)
library(readr)
# Regular expression to find your dataset file named datasetname
# You'll need to change that to the actual name
tsv_file <- dir_ls(".", regexp = "datasetname.*\\.b?gz")
if (str_detect(tsv_file, "bgz"))
file_move(tsv_file, str_replace(tsv_file, "bgz$", "gz"))
dataset <- read_tsv(tsv_file)

Related

How to select specific files according to a spreadsheet criteria and then copy from directory to another directory in R?

I have a task that requires me to use a specific column in a CSV spreadsheet that stores the file names, for example:
File Name
CA-001
WV-001
ma-001
My task is to move some files from folder 'source' to folder 'target'.
And I'm using this csv spreadsheet as a crosswalk to select any files with names that match with what's in the column 'File Name'. Then I'm asking R to copy from the source folder that contains not only these files but also other files that are not in this list(eg: CO-001, SC-001...). If it's helpful, all of the files are PDFs, so we don't worry about file type. I want only the files that have names match with what's in the csv spreadsheet. How can I do this?
I have some sample code below, but it still didn't execute successfully.
source <- "C:/Users/53038/MovePDF/Test_From"
target <- "C:/Users/53038/MovePDF/Test_To"
all.files <- list.files(path = source)
csvfile <- read.csv('C:/Users/53038/MovePDF/Master.csv')
toCopy <- all.files[all.files %in% csvfile$Move]
file.copy(toCopy, target)
Thank you!
With the provided code, the selection of patterns you want to match will be in csvfile$File.Name.
I'm assuming the source directory is potentially very large. Instead of performing slow regular expressions to match substrings (while we know the exact filename), and/or getting a complete file listing (which is also slow), I will only seek if the exactly wanted filenames exist before copying them:
source <- "C:/Users/53038/MovePDF/Test_From"
target <- "C:/Users/53038/MovePDF/Test_To"
csvfile <- read.csv('C:/Users/53038/MovePDF/Master.csv')
# add .pdf suffix
toCopy <- paste0(csvfile$File.Name,'.pdf')
# add source directory path
toCopy <- file.path(source, toCopy)
# optional: extract only the existing files from toCopy. You can skip this step if you're sure they exist and/or you don't mind receiving errors
toCopy <- toCopy[file.exists(toCopy)]
# make it so
file.copy(toCopy, target, overwrite = T)
I would preferably keep the .pdf extension in the filename at all times, so also in the source CSV. There would be an issue on case-sensitive filesystems (almost all Linux installations, rarely macOS or Windows) if the extension is .PDF, .Pdf, etc.

Read files with a specific a extension from a folder in R

I want to read files with extension .output with the function read.table.
I used pattern=".output" but its'not correct.
Any suggestions?
As an example, heres how you could read in files with the extension ".output" and create a list of tables
list.filenames <- list.files(pattern="\\.output$")
trialsdata <- lapply(list.filenames,read.table,sep="\t")
or if you just want to read them one at a time manually just include the extention in the filename argument.
read.table("ACF.output",sep=...)
So finally because i didn't found a solution(something is going wrong with my path) i made a text file including all the .output files with ls *.output > data.txt.
After that using :
files = read.table("./data.txt")
i am making a data.frame including all my files and using
files[] <- lapply(files, as.character)
Finally with test = read.table(files[i,],header=F,row.names=1)
we could read every file which is stored in i (i = no of line).

I have downloaded a fastq file from SRA and I have problems reading it into my workspace. Is there another way for reading fastq files into R?

fqsample <- readFastq("/cloud/project/sra_data.fastq.", pattern = "fastq")
Error: Input/Output
no input files found
dirPath: /cloud/project/sra_data.fastq.
pattern: fastq
It seems like your path or filename is wrong.
Is the . at the end of your filename (.fastq.) intentional or a typo?
Try it without the .:
fqsample <- readFastq("/cloud/project/sra_data.fastq", pattern = "fastq")
Edit
I have just tried the readFastq function myself on a single file. You don't need to specify pattern = "fastq".
To explain it in more detail, you could read your file in two ways:
readFastq("/cloud/project/", pattern = "sra_data.fastq")
Here the first argument is just the path and the second is the file name. If you only provide the path, it would read all files inside the directory.
or:
readFastq("/cloud/project/sra_data.fastq")
Hope this helps.

How to file copy from a column of files in a data table in R

I am trying to create a new folder with all the files I need. I have a datatable called newlist with a column called fullpath which has the file path for each file.
I have tried the code below but the error message says "'from' path too long" so I don't think it recognises the values as separate file paths.
file.copy(from=newlist[,"fullpath"], to=destination, overwrite=TRUE, recursive=TRUE)
I think I need to specify the files to copy first using the list.files() function but I am unsure how to do this with a column of files in a datatable.
Try this:
lapply(newlist[,"fullpath"], function(x) file.copy(from=x, to=destination, overwrite=TRUE, recursive=TRUE))
Edit:
If your paths do not contain extensions, you can use this code instead:
lapply(newlist[,"fullpath"], function(x) {
# Find the file in the given directory with the basename and add a wildcard extension
f <- file.path(dirname(x), list.files(dirname(x), paste0(basename(x), ".*")))
file.copy(from=f, to=destination, overwrite=TRUE, recursive=TRUE)
})

How to read a csv file or load an excel workbook by ignoring some characters in the file path?

I'm writing a loop script which involves reading a file from a workbook (using the package XLConnect). The challenge is that the file names contain characters (representing time) that I want to ignore.
For example, here are 3 paths to those files:
G://User//Documents//daily_data//Op_Schedule_20160520_132025.xlsx
G://User//Documents//daily_data//Op_Schedule_20160521_142805.xlsx
G://User//Documents//daily_data//Op_Schedule_20160522_103052.xlsx
I need to import hundreds of those files. I can easily account for the character string representing the date (e.g. 20160522), but not the time.
Is there a way to tell R to ignore some characters located in the file path? Here is how I was thinking of writing my script (the "???" is where i need help). I know a loop is probably not the most efficient way, but i'm open to suggestions, should you have any:
require(XLConnect)
path= "G://User//Documents//daily_data//Op_Schedule_"
wd.seq = format(seq(as.Date("2014-01-01"),as.Date("2016-12-31"),"days"),format="%Y%m%d")
scheduleList = rep(list(matrix(1,1,1)),length(wd.seq))
for(i in 1:length(wd.seq)) {
wb = loadWorkbook(file= paste0(path,wd.seq[i],"???",".xlxs"))
scheduleList[[i]] = readWorksheet(wb,sheet='=SCHEDULE', header = TRUE)
}
`
Thanks for reading and suggestions, if any.
Mathieu
I don't know if this is helpful, but if you want to read all the files in a certain directory (which it seems to me is what you're after), you can read all the filenames into a list using the list.files() function, for example
fileList <- list.files(""G://User//Documents//daily_data//")
And then load the xlsx files looping through the list with a for loop
for(i in fileList) {
loadWorkbook(file = i)
}
I haven't used the XLConnect function before so that exact code probably doesn't work, but the loop will iterate through all the files in that directory and so you can construct your loading call using the i variable for the filename (it won't be an absolute path though, so you might need to use paste to add the first part of the filepath)
I realize there might be other files in the directory that are not excel files, you could use grepl to select only files containg "OP_Schedule_"
fileListClean <- fileList[grepl("Op_Schedule_",fileList)]
or perhaps only selecting .xlsx files in the directory:
fileListClean <- fileList[grepl(".xlsx",fileList)]
Edit to fit your reply:
Since you need to fit it to a sequence, you can do it as you did earlier:
wd.seq = format(seq(as.Date("2014-01-01"),as.Date("2016-12-31"),"days"),format="%Y%m%d")
wd.seq2 <- paste("Op_Schedule_", wd.seq, sep = "")
And then use grepl to only pick files starting with that extensions:
fileListClean <- fileList[grepl(paste(wd.seq2, collapse = "|"), fileList)]
Full disclosure: The last part i got from this SO answer: grep using a character vector with multiple patterns

Resources