Conditionally process (bgzip, tabix) files using loop and if else statement

Conditionally process (bgzip, tabix) files using loop and if else statement - r

I have some .vcf files. I have selected those files from my directory and want to convert them to two other formats.
I am a bit confused using if and else if here. I want to do it like this: if there isn't .bgz file for [i]th .vcf file, I want to convert it to .bgz file keeping the original file.
If there is already .bgz file, but not .bgz.tbi file for [i] th .bgz file, then I want to convert .bgz file to .bgz.tbi file keeping the original .bgz that I get from .vcf file.
Can someone please help me finish this loop? It works for if condition, but don't know how to proceed from there.
path.file<-"/mypath/for/files/"
all.files <- list.files("/mypath/for/files")
all.files <- all.files[grepl(".vcf$",all.files)]
for (i in 1:length(all.files)){
if(!exists(paste0(all.files[i],".bgz"))){
bgzip(paste0(path.file,all.files[i]), overwrite=FALSE)
}else{(!exists(paste0(all.files[i],".bgz",".tbi"))){
#if(!exists(paste0(all.files[i],".bgz",".tbi"))){
indexTabix(paste0(paste0(path.file,all.files[i]),".bgz"), format="vcf")
}
}

Try this (not tested):
#get VCF files with path
all.files <- list.files("/mypath/for/files", pattern = "*.vcf$",
full.names = TRUE)
for (i in all.files) {
#make output names, so we don't mess about with paste
file_bgz <- paste0(i, ".bgz")
file_bgz_tbi <- paste0(i, ".bgz.tbi")
#if bgz exists don't zip else zip
if(!exists(file_bgz))
bgzip(i, paste0(i, ".bgz"))
#if tbi exists don't index else tabix
if(!exists(file_bgz_tbi))
indexTabix(file_bgz, format = "vcf")
}

Related

Importing a password protected xlsx file into R

I found an old thread (How do you read a password protected excel file into r?) that recommended that I use the following code to read in a password protected file:
install.packages("excel.link")
library("excel.link")
dat <- xl.read.file("TestWorkbook.xlsx", password = "pass", write.res.password="pass")
dat
However, when I try to do this my R immediately crashes. I've tried removing the write.res.password argument, and that doesn't seem to be the issue. I have a hunch that excel.link might not work with the newest version of R, so if you know of any other ways to do this I'd appreciate the advice.
EDIT: Using read.xlsx generates this error:
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "newInstance", .jfindClass(class), :
org.apache.poi.poifs.filesystem.OfficeXmlFileException:
The supplied data appears to be in the Office 2007+ XML.
You are calling the part of POI that deals with OLE2 Office Documents.
You need to call a different part of POI to process this data (eg XSSF instead of HSSF)

You can remove the password of the excel file without knowing it with the following function (adapted version of code available at https://www.r-bloggers.com/2018/05/remove-password-protection-from-excel-sheets-using-r/)
remove_Password_Protection_From_Excel_File <- function(dir, file, bool_XLSXM = FALSE)
{
initial_Dir <- getwd()
setwd(dir)
# file name and path after removing protection
if(bool_XLSXM == TRUE)
{
file_unlocked <- stringr::str_replace(basename(file), ".xlsm$", "_unlocked.xlsm")
}else
{
file_unlocked <- stringr::str_replace(basename(file), ".xlsx$", "_unlocked.xlsx")
}
file_unlocked_path <- file.path(dir, file_unlocked)
# create temporary directory in project folder
# so we see what is going on
temp_dir <- "_tmp"
# remove and recreate _tmp folder in case it already exists
unlink(temp_dir, recursive = TRUE)
dir.create(temp_dir)
# unzip Excel file into temp folder
unzip(file, exdir = temp_dir)
# get full path to XML files for all worksheets
worksheet_paths <- list.files(paste0(temp_dir, "/xl/worksheets"), full.name = TRUE, pattern = ".xml")
# remove the XML node which contains the sheet protection
# We might of course use e.g. xml2 to parse the XML file, but this simple approach will suffice here
for(ws in worksheet_paths)
{
file_Content <- readLines(ws, encoding = "windows1")
# the "sheetProtection" node contains the hashed password "<sheetProtection SOME INFO />"
# we simply remove the whole node
out <- str_replace(file_Content, "<sheetProtection.*?/>", "")
writeLines(out, ws)
}
worksheet_Protection_Paths <- paste0(temp_dir, "/xl/workbook.xml")
file_Content <- readLines(worksheet_Protection_Paths , encoding = "windows1")
out <- stringr::str_replace(file_Content, "<workbookProtection.*?/>", "")
writeLines(out, worksheet_Protection_Paths)
# create a new zip, i.e. Excel file, containing the modified XML files
old_wd <- setwd(temp_dir)
files <- list.files(recursive = T, full.names = F, all.files = T, no.. = T)
# as the Excel file is a zip file, we can directly replace the .zip extension by .xlsx
zip::zip(file_unlocked_path, files = files) # utils::zip does not work for some reason
setwd(old_wd)
# clean up and remove temporary directory
unlink(temp_dir, recursive = T)
setwd(initial_Dir)
}
Once the password is removed, you can read the Excel file. This approach works for me.

How to loop through folders and apply function on files in these folders

I have multiple similarly named but different folders, each containing similarly named but different csv files.
For example, I have three folders named "output", each containing "image.csv" and "cells.csv".
How do I loop through each "output" folder, then read each csv files in the folder and apply function onto these files?
Here's what I tried :
Firstly, I list the folders named "output":
dirs<-list.dirs()
dirs<-dirs[grepl("output",dirs)]
Then I want to set up a function to join both csv files, something like below (codes are incomplete though, please help to correct this):
object_extraction<-function(x){ image<-read.csv(image.csv, header=T, sep=",")
cells<-read.csv(cells.csv, header=T, sep=",")
object<-dplyr::inner_join(cells,image,by="ImageNumber")
return(object)}
Finally I want to loop the function above through the "output" folders
object<-list()
for(i in 1:length(dirs)){
object[[i]]<-object_extraction(dirs[i])
Thank you

Make the path to read csv dynamic in your function
object_extraction<-function(x){
image<-read.csv(paste0(x, '/image.csv'), header=T, sep=",")
#header = T and sep = ',' is default in read.csv so this should
#work without specifying them as well.
cells<-read.csv(paste0(x, '/cells.csv'))
object<-dplyr::inner_join(cells,image,by="ImageNumber")
return(object)
}
and then apply the function to each folder.
dirs <- list.dirs(recursive=FALSE)
dirs <- grep('output', dirs, value = TRUE)
result <- lapply(dirs, object_extraction)

Two errors I can spot in your code:
You need to use the directory name form the dirs variable, eg:
object_extraction<-function(x){
image<-read.csv(file.path(x, "image.csv"), header=T, sep=",")
cells<-read.csv(file.path(x, "cells.csv"), header=T, sep=",")
object<-dplyr::inner_join(cells,image,by="ImageNumber")
return(object)
}
And the file names should be strings, "image.csv" and "cells.csv"
HTH

How to remove file name extensions from the global environment

With the code below, I have imported all .txt files from working directory.
temp=list.files(pattern = "*.txt")
for (i in 1:length(temp)) { assign(temp[i], read.delim(temp[i]))
But all of them came with .txt extension like this.
How can I remove all .txt extensions from data names?

You can rename the variables in your for loop itself
for (i in 1:length(temp)) {assign(sub(".txt$", "", temp[i]), read.delim(temp[i]))}
Or if you have already imported the variables change their names later
vals <- ls(pattern = ".txt$")
for (i in vals) { assign(sub(".txt$", "", i), get(i)) }
and then clean up the old names
rm(list = vals)
On a side note, using assign is considered bad. Read it's potential dangers and side effects here.

Write data to a temporary file with filename stored in an R environment

I have some data the I would like to write to a temporary CSV file in R.
Users have the option to specify a filename of their choice, which is stored in an environment (called 'envr') separate from .GlobalEnv
if (!is.null(envr$filename)) {
write.csv(df, file = paste(envr$filename, ".csv", sep = ""))
}
In order to do this successfully, I need to create a temporary file that is assigned to the filename chosen by the user.
if (!is.null(envr$filename)) {
file.name <- get("filename", envir = envr)
tempfile(fileext = ".csv")
write.csv(df, file = file.name)
}
The above if statement however does not do the job, as a CSV file is not saved in $TMPDIR.
How can I easily integrate tempfile() into the first if statement above without having to assign it to a variable name (file.name)?

You may concatenate the file name (obtained from the filename environment variable) with the temporary folder of the session (using tempdir()), along with the .csv extension, as follows:
if (!is.null(envr$filename)) {
write.csv(df, file = paste0(tempdir(), "/", get("filename", envir = envr), ".csv"))
}
Let me know if it answers your question or if you need any further help.

change .txt to .csv in R

I have been able to create a code in R to batch formate all my .txt file to .csv files using
setwd("~/Desktop/test/")
filelist = list.files(pattern = ".txt")
for (i in 1:length(filelist)){
input<-filelist[i]
output<-paste0(input, ".csv")
print(paste("Processing the file:", input))
data = read.delim(input, header = TRUE)
setwd("~/Desktop/test/done/")
write.table(data, file=output, sep=",", col.names=TRUE, row.names=FALSE)
setwd("~/Desktop/test/")
}
Works great but file still retains the .txt extension in the name
Example origninal file = "sample1.txt" and then the new file says "sample1.txt.csv" The file works as a .csv but the extra ".txt" in the file name annoys me, anyone know how to remove it from name?
thanks

You could delete the unwanted .txt:
output <- paste0(gsub("\\.txt$", "", input), ".csv")
The backslash marks the dot as literal dot (it has other meaning in regular expressions if not marked). The backslash has to be doubled because R tries to escape single backslashes. The dollar sign represents the end of the string, so only ".txt" at the end of the filename gets removed.

write.table(filelist,file=paste0("~/Desktop/test/done/",sub(".txt","",filelist[i]),".csv"),row.names=F,colnames=T,quote=F,sep=",")
Alternative help:
setwd("~/Users/Rio/Documents/Data/")
FILES <- list.files( pattern = ".txt")
for (i in 1:length(FILES)) {
FILE=read.table(file=FILES[i],header=T,sep="\t")
write.table(FILE,file=paste0("~Users/Rio/Documents/Data/",sub(".txt","",FILES[i]),".csv"),row.names=F,quote=F,sep=",")
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Conditionally process (bgzip, tabix) files using loop and if else statement - r

Related

Importing a password protected xlsx file into R

How to loop through folders and apply function on files in these folders

How to remove file name extensions from the global environment

Write data to a temporary file with filename stored in an R environment

change .txt to .csv in R

Categories

Resources