How to download .gz files from FTP server in R?

How to download .gz files from FTP server in R? - r

I am trying to download all .gz files from this link:
ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/BED/
So far I tried this and I am not getting any results:
require(RCurl)
url= "ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/BED/"
filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE)
filenames <- strsplit(filenames, "\r\n")
filenames = unlist(filenames)
I am getting this error:
Error in function (type, msg, asError = TRUE) :
Operation timed out after 300552 milliseconds with 0 out of 0 bytes received
Can someone please help with this?
Thanks
EDIT:
I tried to run with with filenames provided to me bellow so in my r script I have:
require(RCurl)
my_url <-"ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/BED/"
my_filenames= c("bed_chr_11.bed.gz", ..."bed_chr_9.bed.gz.md5")
my_filenames <- strsplit(my_filenames, "\r\n")
my_filenames = unlist(my_filenames)
for(my_file in my_filenames){
download.file(paste0(my_url, my_file), destfile = file.path('/mydir', my_file))
}
And when I run the script I get these warnings:
trying URL 'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/BED/bed_chr_11.bed.gz'
Error in download.file(paste0(my_url, my_file), destfile = file.path("/mydir", :
cannot open URL 'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/BED/bed_chr_11.bed.gz'
In addition: Warning message:
In download.file(paste0(my_url, my_file), destfile = file.path("/mydir", :
URL 'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/BED/bed_chr_11.bed.gz': status was 'Timeout was reached'
Execution halted

The file names you're trying to access are
filenames <- c("bed_chr_11.bed.gz", "bed_chr_11.bed.gz.md5", "bed_chr_12.bed.gz",
"bed_chr_12.bed.gz.md5", "bed_chr_13.bed.gz", "bed_chr_13.bed.gz.md5",
"bed_chr_14.bed.gz", "bed_chr_14.bed.gz.md5", "bed_chr_15.bed.gz",
"bed_chr_15.bed.gz.md5", "bed_chr_16.bed.gz", "bed_chr_16.bed.gz.md5",
"bed_chr_17.bed.gz", "bed_chr_17.bed.gz.md5", "bed_chr_18.bed.gz",
"bed_chr_18.bed.gz.md5", "bed_chr_19.bed.gz", "bed_chr_19.bed.gz.md5",
"bed_chr_20.bed.gz", "bed_chr_20.bed.gz.md5", "bed_chr_21.bed.gz",
"bed_chr_21.bed.gz.md5", "bed_chr_22.bed.gz", "bed_chr_22.bed.gz.md5",
"bed_chr_AltOnly.bed.gz", "bed_chr_AltOnly.bed.gz.md5", "bed_chr_MT.bed.gz",
"bed_chr_MT.bed.gz.md5", "bed_chr_Multi.bed.gz", "bed_chr_Multi.bed.gz.md5",
"bed_chr_NotOn.bed.gz", "bed_chr_NotOn.bed.gz.md5", "bed_chr_PAR.bed.gz",
"bed_chr_PAR.bed.gz.md5", "bed_chr_Un.bed.gz", "bed_chr_Un.bed.gz.md5",
"bed_chr_X.bed.gz", "bed_chr_X.bed.gz.md5", "bed_chr_Y.bed.gz",
"bed_chr_Y.bed.gz.md5", "bed_chr_1.bed.gz", "bed_chr_1.bed.gz.md5",
"bed_chr_10.bed.gz", "bed_chr_10.bed.gz.md5", "bed_chr_2.bed.gz",
"bed_chr_2.bed.gz.md5", "bed_chr_3.bed.gz", "bed_chr_3.bed.gz.md5",
"bed_chr_4.bed.gz", "bed_chr_4.bed.gz.md5", "bed_chr_5.bed.gz",
"bed_chr_5.bed.gz.md5", "bed_chr_6.bed.gz", "bed_chr_6.bed.gz.md5",
"bed_chr_7.bed.gz", "bed_chr_7.bed.gz.md5", "bed_chr_8.bed.gz",
"bed_chr_8.bed.gz.md5", "bed_chr_9.bed.gz", "bed_chr_9.bed.gz.md5"
)
The files are big, so I didn't check that this whole loop, but this worked at least for the first file. Add this to the end of your code.
my_url <- 'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/BED/'
for(my_file in filenames){ # loop over the files
# download each file, saving in a directory that you need to create on your own computer
download.file(paste0(my_url, my_file), destfile = file.path('c:/users/josep/Documents/', my_file))
}

Related

Error in setwd() : cannot change working directory Execution halted

i am trying to pass external arguments to R when calling a Rscript with a command line in that i am getting an Error in setwd(args[8]) : cannot change working directory
Execution halted
Not sure why this error is occurring Please suggest some ideas to sort out
Any idea would be appreciated
Rscript /home/mike/learn-analysis/SL_input.R chr21 10542449 10542649 + Y Y 1 /mnt/prep/learn/CA-21-10542394-10542448.2.D.txt D /home/mike/learn-analysis/SL_table.R
#SL_table.R file
args <- commandArgs(trailingOnly = TRUE)
setwd(args[8]) # directory to write file to
source(args[11]) # can be updated to reflect location of source code
results<-SL_mutate(args[1],args[2],args[3],args[4],args[5],args[6],args[7]) # chrom, start, stop, strand, gene, transcript, exon number
results$type<-args[10]
results$transcript<-NULL
results$exon_num<-NULL
write.table(results,args[9],quote = F, row.names = F, sep="\t") # write to filename given in argument (in working directory)

Error in file(file, "rt") : invalid 'description' argument when running R script

I am trying to reproduce this protocol for DNA sequencing data analysis. It requires running this bash script that links to an R script. However, I am getting this error (see bottom) that I cant seem to solve.
#!/bin/bash
Project_dir=~/base
cd /${Project_dir}
SCRTP=~/scRepliseq-Pipeline
OUTNAME="bam/G1_F121_A1.adapter_filtered2"
genome_name="mm10"
bamfile=${OUTNAME}.${genome_name}.clean_srt_markdup.bam
rscript=${SCRTP}/util/Step3_R-Aneu-Fragment-bins.R
out_dir="Aneu_analysis"
Name=‘$bamfile’
Name=${name%.adapter_filtered2.${genome_name}.clean_srt_markdup.bam}
blacklist=~/blacklist/mm10-blacklist-v1_id.bed
genome_file=~/reference/UCSC_hg19_female.fa.fai
mkdir -p ${out_dir}
Rscript --vanilla $rscript ${bamfile} ${out_dir} ${name} ${blacklist} ${genome_file}
it links to this R script
args = commandArgs(TRUE)
bamfile=args[1]
out_dir=args[2]
name=args[3]
blacklist=args[4]
genome_file=args[5]
options(scipen=100)
##Extension of file name##
ext="_mapq10_blacklist_fragment.Rdata"
ext2="_mapq10_blacklist_bin.Rdata"
library(AneuFinder)
##loading black list and genome Info##
genome_tmp <- read.table(genome_file,sep="\t") #UCSC_mm9.woYwR.fa.fai
genome=data.frame(UCSC_seqlevel=genome_tmp$V1,UCSC_seqlength=genome_tmp$V2)
chromosomes=as.character(genome$UCSC_seqlevel)
##setup output directories##
out_dir_f=paste0(out_dir,"/fragment")
out_dir_b=paste0(out_dir,"/bins")
dir.create(out_dir,showWarnings = FALSE)
dir.create(out_dir_f,showWarnings = FALSE)
dir.create(out_dir_b,showWarnings = FALSE)
##save the fragment file (>10 MAPQ), filtering out the blacklist regions##
raw_reads=bam2GRanges(bamfile,remove.duplicate.reads = TRUE,min.mapq = 10,blacklist = blacklist)
save(raw_reads,file = paste0(out_dir_f,"/",name,ext))
##save the bin data file ##
bins_reads=binReads(raw_reads,
assembly=genome,
chromosomes=chromosomes,
binsizes=c(40000,80000,100000,200000,500000))
rpm=1000000/length(raw_reads)
bins_reads[["rpm"]]=rpm
save(bins_reads,file=paste(out_dir_b,"/",name,ext2,sep=""))
It shows this error:
Error in file(file, "rt") : invalid 'description' argument
Calls: read.table -> file
Execution halted

R: Cannot run certain function after cleaning temporary directory

I get the error:
Error in file(fn, "rb") : cannot open the connection
In addition: Warning message:
In file(fn, "rb") :
cannot open file 'C:\Users\***\AppData\Local\Temp\Rtmpwh6Zih\raster\r_tmp_2020-05-
13_170601_12152_33882.gri': No such file or directory
When I run the following code in RStudio (1.2.5042):
raster.binair <- vector(mode = "list", length = length(aggregated.rasters))
for (i in 1:NROW(aggregated.rasters)) {
+ clamped <- clamp(aggregated.rasters[[i]], upper=12, useValues=FALSE)
+ raster.binair[[i]] <- clamped
+ }
"aggregated.rasters" is a list of 96 rasters and when I separately run it, I get the correct list. I recently cleaned my temporary directory (accessed by tempdir()) and deleted the files in there. I suppose the part:
cannot open file 'C:\Users\***\AppData\Local\Temp\Rtmpwh6Zih\raster\r_tmp_2020-05-
13_170601_12152_33882.gri': No such file or directory
is referring to this. I don't know what I did wrong here. Can I get these files back or work around this error?

Files in the temp folder are deleted when an R session ends. So you should never count on them. You can run the code again, but if you want to permanently keep the results you need to write them elsewhere. Here are two options
Write many files
raster.binair <- vector(mode = "list", length = length(aggregated.rasters))
for (i in 1:NROW(aggregated.rasters)) {
f <- paste0("raster_", i)
clamped <- clamp(aggregated.rasters[[i]], upper=12, useValues=FALSE, filename=f)
raster.binair[[i]] <- clamped
}
Write a single file
raster.binair <- vector(mode = "list", length = length(aggregated.rasters))
for (i in 1:NROW(aggregated.rasters)) {
raster.binair[[i]] <- clamp(aggregated.rasters[[i]], upper=12, useValues=FALSE)
}
s <- stack(raster.binair)
s <- writeRaster(s, filename="mydata.tif")

Error accessing ftp with getURL in R

I want access files on the internet, but I get the following error message:
Error in function (type, msg, asError = TRUE) : Access denied: 530
In addition: Warning messages:
1: In strsplit(str, "\\\r\\\n") : input string 1 is invalid in this locale
This is my code from this post:
library(RCurl)
url<'ftp://ftp.address'
userpwd <- "user:password"
filenames <- getURL(url, userpwd = userpwd,
ftp.use.epsv=FALSE, dirlistonly = TRUE)
Any idea how to solve this?
Thanks a lot for your help!

Try:
library(RCurl)
filenames <- getURL(url="ftp://user:password#ftp.address",ftp.use.epsv=FALSE, dirlistonly = FALSE)

Error trying to read a PDF using readPDF from the tm package

(Windows 7 / R version 3.0.1)
Below the commands and the resulting error:
> library(tm)
> pdf <- readPDF(PdftotextOptions = "-layout")
> dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1")
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file 'C:\Users\Raffael\AppData\Local\Temp
\RtmpS8Uql1\pdfinfo167c2bc159f8': No such file or directory
How do I solve this issue?
EDIT I
(As suggested by Ben and described here)
I downloaded Xpdf copied the 32bit version to
C:\Program Files (x86)\xpdf32
and the 64bit version to
C:\Program Files\xpdf64
The environment variables pdfinfo and pdftotext are referring to the respective executables either 32bit (tested with R 32bit) or to 64bit (tested with R 64bit)
EDIT II
One very confusing observation is that starting from a fresh session (tm not loaded) the last command alone will produce the error:
> dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1")
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file 'C:\Users\Raffael\AppData\Local\Temp\RtmpKi5GnL
\pdfinfode8283c422f': No such file or directory
I don't understand this at all because the function variable is not defined by tm.readPDF yet. Below you'll find the function pdf refers to "naturally" and to what is returned by tm.readPDF:
> pdf
function (elem, language, id)
{
meta <- tm:::pdfinfo(elem$uri)
content <- system2("pdftotext", c(PdftotextOptions, shQuote(elem$uri),
"-"), stdout = TRUE)
PlainTextDocument(content, meta$Author, meta$CreationDate,
meta$Subject, meta$Title, id, meta$Creator, language)
}
<environment: 0x0674bd8c>
> library(tm)
> pdf <- readPDF(PdftotextOptions = "-layout")
> pdf
function (elem, language, id)
{
meta <- tm:::pdfinfo(elem$uri)
content <- system2("pdftotext", c(PdftotextOptions, shQuote(elem$uri),
"-"), stdout = TRUE)
PlainTextDocument(content, meta$Author, meta$CreationDate,
meta$Subject, meta$Title, id, meta$Creator, language)
}
<environment: 0x0c3d7364>
Apparently there is no difference - then why use readPDF at all?
EDIT III
The pdf file is located here: C:\Users\Raffael\Documents
> getwd()
[1] "C:/Users/Raffael/Documents"
EDIT IV
First instruction in pdf() is a call to tm:::pdfinfo() - and there the error is caused within the first few lines:
> outfile <- tempfile("pdfinfo")
> on.exit(unlink(outfile))
> status <- system2("pdfinfo", shQuote(normalizePath("C:/Users/Raffael/Documents/17214.pdf")),
+ stdout = outfile)
> tags <- c("Title", "Subject", "Keywords", "Author", "Creator",
+ "Producer", "CreationDate", "ModDate", "Tagged", "Form",
+ "Pages", "Encrypted", "Page size", "File size", "Optimized",
+ "PDF version")
> re <- sprintf("^(%s)", paste(sprintf("%-16s", sprintf("%s:",
+ tags)), collapse = "|"))
> lines <- readLines(outfile, warn = FALSE)
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file 'C:\Users\Raffael\AppData\Local\Temp\RtmpquRYX6\pdfinfo8d419174450': No such file or direc
Apparently tempfile() simply doesn't create a file.
> outfile <- tempfile("pdfinfo")
> outfile
[1] "C:\\Users\\Raffael\\AppData\\Local\\Temp\\RtmpquRYX6\\pdfinfo8d437bd65d9"
The folder C:\Users\Raffael\AppData\Local\Temp\RtmpquRYX6 exists and holds some files but none is named pdfinfo8d437bd65d9.

Intersting, on my machine after a fresh start pdf is a function to convert an image to a PDF:
getAnywhere(pdf)
A single object matching ‘pdf’ was found
It was found in the following places
package:grDevices
namespace:grDevices [etc.]
But back to the problem of reading in PDF files as text, fiddling with the PATH is a bit hit-and-miss (and annoying if you work across several different computers), so I think the simplest and safest method is to call pdf2text using system as Tony Breyal describes here.
In your case it would be (note the two sets of quotes):
system(paste('"C:/Program Files/xpdf64/pdftotext.exe"',
'"C:/Users/Raffael/Documents/17214.pdf"'), wait=FALSE)
This could easily be extended with an *apply function or loop if you have many PDF files.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to download .gz files from FTP server in R? - r

Related

Error in setwd() : cannot change working directory Execution halted

Error in file(file, "rt") : invalid 'description' argument when running R script

R: Cannot run certain function after cleaning temporary directory

Error accessing ftp with getURL in R

Error trying to read a PDF using readPDF from the tm package

Categories

Resources