I am trying to get files from this FTP
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2015/11/PXD000299/
From there, I need only the files starting with the .dat extension. But there are other files that I am not interested in.
I want to avoid downloading each one at a time, so I thought in creating a vector with the names and loop over them.
How can I download only the files I want?
Thanks
EDIT:
I have tried doing the following
downloadURL <- "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2015/11/PXD000299/F010439.dat"
download.file(downloadURL, "F010439.dat") #this is a trial using one file
And after a few seconds I get the following error:
trying URL
'ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2015/11/PXD000299/F010439.dat'
Error in download.file(downloadURL, "F010439.dat") :
cannot open URL 'ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2015/11/PXD000299/F010439.dat'
In addition: Warning message:
In download.file(downloadURL, "F010439.dat") :
InternetOpenUrl failed: 'Die FTP-Sitzung wurde beendet.
'
Use the curl library to extract the directory listing
> library(curl)
> url = "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2015/11/PXD000299/"
> h = new_handle(dirlistonly=TRUE)
> con = curl(url, "r", h)
> tbl = read.table(con, stringsAsFactors=TRUE, fill=TRUE)
> close(con)
> head(tbl)
V1
1 12-0210_Druart_Uterus_J0N-Co_1a_ORBI856.raw.mzML
2 12-0210_Druart_Uterus_J0N-Co_2a_ORBI857.raw.mzML
3 12-0210_Druart_Uterus_J0N-Co_3a_ORBI858.raw.mzML
4 12-0210_Druart_Uterus_J10N-Co_1a_ORBI859.raw.mzML
5 12-0210_Druart_Uterus_J10N-Co_2a_ORBI860.raw.mzML
6 12-0210_Druart_Uterus_J10N-Co_3a_ORBI861.raw.mzML
Paste the relevant ones on to the url and use
urls <- paste0(url, tbl[1:5,1])
fls = basename(urls)
curl_fetch_disk(urls[1], fls[1])
Related
I am trying to read a zip file without unzipping it in my directory while utilizing read.csv2.sql for specific row filtering.
Zip file can be downloaded here :
I have tried setting up a file connection to read.csv2.sql, but it seems that it does not take in file connection as an parameter for "file".
I already installed sqldf package in my machine.
This is my following R code for the issue described:
### Name the download file
zipFile <- "Dataset.zip"
### Download it
download.file("https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip",zipFile,mode="wb")
## Set up zip file directory
zip_dir <- paste0(workingDirectory,"/Dataset.zip")
### Establish link to "household_power_consumption.txt" inside zip file
data_file <- unz(zip_dir,"household_power_consumption.txt")
### Read file into loaded_df
loaded_df <- read.csv2.sql(data_file , sql="SELECT * FROM file WHERE Date='01/02/2007' OR Date='02/02/2007'",header=TRUE)
### Error Msg
### -Error in file(file) : invalid 'description' argument
This does not use read.csv2.sql but as there are only ~ 2 million records in the file it should be possible to just download it, read it in using read.csv2 and then subset it in R.
# download file creating zipfile
u <-"https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip"
zipfile <- sub(".*%2F", "", u)
download.file(u, zipfile)
# extract fname from zipfile, read it into DF0 and subset it to DF
fname <- sub(".zip", ".txt", zipfile)
DF0 <- read.csv2(unz(zipfile, fname))
DF0$Date <- as.Date(DF0$Date, format = "%d/%m/%Y")
DF <- subset(DF0, Date == '2007-02-01' | Date == '2007-02-02')
# can optionally free up memory used by DF0
# rm(DF0)
I am trying to import csv-files from a ftp server to R.
It would be best to import files into a dataframe.
I want to import only specific files from ftp server, not all of the files.
My issues began by trying to import only one file:
url <- "ftp:servername.de/"
download.file(url, "testdata.csv")
I got this error message:
try URL 'ftp://servername/'
Fehler in download.file(url, "testdata") :
can not open 'ftp://servername.de/'
Additional Warning
In download.file(url, "tesdata.csv") :
URL 'ftp://servername/': status was 'Couldn't connect to server'
Another way I tried was:
url <- "ftp://servername.de/"
userpwd <- "a:n"
filenames <- getURL(url, userpwd = userpwd
,ftp.use.epsv = FALSE, dirlistonly = TRUE
)
Here I do not understand how to import the files into an R-Object.
Additionally, it would be great to get a clue on how to handle this process with zipped data instead of csv-data (format: .gz)
Use the curl library to extract the directory listing
library(curl)
url = "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2015/11/PXD000299/"
h = new_handle(dirlistonly=TRUE)
con = curl(url, "r", h)
tbl = read.table(con, stringsAsFactors=TRUE, fill=TRUE)
close(con)
head(tbl)
V1
1 12-0210_Druart_Uterus_J0N-Co_1a_ORBI856.raw.mzML
2 12-0210_Druart_Uterus_J0N-Co_2a_ORBI857.raw.mzML
3 12-0210_Druart_Uterus_J0N-Co_3a_ORBI858.raw.mzML
4 12-0210_Druart_Uterus_J10N-Co_1a_ORBI859.raw.mzML
5 12-0210_Druart_Uterus_J10N-Co_2a_ORBI860.raw.mzML
6 12-0210_Druart_Uterus_J10N-Co_3a_ORBI861.raw.mzML
Paste the relevant ones on to the url and use
urls <- paste0(url, tbl[1:5,1])
fls = basename(urls)
curl_fetch_disk(urls[1], fls[1])
Reference:
Downloading files from ftp with R
I am writing an R function that reads CSV files from a subdirectory in a ZIP file without first unzipping it, using read.csv() and unz().
The CSV files are named with leading 0 as in 00012.csv, 00013.csv etc.
The function has the following parameters: MyZipFile, ASubDir, VNum (a vector e.g. 1:42) which forms the filename.
What I want is to use the variable PathNfilename in unz().
# Incorporate the directory in the ZIP file while constructing the filename using stringr package
PathNfilename <- paste0("/", ASubDir, "/", str_pad(Vnum, 5, pad = "0"), ".csv", sep="")
What works is:
csvdata <- read.csv(unz(description = "MyZipFile.zip", filename = "ASubDirectory/00039.csv"), header=T, quote = "")
What I need is something along these lines of this:
csvdata <- read.csv(unz(description = "MyZipFile.zip", filename = PathNFileName), header=T, quote = "")
The error that I get is:
Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") :
cannot locate file '/ASubDir/00039.csv' in zip file 'MyZipFile.zip'
I'd like to understand why I'm getting the error and how to resolve it. Is it a scoping issue?
Try with some PathFilename without the leading /
ASubDir <- "ASubDirectory"
Vnum <- 1:5
PathNfilename <- file.path(ASubDir,
paste0(str_pad(Vnum, 5, pad = "0"), ".csv")
)
PathNfilename
#> [1] "ASubDirectory/00001.csv" "ASubDirectory/00002.csv"
#> [3] "ASubDirectory/00003.csv" "ASubDirectory/00004.csv"
#> [5] "ASubDirectory/00005.csv"
I am trying to download and read a zipped csv file from Kaggle within an R script. After researching other posts including post1 and post2 I have tried:
# Read data with temp file
url <- "https://www.kaggle.com/c/rossmann-store-sales/download/store.csv.zip"
tmp <- tempfile()
download.file(url, tmp, mode = "wb")
con <- unz(tmp, "store.csv.zip")
store <- read.table(con, sep = ",", header = TRUE)
unlink(tmp)
the read.table command throws an error:
Error in open.connection(file, "rt") : cannot open the connection
I have also tried:
# Download file, unzip, and read
url <- "https://www.kaggle.com/c/rossmann-store-sales/download/store.csv.zip"
download.file(url, destfile = "./SourceData/store.csv.zip", mode = "wb")
unzip("./SourceData/store.csv.zip")
Unzip throws the error:
error 1 in extracting from zip file
Bypassing the unzip command and reading directly from the zip file
store <- read_csv("SourceData/store.csv.zip")
Throws the error:
zip file ... SourceData/store.csv.zip cannot be opened
I prefer to use the temp file, but at this point I'll use either approach if I can make it work.
I have script as below:
setwd ("I:/prep/Coord/RData/test")
#load .csv files
a.files <- grep("^Whirr", dir(), value=TRUE) #pattern matching
b.files <- paste0("Files_", a.files)
for(i in length(a.files)){
a <- read.table(a.files[i], header=T, sep=",", row.names=1) #read files start with Whirr_
b <- read.table(b.files[i], header=T, sep=",", row.names=1) #read files start with Files_
a
b
cr <- as.matrix(a) %*% as.matrix(t(a)
cr
diag(cr)<-0
cr
#write to file
write.csv(cr, paste0("CR_", a.files[i], ".csv"))
}
Basically, what I want to do is to compare two files which have similar filename at the end of file name, and do the calculation, and write the result to file.
When I tried to print a.files and b.files, the output seems ok for me. The output as below:
> a.files <- grep("^Whirr", dir(), value=TRUE) #pattern matching
> b.files <- paste0("Files_", a.files, sep="")
Error: could not find function "paste0"
> a.files
[1] "Whirr_127.csv" "Whirr_128.csv"
> b.files
[1] "Files_ Whirr_127.csv" "Files_ Whirr_128.csv"
>
I tried to feed the script with multiple files, but I got an error msg as below:
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") : cannot open file 'Files_ Whirr_128.csv': No such file or directory
So, I tried to use file.choose, but it also doesn't work for me.
Appreciate help from the expert
Change the line:
b.files <- paste0("Files_", a.files)
to:
b.files <- paste("Files_", a.files, sep="")
You are using a version of R that does not have paste0 (I see that code was given to you in an earlier answer). This means you were keeping an earlier version of b.files, perhaps one that had been constructed using paste.
One important lesson about this is that whenever you get an error message about a line, such as Error: could not find function "paste0", that means the line did not happen! You have to fix that error before you paste the code, or tell us about the error when you do- otherwise we assume the b.files <- paste0("Files_", a.files) line works.