I wonder if anyone can help me - I am trying to download a file using scp and then unzip it (it is a set of 100 files, so they have to be zipped and as well it excludes unz)
Basically first I run
x <- scp(host = "WM.net", path = "/wmdata.zip", user = "w", password = "wm")
it returns a raw object (of course it is a dummy address, you would not get anything, I cannot provide working site you can scp anything)
> class(x)
[1] "raw"
then I try to unzip it
b<-unzip(x)
Error in unzip(x) : invalid zip name argument
I tried to decompress it in memory but with no luck - the output is still raw, not a file list
z<-memDecompress(x, type = "unknown")
> class(z)
[1] "raw"
Where is my error? What am doing wrong? I have a vague feeling I need to save x to disc as zip, and then use unzip, but no idea how to save raw compressed value.
EDIT: I tried as well saving as a binary file via
f<-file("file.bin",open="wb") #or f<-file("file.zip",open="wb")
writeBin(x, f)
b <- unzip(f) #or b <- unzip("file.bin") or b <- unzip("file.zip")
and it produced a file after the first line, but after the second line the file is still empty and the unzip procedure returns the same zip name error
> class(f)
[1] "file" "connection"
> f
A connection with
description "file.zip"
class "file"
mode "wb"
text "binary"
opened "opened"
can read "no"
can write "yes"
The error you are getting is not unexpected at all, because unzip expects a file as its first parameter, and you are trying to pass a raw R vector, which is a vector of bytes. You can try first writing that raw vector to file, and then reading it using unzip. Something like this:
x <- scp(host = "WM.net", path = "/wmdata.zip", user = "w", password = "wm")
f <- file("path/to/your/file.bin", "wb")
writeBin(x, f)
b <- unzip(f)
This is not tested, but I wanted to point out the issues with how you were using the various APIs.
Related
I have been trying to work this out but I have not been able to do it...
I want to create a data frame with four columns: country-number-year-(content of the .txt file)
There is a .zip file in the following URL:
https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/0TJX8Y/PZUURT
The file contains a folder with 49 folders in it, and each of them contain 150 .txt files give or take.
I first tried to download the zip file with get_dataset but did not work
if (!require("dataverse")) devtools::install_github("iqss/dataverse-client-r")
library("dataverse")
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
get_dataset("=doi:10.7910/DVN/0TJX8Y/PZUURT", key = "", server = "dataverse.harvard.edu")
"Error in get_dataset("=doi:10.7910/DVN/0TJX8Y/PZUURT", key = "", server = "dataverse.harvard.edu") :
Not Found (HTTP 404)."
Then I tried
temp <- tempfile()
download.file("https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/0TJX8Y/PZUURT",temp)
UNGDC <-unzip(temp, "UNGDC+1970-2018.zip")
It worked to some point... I downloaded the .zip file and then I created UNGDC but nothing happened, because it only has the following information:
UNGDC
A connection with
description "/var/folders/nl/ss_qsy090l78_tyycy03x0yh0000gn/T//RtmpTc3lvX/fileab730f392b3:UNGDC+1970-2018.zip"
class "unz"
mode "r"
text "text"
opened "closed"
can read "yes"
can write "yes"
Here I don't know what to do... I have not found relevant information to proceed... Can someone please give me some hints? or any web to learn how to do it?
Thanks for your attention and help!!!
How about this? I used the zip package to unzip, but possibly the base unzip might work as well.
library(zip)
dir.create(temp <- tempfile())
url<-'https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/0TJX8Y/PZUURT'
download.file(url, paste0(temp, '/PZUURT.zip'), mode = 'wb', exdir = temp)
unzip(paste0(temp, '/PZUURT.zip'), exdir = temp)
Note in particular I had to set the mode = 'wb' as I'm on a Windows machine.
I then saw that the unzipped archive had a _MACOSX folder and a Converted sessions folder. Assuming I don't need the MACOSX stuff, I did the following to get just the files I'm interested in:
root_folder <- paste0(temp,'/Converted sessions/')
filelist <- list.files(path = root_folder, pattern = '*.txt', recursive = TRUE)
filenames <- basename(filelist)
'filelist' contains the full paths to each text file, while 'filenames' has just each file name, which I'll then break up to get the country, the number and the year:
df <- data.frame(t(sapply(strsplit(filenames, '_'),
function(x) c(x[1], x[2], substr(x[3], 1, 4)))))
colnames(df) <- c('Country', 'Number', 'Year')
Finally, I can read the text from each of the files and stick it into the dataframe as a new Text field:
df$Text <- sapply(paste0(root_folder, filelist), function(x) readChar(x, file.info(x)$size))
Consider a tar.gz file of a directory which containing a lot of individual files.
From within R I can easily extract the name of the individual files with this command:
fileList <- untar(my_tar_dir.tar.gz, list=T)
Using only R is it possible to directly read/load a single of those files into R (aka without first unpacking and writing the file to the disk)?
It is possible, but I don't know of any clean implementation (it may exist). Below is some very basic R code that should work in many cases (e.g. file names with full path inside the archive should be less than 100 characters). In a way, it's just re-implementing "untar" in an extremely crude way, but in such a way that it will point to the desired file in a gzipped file.
The first problem is that you should only read a gzipped file from the start. Using "seek()" to re-position the file pointer to the desired file is, unfortunately, erratic in a gzipped file.
ParseTGZ<- function(archname){
# open tgz archive
tf <- gzfile(archname, open='rb')
on.exit(close(tf))
fnames <- list()
offset <- 0
nfile <- 0
while (TRUE) {
# go to beginning of entry
# never use "seek" to re-locate in a gzipped file!
if (seek(tf) != offset) readBin(tf, what="raw", n= offset - seek(tf))
# read file name
fName <- rawToChar(readBin(tf, what="raw", n=100))
if (nchar(fName)==0) break
nfile <- nfile + 1
fnames <- c(fnames, fName)
attr(fnames[[nfile]], "offset") <- offset+512
# read size, first skip 24 bytes (file permissions etc)
# again, we only use readBin, not seek()
readBin(tf, what="raw", n=24)
# file size is encoded as a length 12 octal string,
# with the last character being '\0' (so 11 actual characters)
sz <- readChar(tf, nchars=11)
# convert string to number of bytes
sz <- sum(as.numeric(strsplit(sz,'')[[1]])*8^(10:0))
attr(fnames[[nfile]], "size") <- sz
# cat(sprintf('entry %s, %i bytes\n', fName, sz))
# go to the next message
# don't forget entry header (=512)
offset <- offset + 512*(ceiling(sz/512) + 1)
}
# return a named list of characters strings with attributes?
names(fnames) <- fnames
return(fnames)
}
This will give you the exact position and length of all files in the tar.gz archive.
Now the next step is to actually extact a single file. You may be able to do this by using a "gzfile" connection directly, but here I will use a rawConnection(). This presumes your files fit into memory.
extractTGZ <- function(archfile, filename) {
# this function returns a raw vector
# containing the desired file
fp <- ParseTGZ(archfile)
offset <- attributes(fp[[filename]])$offset
fsize <- attributes(fp[[filename]])$size
gzf <- gzfile(archfile, open="rb")
on.exit(close(gzf))
# jump to the byte position, don't use seek()
# may be a bad idea on really large archives...
readBin(gzf, what="raw", n=offset)
# now read the data into a raw vector
result <- readBin(gzf, what="raw", n=fsize)
result
}
now, finally:
ff <- rawConnection(ExtractTGZ("myarchive", "myfile"))
Now you can treat ff as if it were (a connection pointing to) your file. But it only exists in memory.
One can read in a csv within an archive using library(archive) as follows (this should be a lot more elegant than the currently accepted answer, this package also supports all major archive formats - 'tar', 'ZIP', '7-zip', 'RAR', 'CAB', 'gzip', 'bzip2', 'compress', 'lzma' & 'xz' and it works on all platforms):
library(archive)
library(readr)
read_csv(archive_read("my_tar_dir.tar.gz", file = 1), col_types = cols())
Typical example:
path <- "C:/test/path" # great
path <- "C:\\test\\path" # also great
path <- "C:\test\path"
Error: '\p' is an unrecognized escape in character string starting ""C:\test\p"
(of course - \t is actually an escape character.)
Is there any mark that can be used to treat the string as verbatim? Or can it be coded?
It would be really useful when copy/pasting path names in Windows...
R 4.0.0 introduces raw strings:
dir <- r"(c:\Program files\R)"
https://stat.ethz.ch/R-manual/R-devel/library/base/html/Quotes.html
https://blog.revolutionanalytics.com/2020/04/r-400-is-released.html
You can use scan ( but only in interactive session -- not in source)
Like
path=scan(what="",allowEscapes=F,nlines=1)
C:\test\path
print(path)
And then
Ctrl+A ++ Ctrl+Enter
give you result
But not work in function or source :
{
path=scan(what="character",allowEscapes=F,nlines=1)
C:\test\path
print(path)
}
throw error
Maybe readline() or scan(what = "charactor"), both work in terminal, not script or function:
1.readline():
> path <- readline()
C:\test\path #paste your path, ENTER
> path
[1] "C:\\test\\path"
2.scan(what = "charactor"):
> path = scan(what = "character")
1: C:\test\path #paste, ENTER
2: #ENTER
#Read 1 item
> path
[1] "C:\\test\\path"
EDIT:
Try this:
1.Define a function getWindowsPath():
> getWindowsPath <- function() #define function
{
return(scan(file = "clipboard", what = "character"))
}
2.Copy windows path using CTRL+C:
#CTRL+C: C:\test\path
> getWindowsPath()
#Read 1 item
[1] "C:\\test\\path"
If you are copying and pasting in windows, you can set up a file connection to the clipboard. Then you can use scan to read from it, with allowEscapes turned off. However, Windows allows spaces in file paths, and scan doesn't understand that, so you have to wrap the result in paste0 with collapse set to a 0-length character string.
x = file(description = "clipboard")
y = paste0(scan(file = x, what = "character", allowEscapes = F), collapse = "")
Unfortunately, this only works for the path currently in the clipboard, so if you are copying and pasting lots of paths into an R script, this is not a solution. A workaround in that situation would be to paste each path into a separate text file and save it. Then, in your main script, you could run the following
y = paste0(scan(file = "path1.txt", what = "character", allowEscapes = F), collapse = "")
You would probably need one saved file for each path.
In linux we can use file command to get the file type based on the content of the file (not extension). Is there any similar function in R?
Old question but maybe relevant for people getting here via google: You can use dqmagic, a wrapper around libmagic for R, to determine the file type based on the files content. Since file uses the same library, the results are the same, e.g.:
library(dqmagic)
file_type("DESCRIPTION")
#> [1] "ASCII text"
file_type("src/file.cpp")
#> [1] "C source, ASCII text"
vs.
$ file DESCRIPTION src/file.cpp
DESCRIPTION: ASCII text
src/file.cpp: C source, ASCII text
Disclaimer: I am the author of the package.
dqmagic is not on CRAN. Below an R solution which uses linux's "file" command (actually BSD's 'file' v5.35 dated October 2018, packaged in Ubuntu 19.04, according to man page)
file_full_path <- "/home/user/Documents/an_RTF_document.doc"
file_mime_type <- system2(command = "file",
args = paste0(" -b --mime-type ", file_full_path), stdout = TRUE) # "text/rtf"
# Gives the list of potentially allowed extension for this mime type:
file_possible_ext <- system2(command = "file",
args = paste0(" -b --extension ", file_full_path),
stdout = TRUE) # "???". "doc/dot" for MsWord files.
It could be necessary to check that the actual extension is known to be a valid extension for the given mime type (for instance, readtext::readtext() reads an RTF file but fails if it is saved as *.doc).
file.basename <- basename(file_full_path)
file.base_without_ext <-sub(pattern = "(.*)\\..*$",
replacement = "\\1", file.basename)
file.nchar_ext <- nchar(file.basename) -
nchar(file.base_without_ext)-1 # 3 or 4 (doc, docx, odt...)
file_ext <- substring(file.basename, nchar(file.basename) -
file.nchar_ext +1) # doc, rtf...
if (file_mime_type == "text/rtf"){
file_possible_ext <- "rtf"
} # in some (all?) cases, for an rtf mime-type,
#'file' outputs "???" as allowed extension
# Returns TRUE if the actual extension is known to
# be a valid extension for the given mime type:
length(grep(file_ext, file_possible_ext, ignore.case = TRUE)) > 0
Given the following code:
x <- 1
save(x, file = "x")
file.remove("x")
The file.remove() command successfully removes the x file. However, it returns TRUE to the R console. How do I keep it from doing that?
I've tried things like file.remove("x", silent = TRUE), but it seems that whatever I add to the function is interpreted as a file name, since the above returns cannot remove file 'TRUE', reason 'No such file or directory'.
Try wrapping the call with invisible
x <- 1
save(x, file = "x")
invisible(file.remove("x"))