How can I read .dat file separating "::" in R - r

I have a text file with "::" separator.
When I read this file like below.
tmp <- fread("file.dat", sep="::")
tmp <- read.table("file.dat", sep="::")
There is a 'sep' must be 'auto' or a single character or invalid 'sep' value: must be one byte error message.
How can I read this file?

You could try
fread("cat file.dat | tr -s :", sep = ":")
fread() allows a system call in its first argument. This one uses tr -s, which is a "squeeze" command, replacing the repetitions of : with single occurrences of that character.
With this call, fread() may even recognize the sep argument automatically, eliminating the need to name it.
Using the same concept, another way you could go (with an example file "x.txt") is to do
writeLines("a::b::c", "x.txt")
read.table(text = system("cat x.txt | tr -s :", intern = TRUE), sep = ":")
# V1 V2 V3
# 1 a b c
I'm not sure how this translates to Windows-based systems.

Related

Reading a txt file separated by "¬" in R [duplicate]

I'm trying to read a large file into R which is separated by the "not" sign (¬). What I normally do, is to change this symbol into semicolons using Text Edit, and save it as a csv file, but this file is too large, and my computer keeps crashing when I try to do so. I have tried the following options:
my_data <- read.delim("myfile.txt", header = TRUE, stringsAsFactors = FALSE, quote = "", sep = "\t")
which results in a dataframe with a single row. This makes sense, I know, since my file is not separated by tabs, but by the not sign. However, when I try so change sep to ¬ or \¬, I get the following message:
Error in scan(file, what = "", sep = sep, quote = quote, nlines = 1, quiet = TRUE, :
invalid 'sep' value: must be one byte
I have also tried with
my_data <- read.csv2(file.choose("myfile.txt"))
and
my_data <- read.table("myfile.txt", sep="\¬", quote="", comment.char="")
getting similar results. I have searched for options similar to mine, but his kind of separator is not commonly used.
You can try to read in a piped translation of it.
Setup:
writeLines("a¬b¬c\n1¬2¬3\n", "quux.csv")
The work:
read.csv(pipe("tr '¬' ',' < quux.csv"))
# a b c
# 1 1 2 3
If commas don't work for you, this works equally well with other replacement chars:
read.table(pipe("tr '¬' '\t' < quux.csv"), header = TRUE)
# a b c
# 1 1 2 3
The tr utility is available on all linuxes, it should be available on macos, and it is included in Rtools for windows (as well as git-bash, if you have that).
If there is an issue using pipe, you can always use the tr tool to create another file (replacing your text-editor step):
system2("tr", c("¬", ","), stdin="quux.csv", stdout="quux2.csv")
read.csv("quux2.csv")
# a b c
# 1 1 2 3

In R, get the text and corresponding filename of every file in a ZIP archive

I have a bunch of ZIP archives that each contain a bunch of text files. I want to read all the text into memory, one string per file, and with each file tagged with the corresponding filename, but without removing the original ZIP files or writing all the contents to disk. (If writing temporary files is a must, they should be deleted once we're done reading them, or if processing is interrupted.)
For example, suppose you create a simple ZIP like this:
$ echo 'contents1' > file1
$ echo 'contents2' > file2
$ zip files.zip file1 file2
Then calling myfunction("files.zip") should return the same thing as list(file1 = "contents1\n", file2 = "contents2\n").
I currently use the following function, which uses Info-ZIP unzip. It works fine, except that its code to detect the end of one file and the beginning of another might trigger on file contents instead.
library(stringr)
slurp.zip = function(path)
# Extracts each file in the zip file at `path` as a single
# string. The names of the resulting list are set to the inner
# file names.
{lines = system2("unzip", c("-c", path), stdout = T)
is.sep = str_detect(lines, "^ (?: inflating|extracting): ")
chunks = lapply(
split(lines[!is.sep], cumsum(is.sep)[!is.sep])[-1],
function(chunk) paste(chunk, collapse = "\n"))
fnames = str_match(lines[is.sep], "^ (?: inflating|extracting): (.+) $")
stopifnot(!anyNA(fnames))
names(chunks) = fnames[,2]
chunks}
We can use unzip(..., list = TRUE) to get the file names in the archive, without actually extracting them. Then we can use unz to create connections to the files, which can be read using e.g. readLines or scan:
slurp.zip = function(path) {
sapply(unzip(path, list = TRUE)$Name, function(x)
paste0(readLines(unz('files.zip', x)), collapse = '\n'),
simplify = FALSE, USE.NAMES = TRUE)
}
dput(slurp.zip('files.zip'))
# list(file1 = "contents1\n", file2 = "contents2\n")

using grep as command line tool in r data.table fread() - incorrect result

Hello,
This is my first posting here. I use excellent R data.table package. I need to import a file without comment lines, but I don't see any option in fread() to get rid of comment lines which are spread across the file, not only at the beginning of the file. To simplify - the file test.txt consists of 4 lines, comment lines begin with "#":
#A
A AA
A A#A
#A
I import data with fread() and then get rid of comment lines with grep (^#); everything works.
There is also an option to use grep inside fread() as a command line call instead a single file name. (For the record, I working in Windows, thus I have grep.exe in my project folder.) Grep works with simple regular expressions as expected when I call it from R:
> system("grep # test.txt")
#A
A A#A
#A
> system("grep ^# test.txt")
#A
#A
But it ignores the beginning of line command "^" when called as a system command inside fread() function:
> fread("grep # test.txt", sep = "\t", header = FALSE, fill = TRUE)
V1 V2
1: #A
2: A A#A
3: #A
> fread("grep ^# test.txt", sep = "\t", header = FALSE, fill = TRUE)
V1 V2
1: #A
2: A A#A
3: #A
Thus, grep.exe as well as grep() in R are working as expected, but grep.exe called from fread() ignores beginning of line command (didn't try other regex). What is wrong here?

how to read multiple specific columns out of compressed .csv file in R

I need a fast way to read multiple specific columns from a .csv file compressed as .tar.gz into a variable in R.
My approach:
con <- textConnection(system(paste("zcat ", filename.tar.gz, " | cut -d ; -f 1,2,3", sep = "")))
var <- read.csv(con, sep = ";")
it seems like he does not understand the pipe command, since it zcat filename.tar.gz | cut -d ; -f 1,2,3 is working on console.
The error i'm getting in R:
[5] "cut.gz: No such file or directory"
[6] ";.gz: No such file or directory"
[7] "2.gz: No such file or directory"
1) pipe If we have a csv file named a.csv in a.tar.gz and it has 8 columns and we want to read the first 3 columns and ignore the rest (or in place of using colClasses use a pipeline in pipe as in your question):
read.csv(pipe("tar -xOzf a.tar.gz a.csv"), colClasses = rep(c(NA, "NULL"), c(3, 5)))
2) gsubfn To parameterize it, it could be written like this:
library(gsubfn)
Archive <- "a.tar.gz"
File <- "a.csv"
read.csv(fn$pipe("tar -xOzf $Archive $File"), colClasses = rep(c(NA, "NULL"), c(3, 5)))
3) fread The fread function in data.table can also be useful here. This uses Archive and File from (2). It has the advantage of not requiring knowledge of the number of columns. Also fread handles shell commands directly, can usually figure out whether there are headers and what the separator is and it tends to be fast.
library(data.table)
library(gsubfn)
fn$fread("tar -xOzf $Archive $File", select = 1:3)

How to read \" double-quote escaped values with read.table in R

I am having trouble to read a file containing lines like the one below in R.
"_:b5507F4C7x59005","Fabiana D\"atri"
Any idea? How can I make read.table understand that \" is the escape of quote?
Cheers,
Alexandre
It seems to me that read.table/read.csv cannot handle escaped quotes.
...But I think I have an (ugly) work-around inspired by #nullglob;
First read the file WITHOUT a quote character.
(This won't handle embedded , as #Ben Bolker noted)
Then go though the string columns and remove the quotes:
The test file looks like this (I added a non-string column for good measure):
13,"foo","Fab D\"atri","bar"
21,"foo2","Fab D\"atri2","bar2"
And here is the code:
# Generate test file
writeLines(c("13,\"foo\",\"Fab D\\\"atri\",\"bar\"",
"21,\"foo2\",\"Fab D\\\"atri2\",\"bar2\"" ), "foo.txt")
# Read ignoring quotes
tbl <- read.table("foo.txt", as.is=TRUE, quote='', sep=',', header=FALSE, row.names=NULL)
# Go through and cleanup
for (i in seq_len(NCOL(tbl))) {
if (is.character(tbl[[i]])) {
x <- tbl[[i]]
x <- substr(x, 2, nchar(x)-1) # Remove surrounding quotes
tbl[[i]] <- gsub('\\\\"', '"', x) # Unescape quotes
}
}
The output is then correct:
> tbl
V1 V2 V3 V4
1 13 foo Fab D"atri bar
2 21 foo2 Fab D"atri2 bar2
On Linux/Unix (or on Windows with cygwin or GnuWin32), you can use sed to convert the escaped double quotes \" to doubled double quotes "" which can be handled well by read.csv:
p <- pipe(paste0('sed \'s/\\\\"/""/g\' "', FILENAME, '"'))
d <- read.csv(p, ...)
rm(p)
Effectively, the following sed command is used to preprocess the CSV input:
sed 's/\\"/""/g' file.csv
I don't call this beautiful, but at least you don't have to leave the R environment...
My apologies ahead of time that this isn't more detailed -- I'm right in the middle of a code crunch.
You might consider using the scan() function. I created a simple sample file "sample.csv," which consists of:
V1,V2
"_:b5507F4C7x59005","Fabiana D\"atri"
Two quick possibilities are (with output commented so you can copy-paste to the command line):
test <- scan("sample.csv", sep=",", what='character',allowEscapes=TRUE)
## Read 4 items
test
##[1] "V1" "V2" "_:b5507F4C7x59005"
##[4] "Fabiana D\\atri\n"
or
test <- scan("sample.csv", sep=",", what='character',comment.char="\\")
## Read 4 items
test
## [1] "V1" "V2" "_:b5507F4C7x59005"
## [4] "Fabiana D\\atri\n"
You'll probably need to play around with it a little more to get what you want. And I see that you've already mentioned writeLines, so you may have already tried this. Either way, good luck!
I was able to get your eample to work by setting the quote argument:
> read.csv('test.csv',quote="'",head=FALSE)
V1 V2
1 "_:b5507F4C7x59005" "Fabiana D\\"atri"
2 "_:b5507F4C7x59005" "Fabiana D\\"atri"
read_delim from package readr can handle escaped and doubled double quotes, using the arguments escape_double and escape_backslash.
For example, if our file escapes quotes by doubling them:
"quote""","hello"
1,2
then we use
read_delim(file, delim=',') # default escape_backslash=FALSE, escape_double=TRUE
If our file escapes quotes with a backslash:
"quote\"","hello"
1,2
we use
read_delim(file, delim=',', escape_double=FALSE, escape_backslash=TRUE)
As of newer R versions, readr::read_delim() is the correct answer.
data = read_delim(filename, delim = "\t", quote = "\"",
escape_backslash=T, escape_double=F,
# The columns depend on your data
col_names = c("timeStart", "posEnd", "added", "removed"),
col_types = "nncc"
)
This should be fine with read.csv(). Take a look at the help for ?read.csv - the option for specifying the quote is quote = "....". In this case, though, there may be a problem: it seems that read.csv() prefers to see matching quotes.
I tried the same with read.table("sample.txt", header = FALSE, as.is = TRUE), with your text in sample.txt, and it seems to work. When all else fails with read.csv(), I tend to back up to read.table() and specify the parameters carefully.

Resources