I would like to know what is the recommended way of reading a data.table from an archived file (zip archive in my case). One obvious option is to unzip it to a temporary file and then fread() it as usual. I don't want to bother about creating new file, so instead I use read.table() from unz() connection and then convert it with data.table():
mydt <- data.table(read.table(unz(myzipfilename, myfilename)))
This works fine but read.table() is slow for big files while fread() can't read unz() connection directly. I'm wondering if there is any better solution.
Look at: Read Ziped CSV File with fread
To avoid tmp files you can use unzip with -p extract files to pipe, no messages
You can use such a kind of statements with fread.
x = fread('unzip -p test/allRequests.csv.zip')
Or with gunzip
x = fread('gunzip -cq test/allRequests.csv.gz')
You can also use grep or other tools.
Related
I have two notebooks, and both have RStudio. When I try to use fread module from library data.table I need to specify the file (.csv in this case) for read and usually I just use Control+C and Control+V to the first argument from function. In one of these notebooks works perfectly, but in another it just don't paste anything ..... I need to activate a module or something I'm my RStudio?
library(data.table)
test <- fread("does not paste anything", sep=";", dec=",")
I am trying to make my current project reproducible, and so am creating a master document (eventually a .rmd file) that will be used to call and execute several other documents. This way myself and other investigators only need to open and run one file.
There are three layers to the current setup: master file, 2 read-in files, 2 databases. The master file calls the read-in files using source(), and the read-in files parse the .csv databases and apply labels.
The read-in files and the databases are generated automatically with the data management software I'm currently using (REDCap) each time I download the updated data.
However, the read-in files have a line of code that removes all of the objects in my environment. I would like to edit the read-in files directly from the master file so that I do not have to open the read-in files individually each time I run my report. Specifically, since all the read-in files are the same, I would like to remove line #2 in each.
I've tried searching Google, and tried file.edit(), but have been unable to find anything. Not even sure it is possible, but figured I would ask. Let me know if I can improve this question or if you need any additional code to answer it. Thanks!
Current relevant master code (edited for generality):
source("read-in1")
source("read-in2")
Current relevant read-in file code (same in each file, except for the database name):
#Clear existing data and graphics
rm(list=ls())
graphics.off()
#Load Hmisc library
library(Hmisc)
#Read Data
data=read.csv('database.csv')
#Setting Labels
[read-in code truncated]
Additional details:
OS: Windows 7 Professional x86
R version: 3.1.3
R Studio version: 0.99.441
You might try readLines() and something like the following (which was simplified greatly by a suggestion from #Hong Ooi below):
eval(parse(readLines("read-in1.R")[-2]))
My original solution which was much more pedantic:
f <- file("read-in1.R", open="r")
t <- readLines(f)
close(f)
for (l in t[-2]) { eval(parse(text=l)) }
The for() loop just parses and evaluates each line from the text file except for the second one (that's what the -2 index value does). If you're reading and writing longer files then the following will be much faster than the second option, however still less preferable than #Hong Ooi's:
f <- file("read-in1.R", open="r")
t <- readLines(f)
close(f)
f <- file("out.R", open="w")
o <- writeLines(t[-2], f)
close(f)
source("out.R")
Sorry I'm so late in noticing this question, but you may want to investigate getting access the the REDCap API and using either the redcapAPI package or the REDCapR package. Both of those packages will allow you to export the data from REDCap and directly into R without having to use the download scripts. redcapAPI will even apply all the formats and dates (REDCapR might do this now too. It was in the plan, but I haven't used it in a while).
You could try this. It just calls some shell commands: (1) renames the file, then (2) copies all lines not containing rm(list=ls()) to a new file with the same name as the original file, then (3) removes the copy.
files_to_change <- c("read-in1.R", "read-in2.R")
for (f in files_to_change) {
old <- paste0(f, ".old")
system(paste("cmd.exe /c ren", f, old))
system(paste("cmd.exe /c findstr /v rm(list=ls())", old, ">", f))
system(paste("cmd.exe /c rm", old))
}
After calling this loop you should have
#Clear existing data and graphics
graphics.off()
#Load Hmisc library
library(Hmisc)
#Read Data
data=read.csv('database.csv')
#Setting Labels
in your read-in*.R files. You could put this in a batch script
#echo off
ren "%~f1" "%~nx1.old"
findstr /v "rm(list=ls())" "%~f1.old" > "%~f1"
rm "%~nx1.old"
say, "example.bat", and call that in the same way using system.
I have a large CSV file (8.1 GB) that I'm trying to wrangle into R. I created the CSV using Python's csvkit in2csv, converted from a .txt file, but somehow the conversion led to null characters showing up in the file. I'm now getting this error when importing:
Error in fread("file.csv", nrows = 100) :
embedded nul in string: 'ÿþr\0e\0c\0d\0_\0z\0i\0p\0c\0'
I am able to import small chunks just fine with read.csv though, but that's because it allows for UTF-16 encoding via the fileEncoding argument.
test <- read.csv("file.csv", nrows=100, fileEncoding="UTF-16LE")
I don't dare try to import an 8 GB file with read.csv, though.
So I then tried the solution offered here, in which you use sed s/\\0//g file.csv > file2.csv to pull the nulls out. The command performed just fine and populated a new 8GB CSV file, but I received a nearly-identical error:
Error in fread("file2.csv", nrows = 100) :
embedded nul in string: 'ÿþr\0e\0c\0d\0_\0z\0i\0p\0c\0,\0p\0o\0s\0t\0_\0z\0i
So, that didn't work. I'm stumped at this point. Considering the size of the file, I can't use read.csv on the whole thing, and I'm not sure how to get rid of the nulls in the original CSV. I'm not even sure how the file got encoded as UTF-16. Any suggestions or advice would be greatly appreciated at this point.
Edit: I'm on a Windows machine.
If you're on linux/mac, try this
file <- "file.csv"
tt <- tempfile() # or tempfile(tmpdir="/dev/shm")
system(paste0("tr < ", file, " -d '\\000' >", tt))
fread(tt)
A possible option would be to install bash emulator on your machine from http://win-bash.sourceforge.net/ , and remove null terminated strings using Linux tools, as described, for example, here: Identifying and removing null characters in UNIX or here 'Embedded nul in string' error when importing csv with fread
I think the nonsensical characters happen because the file is compressed. This is what I found when trying to read vcf.gz files. fread does not seem to support reading compressed files. See e.g. https://github.com/Rdatatable/data.table/issues/717
readLines() and read.table() support compressed files, but they are slower.
There are very similar questions about this topic, but non deals with this under R quite precisely.
I have a csv.gz file and I would like to "unzip" the file and have it as ordinary *.csv file. I suppose one would go about first reading the csv.gz file and latter via write.csv command create the csv file itself.
Here what I have tried a part of other things:
gz.file <- read.csv(gzfile(file.choose()), as.is = TRUE)
gives:
head(gz.file)
farmNo.milk.energy.vet.cows
1 1;862533;117894;21186;121
2 2;605764;72049;43910;80
3 3;865658;158466;54583;95
4 4;662331;66783;45469;87
5 5;1003444;101714;81625;125
6 6;923512;252408;96807;135
File claims to be data.frame but doesn't behave like one, what I'm missing here?
class(gz.file)
[1] "data.frame"
Once read into memory I would like to have it in pure csv file, so would write.csv would be the solution?
write.csv(gz.file, file="PATH")
In recent versions of data.table fast csv reader fread got support for csv.gz files. It automatically detects if it needs to decompress based on the filename so there is not much new to learn. Following should work.
library(data.table)
dt = fread("data.csv.gz")
This feature requires extra, fortunately lightweight, dependency as you can read in ?fread manual
Compressed files ending .gz and .bz2 are supported if the R.utils package is installed.
To write compressed argument use fwrite(compress="gzip").
tidyverse, particularly the readr package, has transparent support of gzip compressed files (and a few others)
library(readr)
read_csv("file.csv.gz") -> d
# write uncompressed data
d %>% write_csv("file.csv")
I have a big data.frame that I want to write into a compressed CSV file. Is there any way to directly write the data into a CSV.TAR.GZ compressed file instead of performing write.csv/gzip steps in order to reduce DISK access?
Thanks.
Use gzfile (or bzfile for bzip2 archiving, or xzfile for xz archiving).
write.csv(mtcars, file=gzfile("mtcars.csv.gz"))
PS. If you only have one data frame, surely you don't need tar.