Is there a way to get the number of lines in a file without importing it?
So far this is what I am doing
myfiles <- list.files(pattern="*.dat")
myfilesContent <- lapply(myfiles, read.delim, header=F, quote="\"")
for (i in 1:length(myfiles)){
test[[i]] <- length(myfilesContent[[i]]$V1)
}
but is too time consuming since each file is quite big.
You can count the number of newline characters (\n, will also work for \r\n on Windows) in a file. This will give you a correct answer iff:
There is a newline char at the end of last line (BTW, read.csv gives a warning if this doesn't hold)
The table does not contain a newline character in the data (e.g. within quotes)
I'll suffice to read the file in parts. Below I set chunk (tmp buf) size of 65536 bytes:
f <- file("filename.csv", open="rb")
nlines <- 0L
while (length(chunk <- readBin(f, "raw", 65536)) > 0) {
nlines <- nlines + sum(chunk == as.raw(10L))
}
print(nlines)
close(f)
Benchmarks on a ca. 512 MB ASCII text file, 12101000 text lines, Linux:
readBin: ca. 2.4 s.
#luis_js's wc-based solution: 0.1 s.
read.delim: 39.6 s.
EDIT: reading a file line by line with readLines (f <- file("/tmp/test.txt", open="r"); nlines <- 0L; while (length(l <- readLines(f, 128)) > 0) nlines <- nlines + length(l); close(f)): 32.0 s.
If you:
still want to avoid the system call that a system2("wc"… will cause
are on BSD/Linux or OS X (I didn't test the following on Windows)
don't mind a using a full filename path
are comfortable using the inline package
then the following should be about as fast as you can get (it's pretty much the 'line count' portion of wc in an inline R C function):
library(inline)
wc.code <- "
uintmax_t linect = 0;
uintmax_t tlinect = 0;
int fd, len;
u_char *p;
struct statfs fsb;
static off_t buf_size = SMALL_BUF_SIZE;
static u_char small_buf[SMALL_BUF_SIZE];
static u_char *buf = small_buf;
PROTECT(f = AS_CHARACTER(f));
if ((fd = open(CHAR(STRING_ELT(f, 0)), O_RDONLY, 0)) >= 0) {
if (fstatfs(fd, &fsb)) {
fsb.f_iosize = SMALL_BUF_SIZE;
}
if (fsb.f_iosize != buf_size) {
if (buf != small_buf) {
free(buf);
}
if (fsb.f_iosize == SMALL_BUF_SIZE || !(buf = malloc(fsb.f_iosize))) {
buf = small_buf;
buf_size = SMALL_BUF_SIZE;
} else {
buf_size = fsb.f_iosize;
}
}
while ((len = read(fd, buf, buf_size))) {
if (len == -1) {
(void)close(fd);
break;
}
for (p = buf; len--; ++p)
if (*p == '\\n')
++linect;
}
tlinect += linect;
(void)close(fd);
}
SEXP result;
PROTECT(result = NEW_INTEGER(1));
INTEGER(result)[0] = tlinect;
UNPROTECT(2);
return(result);
";
setCMethod("wc",
signature(f="character"),
wc.code,
includes=c("#include <stdlib.h>",
"#include <stdio.h>",
"#include <sys/param.h>",
"#include <sys/mount.h>",
"#include <sys/stat.h>",
"#include <ctype.h>",
"#include <err.h>",
"#include <errno.h>",
"#include <fcntl.h>",
"#include <locale.h>",
"#include <stdint.h>",
"#include <string.h>",
"#include <unistd.h>",
"#include <wchar.h>",
"#include <wctype.h>",
"#define SMALL_BUF_SIZE (1024 * 8)"),
language="C",
convention=".Call")
wc("FULLPATHTOFILE")
It'd be better as a package since it actually has to compile the first time through. But, it's here for reference if you really do need "speed". For a 189,955 line file I had lying around, I get (mean values from a bunch of runs):
user system elapsed
0.007 0.003 0.010
I found this easy way using R.utils package
library(R.utils)
sapply(myfiles,countLines)
here is how it works
Maybe I am missing something but usually I do it using length on top of ReadLines:
con <- file("some_file.format")
length(readLines(con))
This at least has worked with many cases I had. I think it's kinda fast and it does only create a connection to the file without importing it.
If you are using linux, this might work for you:
# total lines on a file through system call to wc, and filtering with awk
target_file <- "your_file_name_here"
total_records <- as.integer(system2("wc",
args = c("-l",
target_file,
" | awk '{print $1}'"),
stdout = TRUE))
in your case:
#
lapply(myfiles, function(x){
as.integer(system2("wc",
args = c("-l",
x,
" | awk '{print $1}'"),
stdout = TRUE))
}
)
Here is another way with CRAN package fpeek, function peek_count_lines. This function is coded in C++ and is pretty fast.
library(fpeek)
sapply(filenames, peek_count_lines)
Related
I've written a function in Rcpp and compiled it with inline. Now, I want to run it in parallel on different cores, but I'm getting a strange error. Here's a minimal example, where the function funCPP1 can be compiled and runs well by itself, but cannot be called by snow's clusterCall function. The function runs well as a single process, but gives the following error when ran in parallel:
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
2 nodes produced errors; first error: NULL value passed as symbol address
And here is some code:
## Load and compile
library(inline)
library(Rcpp)
library(snow)
src1 <- '
Rcpp::NumericMatrix xbem(xbe);
int nrows = xbem.nrow();
Rcpp::NumericVector gv(g);
for (int i = 1; i < nrows; i++) {
xbem(i,_) = xbem(i-1,_) * gv[0] + xbem(i,_);
}
return xbem;
'
funCPP1 <- cxxfunction(signature(xbe = "numeric", g="numeric"),body = src1, plugin="Rcpp")
## Single process
A <- matrix(rnorm(400), 20,20)
funCPP1(A, 0.5)
## Parallel
cl <- makeCluster(2, type = "SOCK")
clusterExport(cl, 'funCPP1')
clusterCall(cl, funCPP1, A, 0.5)
Think it through -- what does inline do? It creates a C/C++ function for you, then compiles and links it into a dynamically-loadable shared library. Where does that one sit? In R's temp directory.
So you tried the right thing by shipping the R frontend calling that shared library to the other process (which has another temp directory !!), but that does not get the dll / so file there.
Hence the advice is to create a local package, install it and have both snow processes load and call it.
(And as always: better quality answers may be had on the rcpp-devel list which is read by more Rcpp constributors than SO is.)
Old question, but I stumbled across it while looking through the top Rcpp tags so maybe this answer will be of use still.
I think Dirk's answer is proper when the code you've written is fully de-bugged and does what you want, but it can be a hassle to write a new package for such as small piece of code like in the example. What you can do instead is export the code block, export a "helper" function that compiles source code and run the helper. That'll make the CXX function available, then use another helper function to call it. For instance:
# Snow must still be installed, but this functionality is now in "parallel" which ships with base r.
library(parallel)
# Keep your source as an object
src1 <- '
Rcpp::NumericMatrix xbem(xbe);
int nrows = xbem.nrow();
Rcpp::NumericVector gv(g);
for (int i = 1; i < nrows; i++) {
xbem(i,_) = xbem(i-1,_) * gv[0] + xbem(i,_);
}
return xbem;
'
# Save the signature
sig <- signature(xbe = "numeric", g="numeric")
# make a function that compiles the source, then assigns the compiled function
# to the global environment
c.inline <- function(name, sig, src){
library(Rcpp)
funCXX <- inline::cxxfunction(sig = sig, body = src, plugin="Rcpp")
assign(name, funCXX, envir=.GlobalEnv)
}
# and the function which retrieves and calls this newly-compiled function
c.namecall <- function(name,...){
funCXX <- get(name)
funCXX(...)
}
# Keep your example matrix
A <- matrix(rnorm(400), 20,20)
# What are we calling the compiled funciton?
fxname <- "TestCXX"
## Parallel
cl <- makeCluster(2, type = "PSOCK")
# Export all the pieces
clusterExport(cl, c("src1","c.inline","A","fxname"))
# Call the compiler function
clusterCall(cl, c.inline, name=fxname, sig=sig, src=src1)
# Notice how the function now named "TestCXX" is available in the environment
# of every node?
clusterCall(cl, ls, envir=.GlobalEnv)
# Call the function through our wrapper
clusterCall(cl, c.namecall, name=fxname, A, 0.5)
# Works with my testing
I've written a package ctools (shameless self-promotion) which wraps up a lot of the functionality that is in the parallel and Rhpc packages for cluster computing, both with PSOCK and MPI. I already have a function called "c.sourceCpp" which calls "Rcpp::sourceCpp" on every node in much the same way as above. I'm going to add in a "c.inlineCpp" which does the above now that I see the usefulness of it.
Edit:
In light of Coatless' comments, the Rcpp::cppFunction() in fact negates the need for the c.inline helper here, though the c.namecall is still needed.
src2 <- '
NumericMatrix TestCpp(NumericMatrix xbe, int g){
NumericMatrix xbem(xbe);
int nrows = xbem.nrow();
NumericVector gv(g);
for (int i = 1; i < nrows; i++) {
xbem(i,_) = xbem(i-1,_) * gv[0] + xbem(i,_);
}
return xbem;
}
'
clusterCall(cl, Rcpp::cppFunction, code=src2, env=.GlobalEnv)
# Call the function through our wrapper
clusterCall(cl, c.namecall, name="TestCpp", A, 0.5)
I resolved it by sourcing on each cluster cluster node an R file with the wanted C inline function:
clusterEvalQ(cl,
{
library(inline)
invisible(source("your_C_func.R"))
})
And your file your_C_func.R should contain the C function definition:
c_func <- cfunction(...)
Take this simple python script for example:
#!/usr/bin/env python3
# /tmp/xxx.py
import time
for i in range(1000):
print(i)
time.sleep(1)
It continuously outputs numbers. I can call it from R like this:
system2("/tmp/xxx.py", stdin=?)
where stdin can be set to NULL, "", TRUE or a filename. But what I am looking for is a way to process these numbers in realtime. For example, whenever a number is printed from this python script, I want to multiply the number by Pi and then print it to the console. Is there a way to do this in R?
Not an expert, but I got something working.
First of all, I used the following /tmp/xxx.R executable Rscript instead of your python script as I found out python was buffering its output (not printing one line at a time) which makes it hard to test:
#!/usr/bin/env Rscript
for (i in 1:5) {
cat(i, "\n")
Sys.sleep(1)
}
Then the R code:
system('mkfifo /tmp/xxx.fifo')
f <- fifo("/tmp/xxx.fifo", 'r')
p <- pipe('/tmp/xxx.R > /tmp/xxx.fifo; echo OVER > /tmp/xxx.fifo', 'w')
while(TRUE) {
line <- readLines(f, n = 1)
if (length(line) > 0) {
if (line == "OVER") break
cat(pi * as.numeric(line), "\n")
}
Sys.sleep(0.1)
}
close(f)
close(p)
some of which was inspired from this: https://stackoverflow.com/a/5561188/1201032.
Hope it answers your question.
I can't get this to work. I want to replace all two character occurences in the first field of a csv file with the occurence and an X appended, and whitespace removed. For example SA and SA should map to SAX in the new file. Below is what I tried with sed (from help through an earlier question)
system( paste("sed ","'" ,' s/^GG/GGX/g; s/^GG\\s/GGX/g; s/^GP/GPX/g;
s/^GP\\s/GPX/g; s/^FG/FGX/g; s/^FG\\s/FGX/g; s/^SA/SAX/g; s/^SA\\s/SAX/g;
s/^TP/TPX/g; s/^TP\\s/TPX/g ',"'",' ./data/concat_csv.2 >
./data/concatenated_csv.2 ',sep=''))
I tried using the sQuote() function, but this still doesn't help. The file has problems being handled by read.csv because there are errors within some fields based on too many and not enough separators on certain lines.
I could try reading in and editing the file in pieces, but I don't know how to do that as a streaming process.
I really just want to edit the first field of the file using a system() call. The file is about 30GB.
try the following on a file like so:
echo "fi,second,third" | awk '{len = split($0,array,","); str = ""; for (i = 1; i <= len; ++i) if (i == 1) { m = split(array[i],array2,""); if (m == 2) {str = array[i]"X";} else {str = array[i]};} else str = str","array[i]; print str;}'
so you would call it from R using the following as input to the paste() call
cat fileNameToBeRead | awk '{len = split($0,array,","); str = ""; for (i = 1; i <= len; ++i) if (i == 1) { m = split(array[i],array2,""); if (m == 2) {str = array[i]"X";} else {str = array[i]};} else str = str","array[i]; print str;}' > newFile
this code won't handle your whitespace requirement though. could you provide examples to demonstrate the sort of functionality you're looking at
I have (html-)texts and I want to change the ö things to real characters like ä, ü, ö, and so on because otherwise the xml-package does not accept it.
So I wrote a little function which cycles through a replacement table (link1, link2) and does replace special character by special character by sp... the function looks like this (only looonger):
html.charconv <- function(text){
replacer <- matrix(c(
"Á", "Á",
"á", "á",
"Â", "Â",
"â", "â",
"´", "´"
)
,ncol=2,byrow=T)
for(i in 1:length(replacer[,1])){
text <- str_replace_all(text,replacer[i,2],replacer[i,1])
}
text
}
How might I speed this up? I thought about vectorization but did not come with any helping solution because for each cycle the result of the last cycle is its starting point.
You can get a significant speedup by constructing your function a bit different, and forget about the text tools. Basically you :
split the character string
match the characters you want and replace them by the new characters
paste everything together again
You can do that with following function :
html.fastconv <- function(x,old,new){
xs <- strsplit(x,"&|;")
old <- gsub("&|;","",old)
xs <- lapply(xs,function(i){
id <- match(i,old,0L)
i[id!=0] <- new[id]
return(i)
})
sapply(xs,paste,collapse="")
}
This works as :
> sometext <- c("Ádd somá leÂtterâ acute problems et´ cetera",
+ "Ádd somá leÂtterâ acute p ..." ... [TRUNCATED]
> newchar <- c("Á","á","Â","â","´")
> oldchar <- c("Á","á","Â","â","´")
> html.fastconv(sometext,oldchar,newchar)
[1] "Ádd somá leÂtterâ acute problems et´ cetera" "Ádd somá leÂtterâ acute problems et´ cetera"
For the record, some benchmarking :
require(rbenchmark)
benchmark(html.fastconv(sometext,oldchar,newchar),html.charconv(sometext),
columns=c("test","elapsed","relative"),
replications=1000)
test elapsed relative
2 html.charconv(sometext) 0.79 5.643
1 html.fastconv(sometext, oldchar, newchar) 0.14 1.000
Just for fun, here is a version based on Rcpp.
#include <Rcpp.h>
using namespace Rcpp ;
// [[Rcpp::export]]
CharacterVector rcpp_conv(
CharacterVector text, CharacterVector old , CharacterVector new_){
int n = text.size() ;
int nr = old.size() ;
std::string buffer, current_old, current_new ;
size_t pos, current_size ;
CharacterVector res(n) ;
for( int i=0; i<n; i++){
buffer = text[i] ;
for( int j=0; j<nr; j++){
current_old = old[j] ;
current_size = current_old.size() ;
current_new = new_[j] ;
pos = 0 ;
pos = buffer.find( current_old ) ;
while( pos != std::string::npos ){
buffer.replace(
pos, current_size,
current_new
) ;
pos = buffer.find( current_old ) ;
}
}
res[i] = buffer ;
}
return res ;
}
For which I get quite a further performance gain:
> microbenchmark(
+ html.fastconv( sometext,oldchar,newchar),
+ html.fastconvJC(sometext, oldchar, newchar),
+ rcpp_conv( sometext, oldchar, newchar)
+ )
Unit: microseconds
expr min lq median uq
1 html.fastconv(sometext, oldchar, newchar) 97.588 99.9845 101.4195 103.072
2 html.fastconvJC(sometext, oldchar, newchar) 19.945 23.3060 25.8110 28.134
3 rcpp_conv(sometext, oldchar, newchar) 4.047 5.1555 6.2340 9.275
max
1 256.061
2 40.647
3 25.763
Here is an implementation based on the Rcpp::String feature, available from Rcpp >= 0.10.2:
class StringConv{
public:
typedef String result_type ;
StringConv( CharacterVector old_, CharacterVector new__):
nr(old_.size()), old(old_), new_(new__){}
String operator()(String text) const {
for( int i=0; i<nr; i++){
text.replace_all( old[i], new_[i] ) ;
}
return text ;
}
private:
int nr ;
CharacterVector old ;
CharacterVector new_ ;
} ;
// [[Rcpp::export]]
CharacterVector test_sapply_string(
CharacterVector text, CharacterVector old , CharacterVector new_
){
CharacterVector res = sapply( text, StringConv( old, new_ ) ) ;
return res ;
}
I'm guessing that 36,000 file read and writes is your bottleneck and the way you code in R can't help much with that. Some things just take a while. Your function looks like it will work right, just let it run. There are a few small improvements you could make.
replacer <- matrix(c(
"Á", "Á",
"á", "á",
"Â", "Â",
"â", "â",
"´", "´"
)
,ncol=2, byrow=T)
html.fastconvJC <- function(x,old,new){
n <- length(new)
s <- x #make a copy cause I'm scared of scoping in R :)
for (i in 1:n) s <- gsub(old[i], new[i], s, fixed = TRUE)
s
}
# borrowing the strings from Joris Meys
benchmark(html.fastconvJC(sometext, replacer[,2], replacer[,1]),
html.charconv(sometext), columns = c("test", "elapsed", "relative"),
replications=1000)
test elapsed relative
2 html.charconv(sometext) 0.727 17.31
1 html.fastconvJC(sometext, replacer[, 2], replacer[, 1]) 0.042 1.00
And they increased speed more than I expected. Note that a huge part of that speedup is making fixed = TRUE, otherwise Joris Meys answer comes in about the same speed.
If this doesn't get your far in overall speed you know your bottleneck is elsewhere, likely file reads and writes. Unless you have solid state or RAID drives, running this in parallel isn't going to speed anything up and might just slow it down.
I will try with plyr :
input.data <- llply(input.files, html.charconv, .parallel=TRUE)
I'm working on diffing some ldif files where each section begins with "dn: leaf,branch3,branch2,branch1,root" I would like the dn (distinguished name) for each section to be displayed, and the Unix diff utility has a feature to do that: --show-function-line=regular expression. However, the diff util truncates the dn line in the output, which makes it harder to know the full path.
current command:
diff -U 0 --show-function-line="^dn\: .*" file1.ldif file2.ldif > deltas.txt
example output:
## -56 +56 ## dn: administratorId=0,applicationName=pl
-previousLoginTime: 20120619180751Z
+previousLoginTime: 20120213173659Z
original dn:
dn: administratorId=0,applicationName=platform,nodeName=NODENAME
I would like the entire original line to be included in the output. Is there a way to do this?
Thanks,
Rusty
I solved it by editing the source code and recompiling.
in src/context.c: print_context_function (FILE *out, char const *function)
changed line:
for (j = i; j < i + 40 && function[j] != '\n'; j++)
to
for (j = i; j < i + 100 && function[j] != '\n'; j++)
The "40" was limiting the number of characters output to 40, so I increased it to 100, which should be large enough for my needs. That check could probably be omitted entirely, and let it just check for function[j] != '\n', but I decided to leave it as is.