I can't get this to work. I want to replace all two character occurences in the first field of a csv file with the occurence and an X appended, and whitespace removed. For example SA and SA should map to SAX in the new file. Below is what I tried with sed (from help through an earlier question)
system( paste("sed ","'" ,' s/^GG/GGX/g; s/^GG\\s/GGX/g; s/^GP/GPX/g;
s/^GP\\s/GPX/g; s/^FG/FGX/g; s/^FG\\s/FGX/g; s/^SA/SAX/g; s/^SA\\s/SAX/g;
s/^TP/TPX/g; s/^TP\\s/TPX/g ',"'",' ./data/concat_csv.2 >
./data/concatenated_csv.2 ',sep=''))
I tried using the sQuote() function, but this still doesn't help. The file has problems being handled by read.csv because there are errors within some fields based on too many and not enough separators on certain lines.
I could try reading in and editing the file in pieces, but I don't know how to do that as a streaming process.
I really just want to edit the first field of the file using a system() call. The file is about 30GB.
try the following on a file like so:
echo "fi,second,third" | awk '{len = split($0,array,","); str = ""; for (i = 1; i <= len; ++i) if (i == 1) { m = split(array[i],array2,""); if (m == 2) {str = array[i]"X";} else {str = array[i]};} else str = str","array[i]; print str;}'
so you would call it from R using the following as input to the paste() call
cat fileNameToBeRead | awk '{len = split($0,array,","); str = ""; for (i = 1; i <= len; ++i) if (i == 1) { m = split(array[i],array2,""); if (m == 2) {str = array[i]"X";} else {str = array[i]};} else str = str","array[i]; print str;}' > newFile
this code won't handle your whitespace requirement though. could you provide examples to demonstrate the sort of functionality you're looking at
Related
In png(), the first argument is filename = "Rplot%03d.png" which causes files to be generated with ascending numbers. However, in ggsave, this doesn't work, the number always stays at the lowest number (Rplots001.png") and this file is always overwritten.
Looking at the code of the grDevices-functions (e.g. grDevices::png() it appears that the automatic naming happens in functions which are called by .External()
Is there already an implementation of this file naming functionality in R such that it is accessible outside of the grDevices functions?
Edit:
asked differently, is there a way to continue automatic numbering after shutting off and restarting a device? For example, in this code, the two later files overwrite the former ones:
png(width = 100)
plot(1:10)
plot(1:10)
dev.off()
png(width = 1000)
plot(1:10)
plot(1:10)
dev.off()
You can write a function to do this. For example, how about simply adding a time stamp. something like:
fname = function(basename = 'myfile', fileext = 'png'){
paste(basename, format(Sys.time(), " %b-%d-%Y %H-%M-%S."), fileext, sep="")
}
ggsave(fname())
Or, if you prefer sequential numbering, then something along the lines of
next_file = function(basename = 'myfile', fileext = 'png', filepath = '.'){
old.fnames = grep(paste0(basename,' \\d+\\.', fileext,'$'),
list.files(filepath), value = T)
lastnum = gsub(paste0(basename,' (\\d+)\\.', fileext,'$'), '\\1', old.fnames)
if (!length(lastnum)) {
lastnum = 1
} else {
lastnum = sort(as.integer(lastnum),T)[1] + 1L
}
return(paste0(basename, ' ', sprintf('%03i', lastnum), '.', fileext))
}
ggsave(next_file())
I have the following code to check if the dataset existed and if the dataset has data, and the problem is coming from the double quote around the macro variable.
check_data_ready <- defmacro(tracking_sheet,table_df,table_name,
expr={if (exists(table_df) && is.data.frame(get(table_df)) && dim(table_df)==NULL) {`tracking_sheet$DataReady[tracking_sheet$Table==table_name]<-'Ready'
} else {tracking_sheet$DataReady[tracking_sheet$Table==table_name]<-'Check!'}
})
check_data_ready(tracking_sheet,"df","Table")
the error msg is
Error in if (exists("df") && is.data.frame(get("df")) && dim("df") ==
So apparently dim("df") is not working, which should be dim(df). I try to substr the double quote from 2 to 3, but looks like it recognize d as 1 and f as 2, it doesn't count the double quote. I am lost here, how do I make the code working like this
if (exists("df") && is.data.frame(get("df")) && dim(df) ==
I am trying to compare the input string with the strings present in the doc. I am using strcmp for the purpose. These are non-English strings. When the input string is English language, the output is correct. But for any Kannada (non-English language) word the output is the same. I am trying to write a program to check if the word is present in the database. Please guide me in what could be the problem.
The calling function is as below:
str_kan = handles.InputBox.String;
res = strcmp('str_kan','s1.text')
if res == 1 then handles.InputBox.String = string( ' present')
abort
else
handles.InputBox.String = string( 'not present')
abort
end
The whole program is as below:
global s1
f=figure('figure_position',[400,50],'figure_size',[640,480],'auto_resize','on','background',[33],'figure_name','Graphic window number %d','dockable','off','infobar_visible','off','toolbar_visible','off','menubar_visible','off','default_axes','on','visible','off');
handles.dummy = 0;
handles.InputBox=uicontrol(f,'unit','normalized','BackgroundColor',[-1,-1,-1],'Enable','on','FontAngle','normal','FontName','Tunga','FontSize',[12],'FontUnits','points','FontWeight','normal','ForegroundColor',[-1,-1,-1],'HorizontalAlignment','left','ListboxTop',[],'Max',[1],'Min',[0],'Position',[0.0929487,0.6568182,0.4647436,0.1795455],'Relief','default','SliderStep',[0.01,0.1],'String','Enter a Kannada Word','Style','edit','Value',[0],'VerticalAlignment','middle','Visible','on','Tag','InputBox','Callback','')
handles.CheckDB=uicontrol(f,'unit','normalized','BackgroundColor',[-1,-1,-1],'Enable','on','FontAngle','normal','FontName','Tahoma','FontSize',[12],'FontUnits','points','FontWeight','normal','ForegroundColor',[-1,-1,-1],'HorizontalAlignment','center','ListboxTop',[],'Max',[1],'Min',[0],'Position',[0.1025641,0.4636364,0.4567308,0.1204545],'Relief','default','SliderStep',[0.01,0.1],'String','Check','Style','pushbutton','Value',[0],'VerticalAlignment','middle','Visible','on','Tag','CheckDB','Callback','CheckDB_callback(handles)')
f.visible = "on";
function CheckDB_callback(handles)
str_kan = handles.InputBox.String;
res = strcmp('str_kan','s1.text')
if res == 1 then handles.InputBox.String = string( ' present')
abort
else
handles.InputBox.String = string( 'not present')
abort
end
endfunction
Here is an example showing that, in Scilab, strcmp() does support UTF-8 extended characters:
--> strcmp(["aα" "aα" "aβ"], ["aα" "aβ" "aα"])
ans =
0. -1. 1.
The problem in the original posted code is the confusion between literal values as "Hello" and variables names as Hello="my text", as already noted by PTRK.
Is there a way to get the number of lines in a file without importing it?
So far this is what I am doing
myfiles <- list.files(pattern="*.dat")
myfilesContent <- lapply(myfiles, read.delim, header=F, quote="\"")
for (i in 1:length(myfiles)){
test[[i]] <- length(myfilesContent[[i]]$V1)
}
but is too time consuming since each file is quite big.
You can count the number of newline characters (\n, will also work for \r\n on Windows) in a file. This will give you a correct answer iff:
There is a newline char at the end of last line (BTW, read.csv gives a warning if this doesn't hold)
The table does not contain a newline character in the data (e.g. within quotes)
I'll suffice to read the file in parts. Below I set chunk (tmp buf) size of 65536 bytes:
f <- file("filename.csv", open="rb")
nlines <- 0L
while (length(chunk <- readBin(f, "raw", 65536)) > 0) {
nlines <- nlines + sum(chunk == as.raw(10L))
}
print(nlines)
close(f)
Benchmarks on a ca. 512 MB ASCII text file, 12101000 text lines, Linux:
readBin: ca. 2.4 s.
#luis_js's wc-based solution: 0.1 s.
read.delim: 39.6 s.
EDIT: reading a file line by line with readLines (f <- file("/tmp/test.txt", open="r"); nlines <- 0L; while (length(l <- readLines(f, 128)) > 0) nlines <- nlines + length(l); close(f)): 32.0 s.
If you:
still want to avoid the system call that a system2("wc"… will cause
are on BSD/Linux or OS X (I didn't test the following on Windows)
don't mind a using a full filename path
are comfortable using the inline package
then the following should be about as fast as you can get (it's pretty much the 'line count' portion of wc in an inline R C function):
library(inline)
wc.code <- "
uintmax_t linect = 0;
uintmax_t tlinect = 0;
int fd, len;
u_char *p;
struct statfs fsb;
static off_t buf_size = SMALL_BUF_SIZE;
static u_char small_buf[SMALL_BUF_SIZE];
static u_char *buf = small_buf;
PROTECT(f = AS_CHARACTER(f));
if ((fd = open(CHAR(STRING_ELT(f, 0)), O_RDONLY, 0)) >= 0) {
if (fstatfs(fd, &fsb)) {
fsb.f_iosize = SMALL_BUF_SIZE;
}
if (fsb.f_iosize != buf_size) {
if (buf != small_buf) {
free(buf);
}
if (fsb.f_iosize == SMALL_BUF_SIZE || !(buf = malloc(fsb.f_iosize))) {
buf = small_buf;
buf_size = SMALL_BUF_SIZE;
} else {
buf_size = fsb.f_iosize;
}
}
while ((len = read(fd, buf, buf_size))) {
if (len == -1) {
(void)close(fd);
break;
}
for (p = buf; len--; ++p)
if (*p == '\\n')
++linect;
}
tlinect += linect;
(void)close(fd);
}
SEXP result;
PROTECT(result = NEW_INTEGER(1));
INTEGER(result)[0] = tlinect;
UNPROTECT(2);
return(result);
";
setCMethod("wc",
signature(f="character"),
wc.code,
includes=c("#include <stdlib.h>",
"#include <stdio.h>",
"#include <sys/param.h>",
"#include <sys/mount.h>",
"#include <sys/stat.h>",
"#include <ctype.h>",
"#include <err.h>",
"#include <errno.h>",
"#include <fcntl.h>",
"#include <locale.h>",
"#include <stdint.h>",
"#include <string.h>",
"#include <unistd.h>",
"#include <wchar.h>",
"#include <wctype.h>",
"#define SMALL_BUF_SIZE (1024 * 8)"),
language="C",
convention=".Call")
wc("FULLPATHTOFILE")
It'd be better as a package since it actually has to compile the first time through. But, it's here for reference if you really do need "speed". For a 189,955 line file I had lying around, I get (mean values from a bunch of runs):
user system elapsed
0.007 0.003 0.010
I found this easy way using R.utils package
library(R.utils)
sapply(myfiles,countLines)
here is how it works
Maybe I am missing something but usually I do it using length on top of ReadLines:
con <- file("some_file.format")
length(readLines(con))
This at least has worked with many cases I had. I think it's kinda fast and it does only create a connection to the file without importing it.
If you are using linux, this might work for you:
# total lines on a file through system call to wc, and filtering with awk
target_file <- "your_file_name_here"
total_records <- as.integer(system2("wc",
args = c("-l",
target_file,
" | awk '{print $1}'"),
stdout = TRUE))
in your case:
#
lapply(myfiles, function(x){
as.integer(system2("wc",
args = c("-l",
x,
" | awk '{print $1}'"),
stdout = TRUE))
}
)
Here is another way with CRAN package fpeek, function peek_count_lines. This function is coded in C++ and is pretty fast.
library(fpeek)
sapply(filenames, peek_count_lines)
I'm working on diffing some ldif files where each section begins with "dn: leaf,branch3,branch2,branch1,root" I would like the dn (distinguished name) for each section to be displayed, and the Unix diff utility has a feature to do that: --show-function-line=regular expression. However, the diff util truncates the dn line in the output, which makes it harder to know the full path.
current command:
diff -U 0 --show-function-line="^dn\: .*" file1.ldif file2.ldif > deltas.txt
example output:
## -56 +56 ## dn: administratorId=0,applicationName=pl
-previousLoginTime: 20120619180751Z
+previousLoginTime: 20120213173659Z
original dn:
dn: administratorId=0,applicationName=platform,nodeName=NODENAME
I would like the entire original line to be included in the output. Is there a way to do this?
Thanks,
Rusty
I solved it by editing the source code and recompiling.
in src/context.c: print_context_function (FILE *out, char const *function)
changed line:
for (j = i; j < i + 40 && function[j] != '\n'; j++)
to
for (j = i; j < i + 100 && function[j] != '\n'; j++)
The "40" was limiting the number of characters output to 40, so I increased it to 100, which should be large enough for my needs. That check could probably be omitted entirely, and let it just check for function[j] != '\n', but I decided to leave it as is.