I am using grep to count the number of occurrences of string "^mj" in a file graph.tcl.
The command which i have written is quite easy and you can easily understand.
grep "^mj " mjwork/run/graph.tcl | wc -l
It outputs 46625 but after ~45 min. Can you guys suggest a better approach which could reduce the time?
The following line might make it faster:
$ awk '/^mj/{c++}END{print c}' file
This will process the file only a single time and it will only print the total amount of matches. This is in contrast to your initial case where you ask grep to print everything into a buffer and process that again with wc.
In the end, you could also just do:
$ grep -c '^mj' file
which just returns the total matches. This is probably even faster than the awk version. Awk will, by default, attempt a field splitting, this action is not needed with the above grep.
There are many reasons why your process could be slow, heavy load on the disk, a slow nfs if you use it, extremely long lines to parse, ... without more information on the input file and the system you are running this on, it is hard to say why it is so slow.
Sounds like something up with your machine. Have you enough swap space etc? What does df -h show? As a test, try egrep or fgrep as alternatives to grep.
You should try with this small C program that I just made a minute ago.
#define _FILE_OFFSET_BITS 64
#include <string.h>
#include <stdio.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>
const char needle[] = "mj";
int main(int argc, char * argv[]) {
int fd, i, res, count;
struct stat st;
char * data;
if (argc != 2) {
fprintf(stderr, "Syntax: %s file\n", *argv);
return 1;
}
fd = open(argv[1], O_RDONLY);
if (fd < 0) {
fprintf(stderr, "Couldn't open file \"%s\": %s\n", argv[1], strerror(errno));
return 1;
}
res = fstat(fd, &st);
if (res < 0) {
fprintf(stderr, "Failed at fstat: %s\n", strerror(errno));
return 1;
}
if (!S_ISREG(st.st_mode)) {
fprintf(stderr, "File \"%s\" is not a regular file.\n", argv[1]);
return 1;
}
data = mmap(NULL, st.st_size, PROT_READ, MAP_SHARED, fd, 0);
if (!data) {
fprintf(stderr, "mmap failed!: %s\n", strerror(errno));
return 1;
}
count = 0;
for (i = 0; i < st.st_size; i++) {
// look for string:
if (i + sizeof needle - 1 < st.st_size
&& !memcmp(data + i, needle, sizeof needle - 1)) {
count++;
i += sizeof needle - 1;
}
while (data[i] != '\n' && i < st.st_size)
i++;
}
printf("%d\n", count);
return 0;
}
Compile it with: gcc grepmj.c -o grepmj -O2
Related
I am trying to filter a huge txt file line by line, which pure R is not so good at. So, I wrote a c function that hopefully can speed up the process. Below is a minimum working example of filter.c, just for the demo purpose.
Currently, I have tried .C to do the trick without luck. Here is my attempt.
built filter.so using gcc -shared -o lfilter.so -fPIC filter.c
dyn.load("lfilter.so")
.C("filter", as.character("I1.txt"), as.character("I1.out.txt"), as.character("filter.txt"))
R crashed on me with 3rd step. But unfortunately, I have to stay within R.
Any help or suggestions are welcome.
filter.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define LL 256
int get_row(char *filename)
{
char line[LL];
int i = 0;
FILE *stream = fopen(filename, "r");
while (fgets(line, LL, stream))
{
i++;
}
fclose(stream);
return i;
}
void filter(char *R1_in,
char *R1_out,
char *filter)
{
char R1_line[LL];
FILE *R1_stream = fopen(R1_in, "r");
FILE *R1_out_stream = fopen(R1_out,"w");
/*****************loading filters*******************/
int nrows = get_row(filter);
FILE *filter_stream = fopen(filter, "r");
char **filter_list = (char **)malloc(nrows * sizeof(*filter_list));
for(int i = 0; i <nrows; i++)
{
filter_list[i] = malloc(LL * sizeof(char));
fgets(filter_list[i], LL, filter_stream);
}
fclose(filter_stream);
/*****************filtering*******************/
while (fgets(R1_line, LL, R1_stream))
{
// printf("%s", R1_line);
for(int i = 0; i<nrows; i++)
{
if(strcmp(R1_line, filter_list[i])==0)
{
fprintf(R1_out_stream, "%s", R1_line);
break;
}
}
}
printf("\n");
for(int i=0; i<nrows; i++)
{
free(filter_list[i]);
}
free(filter_list);
fclose(R1_stream);
fclose(R1_out_stream);
}
// int main()
// {
// char R1_in[] = "I1.txt";
// char R1_out[] = "I1.out.txt";
//
// char filters[] = "filter.txt";
//
// filter(R1_in, R1_out, filters);
// return 0;
// }
I1.txt
aa
baddf
ca
daa
filter.txt
ca
cb
Expected Output I1.out.txt
ca
I had never used R before. But, I was a bit intrigued. So, I installed R and did a little research.
Everything in R [using the .C interface] is passed to the C function as a pointer.
From: https://www.r-bloggers.com/2014/02/three-ways-to-call-cc-from-r/ we have:
Inside a running R session, the .C interface allows objects to be directly accessed in an R session’s active memory. Thus, to write a compatible C function, all arguments must be pointers. No matter the nature of your function’s return value, it too must be handled using pointers. The C function you will write is effectively a subroutine.
So, if we pass an integer, the C function argument must be:
int *
I took a guess that:
char *
Needed to be:
char **
And, then tested it with:
#include <stdio.h>
#define SHOW(_sym) \
show(#_sym,_sym)
static void
show(const char *sym,char **ptr)
{
char *str;
printf("%s: ptr=%p",sym,ptr);
str = *ptr;
printf(" str=%p",str);
printf(" '%s'\n",str);
}
void
filter(char **R1_in,char **R1_out,char **filt)
{
SHOW(R1_in);
SHOW(R1_out);
SHOW(filt);
}
Here is the output:
> dyn.load("filter.so");
> .C("filter",
+ as.character("abc"),
+ as.character("def"),
+ as.character("ghi"))
R1_in: ptr=0x55a9f8cb1798 str=0x55a9f9de9760 'abc'
R1_out: ptr=0x55a9f8cb1818 str=0x55a9f9de9728 'def'
filt: ptr=0x55a9f8cb1898 str=0x55a9f9de96f0 'ghi'
[[1]]
[1] "abc"
[[2]]
[1] "def"
[[3]]
[1] "ghi"
> q()
So, you want:
void
filter(char **R1_in, char **R1_out, char **filt)
{
FILE *R1_stream = fopen(*R1_in, "r");
// ...
}
I want to make tree command in xv6, if you don't know the tree is to list out directories on the terminal. I know this is probably easy for you but the code is so far
#include "types.h"
#include "stat.h"
#include "user.h"
#include "fcntl.h"
#include "fs.h"
#include "file.h"
int
main(int argc, char *argv[])
{
if(argc < 2){
printf(2, "Usage: tree [path]...\n");
exit();
}
tree(argv[1]);
int fd = open(argv[1],O_RDONLY);
if(fd<0)
return -1;
struct dirent dir;
while(read(fd,&dir,sizeof(dir))!=0){
printf(1,"|_ %d,%d",dir.name,dir.inum);
//struct stat *st;
struct inode ip;
ip= getinode(dir.inum);
if(ip.type==T_DIR){
int i;
for(i=0;i<NDIRECT;i++ ){
uint add=ip.addrs[i];
printf(1,"%d",add);
}
}
}
return 0;
}
and it has been giving me numerous error on the terminal the first being file.h:17:20: error: field ‘lock’ has incomplete type
struct sleeplock lock; // protects everything below here
^~~~
I'm searching for sleeplock and there is nothing like that in the code. What is wrong with the code? Thank you for your help
You cannot use kernel headers (like file.h) in a user code. To use kernel functionnalities in your code, you must use system calls.
To achieve what you want, you could start from ls function and make it recursive.
One example made quickly:
I added a parameter to the ls function to display the depth of crawling
and call itself on each directory elements but two first which are . and ..
void
ls(char *path, int decal)
{
char buf[512], *p;
int fd, i, skip = 2;
struct dirent de;
struct stat st;
if((fd = open(path, 0)) < 0){
printf(2, "tree: cannot open %s\n", path);
return;
}
if(fstat(fd, &st) < 0){
printf(2, "tree: cannot stat %s\n", path);
close(fd);
return;
}
switch(st.type){
case T_FILE:
for (i = 0; i < decal; i++)
printf(1, " ");
printf(1, "%s %d %d %d\n", fmtname(path), st.type, st.ino, st.size);
break;
case T_DIR:
if(strlen(path) + 1 + DIRSIZ + 1 > sizeof buf){
printf(1, "tree: path too long\n");
break;
}
strcpy(buf, path);
p = buf+strlen(buf);
*p++ = '/';
while(read(fd, &de, sizeof(de)) == sizeof(de)){
if(de.inum == 0)
continue;
memmove(p, de.name, DIRSIZ);
p[DIRSIZ] = 0;
if(stat(buf, &st) < 0){
printf(1, "tree: cannot stat %s\n", buf);
continue;
}
for (i = 0; i < decal; i++)
printf(1, " ");
printf(1, "%s %d %d %d\n", fmtname(buf), st.type, st.ino, st.size);
if (skip)
skip--;
else
ls(buf, decal+1);
}
break;
}
close(fd);
}
I have recently started programming in UNIX environment. I need to write a program which creates an empty file with name and size given in the terminal using this commands
gcc foo.c -o foo.o
./foo.o result.txt 1000
Here result.txt means the name of the newly created file, and 1000 means the size of the file in bytes.
I know for sure that lseek function moves the file offset, but the trouble is that whenever I run the program it creates a file with a given name, however the size of the file is 0.
Here is the code of my small program.
#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <ctype.h>
#include <sys/types.h>
#include <sys/param.h>
#include <sys/stat.h>
int main(int argc, char **argv)
{
int fd;
char *file_name;
off_t bytes;
mode_t mode;
if (argc < 3)
{
perror("There is not enough command-line arguments.");
//return 1;
}
file_name = argv[1];
bytes = atoi(argv[2]);
mode = S_IWUSR | S_IWGRP | S_IWOTH;
if ((fd = creat(file_name, mode)) < 0)
{
perror("File creation error.");
//return 1;
}
if (lseek(fd, bytes, SEEK_SET) == -1)
{
perror("Lseek function error.");
//return 1;
}
close(fd);
return 0;
}
If you aren't allowed to use any other functions to assist in creating a "blank" text file, why not change your file mode on creat() then loop-and-write:
int fd = creat(file_name, 0666);
for (int i=0; i < bytes; i++) {
int wbytes = write(fd, " ", 1);
if (wbytes < 0) {
perror("write error")
return 1;
}
}
You'll want to have some additional checks here but, that would be the general idea.
I don't know whats acceptable in your situation but, possibly adding just the write() call after lseek() even:
// XXX edit to include write
if ((fd = creat(file_name, 0666)) < 0) {
perror("File creation error");
//return 1;
}
// XXX seek to bytes - 1
if (lseek(fd, bytes - 1, SEEK_SET) == -1) {
perror("lseek() error");
//return 1;
}
// add this call to write a single byte # position set by lseek
if (write(fd, " ", 1) == -1) {
perror("write() error");
//return 1;
}
close(fd);
return 0;
I need to create three child processes, each of which reads a string from the command line arguments and writes the string to a single pipe. The parent would then read the strings from the pipe and display all three of them on the screen. I tried doing it for two processes to test and it is printing one of the strings twice as opposed to both of them.
#include <stdio.h>
#include <unistd.h>
int main (int argc, char *argv[]) {
char *character1 = argv[1];
char *character2 = argv[2];
char inbuf[100]; //creating an array with a max size of 100
int p[2]; // Pipe descriptor array
pid_t pid1; // defining pid1 of type pid_t
pid_t pid2; // defining pid2 of type pid_t
if (pipe(p) == -1) {
fprintf(stderr, "Pipe Failed"); // pipe fail
}
pid1 = fork(); // fork
if (pid1 < 0) {
fprintf(stderr, "Fork Failed"); // fork fail
}
else if (pid1 == 0){ // if child process 1
close(p[0]); // close the read end
write(p[1], character1, sizeof(&inbuf[0])); // write character 1 to the pipe
}
else { // if parent, create a second child process, child process 2
pid2 = fork();
if (pid2 < 0) {
fprintf(stderr, "Fork Failed"); // fork fail
}
if (pid2 = 0) { // if child process 2
close(p[0]); // close the read end
write(p[1], character2, sizeof(&inbuf[0])); // write character 2 to the pipe
}
else { // if parent process
close(p[1]); // close the write end
read(p[0], inbuf, sizeof(&inbuf[0])); // Read the pipe that both children write to
printf("%s\n", inbuf); // print
read(p[0], inbuf, sizeof(&inbuf[0])); // Read the pipe that both children write to
printf("%s\n", inbuf); // print
}
}
}
Your code doesn't keep looping until there's no more data to read. It does a single read. It also doesn't check the value returned by read(), but it should.
I've abstracted the fork() and write() (and error check) code into a function. This seems to work:
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
static void child(int fd, const char *string)
{
pid_t pid = fork();
int len = strlen(string);
if (pid < 0)
{
fprintf(stderr, "%.5d: failed to fork (%d: %s)\n",
(int)getpid(), errno, strerror(errno));
exit(1);
}
else if (pid > 0)
return;
else if (write(fd, string, len) != len)
{
fprintf(stderr, "%.5d: failed to write on pipe %d (%d: %s)\n",
(int)getpid(), fd, errno, strerror(errno));
exit(1);
}
else
exit(0);
}
int main (int argc, char *argv[])
{
char inbuf[100]; //creating an array with a max size of 100
int p[2]; // Pipe descriptor array
if (argc != 4)
{
fprintf(stderr, "Usage: %s str1 str2 str3\n", argv[0]);
return 1;
}
if (pipe(p) == -1)
{
fprintf(stderr, "Pipe Failed"); // pipe fail
return 1;
}
for (int i = 0; i < 3; i++)
child(p[1], argv[i+1]);
int nbytes;
close(p[1]); // close the write end
while ((nbytes = read(p[0], inbuf, sizeof(inbuf))) > 0)
printf("%.*s\n", nbytes, inbuf); // print
return 0;
}
I ran the command multiple times, each time using the command line:
./p3 'message 1' 'the second message' 'a third message for the third process'
On one run, the output was:
the second messagemessage 1
a third message for the third process
On another, I got:
the second messagemessage 1a third message for the third process
And on another, I got:
message 1
the second messagea third message for the third process
(This is on a MacBook Pro with Intel Core i7, running Mac OS X 10.8.3, and using GCC 4.7.1.)
What would be the best way to do this in the C programming language?
find fileName
Look up the POSIX function nftw(). It is designed as a 'new file tree walk' function.
There's a related but not immediately as useful function scandir() which you might use. The selection function might be used to invoke a recursive scan on sub-directories, for example, but nftw() is probably more appropriate.
You could call find from a forked child process and get back find's output from a pipe:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define BUFSIZE 1000
int main(void) {
int pfd[2], n;
char str[BUFSIZE + 1];
if (pipe(pfd) < 0) {
printf("Oups, pipe failed. Exiting\n");
exit(-1);
}
n = fork();
if (n < 0) {
printf("Oups, fork failed. Exiting\n");
exit(-2);
} else if (n == 0) {
close(pfd[0]);
dup2(pfd[1], 1);
close(pfd[1]);
execlp("find", "find", "filename", (char *) 0);
printf("Oups, execlp failed. Exiting\n"); /* This will be read by the parent. */
exit(-1); /* To avoid problem if execlp fails, especially if in a loop. */
} else {
close(pfd[1]);
while ((n = read(pfd[0], str, BUFSIZE)) > 0) {
str[n] = '\0';
printf("%s", str);
}
close(pfd[0]);
wait(&n); /* To avoid the zombie process. */
if (n != 0) {
printf("Oups, find or execlp failed.\n");
}
}
}
That's a complex topic. Have a look at the GNU libc documentation. Then try to scan the current directory using scandir. If that works, you can implement a recursive version, assuming you are talking about the UNIX find command and want to do recursive search for file names.