I am looking for a unix command to get a single line by passing line number to a big file (with around 5 million records). For example to get 10th line, I want to do something like
command file-name 10
Is there any such command available? We can do this by looping through each record but that will be time consuming process.
This forum entry suggests:
sed -n '52p' (file)
for printing the 52th line of a file.
Going forward, There are a lot of ways to do it, and other related stuffs.
If you want multiple lines to be printed,
sed -n -e 'Np' -e 'Mp'
Where N and M are lines which will only be printed. Refer this 10 Awesome Examples for Viewing Huge Log Files in Unix
command | sed -n '10p'
or
sed -n '10p' file
You could do something like:
head -n<lineno> <file> | tail -n1
That would give you the <lineno> lines, then only give the last line of output (your line).
Edit: It seems all the solutions here are pretty slow. However, by definition you'll have to iterate through all the records since the operating system has no way to parse line-oriented files since files are byte-oriented. (In some sense, all these programs are going to do is count the number of \n or \r characters.) In lieu of a great answer, I'll also present the timings on my system of several of these commands!
[mjschultz#mawdryn ~]$ time sed -n '145430980p' br.txt
0b10010011111111010001101111010111
real 0m25.871s
user 0m17.315s
sys 0m2.360s
[mjschultz#mawdryn ~]$ time head -n 145430980 br.txt | tail -n1
0b10010011111111010001101111010111
real 0m41.112s
user 0m39.385s
sys 0m4.291s
[mjschultz#mawdryn ~]$ time awk 'NR==145430980{print;exit}' br.txt
0b10010011111111010001101111010111
real 2m8.835s
user 1m38.076s
sys 0m3.337s
So, on my system, it looks like the sed -n '<lineno>p' <file> solution is fastest!
you can use awk
awk 'NR==10{print;exit}' file
Put an exit after printing the 10th line so that awk won't process the 5 million records file further.
Related
I have a file in unix, say as below:
$ cat file
1.this is the test file.
The file have data related to recent survey.
The requirement is to cut the first two characters only from the 1st line and print rest of the content of the file as it is, I have tried the cut command for same..
$ cut -c 3- file
this is the test file.
e file have data related to recent survey.
but its removing the 2 characters from each line, is there any way to implement the cut only on the first line or any other command we can use to get the required result (also the first two characters could be anything which need to remove, not particularly 1.)
Using shell utilities head, tail and cut:
head -n1 file | cut -c3-; tail -n+2 file
or a sed one-liner:
sed '1s/..//' file
I'm using terminal on OS 10.X. I have some data files of the format:
mbh5.0_mrg4.54545454545_period0.000722172513951.params.dat
mbh5.0_mrg4.54545454545_period0.00077271543854.params.dat
mbh5.0_mrg4.59090909091_period-0.000355232058085.params.dat
mbh5.0_mrg4.59090909091_period-0.000402015664015.params.dat
I know that there will be some files with similar numbers after mbh and mrg, but I won't know ahead of time what the numbers will be or how many similarly numbered ones there will be. My goal is to cat all the data from all the files with similar numbers after mbh and mrg into one data file. So from the above I would want to do something like...
cat mbh5.0_mrg4.54545454545*dat > mbh5.0_mrg4.54545454545.dat
cat mbh5.0_mrg4.5909090909*dat > mbh5.0_mrg4.5909090909.dat
I want to automate this process because there will be many such files.
What would be the best way to do this? I've been looking into sed, but I don't have a solution yet.
for file in *.params.dat; do
prefix=${file%_*}
cat "$file" >> "$prefix.dat"
done
This part ${file%_*} remove the last underscore and following text from the end of $file and saves the result in the prefix variable. (Ref: http://www.gnu.org/software/bash/manual/bashref.html#Shell-Parameter-Expansion)
It's not 100% clear to me what you're trying to achieve here but if you want to aggregate files into a file with the same number after "mbh5.0_mrg4." then you can do the following.
ls -l mbh5.0_mrg4* | awk '{print "cat " $9 " > mbh5.0_mrg4." substr($9,12,11) ".dat" }' | /bin/bash
The "ls -s" lists the file and the "awk" takes the 9th column from the result of the ls. With some string concatenation the result is passed to /bin/bash to be executed.
This is a linux bash script, so assuming you have /bind/bash, I'm not 100% famililar with OS X. This script also assumes that the number youre grouping on is always in the same place in the filename. I think you can change /bin/bash to almost any shell you have installed.
I just ran these two commands on a file having around 250 million records.
awk '{if(substr($0,472,1)=="9") print $0}' < file1.txt >> file2.txt
and
nawk '{if(substr($0,472,1)=="9") print $0}' < file1.txt >> file2.txt
The record length is 482. The first command gave the correct number of records in file2.txt
i.e.; 60 million but the nawk command gives only 4.2 million.
I am confused and would like to know if someone has come across issue like this. How exactly this simple command being treated in a different way internally? Is there a buffer which can hold only up to certain number of bytes while using nawk?
would appreciate if someone can throw some light on this.
My OS details are
SunOS <hostname> 5.10 Generic_147148-26 i86pc i386 i86pc
The difference probably lies on the buffer limit of Nawk. One of the records (lines) found in your input file has probably exceeded it.
This crucial line can be found in awk.h:
#define RECSIZE (8 * 1024) /* sets limit on records, fields, etc., etc. */
Your command can be reduced to just this:
awk 'substr($0,472,1)==9'
On Solaris (which you are on) when you run awk by default you are running old, broken awk (/usr/bin/awk) so I suspect that nawk is the one producing the correct result.
Run /usr/xpg4/bin/awk with the same script/arguments and see which of your other results it's output agrees with.
Also, check if your input file was created on Windows by running dos2unix on it and see if it's size changes and, if so, re-run your awk commands on the modified files. If it was created on Windows then it will have some control-Ms in there that could be causing chaos.
I have a group of very large (a couple of GB's each) text files. I need to add two lines at the beginning of each of these files.
I tried using sed with the following command
sed -i '1iFirstLine'
sed -i '2iSecondLine'
The problem with sed is that it loops through the entire file, even if had to add only two lines at the beginning and therefore it takes lot of time.
Is there an alternate way to do this more efficiently, without reading the entire file?
You should try
echo "1iFirstLine" > newfile.txt
echo "2iSecondLine" >> newfile.txt
cat oldfile.txt >> newfile.txt
mv newfile.txt oldfile.txt
This one is perfectly working and its extremely fast too.
perl -pi -e '$.=0 if eof;print "first line\nsecond line\n" if ($.==1)' *.txt
Adding at the beginning is not possible without file rewrite (contrary to appending to the end). You simply cannot "shift" file content as no filesystem supports that. So you should do:
echo -e "line 1\nLine2" > tmp.txt
cat tmp2.txt oldbigfile.txt > newbigfile.txt
rm oldbigfile.txt
mv newbigfile.txt oldbigfile.txt
Note you need enough diskspace to hold both files for a while.
I need to extract a set number of lines from a file given the start line number and end line number.
How could I quickly do this under unix (it's actually Solaris so gnu flavour isn't available).
Thx
To print lines 6-10:
sed -n '6,10p' file
If the file is huge, and the end line number is small compared to the number of lines, you can make it more efficient by:
sed -n '10q;6,10p' file
From testing a file with a fairly large number of lines:
$ wc -l test.txt
368048 test.txt
$ du -k test.txt
24640 test.txt
$ time sed -n '10q;6,10p' test.txt >/dev/null
real 0m0.005s
user 0m0.001s
sys 0m0.003s
$ time sed -n '6,10p' test.txt >/dev/null
real 0m0.123s
user 0m0.092s
sys 0m0.030s
Or
head -n "$last" file | tail -n +"$first"
I wrote a Haskell program called splitter that does exactly this: have a read through my release blog post.
You can use the program as follows:
$ cat somefile | splitter 4,6-10,50-
That will get lines four, six to ten and lines fifty onwards. And that is all that there is to it. You will need Haskell to install it. Just:
$ cabal install splitter
And you are done. I hope that you find this program useful.
you can do it with nawk as well
#!/bin/sh
start=10
end=20
nawk -vs="$start" -ve="$end" 'NR>e{exit}NR>=s' file