Move top 1000 lines from text file to a new file using Unix shell commands - unix

I wish to copy the top 1000 lines in a text file containing more than 50 million entries, to another new file, and also delete these lines from the original file.
Is there some way to do the same with a single shell command in Unix?

head -1000 input > output && sed -i '1,+999d' input
For example:
$ cat input
1
2
3
4
5
6
$ head -3 input > output && sed -i '1,+2d' input
$ cat input
4
5
6
$ cat output
1
2
3

head -1000 file.txt > first100lines.txt
tail --lines=+1001 file.txt > restoffile.txt

Out of curiosity, I found a box with a GNU version of sed (v4.1.5) and tested the (uncached) performance of two approaches suggested so far, using an 11M line text file:
$ wc -l input
11771722 input
$ time head -1000 input > output; time tail -n +1000 input > input.tmp; time cp input.tmp input; time rm input.tmp
real 0m1.165s
user 0m0.030s
sys 0m1.130s
real 0m1.256s
user 0m0.062s
sys 0m1.162s
real 0m4.433s
user 0m0.033s
sys 0m1.282s
real 0m6.897s
user 0m0.000s
sys 0m0.159s
$ time head -1000 input > output && time sed -i '1,+999d' input
real 0m0.121s
user 0m0.000s
sys 0m0.121s
real 0m26.944s
user 0m0.227s
sys 0m26.624s
This is the Linux I was working with:
$ uname -a
Linux hostname 2.6.18-128.1.1.el5 #1 SMP Mon Jan 26 13:58:24 EST 2009 x86_64 x86_64 x86_64 GNU/Linux
For this test, at least, it looks like sed is slower than the tail approach (27 sec vs ~14 sec).

This is a one-liner but uses four atomic commands:
head -1000 file.txt > newfile.txt; tail +1000 file.txt > file.txt.tmp; cp file.txt.tmp file.txt; rm file.txt.tmp

Perl approach:
perl -ne 'if($i<1000) { print; } else { print STDERR;}; $i++;' in 1> in.new 2> out && mv in.new in

Using pipe:
cat en-tl.100.en | head -10

Related

How to Remove Code Specific Code Lines using Unix

Could someone please help/advise how could I removed the first 4 line and the last 2 line of codes in my 3 JavaScript files using the Shell Script?
I tried using this guide: UNIX - delete specific lines but it will only work for the first 4 lines. All 3 Javascript files have different set of line of codes.
set -vx
lines2del="(1,2,3,4)"
sedCmds=${lines2del//,/d;}
sedCmds=${sedCmds/(/}
sedCmds=${sedCmds/)/}
sedCmds=${sedCmds}d
sed -i "$sedCmds" file
Any inputs are highly appreciated. Thanks
This might work for you (GNU sed):
sed -i '1,4d;N;$d;P;D' file
This deletes the lines 1 to 4 and then prints all other lines except the last two which it also deletes.
Add the following to your lines2del:
$(($(cat file | wc -l)-2)) // third last line
$(($(cat file | wc -l)-1)) // second last line
$(cat file | wc -l) // last line
$ seq 10 | tail -n +5 | head -n -2
5
6
7
8
$ seq 10 | awk '{p3=p2; p2=p1; p1=$0} NR>6{print p3}'
5
6
7
8
$ seq 10 | awk '{p[NR%6]=$0} NR>6{print p[(NR-2)%6]}'
5
6
7
8
$ seq 10 | awk -v b=4 -v a=2 'BEGIN{t=b+a} {p[NR%t]=$0} NR>t{print p[(NR-a)%t]}'
5
6
7
8
$ seq 10 | awk -v b=3 -v a=5 'BEGIN{t=b+a} {p[NR%t]=$0} NR>t{print p[(NR-a)%t]}'
4
5

Compare 2 files in unix file1(2M numbers/rows/lines) , file2(2,000,480 numbers/rows/lines)

How can I compare this 2 big files in unix.
I've already tried using 'grep -Fxvf file1.txt file2.txt | wc -l' but the output is 2,000,480 and when switching file1 and file2 the output is 1,999,999.
How can I get the output of '480' because that's what i am expecting.
I've also tried using diff/cmp commands but the output is too complicated.
I think you want an absolute value of a difference in line numbers in 2 files. You can achieve it easily with awk and get a decent result. You'd read numbers of lines in an array and later subtract the array values in the END block. For pure shell it'd have to get more complex. Imagine you get some test data generated (10 and 14 line files):
$ seq 1 10 > ten
$ seq 1 14 > fourteen
And then you do:
$ ( wc -l ten ; wc -l fourteen ) | awk '{ print $1}' | sort -rn | xargs -J % echo % - p | dc
The result:
4
But much better way would be do just do it in 3 lines (get word count for file1, then file2 and then subtract)

UNIX (AIX) Command Help - Sed & Awk

I'm running this on an AIX 6.1.
The intended purpose of this command is to display the following information in the following format:
GetUsedRAM:GetUsedSwap:CPU_0_System:CPU_0_User:…CPU_N_System:CPU_N_User
The command is composed of several sub commands:
echo `vmstat 1 2 | tr -s ' ' ':' | cut -d':' -f4,5,14-15 | tail -1 | sed 's/\([0-9]*:[0-9]*:\)\([0-9]*:[0-9]*\)/\1/'``mpstat -a 1 1 | tr -s ' ' '|' | head -8 | tail -4 | cut -d'|' -f 25,27 | awk -F "|" '{printf "%.0f:%.0f:",$2,$1}' | sed '$s/.$//'| sed -e "s/ \{1,\}$//"| awk '{int a[10];split($1, a,":");printf("%d:%d:%d:%d:%d:%d:%d:%d",a[0],a[1],a[2],a[3],a[4],a[5],a[6],a[7])}'`
Which I'll re format for clarity:
echo \
`vmstat 1 2 |
tr -s ' ' ':' |
cut -d':' -f4,5,14-15 |
tail -1 |
sed 's/\([0-9]*:[0-9]*:\)\([0-9]*:[0-9]*\)/\1/' \
` \
`mpstat -a 1 1 |
tr -s ' ' '|' |
head -8 |
tail -4 |
cut -d'|' -f 25,27 |
awk -F "|" '{printf "%.0f:%.0f:",$2,$1}' |
sed '$s/.$//' |
sed -e "s/ \{1,\}$//" |
awk '{int a[10];split($1, a,":");printf("%d:%d:%d:%d:%d:%d:%d:%d",a[0],a[1],a[2],a[3],a[4],a[5],a[6],a[7])}' \
`
I understand all of the tr, cut, head tail, and (roughly) vmstat/mpstat commands. The first sed is where I get lost, I've tried running the command in smaller segments and not quite sure why it seems to work as a whole but not when I truncate the command before the next tr.
I'm also not so sure on the awk command although I understand the premise vaguely, as a function allowing formatted output.
Similarly, I have a vague understanding of sed being a command allowing certain strings/characters being replaced in some file.
I'm not able to make out what this specific implementation in the above case is.
Could anyone provide some clarity or direction as to exactly what is happening at each sed and awk step within the context of the entire command?
Thanks for your help.
Simplification
This two simpler commands will get the exact same output:
# GetUsedRAM:GetUsedSwap:CPU_0_System:CPU_0_User:…CPU_N_System:CPU_N_User
# Select fields 4,5 of last line, and format with :
comm1=`vmstat 1 2 |
awk '$4~/[0-9]/{avm=$4;fre=$5} END{printf "%s:%s",avm,fre}'
`
# Select fields 27 (sy) and 25 (us) for four cpu, print as decimal.
comm2=`mpstat -A 1 1 |
awk -v firstline=6 -v cpus=4 '
BEGIN{start=firstline-1; end=firstline+cpus;}
NR>start && NR<end {printf( ":%d:%d", $27,$25)}'
`
echo "${comm1}${comm2}"
Description.
Description of original commands
The whole command is the concatenation of two commands.
The first command:
The output of the vmstat is shown in this link.
The columns 4 and 5 are 'avm' and 'fre'. The output in columns 14 and 15,
seem to be 'us' (user) and 'sy' (system). And I say seem as no output
from the user is available to confirm.
The first command
`vmstat 1 2 | # Execute the command vmstat.
tr -s ' ' ':' | # convert all spaces to colon (:).
cut -d':' -f4,5,14-15 | # select fields 4,5,14,and 15
tail -1 | # select last line.
sed 's/\([0-9]*:[0-9]*:\)\([0-9]*:[0-9]*\)/\1/' \ # See below.
`
The sed command selects inside braces all digits [0-9]* before a colon
repeated twice. And then again (without the last colon). That's the whole
string in two parts: « (dd:dd:)(dd:dd) » (d means digit).
And finally, it replaces such whole string by what was selected inside
the first braces /\1/.
All this complexity just removes fields 14 and 15 as selected by cut.
A simpler command with exactly the same output is:
Select fields 4,5 of last line, and format with (:).
`vmstat 1 2 | awk '
$4~/[0-9]/{avm=$4;fre=$5} END{printf "%s:%s:",avm,fre}'
`
The second command:
The output of mpstat -A is similar to this one from Linux.
And also similar to this AIX mpstat -d output.
However, the exact output of AIX 6.1 for mpstat -a (ALL) on the computer
used could have several variations. Anyway, guided by the intended final
output desired: CPU_0_System:CPU_0_User:…CPU_N_System:CPU_N_User.
It seems that the columns to be selected should be us (user) and sy
(sys) percent of time that used the cpu for all cpu in use,
which seem to be four on the computer measured.
The manual for AIX 6.1 mpstat is here.
It has a list of all the 40 columns that are presented when the option
-a ALL is used:
CPU min maj mpcs mpcr dev soft dec ph cs ics bound rq push
S3pull S3grd S0rd S1rd S2rd S3rd S4rd S5rd S3hrd S4hrd S5hrd
sysc us sy wa id pc %ec ilcs vlcs lcs %idon %bdon %istol %bstol %nsp
us and sy are listed as the fields 27 and 28, however the command presented
by the user selects fields number 25 and 27. Close but not the same. The
only way to confirm would be to receive the output of the command from the user.
For testing I will be using the output of mpstat 5 1 from here.
# mpstat 5 1
System configuration: lcpu=4 ent=1.0 mode=Uncapped
cpu min maj mpc int cs ics rq mig lpa sysc us sy wt id pc %ec lcs
0 4940 0 1 632 685 268 0 320 100 263924 42 55 0 4 0.57 35.1 277
1 990 0 3 1387 2234 805 0 684 100 130290 28 47 0 25 0.27 16.6 649
2 3943 0 2 531 663 223 0 389 100 276520 44 54 0 3 0.57 34.9 270
3 1298 0 2 1856 2742 846 0 752 100 82141 31 40 0 29 0.22 13.4 650
ALL 11171 0 8 4406 6324 2142 0 2145 100 752875 39 51 0 10 1.63 163.1 1846
The second command
`mpstat -A 1 1 | # execute command
tr -s ' ' '|' | # replace all spaces with (|).
head -8 | # select 8 first lines.
tail -4 | # select last four lines.
cut -d'|' -f 25,27 | # select fields 25 and 27
awk -F "|" '{printf "%.0f:%.0f:",$2,$1}' | # print the fields as integers.
sed '$s/.$//' | # on the last line ($), substitute the last character (.$) by nothing.
sed -e "s/ \{1,\}$//" | # remove trailing space(s).
awk '{
int a[10];
split($1, a,":");
printf("%d:%d:%d:%d:%d:%d:%d:%d",a[0],a[1],a[2],a[3],a[4],a[5],a[6],a[7])
}' \
`
About the int: For older versions of awk, calling a function without the parentheses is equivalent to call the function on $0. int is equivalent to int($0), which is not printed, nor used. The same happens to the value of a[10].
The split sets each value of the command in a[i]. Then, all values of a[i] are printed as decimals.
The equivalent, and way simpler is:
Command #2
`mpstat -A 1 1 |
awk -v firstline=6 -v cpus=4 '
BEGIN{start=firstline-1; end=firstline+cpus;}
NR>start && NR<end {printf( ":%d:%d", $27,$25)}'
`

Finding common elements from one file in a column of another file and output the entire row of the latter

I needed to extract all hits from one list (list.txt) which can be found in one of the columns of another (here in Data.txt) into a third (output.txt).
Data.txt (tab delimited)
some_data more_data other_data here yet_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E 4 Btm;Lol 16 2
T 3 Whizz 13 3
List.txt
Gee
Whiz
Lol
Ideally output.txt looks like
some_data more_data other_data here yet_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E 4 Btm;Lol 16 2
So I tried a shell script
for ids in List.txt
do
grep $ids Data.txt >> output.txt
done
except I typed out everything (cut and paste actually) in List.txt in said script.
Unfortunately it gave me an output.txt including the last line, I assume as 'Whizz' contains 'Whiz'.
I also tried cat Data.txt | egrep -F "List.txt" and that resulted in grep: conflicting matchers specified -- I suppose that was too naive of me. The actual files: List.txt contains a sorted list of 985 words, Data.txt has 115576 rows with 17 columns.
Some help/guidance would be much appreciated thanks.
Try something like this:
for ids in List.txt
do
grep "[TAB;]$ids[TAB;]" Data.txt >> output.txt
done
But it has two drawbacks:
"Data.txt" is scanned multiple times
You can get one line multiple times.
If it is problem try two step version:
cat List.txt | sed -e "s/.*/[TAB;]\0[TAB;]/g" > List_mod.txt
grep -f List_mod.txt Data.txt > output.txt
Note:
TAB character can be inserted by combination Ctrl-V following by Tab key in command line, and Tab character in editor. You have to check if your edit does not change tab to series of spaces.
The UNIX tool for general text processing is "awk":
awk '
NR==FNR { list[$0]; next }
{
for (word in list) {
if ($0 ~ "[\t;]" word "[\t;]") {
print
next
}
}
}
' List.txt Data.txt > output.txt

How can I get a range of line every nth interval using awk, sed, or other unix command?

I know how to get a range of lines by using awk and sed.
I also do know how to print out every nth line using awk and sed.
However, I don't know how to combined the two.
For example, I have a file with 1780000 lines.
For every 17800th line, I would like to print 17800th line plus the two after that.
So if I have a file with 1780000 lines and it starts from 1 and ends at 1780000, this will print:
1
2
3
17800
17801
17802
35600
35601
35602
# ... and so on.
Does anyone know how to get a range of line every nth interval using awk, sed, or other unix command?
Using GNU sed:
sed -n '0~17800{N;N;p}' input
Meaning,
For every 17800th line: 0~17800
Read two lines: {N;N;
And print these out: p}
We can also add the first three lines:
sed -n -e '1,3p' -e '0~17800{N;N;p}' input
Using Awk, this would be simpler:
awk 'NR%17800<3 || NR==3 {print}' input
$ cat file
1
2
3
4
5
6
7
8
9
10
$ awk '!(NR%3)' file
3
6
9
$ awk -v intvl=3 -v delta=2 '!(NR%intvl){print "-----"; c=delta} c&&c--' file
-----
3
4
-----
6
7
-----
9
10
$ awk -v intvl=4 -v delta=2 '!(NR%intvl){print "-----"; c=delta} c&&c--' file
-----
4
5
-----
8
9
$ awk -v intvl=4 -v delta=3 '!(NR%intvl){print "-----"; c=delta} c&&c--' file
-----
4
5
6
-----
8
9
10
seq -f %.0f 1780000 | awk 'NR < 4 || NR % 17800 < 3' | head
output:
1
2
3
17800
17801
17802
35600
35601
35602
53400
Explanation
The NR < 4 is for the first 3 lines because the requirement For every 17800th line, print 17800th line plus the two after that. doesn't fit the output you gave.
Here I use head for reducing the output size and you should remove it in your use case.
For GNU seq, you don't need -f %.0f.

Resources