Related
I am trying to count the number of occurrence of "1260005113" and then check if it is more than 1 or not
I am trying to count the number of occurrence of "1260005113" in the below log file.
1260005722,1000239103,1260005113,1000235906,1260004267,1260004642,1260005113,1260003996,1000239447,1000233697
1260005113,1260004642,1260004267,1260005722,1260003996
1000120365,1260005113,1260005113,1260005722,1000239103,1000239447,1260003996,1260004267,1000235906,1000233697
1000213089,1000154578,1000053838,1770003314
1000228336,1260005113,1000223808,1000225189,1260003996,1260004642,1000228200,1260005722,1260004267
1000228200,1000223808,1260005113,1260005722,1260004267,1000225189,1260003996,1260005113,1000228336
1000120365,1000233697,1260004642,1260005113,1000239103,1260005722,1260003996,1260004267,1000235906,1000239447
1000235906,1260004642,1260004267,1260003996,1000233697,1260005722,1000239103,1260005113,1000120365,1000239447
1260005722,1000239447,1000233697,1260003996,1000239103,1260004642,1000120365,1260005113,1000235906,1260004267
1000213089,1000154578,1000053838,1770003314
1000120365,1260005113,1260004642,1000235906,1000239103,1000239447,1000233697,1260005722,1260003996,1260004267
1260004267,1000233697,1000239103,1000235906,1260005722,1000120365,1260005113,1260004642,1000239447,1260003996
1260005722,1260004267,1000120365,1260003996,1000239447,1000235906,1260005113,1260004642,1000233697,1000239103
1260004267,1000239103,1000120365,1000235906,1000233697,1260005113,1000239447,1260004642,1260003996,1260005722
1000228336,1260005722,1260004267,1000225189,1260005113,1260004642,1260003996,1000228200,1000223808
1000233697,1260005722,1000235906,1000239447,1000120365,1260004267,1000239103,1260003996,1260004642,1260005113
1000213089,1000154578,1000053838,1770003314
1000120365,1000239103,1260003996,1260005722,1000235906,1260004642,1000239447,1260005113,1260004267,1000233697
i have used awk -F '1260005113' '{print (NF?NF-1:0)}'
and it gives me number of occurances of '1260005113'..
but i am unable to find how i can get only those line which have occurances of '1260005113' is more than 1.
so i only want to get lines 1,3, 5 which have more the 1 repetition of 1260005113
$ awk -F '1260005113' 'NF>2' file
1260005722,1000239103,1260005113,1000235906,1260004267,1260004642,1260005113,1260003996,1000239447,1000233697
1000120365,1260005113,1260005113,1260005722,1000239103,1000239447,1260003996,1260004267,1000235906,1000233697
1000228200,1000223808,1260005113,1260005722,1260004267,1000225189,1260003996,1260005113,1000228336
Could you please try following.
awk 'gsub("1260005113","&")>1' Input_file
Above will catch 126000511312 too, so try more robust code which will catch only 1260005113 in fields.
awk 'BEGIN{FS=","}{for(i=1;i<=NF;i++){if($i=="1260005113"){++count}};if(count>1){print};count=""}' Input_file
awk 'gsub(/1260005113/,"&")>1' file
gsub(/1260005113/,"&") replaces all 1260005113 with itself and returns the number of replacements took place, so basically it returns the number of occurences of 1260005113 in current line.
With sed (remember and check first match)
sed -rn '/(1260005113).*\1/p' logfile
I'm looking to grep some log files with a few surrounding lines, but then discard the junk lines from the matches. To make matters worse, the stupid code outputs the same exception twice so I want to junk every other grep match. I don't know that there's a good way to skip every other grep match when also including surrounding lines, so I'm good to do it all in one.
So let's say we have the following results from grep:
InterestingContext1
lkjsdf
MatchExceptionText1
--
kjslkj
lskjlk
MatchExceptionText2
--
InterestingContext3
lkjsdf
MatchExceptionText3
--
kjslkj
lskjlk
MatchExceptionText4
--
Obviously the grep match is "MatchExceptionText" (simplified, of course). So I'd like to pipe this to something where I can remove lines 2,5,6,7,8 and then repeat that pattern, so the results look like this:
InterestingContext1
MatchExceptionText1
--
InterestingContext3
MatchExceptionText3
--
The repeating is where things get tricky for me. I know sed can just remove certain line numbers but I don't know how to group them into groups of 8 lines and repeat that cut in all the groups.
Any ideas? Thanks for your help.
awk can do modular arithemetic so printing conditional on the number of lines read mod 8 should allow you to repeat the pattern.
awk 'NR%8 ~ /[134]/' file
Sed can do it:
sed -n 'N;s/\n.*//;N;N;p;N;N;N;N' filename
EDIT:
Come to think of it, this is a little better:
sed -n 'p;n;n;N;p;n;n;n;n' filename
With GNU awk you can split the input at appropriate record separators and print the wanted output, eg.:
awk 'NR%2 {print $1, $3}' RS='--\n' ORS='\n--\n' OFS='\n' infile
Output:
InterestingContext1
MatchExceptionText1
--
InterestingContext3
MatchExceptionText3
--
This might work for you (GNU sed):
sed -n 'p;n;n;p;n;p;n;n;n;n' file
sed -n "s/^.*\n//;x;s/^/²/
/^²\{1\}$/ b print
/^²\{3\}$/ b print
/^²\{4\}$/ b print
/^²\{7\}$/ b print
/^²\{8\}$/ b print
b cycle
: print
x;
# your treatment begin
p
# your treatment stop
x
: cycle
/^²\{8\}$/ s/.*//
x
" YourFile
Mainly for reference for kind of "case of" with relative line number, just have to change the number in /^²\{YourLineNumber\}$/ for take the other relative line position.
Don't forget the last line number that reset the cycle
First part take the line and prepare the relative line counter
Second part is the case of
Third part is the treatment (here a print)
Last part is the reset of the cycle counter if needed
I have a very large file from which I need to delete a specific line (line number 941573 )
I'm somewhat new to this environment, but I've been googling the problem to no avail.
I've tried using the sed command as such, but it doesn't seem to be working
sed -e '941572,941574d' filenameX > newfilenameY
I've also tried
sed -e '941573d' filenameX > newfilenameY
Yet the 'newfilenameY' file and the original file 'filenameX' both still contain the line that I'm trying to delete. It's a fastq file, though I don't see how that would make any difference. Like I said I'm new to unix so maybe I've gotten the sed command wrong
d deletes a line/lines. So your second approach works.
$ sed '941573d' input > output
Long Example:
% for i in $(seq 1000000)
do
echo i >> input
done
% wc -l input
1000000 input
% sed '941573d' input > output
% wc -l output
999999 output
% diff -u input output :(
--- input 2012-10-22 13:22:41.404395295 +0200
+++ output 2012-10-22 13:22:43.400395358 +0200
## -941570,7 +941570,6 ##
941570
941571
941572
-941573
941574
941575
941576
Short Example:
% cat input
foo
bar
baz
qux
% sed '3d' input > output
% cat output
foo
bar
qux
Here is how to remove one or more lines from a file.
Syntax:
sed '{[/]<n>|<string>|<regex>[/]}d' <fileName>
sed '{[/]<adr1>[,<adr2>][/]d' <fileName>
/.../=delimiters
n = line number
string = string found in in line
regex = regular expression corresponding to the searched pattern
addr = address of a line (number or pattern )
d = delete
I generated a test file with 1000000 lines and tried your sed -e '941573d' filenameX > newfilenameY and it worked fine on Linux.
Maybe we have some other misunderstanding. Line numbers count from one, not zero. If you counted from zero then you would find line 941572 was missing.
Did you try a diff filenameX newfilenameY? That would highlight any unexpected changes.
I don't know much about FASTQ format, but are you sure we are talking about text file line numbers, and not sequence numbers?
There is a general line length limit of 4096 bytes, do any of your lines exceed that? (That's unlikely, but I thought it worth the question).
Is there a way to delete duplicate lines in a file in Unix?
I can do it with sort -u and uniq commands, but I want to use sed or awk.
Is that possible?
awk '!seen[$0]++' file.txt
seen is an associative array that AWK will pass every line of the file to. If a line isn't in the array then seen[$0] will evaluate to false. The ! is the logical NOT operator and will invert the false to true. AWK will print the lines where the expression evaluates to true.
The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on.
AWK evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.
From http://sed.sourceforge.net/sed1line.txt:
(Please don't ask me how this works ;-) )
# delete duplicate, consecutive lines from a file (emulates "uniq").
# First line in a set of duplicate lines is kept, rest are deleted.
sed '$!N; /^\(.*\)\n\1$/!P; D'
# delete duplicate, nonconsecutive lines from a file. Beware not to
# overflow the buffer size of the hold space, or else use GNU sed.
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
Perl one-liner similar to jonas's AWK solution:
perl -ne 'print if ! $x{$_}++' file
This variation removes trailing white space before comparing:
perl -lne 's/\s*$//; print if ! $x{$_}++' file
This variation edits the file in-place:
perl -i -ne 'print if ! $x{$_}++' file
This variation edits the file in-place, and makes a backup file.bak:
perl -i.bak -ne 'print if ! $x{$_}++' file
An alternative way using Vim (Vi compatible):
Delete duplicate, consecutive lines from a file:
vim -esu NONE +'g/\v^(.*)\n\1$/d' +wq
Delete duplicate, nonconsecutive and nonempty lines from a file:
vim -esu NONE +'g/\v^(.+)$\_.{-}^\1$/d' +wq
The one-liner that Andre Miller posted works except for recent versions of sed when the input file ends with a blank line and no characterss. On my Mac my CPU just spins.
This is an infinite loop if the last line is blank and doesn't have any characterss:
sed '$!N; /^\(.*\)\n\1$/!P; D'
It doesn't hang, but you lose the last line:
sed '$d;N; /^\(.*\)\n\1$/!P; D'
The explanation is at the very end of the sed FAQ:
The GNU sed maintainer felt that despite the portability problems
this would cause, changing the N command to print (rather than
delete) the pattern space was more consistent with one's intuitions
about how a command to "append the Next line" ought to behave.
Another fact favoring the change was that "{N;command;}" will
delete the last line if the file has an odd number of lines, but
print the last line if the file has an even number of lines.
To convert scripts which used the former behavior of N (deleting
the pattern space upon reaching the EOF) to scripts compatible with
all versions of sed, change a lone "N;" to "$d;N;".
The first solution is also from http://sed.sourceforge.net/sed1line.txt
$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr '$!N;/^(.*)\n\1$/!P;D'
1
2
3
4
5
The core idea is:
Print only once of each duplicate consecutive lines at its last appearance and use the D command to implement the loop.
Explanation:
$!N;: if the current line is not the last line, use the N command to read the next line into the pattern space.
/^(.*)\n\1$/!P: if the contents of the current pattern space is two duplicate strings separated by \n, which means the next line is the same with current line, we can not print it according to our core idea; otherwise, which means the current line is the last appearance of all of its duplicate consecutive lines. We can now use the P command to print the characters in the current pattern space until \n (\n also printed).
D: we use the D command to delete the characters in the current pattern space until \n (\n also deleted), and then the content of pattern space is the next line.
and the D command will force sed to jump to its first command $!N, but not read the next line from a file or standard input stream.
The second solution is easy to understand (from myself):
$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr 'p;:loop;$!N;s/^(.*)\n\1$/\1/;tloop;D'
1
2
3
4
5
The core idea is:
print only once of each duplicate consecutive lines at its first appearance and use the : command and t command to implement LOOP.
Explanation:
read a new line from the input stream or file and print it once.
use the :loop command to set a label named loop.
use N to read the next line into the pattern space.
use s/^(.*)\n\1$/\1/ to delete the current line if the next line is the same with the current line. We use the s command to do the delete action.
if the s command is executed successfully, then use the tloop command to force sed to jump to the label named loop, which will do the same loop to the next lines until there are no duplicate consecutive lines of the line which is latest printed; otherwise, use the D command to delete the line which is the same with the latest-printed line, and force sed to jump to the first command, which is the p command. The content of the current pattern space is the next new line.
uniq would be fooled by trailing spaces and tabs. In order to emulate how a human makes comparison, I am trimming all trailing spaces and tabs before comparison.
I think that the $!N; needs curly braces or else it continues, and that is the cause of the infinite loop.
I have Bash 5.0 and sed 4.7 in Ubuntu 20.10 (Groovy Gorilla). The second one-liner did not work, at the character set match.
The are three variations. The first is to eliminate adjacent repeat lines, the second to eliminate repeat lines wherever they occur, and the third to eliminate all but the last instance of lines in file.
pastebin
# First line in a set of duplicate lines is kept, rest are deleted.
# Emulate human eyes on trailing spaces and tabs by trimming those.
# Use after norepeat() to dedupe blank lines.
dedupe() {
sed -E '
$!{
N;
s/[ \t]+$//;
/^(.*)\n\1$/!P;
D;
}
';
}
# Delete duplicate, nonconsecutive lines from a file. Ignore blank
# lines. Trailing spaces and tabs are trimmed to humanize comparisons
# squeeze blank lines to one
norepeat() {
sed -n -E '
s/[ \t]+$//;
G;
/^(\n){2,}/d;
/^([^\n]+).*\n\1(\n|$)/d;
h;
P;
';
}
lastrepeat() {
sed -n -E '
s/[ \t]+$//;
/^$/{
H;
d;
};
G;
# delete previous repeated line if found
s/^([^\n]+)(.*)(\n\1(\n.*|$))/\1\2\4/;
# after searching for previous repeat, move tested last line to end
s/^([^\n]+)(\n)(.*)/\3\2\1/;
$!{
h;
d;
};
# squeeze blank lines to one
s/(\n){3,}/\n\n/g;
s/^\n//;
p;
';
}
This can be achieved using AWK.
The below line will display unique values:
awk file_name | uniq
You can output these unique values to a new file:
awk file_name | uniq > uniq_file_name
The new file uniq_file_name will contain only unique values, without any duplicates.
Use:
cat filename | sort | uniq -c | awk -F" " '$1<2 {print $2}'
It deletes the duplicate lines using AWK.
Using sed or similar how would you extract lines from a file? If I wanted lines 1, 5, 1010, 20503 from a file, how would I get these 4 lines?
What if I have a fairly large number of lines I need to extract?
If I had a file with 100 lines, each representing a line number that I wanted to extract from another file, how would I do that?
Something like "sed -n '1p;5p;1010p;20503p'. Execute the command "man sed" for details.
For your second question, I'd transform the input file into a bunch of sed(1) commands to print the lines I wanted.
with awk it's as simple as:
awk 'NR==1 || NR==5 || NR==1010' "file"
#OP, you can do this easier and more efficiently with awk. so for your first question
awk 'NR~/^(1|2|5|1010)$/{print}' file
for 2nd question
awk 'FNR==NR{a[$1];next}(FNR in a){print}' file_with_linenr file
This ain't pretty and it could exceed command length limits under some circumstances*:
sed -n "$(while read a; do echo "${a}p;"; done < line_num_file)" data_file
Or its much slower but more attractive, and possibly more well-behaved, sibling:
while read a; do echo "${a}p;"; done < line_num_file | xargs -I{} sed -n \{\} data_file
A variation:
xargs -a line_num_file -I{} sed -n \{\}p\; data_file
You can speed up the xarg versions a little bit by adding the -P option with some large argument like, say, 83 or maybe 419 or even 1177, but 10 seems as good as any.
*xargs --show-limits </dev/null can be instructive
I'd investigate Perl, since it has the regexp facilities of sed plus the programming model surrounding it to allow you to read a file line by line, count the lines and extract according to what you want (including from a file of line numbers).
my $row = 1
while (<STDIN>) {
# capture the line in $_ and check $row against a suitable list.
$row++;
}
In Perl:
perl -ne 'print if $. =~ m/^(1|5|1010|20503)$/' file