I have a very large file from which I need to delete a specific line (line number 941573 )
I'm somewhat new to this environment, but I've been googling the problem to no avail.
I've tried using the sed command as such, but it doesn't seem to be working
sed -e '941572,941574d' filenameX > newfilenameY
I've also tried
sed -e '941573d' filenameX > newfilenameY
Yet the 'newfilenameY' file and the original file 'filenameX' both still contain the line that I'm trying to delete. It's a fastq file, though I don't see how that would make any difference. Like I said I'm new to unix so maybe I've gotten the sed command wrong
d deletes a line/lines. So your second approach works.
$ sed '941573d' input > output
Long Example:
% for i in $(seq 1000000)
do
echo i >> input
done
% wc -l input
1000000 input
% sed '941573d' input > output
% wc -l output
999999 output
% diff -u input output :(
--- input 2012-10-22 13:22:41.404395295 +0200
+++ output 2012-10-22 13:22:43.400395358 +0200
## -941570,7 +941570,6 ##
941570
941571
941572
-941573
941574
941575
941576
Short Example:
% cat input
foo
bar
baz
qux
% sed '3d' input > output
% cat output
foo
bar
qux
Here is how to remove one or more lines from a file.
Syntax:
sed '{[/]<n>|<string>|<regex>[/]}d' <fileName>
sed '{[/]<adr1>[,<adr2>][/]d' <fileName>
/.../=delimiters
n = line number
string = string found in in line
regex = regular expression corresponding to the searched pattern
addr = address of a line (number or pattern )
d = delete
I generated a test file with 1000000 lines and tried your sed -e '941573d' filenameX > newfilenameY and it worked fine on Linux.
Maybe we have some other misunderstanding. Line numbers count from one, not zero. If you counted from zero then you would find line 941572 was missing.
Did you try a diff filenameX newfilenameY? That would highlight any unexpected changes.
I don't know much about FASTQ format, but are you sure we are talking about text file line numbers, and not sequence numbers?
There is a general line length limit of 4096 bytes, do any of your lines exceed that? (That's unlikely, but I thought it worth the question).
Related
I have a file with multiple lines (no. of lines unknown)
DD0TRANSID000019021210504250003379433005533665506656000008587201902070168304000.0AK 0000L00000.00 N 01683016832019021220190212N0000.001683065570067.000000.00000.0000000000000NAcknowledgment
DD0TRANSID000019021210505110003379433005535567606656000008587201902085381804000.0FC 0000L00000.00 N 53818538182019021220190212N0000.053818065570067.000000.00000.0000000000000NFirst Contact
DD0TRANSID000019021210510360003379433005535568006656000008587201902085381804000.0SR 0000L00000.00 N 53818538182019021220190212N0000.053818065570067.000000.00000.0000000000000NStatus Report
The text TRANSID000 is in every line starting from 3rd to 10th poisition
I need to be able to replace it with TRAN000066 in increments of 1
66 is a variable I am getting from another file (say nextcounter) for storing the start of the counter. Once the program updates all the lines, I should be able to capture the last number and update the nextcounter file with it.
Output
DD0TRAN00066019021210504250003379433005533665506656000008587201902070168304000.0AK 0000L00000.00 N 01683016832019021220190212N0000.001683065570067.000000.00000.0000000000000NAcknowledgment
DD0TRAN00067019021210505110003379433005535567606656000008587201902085381804000.0FC 0000L00000.00 N 53818538182019021220190212N0000.053818065570067.000000.00000.0000000000000NFirst Contact
DD0TRAN00068019021210510360003379433005535568006656000008587201902085381804000.0SR 0000L00000.00 N 53818538182019021220190212N0000.053818065570067.000000.00000.0000000000000NStatus Report
I have tried awk sed and perl, but it does not give me desired results.
Please suggest.
Simple loop
s=66; while read l; do echo "$l" | sed "s/TRANSID000/TRAN$(printf '%06d' $s)/" ; s=$((s+=1)); done < inputFile > outputFile; echo $s > counterFile
Walter A answer is almost perfect, missing the required lines 3-10 limit.
So the improved answer is:
awk -v start=66 'NR > 2 && NR < 11{ sub(/TRANSID000/, "TRAN0000" start++); print }' inputfile
When you want to use sed you might want to use a loop, avoiding the ugly
sed '=' inputfile | sed -r '{N;s/(.*)\n(.*)(TRANSID000)(.*)/echo "\2TRAN0$((\1+65))\4"/e}'
It is much easier with awk:
awk -v start=66 '{ sub(/TRANSID000/, "TRAN0" start++); print }' inputfile
EDIT:
OP asks for replace TRANSID with TRAN0, I showed this in the edited solution.
When I look to the example output, the additional 0 is not needed.
Another question is what happens when the counter comes above 99. Should one of the leading zeroes be deleted (with a construction like printf "%.4d"), or will the line length be 1 more?
DD0TRAN00099019...
DD0TRAN00100019...
# or
DD0TRAN00099019...
DD0TRAN000100019...
I have two files, both contain tens of thousands of lines. I'm currently taking a string (Z1234562) from file_one.txt and trying to see if its on file_two.txt. If it's found on file_two.txt, I'm returning the line number the match was on -- in this case, line 1235. I have this working already.
file_one.txt
Line 1> [...]_Z1234562
file_two.txt
Line 1234> [...],Z1234561,[...]
Line 1235> [...],Z1234562,[...]
Line 1236> [...],Z1234563,[...]
However, I want to now append to line 1235 the string ,Yes. So that on file_two.txt I have
Line 1235> [...],Z1234562,[...],Yes
With the help Glenn Jackman's answer of this other question I was able to figure out how to add text using the ed editor before and after a specific line within a file. However, I haven't been able to figure out if with ed I can add text to the end of a line within a file. Reading the documentation I'm not sure there is. So far, based off this AIX site, this is what I have:
(echo '15022a'; echo 'Text to add.'; echo '.'; echo 'w'; echo 'q') | ed -s filename.txt
This appends the string Text to add. after line 15,022. I'm wondering if there is an insert equivalent to this append.
The reason I'm not using sed is because I'm on an AIX system and I can't seem to get what this forum has working. However, I'm not sure if the sed command in this forum only solves adding text before or after a line and not at the end of the line, which I already have working.
My next approach is to remove the return character at the end of the line I want to append to, append, and then re-add the return character but I don't want to reinvent the wheel before exhausting my options. Maybe I'm missing something as I want to do this as soon as possible but any help would be appreciated. I'm starting not to like these AIX systems. Maybe awk can help me but I'm less familiar with awk.
I wrote a small binary search subroutine using Perl in order to find the line that I want to append to. I'm not sure if sed, ed, grep, awk, etc. use binary search but that's why I'm not using ed or sed's pattern-replace searches. I want it to be as fast as possible so I'm open to a different approach.
Here is a general recipe I have used for invoking ed from a script:
(echo '/pattern/s/$/ new text/'
echo w ) | ed filename
This invokes ed on filename, searches for a line containing pattern, and appends "new text" at the end of that line. Season to taste.
You said you were having trouble with sed, but for the record, here's the same sort of thing using sed:
sed '/pattern/s/$/ new text/' filename > filename.modified
You can use the j command
(.,.+1)j
Joins the addressed lines. The addressed lines are deleted from the buffer and replaced by a single line containing their joined text. The current address is set to the resultant line.
So you just have to modifiy your previous command :
cat << EOF | ed -s filename.txt
15022a
Text to add.
.
-1,.j
wq
EOF
First we create a test file foo:
$ cat > foo
foo
bar
If you already know the line number you want to edit, e.g. previous example and line number 2 using sed:
$ sed '2s/$/foo' foo
foo
barfoo
In awk you'd command:
$ awk 'NR==2 {sub(/$/,"foo")} 1' foo
foo
barfoo
In perl:
$ perl -p -e 's/$/foo/ if $. == 2' foo
foo
barfoo
I have a file which has one particular string which never repeats and all my data starts from this string. My requirement is to read all data beneath this string(say [string-start]) and redirect the data read into another file.
#Krishna Kanth: This command may be helpful . Try it :
sed -e 's/^.*\(search-string\)/\1/' input-file > output-file
#Landys:
I tried using below command but found parsing error for the same.
$ sed -ne 'H;1h;${g;s/.*\START-OF-DATA//g;p}' < file.txt > file.out
sed: 0602-404 Function H;1h;${g;s/.*\START-OF-DATA//g;p} cannot be parsed.
Please suggest!!!
It's easy to achieve it with sed in one line.
sed -ne 'H;1h;${g;s/.*string-start//;p}' input.txt > output.txt
Here's the decomposition.
-ne run the following script with quiet mode.
h/H - copy/append pattern space to hold space, and H will append \n to hold space first.
H;1h - just used for copy all text to hold space, 1 match the first line.
s/.../.../ - used to replace text before string-start as empty, which means delete it.
p - print the current pattern space.
${...} - match the last line.
For example, the input.txt is as follows.
abc
def
ghistring-startjkl
mno
pqr
The output.txt will be as follows.
jkl
mno
pqr
How to replace all dots in a file with a space but dots in numbers such as 1.23232 or 4.23232 should not be replaced.
for example
Input:
abc.hello is with cdf.why with 1.9343 and 3.3232 points. What will
Output:
abc_hello is with cdf_why with 1.9343 and 3.3232 point_ what will
$ cat file
abc.hello is with cdf.why with 1.9343 and 3.3232 points. What will
this is 1.234.
here it is ...1.234... a number
.that was a number.
$ sed -e 's/a/aA/g' -e 's/\([[:digit:]]\)\.\([[:digit:]]\)/\1aB\2/g' -e 's/\./_/g' -e 's/aB/./g' -e 's/aA/a/g' file
abc_hello is with cdf_why with 1.9343 and 3.3232 points_ What will
this is 1.234_
here it is ___1.234___ a number
_that was a number_
Try any solution you're considering with that input file as it includes some edge cases (there may be more I haven't included in that file too).
The solution is basically to temporarily convert periods within numbers to some string that cannot exist anywhere else in the file so we can then convert any other periods to underscores and then undo that first temporary conversion.
So first we create a string that can't exist in the file by converting all as to the string aA which means that the string aB cannot exist in the file. Then convert all .s within numbers to aBs, then all remaining .s to _s then unwind the temporary conversions so aBs return to .s and aAs returns to as:
sed -e 's/a/aA/g' # a -> aA encode #1
-e 's/\([[:digit:]]\)\.\([[:digit:]]\)/\1aB\2/g' # 2.4 -> 2aB4 encode #2
-e 's/\./_/g' # . -> _ convert
-e 's/aB/./g' # 2aB4 -> 2.4 decode #2
-e 's/aA/a/g' # aA -> a decode #1
file
That approach of creating a temporary string that you KNOW can't exist in the file is a common alternative to picking a control character or trying to come up with some string you THINK is highly unlikely to exist in the file when you temporarily need a string that doesn't exist in the file.
I think, that will do what you want:
sed 's/\([^0-9]\)\.\([^0-9]\)/\1_\2/g' filename
This will replace all dots that are not between two digits with an underscore (_) sign (you can exchange the underscore with a space character in the above command to get spaces in the output).
If you want to write the changes back into the file, use sed -i.
Edit:
To cover dots at the beginning resp. end of the line or directly before or after a number the expression becomes a bit more ugly:
sed -r 's/(^|[^0-9])\.([^0-9]|$)/\1_\2/g;s/(^|[^0-9])\.([0-9])/\1_\2/g;s/([0-9])\.([^0-9]|$)/\1_\2/g'
resp.:
sed 's/\(^\|[^0-9]\)\.\([^0-9]\|$\)/\1_\2/g;s/\(^\|[^0-9]\)\.\([0-9]\)/\1_\2/g;s/\([0-9]\)\.\([^0-9]\|$\)/\1_\2/g'
gawk
awk -v RS='[[:space:]]+' '!/^[[:digit:]]+\.[[:digit:]]+$/{gsub("\\.", "_")}; {printf "%s", $0RT}' file.txt
since you tagged with vi, I guess you may have vim too? it would be a very easy task for vim:
:%s/\D\zs\.\ze\D/_/g
Is there a way to delete duplicate lines in a file in Unix?
I can do it with sort -u and uniq commands, but I want to use sed or awk.
Is that possible?
awk '!seen[$0]++' file.txt
seen is an associative array that AWK will pass every line of the file to. If a line isn't in the array then seen[$0] will evaluate to false. The ! is the logical NOT operator and will invert the false to true. AWK will print the lines where the expression evaluates to true.
The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on.
AWK evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.
From http://sed.sourceforge.net/sed1line.txt:
(Please don't ask me how this works ;-) )
# delete duplicate, consecutive lines from a file (emulates "uniq").
# First line in a set of duplicate lines is kept, rest are deleted.
sed '$!N; /^\(.*\)\n\1$/!P; D'
# delete duplicate, nonconsecutive lines from a file. Beware not to
# overflow the buffer size of the hold space, or else use GNU sed.
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
Perl one-liner similar to jonas's AWK solution:
perl -ne 'print if ! $x{$_}++' file
This variation removes trailing white space before comparing:
perl -lne 's/\s*$//; print if ! $x{$_}++' file
This variation edits the file in-place:
perl -i -ne 'print if ! $x{$_}++' file
This variation edits the file in-place, and makes a backup file.bak:
perl -i.bak -ne 'print if ! $x{$_}++' file
An alternative way using Vim (Vi compatible):
Delete duplicate, consecutive lines from a file:
vim -esu NONE +'g/\v^(.*)\n\1$/d' +wq
Delete duplicate, nonconsecutive and nonempty lines from a file:
vim -esu NONE +'g/\v^(.+)$\_.{-}^\1$/d' +wq
The one-liner that Andre Miller posted works except for recent versions of sed when the input file ends with a blank line and no characterss. On my Mac my CPU just spins.
This is an infinite loop if the last line is blank and doesn't have any characterss:
sed '$!N; /^\(.*\)\n\1$/!P; D'
It doesn't hang, but you lose the last line:
sed '$d;N; /^\(.*\)\n\1$/!P; D'
The explanation is at the very end of the sed FAQ:
The GNU sed maintainer felt that despite the portability problems
this would cause, changing the N command to print (rather than
delete) the pattern space was more consistent with one's intuitions
about how a command to "append the Next line" ought to behave.
Another fact favoring the change was that "{N;command;}" will
delete the last line if the file has an odd number of lines, but
print the last line if the file has an even number of lines.
To convert scripts which used the former behavior of N (deleting
the pattern space upon reaching the EOF) to scripts compatible with
all versions of sed, change a lone "N;" to "$d;N;".
The first solution is also from http://sed.sourceforge.net/sed1line.txt
$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr '$!N;/^(.*)\n\1$/!P;D'
1
2
3
4
5
The core idea is:
Print only once of each duplicate consecutive lines at its last appearance and use the D command to implement the loop.
Explanation:
$!N;: if the current line is not the last line, use the N command to read the next line into the pattern space.
/^(.*)\n\1$/!P: if the contents of the current pattern space is two duplicate strings separated by \n, which means the next line is the same with current line, we can not print it according to our core idea; otherwise, which means the current line is the last appearance of all of its duplicate consecutive lines. We can now use the P command to print the characters in the current pattern space until \n (\n also printed).
D: we use the D command to delete the characters in the current pattern space until \n (\n also deleted), and then the content of pattern space is the next line.
and the D command will force sed to jump to its first command $!N, but not read the next line from a file or standard input stream.
The second solution is easy to understand (from myself):
$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr 'p;:loop;$!N;s/^(.*)\n\1$/\1/;tloop;D'
1
2
3
4
5
The core idea is:
print only once of each duplicate consecutive lines at its first appearance and use the : command and t command to implement LOOP.
Explanation:
read a new line from the input stream or file and print it once.
use the :loop command to set a label named loop.
use N to read the next line into the pattern space.
use s/^(.*)\n\1$/\1/ to delete the current line if the next line is the same with the current line. We use the s command to do the delete action.
if the s command is executed successfully, then use the tloop command to force sed to jump to the label named loop, which will do the same loop to the next lines until there are no duplicate consecutive lines of the line which is latest printed; otherwise, use the D command to delete the line which is the same with the latest-printed line, and force sed to jump to the first command, which is the p command. The content of the current pattern space is the next new line.
uniq would be fooled by trailing spaces and tabs. In order to emulate how a human makes comparison, I am trimming all trailing spaces and tabs before comparison.
I think that the $!N; needs curly braces or else it continues, and that is the cause of the infinite loop.
I have Bash 5.0 and sed 4.7 in UbuntuĀ 20.10 (Groovy Gorilla). The second one-liner did not work, at the character set match.
The are three variations. The first is to eliminate adjacent repeat lines, the second to eliminate repeat lines wherever they occur, and the third to eliminate all but the last instance of lines in file.
pastebin
# First line in a set of duplicate lines is kept, rest are deleted.
# Emulate human eyes on trailing spaces and tabs by trimming those.
# Use after norepeat() to dedupe blank lines.
dedupe() {
sed -E '
$!{
N;
s/[ \t]+$//;
/^(.*)\n\1$/!P;
D;
}
';
}
# Delete duplicate, nonconsecutive lines from a file. Ignore blank
# lines. Trailing spaces and tabs are trimmed to humanize comparisons
# squeeze blank lines to one
norepeat() {
sed -n -E '
s/[ \t]+$//;
G;
/^(\n){2,}/d;
/^([^\n]+).*\n\1(\n|$)/d;
h;
P;
';
}
lastrepeat() {
sed -n -E '
s/[ \t]+$//;
/^$/{
H;
d;
};
G;
# delete previous repeated line if found
s/^([^\n]+)(.*)(\n\1(\n.*|$))/\1\2\4/;
# after searching for previous repeat, move tested last line to end
s/^([^\n]+)(\n)(.*)/\3\2\1/;
$!{
h;
d;
};
# squeeze blank lines to one
s/(\n){3,}/\n\n/g;
s/^\n//;
p;
';
}
This can be achieved using AWK.
The below line will display unique values:
awk file_name | uniq
You can output these unique values to a new file:
awk file_name | uniq > uniq_file_name
The new file uniq_file_name will contain only unique values, without any duplicates.
Use:
cat filename | sort | uniq -c | awk -F" " '$1<2 {print $2}'
It deletes the duplicate lines using AWK.