I'm trying to build a diff viewer plugin for my text editor (Kakoune). I'm comparing two files and marking any lines that are different across two panes.
However the two views don't scroll simultaneously. My idea is to get a list of line numbers that correspond to each other, so I can place the cursor correctly when switching panes, or scroll the secondary window when the primary one scrolls, etc.
So - Is there a way to get a list of corresponding numbers from commandline diff?
I'm hoping for something along the following example: Given file A and B, the output should tell me which line numbers (of the ones that didn't change) would correspond.
File A File B Output
1: hello 1: hello 1:1
2: world 2: another 2:3
3: this 3: world 3:4
4: is 4: this 4:5
5: a 5: is
6: test 6: eof
The goal is that when I scroll to line 4 in file A, I'll know to scroll file B such that it's line 5 is rendered at the same position.
Doesn't have to be Gnu diff, but should only use tools that are available on most/all linux boxes.
I can get some of the way using GNU diff, but then need a python script to post-process it to turn a group-based output into line-based.
#!/usr/bin/env python
import sys
import subprocess
file1, file2 = sys.argv[1:]
cmd = ["diff",
"--changed-group-format=",
"--unchanged-group-format=%df %dl %dF %dL,",
file1,
file2]
p = subprocess.Popen(cmd, stdout=subprocess.PIPE)
output = p.communicate()[0]
for item in output.split(",")[:-1]:
start1, end1, start2, end2 = [int(s) for s in item.split()]
n = end1 - start1 + 1
assert(n == end2 - start2 + 1) # unchanged group, should be same length in each file
for i in range(n):
print("{}:{}".format(start1 + i, start2 + i))
gives:
$ ./compare.py fileA fileB
1:1
2:3
3:4
4:5
Related
This is actually a continued version of thisquestion:
I have a file
1
2
PAT1
3 - first block
4
PAT2
5
6
PAT1
7 - second block
PAT2
8
9
PAT1
10 - third block
and I use awk '/PAT1/{flag=1; next} /PAT2/{flag=0} flag'
to extract the blocks of lines.
Extracting them works ok, but I'm trying to iterate over these blooks in a block-by-block fashion and do some processing with each block (e.g. save to file, process with other scripts etc.).
How can I construct such a loop?
Problem is not very clear but you may do something like this:
awk '/PAT1/ {
flag = 1
++n
s = ""
next
}
/PAT2/ {
flag = 0
printf "Processing record # %d =>\n%s", n, s
}
flag {
s = s $0 ORS
}' file
Processing record # 1 =>
3 - first block
4
Processing record # 2 =>
7 - second block
This might work for you (GNU sed):
sed -ne '/PAT1/!b;:a;N;/PAT2/!ba;e echo process:' -e 's/.*/echo "&"|wc/pe;p' file
Gather up the lines between PAT1 and PAT2 and process the collection.
In the example above, the literal process: is printed.
The command to print the result of the wc command for the collection is built and printed.
The result of the evaluation of the above command is printed.
N.B. The position of the p flag in the substitution command is critical. If the p is before the e flag the pattern space is printed before the evaluation, if the p flag is after the e flag the pattern space is post evaluation.
I want to split single line into multiple line of 8 bytes each. And I am using the fold command and since this file contains the special characters the fold command does not work and it breaks in the middle of multibyte character.
File Content
あいbbえおかcc髙①こさし㈱㈱ちつて髙aabbc
Command Used
fold -b8 dummy_file.dat
Appreciate any help on this.
The problem here is that your text contains multi-bytes characters that will be broken by the fold command if we split them on 2 lines.
echo "あいbbえおかcc髙①こさし㈱㈱ちつて髙aabbc" | fold -b8
あいbb
えお��
�cc髙��
�こさ�
��㈱㈱
ちつ��
�髙aabb
c
If you want to have 8 characters per line you can use the following sed command:
echo "あいbbえおかcc髙①こさし㈱㈱ちつて髙aabbc" | sed 's/.\{8\}/&\n/g'
あいbbえおかc
c髙①こさし㈱㈱
ちつて髙aabb
c
that add a breakline after each occurrence of 8 characters.
If you do not want to display 8 characters but want to constraint each line to be at most 8 bytes without breaking the text content then you can use the python script:
import sys
def utf8len(s):
return len(s.encode('utf-8'))
entry = unicode(sys.stdin.read(),'utf-8')
tmp = ''
for c in entry:
if utf8len(tmp)+utf8len(c) > 8:
print tmp
tmp = c
elif utf8len(tmp)+utf8len(c) == 8:
print tmp,c
tmp = ''
else:
tmp += c
if tmp:
print tmp
output:
echo -n "あいbbえおかcc髙①こさし㈱㈱ちつて髙aabbc" | python max8bytes.py
あいb b
えお
かcc 髙
①こ
さし
㈱㈱
ちつ
て髙a a
bbc
Explanations:
You define a function that will count how many bytes you have per char.
You read char by char stdin and you avoid to have more than 8 bytes on the same line. If you do not want to have less than you can add some spaces char at the end of each line.
I have a long list of data organised as below (INPUT).
I want to split the data up so that I get an output as below (desired OUTPUT).
The code below first identifies all the lines containing ">gi" and saves the linecount of those lines in an array called B.
Then, in a new file, it should replace those lines from array B with the shortened version of the text following the ">gi"
I figured the easiest way would be to split at "|", however this does not work (no separation happens with my code if i replace " " with "|")
My code is below and does split nicely after the " " if I replace the "|" by " " in the INPUT, however I get into trouble when I want to get the text between the [ ] brackets, which is NOT always there and not always only 2 words...:
B=$( grep -n ">gi" 1VAO_1DII_5fxe_all_hits_combined.txt | cut -d : -f 1)
awk <1VAO_1DII_5fxe_all_hits_combined.txt >seqIDs_1VAO_1DII_5fxe_all_hits_combined.txt -v lines="$B" '
BEGIN {split(lines, a, " "); for (i in a) change[a[i]]=1}
NR in change {$0 = ">" $4}
1
'
let me know if more explanations are needed!
INPUT:
>gi|9955361|pdb|1E0Y|A:1-560 Chain A, Structure Of The D170sT457E DOUBLE MUTANT OF VANILLYL- Alcohol Oxidase
MSKTQEFRPLTLPPKLSLSDFNEFIQDIIRIVGSENVEVISSKDQIVDGSYMKPTHTHDPHHVMDQDYFLASAIVA
>gi|557721169|dbj|GAD99964.1|:1-560 hypothetical protein NECHADRAFT_63237 [Byssochlamys spectabilis No. 5]
MSETMEFRPMVLPPNLLLSEFNGFIRETIRLVGCENVEVISSKDQIHDGSYMDPRHTHDPHHIMEQDYFLASAIVAPRNV
desired OUTPUT:
>1E0Y
MSKTQEFRPLTLPPKLSLSDFNEFIQDIIRIVGSENVEVISSKDQIVDGSYMKPTHTHDPHHVMDQDYFLASAIVAPRNV
>GAD99964.1 Byssochlamys spectabilis No. 5
MSETMEFRPMVLPPNLLLSEFNGFIRETIRLVGCENVEVISSKDQIHDGSYMDPRHTHDPHHIMEQDYFLASAIVA
This can be done in one step with awk (gnu awk):
awk -F'|' '/^>gi/{a=1;match($NF,/\[([^]]*)]/, b);print ">"$4" "b[1];next}a{print}!$0{a=0}' input > output
In a more readable way:
/^>gi/ { # when the line starts with ">gi"
a=1; # set flag "a" to 1
# extract the eventual part between brackets in the last field
match($NF,"\\[([^]]*)]", b);
print ">"$4" "b[1]; # display the line
next # jump to the next record
}
a { print } # when "a" (allowed block) display the line
!$0 { a=0 } # when the line is empty, set "a" to 0 to stop the display
I have got a big file ( arounf 80K lines )
my main goal is to find the patterns and pring for example 10 lines before and 10 lines after the pattern .
the pattern accures multiple times across the file .
using the grep command :
grep -i <my_pattern>* -B 10 -A 10 <my_file>
i get only some of the data , i think it must be something related to the buffer size ....
i need a command ( grep , sed , awk ) that will handle all the matching
and will print 10 line before and after the pattern ...
Example :
my patterns hides here :
a
b
c
pattern_234
c
b
a
a
b
c
pattern_567
c
b
a
this happens multiple times across the file .
running this command :
grep -i pattern_* -B 3 -A 3 <my_file>
will get he right output :
a
b
c
c
b
a
a
b
c
c
b
it works but not full time
if i have 80 patterns not all the 80 will be shown
awk to the rescue
awk -vn=4 # pass the argument of context line count
'{
for(i=1;i<=n;i++) # store the past n lines in an indexed array
p[i]=p[i+1];
p[n+1]=$0
}
/pattern/ # if pattern matched
{
c=n+1; # set the counter to after match line count
for(i=1;i<=n;i++) # print previously saved entries
print p[i]
}
c-->0' # print the lines after match until counter runs out
will print 4 lines before and 4 lines after the match of pattern, change the value of n as per your need.
if non-symmetric before/after you need two variables
awk -vb=2 -va=3 '{for(i=1;i<=b;i++) p[i]=p[i+1];p[b+1]=$0} /pattern/{c=a+1;for(i=1;i<=b;i++) print p[i]} c-->0'
Is there a better way to get the first character of a GNU make variable than
FIRST=$(shell echo $(VARIABLE) | head -c 1)
(which is not only unwieldy but also calls the external shell)?
This is pretty horrible, but at least it doesn't invoke shell:
$(eval REMAINDER := $$$(VAR)) # variable minus the first char
FIRST := $(subst $(REMAINDER),,$(VAR)) # variable minus that
The GNU Make Standard Library provides a substr function
substr
Arguments: 1: A string
2: Start offset (first character is 1)
3: Ending offset (inclusive)
Returns: Returns a substring
I haven't tested it, but $(call substr,$(VARIABLE),1,1) should work
Since I came across this in my own search and didn't find what I was looking for here is what I ended up using to parse a hex number that could be applied to any known set of characters
letters := 0 1 2 3 4 5 6 7 8 9 a b c d e f
nextletter = $(strip $(foreach v,$(letters),$(word 2,$(filter $(1)$(v)%,$(2)) $v)))
then
INPUT := 40b3
firstletter := $(call nextletter,,$(INPUT))
secondletter := $(call nextletter,$(firstletter),$(INPUT))
thirdletter := $(call nextletter,$(firstletter)$(secondletter),$(INPUT))
etc.
It's ugly but it's shell agnostic