How to plot data from file from specific lines start at line with some special string - plot

I am trying to execute command similar to
plot "data.asc" every ::Q::Q+1500 using 2 with lines
But i have problem with that "Q" number. Its not a well known value but number of line with some specific string. Lets say i have line with string "SET_10:" and then i have my data to plot after this specific line. Is there some way how to identify the number of that line with specific string?

An easy way is to pass the data through GNU sed to print just the wanted lines:
plot "< sed -n <data.asc '/^SET_10:/,+1500{/^SET_10:/d;p}'" using 1:2 with lines
The -n stops any output, the a,b says between which lines to do the {...} commands, and those commands say to delete the trigger line, and p print the others.
To make sure you have a compatible GNU sed try the command on its own, for a short number of lines, eg 5:
sed -n <data.asc '/^SET_10:/,+5{/^SET_10:/d;p}'
If this does not output the first 5 lines of your data, an alternative is to use awk, as it is too difficult in sed to count lines without this GNU-specific syntax. Test the (standard POSIX, not GNU-specific) awk equivalent:
awk <data.asc 'end!=0 && NR<=end{print} /^start/{end=NR+5}'
and if that is ok, use it in gnuplot as
plot "< awk <data.asc 'end!=0 && NR<=end{print} /^start/{end=NR+1500}'" using 1:2 with lines

Here's a version entirely within gnuplot, with no external commands needed. I tested this on gnuplot 5.0 patchlevel 3 using the following bash commands to create a simple dataset of 20 lines of which only 5 lines are to be printed from the line with "start" in column 1. You don't need to do this.
for i in $(seq 1 20)
do let j=i%2
echo "$i $j"
done >data.asc
sed -i data.asc -e '5a\
start'
The actual gnuplot uses a variable endlno initially set to NaN (not-a-number) and a function f which takes 3 parameters: a boolean start saying if column 1 has the matching string, lno the current linenumber, and the current column 1 value val. If the linenumber is less-than-or-equal-to the ending line number (and therefore it is not still NaN), f returns val, else if the start condition is true the wanted ending line number is set in variable endlno and NaN is returned. If we have not yet seen the start, NaN is returned.
gnuplot -persist <<\!
endlno=NaN
f(start,lno,val) = ((lno<=endlno)?val:(start? (endlno=lno+5,NaN) : NaN))
plot "data.asc" using (f(stringcolumn(1)eq "start", $0, $1)):2 with lines
!
Since gnuplot does not plot points with NaN values, we ignore lines upto the start, and again after the wanted number of lines.
In your case you need to change 5 to 1500 and "start" to "SET_10:".

Related

Using sed or awk (or similar) incrementally or with a loop to do deletions in data file based on lines and position numbers given in another text file

I am looking to do deletions in a data file at specific positions in specific lines, based on a list in a separate text file, and have been struggling to get my head around it.
I'm working in cygwin, and have a (generally large) data file (data_file) to do the deletions in, and a tab-delimited text file (coords_file) listing the relevant line numbers in column 2 and the matching position numbers for each of those lines in column 3.
Effectively, I think I'm trying to do something similar to the following incomplete sed command, where coords_file$2 represents the line number taken from the 2nd column of coords_file and coords_file$3 represents the position in that line to delete from.
sed -r 's coords_file$2/(.{coords_file$3}).*/\1/' datafile
I'm wondering if there's a way to include a loop or iteration so that sed runs first using the values in the first row of coords_file to fill in the relevant line and position coordinates, and then runs again using the values from the second row, etc. for all the rows in coords_file? Or if there's another approach, e.g. using awk to achieve the same result?
e.g. for awk, I identified these coordinates based on string matches using this really handy awk command from Ed Morton's response to this question: line and string position of grep match.
awk 'NR==FNR{strings[$0]; next} {for (string in strings) if ( (idx = index($0,string)) > 0 ) print string, FNR, idx }' strings.txt data_file > coords_file.txt
Was thinking potentially something similar could work doing an in-place deletion rather than just finding the lines, such as incorporating a simple find and replace like {if($0=="somehow_reference_coords_file_values_here"){$0=""}. But it's a bit beyond me (am a coding novice, so I barely understand how that original command is actually working, let alone how to mod it).
File examples
data_file
#vandelay.1
blablablablablablablablablablablabla
+
mehmehmehmehmehmehmehmehmehmehmehmeh
#vandelay.2
blablablablablablablablablablablabla
+
mehmehmehmehmehmehmehmehmehmehmehmeh
#vandelay.3
blablablablablablablablablablablabla
+
mehmehmehmehmehmehmehmehmehmehmehmeh
coords_file (tab-delimited)
(column 1 is just the string that was matched, column 2 is the line number it matched in, and column 3 is the position number of the match).
stringID 2 20
stringID 4 20
stringID 10 27
stringID 12 27
Desired result:
#vandelay.1
blablablablablablab
+
mehmehmehmehmehmehm
#vandelay.2
blablablablablablablablablablablabla
+
mehmehmehmehmehmehmehmehmehmehmehmeh
#vandelay.3
blablablablablablablablabl
+
mehmehmehmehmehmehmehmehme
Any guidance would be much appreciated thanks! (And as I mentioned, I'm very new to this coding scene, so apologies if some of that doesn't make sense or my question format's shonky (or if the question itself is rudimentary)).
Cheers.
(Incidentally, this has all been a massive work around to delete strings identified in the blablabla lines of data_file as well as the same positions 2 lines below (i.e. the mehmehmeh lines), since the mehmehmeh characters are quality scores that match the blablabla characters for each sample (each #vandelay.xx). i.e. Essentially this: sed -i 's/string.*//' datafile, but also running the same deletion 2 lines below every time it identifies the string. So if there's actually an easier script to do just that instead of all the stuff in the question above, please let me know!)
You can simply use one liner awk to do that,
$ awk 'NR==FNR{a[$2]=$3;next} (FNR in a){$0=substr($0,0,a[FNR]-1)}1' coords_file data_file
#vandelay.1
blablablablablablab
+
mehmehmehmehmehmehm
#vandelay.2
blablablablablablablablablablablabla
+
mehmehmehmehmehmehmehmehmehmehmehmeh
#vandelay.3
blablablablablablablablabl
+
mehmehmehmehmehmehmehmehme
Brief explanation,
NR==FNR{a[$2]=$3;next}: create the line number and the matching position map in array a. This part of expression would only process coords_file because of NR==FNR
(FNR in a): then awk would start to process data_file. Use the expression to search any FNR contained in array a.
$0=substr($0,0,a[FNR]-1): re-assign the $0 to the line be cut.
1: print all lines

Plotting multiple sets of information from file with Gnuplot

I have a file that looks like this:
0 0.000000
1 0.357625
2 0.424783
3 0.413295
4 0.417723
5 0.343336
6 0.354370
7 0.349152
8 0.619159
9 0.871003
0.415044
The last line is the mean of the N entries listed right above it. What I want to do is to plot a chart that has each point listed and a line with the mean value. I know it involves replot in some way but I can't read the last value separately.
You can make two passes using the stats command to get the necessary data
stats datafile u 1 nooutput
stats datafile u ($0==(STATS_records-1)?$1:1/0) nooutput
The first pass of stats will summarize the data file. What we are actually interested in is the number of records in the file, which will be saved in the variable STATS_records.
The second pass will compute a column to analyze. If the line number (the value of $0) is equal to one less than the number of records (lines are numbered from 0, so this is the last line), than we get this value, otherwise we get an invalid value. This causes the stats command to only look at this last line. Now the value of the last line is stored in STATS_max (or STATS_min and several other variables).
Now we can create the plot using
plot datafile u 1:2, STATS_max
where we explicitly state columns 1 and 2 to make the first plot specification ignore that last line (actually, if we just do plot datafile it should default to this column selection and automatically ignore that last line, but this makes certain). This produces
An alternative way is to use external programs to filter the data. For example, if we have the linux command tail available, we could do1
ave = system("tail -1 datafile")
plot datafile u 1:2, ave+0
Here, ave will contain the last row of the file as a string. In the plot command we add 0 to it to force it to change to a number (otherwise gnuplot will think it is a filename).
Other external programs can be used to read that last line as well. For example, the following call to python3 (using Windows style shell quotes) does the same:
ave = system('python -c "print(open(datafile,\"r\").readlines()[-1])"')
or the following using AWK (again with Windows style shell quotes) has the same result:
ave = system('awk "END{print}"')
or even using Perl (again with Windows shell quotes):
ave = system('perl -lne "END{print $last} $last=$_" datafile')
1 This use of tail uses a now obsolete (according to the GNU manuals) command line option. Using tail -n 1 datafile is the recommended way. However, this shorter way is less to type, and if forward compatibility is not needed (ie you are using this script once), there is no reason not to use it.
Gnuplot ignores those lines with missing data (for example, the last line of your datafile has no column 2). Then, you can simply do the following:
stats datafile using 2 nooutput
plot datafile using 1:2, STATS_mean
The result:
There is no need for using external tools or using stats (unless the value hasn't been calculated already, but in your example it has).
During plotting of the data points you can assign the value of the first column, e.g. to the variable mean.
Since the last row doesn't contain a second column, no datapoint will be plotted, but this last value will be hold in the variable mean.
If you replace reset session with reset and read the data from a file instead of a datablock, this will work with gnuplot 4.6.0 or even earlier versions.
Minimal solution:
plot FILE u (mean=$1):2, mean
Script: (nicer plot and including data for copy & paste & run)
### plot values as points and last value from column 1 as line
reset session
$Data <<EOD
0 0.000000
1 0.357625
2 0.424783
3 0.413295
4 0.417723
5 0.343336
6 0.354370
7 0.349152
8 0.619159
9 0.871003
0.415044
EOD
set key top center
plot $Data u (mean=$1):2 w p pt 7 lc rgb "blue" ti "Data", \
mean w l lw 2 lc rgb "red"
### end of script
Result:

Unix: find all lines having timestamps in both time series?

I have time-series data where I would like to find all lines matching each another but values can be different (match until the first tab)! You can see the vimdiff below where I would like to get rid of days that occur only on the other time series.
I am looking for the simplest unix tool to do this!
Timeserie here and here.
Simple example
Input
Left file Right File
------------------------ ------------------------
10-Apr-00 00:00 0 || 10-Apr-00 00:00 7
20-Apr 00 00:00 7 || 21-Apr-00 00:00 3
Output
Left file Right File
------------------------ ------------------------
10-Apr-00 00:00 0 || 10-Apr-00 00:00 7
Let's consider these sample input files:
$ cat file1
10-Apr-00 00:00 0
20-Apr-00 00:00 7
$ cat file2
10-Apr-00 00:00 7
21-Apr-00 00:00 3
To merge together those lines with the same date:
$ awk 'NR==FNR{a[$1]=$0;next;} {if ($1 in a) print a[$1]"\t||\t"$0;}' file1 file2
10-Apr-00 00:00 0 || 10-Apr-00 00:00 7
Explanation
NR==FNR{a[$1]=$0;next;}
NR is the number of lines read so far and FNR is the number of lines read so far from the current file. So, when NR==FNR, we are still reading the first file. If so, save this whole line, $0, in array a under the key of the first field, $1, which is the date. Then, skip the rest of the commands and jump to the next line.
if ($1 in a) print a[$1]"\t||\t"$0
If we get here, then we are reading the second file, file2. If the first field on this line, $1 is a date that we already saw in file1, in other words, if $1 in a, then print this line out together with the corresponding line from file1. The two lines are separated by tab-||-tab.
Alternative Output
If you just want to select lines from file2 whose dates are also in file1, then the code can be simplified:
$ awk 'NR==FNR{a[$1]++;next;} {if ($1 in a) print;}' file1 file2
10-Apr-00 00:00 7
Or, still simpler:
$ awk 'NR==FNR{a[$1]++;next;} ($1 in a)' file1 file2
10-Apr-00 00:00 7
There is the relatively unknown unix command join. It can join sorted files on a key column.
To use it in your context, we follow this strategy (left.txt and right.txt are your files):
add line numbers (to put everything in the original sequence in the last step)
nl left.txt > left_with_lns.txt
nl right.txt > right_with_lns.txt
sort both files on the date column
sort left_with_lns.txt -k 2 > sl.txt
sort right_with_lns.txt -k 2 > sr.txt
join the files using the date column (all times are 0:00) (this would merge all columns of both files with correponding key, but we provide a output template to write the columns from the first file somewhere and the columns from the second file somewhere else (but only those line with a matching key will end in the result fl.txt and fr.txt)
join -j 2 -t $'\t' -o 1.1 1.2 1.3 1.4 sl.txt sr.txt > fl.txt
join -j 2 -t $'\t' -o 2.1 2.2 2.3 2.4 sl.txt sr.txt > fr.txt
sort boths results on the linenumber column and output the other columns
sort -n fl |cut -f 2- > left_filtered.txt
sort -n fr.txt | cut -f 2- > right_filtered.txt
Tools used: cut, join, nl, sort.
As requested by #Masi, I tried to work out a solution using sed.
My first attempt uses two passes; the first transforms file1 into a sed script that is used in the second pass to filter file2.
sed 's/\([^ \t]*\).*/\/^\1\t\/p;t/' file1 > sed1
sed -nf sed1 file2 > out2
With big input files, this is s-l-o-w; for each line from file2, sed has to process an amount of patterns that equals the number of lines in file1. I haven't done any profiling, but I wouldn't be surprised if the time complexity is quadratic.
My second attempt merges and sorts the two files, then scans through all lines in search of pairs. This runs in linear time and consequently is a lot faster. Please note that this solution will ruin the original order of the file; alphabetical sorting doesn't work too well with this date notation. Supplying files with a different date format (y-m-d) would be the easiest way to fix that.
sed 's/^[^ \t]\+/&#1/' file1 > marked1
sed 's/^[^ \t]\+/&#2/' file2 > marked2
sort marked1 marked2 > sorted
sed '$d;N;/^\([^ \t]\+\)#1.*\n\1#2/{s/\(.*\)\n\(.*\)/\2\n\1/;P};D' sorted > filtered
sed 's/^\([^ \t]\+\)#2/\1/' filtered > out2
Explanation:
In the first command, s/^[^ \t]\+/&#1/ appends #1 to every date. This makes it possible to merge the files, keep equal dates together when sorting, and still be able to tell lines from different files apart.
The second command does the same for file2; obviously with its own marker #2.
The sort command merges the two files, grouping equal dates together.
The third sed command returns all lines from file2 that have a date that also occurs in file1.
The fourth sed command removes the #2 marker from the output.
The third sed command in detail:
$d suppresses inappropriate printing of the last line
N reads and appends another line of input to the line already present in the pattern space
/^\([^ \t]\+\)#1.*\n\1#2/ matches two lines originating from different files but with the same date
{ starts a command group
s/\(.*\)\n\(.*\)/\2\n\1/ swaps the two lines in the pattern space
P prints the first line in the pattern space
} ends the command group
D deletes the first line from the pattern space
The bad news is, even the second approach is slower than the awk approach made by #John1024. Sed was never designed to be a merge tool. Neither was awk, but awk has the advantage of being able to store an entire file in a dictionary, making #John1024's solution blazingly fast. The downside of a dictionary is memory consumption. On huge input files, my solution should have the advantage.

Exclude data in gnuplot with a condition

I have a data file with 3 column and I want to plot with 2 of them. But I want to use the third with a condition to exclude or not the line from the plot (For example, if $3 < 10 the data line isn't valid). I know there is set datafile missing but this case is somewhat peculiar and I don't know how to do that. Any help is appreciated...
You can use conditional logic in the using expression in the plot command:
plot 'data.dat' u 1:($3 < 10 ? 1/0 : $2)
This command plots 1/0 (it skips that data point) if the value in the third column is < 10, and otherwise plots the value in the second column.

How do I delete the current line based on a match in both the previous and current lines in sed?

Given a sorted file like so:
AAA 1 2 3
AAA 2 3 4
AAA 3 4 2
BBB 1 1 1
BBB 1 2 1
and a desired output of
AAA 1 2 3
BBB 1 1 1
what's the best way to achieve this with sed?
Basically, if the col starts with the same field as the previous line, how do I delete it? The rest of the data must be kept on the output.
I imagine there must be some way to do this either using the hold buffer, branching, or the test command.
another way with awk:
awk '!($1 in a){print;a[$1]}' file
This could be done with AWK:
$ gawk '{if (last != $1) print; last = $1}' in.txt
AAA 1 2 3
BBB 1 1 1
Maybe there's a simpler way with sed, but:
sed ':a;N;/\([[:alnum:]]*[[:space:]]\).*\n\1/{s/\n.*//;ta};P;D'
This produces the output
AAA 1 2 3
BBB 1 1 1
which differs from that in the question, but matches the description:
if the col starts with the same field as the previous line, how do I delete it?
This might work for you (GNU sed):
sed -r ':a;$!N;s/^((\S+\s).*)\n\2.*/\1/;ta;P;D' file
or maybe just:
sort -uk1,1 file
One way using GNU awk:
awk '!array[$1]++' file.txt
Results:
AAA 1 2 3
BBB 1 1 1
Using sed:
#!/bin/sed -nf
P
: loop
s/\s.*//
N
/\([^\n][^\n]*\)\n\1/ b loop
D
Firstly, we must pass the -n flag to sed so it will only print what we tell it to.
We start off by printing the line with the "P" command, because the first line will always be printed and we will force sed to only execute this line when we want it to.
Now we will do a loop. We define a loop with a starting label through the ":" command (in this case we name the label as "loop"), and when necessary we jump back to this label with a "b" command (or a "t" test command). This loop is quite simple:
Remove everything but the first field (replace the first space character and everything that follows it with nothing)
Append the next line (a newline character will be included)
Check if the new line starts with the field we isolated. We do this by using a capture. A capture is defined as a "sub-match" whose matched input will be stored into a special "variable", named numerically following the order of captures present. We specify captures using parenthesis escaped with backslased (starts with \( and ends with \)). In this case we match all characters that aren't a newline character (ie. [^\n]) up to the end of the line. We do this by matching at least one of non-newline characters followed by an arbitrary sequence of them. This prevents matching an empty string before a newline. After the capture, we match a newline character followed by the result of the capture, by using the special variable \1, which contains the input matched by that first capture. If this succeeds, we have a line that repeats the first field, so we jump back to the start of the loop with the "b" branch command.
When we exit the loop, we have found a line that has a different first field, so we must prepare the input line and jump back to the beginning of the script. This can be done with the "D" delete-first-line-and-restart-script command.
This can be shortened into a single line (notice that we have renamed the "loop" label into "a"):
sed -e 'P;:a;s/\s.*//;N;/\([^\n][^\n]*\)\n\1/ba;D'

Resources