Large grouped data plotting - plot

I have a large amount of data to plot, and I'm trying to use gnuplot. The data is a sorted array of around 80000 elements. By simply using
plot "myData.txt" using 1:2 with linespoints linetype 1 pointtype 1
I get the output, but: it takes time to render, and the points are often cluttered, with occasional gaps. To address the second, I thought of doing the bar chart: each of the entries
would correspond to a bar. However, I'm not sure how to achieve this. I would like to have some space between consecutive bars, but I don't expect that it would be visible. What would be your suggestion to plot the data?
........................
Due to large data volume, I guess it's best to group.
Note that my data looks like
1 11041.9
2 11041.9
3 9521.07
4 9521.07
5 9520.07
6 9519.07
7 9018.07
...
I would like to plot the data by a groups of 3, ie., the first vertical line should start at 9521.07 as a minimum of the points from 1, 2, 3, and end at 11041. The second vertical line should consider the following 3 points: 4, 5 and 6, and start at 9519.07 with an end at 9521.07, and so on.
Could this be achieved with gnuplot, given the data file as illustrated? If so, I would appreciate if someone posts a set of commands I should use.

To reduce the number of points gnuplot actually draws, you can use the every keyword, e.g.
plot "myData.txt" using 1:2 with linespoints linetype 1 pointtype 1 every 100
will plot every 100th data point.
I am not sure if it's possible to do what you want (plotting vertical lines) elegantly within gnuplot, but here is my solution (assuming a UNIX-y environment). First make an awk script called sort.awk:
BEGIN { RS = "" }
{
# the next two lines handle the case where
# there are not three lines in a record
xval = $1 + 1
ymin = ymax = $2
# find y minimum
if ($2 <= $4 && $2 <= $6)
ymin=$2
else if ($4 <= $2 && $4 <= $6 && $4 != "")
ymin=$4
else if ($6 <= $2 && $6 <= $4 && $6 != "")
ymin=$6
# find y maximum
if ($2 >= $4 && $2 >= $6)
ymax=$2
else if ($4 >= $2 && $4 >= $6)
ymax=$4
else if ($6 >= $2 && $6 >= $4)
ymax=$6
# print the formatted line
print ($1+1) " " ymin " " ymin " " ymax " " ymax
}
Now this gnuplot script will call it:
set terminal postscript enhanced color
set output 'plot.eps'
set boxwidth 3
set style fill solid
plot "<sed 'n;n;G;' myData.txt | awk -f sort.awk" with candlesticks title 'pretty data'
It's not pretty but it works. sed adds a blank line every 3 lines, and awk formats the output for the candlesticks style. You can also try embedding the awk script in the gnuplot script.

You can do something like that...(it'll be easiest on unix). You will need to insert a space every third line -- I don't see any way around that. If you're on unix, the command
awk 'NR % 3 == 0 {print ""} 1' myfile
should do it. ( see How do I insert a blank line every n lines using awk? )
Of course, you could (and probably should) pack that straight into your gnuplot file.
So, all said and done, you'd have something like this:
xval(x)=int(x)/3 #Return the x position on the plot
plot "< awk 'NR % 3 == 0 {print ""} 1' datafile" using (xval($1)):2 with lines

Related

awk to print incremental count of occurrences of unique values in each column

Would like print to incrementally count and then print the counts of the unique values in column 1 & column 2 & column 3 ...Column NF and Column $0
and if the word is appeared only one time of column 1, would like to print remarks as "No" as duplicated flag
and if the word is appeared more than one time of column 1, would like to print remarks as "Yes" as duplicated flag
Looking something like this
awk -F"," '{OFS=","; if (word == $1) { counter++ } else { counter = 1; word = $1 }; print $0 ",", "Yes/No", counter }'
For example, I am trying to check is there any duplicate information in the field $1 (Fruits Name) .
Under Name field, "Apple" appears three times , "Orange" appears two times,"Mango" appear one time.
So if any word is not repeated more than one time is consieder as "Name_Dup=No" duplicate and count of appears is "Name_Counter=1" (i.e Mango)
where "Apple" appears 3 times , so it is repeated/duplicated -remarks as "Yes" when it appears first time count is "Name_Dup=Yes" and Name_Counter=1" ,
when it appears second time "Name_Dup=Yes" and Name_Counter=2, when it appears 3rd time "Name_Dup=Yes" and Name_Counter=3
Then need to check each column $2, $3 .. till $NF and $0 ..
My actual input file is not sorted on any order. No of fields used to be vary like 10 fields, 12 fields and 15 fields etc
Input.csv
Name,Amount,Dept
Apple,10,eee
Orange,20,csc
Apple,30,mec
Mango,40,sss
Apple,10,eee
Orange,10,csc
Desired Output
Name,Amount,Dept,Name_Dup,Name_Counter,Amount_Dup,Amount_Counter,Dept_Dup,Dept_Counter,EntireLine_Dup,EntireLine_Counter
Apple,10,eee,Yes,1,Yes,1,Yes,1,Yes,1
Orange,20,csc,Yes,1,No,1,Yes,1,No,1
Apple,30,mec,Yes,2,No,1,No,1,No,1
Mango,40,sss,No,1,No,1,No,1,No,1
Apple,10,eee,Yes,3,Yes,2,Yes,2,Yes,2
Orange,10,csc,Yes,2,Yes,3,Yes,2,No,1
For example , Please find below steps for reference.
Step#1 - Field $1 check and Output
Name,Name_Dup,Name_Counter
Apple,Yes,1
Orange,Yes,1
Apple,Yes,2
Mango,No,1
Apple,Yes,3
Orange,Yes,2
Step#2 - Field $2 check and Output
Amount,Amount_Dup,Amount_Counter
10,Yes,1
20,No,1
30,No,1
40,No,1
10,Yes,2
10,Yes,3
Step#3 - Field $3 check and Output
Dept,Dept_Dup,Dept_Counter
eee,Yes,1
csc,Yes,1
mec,No,1
sss,No,1
eee,Yes,2
csc,Yes,2
Step#4-Field $0 check, combination of $1 & $2 & $3 and Output
"Name,Amount,Dept",EntireLine_Dup,EntireLine_Counter
"Apple,10,eee",Yes,1
"Orange,20,csc",No,1
"Apple,30,mec",No,1
"Mango,40,sss",No,1
"Apple,10,eee",Yes,2
"Orange,10,csc",No,1
awk solution:
OP asks for, as I understand it, to show per line, per column, if a column value shows up more than once and give an occurrence count of this particular column so far.
$ cat tst.awk
BEGIN{ FS=OFS="," }
NR==1{
header=$0
n=split("Dup,Counter",h)
for (i=1; i<=NF; i++)
for (j=1; j<=n; j++) header=header OFS $i"_"h[j]
printf("%s,EntireLine_Dup,EntireLine_Counter\n", header)
next
}
{
r[++lines]=$0
for (col=1; col<=NF; col++) v[col][$col]++
v[col][$0]++
}
END {
for (l=1; l<=lines; l++){
n=split(r[l], s)
res=""
for (c=1; c<=n; c++)
res=res OFS output(v,c,s[c])
res=res OFS output(v,c,r[l])
print r[l] res
}
}
function output(arr, col, val){
return sprintf("%s,%s", (arr[col][val] > 1? "Yes" : "No"), ++count[col][val])
}
with input:
$ cat input.txt
Name,Amount,Dept,Nonsense
Apple,10,eee,eee
Orange,20,csc,eee
Apple,30,mec,eee
Mango,40,sss,eee
Apple,10,eee,eee
Orange,10,csc,eee
this gives (I've deleted the header line manually, because I couldn't get it to fit in the code sample):
$ awk -f tst.awk input.txt
# deleted header line
Apple,10,eee,eee,Yes,1,Yes,1,Yes,1,Yes,1,Yes,1
Orange,20,csc,eee,Yes,1,No,1,Yes,1,Yes,2,No,1
Apple,30,mec,eee,Yes,2,No,1,No,1,Yes,3,No,1
Mango,40,sss,eee,No,1,No,1,No,1,Yes,4,No,1
Apple,10,eee,eee,Yes,3,Yes,2,Yes,2,Yes,5,Yes,2
Orange,10,csc,eee,Yes,2,Yes,3,Yes,2,Yes,6,No,1
you are not providing what efforts you placed so far. Here is a hint where I would start. I guess since awk is the tool to use, start with shell command sort Input.csv and pipe it to awk. Populate an array when reading the input as well as an associative array with index the first field.
I n the END section go over the array and see if you find the first field more than once. It takes a bit of time however that sounds like a homework. Not a production problem.

Appending whitespace to a variable in AWK script

I have an AWK script, which receives an input variable from another script.
The length of the input variable is compared. if the length is 3, two whitespace is added infront of variable. If the length is 4, 1 whitespace is added in front. I could compare the length but am not able to append white space.
I tried the following in AWK script
if (length(input_variable) ==3 ) {
input_variable = " "input_variable
} else if(length(input_variable) ==4 ){
input_variable = " "input_variable
}print input_variable
Output: No value is getting printed. Please help me
you should use printf
awk '{printf "%5s", $1}'
pads with spaces on the left to the desired length, don't reinvent.

Comparing consecutive rows within an file

I have two files with with 1s and 0s in each column, where the field separator is "," :
1,0,0,1,1,1,0,0,0,0,1,0,0,1,1,0,1,0
0,1,0,1,1,1,0,1,0,1,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0
1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0
1,0,1,0,0,0,0,1,1,1,1,1,1,1,1,1,0,1
1,0,1,0,0,0,0,1,1,1,1,1,1,1,1,1,0,0
1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0
1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0
Want I want to do is look at the file in pairs of rows, compare them, and if they are exactly the same output a 1. So for this example the rows 1 & 2 are different so they don't get a 1, rows 3 & 4 are exactly the same so they get a 1, and rows 5&6 differ by 1 column so they don't get a 1, and so on.
So the desired output could be something like :
1
1
1
Because here there are exactly 3 pairs (they are paired by the fact if they are consecutive) of rows that are exactly the same: rows 3&4, 7&8, and 9&10. The comparison should not reuse a row, so if you compare rows 1 & 2, you shouldn't then compare rows 2 & 3.
You can do this with awk like:
awk -F, '!(NR%2) {print $0==p} {p=$0}' data
0
1
0
1
1
where every line that's evenly divisible by two will print a 0 if the current line doesn't match the last value for p or a 1 if it matches.
If you truly only want the 1s, which is throwing away any information about which pairs matched, you could:
awk -F, '!(NR%2)&&$0==p {print 1} {p=$0}' data
1
1
1
Alternatively, you could output matching pair line numbers like:
awk -F, '!(NR%2)&&$0==p {print NR-1 "," NR} {p=$0}' data
3,4
7,8
9,10
Or just the counts of all matched pairs:
awk -F, '!(NR%2)&&$0==p {c++} {p=$0} END{ print c}' data
3
Another useful variant might be just to return the matching lines directly:
awk -F, '!(NR%2)&&$0==p {print} {p=$0}' data
1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0
1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0
I would use a shell script like this:
while read line
do
if test "$prevline" = "$line"
then
echo 1
fi
prevline=$line
done
I'm not 100% sure about your requirement to "not reuse a row", but I think that could be achieved by changing inner part of the loop to
if test "$prevline" = "$line"
then
echo 1
line="" # don't reuse a line
fi

how to delete the last line in file starting with string in unix

I have a file in the below format
AB1234 jhon cell number etc
MD 2 0 8 -1
MD4567 Jhon2 cell number etc
MD 2 0 8 -1
I want to find the last line that start with "MD 2" (not MD as MD is embedded in other data) and delete that line.
so my output should be --
AB1234 jhon cell number etc
MD 2 0 8 -1
MD4567 Jhon2 cell number etc
I have tried many regular expression in sed but it seems it is not working..
sed -e '/^MD *2/p' <file Name >
sed '/^(MD 2)/p' <file Name>
This might work for you (GNU sed):
sed '/^MD\s\+2/,${//{x;//p;d};H;$!d;x;s/^[^\n]*\n//}' file
This holds a window of lines in the hold space. When it encounters the required pattern, it prints out the current window and starts a new one. At the end of file it prints out all but the first line of the window (as it is the first line that is the required pattern to be deleted).
You can do this in 2 steps:
Find the line number of that line
Delete the line using sed
For example:
n=$(awk '/^MD *2/ { n=NR } END { print n }' filename)
sed "${n}d" filename
If you are trying to match exactly 2 in the second column (and not strings that begin with 2), do two passes:
awk 'NR==FNR && $1 == "MD" && $2 == "2"{k=NR} NR!=FNR && FNR!=k' input input
Or, if you have access to tac and want to make 3 passes on the file:
tac input | awk '$1 == "MD" && $2 == "2" && !k{ k=1; next}1' | tac
To match when the second column does not exactly equal the string 2 but merely begins with a 2, replace $2 == "2" in the above with $2 ~ /^2/
Here is one way to do it.
awk '{a[NR]=$0} /^MD *2/ {f=NR} END {for (i=1;i<=NR;i++) if (f!=i) print a[i]}' file
AB1234 jhon cell number etc
MD 2 0 8 -1
MD4567 Jhon2 cell number etc
Store all data in array a
Search and find last MD 2 and store record number in f
Then print array a, but only if record number is not equal to value in f

In Gnuplot, how to plot a function many times on the same plot in a for loop

In gnuplot, I'm trying to plot a function with 5 parameters, whose values are stored in an external file, 8 times on the same graph. I want to plot the vapor pressure of 8 species as a function of temperature; the vapor pressure is parametrized by 5 variables. I have tried using a do-for loop, but that only plots one species. How can I plot the function 8 times on the same plot using the 8 sets of parameters? The code below is based on this answer, and works except that the answer as given will print 8 pngs, but I would like 1, and modified it in my attempt to do so.
parameters.txt
A B C D E
33.634 -3647.9 -8.6428 -9.69E-11 1.19E-06
19.419 -5869.9 -0.4428 -1.26E-02 5.22E-06
-15.077 -4870.2 14.501 -3.16E-02 1.35E-05
76.1 -5030 -25.078 9.76E-03 -2.58E-13
2.1667 -2631.8 4.035 -1.18E-02 6.10E-06
39.917 -4132 -10.78 1.97E-10 2.04E-06
29.89 -3953.5 -7.2253 2.11E-11 8.96E-07
99.109 -7533.3 -32.251 1.05E-02 1.23E-12
vapor.plt
reset
datafile = "parameters.txt"
set terminal pngcairo
set xrange [273.15:493.15]
set logscale y
set output "vapor.png"
do for [step=1:8] {
# read parameters from file, where the first line is the header, thus the +1
a=system("awk '{ if (NR == " . step . "+1) printf \"%f\", $1}' " . datafile)
b=system("awk '{ if (NR == " . step . "+1) printf \"%f\", $2}' " . datafile)
c=system("awk '{ if (NR == " . step . "+1) printf \"%f\", $3}' " . datafile)
d=system("awk '{ if (NR == " . step . "+1) printf \"%f\", $4}' " . datafile)
e=system("awk '{ if (NR == " . step . "+1) printf \"%f\", $5}' " . datafile)
# convert parameters to numeric format
a=a+0.
b=b+0.
c=c+0.
d=d+0.
e=e+0.
plot 10**(a + b/x + c*log10(x) + d*x + e*x**2) title ''
}
set output
To plot several function into one graph, you must either use only one plot command, and separate the functions with commas:
plot f(x), g(x), h(x)
This would plot all three function in one graph. For your case you would need to extract the parameters first to have a1, a2,... a8 etc. This would have the advantage, that you could have a key (legend) for the parameter sets.
The second option fits better to your existing script. You need to put the plot calls in a multiplot:
reset
datafile = "parameters.txt"
set terminal pngcairo
set xrange [273.15:493.15]
set logscale y
set output "vapor.png"
set lmargin at screen 0.1
set rmargin at screen 0.9
set bmargin at screen 0.1
set tmargin at screen 0.9
set multiplot
do for [step=1:8] {
# read parameters from file, where the first line is the header, thus the +1
a=system("awk '{ if (NR == " . step . "+1) printf \"%f\", $1}' " . datafile)
b=system("awk '{ if (NR == " . step . "+1) printf \"%f\", $2}' " . datafile)
c=system("awk '{ if (NR == " . step . "+1) printf \"%f\", $3}' " . datafile)
d=system("awk '{ if (NR == " . step . "+1) printf \"%f\", $4}' " . datafile)
e=system("awk '{ if (NR == " . step . "+1) printf \"%f\", $5}' " . datafile)
# convert parameters to numeric format
a=a+0.
b=b+0.
c=c+0.
d=d+0.
e=e+0.
plot 10**(a + b/x + c*log10(x) + d*x + e*x**2) lt step title ''
if (step == 1) {
unset border
unset xtics
unset ytics
}
}
unset multiplot
set output
With multiplot the border and tics would be redrawn each time, which looks ugly (bolder). For this I unset border, xtics and ytics after the first plot. But in order to have the same margins for all plots I set fixed, absolute margins at the beginning. It would be possible to hold the automatic margins, which are computed with the first plot, but that is a bit lengthy (see the topic 'Gnuplot-defined variables' in the docs).
I also used different linetypes for every plot. The above script gives the output:

Resources