gnuplot computing stats over multiple columns - plot

I have a simple 9 column file. I wan't to compute certain statistics for each column and then plot it (using gnuplot).
1) This is how I compute statistics for every column excluding the first one.
stats 'data' every ::2 name "stats"
2) In the output screen I can see that the operation is successful. Note that the number of columns/records is 8
* FILE:
Records: 8
Out of range: 0
Invalid: 0
Blank: 0
Data Blocks: 1
* COLUMNS:
Mean: 6.5000 491742.6625
Std Dev: 2.2913 703.4865
Sum: 52.0000 3.93394e+06
Sum Sq.: 380.0000 1.93449e+12
Minimum: 3.0000 [0] 490312.0000 [2]
Maximum: 10.0000 [7] 492643.5000 [7]
Quartile: 4.5000 491329.5000
Median: 6.5000 491911.1500
Quartile: 8.5000 492252.2500
Linear Model: y = 121.8 x + 4.91e+05
Correlation: r = 0.3966
Sum xy: 2.558e+07
3) Now I can access statistics on the first 2 columns by appending _x and _y like this
print stats_median_x
print stats_median_y
My questions are:
How can I access statistics (lets say medians) for the remaining 6 columns?
How could I plot lets say a line over all medians against some X axis?
I know that I can simply add a python script to pre-compute all this, but I would prefer to avoid it if there is an easy way to do it using gnuplot itself.
Thanks!

Short answer(s)
"How can I access statistics of the other column?"
with stats 'data'using n you will access to the nth column...
"How can I plot for example all medians?"
e.g. a set print and a do for cycle can create a data-file that you can use for the plot.
A working solution
set print "StatDat.dat"
do for [i=2:9] { # Here you will use i for the column.
stats 'data.dat' u i nooutput ;
print i, STATS_median, STATS_mean , STATS_stddev # ...
}
set print
plot "StatDat.dat" us 1:2 # or whatever column you want...
Some words more about it
Asking help to gnuplot itself with help stats it's possible to read a lot of interesting things :-).
Syntax:
stats 'filename' [using N[:M]] [name 'prefix'] [[no]output]]
This command prepares a statistical summary of the data in one or two columns of a file. The using specifier is interpreted in the same way as for plot commands. See plot for details on the index, every, and using directives.
From the first highlighted sentence we can understand that it prepares statistics for one or maximum two column each time (It's a pity let's see in future...).
From the second highlighted sentence it's possible to read that it will follow the same syntax of the plot command:
so stats 'data'using 3 will give you the statistic of the 3rd column in x
and stats 'data' using 4:5 of the 4th and 5th in x,y...
Notes about your interpretations
You said
This is how I compute statistics for every column excluding the first one.
stats 'data' every ::2 name "stats"
Not really this is the statistic for the first two column excluding the first two lines, indeed their counter starts from 0 and not from 1.
As consequence of the above assumption/interpretation, when we read
Records: 8
it means that the lines computed where 8; your file had 10 (usable) lines, you specify every ::2 and you skip the first two, thus you have 8 records useful for the statistic.
Indeed so we can better understand when in help stats it is said
STATS_records # total number of in-range data records
implying "used to compute this statistic".
Tested on gnuplot 4.6 patchlevel 4
Working on gnuplot Version 5.0 patchlevel 1

Related

Problems with DataFrame to generate a table of 15000 rows x 37 columns in Julia

I'm trying to generate a table with 15000 rows and 16 columns, however Julia loses or omits some variables.
I have tried different ways to run DataFrame but I get the following results:
df = DataFrame(periods=15000, households=5000, giniY=giniY)
15,000 rows × 3 columns
However, when I run with the 16 variables I get the following result
df = DataFrame(periods=15000, households=5000, gamma=gamma, delta=delta,
betta=betta, alfa=alfa, miz=miz, roz=roz,
phi=phi, rok=rok, mie=mie, roe=roe,
roez=roez)
1×13 DataFrame. Omitted printing of 3 columns
Your second df variable has 13 columns (I have aligned the code in my edit 4 variables per line so that it is clearly visible). Julia omits printing all columns if they would not fit the screen (imagine what would happen if you had a data frame with 10 000 columns and always tried to print them all).
In REPL Julia omits printing columns if they do not fit the screen unless you pass allcols=true keyword argument to show or create a custom IOContext that you pass to show that defines a non-standard output width. All this is explained in show documentation for DataFrame.
In Jupyter Notebook a similar thing happens, but by default the width of the output is governed by "COLUMNS" environment variable. The details how you can set it are explained at the beginning of the DataFrames.jl manual here.

Retrieve best number of clusters from NbClust

Many functions in R provide some sort of console output (such as NbClust() etc.) Is there any way of retrieving some of the output (e.g. a certain integer value) without having a look at the output? Any way of reading from the console?
Imagine the output would look like the following output from example code provided in the package manual:
[1] "Frey index : No clustering structure in this data set"
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 1 proposed 2 as the best number of clusters
* 2 proposed 4 as the best number of clusters
* 2 proposed 6 as the best number of clusters
* 1 proposed 7 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 4
*******************************************************************
How would I retrieve the value 4 from the last line of the above output?
It is better to work with objects rather than output in the console. Any "good" function would return hopefully structured output that can be accessed using $ or # signs, use str() to see object's structure.
In your case, I think this should work:
length(unique(res$Best.partition))
Another option is:
max(unlist(res[4]))

Plotting multiple sets of information from file with Gnuplot

I have a file that looks like this:
0 0.000000
1 0.357625
2 0.424783
3 0.413295
4 0.417723
5 0.343336
6 0.354370
7 0.349152
8 0.619159
9 0.871003
0.415044
The last line is the mean of the N entries listed right above it. What I want to do is to plot a chart that has each point listed and a line with the mean value. I know it involves replot in some way but I can't read the last value separately.
You can make two passes using the stats command to get the necessary data
stats datafile u 1 nooutput
stats datafile u ($0==(STATS_records-1)?$1:1/0) nooutput
The first pass of stats will summarize the data file. What we are actually interested in is the number of records in the file, which will be saved in the variable STATS_records.
The second pass will compute a column to analyze. If the line number (the value of $0) is equal to one less than the number of records (lines are numbered from 0, so this is the last line), than we get this value, otherwise we get an invalid value. This causes the stats command to only look at this last line. Now the value of the last line is stored in STATS_max (or STATS_min and several other variables).
Now we can create the plot using
plot datafile u 1:2, STATS_max
where we explicitly state columns 1 and 2 to make the first plot specification ignore that last line (actually, if we just do plot datafile it should default to this column selection and automatically ignore that last line, but this makes certain). This produces
An alternative way is to use external programs to filter the data. For example, if we have the linux command tail available, we could do1
ave = system("tail -1 datafile")
plot datafile u 1:2, ave+0
Here, ave will contain the last row of the file as a string. In the plot command we add 0 to it to force it to change to a number (otherwise gnuplot will think it is a filename).
Other external programs can be used to read that last line as well. For example, the following call to python3 (using Windows style shell quotes) does the same:
ave = system('python -c "print(open(datafile,\"r\").readlines()[-1])"')
or the following using AWK (again with Windows style shell quotes) has the same result:
ave = system('awk "END{print}"')
or even using Perl (again with Windows shell quotes):
ave = system('perl -lne "END{print $last} $last=$_" datafile')
1 This use of tail uses a now obsolete (according to the GNU manuals) command line option. Using tail -n 1 datafile is the recommended way. However, this shorter way is less to type, and if forward compatibility is not needed (ie you are using this script once), there is no reason not to use it.
Gnuplot ignores those lines with missing data (for example, the last line of your datafile has no column 2). Then, you can simply do the following:
stats datafile using 2 nooutput
plot datafile using 1:2, STATS_mean
The result:
There is no need for using external tools or using stats (unless the value hasn't been calculated already, but in your example it has).
During plotting of the data points you can assign the value of the first column, e.g. to the variable mean.
Since the last row doesn't contain a second column, no datapoint will be plotted, but this last value will be hold in the variable mean.
If you replace reset session with reset and read the data from a file instead of a datablock, this will work with gnuplot 4.6.0 or even earlier versions.
Minimal solution:
plot FILE u (mean=$1):2, mean
Script: (nicer plot and including data for copy & paste & run)
### plot values as points and last value from column 1 as line
reset session
$Data <<EOD
0 0.000000
1 0.357625
2 0.424783
3 0.413295
4 0.417723
5 0.343336
6 0.354370
7 0.349152
8 0.619159
9 0.871003
0.415044
EOD
set key top center
plot $Data u (mean=$1):2 w p pt 7 lc rgb "blue" ti "Data", \
mean w l lw 2 lc rgb "red"
### end of script
Result:

R spline function given a fixed space

So, I need to generate a spline function to feed it into another program which only accepts a fixed space between consecutive points. So, I used spline function in R with a given number of points to genrate spline, however, the floating-point cutoff makes the space among the points variable, for example:
spline(d$V1, d$V2, n=(max(d$V1)-min(d$V1))/0.0200)
> head(t.spl, 7)
x y
1 2.3000 -3.0204
2 2.3202 -3.0204
3 2.3404 -3.0204
4 2.3606 -3.0204
5 2.3807 -3.0204
6 2.4009 -3.0204
7 2.4211 -3.0204
so, the space between 1st 1nd 2nd row is 0.0202, while between 4th and 5th is 0.0201. So because of this problem, the other program that I am feeding this spline into, doesn't accept this. So, is there any way to make this work?
As an aside: please provide a reproducible example next time (I can't copy/paste your code in because I don't have d or t.spl)
I think you'll find that the different intervals (0.0202 vs 0.0201) is an artifact of the number of characters you are printing on the screen, not of the spline function.
It seems R is printing 4 digits after the decimal point for you for neatness, so it's doing the rounding only for the purposes of displaying the results to you.
You can see how many digits are displayed with options('digits')$digits, and adjust it with options(digits=new_number_of_digits) (see ?options for details).
For example:
options(digits=4)
pi
# 3.142
options(digits=10)
pi
# 3.141592654
In summary, when you feed the values in to your other program, make sure you print the values with enough decimal points that the other program accepts the intervals as being "equal".
If you are writing to a file, for example, just make sure you write enough digits out. If you are copy-pasting from the R console, make sure you adjust R to print out enough digits.
MathematicalCoffee is probably right. I'm just adding an alternative for the sake of wordiness.
myspline <- splinefun(dV$1,dV$2)
mydata.y <- myspline(desired_x_values,deriv=0)
Will guarantee the uniform x-spacings you desire.

how to manipulate data with gnuplot's plot with a number stored in the same file?

I'd like to plot a histogram data already created, stored in hist.dat as:
#hist1
100
1
9
10
30
30
10
9
1
Where the (zeroth line is a comment), first line contains the summation of the y value of the histogram, and x values are 1, 2, ... (the line number). So without normation, I could use
plot "hist.dat" every::1 using 0:1
and with normation I could use
plot "hist.dat" every::1 using 0:($1/100)
The question is how can I refer the summated value (100)? Because I don't want to pre-read the file just to create a correct gnuplot code, so I dont't want to write down the value implicit. I already tried
plot "hist.dat" using 0:($1/(columnhead+0))
but columnhead cannot called within using (it is a string, that's why I tried to add 0 to make it int).
I don't want to modify the file or create a new one based on this one, I want to just use the appropriate gnuplot command. I would like to avoid neglecting the summated value and recalculating it again with gnuplot.
Solution: according to andyras who give the correct answer, a bit improved method is
first(x) = ($0 == 0) ? (first = column(x), 1/0) : first
plot "hist.dat" using 0:($1/first(1))
So you can use this to plot histograms if you have multiple columns as if the hist.dat were
#hist1 hist2
10000 8000
1000 50
9000 70
1000 1100
3000 4500
3000 1200
1000 700
9000 380
1000
How can I refer the summated value (100)? (without pre-reading the file)
Yes, using a gnuplot function:
first(x) = ($0 == 0) ? (first = $1, 1/0) : first
plot "hist.dat" using 0:($1/first($1))
If it is reading the first line, the function assigns the value from that line to the variable first and returns 1/0 (gnuplot treats it as missing data and won't extend the x range to include that point). Otherwise the function returns the value of first.
This way you don't even have to use every ::1.
If you didn't mind rereading the file you could use the stats command to find out the largest value in the file.

Resources