gnuplot: How to add a new colored curve every nth row while removing intermediate headers? - plot

I've got a file which has this structure:
Header 1
Header 2
config X Y
0.0 -5 -2
0.0 0 1
0.0 5 4
Header2
Config X Y
1.0 -5 -1
1.0 0 0
1.0 5 5
Header2
Config X Y
2.0 -5 0
2.0 0 1
2.0 5 6
Using gnuplot, I would like to plot columns 2:3 (Y as a function of X) with a few conditions:
Get rid of the headers and any line that's not filled with numbers
On the same graph, plot a new function (with a new label and a new color) each time the config changes. In the aforementioned case, you'd end up with three plots (one for config=0.0, one for config=1.0 and one for config=2.0)
Is there a one-liner for this in Gnuplot?
I tried to use the "every" keyword
p 'filename.txt' every ::3 u 2:3 w p
but to no avail
Thank you

If you have a strict data structure and you insist on a one-liner you could do the following with every, check help every.
However, then you need to know in advance that N=5 (here: 2 header lines and 3 data lines) and you have 3 blocks. You also skip the first 3 lines (check help skip).
You could use stats find out N automatically.
Personally, I would prefer a solution with more than one line, which would robust against little changes in data, just in case.
Script:
### separate data into subblocks with different colors
reset session
$Data <<EOD
Header 1
Header 2
config X Y
0.0 -5 -2
0.0 0 1
0.0 5 4
Header2
Config X Y
1.0 -5 -1
1.0 0 0
1.0 5 5
Header2
Config X Y
2.0 -5 0
2.0 0 1
2.0 5 6
EOD
plot for [i=0:2] N=5 $Data u 2:3 every ::i*N::(i+1)*N-1 skip 3 w lp pt 7 lc i ti sprintf("Config%d",i)
### end of script
Result:
Addition:
Here is a more general (but maybe not too obvious) solution:
you don't need to know in advance how many blocks you have and how many datalines you have
you can have different number of data lines
What the script does:
during plotting, the script checks if the column 1 contains a valid number, valid(1) will return 1 if it is a valid number and 0 otherwise (check help valid).
the variable c1 is initialized to 0. During plotting line by line c0 is assigned the value of c1 and c1 gets the value of valid(1).
so, everytime column 1 changes from text to numbers (i.e. c1>c0) increment b by 1 and use it for setting the color (check help lc variable)
plot keyentries in a loop with the corresponding title.
So, it is a two-liner which also could be put into a single line.
Script:
### separate data into subblocks with different colors (more flexibility)
reset session
$Data <<EOD
Header 1
Header 2
Header 3
config X Y
0.0 -5 -2
0.0 0 1
0.0 5 4 # could be followed e.g. by an empty line
Header2
Config X Y
1.0 -5 -1
1.0 0 0
1.0 5 5
1.0 6 6 # 4 data entries
Header2
Config X Y
Some other text line added
2.0 -5 0
2.0 0 1
2.0 5 6
2.0 6 7
2.0 7 6.5 # 5 data entries
EOD
set key top left noautotitle
plot c1=b=0 $Data u (c0=c1,c1=valid(1),$2):($3):(c1>c0?b=b+1:b-1) w lp pt 7 lc var, \
for [i=0:b-1] keyentry w lp pt 7 lc i ti sprintf("Config%d",i)
### end of script
Result:

Related

Gnuplot bar chart with personalize interval on x-axis

I'm new using gnuplot and i would like to replicate this plot: https://images.app.goo.gl/DqygL2gfk3jZ7jsK6
I have a file.dat with continuous value between 0 and 100 and i would like to plot it, subdivided in intervals ( pident> 98, 90 < pident < 100...) Etc. And on y-axis the total occurrences.
I looked everywhere finding a way but still I cannot do it.
Thank you !
sample of the data, with the value and the counts:
33.18 5
43.296 1
33.19 1
27.168 5
71.429 11
30.698 9
47.934 1
43.299 3
30.699 3
37.092 2
24.492 2
24.493 2
24.494 7
47.938 1
24.497 1
37.097 8
37.099 2
33.824 7
51.111 15
59.025 2
62.553 2
62.554 2
57.867 2
33.826 2
62.555 1
33.827 5
62.556 2
33.828 1
59.028 1
46.429 11
51.117 1
75.158 2
27.621 1
27.623 1
27.624 2
37.5 113
37.6 2
32.313 8
27.626 3
37.7 3
32.314 1
67.797 3
27.628 2
32.316 2
37.9 1
61.044 1
43.81 5
32.317 8
32.318 2
43.82 4
32.319 2
43.83 2
37.551 3
61.048 1
48.993 6
29.43 2
This is the code tried so far (where i also calculate the mean):
#!/usr/bin/gnuplot -persist
set noytics
# Find the mean
mean= system("awk '{sum+=$1*$2; tot+=$2} END{print sum/tot}' hist.dat")
set arrow 1 from mean,0 to mean, graph 1 nohead ls 1 lc rgb "blue"
set label 1 sprintf(" Mean: %s", mean) at mean, screen 0.1
# Histogram
binwidth=10
bin(x,width)=width*floor(x/width)
plot 'hist.dat' using (bin($1,binwidth)):(1.0) smooth freq with boxes
This is the result:
The following script takes your data and sums up the second column within the defined bins.
If you have values of equal 100 in the first column, those values would be in the bin 100-<110.
With Bin(x) = floor(x/BinWidth)*BinWidth + BinWidth*0.5, the bins are shifted by half a binwidth to let the boxes on the x-axis range from the beginning of the bin to the end of the bin (and not centered at the beginning of the respective bin).
If you explicitely want to have xtics labels like in the example graph you've shown, i.e. 10-<20, 20-<30 etc. you would have to fiddle around with the xtic labels.
Edit: Forgot the mean value. There is no need for calling awk. Gnuplot can do this for you as well, check help stats.
Code:
### create histogram
reset session
$Data <<EOD
33.18 5
43.296 1
33.19 1
27.168 5
71.429 11
30.698 9
47.934 1
43.299 3
30.699 3
37.092 2
24.492 2
24.493 2
24.494 7
47.938 1
24.497 1
37.097 8
37.099 2
33.824 7
51.111 15
59.025 2
62.553 2
62.554 2
57.867 2
33.826 2
62.555 1
33.827 5
62.556 2
33.828 1
59.028 1
46.429 11
51.117 1
75.158 2
27.621 1
27.623 1
27.624 2
37.5 113
37.6 2
32.313 8
27.626 3
37.7 3
32.314 1
67.797 3
27.628 2
32.316 2
37.9 1
61.044 1
43.81 5
32.317 8
32.318 2
43.82 4
32.319 2
43.83 2
37.551 3
61.048 1
48.993 6
29.43 2
EOD
# Histogram
BinWidth = 10
Bin(x) = floor(x/BinWidth)*BinWidth + BinWidth*0.5
# Mean
stats $Data u ($1*$2):2 nooutput
mean = STATS_sum_x/STATS_sum_y
set arrow 1 from mean, graph 0 to mean, graph 1 nohead lw 2 lc rgb "red" front
set label 1 sprintf("Mean: %.1f", mean) at mean, graph 1 offset 1,-0.7
set xlabel "Identity / %"
set xrange [0:100]
set xtics 10 out
set ylabel "The number of blast hits"
set style fill solid 0.3
set boxwidth BinWidth
set key noautotitle
set grid x,y
plot $Data using (Bin($1)):2 smooth freq with boxes lc "blue"
### end of code
Result:

extract h2o random forest in format like rpart frame

The following code:
library(randomForest)
z.auto <- randomForest(Mileage ~ Weight,
data=car.test.frame,
ntree=1,
nodesize = 15)
tree <- getTree(z.auto,k=1,labelVar = T)
tree
Gives this as text output:
left daughter right daughter split var split point status prediction
1 2 3 Weight 2567.5 -3 24.45000
2 0 0 <NA> 0.0 -1 30.66667
3 4 5 Weight 3087.5 -3 22.37778
4 6 7 Weight 2747.5 -3 24.00000
5 8 9 Weight 3637.5 -3 19.94444
6 0 0 <NA> 0.0 -1 25.20000
7 10 11 Weight 2770.0 -3 23.29412
8 0 0 <NA> 0.0 -1 21.18182
9 0 0 <NA> 0.0 -1 18.00000
10 0 0 <NA> 0.0 -1 22.50000
11 0 0 <NA> 0.0 -1 23.72727
From this data I can see the logic of an individual tree.
How do I get the much longer table, based on this, that describes all the trees in a random forest, from h2o?
I like 'h2o' because it cleanly uses all the cores, and goes at a pretty good clip on my system. It is a nice tool. It is, however, a library separate from 'r' so I am unsure how to access various parts of my data.
How do I get something like the above printed output, in the form of a csv file, from an h2o random forest?
H2O doesn't currently have a function to display a table like that, but you can export the random forest model to POJO (a Java file) using the
h2o.download_pojo() function and then inspect the tree (individual rules) manually.
H2O also accepts feature requests.

Imputation for longitudinal data using observation before and after missing data

I’m in the process of cleaning some longitudinal data and I have several missing cases. I am trying to use an imputation that incorporates observations before and after the missing case. I’m wondering how I can go about addressing the issues detailed below.
I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.
The details are below:
#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)
*Bold characters represent changes from the dataset above
The goal here is to find a way to get the mean of the value before (3) and after (0) the NA value for ID #1 (variable ss) so that the data look like this: 1,3,2,3,1.5,0,0,
ID# 2 (variable ss) should look like this: 2,4,0,0,0,0,0
ID #3 (variable ss) should use a last observation carried forward approach, so it would need to look like this: 4,1,2,4,2,3,3
ID #4 (variable ss) has two consecutive NA values and should not be changed. It will be flagged for a different analysis later in my project. So, it should look like this: 2,1,0,NA,NA,0,0 (no change).
I use a package, smwrBase, the syntax for only filling in 1 missing value is below, but doesn't address id.
smwrBase::fillMissing(ss, max.fill=1)
The zoo package might be more standard, same issue though.
zoo::na.approx(ss, maxgap=1)
Below is an approach that accounts for the variable id. Current interpolation approaches dont like to fill in the last value, so i added a manual if stmt for that. A bit brute force as there might be a tapply approach out there.
> id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
> time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
> ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
> mydat <- data.frame(id, time, ss, ss2=NA_real_)
> for (i in unique(id)) {
+ # interpolate for gaps
+ mydat$ss2[mydat$id==i] <- zoo::na.approx(ss[mydat$id==i], maxgap=1, na.rm=FALSE)
+ # extension for gap as last value
+ if(is.na(mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])])) {
+ mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])] <-
+ mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])-1]
+ }
+ }
> mydat
id time ss ss2
1 1 0 1 1.0
2 1 1 3 3.0
3 1 2 2 2.0
4 1 3 3 3.0
5 1 4 NA 1.5
6 1 5 0 0.0
7 1 6 0 0.0
8 2 0 2 2.0
9 2 1 4 4.0
10 2 2 0 0.0
11 2 3 NA 0.0
12 2 4 0 0.0
13 2 5 0 0.0
14 2 6 0 0.0
15 3 0 4 4.0
16 3 1 1 1.0
17 3 2 2 2.0
18 3 3 4 4.0
19 3 4 2 2.0
20 3 5 3 3.0
21 3 6 NA 3.0
22 4 0 2 2.0
23 4 1 1 1.0
24 4 2 0 0.0
25 4 3 NA NA
26 4 4 NA NA
27 4 5 0 0.0
28 4 6 0 0.0
The interpolated value in id=1 is 1.5 (avg of 3 and 0), id=2 is 0 (avg of 0 and 0, and id=3 is 3 (the value preceding since it there is no following value).

gnuplot: stacked newhistogram, starting with same fillstyle pattern, not appearing in shared key

I am trying to create multiple columnstacked histograms in gnuplot using the newhistogram command.
The data share the same structure so I want the boxes to share the same key like in this question, but with a pattern fillstyle (not only color).
My script looks like this:
set style data histogram
set style histogram columnstacked
set style fill pattern
set boxwidth 0.75
set xtics ("fff" 0, "ggg" 1, "hhh" 3, "iii" 4)
set grid y
set border 3 # remove top and right plot-border
set key outside title 'key' width -30
plot \
newhistogram "data A" fs pattern 2 lt 1, \
"plot_test_data.out" u 2, '' u 3, \
newhistogram "data B" fs pattern 2 lt 1, \
"plot_test_data_2.out" u 2:key(1), '' u 3
which produces the following output:
Could anybody explain why the key doesn't show the correct pattern? (I tried adding fs pattern in each using directive and removing the linetype, but nothing bears the desired output).
I am using
G N U P L O T
Version 4.6
patchlevel 3
last modified 2013-04-12
Build System: Linux x86_64
with 2 test data sets
plot_test_data.out:
0 1 3 2 2
1 1 6 4 4
2 2 5 4 5
3 1 1 3 4
4 1 2 4 5
plot_test_data_2.out:
0 1 4 3 1
1 3 6 4 2
2 2 5 6 4
3 2 5 5 3
4 1 3 4 2
thank you!

Segmenting a data frame by row based on previous rows values

I have a data frame in R that contains 2 columns named x and y (co-ordinates). The data frame represents a journey with each line representing the position at the next point in time.
x y seconds
1 0.0 0.0 0
2 -5.8 -8.5 1
3 -11.6 -18.2 2
4 -16.9 -30.1 3
5 -22.8 -40.8 4
6 -29.0 -51.6 5
I need to break the journey up into segments where each segment starts once the distance from the start of the previous segment crosses a certain threshold (e.g. 200).
I have recently switched from using SAS to R, and this is the first time I've come across anything I can do easily in SAS but can't even think of the way to approach the problem in R.
I've posted the SAS code I would use below to do the same job. It creates a new column called segment.
%let cutoff=200;
data segments;
set journey;
retain segment distance x_start y_start;
if _n_=1 then do;
x_start=x;
y_start=y;
segment=1;
distance=0;
end;
distance + sqrt((x-x_start)**2+(y-y_start)**2);
if distance>&cutoff then do;
x_start=x;
y_start=y;
segment+1;
distance=0;
end;
keep x y seconds segment;
run;
Edit: Example output
If the cutoff were 200 then an example of required output would look something like...
x y seconds segment
1 0.0 0.0 0 1
2 40.0 30.0 1 1
3 80.0 60.0 2 1
4 120.0 90.0 3 1
5 160.0 120.0 4 2
6 120.0 150.0 5 2
7 80.0 180.0 6 2
8 40.0 210.0 7 2
9 0.0 240.0 8 3
If your data set is dd, something like
cutoff <- 200
origin <- dd[1,c("x","y")]
cur.seg <- 1
dd$segment <- NA
for (i in 1:nrow(dd)) {
dist <- sqrt(sum((dd[i,c("x","y")]-origin)^2))
if (dist>cutoff) {
cur.seg <- cur.seg+1
origin <- dd[i,c("x","y")]
}
dd$segment[i] <- cur.seg
}
should work. There are some refinements (it might be more efficient to compute distances of the current origin to all rows, then use which(dist>cutoff)[1] to jump to the first row that goes beyond the cutoff), and it would be interesting to try to come up with a completely vectorized solution, but this should be OK. How big is your data set?

Resources