I have some data that I want to plot them with gnuplot. But I have for the same x value many y values, I will show you to understand well:
0 0.650765 0.122225 0.013325
0 0.522575 0.001447 0.010718
0 0.576791 0.004277 0.104052
0 0.512327 0.002268 0.005430
0 0.530401 0.000000 0.036541
0 0.518333 0.001128 0.017270
20 0.512864 0.001111 0.005433
20 0.510357 0.005312 0.000000
20 0.526809 0.001089 0.033523
20 0.527076 0.000000 0.034215
20 0.507166 0.001131 0.000000
20 0.513868 0.001306 0.004344
40 0.531742 0.003295 0.0365
In this example, I have 6 values for each x value.So how can I draw the average and the confidence bar(interval) ??
thanks for help
To do this, you will need some kind of external processing. One possibility would be to use gawk to calculate the required quantities and the feed this auxiliary output to Gnuplot to plot it. For example:
set terminal png enhanced
set output 'test.png'
fName = 'data.dat'
plotCmd(col_num)=sprintf('< gawk -f analyze.awk -v col_num=%d %s', col_num, fName)
set format y '%0.2f'
set xr [-5:25]
plot \
plotCmd(2) u 1:2:3:4 w yerrorbars pt 3 lc rgb 'dark-red' t 'column 2'
This assumes that the script analyze.awk resides in the same directory from which Gnuplot is launched (otherwise, it would be necessary to modify the path in the -f option of gawk. The script analyze.awk itself reads:
function analyze(x, data){
n = 0;mean = 0;
val_min = 0;val_max = 0;
for(val in data){
n += 1;
delta = val - mean;
mean += delta/n;
val_min = (n == 1)?val:((val < val_min)?val:val_min);
val_max = (n == 1)?val:((val > val_max)?val:val_max);
}
if(n > 0){
print x, mean, val_min, val_max;
}
}
{
curr = $1;
yval = $(col_num);
if(NR==1 || prev != curr){
analyze(prev, data);
delete data;
prev = curr;
}
data[yval] = 1;
}
END{
analyze(curr, data);
}
It directly implements the online algorithm to calculate the mean and for each distinct value of x prints this mean as well as the min/max values.
In the Gnuplot script, the column of interest is then passed to the plotCmd function which prepares the command to be executed and the output of which will be plotted with u 1:2:3:4 w yerrorbars. This syntax means that the confidence interval is stored in the 3rd/4th columns while the value itself (the mean) resides in the second column.
In total, the two scripts above produce the picture below. The confidence interval on the last point is not visible since the example data in your question contain only one record for x=40, thus the min/max values coincide with the mean.
You can easily plot the average in this case:
plot "myfile.dat" using ($1):($2 + $3 + $4)/3
If you want average of only second and fourth column for example, you can write ($2+$4)/2 and so on.
Related
I want to calculate the difference of two columns of a dataframe containing times. Since not always a value from the same column ist bigger/later, I have to do a workaround with an if-clause:
counter = 1
while(counter <= nrow(data)){
if(data$time_end[counter] - data$time_begin[counter] < 0){
data$chargingDuration[counter] = 1-abs(data$time_end[counter]-data$time_begin[counter])
}
if(data$time_end[counter] - data$time_begin[counter] > 0){
data$chargingDuration[counter] = data$time_end[counter]-data$time_begin[counter]
}
counter = counter + 1
}
The output I get is a decimalvalue smaller than 1 (i.e.: 0,53322 meaning half a day)... However, if I use my console and calculate the timedifference manually for a single line, I get my desired result looking like 02:12:03...
Thanks for the help guys :)
My data file data.dat looks like this:
1e-23 1e-23 1e-27 2e-28
2e-22 4e-23 1e-23 4e-23
3e-21 1e-23 1e-24 6e-23
4e-32 1e-23 1e-25 8e-30
Using gnuplot, I use stats "data.dat" matrix to find minimum and maximum values in the above array (data file); both values are shown zero. I am guessing stats is reading exponentially low values as zero. Is there a way to fix this?
If you execute the command
stats 'data.dat' matrix
the output minimum/maximum reported is indeed:
Minimum: 0.0000 [ 0 3 ]
Maximum: 0.0000 [ 0 2 ]
COG: 0.1157 1.9057
This is due to the strategy Gnuplot uses for formatting the values in the stats output. The relevant function is:
static char*
fmt( char *buf, double val )
{
if ( isnan(val) )
sprintf( buf, "%11s", "undefined");
else if ( fabs(val) < 1e-14 ) //<-- HERE
sprintf( buf, "%11.4f", 0.0 );
else if ( fabs(log10(fabs(val))) < 6 )
sprintf( buf, "%11.4f", val );
else
sprintf( buf, "%11.5e", val );
return buf;
}
this means that if a value in absolute value is smaller than 1E-14, it will show just zero...
To get the raw values, you might use the STATS_min/STATS_max variables:
gnuplot> print STATS_min, STATS_max
4.00000009496891e-32 2.99999990479657e-21
EDIT:
The stats 'filename' matrix seems to behave slightly differently than the "ordinary" stats executed per column:
fname = 'data.dat'
stats fname nooutput
N = STATS_columns
#specify which values to include in the minimum calculation
cond(val) = 1 #(abs(val) > 1E-30)
#process each column individually and determine the global minimum
#if there are no values, set the minimum to +inf
globalMin = real('inf')
do for [i=1:N] {
stats fname using (cond(column(i))?column(i):NaN) nooutput
#due to cond( ), there might not be any records, so one needs to check
if(STATS_records) {
globalMin = (STATS_min < globalMin)?STATS_min:globalMin
}
}
print globalMin
Given data with a number-of-days-to-event, and an outcome, like so:
data pretend ;
do subject = 1 to 1000 ;
fup_time = round(uniform(83386)*500, 1) ;
select(round(uniform(778523)*5, 1)) ;
when(1) outcome = 'cholys' ;
when(2) outcome = 'death' ;
when(3) outcome = 'tx end' ;
when(4) outcome = 'vascul' ;
otherwise outcome = 'reop' ;
end ;
output ;
end ;
label
fup_time = "The day on which ::outcome:: occurred"
outcome = "What the subject's last observed event was"
;
run ;
What's the easiest way to generate curves for each outcome that show what proportion of the sample was still being observed on day X, broken out by outcome?
I tried:
proc lifetest data = pretend plots = all notable ;
strata outcome ;
time fup_time*censored(1) ;
run ;
Where 'censored' is set to 1 whenever outcome is 'tx end' or 'death'. I quite like the product-limit survival curves that produces, except that the lines for death & tx end are completely flat, at y = 1.0.
I'm not actually looking to do any inference here at all--just want the pretty pictures. Is there an easy way?
How can I generate or update variables without using a loop? mutate doesn't work here (at least I don't know how to get it to work for this problem) because I need to calculate stuff from multiple rows in another data set.
I'm supposed to replicate the regression results of an academic paper, and I'm trying to generate some variables required in the regression. The following is what I need.
I have 2 relevant data sets for this question, subset (containing
geocoded residential property transactions) and sch_relocs (containing the date
of school relocation events as well as their locations)
I need to calculate the distance between each residential property and the nearest (relocated) school
If the closest school is one that relocated to the area near the residential property, the dummy variable new should be 1 (if the school relocated away from the area, then new should be 0)
If the relocated school moved only a small distance, and a house is within the overlapping portion of the respective 2km radii around the school locations, the dummy variable overlap should be 1, otherwise 0
If the distance to the nearest school is <= 2km, the dummy variable in_zone should be 1. If the distance is between 2km and 4km, these transactions are considered controls, and hence in_zone should be 0. If the distance is greater than 4km, I should drop the observations from the data
I have tried to do this using a for loop, but it's taking ages to run (it's still not done running after one night), so I need a better way to do it. Here's my code (very messy, I think the above explanation is a lot easier if you want to figure out what I'm trying to do.
for (i in 1:as.integer(tally(subset))) {
# dist to new sch locations
for (j in 1:as.integer(tally(sch_relocs))) {
dist = distHaversine(c(subset[i,]$longitude, subset[i,]$latitude),
c(sch_relocs[j,]$new_lon, sch_relocs[j,]$new_lat)) / 1000
if (dist < subset[i,]$min_dist_new) {
subset[i,]$min_dist_new = dist
subset[i,]$closest_new_sch = sch_relocs[j,]$school_name
subset[i,]$date_new_loc = sch_relocs[j,]$date_reloc
}
}
# dist to old sch locations
for (j in 1:as.integer(tally(sch_relocs))) {
dist = distHaversine(c(subset[i,]$longitude, subset[i,]$latitude),
c(sch_relocs[j,]$old_lon, sch_relocs[j,]$old_lat)) / 1000
if (dist < subset[i,]$min_dist_old) {
subset[i,]$min_dist_old = dist
subset[i,]$closest_old_sch = sch_relocs[j,]$school_name
subset[i,]$date_old_loc = sch_relocs[j,]$date_reloc
}
}
# generate dummy "new"
if (subset[i,]$min_dist_new < subset[i,]$min_dist_old) {
subset[i,]$new = 1
subset[i,]$date_move = subset[i,]$date_new_loc
}
else if (subset[i,]$min_dist_new >= subset[i,]$min_dist_old) {
subset[i,]$date_move = subset[i,]$date_old_loc
}
# find overlaps
if (subset[i,]$closest_old_sch == subset[i,]$closest_new_sch &
subset[i,]$min_dist_old <= 2 &
subset[i,]$min_dist_new <= 2) {
subset[i,]$overlap = 1
}
# find min dist
subset[i,]$min_dist = min(subset[i,]$min_dist_old, subset[i,]$min_dist_new)
# zoning
if (subset[i,]$min_dist <= 2) {
subset[i,]$in_zone = 1
}
else if (subset[i,]$min_dist <= 4) {
subset[i,]$in_zone = 0
}
else {
subset[i,]$in_zone = 2
}
}
Here's how the data sets look like (just the relevant variables)
subset data set with desired result (first 2 rows):
sch_relocs data set (full with only relevant columns)
I'm trying to implement following thing in R, but I'm new in R and my code doesn't work.
I have matrix A, I did coordinates changes .
I want to write two function:
1) give the element of matrix, given coordinates
2) give the coordinates given number.
the pseudo code is right, the only problem is my syntax. can somebody correct it ?
f<- as.numeric(readline(prompt="Please enter 10 to get coordinate of number,and 20 to get the number > "));
if(p==10){
# give the number, given coordinates
i<- as.numeric(readline(prompt="Pleae enter i cordinate > "));
j<- as.numeric(readline(prompt="Pleae enter j cordinate > "));
if (i>0&j<0) return A[5+i,5+j]
if (i>0&j>0) return A[5+i,5+j]
if (i<0&j>0) return A[5+i,5-j]
if (i<0&j<0) return A[5+i,5-j]
}else if (p==20){
#give the cordinate, given number
coordinate <- which(A==number)
[i,j]<-A[2-coordinate[0],coordinate[1]-2]
}
}
Warning: what if i or j is equal to zero? Next, make a single variable which is the decimal representation of binary i,j, That is,
if(p==10){
x <- (i>0) + 2*(j>0) +1
# x takes on values 1 thru 5. This is because switch requires nonnegative integer
switch(x,
return A[5+i,5+j],
return A[5+i,5+j],
return A[5+i,5+j],
return A[5+i,5+j]) # change the +/- indices as desired
}else{
#etc.
And, finally, you should make this a function, not a collection of commands.
Edit - I skipped this before, but: you cannot call an index of 0 so you need to fix a number of things in the line [i,j]<-A[2-coordinate[0],coordinate[1]-2]
The syntax is as follows:
x <- 4
if (x == 1 | x == 2) print("YES")