Gnuplot Script for average over multiple files - graph

I have series of measurement in several files. Every file is looking like this:
1 151.973938 144.745789 152.21991 17:57:14
2 151.995697 144.755737 152.21991 17:57:14
3 152.015747 144.765076 152.21991 17:57:14
.
.
.
I'm looking for a possibility to compute the average of the same field in several files. At the end of the process I would like to have a graph of the average measurement.
Is that possible with gnuplot? I wasn't able to find a fitting option in gnuplot by myself. If not, which different way to achieve that would you recommend?
Best regards, Juuro

You cannot do it all in gnuplot. Gnuplot is limited to processing columns from one file at a time. You need some other utility to preprocess your data. Assuming the data is in the format you demonstrated (with spaces instead of semicolons), this script will take the average of the second, third and fourth columns and spit out a file with the first and fifth columns the same, and the middles averaged. Run this bash script in a directory where there are only the .txt files you want to process.
#!/bin/bash
sum=$(ls -l *.txt | wc -l)
paste -d" " *.txt | nawk -v s="$sum" '{
for(i=0;i<=s-1;i++)
{
t1 = 2+(i*5)
temp1 = temp1 + $t1
t2 = 3+(i*5)
temp2 = temp2 + $t2
t3 = 4+(i*5)
temp3 = temp3 + $t3
}
print $1" "temp1/s" "temp2/s" "temp3/s" "$5
temp1=0
temp2=0
temp3=0
}'
From inside gnuplot, you can run the script like this:
!myscript.sh > output.out
plot 'output.out' u 1:2 # orwhatever
or like this:
plot '<myscript.sh' u 1:2
(code inspired by what I found here.)

I think it's no possible with gnuplot. I would first make a script which does the averaging and prints the results to stdout. Say this script is called average.py:
plot '<average.py FILE1 FILE2 FILE3' w l
the script average.py can look like this for example.
#!/usr/bin/python
from numpy import loadtxt,mean,ones
import sys
#number of files:
nrfiles=len(sys.argv[1:])
#just to get the dimensions of the files
data=loadtxt(str(sys.argv[1]))
rows=data.shape[0]
cols=data.shape[1]
#initialize array
all=ones((int(nrfiles),int(rows),int(10)))
#load all files:
n=0
for file in sys.argv[1:]:
data=loadtxt(str(file))
all[n,:,0:cols]=data
n=n+1
#calculate mean:
mean_all=mean(all,axis=0)
#print to stdout
for i in range(rows):
a=''
for j in range(cols):
a=a+str('%010.5f ' % mean_all[i,j])
print str(a)
The limits of this script is that all files must have the same data-structure

Related

R: Simple Random Sample of Massive Dataframe

I have a massive (8GB) dataset, which I am simply unable to read into R using my existing setup. Attempting to use fread on the dataset crashes the R session immediately, and attempting to read in random lines from the underlying file was insufficient because: (1) I don't have a good way of knowing that total number of rows in the dataset; (2) my method was not a true "random sampling."
These attempts to get the number of rows have failed (they take as long as simply reading the data in:
length(count.fields("file.dat", sep = "|"))
read.csv.sql("file.dat", header = FALSE, sep = "|", sql = "select
count(*) from file")
Is there any way via R or some other program to generate a random sample from a large underlying dataset?
Potential idea: Is it possible, given a "sample" of the first several rows to get a sense of the average amount of information contained on a per-row basis. And then back-out how many rows there must be given the size of the dataset (8 GB)? This wouldn't be accurate, but it might give a ball-park figure that I could just under-cut.
Here's one option, using the ability of fread to accept a shell command that preprocesses the file as its input. Using this option we can run a gawk script to extract the required lines. Note you may need to install gawk if it is not already on your system. If you have awk instead on your system, you can use that instead.
First lets create a dummy file to test on:
library(data.table)
dt = data.table(1:1e6, sample(letters, 1e6, replace = TRUE))
write.csv(dt, 'test.csv', row.names = FALSE)
Now we can use the shell command wc to find how many lines there are in the file:
nl = read.table(pipe("wc -l test.csv"))[[1]]
Take a sample of line numbers and write them (in ascending order) to a temp file which makes them accessible easily to gawk.
N = 20 # number of lines to sample
sample.lines = sort(sample(2:nl, N)) #start sample at line 2 to exclude header
cat(paste0(sample.lines, collapse = '\n'), file = "lines.txt")
Now we are ready to read in the sample using fread and gawk (based on this answer). You can also try some of the other gawk scripts in this linked question which could possibly be be more efficient on very large data.
dt.sample = fread("gawk 'NR == FNR {nums[$1]; next} FNR in nums' lines.txt test.csv")

How would you generate a .txt file

I have a script that uses "suites" O(n). Basically a linear algorithm to generate possible strings.
It goes like this is to schematize roughtly
X = result
x from 1 to x max
y = x max
It's chains of possible alphanumeric characters outcomes.
My wish is to transform the script's outcomes (that will generate all possible strings) in one .txt file
First thing you must output what you want from your script. Something like:
echo $VARIABLE
Then to save the output of the script
./yourscript.sh > a.txt
To append to a file
./yourscript.sh >> a.txt

Writing a loop to store many data variables in r

I want to begin by saying that I am not a programmer, I'm just trying to store data so it's easily readable to myself.
I just downloaded a large .nc file of weather data and I am trying to take data from the file and store it in .csv format so I can easily view it in excel. The problem with the data is that it contains 53 variables with three 'dimensions': latitude, longitude, and time. I have written some code to only take a certain latitude and longitude and every timestamp so I get one nice column for each variable (with a single latitude and longitude but every timestamp). My problem is that I want to have the loop store a column for every variable to a different (arbitrary) object in R so that I just have to run it once and then write all the data to one .csv file with the write.csv function.
Here's the code I've written so far, where janweather is the .nc file.
while( j <= 53){
v1 <- janweather$var[[j]]
varsize <- v1$varsize
ndims <- v1$ndims
nt <- varsize[ndims] # Remember timelike dim is always the LAST dimension!
j <- j +1;
for( i in 1:nt ) {
# Initialize start and count to read one timestep of the variable.
start <- rep(1,ndims) # begin with start=(1,1,1,...,1)
start[1] <- i
start[2] <- i# change to start=(i,i,1
count <- varsize # begin w/count=(nx,ny,nz,...), reads entire var
count[1] <- 1
count[2] <- 1
data3 <- get.var.ncdf( janweather, v1, start=start, count=count )
}
}
Here are the details of the nc file from print.ncdf(janweather):
[1] "file netcdf-atls04-20150304074032-33683-0853.nc has 3 dimensions:" [1] "longitude Size: 240" [1] "latitude Size: 121" [1] "time Size: 31" [1] "------------------------" [1] "file netcdf-atls04-20150304074032-33683-0853.nc has 53 variables:"
My main goal is to have all the variables stored under a different name by the get.var.ncdf function. Right now I realize that it just keeps overwritting 'data3' until it reaches the last variable so all I've accomplished is getting data3 written to the last variable. I'd like to think there is an easy solution to this but I'm not exactly sure how to generate strings to store the variables under.
Again, I'm not a programmer so I'm sorry if anything I've said doesn't make any sense, I'm not very well versed in the lingo or anything.
Thanks for any and all help you guys bring!
If your not a programmer and want only to get variables in csv format, you can use the NCO commands. With this commands you can do multiple operations on netcdf files.
So with the command ncks you can output the data from a variable with and specific dimensions slice.
ncks -H -v latitude janweather.nc
This command will list on the screen the values in the latitude variable.
ncks -s '%f ,' -H -v temperature janweather.nc
This command will list the values of the variable temperature, with the format specified with the -p argument (sprintf style).
So just pass the output to a file and there you have the contents of a variables in a text file.
ncks -s '%f ,' -H -v temperature janweather.nc > temperature.csv

read large csv files in R inside for loop

To speedup I'm setting colClasses, my readfile looks like following:
readfile=function(name,save=0, rand=1)
{
data=data.frame()
tab5rows <- read.table(name, header = TRUE, nrows = 5,sep=",")
classes <- sapply(tab5rows, class)
data <- read.table(pipe(paste("cat",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",")
if(save==1)
{
out=paste(file,"Rdata",sep=".")
save(data,file=out)
}
else
{
data
}
}
contents of myscipt.sh:
#!/bin/sh
awk -v prob="$1" 'BEGIN {srand()} {if(NR==1)print $0; else if(rand() < prob) print $0;}'
In an extension to this, I needed to read file incrementaly. Say, if file had 10 lines at 10:am and 100 lines at 11:am, I needed those newly added 90 lines + the header (without which I would not be able to implement futher R processing) I made a change to readfile funtion using the comand:
data <- read.table(pipe(paste("(head -n1 && tail -n",skip,")<",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",") here skip gives me the number of lined to be tailed (calculated by some other script, lets, say I have these already). I call this function readfileIncrementally.
abcd are csv files each with 18 columns. Now I run this inside for loop say for i in a b c d
a,b,c,d are 4 files which have different values of skip. Lets say skip=10,000 for a , 20,000 for b. If I run these individually (not in for loop), it runs fine. But in case of loop it gives me error in scan line "n" does not have 18 columns. Usually this happens when skip value is greater than 3000 (approx).
However I cross checked no. of columns using command awk -F "," 'NF != 18' ./a.csv it surely has 18 columns.
It looks like a timing issue to me, is there any way to give R the required amount of time before going to next file. Or is there anything I'm missing. On running individually it runs fine (takes few seconds though).
data <- read.table(pipe(paste("(head -n1 && tail -n",skip," | head " as.integer(skip)-1")<",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",") worked for me. Basically the last line was not getting written completely by the time R was reading the file. And hence displayed the error that line number n didn't have 18 columns. Making it read 1 line less works fine for me.
Apart from this I didn't find any R feature to overcome such scenarios.

Opening and reading multiple netcdf files with RnetCDF

Using R, I am trying to open all the netcdf files I have in a single folder (e.g 20 files) read a single variable, and create a single data.frame combining the values from all files. I have been using RnetCDF to read netcdf files. For a single file, I read the variable with the following commands:
library('RNetCDF')
nc = open.nc('file.nc')
lw = var.get.nc(nc,'LWdown',start=c(414,315,1),count=c(1,1,240))
where 414 & 315 are the longitude and latitude of the value I would like to extract and 240 is the number of timesteps.
I have found this thread which explains how to open multiple files. Following it, I have managed to open the files using:
filenames= list.files('/MY_FOLDER/',pattern='*.nc',full.names=TRUE)
ldf = lapply(filenames,open.nc)
but now I'm stuck. I tried
var1= lapply(ldf, var.get.nc(ldf,'LWdown',start=c(414,315,1),count=c(1,1,240)))
but it doesn't work.
The added complication is that every nc file has a different number of timestep. So I have 2 questions:
1: How can I open all files, read the variable in each file and combine all values in a single data frame?
2: How can I set the last dimension in count to vary for all files?
Following #mdsummer's comment, I have tried a do loop instead and have managed to do everything I needed:
# Declare data frame
df=NULL
#Open all files
files= list.files('MY_FOLDER/',pattern='*.nc',full.names=TRUE)
# Loop over files
for(i in seq_along(files)) {
nc = open.nc(files[i])
# Read the whole nc file and read the length of the varying dimension (here, the 3rd dimension, specifically time)
lw = var.get.nc(nc,'LWdown')
x=dim(lw)
# Vary the time dimension for each file as required
lw = var.get.nc(nc,'LWdown',start=c(414,315,1),count=c(1,1,x[3]))
# Add the values from each file to a single data.frame
rbind(df,data.frame(lw))->df
}
There may be a more elegant way but it works.
You're passing the additional function parameters wrong. You should use ... for that. Here's a simple example of how to pass na.rm to mean.
x.var <- 1:10
x.var[5] <- NA
x.var <- list(x.var)
x.var[[2]] <- 1:10
lapply(x.var, FUN = mean)
lapply(x.var, FUN = mean, na.rm = TRUE)
edit
For your specific example, this would be something along the lines of
var1 <- lapply(ldf, FUN = var.get.nc, variable = 'LWdown', start = c(414, 315, 1), count = c(1, 1, 240))
though this is untested.
I think this is much easier to do with CDO as you can select the varying timestep easily using the date or time stamp, and pick out the desired nearest grid point. This would be an example bash script:
# I don't know how your time axis is
# you may need to use a date with a time stamp too if your data is not e.g. daily
# see the CDO manual for how to define dates.
date=20090101
lat=10
lon=50
files=`ls MY_FOLDER/*.nc`
for file in $files ; do
# select the nearest grid point and the date slice desired:
# %??? strips the .nc from the file name
cdo seldate,$date -remapnn,lon=$lon/lat=$lat $file ${file%???}_${lat}_${lon}_${date}.nc
done
Rscript here to read in the files
It is possible to merge all the new files with cdo, but you would need to be careful if the time stamp is the same. You could try cdo merge or cdo cat - that way you can read in a single file to R, rather than having to loop and open each file separately.

Resources