wav inconsistent number of samples when using avconv - wav

I need to produce a number of consistent music samples.
Having 1 second long wave with sample rate 44100, I should be able to get an array of exactly 44100 samples. Unfortunatelly, this is not true.
My approach is the following:
1) Produce output.wav, which is 1 second long with sample rate 44100
avconv -i input.mp3 -ss 00:01:00 -t 00:00:01 -ar 44100 -ac 1 output.wav
2) I read the file and print the number of samples
meta,song = scipy.io.wavfile.read(path + "/" +file)
assert meta == 44100
print(len(song))
For different choices of input.mp3 and starting position I get different numbers:
43776,
43776,
44928,
43776,
43776,
44928
My question is, why is that the case? and how can I change the step 1, to produce consistent data samples?

The avconv is not very precize. The song fragment which is supposed to have one sec, acctually has 0.983991 sec.
To solve the problem we might use sox:
sox input.mp3 -r 44100 -c 1 output.wav --show-progress trim 0 00:01
sox does not support mp3 in default, so I had to install
sudo apt-get install libsox-fmt-mp3

Related

Extracting dates that satisfied multiple conditions using NCO/CDO or bash

I have a netcdf file containing vorticity (per sec) and winds (m/s). I want to print the dates of gridpoints that satisfy the following conditions:
1). Vorticity > 1x10^-5 per sec and winds >= 5 m/s at a gridpoint.
2). The average of vorticity and winds at the "four" (North, West, East, South) surrounding the gridpoint found in (1) should also be > 1x10^-5 and 5m/s, respectively.
I am able to just filter the gridpoints that satisfied (1), using ncap:
ncap2 -v -O -s 'where(vort > 1e-5 && winds >= 5) vort=vort; elsewhere vort=vort.get_miss();' input_test.nc output_test.nc
How do I get the dates? Also how can I implement the second condition.
Here's the screenshot of the header of the netcdf file.
I'll appreciate any help on this.
This may be achieved by combining "cdo" and "nco".
The average value of the surrounding 4 grids needed for the second condition can be calculated by combining the shiftxy and ensmean operators of "cdo".
cdo selname,vr,wspd input_test.nc vars.nc
cdo -expr,'vr_mean=vr; wspd_mean=wspd' \
-ensmean \
-shiftx,1 vars.nc \
-shiftx,-1 vars.nc \
-shifty,1 vars.nc \
-shifty,-1 vars.nc \
vars_mean.nc
You can then use merge operator of "cdo" to combine the variables needed to check conditions 1) and 2) into a single NetCDF file, and use ncap2 to check the conditions as you have tried.
In the example command below, the "for" loop of "ncap2" is used to scan the time. If there is at least one grid that satisfies both conditions 1) and 2) at each time, the information for that time will be displayed.
cdo merge vars.nc vars_mean.nc vars_test.nc
ncap2 -s '*flag = (vr > 1e-5 && wspd >= 5) && (vr > 1e-5 && wspd >= 5); *nt=$time.size; for(*i=0;i<nt;i++) { if ( max(flag(i,:,:))==1 ) { print(time(i)); } }' vars_test.nc

How can I copy the values of a variable to another variable in a NetCDF but not the dimension?

I have 2 dimensions X1, X2
And 3 variables V1(X1), V2(X2), V3(X3)
I want to copy the values of V2 to V1. But keep the dimensions as it is.
If I do:
ncap2 -s "V2=V1*1" in.nc out.nc
the dimensions become V1(X2), V2(X2), V3(X3)
How can I retain the original dimension of V1?
That's an unusual request. One solution is to follow the step you already have with one more command to append the values you want back into the original variable. Here lon and ilev are both the same size, but with different underlying dimensions:
ncap2 -O -v -s 'lon=ilev' ~/in.nc ~/foo.nc # make lon a copy of ilev
ncks -A -C -v lon ~/foo.nc ~/in.nc # append lon back into itself

Can't concatenate netCDF files with ncrcat

I am looping over a model that outputs daily netcdf files. I have a 7-year time series of daily files that, ideally, I would like to append into a single file at the end of each loop but it seems that, using nco tools, the best way to merge the data into one file is to concatenate. Each daily file is called test.t.nc and is renamed as the date of the daily file e.g. 20070102.nc, except the first one that I create with
ncks -O --mk_rec_dmn time test.t.nc 2007-01-01.nc
to make time the record dimension for concatenation. If I try to concatenate the first two files such as
ncrcat -O -h 2007-01-01.nc 2007-01-02.nc out.nc
I get the error message
ncrcat: symbol lookup error: /usr/local/lib/libudunits2.so.0: undefined symbol: XML_ParserCreate
I don't understand what this means and, looking at all the help online, ncrcat should be a straightforward process. Does anyway understand what's happening?
Just in case this helps, the ncdump -h for 20070101.nc is
netcdf \20070101 {
dimensions:
time = UNLIMITED ; // (8 currently)
y = 1 ;
x = 1 ;
tile = 9 ;
soil = 4 ;
nt = 2 ;
and 20070102.nc
netcdf \20070102 {
dimensions:
x = 1 ;
y = 1 ;
tile = 9 ;
soil = 4 ;
time = UNLIMITED ; // (8 currently)
nt = 2 ;
This is part of a bigger shell script and I don't have much flexibility over the naming of files - just in case this matters!

specifying job arrays in LSF

My objective is to repeatedly run an R script, each time with a different set of parameters.
To do so, I have been using a bash script to pass the command-line parameters to the R script by looping through an input file, in which each line contains a different combination of 7 parameters.
The input file looks like this:
10 food 0.00005 0.002 1 OBSERVED 0
10 food 0.00005 0.002 1 OBSERVED 240
10 food 0.00005 0.002 1 OBSERVED 480
10 food 0.00005 0.002 1 OBSERVED 720
10 food 0.00005 0.002 1 OBSERVED 960
10 food 0.00005 0.002 1 OBSERVED 1200
The R script to which the command-line parameters are passed, begins like this:
commandArgs(trailingOnly=FALSE)
A <- as.numeric (commandArgs()[as.numeric(length(commandArgs()) -6 )])
B <- commandArgs()[as.numeric(length(commandArgs()) -5 )]
C <- as.numeric (commandArgs()[as.numeric(length(commandArgs()) -4 )])
D <- as.numeric (commandArgs()[as.numeric(length(commandArgs()) -3 )])
E <- as.numeric (commandArgs()[as.numeric(length(commandArgs()) -2 )])
F <- commandArgs()[as.numeric(length(commandArgs()) -1 )]
G <- as.numeric (commandArgs()[as.numeric(length(commandArgs()) )])
The bash loop that reads these in and dispatches the R script, is as follows;
#!/bin/bash
N=0
cat Input.txt | while read LINE ; do
N=$((N+1))
echo "R --no-save < /home/trichard/Script.R" "$LINE" | bsub -N -q priority -R "select[model==Xeon5450]"
done
However, the problem is that there are millions of lines in Input.txt, so this approach is way too slow (it prevents other LSF users from submitting their own jobs).
So, the question is, how to do the above using an LSF array?
The main trick is to extract the nth line from the input file. Assuming you're on a Unix-like system, you can use the "sed" command to do that. Here's an example:
N=$(wc -l < input.txt)
echo 'R --no-save -f Script.R --args $(sed "${LSB_JOBINDEX}q;d" input.txt)' |
bsub -J "R_Job[1-$N]" -N -q priority -R "select[model==Xeon5450]"
Correct argument quoting is a bit tricky and very important in this example.
Note that this uses the R "--args" option to avoid warnings messages about unrecognized arguments. I'd also suggest using commandArgs(trailingOnly=TRUE) in the R script so you only see the arguments of interest.
maybe you should consider putting it all into R and use a 'foreach' loop construct with a proper parallelization framework like 'doMPI' (or pure Rmpi if your are really motivated ;-)). So the job management system on the cluster has full control and your are basically submitting one single job.
Rather a hint then a solution to your specific problem.
The answer of Steve Westson works well; thanks!
However, in the LSF system, the maximum N jobs within a single array is limited to ~1000. That means that when you have >1000 jobs, you need to submit multiple job arrays, like this:
#!/bin/bash
increment=1000
startvalue=1
stopvalue=$(wc -l < Col_Treat_BETA_MU_RAND_METHOD_part1.txt)
stopvalue=$(( ($increment*((stopvalue+999)/$increment))+$increment ))
end=$increment
for ((s=$startvalue,e=$end ; e<$stopvalue; s+=$increment,e+=$increment)); do
echo $s "-" $e
echo 'R --no-save -f script.R --args $(sed "${LSB_JOBINDEX}q;d" input.txt)' | bsub -J "R_Job[$s-$e]" -N -q normal
done
so, this successfully submits all jobs instantaneously, wihtout the original job-by-job loop that essentially blocks other users, and annoys your sysadmin. Thanks again!
I am posting this as an answer as it exceeds the max length for a comment.

Unix: find all lines having timestamps in both time series?

I have time-series data where I would like to find all lines matching each another but values can be different (match until the first tab)! You can see the vimdiff below where I would like to get rid of days that occur only on the other time series.
I am looking for the simplest unix tool to do this!
Timeserie here and here.
Simple example
Input
Left file Right File
------------------------ ------------------------
10-Apr-00 00:00 0 || 10-Apr-00 00:00 7
20-Apr 00 00:00 7 || 21-Apr-00 00:00 3
Output
Left file Right File
------------------------ ------------------------
10-Apr-00 00:00 0 || 10-Apr-00 00:00 7
Let's consider these sample input files:
$ cat file1
10-Apr-00 00:00 0
20-Apr-00 00:00 7
$ cat file2
10-Apr-00 00:00 7
21-Apr-00 00:00 3
To merge together those lines with the same date:
$ awk 'NR==FNR{a[$1]=$0;next;} {if ($1 in a) print a[$1]"\t||\t"$0;}' file1 file2
10-Apr-00 00:00 0 || 10-Apr-00 00:00 7
Explanation
NR==FNR{a[$1]=$0;next;}
NR is the number of lines read so far and FNR is the number of lines read so far from the current file. So, when NR==FNR, we are still reading the first file. If so, save this whole line, $0, in array a under the key of the first field, $1, which is the date. Then, skip the rest of the commands and jump to the next line.
if ($1 in a) print a[$1]"\t||\t"$0
If we get here, then we are reading the second file, file2. If the first field on this line, $1 is a date that we already saw in file1, in other words, if $1 in a, then print this line out together with the corresponding line from file1. The two lines are separated by tab-||-tab.
Alternative Output
If you just want to select lines from file2 whose dates are also in file1, then the code can be simplified:
$ awk 'NR==FNR{a[$1]++;next;} {if ($1 in a) print;}' file1 file2
10-Apr-00 00:00 7
Or, still simpler:
$ awk 'NR==FNR{a[$1]++;next;} ($1 in a)' file1 file2
10-Apr-00 00:00 7
There is the relatively unknown unix command join. It can join sorted files on a key column.
To use it in your context, we follow this strategy (left.txt and right.txt are your files):
add line numbers (to put everything in the original sequence in the last step)
nl left.txt > left_with_lns.txt
nl right.txt > right_with_lns.txt
sort both files on the date column
sort left_with_lns.txt -k 2 > sl.txt
sort right_with_lns.txt -k 2 > sr.txt
join the files using the date column (all times are 0:00) (this would merge all columns of both files with correponding key, but we provide a output template to write the columns from the first file somewhere and the columns from the second file somewhere else (but only those line with a matching key will end in the result fl.txt and fr.txt)
join -j 2 -t $'\t' -o 1.1 1.2 1.3 1.4 sl.txt sr.txt > fl.txt
join -j 2 -t $'\t' -o 2.1 2.2 2.3 2.4 sl.txt sr.txt > fr.txt
sort boths results on the linenumber column and output the other columns
sort -n fl |cut -f 2- > left_filtered.txt
sort -n fr.txt | cut -f 2- > right_filtered.txt
Tools used: cut, join, nl, sort.
As requested by #Masi, I tried to work out a solution using sed.
My first attempt uses two passes; the first transforms file1 into a sed script that is used in the second pass to filter file2.
sed 's/\([^ \t]*\).*/\/^\1\t\/p;t/' file1 > sed1
sed -nf sed1 file2 > out2
With big input files, this is s-l-o-w; for each line from file2, sed has to process an amount of patterns that equals the number of lines in file1. I haven't done any profiling, but I wouldn't be surprised if the time complexity is quadratic.
My second attempt merges and sorts the two files, then scans through all lines in search of pairs. This runs in linear time and consequently is a lot faster. Please note that this solution will ruin the original order of the file; alphabetical sorting doesn't work too well with this date notation. Supplying files with a different date format (y-m-d) would be the easiest way to fix that.
sed 's/^[^ \t]\+/&#1/' file1 > marked1
sed 's/^[^ \t]\+/&#2/' file2 > marked2
sort marked1 marked2 > sorted
sed '$d;N;/^\([^ \t]\+\)#1.*\n\1#2/{s/\(.*\)\n\(.*\)/\2\n\1/;P};D' sorted > filtered
sed 's/^\([^ \t]\+\)#2/\1/' filtered > out2
Explanation:
In the first command, s/^[^ \t]\+/&#1/ appends #1 to every date. This makes it possible to merge the files, keep equal dates together when sorting, and still be able to tell lines from different files apart.
The second command does the same for file2; obviously with its own marker #2.
The sort command merges the two files, grouping equal dates together.
The third sed command returns all lines from file2 that have a date that also occurs in file1.
The fourth sed command removes the #2 marker from the output.
The third sed command in detail:
$d suppresses inappropriate printing of the last line
N reads and appends another line of input to the line already present in the pattern space
/^\([^ \t]\+\)#1.*\n\1#2/ matches two lines originating from different files but with the same date
{ starts a command group
s/\(.*\)\n\(.*\)/\2\n\1/ swaps the two lines in the pattern space
P prints the first line in the pattern space
} ends the command group
D deletes the first line from the pattern space
The bad news is, even the second approach is slower than the awk approach made by #John1024. Sed was never designed to be a merge tool. Neither was awk, but awk has the advantage of being able to store an entire file in a dictionary, making #John1024's solution blazingly fast. The downside of a dictionary is memory consumption. On huge input files, my solution should have the advantage.

Resources