NAudio splitting wav file issue. Every other audio file sound distorted while the others are clear - wav

The code in the image below takes a wav file, reads the wav file into memory, finds the length of the wav file, splits the wav file into equal segments, and writes the wav segment to a new wav file. For some reason, the 2nd, 4th, 6th, 8th, 10th, etc.. files sound distorted and the others sound perfect.

I suspect your sizeOfEachFile is an odd number. It must be an exact multiple of the BlockAlign of the WaveFormat which will be 2 in your case for mono 16 bit audio.

Related

Vroom/fread won't read LARGE .csv file - cannot memory map it

I have a .csv file that is 112GB in weight but neither vroom nor data.table::fread will open it. Even if I ask to read in 10 rows or just a couple of columns it complains with mapping error: Cannot allocate memory.
df<-data.table::fread("FINAL_data_Bus.csv", select = c(1:2),nrows=10)
System errno 22 unmapping file: Invalid argument
Error in data.table::fread("FINAL_data_Bus.csv", select = c(1:2), nrows = 10) :
Opened 112.3GB (120565605488 bytes) file ok but could not memory map it. This is a 64bit process. There is probably not enough contiguous virtual memory available.
read.csv on the other hand will read the ten rows happily.
Why won't vroom or fread read it using the usual altrep, even for 10 rows?
This matter has been discussed by the main creator of data.table package at https://github.com/Rdatatable/data.table/issues/3526. See the comment by Matt Dowle himself at https://github.com/Rdatatable/data.table/issues/3526#issuecomment-488364641. From what I understand, the gist of the matter is that to read even 10 lines from a huge csv file with fread, the entire file needs to be memory mapped. So fread cannot be used on its own in case your csv file is too big for your machine. Please correct me if I'm wrong.
Also, I haven't been able to use vroom with big more-than-RAM csv files. Any pointers towards this end will be appreciated.
For me, the most convenient way to check out a huge (gzipped) csv file is by using a small command line tool csvtk from https://bioinf.shenwei.me/csvtk/
e.g., check dimensions with
csvtk dim BigFile.csv.gz
and, check out head with top 100 rows
csvtk head -n100 BigFile.csv.gz
get a better view of above with
csvtk head -n100 BigFile.csv.gz | csvtk pretty | less -SN
Here I've used less command available with "Gnu On Windows" at https://github.com/bmatzelle/gow
A word of caution - many people suggest using command
wc -l BigFile.csv
to check out no. of lines from a big csv file. In most cases, it will be equal to the no. of rows. But in case the big csv file contains newline characters within a cell, to use a spreadsheet term, the above command will not show the no. of rows. In such cases the no. of lines is different from the no. of rows. So it is advisable to use csvtk dim or csvtk nrow. Other csv command line tools like xsv, miller will also show correct results.
Another word of caution - the short command fread(cmd="head -n 10 BigFile.csv") is not advisable to preview top few lines in case some columns contain significant leading zeros in data such as 0301, 0542, etc. since without column specification, fread will interpret them as integers and not show leading zeros from such columns. For example, in some databases that I have to analyse, the first digit zero in a particular column means that it is a Revenue Receipt. So better use a command line tool like csvtk, miller, xsv with less -SN for previewing a big csv file which show the file "as is" without any potentially wrong interpretation.
PS1: Even spreadsheets like MS Excel and LibreOffice Calc loses leading zeroes in csv files by default. LibreOffice Calc actually shows leading zeroes in the preview window but loses them when you load the file! I'm yet to find a spreadsheet that does not lose leading zeroes in csv files by default.
PS2: I've posted my approach to querying very large csv files at https://stackoverflow.com/a/68693819/8079808
EDIT:
VROOM does have difficulty when dealing with huge files since it needs to store the index in memory as well as any data you read from the file. See development thread https://github.com/r-lib/vroom/issues/203

Write lines of text at a given position in a file in R

I would like to write a line in a text file at a given position (i) by avoiding the sequential reading.
There is WriteLines base function but I don't know how to insert the text at position (i) given as parameter.
Thanks
Dave
This is — unrelated to R — fundamentally impossible. Most (all common) filesystems do not support inserting or removing content in the middle of a file. The only supported operations are appending (or truncation) at the end, and R only supports appending, not truncation.
The way virtually all software solves your problem is by reading the file, modifying it, and writing it back to disk. If you want to get fancy because the file is very large (at least in the order of hundreds of MiB), you can stream edit the file: Read a part, edit that part, write it back to a new file. Rinse and repeat.
Technical aside: There is one exception to the above with low-level file operations, since files are stored as as non-contiguous “blocks”. But even if R supported this it wouldn’t help you since it doesn’t permit byte-level or line-level granularity: Blocks are typically at least 4 kiB in size.

Native method in R to test if file is ascii

Is there a native method in R to test if a file on disk is an ASCII text file, or a binary file? Similar to the file command in Linux, but a method that will work cross platform?
The file.info() function can distinguish a file from a dir, but it doesn't seem to go beyond that.
If all you care about is whether the file is ASCII or binary...
Well, first up definitions. All files are binary at some level:
is.binary <- function(file){
if(system.type() != "quantum computer"){
return(TRUE)
}else{
return(cat=alive&dead)
}
}
ASCII is just an encoding system for characters. It is therefore impossible to tell if a file is ASCII or binary, because ASCII-ness is a matter of interpretation. If I save a file and decide that binary number 01001101 is Q and 01001110 is Z then you might decode this as ASCII but you'll get the wrong message. Luckily the Americans muscled in and said "Hey, everyone use ASCII to code their text! You get 128 characters and a parity bit! Woo! Go USA!". IBM tried to tell people to use EBCDIC but nobody listened. Which was A Good Thing.
So everyone was packing ASCII-coded text into their 8-bit bytes, and using the eighth bit for parity checking. But then people stopped doing parity checking because TCP/IP handled all that, which was also A Good Thing, and the eighth bit was expected to be zero. If not, there was trouble.
Because people (read "Microsoft") started abusing the eighth bit, and making up their own encoding schemes, and so unless you knew what encoding scheme the file was using, you were stuffed. And the file very rarely told you what encoding scheme it was. And now we have Unicode and even more encoding schemes. And that is a third Good Thing. But I digress.
Nowadays when people ask if a file is binary, what they are normally asking is "Does any byte in this file have it's highest bit set?". Which you can do in R by reading a raw file connection as unsigned integers and testing the highest value. Something like:
is.binary <- function(filepath,max=1000){
f=file(filepath,"rb",raw=TRUE)
b=readBin(f,"int",max,size=1,signed=FALSE)
return(max(b)>128)
}
This will by default test only at most the first 1000 characters. I think the file command does something similar.
You may want to change the test to check for printable character codes, and whitespace, and line feed, carriage return, and other codes you might want to consider plausible in your non-binary files...
Well, how would you do that? I guess you can't without reading (parts or all of) the file, which is why files extensions are used to signal content type.
I looked into that years ago---and as I recall, the file(1) apps actually reads the first few header bytes of a file and compares that to what is stored in a lookup table. Sounds like a good candidate for an add-on package to me..
The example section of the manual for ?raw uses this:
isASCII <- function(txt) all(charToRaw(txt) <= as.raw(127))

Diff-command: doesn't print lines that are different but still says the two files are different

I'm using the diff command to compare two text files. They need to be literally matched.
So I use the diff:
diff binary.out binary.expected
(By the way, those files are NOT binary files. They are text file. I call them binary because that's the name of the project)
and got
Binary files binary.out and binary.expected differ
When I use another diff tool, the smartest of all (AKA human), and there's really nothing different between the two files.
Does anyone happen to know what's going on here?
Thanks.
diff from diffutils says the following about text/binary:
diff determines whether a file is text or binary by checking the
first few bytes in the file; the exact number of bytes is system
dependent, but it is typically several thousand. If every byte in
that part of the file is non-null, diff considers the file to be
text; otherwise it considers the file to be binary.
hence GNU diff have a quite open definition of what is text, and the use of the --text option to force it to treat the file as text should seldom be needed.
Have you checked if binary.out or binary.expected contains null characters? What version is your diff program?
Make sure to ignore white space in the diff options.
It may also see Unicode characters and interpret that as binary. See if your diff tool has an option to force text mode.

How can I convert MathType equation into MathML format?

I want to convert MathType equation saved as GIF format to MathML. Firstly, I opened these GIF files and saved them within MathType 6.7. As a result, MathML text is inserted into the end of GIF files. However, when I extracted MathML text from these GIF files using Perl script, I found some garbled characters in the MathML text as following text:
<mn>xxx</mn>
In the above line, a garbled character  is inserted before 'mn' label. Is this MathType 's BUG? How can I work around this problem? I have uploaded my test GIF files. URL is: http://ubuntuone.com/p/1352/
Update:
I have tried to paste full block of MathML here, but I found the syntax format of MathML text was messed. So I pasted the MathML on GitHub: https://gist.github.com/1068723.
There is a garbled character in the seventh line of MathML text: "  ?#x00A0;".
The original GIF file which doesn't contain MathML text: http://ubuntuone.com/p/13Ba/
Perl script that extracts MathML from GIF image generated by MathType: https://gist.github.com/1068749
Thanks,
thinkhy
Thanks thinkhy. It could be you extracting the data incorrectly (we haven't looked at your script yet). Only one of your GIFs had MathML -- the one that has a file name starting 106R. In that one, if you just grab all the bytes from the first bit that looks like MathML until the end, you do periodically get odd bytes in there, mostly 255's except the last one. (This however doesn't appear to be the junk character you're seeing.) The reason for the 255's is that the MathML is distributed over multiple comment records, each one of which starts with a count of the bytes in the record. From the MathType SDK (free download; link below):
GIF Image Files
MathML text is embedded into a GIF file as an Application Extension Record, which consists of a 14-byte header (Application Extension Descriptor), followed by the MTEF data. The header contains:
Byte Introducer = 0x21;
Byte ExtensionLabel = 0xFF;
Byte BlockSize = 0x0B;
Byte ApplicationId[8] = "MathType";
Byte AuthenticationCode[3] = "003";
The data follows this header and is written as a series of blocks each containing 255 bytes or less. Each block starts with a single byte count followed by the data. The end is marked as a block with length 0.
The header is unique enough that the easiest way to extract the data might be to scan the file for the 14-byte header, then expect the MathML data blocks to follow. Properly decoding the GIF records isn't that hard either, but obviously requires you read the GIF specification.
You may already be using the SDK, but you didn't say whether you were or not, so here's the link: http://www.dessci.com/en/reference/sdk/.

Resources