How do I convert my 5GB 1 liner file to lines based on pattern? - unix

I have a 5GB 1 liner file with JSON data and each line starts from this pattern "{"created". I need to be able to use Unix commands on my Mac to convert this monster of a 1 liner into as many lines as it deserves. Any commands?
ASCII English text, with very long lines, with no line terminators

If you have enough memory you can open the file once with the TextWrangler application (the free BBEdit cousin) and use regular search/replace on the whole file. Use \r in replace to add a return. Will be very slow at opening the file, may even hang if low on memory, but in the end it may probably work. No scripting, no commands,.. etc.. I did this with big SQL files and sometimes it did the job.
You have to replace your line-start string with the same string with \n or \r or \r\n in front of it.

Unclear how it can be a “one liner” file but then each line starts with "{"created", but perhaps python -mjson.tool can help you get started:
cat your_source_file.json | python -mjson.tool > nicely_formatted_file.json
Piping raw JSON through ``python -mjson.tool` will cleanly format the JSON to be more human readable. More info here.

OS X ships with both flex and bison, you can use those to write a parser for your data.

You can use PHP as a shell command (if PHP is installed), just save a text file with name "myscript" and appropriate code (I cannot test code now, but the idea is as follows)
UNTESTED CODE
#!/usr/bin/php
<?php
$REPLACE_STRING='{"created'; // anything you like
// open input file with fopen() in read mode
$inFp=fopen('big_in_file.txt', "r");
// open output file with fopen() in write mode
$outFp=fopen('big_out_file.txt', "w+");
// while not end of file
while (!feof($inFp)) {
// read file chunks here with fread() in variable $chunk
$chunk = fread($inFp, 8192);
// do a $chunk=str_replace($REPLACE_STRING,"\r".$REPLACE_STRING; // to add returns
// (or use \r\n for windows end of lines)
$chunk=str_replace($REPLACE_STRING,"\r".$REPLACE_STRING,$chunk);
// problem: if chunk contains half the string at the end
// easily solved if $REPLACE_STRING is a one char like '{'
// otherwise test for fist char { in the end of $chunk
// remove final part and save it in a var for nest iteration
// write $chunk to output file
fwrite($outFp, $chunk);
// End while
}
?>
After you save it you must make it executable whith sudo chmod a+x ./myscript
and then launch it as ./myscript in terminal
After this, the myscript file is a full unix command

Related

Fortran90: Scripting of Standard In not working as expected

Working with Fortran90 in Unix...
I have a programme which needs to read in the input parameters from a file "input-deck.par". This filename is currently hard-coded but I want to run a number of runs using different input-deck files (input-deck01.par, input-deck02.par, input-deck03.par etc.) so I've set-up the code to do a simple "read(*,*) inpfile" to allow the user to input the name of this file directly on run-time with a view to scripting this later.
This works fine interactively. If I execute the programme it asks for the file name, you type it in and the filename is accepted, the file is opened and the programme picks up the parameters from that file.
The issue lies in scripting this. I've tried scripting using the "<" pipe command so:
myprog.e < input-deck01.par
But I get an error saying:
Fortran runtime error: Cannot open file '----------------': No such file or directory
If I print the filename right after the input line, it prints that the filename is '----------------' (I initialise the variable as having 16 characters hence the 16 hyphens I think)
It seems like the "<" pipe is not passing the keyboard input in correctly. I've tried with spaces and quotes around the filename (various combinations) but the errors are the same.
Can anyone help?
(Please be gentle- this is my first post on SO and Fortran is not my primary language....)
Fortran has the ability to read the command line arguments. A simple example is
program foo
implicit none
character(len=80) name
logical available
integer fd
if (command_argument_count() == 1) then
call get_command_argument(1, name)
else
call usage
end if
inquire(file=name, exist=available)
if (.not. available) then
call usage
end if
open(newunit=fd, file=name, status='old')
! Read file
contains
subroutine usage
write(*,'(A)') 'Usage: foo filename'
write(*,'(A)') ' filename --> file containing input info'
stop
end subroutine usage
end program foo
Instead of piping the file into the executable you simply do
% foo input.txt

Viewing MS Word .docx files in Midnight Commander

I want to be able to quickly view (with F3) the content of Word doc/docx files in Midnight Commander. MC's extensions file calls /usr/lib/mc/ext.d/doc.sh, which contains wv, antiword, catdoc, and word2x as helper programs. On my system (debian), the first three are available, but none of them are able to deal with the newer docx format.
The obvious solution is to use LibreOffice:
libreoffice --headless --convert-to "txt:Text (encoded):UTF8" filename.docx
This works well, but how do I tell MC to use it and display the result of the conversion? If I put this in ~/.config/mc/mc.ext, replacing the lines
View=%view{ascii} /usr/lib/mc/ext.d/doc.sh view msdoc
with
View=libreoffice --headless --convert-to "txt:Text (encoded):UTF8" "${MC_EXT_FILENAME}"
then I end up with a filename.txt file in the current directory, and nothing is displayed. What I want to happen is for mc to do the conversion when I press F3 and discard it when I quit the viewer. (I guess the converted file would be written to /tmp/ and removed on quit.)
Bonus: it would be nice if the displayed file would be word-wrapped, I suppose that could be done by using the wrap command?
Can I do this without having to modify /usr/lib/mc/ext.d/doc.sh, in my ~/.config/mc/mc.ext?
I use docx2txt:
View=%view{ascii} docx2txt %f -
Also you don't need such a long conversion string in libreoffice.
libreoffice --cat %f
is enough.

QMake - Erratic behaviour When using echo System Command

Using QMake, I read some boiler plate code, make modifications and write the modified code to a file.
However, I get very strange results. I have simplified the problem down to the following:
# Open boiler plate file
interfaceBoilerPlateCode = $$cat($$boilerPlateFile, blob)
# Make sure we read the right content
message("Content read: $$interfaceBoilerPlateCode")
# Write the read text into a file
output = $$system(echo $$interfaceBoilerPlateCode >> $$targetFile) # Doesnt work
output = $$system(echo "Howde" >> $$targetFile) # This works
The file being read is a plain text file containing only the string "Howde".
The contents of the file get read correctly.
However, when I try and write the contents of the file to another target file, I get no output (literally: no errors/warnings but no new file generated). However, if I use echo with just a string defined in the code itself (as in the last line of snippet above), a new file gets generated with the string "Howde" inside it.
What is going on? What am I doing wrong that the penultimate line does not generate a new file?
Use write_file. Instead of:
$$system(echo $$content >> $$file_path)
use
write_file($$file_path, $$content)

reading from dynamically set filenames in R

I am writing a CGI script in Perl with a section of embedded R script which produces a graph. The original data filename is unknown as it has been uploaded by the CGI script and is stored in a Perl variable called $filename.
My question is that I now would like to open that file in R using read.table(). I am using Statistics::R and so I have tried:
my $R = Statistics::R->new();
$R->set('filename',$filename);
my $out1 = $R->run(
q`rm(list=ls())`,
# Fetch data
q`setwd("/var/www/uploads")`,
q`peakdata<-read.table(filename, sep="",col.names=c("mz","intensity","ionsscore","matched","query","index","hit"))`,
q`attach(peakdata)` ...etc
I can get this to work ONLY if I change $filename into something static and known like 'data.txt' before trying to open the file in read.table - is there a way for me to open a file with a variable for a name?
Thank you in advance.
One possible way to do this is by doing a little more work in Perl.
This is untested code, to give you some ideas:
my $filename = 'fileNameIGotFromSomewhere.txt'
my $source_dir = '/var/www/uploads';
my $file = "$source_dir/$fielname";
# make sure we can read it
unless ( -r $file ) {
die 'can read that data file: $!";
}
Then instead of $R->set, you could interpolate the file name into the R program. Where you've used the single-quote operator, use the double-quote operator instead:
So instead of:
q`peakdata<-read.table(filename, sep="",col.names= .... )`
Use:
qq`peakdata<-read.table($filename, sep="",col.names= .... )`
Now this looks like it would be inviting problems similar to SQL/Code Injections, so that's why I put int the logic to insure that the file exists and is readable. You might be able to think other checks to add to safeguard your use of user-supplied info.

unix: can i write to the same file in parallel without missing entries?

I wrote a script that executes commands in parallel. I let them all write an entry to the same log file. It does not matter if the order is wrong or entries are interleaved, but i noticed that some entries are missing. I should probably lock the file before writing, however, is it true that if multiple processes try to write to a file simultaneously, it will result in missing entries?
Yes, if different processes independently open and write to the same file, it may result in overlapping writes and missing data. This happens because each process will get its own file pointer, that advances only by local writes.
Instead of locking, a better option might be to open the log file once in an ancestor of all worker processes, have it inherited across fork(), and used by them for logging. This means that there will be a single shared file pointer, that advances when any of the processes writes a new entry.
In a script you should use ">> file" (double greater than) to append output to that file. The interpreter will open the destination in "append" mode. If your program also wants to append, follow the directives below:
Open a text file in "append" mode ("a+") and give preference to printing only full lines (don't do multiple 'print' followed by a final 'println', but print the entire line with a single 'println').
The fopen documentation states this:
DESCRIPTION
The fopen() function opens the file whose pathname is the
string pointed to by filename, and associates a stream with
it.
The argument mode points to a string beginning with one of
the following sequences:
r or rb Open file for reading.
w or wb Truncate to zero length or create file
for writing.
a or ab Append; open or create file for writing
at end-of-file.
r+ or rb+ or r+b Open file for update (reading and writ-
ing).
w+ or wb+ or w+b Truncate to zero length or create file
for update.
a+ or ab+ or a+b Append; open or create file for update,
writing at end-of-file.
The character b has no effect, but is allowed for ISO C
standard conformance (see standards(5)). Opening a file with
read mode (r as the first character in the mode argument)
fails if the file does not exist or cannot be read.
Opening a file with append mode (a as the first character in
the mode argument) causes all subsequent writes to the file
to be forced to the then current end-of-file, regardless of
intervening calls to fseek(3C). If two separate processes
open the same file for append, each process may write freely
to the file without fear of destroying output being written
by the other. The output from the two processes will be
intermixed in the file in the order in which it is written.
It is because of this intermixing that you want to give preference to
using only 'println' (or its equivalent).

Resources