Does the --slurp option load the entire input in memory before processing it or has it been optimized somehow in order to avoid that?
The answer to the question is essentially "yes": commands such as "jq --slurp . FILE ...." store the parsed input as an array in memory. This will often require more memory than the size of the input itself -- consider for example that JSON objects are stored as hash tables.
With jq 1.5 there are often better alternatives than "slurping" the input. Most notably, perhaps, the inputs filter works very nicely with reduce and foreach. (If you do use inputs then don't forget you will probably want to invoke jq with the "-n" option.)
Related
Gensim uses text streaming to minimize memory requirements. This is at the cost of performance due to endless disk IO. Is there a trick to on the fly copy the complete file from disk (one disk IO) to a temporary in-memory file?
I like to keep the code as is (no recoding into a list structures), but this is not a great way of debugging functionality
Expected result: much faster code
Some more background on the question
The original code is at https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb. The example code is taken from the phrase modelling section
I'm calculating the unigrams. All reviews are at
review_txt_filepath = os.path.join(intermediate_directory,'review_text_all.txt'),
all unigrams should go to
unigram_sentences_filepath = os.path.join(intermediate_directory, 'unigram_sentences_all.txt')
The crucial routines are
def punct_space(token):
return token.is_punct or token.is_space
def line_review(filename):
# generator function to read in reviews from the file
with codecs.open(filename, encoding='utf_8') as f:
for review in f:
yield review.replace('\\n', '\n')
def lemmatized_sentence_corpus(filename):
# generator function to use spaCy to parse reviews, lemmatize the text, and yield sentences
for parsed_review in nlp.pipe(line_review(filename),
batch_size=10000, n_threads=4):
for sent in parsed_review.sents:
yield u' '.join([token.lemma_ for token in sent
if not punct_space(token)])
The unigrams are calculated as
with codecs.open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
for sentence in lemmatized_sentence_corpus(review_txt_filepath):
f.write(sentence + '\n')
Doing this for 5000 lines requires some patience, 1h30m ;-)
I'm not that familiar with iterables, but do I understand it correctly that I first have to read the
actual file (on disc) into a variable "list_of_data" and process that
with (review_txt_filepath, 'r', encoding='utf_8') as f:
list_of_data = f.read()
with codecs.open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
for sentence in lemmatized_sentence_corpus(list_of_data):
f.write(sentence + '\n')
So the strategy is
1. read all data into a list in memory
2. process the data
3. write the results to disc
4. delete the list from memory by setting list_with_data = ()
A problem with this is obviously that line_review is doing the file reading
Most gensim interfaces actually take iterable sequences. Examples which emphasize streaming-from-disk just happen to use iterables that read each item as needed, but you could use an in-memory list instead.
Essentially, if you do have enough RAM to have your whole dataset in memory, just use the IO-reading iterable to read things once into a list. Then, feed that list to the gensim class where it expects any iterable sequence.
This shouldn't involve any "recoding into a list structure" – but it is using the Python list type to hold things in memory. It's the most natural way to do it, and likely the most efficient, especially in algorithms which do multiple passes over tokenized text.
(The less-idiomatic approach of, say, loading the entire file into a raw byte array, then performing repeated reading of that, file-style, to the individual items needed by the algorithm is a clunkier approach. It may similarly save on repeated IO cost, but will likely waster effort on repeated re-parsing/tokenizing of items that will be processed repeatedly. You'll want to keep each item as a Python object in memory, if you have the memory, and that requires putting them in a list.)
To be more specific in answering, you'd need to provide more details in the question, like which specific algorithms/corpus-reading-styles you're using, ideally with example code.
I'd like to fancy up my embedding of Julia in a MATLAB mex function by hooking up Julia's STDIN, STDOUT, and STDERR to the MATLAB terminal. The documentation for redirect_std[in|out|err] says that the stream that I pass in as the argument needs to be a TTY or a Pipe (or a TcpSocket, which wouldn't seem to apply).
I know how I will define the right callbacks for each stream (basically, wrappers around calls to MATLAB's input and fprintf), but I'm not sure how to construct the required stream.
Pipe was renamed PipeEndpoint in https://github.com/JuliaLang/julia/pull/12739, but the corresponding documentation was not updated and PipeEndpoint is now considered internal. Even so, creating the pipe up front is still doable:
pipe = Pipe()
Base.link_pipe(pipe)
redirect_stdout(pipe.in)
#async while !eof(pipe)
data = readavailable(pipe)
# Pass data to whatever function handles display here
end
Furthermore, the no-argument version of these functions already create a pipe object, so the recommended way to do this would be:
(rd,wr) = redirect_stdout()
#async while !eof(rd)
data = readavailable(rd)
# Pass data to whatever function handles display here
end
Nevertheless, all of this is less clear than it could be, so I have created a pull request to clean up this API: https://github.com/JuliaLang/julia/pull/18253. Once that pull request is merged, the link_pipe call will become unnecessary and pipe can be passed directly into redirect_stdout. Further, the return value from the no-argument version will become a regular Pipe.
Using rscript inside a bash script
I am passing the content of text files has arguments. to rscript
"$SCRIPTS/myscript.R" "$filecontent"
I get the following when file have +- over 4000 row
/usr/bin/Rscript: Argument list too long
Any way I can increase the length of accepted argument so I can pass large files?
What #MrFlick said is correct - you should change the way the arguments are passed to your script. However, if you still want to try to do it your way, then I recommend reading the following article:
"Argument list too long": Beyond Arguments and Limitations
The "Argument list too long" error, which occurs anytime a user feeds
too many arguments to a single command, leaves the user to fend for
oneself, since all regular system commands (ls *, cp *, rm *, etc...)
are subject to the same limitation. This article will focus on
identifying four different workaround solutions to this problem, each
method using varying degrees of complexity to solve different
potential problems.
Also, this Unix&Linux thread can help:
“Argument list too long”: How do I deal with it, without changing my command?
What is the correct / idiomatic way of passing a hash to a function?
I have sort of hit upon this but am not sure how clean this is or if there any pitfalls.
typeset -A hash
hash=(a sometext b moretext)
foo hash
foo() {
typeset -A mhash
mhash=( ${(Pkv)1} )
}
The P flag interprets result (in this case $1 as holding a parameter name). Since this resulted in only getting the values and not the keys, I bolted on the "kv" to get both keys and values.
Is this the correct way, or is there another way. btw, since i am passing an array and a hash in my actual program, I don't want to use "$*" or "$#"
I tried a little and i'm not sure there is an other way than using $# on the function.
Re: Array as parameter - Zsh mailing list
Possible answers in these questions (bash-oriented):
How to pass an associative array as argument to a function in Bash?
Passing arrays as parameters in bash
Passing array to function of shell script
In fact, when you start needing to use an array, or even worse, an associative array in a shell script, maybe it's time to switch to a more powerful script language, like perl or python.
If you don't do it for you, do it for you 6 months from now / for your successors.
My Taglist in a C code:
macro
|| MIN_LEN
|| MAX_ITERATIONS
||- typedef
|| cell
|| source_cell
||- variable
|| len_given
Taglist elements (domain):
A = {MIN_LEN, MAX_ITERATIONS, cell, source_cell, len_given}
Code snippets (codomain):
B = {"code_MIN_LEN", "code_MAX_ITERATIONS", ..., "code_len_given"}
Goal: to have bijection between the sets A and B.
Example: I want to remove any element in A, such as the MIN_LEN, from A and B by removing either its element in A or B.
Question: Is there a way to quarantee a bijection between A and B so that a change either in A or in B results in a change in the other set?
I strongly doubt you can do that. The taglist plugin uses ctags to collect the symbols in your code and display them in a lateral split. The lateral split contains readonly information (if you try to work on that window, vim tells you that modifiable is off for that buffer).
What you want to achieve would imply quite complex parsing of the source code you are modifying. Even a simple task like automatic renaming (assuming you modify a function name entry in the taglist buffer and all the instances in your source are updated) requires pretty complex parsing, which is beyond the ctags features or taglist itself. Deleting and keeping everything in sync with a bijective relationship is even more complex. Suppose you have a printf line where you use a macro you want to remove. What should happen to that line? should the whole line disappear, or just the macro (in that case, the line will probably be syntactically incorrect.
taglist is a nice plugin for browsing your code, but it's unsuited for automatic refactoring (which is what you want to achieve).
Edit: as for the computational complexity, well, the worst case scenario is that you have to scout the whole document at every keystroke, looking for new occurrence of labels that could be integrated, so in this sense you could say it's O(n) at each keystroke. This is of course overkill and the worst method to implement it. I am not aware of the computational complexity of the syntax highlight in vim, (which would be useful to extract tags as well, via proper tokenization), but I would estimate it very low, and very limited in the amount of parsed data (you are unlikely to have large constructs to parse to extract the token and understand its context). In any case, this is not how taglist works. Taglist runs ctags at every vim invocation, it does not parse the document live while you type. This is however done by Eclipse, XCode and KDevelop for example, which also provide tools for automatic or semiautomatic refactoring, and can eventually integrate vim as an editor. If you need these features, you are definitely using the wrong tool.