Text streaming in Gensim

Text streaming in Gensim - inputstream

Gensim uses text streaming to minimize memory requirements. This is at the cost of performance due to endless disk IO. Is there a trick to on the fly copy the complete file from disk (one disk IO) to a temporary in-memory file?
I like to keep the code as is (no recoding into a list structures), but this is not a great way of debugging functionality
Expected result: much faster code
Some more background on the question
The original code is at https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb. The example code is taken from the phrase modelling section
I'm calculating the unigrams. All reviews are at
review_txt_filepath = os.path.join(intermediate_directory,'review_text_all.txt'),
all unigrams should go to
unigram_sentences_filepath = os.path.join(intermediate_directory, 'unigram_sentences_all.txt')
The crucial routines are
def punct_space(token):
return token.is_punct or token.is_space
def line_review(filename):
# generator function to read in reviews from the file
with codecs.open(filename, encoding='utf_8') as f:
for review in f:
yield review.replace('\\n', '\n')
def lemmatized_sentence_corpus(filename):
# generator function to use spaCy to parse reviews, lemmatize the text, and yield sentences
for parsed_review in nlp.pipe(line_review(filename),
batch_size=10000, n_threads=4):
for sent in parsed_review.sents:
yield u' '.join([token.lemma_ for token in sent
if not punct_space(token)])
The unigrams are calculated as
with codecs.open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
for sentence in lemmatized_sentence_corpus(review_txt_filepath):
f.write(sentence + '\n')
Doing this for 5000 lines requires some patience, 1h30m ;-)
I'm not that familiar with iterables, but do I understand it correctly that I first have to read the
actual file (on disc) into a variable "list_of_data" and process that
with (review_txt_filepath, 'r', encoding='utf_8') as f:
list_of_data = f.read()
with codecs.open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
for sentence in lemmatized_sentence_corpus(list_of_data):
f.write(sentence + '\n')
So the strategy is
1. read all data into a list in memory
2. process the data
3. write the results to disc
4. delete the list from memory by setting list_with_data = ()
A problem with this is obviously that line_review is doing the file reading

Most gensim interfaces actually take iterable sequences. Examples which emphasize streaming-from-disk just happen to use iterables that read each item as needed, but you could use an in-memory list instead.
Essentially, if you do have enough RAM to have your whole dataset in memory, just use the IO-reading iterable to read things once into a list. Then, feed that list to the gensim class where it expects any iterable sequence.
This shouldn't involve any "recoding into a list structure" – but it is using the Python list type to hold things in memory. It's the most natural way to do it, and likely the most efficient, especially in algorithms which do multiple passes over tokenized text.
(The less-idiomatic approach of, say, loading the entire file into a raw byte array, then performing repeated reading of that, file-style, to the individual items needed by the algorithm is a clunkier approach. It may similarly save on repeated IO cost, but will likely waster effort on repeated re-parsing/tokenizing of items that will be processed repeatedly. You'll want to keep each item as a Python object in memory, if you have the memory, and that requires putting them in a list.)
To be more specific in answering, you'd need to provide more details in the question, like which specific algorithms/corpus-reading-styles you're using, ideally with example code.

Related

Performing dead code elimination / slicing from original source code in Frama-C

EDIT: The original question had unnecessary details
I have a source file which I do value analysis in Frama-C, some of the code is highlighted as dead code in the normalized window, no the original source code.
Can I obtain a slice of the original code that removes the dead code?

Short answer: there's nothing in the current Frama-C version that will let you do that directly. Moreover, if your original code contains macros, Frama-C will not even see the real original code, as it relies on an external preprocessor (e.g. cpp) to do macro expansion.
Longer answer: Each statement in the normalized (aka CIL) Abstract Syntax Tree (AST, the internal representation of C code within Frama-C) contains information about the location (start point and end point) of the original statement where it stems from, and this information is also available in the original AST (aka Cabs). It might thus be possible for someone with a good knowledge of Frama-C's inner workings (e.g. a reader of the developer's manual), to build a correspondance between both, and to use that to detect dead statement in Cabs. Going even further, one could bypass Cabs, and identify zones in the original text of the program which are dead code. Note however that it would be a tedious and quite error prone (notably because a single original statement can be expanded in several normalized ones) task.

Given your clarifications, I stand by #Virgile's answer; but for people interested in performing some simplistic dead code elimination within Frama-C, the script below, gifted by a colleague who has no SO account, could be helpful.
(* remove_dead_code.ml *)
let main () =
!Db.Value.compute ();
Slicing.Api.Project.reset_slicing ();
let selection = ref Slicing.Api.Select.empty_selects in
let o = object (self)
inherit Visitor.frama_c_inplace
method !vstmt_aux stmt =
if Db.Value.is_reachable_stmt stmt then
selection :=
Slicing.Api.Select.select_stmt ~spare:true
!selection
stmt
(Extlib.the self#current_kf);
Cil.DoChildren
end in
Visitor.visitFramacFileSameGlobals o (Ast.get ());
Slicing.Api.Request.add_persistent_selection !selection;
Slicing.Api.Request.apply_all_internal ();
Slicing.Api.Slice.remove_uncalled ();
ignore (Slicing.Api.Project.extract "no-dead")
let () = Db.Main.extend main
Usage:
frama-c -load-script remove_dead_code.ml file.c -then-last -print -ocode output.c
Note that this script does not work in all cases and could have further improvements (e.g. to handle initializers), but for some quick-and-dirty hacking, it can still be helpful.

How to simply read binary data sheets which contains string column in Julia?

I am trying to (write, read) a number of tabular data sheets (in, from) a binary file, data are of Integer, Float64 and ASCIIString types, I write them without difficulty, I lpad ASCIIString to make ASCIIString columns of the same length. now I am facing reading operation, I want to read each table of data by a single call to read function e.g.:
read(myfile,Tuple{[UInt16;[Float64 for i=1:10];UInt8]...}, dim) # => works
EDIT-> I do not use the above line of code in my real solution because
I found that
sizeof(Tuple{Float64,Int32})!=sizeof(Float64)+sizeof(Int32)
but how to include ASCIIString fields in in my Tuple type?
check this simplified example:
file=open("./testfile.txt","w");
ts1="5char";
ts2="7 chars";
write(file,ts1,ts2);
close(file);
file=open("./testfile.txt","r");
data=read(file,typeof(ts1)); # => Errror
close(file);
Julia is right because typeof(ts1)==ASCIIString and ASCIIString is a variable length array, so Julia don't know how many bytes must be read.
What kind of type I must replace there? Is there a type that represents ConstantLangthString<length> or Bytes<length> , Chars<length>? any better solution exists?
EDIT
I should add more complete sample code that includes my latest progress, my latest solution is to read some part of data into a buffer (one row or more), allocate memory for one row of data then reinterpret bytes and copy result value from buffer into an out location:
#convert array of bits and copy them to out
function reinterpretarray!{ty}(out::Vector{ty}, buffer::Vector{UInt8}, pos::Int)
count=length(out)
out[1:count]=reinterpret(ty,buffer[pos:count*sizeof(ty)+pos-1])
return count*sizeof(ty)+pos
end
file=open("./testfile.binary","w");
#generate test data
infloat=ones(20);
instr=b"MyData";
inint=Int32[12];
#write tuple
write(file,([infloat...],instr,inint)...);
close(file);
file=open("./testfile.binary","r");
#read data into a buffer
buffer=readbytes(file,sizeof(infloat)+sizeof(instr)+sizeof(inint));
close(file);
#allocate memory
outfloat=zeros(20)
outstr=b"123456"
outint=Int32[1]
outdata=(outfloat,outstr,outint)
#copy and convert
pos=1
for elm in outdata
pos=reinterpretarray!(elm, buffer, pos)
end
assert(outdata==(infloat,instr,inint))
But my experiments in C language tell me that there must be a better, more convenient and faster solution exists, I would like to do it using C style pointers and references, I don't like to copy data from one location to another one.
Thanks

You can use Array{UInt8} as an alternative type for ASCIIString, which is the type for underlying data.
ts1="5chars"
print(ts1.data) #Array{UInt8}
someotherarray=ts1.data[:] #copies as new array
someotherstring=ASCIIString(somotherarray)
assert(someotherstring == ts1)
Do mind that I'm reading UInt8 in a x86_64 system, which might not be your case. You should use Array{eltype(ts1.data)} for safety reasons.

Good practice on how to store the result of a function for later use in R

I have the situation where I have written an R function, ComplexResult, that computes a computationally expensive result that two other separate functions will later use, LaterFuncA and LaterFuncB.
I want to store the result of ComplexResult somewhere so that both LaterFuncA and LaterFuncB can use it, and it does not need to be recalculated. The result of ComplexResult is a large matrix that only needs to be calculated once, then re-used later on.
R is my first foray into the world of functional programming, so interested to understand what it considered good practice. My first line of thinking is as follows:
# run ComplexResult and get the result
cmplx.res <- ComplexResult(arg1, arg2)
# store the result in the global environment.
# NB this would not be run from a function
assign("CachedComplexResult", cmplx.res, envir = .GlobalEnv)
Is this at all the right thing to do? The only other approach I can think of is having a large "wrapper" function, e.g.:
MyWrapperFunction <- function(arg1, arg2) {
cmplx.res <- ComplexResult(arg1, arg2)
res.a <- LaterFuncA(cmplx.res)
res.b <- LaterFuncB(cmplx.res)
# do more stuff here ...
}
Thoughts? Am I heading at all in the right direction with either of the above? Or is an there Option C which is more cunning? :)

The general answer is you should Serialize/deSerialize your big object for further use. The R way to do this is using saveRDS/readRDS:
## save a single object to file
saveRDS(cmplx.res, "cmplx.res.rds")
## restore it under a different name
cmplx2.res <- readRDS("cmplx.res.rds")

This assign to GlobalEnv:
CachedComplexResult <- ComplexResult(arg1, arg2)
To store I would use:
write.table(CachedComplexResult, file = "complex_res.txt")
And then to use it directly:
LaterFuncA(read.table("complex_res.txt"))

Your approach works for saving to local memory; other answers have explained saving to global memory or a file. Here are some thoughts on why you would do one or the other.
Save to file: this is slowest, so only do it if your process is volatile and you expect it to crash hard and you need to pick up the pieces where it left off, OR if you just need to save the state once in a while where speed/performance is not a concern.
Save to global: if you need access from multiple spots in a large R program.

/usr/bin/Rscript: Argument list too long

Using rscript inside a bash script
I am passing the content of text files has arguments. to rscript
"$SCRIPTS/myscript.R" "$filecontent"
I get the following when file have +- over 4000 row
/usr/bin/Rscript: Argument list too long
Any way I can increase the length of accepted argument so I can pass large files?

What #MrFlick said is correct - you should change the way the arguments are passed to your script. However, if you still want to try to do it your way, then I recommend reading the following article:
"Argument list too long": Beyond Arguments and Limitations
The "Argument list too long" error, which occurs anytime a user feeds
too many arguments to a single command, leaves the user to fend for
oneself, since all regular system commands (ls *, cp *, rm *, etc...)
are subject to the same limitation. This article will focus on
identifying four different workaround solutions to this problem, each
method using varying degrees of complexity to solve different
potential problems.
Also, this Unix&Linux thread can help:
“Argument list too long”: How do I deal with it, without changing my command?

VIM: Bijection between VIM Taglist elements and code snippets?

My Taglist in a C code:
macro
|| MIN_LEN
|| MAX_ITERATIONS
||- typedef
|| cell
|| source_cell
||- variable
|| len_given
Taglist elements (domain):
A = {MIN_LEN, MAX_ITERATIONS, cell, source_cell, len_given}
Code snippets (codomain):
B = {"code_MIN_LEN", "code_MAX_ITERATIONS", ..., "code_len_given"}
Goal: to have bijection between the sets A and B.
Example: I want to remove any element in A, such as the MIN_LEN, from A and B by removing either its element in A or B.
Question: Is there a way to quarantee a bijection between A and B so that a change either in A or in B results in a change in the other set?

I strongly doubt you can do that. The taglist plugin uses ctags to collect the symbols in your code and display them in a lateral split. The lateral split contains readonly information (if you try to work on that window, vim tells you that modifiable is off for that buffer).
What you want to achieve would imply quite complex parsing of the source code you are modifying. Even a simple task like automatic renaming (assuming you modify a function name entry in the taglist buffer and all the instances in your source are updated) requires pretty complex parsing, which is beyond the ctags features or taglist itself. Deleting and keeping everything in sync with a bijective relationship is even more complex. Suppose you have a printf line where you use a macro you want to remove. What should happen to that line? should the whole line disappear, or just the macro (in that case, the line will probably be syntactically incorrect.
taglist is a nice plugin for browsing your code, but it's unsuited for automatic refactoring (which is what you want to achieve).
Edit: as for the computational complexity, well, the worst case scenario is that you have to scout the whole document at every keystroke, looking for new occurrence of labels that could be integrated, so in this sense you could say it's O(n) at each keystroke. This is of course overkill and the worst method to implement it. I am not aware of the computational complexity of the syntax highlight in vim, (which would be useful to extract tags as well, via proper tokenization), but I would estimate it very low, and very limited in the amount of parsed data (you are unlikely to have large constructs to parse to extract the token and understand its context). In any case, this is not how taglist works. Taglist runs ctags at every vim invocation, it does not parse the document live while you type. This is however done by Eclipse, XCode and KDevelop for example, which also provide tools for automatic or semiautomatic refactoring, and can eventually integrate vim as an editor. If you need these features, you are definitely using the wrong tool.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex