kagle / google jupiter notenook printing on new lines - jupyter-notebook

When using google or kaggle with HuggingFace transformers for text generation
the printed output usually contains text with line end coding.
alike
Alice red the book.\n And was waiting for more books \n\n Then she skipped reading
How can the python print command be used in kaggle to realy print those new lines ?
instead of printing large text blobs ?

Related

How to convert a Jupyter notebook latex markdown file to word?

I've spent a long time making my first Latex type-setted document in jupyter notebooks, but just now I've realised that I want to get it into a word file so that I can send it to my professor to mark. However I can't find a way to get it into word without ruining all my latex (or forcing me to go through and click 'insert equation' in word for every single symbol.) Could someone help?!
A little convoluted, but you could first convert your jupyter notebook to a latex .tex file (see e.g. https://nbconvert.readthedocs.io/en/latest/usage.html#convert-latex) and then convert that to Word using some other software (a quick google search shows e.g. https://products.aspose.app/pdf/conversion/tex-to-docx).

How to increase the amount of lines you can read in R's view()

In an RMarkdown file, you can see the below sample text in red. It prints in the console and the RStudio Notebook Output (using view()) up to 10,000 lines after running my code.
However, the total text is 20,000 lines long. I can't find help online which indicates how to increase the number of lines you can view in R and I need to access all of it. Can anyone help? I want to view all of the text basically.
Note that my code took hours to run, RStudio crashed when the code finished executing, and the information I need is saved in red. Hence, I can't re-run the code without the same problem occurring.

How to deal with the text file with UTF-8 code imported in R to do LDA topic modeling

I am a newbie in R.
I was trying to import a text file into R to do an LDA topic modeling analysis. The file is about movie tags and there are some strange characters inside, such as "ã¯â¼å". Then I search it online, these strange characters are UTF-8? Not sure if it is correct.
My question is how to deal with these characters, is there any way that can transform them into normal characters? Because when I need to do further analysis for the text material, it returns error "invalid multibyte string 129". (ã¯â¼å in line 129 in the text file)

Detect if PDF contains searchable text

I want to read the text of 16000 pdf files in an R dataframe. Most of the pdfs are no Problem, pdf_text("link") gives the right output. The problem is that some PDFS have no copyable text. With them I get a following (sample) output:
"0123145456789 20ÿ\016\017ÿ\020\021\022ÿ\023\024\020\024\025\022ÿ\026\023ÿ\027\026\016\022\020\025\030ÿ\031\026\032\033\034\030ÿÿ\035\026\036\022\025\016\026\025ÿ\035 \037\025\033\022\032ÿ !\021\032\026\024\023\r\n ()*+ÿ-./.0012ÿ34ÿ5656ÿ7M9^.C=19ÿO=CDÿ-_`MQÿ-19GEHFC=19ÿ\.GW:CG7\r\n ()5+ÿ8.9:2ÿ;4ÿ5656ÿ7N10=#=:GÿJ1>ÿCD:ÿa>:.Cÿa01V.0ÿ<DECA1O9ÿ.9Aÿ\:K19A7\r\n ())+ÿ<:=0:>2ÿ5656ÿ7-1/=AÿY#191H=#Gbÿ̀:CC:Aÿ.9Aÿ>:.0cC=H:ÿF.F:>G7\r\n"
The solution is of course
a. Do not read out all these texts
b. Read the texts by OCR
But since this method either excludes many texts or is very time-consuming, I would like to recognize the problematic texts beforehand. Is there an easy way to do this in R?
text_raw<-pdf_text("link")
if(text_raw is nonsense){
text<-pdf_ocr_text(text_raw)
}else{
text<-text_raw
}

Importing the contents of a word document into R

I am new to R and have worked for a while as follows. I have the code writen in a word document, then I copy and paste the document with the code into R as to have the code run which works fine, however when the code is long (hundred pages) it takes a significant amount of time in R to start making the code run. This seems rather not a very effective working procedure and I am sure there are other forms to compile the R code.
On another hand one of then that comes to my mind is to import the content of word into R which I am unsure how to do. Have tried with read.table but it does not work, have look on internet as to how to import data, however most explanations are all for data tables etc or internet files in the form of data tables and similar. I have tried saving the document into csv. however word does not include csv have tried with Rich text format and XML package but again the instructions from the packages are for importing tables and similars. I am wondering if there is an effective way for R to import a word document as is in the word document.
Thank you
It's hard to say what the easiest solution would be, without examining the word document. Assuming it only contains code and nothing else, it should be pretty easy to convert it all to plain text from within Word. You can do that by going to File -> Save As, and use 'plain text' under 'Save as type'.
Then edit the filename extension to .R from .txt, download a proper text editor (I can recommend RStudio for R), and open your code in it. Then you will be able to run the code from inside the editor without using copy / paste.
No, read table won't do it.
Microsoft Word has its own format, which includes a lot of meta data over and above the text you enter into it. You'll need a reader/parser that understands the Word format.
A Java developer would use a library like Apache POI to read and parse it into word tokens and n-grams.
Look for Natural Language Processing tools, like this R module:
http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Resources