Looping Through Each Word In Word Document Using Docx Library

Looping Through Each Word In Word Document Using Docx Library - novacode-docx

I am trying to make a small program to apply autocorrect changes to an exiting document. I am using the docX library. My question is, how do you iterate (or loop) through each word in the document, using the docX library, to check if it needs to be corrected or not (I have already inserted all auto correct entries in a list<T>).

try this...
DocX document = DocX.Load( <document path> );
foreach(Novacode.Paragraph item in document.Paragraphs) {
// use this if you need whole text of a paragraph
string paraText = item.Text;
// use this if you need word by word
foreach(var data in item.MagicText) {
string word = data.text;
}
}

Related

Iterate faster over a large collection of files (objects) inside an Observable List (JavaFX 8)

I have an excel file that contains all the filenames of the Images. The path of these images are stored in an Observable Collection via <File> class which came from the folder that contains all of the images. My goal is to create a hyperlink of these filenames by matching it through the pool of image file collection.
I would like to ask if how can I iterate faster through a large collection of file classes in order to get their paths easily.
For example:
Image name from Excel :
ABC_0001
The Full path from the collection must be:
C:\Users\admin\Desktop\Images\ABC_0001.jpg
In order to get their full path, I perform the iteration through Stream.
My procedures:
Extract data using Apache POI.
Stream through the Image Collection by converting each data into
their base filenames vs extracted data.
Get the result and store the fullpath on the object via
getAbsolutePath().
Code:
//storage during iteration
ObservableList<DetailedData> dataCollection = FXCollections.observableArrayList()
//Image collection containing over 13k Images listed via commons-io
ObservableList<File> IMAGE_COLLECTION = FXCollections.observableArrayList(FileUtils.listFiles(browsedFOLDER, new String[]{"JPG", "JPEG", "TIF", "TIFF", "jpg", "jpeg", "tif", "tiff"}, true));
//Sheet data
Sheet sheet1 = wb.getsheetAt(0);
for (Row row: sheet1)
{
DetailedData data = new DetailedData();
//extracted data from excel
String FILENAME = row.getCell(0,Row.MissingCellPolicy.CREATE_NULL_AS_BLANK).getStringCellValue();
//to be filled up based on stream result.
String IMAGE_SOURCE = null;
//stream code with the help of commons-io
File IMAGE = IMAGE_COLLECTION.stream().filter(e -> FilenameUtils.getBaseName(e.getName()).toLowerCase().equals(FILENAME.toLowerCase())).findFirst().orElse(null);
if (IMAGE != null)
IMAGE_SOURCE = IMAGE.getAbsolutePath();
data.setFileName(FILENAME);
data.setFullPath(IMAGE_SOURCE);
dataCollection.add(data);
}
Result:
Excel rows = 9,400
Image Files = 13,000
Iteration Time = 120,000ms
Are the results should appear normal or it can become faster?
I tried using parallelStream() and the results went faster but it consumes higher CPU usage.

This code should speed your code up a lot, but there are a few questions about your code.
ObservableList<DetailedData> dataCollection = FXCollections.observableArrayList() Why are you using ObservableList? Why is this a list of DetailedData and not File. Given that detailed data has setFileName and setFullPath. File already has these.
ObservableList<File> IMAGE_COLLECTION = FXCollections.observableArrayList(FileUtils.listFiles(browsedFOLDER, new String[]{"JPG", "JPEG", "TIF", "TIFF", "jpg", "jpeg", "tif", "tiff"}, true)); Why ObservableList?
These two are small things, but I am curious.
So what I think you should do is use a Map. Your code should look something like the code below.
//storage during iteration
List<DetailedData> dataCollection = new ArrayList();
//Image collection containing over 13k Images listed via commons-io
List<File> IMAGE_COLLECTION = new ArrayList(FileUtils.listFiles(new File("C:\\Users\\blj0011\\Pictures"), new String[]{"JPG", "JPEG", "TIF", "TIFF", "jpg", "jpeg", "tif", "tiff"}, true));
//Use this to map file name to file
Map<String, File> map = new HashMap();
//Use this to add data to the map
IMAGE_COLLECTION.forEach((file) -> {map.put(file.getName().substring(0, file.getName().lastIndexOf(".")).toLowerCase(), file);});
for (Row row: sheet1)
{
//extracted data from excel
String FILENAME = row.getCell(0,Row.MissingCellPolicy.CREATE_NULL_AS_BLANK).getStringCellValue();
//If the map contains the file name, create `DetailedData` object. Then set data. Then add object to datacollection list.
if (map.containsKey(FILENAME.toLowerCase()))
{
DetailedData data = new DetailedData();
data.setFileName(FILENAME);
data.setFullPath(map.get(FILENAME.toLowerCase()).getAbsolutePath());
dataCollection.add(data);
}
}
Comments in the code
I still believe this could be cleaned up a little more if you used List<File> dataCollection = new ArrayList()

If you really want to speed up your search, you should try not to do things repeatedly which could just be done once. For example you could use two loops. The first to prepare your search and the second to actually do the search. Inside your filter you call FilenameUtils.getBaseName and two time a conversion to lower case. It would be better to do these things only once in the first loop and store the resulting Strings in a list. In the second loop you then do the search on this list.
I am also wondering why you use ObservableLists here. A simple List would do as well.

I've tested another approach in this slow iteration.
It seems that the cause is declaring the Stream repeatedly inside the foreach.
I tried using Baeldung's solution <Supplier> and declared it outside the loop together with parallelStream()
Sample Code:
Supplier<Stream<File>> streamSupplier = () -> imageCollection.parallelStream();
for (Row row : sheet)
{
File IMAGE = streamSupplier.get().filter(e -> FilenameUtils.getBaseName(e.getName()).toLowerCase().equals(FILENAME.toLowerCase())).findFirst().orElse(null);
if (IMAGE != null)
IMAGE_SOURCE = IMAGE.getAbsolutePath();
}
Result went 45000ms
Please correct me if my approach was not right.

DM Script to import a 2D image in text (CSV) format

Using the built-in "Import Data..." functionality we can import a properly formatted text file (like CSV and/or tab-delimited) as an image. It is rather straight forward to write a script to do so. However, my scripting approach is not efficient - which requires me to loop through each raw (use the "StreamReadTextLine" function) so it takes a while to get a 512x512 image imported.
Is there a better way or an "undocumented" script function that I can tap in?

DigitalMicrograph offers an import functionality via the File/Import Data... menu entry, which will give you this dialog:
The functionality evoked by this dialog can also be accessed by script commands, with the command
BasicImage ImageImportTextData( String img_name, ScriptObject stream, Number data_type_enum, ScriptObject img_size, Boolean lines_are_rows, Boolean size_by_counting )
As with the dialog, one has to pre-specify a few things.
The data type of the image.
This is a number. You can find out which number belongs to which image data type by, f.e., creating an image outputting its data type:
image img := Realimage( "", 4, 100 )
Result("\n" + img.ImageGetDataType() )
The file stream object
This object describes where the data is stored. The F1 help-documention explains how one creates a file-stream from an existing file, but essentially you need to specify a path to the file, then open the file for reading (which gives you a handle), and then using the fileHandle to create the stream object.
string path = "C:\\test.txt"
number fRef = OpenFileForReading( path )
object fStream = NewStreamFromFileReference( fRef, 1 )
The image size object
This is a specific script object you need to allocate. It wraps image size information. In case of auto-detecting the size from the text, you don't need to specify the actual size, but you still need the object.
object imgSizeObj = Alloc("ImageData_ImageDataSize")
imgSizeObj.SetNumDimensions(2) // Not needed for counting!
imgSizeObj.SetDimensionSize(0,10) // Not used for counting
imgSizeObj.SetDimensionSize(1,10) // Not used for counting
Boolean checks
Like with the checkboxes in the UI, you spefic two conditions:
Lines are Rows
Get Size By Counting
Note, that the "counting" flag is only used if "Lines are Rows" is also true. Same as with the dialog.
The following script improrts a text file with couting:
image ImportTextByCounting( string path, number DataType )
{
number fRef = OpenFileForReading( path )
object fStream = NewStreamFromFileReference( fRef, 1 )
number bLinesAreRows = 1
number bSizeByCount = 1
bSizeByCount *= bLinesAreRows // Only valid together!
object imgSizeObj = Alloc("ImageData_ImageDataSize")
image img := ImageImportTextData( "Imag Name ", fStream, DataType, imgSizeObj, bLinesAreRows, bSizeByCount )
return img
}
string path = "C:\\test.txt"
number kREAL4_DATA = 2
image img := ImportTextByCounting( path, kREAL4_DATA )
img.ShowImage()

How to copy and paste indd file content into an another indd file?

I'm newbie in indesign scripting stuffs.So I apologise as I couldn't post my trials.
Objective:
I have an indd document which will have figure caption,label etc. I need to copy content(figure which is editable) from other indd file to this document where related figure label exists.
For example:
sample.indd
Some text
Fig.1.1 caption
some text
I need to copy the content of figure1.indd and paste into the sample.indd document where Fig.1.1 string exists and so on. Now I'm doing it manually. But am supposed to automate this.
So, I need some hint how to acheive it using extendscript?
I have found something like below to do this, but I don't have any clue to develop it further and also am not sure whether this approach is correct to get my result. Pls help me
myDocument=app.open(File("file.indd"),false); //opening a file to get the content without showing.
myDocument.pages.item(0).textFrames.item(0).contents="some text";
//here I could set the content but I don't knw how to get the content
// ?????? Then I have to paste the content into active document.

I found the script for my requirement.
var myDoc = File("sample.indd");//Destination File
var myFigDoc = File("fig.indd");//Figure File
app.open(File(myFigDoc));
app.activeDocument.pageItems.everyItem().select();
app.copy();
app.open(File(myDoc));
app.findGrepPreferences = app.changeGrepPreferences = null;
app.findGrepPreferences.findWhat = "FIG. 1.1 ";//Figure caption text
//app.findGrepPreferences.appliedParagraphStyle = "FigureCaption";//Figure Caption Style
myFinds = app.activeDocument.findGrep();
for(var i=0;i<myFinds.length;i++){
myFinds[i].insertionPoints[0].contents="\r";
myFinds[i].insertionPoints[0].select();
app.paste();
}
app.findGrepPreferences = app.changeGrepPreferences = null;

If acceptable for you, you can place an indesign file as link (place…). So a script could try to catch the "fig…" strings and do the importation.
Have a look at scripts that use finGrep() and place() command.

Show all text of a docx in a stringBuilder with docx4j

i need to put all text of a docx in a stringBuilder, also with tab and hyphen.
i've tried the use of org.docx4j.TextUtils, but in the resultant string doesn't seen tab.
String inputfilepath = System.getProperty("user.home") + "test.docx";
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
org.docx4j.wml.Document wmlDocumentEl = (org.docx4j.wml.Document)documentPart.getJaxbElement();
Writer out = new OutputStreamWriter(System.out);
extractText(wmlDocumentEl, out);
out.close();

As per my answer at http://www.docx4java.org/forums/docx-java-f6/is-it-possible-to-extract-all-text-also-tab-and-hyphen-t1996.html#p6933?sid=b0d58fec2ba349d0f3f49cf66411397c
The problem with tab and hyphen, as I guess you know, is that they aren't represented in the docx as normal characters.
Tab is w:tab
A hyphen might be a hyphen character, or it might be displayed (without being actually in the docx), or it might be:
http://webapp.docx4java.org/OnlineDemo/ecma376/WordML/noBreakHyphen.html
or http://webapp.docx4java.org/OnlineDemo/ecma376/WordML/softHyphen.html
Replicating Word's hyphenation behaviour would be a challenge.
But for the others, there are three approaches which occur to me:
generalising your traverse approach (are you using TraversalUtil.getChildrenImpl?)
doing it in XSLT (you can do this in docx4j, but XSLT is probably slower, and a mix of technologies)
marshal the main document part to a string, do suitable string replacements, then unmarshal, then use TextUtils
For (3), assuming MainDocumentPart mdp, to get it as a String:
String stringContent = mdp.getXML();
Then to inject the modified content:
mdp.setContents((Document)XmlUtils.unmarshalString(stringContent) );

iMacro - Setting Variable + SaveAs CSV

I am looking for help with 2 parts of my iMacro Script...
Part1 - Variable
I am clicking on the follwoing line of a page in order to access the page I need to extract from.
1st Link
TAG POS=**8** TYPE=A FORM=NAME:xxyy ATTR=HREF:https://aaa.aaaa.com/en/administration/xxxx.jsp?reqID=h*
2nd Link
TAG POS=**9** TYPE=A FORM=NAME:xxyy ATTR=HREF:https://aaa.aaaa.com/en/administration/xxxx.jsp?reqID=h*
The tag pos is the variable, how can I get this so that when running on loop, the macro will select the next value on the screen (ie choose 8,9,10)? Some screens have 100 plus links to be clicked on.
Part 2 - Save CSV file
I have the saveas line in my file. But how can I make it so that there is only 1 csv file created (even if macro is runn 50 times)? Also, is there a way to format the CSV file from the iMacros so that each new run starts on another row (currently, all data extracts to row 1 across many columns.)
Thank you in advance,
Adam

This will do what you asked. It will loop the macro and each time set the new position number in the macro.
1)
var macro;
macro ="CODE:";
macro +="TAG POS={{number}} TYPE=A FORM=NAME:xxyy ATTR=HREF:https://aaa.aaaa.com/en/administration/xxxx.jsp?reqID=h*"+"\n";
for(var i=1;i<100;i++)
{
iimSet("number",i)
iimPlay(macro)
}
For the solution of part two you will need JavaScript scripting. First part is declaring macro and the second part is initiating the macro and the third part is the function which saves the extracted text into a file. Each time you run it will save in the new line.
2)
var macroExtractSomething;
macroExtractSomething ="CODE:";
macroExtractSomething +="TAG POS=1 TYPE=DIV ATTR=CLASS:some_class_of_some_div EXTRACT=TXT"+"\n";
iimPlay(macroExtractSomething)
var extracted_text=iimGetLastExtract();
WriteFile("C:\\some_folder\\some_file.csv",extracted)
//This function writes string into a file. It will also create file on that location
function WriteFile(path,string)
{
//import FileUtils.jsm
Components.utils.import("resource://gre/modules/FileUtils.jsm");
//declare file
var file = new FileUtils.File(path);
//declare file path
file.initWithPath(path);
//if it exists move on if not create it
if (!file.exists())
{
file.create(file.NORMAL_FILE_TYPE, 0666);
}
var charset = 'EUC-JP';
var fileStream = Components.classes['#mozilla.org/network/file-output-stream;1']
.createInstance(Components.interfaces.nsIFileOutputStream);
fileStream.init(file, 18, 0x200, false);
var converterStream = Components
.classes['#mozilla.org/intl/converter-output-stream;1']
.createInstance(Components.interfaces.nsIConverterOutputStream);
converterStream.init(fileStream, charset, string.length,
Components.interfaces.nsIConverterInputStream.DEFAULT_REPLACEMENT_CHARACTER);
//write file to location
converterStream.writeString("\r\n"+string);
converterStream.close();
fileStream.close();
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Looping Through Each Word In Word Document Using Docx Library - novacode-docx

try this... DocX document = DocX.Load( <document path> ); foreach(Novacode.Paragraph item in document.Paragraphs) { // use this if you need whole text of a paragraph string paraText = item.Text; // use this if you need word by word foreach(var data in item.MagicText) { string word = data.text; } }

Related

Iterate faster over a large collection of files (objects) inside an Observable List (JavaFX 8)

DM Script to import a 2D image in text (CSV) format

How to copy and paste indd file content into an another indd file?

Show all text of a docx in a stringBuilder with docx4j

iMacro - Setting Variable + SaveAs CSV

Categories

Resources