Traditional pdf indexing solution compared to graph-based version

Traditional pdf indexing solution compared to graph-based version - gremlin

My intention is to index an arbitrary directory containing pdf files (among other file types) with keywords stored in a list. I have a traditional solution and I heard that graph based solutions using e.g. SimpleGraph could be more elegant/efficient and independent of directory structures.
What would a graph-based solution (e.g. SimpleGraph) look like?
Traditional solution
// https://stackoverflow.com/a/14051951/1497139
List<File> pdfFiles = this.explorePath(TestPDFFiles.RFC_DIRECTORY, "pdf");
List<PDFFile> pdfs = this.getPdfsFromFileList(pdfFiles);
…
for (PDFFile pdf:pdfs) {
// https://stackoverflow.com/a/9560307/1497139
if (org.apache.commons.lang3.StringUtils.containsIgnoreCase(pdf.getText(), keyWord)) {
foundList.add(pdf.file.getName()); // here we access by structure (early binding)
// - in the graph solution by name (late binding)
}
}

Basically with SimpleGraph you'd use a combination of the modules
FileSystem
PDFSystem
With the FileSystem module you collect your graph of files in the directory and filter it to include only files with the extension pdf - then you analyze the PDFs using the PDFSystem to get the page/text structure - there is already a test case for this in the simplegraph-bundle module showing how it works with some RFC pdfs as input.
TestPDFFiles.java
I have now added the indexing test see below.
The core functionality has been taken from the old test with searching for a single keyword and allowing this as a parameter:
List<Object> founds = pdfSystem.g().V().hasLabel("page")
.has("text", RegexPredicate.regex(".*" + keyWord + ".*")).in("pages")
.dedup().values("name").toList();
This is a gremlin query that will do most of the work by searching in a whole tree of PDF files with just one call. I consider this more elegant since you do not have to care about the structure of the input (tree/graph/filesystem/database, etc ...)
JUnit Testcase
#Test
/**
* test for https://github.com/BITPlan/com.bitplan.simplegraph/issues/12
*/
public void testPDFIndexing() throws Exception {
FileSystem fs = getFileSystem(RFC_DIRECTORY);
int limit = Integer.MAX_VALUE;
PdfSystem pdfSystem = getPdfSystemForFileSystem(fs, limit);
Map<String, List<String>> index = this.getIndex(pdfSystem, "ARPA",
"proposal", "plan");
// debug=true;
if (debug) {
for (Entry<String, List<String>> indexEntry : index.entrySet()) {
List<String> fileNameList = indexEntry.getValue();
System.out.println(String.format("%15s=%3d %s", indexEntry.getKey(),
fileNameList.size(), fileNameList));
}
}
assertEquals(14,index.get("ARPA").size());
assertEquals(9,index.get("plan").size());
assertEquals(8,index.get("proposal").size());
}

Related

AWS Textract - GetDocumentAnalysisRequest only returns correct results for first page of document

I have written code to extract tables and name value pairs from pdf using Amazon Textract. I followed this example:
https://docs.aws.amazon.com/textract/latest/dg/async-analyzing-with-sqs.html
which was in sdk for java version 1.1.
I have refactored it for version 2.
This is an async process that only applies to multi page documents. When i get back the results it is pretty accurate for first page. But the consecutive pages are mostly empty rows. The documents i parse are scanned so the quality is not great. However if i take a jpg of individual pages and use the one page operation, i.e. AnalyzeDocumentRequest, each page comes out good. Also Amazon Textract tryit service renders the pages correctly.
So the error must be in my code but can't see where.
As you see it all happens in here :
GetDocumentAnalysisRequest documentAnalysisRequest = GetDocumentAnalysisRequest.builder().jobId(jobId)
.maxResults(maxResults).nextToken(paginationToken).build();
response = textractClient.getDocumentAnalysis(documentAnalysisRequest);
and i can't really do any intervention.
The most likely place I could make a mistake would be in the util file that gathers the page and table blocks i.e. here:
PageModel pageModel = tableUtil.getTableResults(blocks);
But that works perfectly for the first page, and i could also see in the response object above, that the number of blocks returned are much less.
Here is the full code:
private DocumentModel getDocumentAnalysisResults(String jobId) throws Exception {
int maxResults = 1000;
String paginationToken = null;
GetDocumentAnalysisResponse response = null;
Boolean finished = false;
int pageCount = 0;
DocumentModel documentModel = new DocumentModel();
// loops until pagination token is null
while (finished == false) {
GetDocumentAnalysisRequest documentAnalysisRequest = GetDocumentAnalysisRequest.builder().jobId(jobId)
.maxResults(maxResults).nextToken(paginationToken).build();
response = textractClient.getDocumentAnalysis(documentAnalysisRequest);
// Show blocks, confidence and detection times
List<Block> blocks = response.blocks();
PageModel pageModel = tableUtil.getTableResults(blocks);
pageModel.setPageNumber(pageCount++);
Map<String,String> keyValues = formUtil.getFormResults(blocks);
pageModel.setKeyValues(keyValues);
documentModel.getPages().add(pageModel);
paginationToken = response.nextToken();
if (paginationToken == null)
finished = true;
}
return documentModel;
}
Has anyone else encountered this issue?
Many thanks

if the response has NextToken, then you need to recall textract and pass in the NextToken to get the next batch of Blocks.
I am not sure how to do this in Java but here is the python example from AWS repo
https://github.com/aws-samples/amazon-textract-serverless-large-scale-document-processing/blob/master/src/jobresultsproc.py
For my solution, I did a simple if response['NextToken'] then recall method and concat the response['Blocks'] to my current list.

Apache NiFi - ScanContent dictionary format not working

I'm trying to run a ScanContent processor on Apache Nifi, and whilst I can get the processor to run when scanning a text file, and using a .txt dictionary file with the search terms contained in it (and delimited by a newline character), I cannot get it to run when searching a file using the binary type of the processor for the dictionary file.
I am unsure whether I am simply using the wrong format for the binary dictionary file, or whether it needs to be encoded differently. I couldnt find any example dictionaries anywhere online that would be of any use (most things were related to the ScanAttributes instead).
The format of my dictionary file is:
(inside a .txt file)
32 00001001001000010000100001000000\n
The requirements according to the documentation are that the dictionary terms need to be a 4 byte integer, followed by the binary search term.
Does anyone have any experience of using this processor with a binary dictionary that might be able to help specify the format?

A binary dictionary file would typically be generated as the output of another program. There is an example in the ScanContent unit tests for how to accomplish this in Java:
#Test
public void testBinaryScan() throws IOException {
// Create dictionary file.
final String[] terms = new String[]{"hello", "good-bye"};
final ByteArrayOutputStream baos = new ByteArrayOutputStream();
try (final DataOutputStream dictionaryOut = new DataOutputStream(baos);) {
for (final String term : terms) {
final byte[] termBytes = term.getBytes("UTF-8");
dictionaryOut.writeInt(termBytes.length);
dictionaryOut.write(termBytes);
}
final byte[] termBytes = baos.toByteArray();
final Path dictionaryPath = Paths.get("target/dictionary");
Files.write(dictionaryPath, termBytes, StandardOpenOption.CREATE, StandardOpenOption.WRITE);
...

UWP/WinRT with C++/CX: In a chain of asynchronous tasks, how can data be passed between them?

I understand that the return from one lambda is fed into the arguments of the next. However, what about if multiple pieces of data need to be passed, or the return type of one lambda is already set by the program structure?
Here is my working code where both of these are the case for opening up a file picker and then reading its contents as text while remembering what the file it came from was:
create_task(picker->PickSingleFileAsync())
.then([this](StorageFile^ file)
{
if (file == nullptr) cancel_current_task();
m_OpenFilename = file->Name;
return FileIO::ReadTextAsync(file);
})
.then([this](String^ fileContents)
{
//do something with the filename and file contents
});
Note that in order to make this work, I needed to add a class variable to store the filename in between the asynchronous tasks. This strikes me as bad for a number of reasons:
It is ugly having a class variable for the internal use of a single method
It this thread-safe? If someone goes nuts opening file pickers and selecting files, would these asynchronous tasks potentially clobber each other when accessing m_OpenFilename?
This is only a trivial example with one variable, but let's say I also want to keep track of the path of the file, and its file attributes, and a number of other characteristics. Now the class is looking uglier and uglier as the number of class variables increase.
My first approach was to have a variable local in scope to the function, and to pass it into each of the lambda functions by altering their capture lists to be [this, OpenFilename]. However, this would fail because by the time the lambda executed, C++/CX's background memory handlers would have already discarded Openfilename, resulting in an access violation when accessing it.
In my example, how can I pass the metadata of the file along to the results of ReadTextAsync so that I can have access to both the file and its contents at the same time?

The easiest way is to just continue building a chain of nested continuations:
auto picker = ref new FileOpenPicker();
picker->FileTypeFilter->Append(L".txt");
picker->SuggestedStartLocation = PickerLocationId::Desktop;
auto task = create_task(picker->PickSingleFileAsync()).then(
[](StorageFile^ file)
{
auto name = file->Name;
auto task = create_task(file->OpenReadAsync()).then(
[name](IRandomAccessStreamWithContentType^ iras)
{
OutputDebugString(name->Data());
});
});
If you don't want to do that (for whatever reason) another option is to use a shared_ptr to hold the value; in this case I'm going to hold on to the name and the created date in a helper file_info type:
struct file_info
{
Platform::String^ name;
Windows::Foundation::DateTime created;
};
auto picker = ref new FileOpenPicker();
picker->FileTypeFilter->Append(L".txt");
picker->SuggestedStartLocation = PickerLocationId::Desktop;
auto info = std::make_shared<file_info>();
auto task = create_task(picker->PickSingleFileAsync()).then(
[info](StorageFile^ file)
{
info->name = file->Name;
info->created = file->DateCreated;
return create_task(file->OpenReadAsync());
}).then(
[info](IRandomAccessStreamWithContentType^ iras)
{
OutputDebugString(info->name->Data());
OutputDebugString(L"\n");
wchar_t datetime[100];
_i64tow_s(info->created.UniversalTime, datetime, 100, 10);
OutputDebugString(datetime);
});

ASP.NET MVC Reference script file with version wildcard (without bundling)

In a ASP.NET MVC 4 project, I'd like to reference a versioned script file like this:
// Just some pseudo-code:
<script src="#Latest("~/Scripts/jquery-{0}.min.js")"></script>
// Resolves to the currently referenced script file
<script src="/Scripts/jquery-1.10.2.min.js"></script>
so that when a new Script version is updated via NuGet, the reference is updated automatically. I know of the bundling-and-minification feature, but it's just to much. I just want the little part which resolves the wildcards. My files are already minified, and also I don't want the bundles.
Do you have some smart ideas how to solve this?

Even though it's a little over kill to use the Bundling in MVC, but I think that will be your best bet. It's already been done and proven so why spend more time to write some proprietary code.
That being said, if you want a simple sample of what you can do, then you can try the following.
public static class Util
{
private const string _scriptFolder = "Scripts";
public static string GetScripts(string expression)
{
var path = HttpRuntime.AppDomainAppPath;
var files = Directory.GetFiles(path + _scriptFolder).Select(x => Path.GetFileName(x)).ToList();
string script = string.Empty;
expression = expression.Replace(".", #"\.").Replace("{0}", "(\\d+\\.?)+");
Regex r = new Regex(#expression, RegexOptions.IgnoreCase);
foreach (var f in files)
{
Match m = r.Match(f);
while (m.Success)
{
script = m.Captures[0].ToString();
m = m.NextMatch();
}
}
return script;
}
}
This will return you the last match in your Scripts director or it will return empty string.
Using this call
#Html.Raw(MvcApplication1.Util.GetScripts("jquery-{0}.min.js"))
Will get you this result if 1.8.2 is the last file that matched your string.
jquery-1.8.2.min.js
Hope this will help you get started.

Images are getting published with TCM id appended with the image name

Mode of publishing - static
I'm trying to publish images, but the issue is whenever I publish those images, their TCM URI is appended to their name (i.e if image name is example and its TCM URI is like tcm:1-115, image filename becomes example_tcm1-115).
I have written the following code:
public void Transform(Engine engine, Package package)
{
Filter MMCompFilter = new Filter();
MMCompFilter.Conditions["ItemType"] = Tridion.ContentManager.ItemType.Component;
Folder folder = engine.GetObject("tcm:1-1-2") as Folder;
foreach (Component MMcomp in folder.GetItems(MMCompFilter))
{
Binary binary = engine.PublishingContext.RenderedItem.AddBinary(MMcomp);
String binaryurl = binary.Url;
char[] array = binaryurl.ToCharArray();
Array.Reverse(array);
string obj = new string(array);
string final = newImagepath(obj);
char[] array2 = final.ToCharArray();
Array.Reverse(array2);
string obj2 = new string(array2);
package.PushItem("Image", package.CreateHtmlItem(obj2));
}
public string newImagepath(string filePath)
{
int formatIndex =filePath.IndexOf(".");
string format= filePath.Substring(0,formatIndex);
int finalPath=filePath.IndexOf("_");
string newPath=filePath.Substring((finalPath+1));
return (format+"."+newPath);
}
}
I want to publish images without the TCM URI appended to it. Plz suggest how can it be done.

Chris Summers wrote a very nice article on this very topic http://www.urbancherry.net/blogengine/post/2010/02/09/Unique-binary-filenames-for-SDL-Tridion-Multimedia-Components.aspx
It is basically a very simple thing to fix, but can have huge consequences which you should be aware of!
You can only publish a binary with a certain file-name in a single location once (and a binary can only be published to a single location on the presentation server, unless you publish it as a variant). However, in the CMS it is very easy to create Multimedia Components with the same binary file-name in different folders, which if they get published to the same location will be in conflict. That is why by default SDL Tridion appends the TCM URI to the filename to make it unique.

Simplest is always best.
In your TBB, just push the individual images to the package:
package.PushItem(package.CreateMultimediaItem(component.Id));
Then use the "PublishBinariesInPackage" TBB to publish these images to your presentation server.

You can use the RenderedItem.AddBinary method for this goal. Some of the overloaded versions of the method allows to publish an image as a stream, and pass any file name. For example:
public Binary AddBinary(
Stream content,
string filename,
string variantId,
string mimeType
)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Traditional pdf indexing solution compared to graph-based version - gremlin

Related

AWS Textract - GetDocumentAnalysisRequest only returns correct results for first page of document

Apache NiFi - ScanContent dictionary format not working

UWP/WinRT with C++/CX: In a chain of asynchronous tasks, how can data be passed between them?

ASP.NET MVC Reference script file with version wildcard (without bundling)

Images are getting published with TCM id appended with the image name

Categories

Resources