Apache tika: preserve bullets list and styles and customize output

Apache tika: preserve bullets list and styles and customize output - xhtml

i'm doing some test with Apache Tika. Goal is to turn complex Word documents (few pages of text, tables, images, bullet list with many level of indentations) into xhtml, preserving as many info/styles as possible.
I found this out of the box example on the offical site. It does its job, but with many limitations:
Bullets and numbering list are not outputted correctly. <p class="list_Paragraph">· first element of the list</p> is generated instead of <ul><li>first element of the list</li>....and indentation levels are lost if there are nested lists.
Text colors, font size, alignment and many other styles are not outputted at all.
Is it possible to generate a specific output for a specific tag/style? (ex: heading3 to be turned into <smallHeading> instead of <h3>)
Images are not extracted.
Point 4 probably requires an extractor to be implemented (from what i found in other posts), but is it possible to achieve the first 3 points above? Are we talking of a few settings/extending the example parser/handler or everything has to be implemented from scratch? Suggestions?
Thanks a lot.
public String parseToHTML() throws IOException, SAXException, TikaException {
ContentHandler handler = new ToXMLContentHandler();
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try (InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc")) {
parser.parse(stream, handler, metadata);
return handler.toString();
}
}

Related

AE Extendscript layered source file

I'm working on a script that does find/replace for missing items in your project. Unfortunately I'm running into a situation detecting and then replacing layered image sources (psd, ai, etc.).
1) I see no way of detecting if a AvItem is a layer within a layered image other than parsing the item.name, which is unreliable because a user can always rename items in the project panel.
2) Once I do know that it is a part of a layered image I cannot figure out how to re-link it to the correct image without replacing the layer with the merged image. item.replace(new_path) will replace that item with the whole image, not the layer within the image. For example:
var item = app.project.item(3); //assuming this is the 'layer' we want to replace
item.replace(new_path);
So is there a secret property somewhere which will reliably tell me if an item is a part of a layered image, and if so is there a way to relink it without replacing the layer with the entire merged image?
EDIT
Here's a function to guess if a layer is part of a layered image. It's not bullet-proof but it should work as long as the user does not rename the item:
function isSourceLayered (av_item) {
// check if there is a "/"
if (av_item.name.indexOf("/") != -1) {
// check if it is in a "layers" folder
if (av_item.parentFolder.name.indexOf("Layers") != -1) {
return true;
}
}
return false;
}

I just asked the same question on the Adobe extendscript forum. Unless there's undocumented features (and I spent a bit of time looking with Extendscript Toolkit's data browser) the fileSource object doesn't seem to have any attributes or methods to do this.
There is a kind of a workaround, you can import the file using ImportOptions.importAs(ImportAsType.COMP) This will import a comp, and you can loop through the layers matching the name, get the source of that layer and use that as your new source. But as you say, it doesn't work if the source has been renamed.
I've written this into a function, it's available on github Edit: I forgot that I changed the way that function works. It doesn't re-import layer sources because of this problem, it just uses the Duplicate menu command.

VUEJS: Is it possible to process/modify data retrieved through v-for before displaying?

Hello I am extremely new to web application development / javascript in general. Only gave myself crash courses from udemy videos for the past 4 months.
For my web application, I am retrieving data from a database through my server-side backend with axios. I have a logRepository Array where the retrieved data is pushed.
data() {
return {
logRepository: [],
}
},
created() {
axios.get('/myrestapiroute', {
headers: {
'Authorization': `Bearer ${this.$store.state.token}`
},
params: {
userId: this.userId
}
})
.then(res => {
const data = res.data.logs
this.dataCheck = data
for(let key in data) {
const log = data[key]
log.id = key
this.logRepository.push(log)
}
})
On my template, I used v-for to loop through all the retrieved data elements:
<div ..... v-for="(log,index) in logRepository" :key="index">
With the v-for in place, one example of how I display my data as such in a paragraph tag. The format is due to how the data was structured.
<p style="text-align: justify;">
{{ log.logId.jobPerformed }}
</p>
The problem arises when I try to apply styling to the text. As you can see above, I would like to use text-align: justify. I also wanted to keep the whitespace as how it was input by the user through the use of white-space: pre-wrap.
(https://i.imgur.com/dwaJHT9.png)
But the problem is these two styles do not work together well. As can be seen below in the picture, if I use justify on its own, it behaves normally (Except that whitespace is lost). However, if I combine both text-align:justify and white-space: pre-wrap, the text end up justified with spacing, but aligned in a weird way.
(https://i.imgur.com/DQSfOya.png)
For short entries, they begin with weird indentation when the starting side should be aligned to the left of the column. The issue appears to be more than simply whitespaces at the start as I have tried .trim() as suggested by a contributor.
(https://i.imgur.com/uwysk9X.png)
I tried to tweak the CSS around, with text-align-last, text-align-start, direction: ltr, pre-tags etc. But it just does not work properly. Suggestions from other SO pages recommended that the data be processed by performing a string replace of all \n to br before displaying.
Is it possible to perform processing for individual data obtained from v-for before displaying or it has to be done to the array using computed property?
Since my data is to be fetched from a database, I am confused on how to achieve the pre-processing of data, since my array size will dynamic and differ for each user.
What would be the most optimal way to achieve pre-processing of data before displaying for such case?
This image below is how my Array of Object (logRepository) looks like. The format is largely due to mongoDB schema.
(https://i.imgur.com/7SilcF7.png)
======= Solution =======
I modified the object variables in my .then block and performed string replace for all \n characters to tags.
https://i.imgur.com/EtLX2tg.png
With that my display no longer requires the "white-space: pre-wrap" styling. However, since I was previously using string interpolation to display my data the tags were treated as plain text.
https://i.imgur.com/zUbNZbI.png
I had to modify the tags to use v-html binding to display the data as htmltext so that would work. The difference can be seen in the last picture.
https://i.imgur.com/sCTsCV4.png
Thanks for helping me on this since I am very new to javascript.

There are a number of patterns you could use here to pre-process your data.
Process your data in the then function you have created
Wrap the data in functions within the {{ ... }} blocks in the template
Create a component that is used as the element that the v-for loop iterates over which displays a computed value
Use a computed value as the source array of the v-for directive in the main template. This is often done using a processing function mapped to the source data, as in log.map( data => data.logId.jobPerformed.trim()

Print different size pages - Determine if printer is pdf

Problem:
My project... printing a sequence of pages... created based on certain templates and database info...
The sequence of pages to be printed can be, in certain situations, of different sizes.
I have been trying to print to real printer, producing multiple pages
if (m_printer->newPage()) { ... }
and on a physical printer, if I try to change the page size, it either doesn't work or puts the printer in an error state.
So there is not much choice, it seems, but to make each page a separate job. Minor disadvantages - possibly on a network. Oh well.
On pdf or any type of file printing, though, it makes a huge difference, whether the sequence is contained in a single document on multiple pages, or if it creates hundreds of different documents of one page each.
So, I found this Is it possible to make a pdf with different page size in Qt?
it seems to be exactly what I need, if I print to a pdf - while for real printer I will make each page a separate job.
The only problem:
How can I tell if I am creating a pdf file, or if I am sending a job to a real printer ?
I looked in QPrinter and QPrinterInfo, I did not see anything that can help.
Pdf printing is probably enabled because of Adobe Acrobat.
I am implementing this currently in Windows.
Edit: why getting the outputFormat (Naidu's answer below) doesn't work:
qprinter.cpp:
void QPrinterPrivate::initEngines(QPrinter::OutputFormat format, const QPrinterInfo &printer)
{
..
// Only set NativeFormat if we have a valid plugin and printer to use
if (format == QPrinter::NativeFormat) { //////// which of course has to be, we have to support any printer
ps = QPlatformPrinterSupportPlugin::get();
QPrinterInfo printerToUse = findValidPrinter(printer);
if (ps && !printerToUse.isNull()) { //////// both valid since the PDF writer is valid
outputFormat = QPrinter::NativeFormat;
printerName = printerToUse.printerName();
}
}
...
}
I would like to have something to check, other than the fact that "pdf" may be contained in the name. If needed, I am willing to use the awful DEVMODE, I just don't know what to look for.

Use the public function
QPrinter::outputFormat()
it returns an enum type enum QPrinter::OutputFormat.
And check if it is QPrinter::PdfFormat
http://doc.qt.io/qt-5/qprinter.html#OutputFormat-enum

using the chrome console to select out data

I'm looking to pull out all of the companies from this page (https://angel.co/finder#AL_claimed=true&AL_LocationTag=1849&render_tags=1) in plain text. I saw someone use the Chrome Developer Tools console to do this and was wondering if anyone could point me in the right direction?
TLDR; How do I use Chrome console to select and pull out some data from a URL?

Note: since jQuery is available on this page, I'll just go ahead and use it.
First of all, we need to select elements that we want, e.g. names of the companies. These are being kept on the list with ID startups_content, inside elements with class items in a field with class name. Therefore, selector for these can look like this:
$('#startups_content .items .name a')
As a result, we will get bunch of HTMLElements. Since we want a plain text we need to extract it from these HTMLElements by doing:
.map(function(idx, item){ return $(item).text(); }).toArray()
Which gives us an array of company names. However, lets make a single plain text list out of it:
.join('\n')
Connecting all the steps above we get:
$('#startups_content .items .name a').map(function(idx, item){ return $(item).text(); }).toArray().join('\n');
which should be executed in the DevTools console.
If you need some other data, e.g. company URLs, just follow the same steps as described above doing appropriate changes.

Nested REST Routing

Simple situation: I have a server with thousands of pictures on it. I want to create a restful business layer which will allow me to add tags (categories) to each picture. That's simple. I also want to get lists of pictures that match a single tag. That's simple too. But now I also want to create a method that accepts a list of tags and which will return only pictures that match all these tags. That's a bit more complex, but I can still do that.
The problem is this, however. Say, my rest service is at pictures.example.com, I want to be able to make the following calls:
pictures.example.com/Image/{ID} - Should return a specific image
pictures.example.com/Images - Should return a list of image IDs.
pictures.example.com/Images/{TAG} - Should return a list of image IDs with this tag.
pictures.example.com/Images/{TAG}/{TAG} - Should return a list of image IDs with these tags.
pictures.example.com/Images/{TAG}/{TAG}/{TAG} - Should return a list of image IDs with these tags.
pictures.example.com/Images/{TAG}/{TAG}/{TAG}/{TAG}/{TAG} - Should return a list of image IDs with these tags.
etcetera...
So, how do I set up a RESTful web service projects that will allow me to nest tags like this and still be able to read them all? Without any limitations for the number of tags, although the URL length would be a limit. I might want to have up to 30 tags in a selection and I don't want to set up 30 different routing thingies to get it to work. I want one routing thingie that could technically allow unlimited tags.
Yes, I know there could be other ways to send such a list back and forth. Better even, but I want to know if this is possible. And if it's easy to create. So the URL cannot be different from above examples.
Must be simple, I think. Just can't come up with a good solution...

The URL structure you choose should be based on whatever is easy to implement with your web framework. I would expect something like:
http://pictures.example.com/images?tags=tag1,tag2,tag3,tag4
Is going to be much easier to handle on the server, and I can see no advantage to the path segment approach that you are having trouble with.

I assume you can figure out how to actually write the SQL or filesystem query to filter by multiple tags. In CherryPy, for example, hooking that up to a URL is as simple as:
class Images:
#cherrypy.tools.json_out()
def index(self):
return [cherrypy.url("/images/" + x.id)
for x in mylib.images()]
index.exposed = True
#cherrypy.tools.json_out()
def default(self, *tags):
return [cherrypy.url("/images/" + x.id)
for x in mylib.images(*tags)]
default.exposed = True
...where the *tags argument is a tuple of all the /{TAG} path segments the client sends. Other web frameworks will have similar options.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Apache tika: preserve bullets list and styles and customize output - xhtml

Related

AE Extendscript layered source file

VUEJS: Is it possible to process/modify data retrieved through v-for before displaying?

Print different size pages - Determine if printer is pdf

using the chrome console to select out data

Nested REST Routing

Categories

Resources