Store unwellformed XHTML in Basex DB

Store unwellformed XHTML in Basex DB - xquery

I have to store XHTML which is not wellformed e.g.(<img href"tes.jpg">) so there is no closing element for element or self close.
I am using the REST service to store data into database, below is the code:
let $$DocPath:= 'D:\sample\'
for $files in file:children($DocPath)
let $filename := fn:tokenize($files,'\\')[fn:last()]
return
(
db:replace('test',('\content\'||$filename),fn:doc($files),map{'skipcorrupt' : true()}),
admin:write-log(fn:concat('Content ingested', ' ' , $filename))
)
I have used option "skipcorrupt" but it didn't work and I am getting error while ingesting content:
(Line 369): The element type "img" must be terminated by the matching end-tag "".]
Is there any option which I need to configure in the "web.xml" or ".basex" file, please suggest.
Thanks

Related

Xquery delete leaves empty lines in xml document - how to remove them? (eXist-db)

In XQuery 3.1 (eXist 4.7) I have an operation that deletes nodes from a stored XML document at /db/apps/myapp/data/list_bibliography.xml that looks like this:
<listBibl xmlns="http://www.tei-c.org/ns/1.0" xml:id="bibliography">
<tei:biblStruct xmlns:tei="http://www.tei-c.org/ns/1.0" type="book" xml:id="Z-BF2WLW8Y">
<tei:monogr>
<tei:title level="m">footitle1</tei:title>
<tei:author>
<tei:name>author name</tei:name>
</tei:author>
<tei:imprint>
<tei:publisher>some city</tei:publisher>
<tei:date>2019</tei:date>
</tei:imprint>
</tei:monogr>
</tei:biblStruct>
<tei:biblStruct xmlns:tei="http://www.tei-c.org/ns/1.0" type="book" xml:id="Z-4KF7YNP3">
<tei:monogr>
<tei:title level="m">footitle2</tei:title>
<tei:author>
<tei:name>author name</tei:name>
</tei:author>
<tei:imprint>
<tei:publisher>some other city</tei:publisher>
<tei:date>2018</tei:date>
</tei:imprint>
</tei:monogr>
</tei:biblStruct>
</listBibl>
The following function:
declare local:delete-bibl()
{
let $bibdoc := doc("/db/apps/myapp/data/list_bibliography.xml")
for $bib in $bibdoc//tei:biblStruct[#xml:id = "Z-BF2WLW8Y"]
return update delete $bib
};
leaves the file with whitespace like this:
<listBibl xmlns="http://www.tei-c.org/ns/1.0" xml:id="bibliography">
<tei:biblStruct xmlns:tei="http://www.tei-c.org/ns/1.0" type="book" xml:id="Z-4KF7YNP3">
<tei:monogr>
<tei:title level="m">footitle2</tei:title>
<tei:author>
<tei:name>author name</tei:name>
</tei:author>
<tei:imprint>
<tei:publisher>some other city</tei:publisher>
<tei:date>2018</tei:date>
</tei:imprint>
</tei:monogr>
</tei:biblStruct>
</listBibl>
Is there some sort of configuration or function that can collapse the white space left by delete?
I tried using instead return update replace $bib with "" but that throws errors as the replacement must be a node.
Many thanks.

There is no configuration option for collapsing the whitespace left by eXist's XQuery Update delete operations.
To work around the error you received when replacing $bib with an empty string, instead replace it with a text node:
update replace $bib with text { "" }

XSLT-style mini transformation in Xquery?

At the moment in Xquery 3.1 (in eXist 4.7) I receive XML fragments that look like the following (from eXist's Lucene full text search):
let $text :=
<tei:text>
<front>
<tei:div>
<tei:listBibl>
<tei:bibl>There is some</tei:bibl>
<tei:bibl>text in certain elements</tei:bibl>
</tei:listBibl>
</tei:div>
<tei:div>
<tei:listBibl>
<tei:bibl>which are subject <exist:match>to</exist:match> a Lucene search</tei:bibl>
<tei:bibl></tei:bibl>
<tei:listBibl>
</tei:div>
<tei:front>
<tei:body>
<tei:p>and often produces</tei:p>
<tei:p>a hit.</tei:p>
<tei:body>
<tei:text>
Currently I have Xquery send this fragment to an XSLT stylesheet in order to transform it into HTML like this:
<td>...elements which are subject <span class="search-hit">to</span> a Lucene search and often p...
Where the stylesheet's job is to return 30 characters of text before and after <exist:match/> and put the content of <exist:match/> into a span. There is only one <exist:match/> per transformation.
This all works fine. However, it's occurred to me that it is a very small job with effectively a single transformation of only one element, the rest being a sort of string-join. I therefore wonder if this can't be done efficiently in Xquery.
In trying to do this, I'm can't seem to find a way to handle the string content up to the <exist:match/> and then the string content after <exist:match/>. My idea is, in pseudo code, to output a result like:
let $textbefore := some function to get the text before <exist:match/>
let $textafter := some function to get text before <exist:match/>
return <td>...{$textbefore}
<span class="search-hit">
{$text//exist:match/text()}
</span> {$textafter}...</td>
Is this even worth doing in Xquery vs the current Xquery -> XSLT pipeline I have?
Many thanks.

I think it can be done as
declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization";
declare namespace tei = "http://example.com/tei";
declare namespace exist = "http://example.com/exist";
declare option output:method 'html';
let $text :=
<tei:text>
<tei:front>
<tei:div>
<tei:listBibl>
<tei:bibl>There is some</tei:bibl>
<tei:bibl>text in certain elements</tei:bibl>
</tei:listBibl>
</tei:div>
<tei:div>
<tei:listBibl>
<tei:bibl>which are subject <exist:match>to</exist:match> a Lucene search</tei:bibl>
<tei:bibl></tei:bibl>
</tei:listBibl>
</tei:div>
</tei:front>
<tei:body>
<tei:p>and often produces</tei:p>
<tei:p>a hit.</tei:p>
</tei:body>
</tei:text>
,
$match := $text//exist:match,
$text-before-all := normalize-space(string-join($match/preceding::text(), ' ')),
$text-before := substring($text-before-all, string-length($text-before-all) - 30),
$text-after := substring(normalize-space(string-join($match/following::text(), ' ')), 1, 30)
return
<td>...{$text-before}
<span class="search-hit">
{$match/text()}
</span> {$text-after}...</td>
which is not really much of a query in XQuery either but just some XPath selection plus some possibly expensive string joining and extraction on the preceding and following axis.

How to add a value to the existing element value and return it as a new value

This is the xml file.
<?xml version="1.0" encoding="UTF-8"?>
<root>
<AtcoCode> System-Start-Date= 2018-05-16T12:35:48.6929328-04:00, " ", System-End-Date = 9999-12-31, " ", 150042010003</AtcoCode>
<NaptanCode>esxatgjd</NaptanCode>
<PlateCode>
</PlateCode>
<CleardownCode>
</CleardownCode>
<CommonName>Upper Park</CommonName>
<CommonNameLang>
</CommonNameLang>
<ShortCommonName>
</ShortCommonName>
<ShortCommonNameLang>
</ShortCommonNameLang>
<Landmark>Upper Park</Landmark>
<LandmarkLang>
</LandmarkLang>
<Street>High Road</Street>
<StreetLang>
</StreetLang>
<Crossing>
</Crossing>
<CrossingLang>
</CrossingLang>
<Indicator>adj</Indicator>
<IndicatorLang>
</IndicatorLang>
<Bearing>NE</Bearing>
<NptgLocalityCode>E0046286</NptgLocalityCode>
<LocalityName>Loughton</LocalityName>
<ParentLocalityName>
</ParentLocalityName>
<GrandParentLocalityName>
</GrandParentLocalityName>
<Town>Loughton</Town>
<TownLang>
</TownLang>
<Suburb>
</Suburb>
<SuburbLang>
</SuburbLang>
<LocalityCentre>1</LocalityCentre>
<GridType>U</GridType>
<Easting>541906</Easting>
<Northing>195737</Northing>
<Co-ordinates>51.64255,0.04944</Co-ordinates>
<StopType>BCT</StopType>
<BusStopType>MKD</BusStopType>
<TimingStatus>OTH</TimingStatus>
<DefaultWaitTime>
</DefaultWaitTime>
<Notes>
</Notes>
<NotesLang>
</NotesLang>
<AdministrativeAreaCode>080</AdministrativeAreaCode>
<CreationDateTime>2006-11-06T00:00:00</CreationDateTime>
<ModificationDateTime>2010-01-16T07:58:02</ModificationDateTime>
<RevisionNumber>5</RevisionNumber>
<Modification>rev</Modification>
<Status>act</Status>
</root>
How to achieve this?
Question: Create the path range index for the status element and fetch all the documents that has status del
after fetching all the documents, you need to create the new element called currentreservationnumber under RevisionNumber element.
The value of the currentrevisionnumber will be +1 to the RevisionNumber.

I think the warning about sequential numbers is related to system-wide unique numbers/ids (like Oracle sequence), so not a worry in this case?
If you only ever have one RevisionNumber, and you can find it without a path index, you can maybe get by with element-value query on the RevisionNumber since it's already indexed.
Given that you get the document somehow, it could be as simple as:
let $doc := fn:doc ('/foo.xml')
let $rev-node := $doc/root/RevisionNumber
return xdmp:node-insert-after ($rev-node, <currentreservationnumber>{$rev-node + 1}</currentreservationnumber>)
though remember to consider locking if you are doing a big query/update. And you might need to switch to node-replace if there is already a currentreservationnumber.

How to match space in MarkLogic using CTS functions?

I need to search those elements who have space " " in their attributes.
For example:
<unit href="http:xxxx/unit/2 ">
Suppose above code have space in the last for href attribute.
I have done this using FLOWER query. But I need this to be done using CTS functions. Please suggest.
For FLOWER query I have tried this:
let $x := (
for $d in doc()
order by $d//id
return
for $attribute in data($d//#href)
return
if (fn:contains($attribute," ")) then
<td>{(concat( "id = " , $d//id) ,", data =", $attribute)}</td>
else ()
)
return <tr>{$x}</tr>
This is working fine.
For CTS I have tried
let $query :=
cts:element-attribute-value-query(xs:QName("methodology"),
xs:QName("href"),
xs:string(" "),
"wildcarded")
let $search := cts:search(doc(), $query)
return fn:count($search)

Your query is looking for " " to be the entirety of the value of the attribute. If you want to look for attributes that contain a space, then you need to use wildcards. However, since there is no indexing of whitespace except for exact value queries (which are by definition not wildcarded), you are not going to get a lot of index support for that query, so you'll need to run this as a filtered search (which you have in your code above) with a lot of false positives.
You may be better off creating a string range index on the attribute and doing value-match on that.

SQLite: isolating the file extension from a path

I need to isolate the file extension from a path in SQLite. I've read the post here (SQLite: How to select part of string?), which gets 99% there.
However, the solution:
select distinct replace(column_name, rtrim(column_name, replace(column_name, '.', '' ) ), '') from table_name;
fails if a file has no extension (i.e. no '.' in the filename), for which it should return an empty string. Is there any way to trap this please?
Note the filename in this context is the bit after the final '\'- it shouldn't be searching for'.'s in the full path, as it does at moment too.
I think it should be possible to do it using further nested rtrims and replaces.

Thanks. Yes, you can do it like this:
1) create a scalar function called "extension" in QtScript in SQLiteStudio
2) The code is as follows:
if ( arguments[0].substring(arguments[0].lastIndexOf('\u005C')).lastIndexOf('.') == -1 )
{
return ("");
}
else
{
return arguments[0].substring(arguments[0].lastIndexOf('.'));
}
3) Then, in the SQL query editor you can use
select distinct extension(PATH) from DATA
... to itemise the distinct file extensions from the column called PATH in the table called DATA.
Note that the PATH field must contain a backslash ('\') in this implementation - i.e. it must be a full path.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Store unwellformed XHTML in Basex DB - xquery

Related

Xquery delete leaves empty lines in xml document - how to remove them? (eXist-db)

XSLT-style mini transformation in Xquery?

How to add a value to the existing element value and return it as a new value

How to match space in MarkLogic using CTS functions?

SQLite: isolating the file extension from a path

Categories

Resources