Do not include repeated data in facets with MarkLogic - xquery

I'm doing a search using facets with the new api search:search but I have the next problem:
My source:
File #1
<root>
<location>
<university>
<name>Yale</name>
<country>USA</country>
</university>
</location>
<location>
<university>
<name>MIT</name>
<country>USA</country>
</university>
</location>
<location>
<university>
<name>Santander</name>
<country>Spain</country>
</university>
</location>
</root>
File #2
<root>
<location>
<university>
<name>MIT</name>
<country>USA</country>
</university>
</location>
</root>
I need to know the number of universities by each country, but the facets return me the number of files that include one country or the number of locations in all files repeat universities, so in the last example of data it returns me this with the 2 options.
First Option (using frequency-order)
USA - 2 (Number of Files with at least one location with USA)
SPAIN - 1
Second Option (Using item-frequency)
USA - 3
SPAIN - 1
When the result should be this:
USA - 2 (because in the two files there are only two universities)
SPAIN - 1
How can I do this???

I think you need the item-frequency option, instead of the default fragment-frequency option. You add it to a constraint as a so-called facet-option. More details, and examples can be found on CMC: http://community.marklogic.com/pubs/5.0/apidocs/SearchAPI.html#search:search
-- edit --
I think I didn't read your question thoroughly enough. The search library focusses on search results, and the facet counts on fragments. Easiest way to improve the counts is by defining the location element as a fragment root. However, I don't think that really returns the numbers you are looking for. The country facet really only counts the country occurrences, and not the universities within countries. You can't achieve that with the search library. It isn't difficult to do it yourself though:
for $country in cts:element-values(xs:QName('country'))
let $universities := cts:element-values(xs:QName('university'), (), cts:element-value-query(xs:QName('country'), $country))
return fn:concat($country, ' - ', fn:count($universities))
Note: Untested code, but it at least shows the essential steps. It also require countries to not occur within same fragments. You need to add location as fragment root in the ML admin interface.
HTH!

Try cts:element-value-co-occurrences with name and country

Related

R list.files: some regexes only return a single file

I'm puzzled by the behaviour of regexes in the list.files command. I have a folder with ~500 files, most names start with "new_" and end with ".txt". There are some other files on the same folder, e.g. README, _cabs.txt.
I'd like to get only the new_*.txt files. I've tried different ways to call list.files with different results. Here are they:
#1 This returns ALL files including README and others
list.files(path="correctpath/")
#2 This returns ALL files including _cabs.txt, which I do not want.
list.files(path="correctpath/",pattern="txt")
#3 This returns ALL files I want, but...
list.files(path="correctpath/",pattern="new_")
#4 This returns just one of the new_*.txt files.
list.files(path="correctpath/",pattern="new*\\.txt")
#5 This returns an empty list.
list.files(path="correctpath/",pattern="new_*\\.txt")
So I have one solution that works, but would like to understand what's going on with the approaches 4 and 5.
thanks in advance
Rafael
list.files(path="correctpath/",pattern="new_.*\\.txt")
* means 0 or more times. If you want to match any character 0 or more time you need to add a period before it .* because a period means any character (except newline). The pattern "new_.*\\.txt" should work.
Good R regex reference.

BI Publisher conditional field masking

I have the following code on a field in a Peoplesoft BI Publisher RTF template where it is masking the last 4 digits of the Bank Account number.
<?xdofx:lpad('',length(Bank_Account__)-4,'*')?>
<?xdoxslt:rtrim(xdoxslt:right(Bank_Account__,4))?>
The problem is that sometimes the total Bank Account number length is less than 4 digits and when this happens it causes an negative array error on the lpad function to occur.
Can I wrap some kind of conditional IF statement around this where it will check the length of the bank account number and if it is longer than 5 digits than mask the last 4 digits, else (for Bank Account numbers less than 5 digits) just mask the last 2 digits. What would this look like?
Thanks in advance!
EDIT:
I should add that the existing code above is already wrapped in the following IF statement:
<?if#inlines:Bank_Account__!=''?>
So the entire statement is:
<?if#inlines:Bank_Account__!=''?>
<?xdofx:lpad('',length(Bank_Account__)-4,'*')?>
<?xdoxslt:rtrim(xdoxslt:right(Bank_Account__,4))?>
<?end if?>
I would just like to add in the conditional logic to check the bank account length and subsequently perform either of the above masking.
EDIT 2:
Here is my setup with your suggested changes, but I don't think I have the logic nested right, and the syntax may also be an issue.
Edit 3:
Here is the modified code, and the resulting error message:
The if statements can be nested, but since BIP does not have an else clause, the second if conditions has to check for the negative case.
Maybe this might work:
<?if#inlines:Bank_Account__!=''?>
<?if#inlines:string-length(Bank_Account__)>4?>
<?xdofx:lpad('',length(Bank_Account__)-4,'*')?><?xdoxslt:rtrim(xdoxslt:right(Bank_Account__,4))?>
<?end if?>
<?if#inlines:string-length(Bank_Account__)<=4?>
<?xdofx:lpad('','2','*')?><?xdoxslt:rtrim(xdoxslt:right(Bank_Account__,string-length(Bank_Account__)-2))?>
<?end if?>
<?end if?>
Update: Here is a screenshot of what I got:
Here is the xml snippet I used.
<?xml version="1.0"?>
<root>
<record>
<Bank_Account__>123456</Bank_Account__>
</record>
<record>
<Bank_Account__>12345</Bank_Account__>
</record>
<record>
<Bank_Account__>1234</Bank_Account__>
</record>
<record>
<Bank_Account__>123</Bank_Account__>
</record>
<record>
<Bank_Account__>12</Bank_Account__>
</record>
</root>
Download working files from here
There are some more functions available for other ways to implement this requirement.

Talend TRestClient: geocoding and combination of both flows (rows) afterwards

I am currently working on a small Talend job, which imports CSV data, gets the address field and sends the address to Google Maps API for geocoding. Afterwards, I need to combine both the input and geocoding data.
My problem is, that the combination of initial data row and geocoding result seems not possible; After passing the TRestClient, all reference to the input data seems gone.
Here's my non-final data flow:
Subjob 1: CSVInput --> THashMapOutput
|
|
Subjob 2: THashInput --> tRestClient --> tExtractJSONFields --> tMap --> tBufferOutput
| (Lookup)
|
tHashInput
|
|
Subjob 3: tBufferInput --> tFileOutputDelimited
Herein, the last tMap does not have a foreign key aka reference to the input row. Therefore the join creates the cross product of all different combinations of input and geocoded raw.
Is there a way to combine both input and geocoding results? Can we configure tRestClient to forward inputs as well?
(a combination of two resulting csv files seems to fail for the same missing identifier)
Ok, answer was quite easy:
Assume you have the first link in subjob 2 called row2.
Then you can open the second tMap component.
Remove the lookup shown above.
Add the references to row 2 within tMap: e.g. row2.URL, row2.Name
Et voila: Now you get each row combined of geocoded result and original data.

Using Marklogic Xquery data population

I have the data as below manner.
<Status>Active Leave Terminated</Status>
<date>05/06/2014 09/10/2014 01/10/2015</date>
I want to get the data as in the below manner.
<status>Active</Status>
<date>05/06/2014</date>
<status>Leave</Status>
<date>09/10/2014</date>
<status>Terminated</Status>
<date>01/10/2015</date>
please help me on the query, to retrieve the data as specified above.
Well, you have a string and want to split it at the whitestapces. That's what tokenize() is for and \s is a whitespace. To get the corresponding date you can get the current position in the for loop using at. Together it looks something like this (note that I assume that the input data is the current context item):
let $dates := tokenize(date, "\s+")
for $status at $pos in tokenize(Status, "\s+")
return (
<status>{$status}</status>,
<date>{$dates[$pos]}</date>
)
You did not indicate whether your data is on the file system or already loaded into MarkLogic. It's also not clear if this is something you need to do once on a small set of data or on an on-going basis with a lot of data.
If it's on the file system, you can transform it as it is being loaded. For instance, MarkLogic Content Pump can apply a transformation during load.
If you have already loaded the content and you want to transform it in place, you can use Corb2.
If you have a small amount of data, then you can just loop across it using Query Console.
Regardless of how you apply the transformation code, dirkk's answer shows how you need to change it. If you are updating content already in your database, you'll xdmp:node-delete() the original Status and date elements and xdmp:node-insert-child() the new ones.

Is there a way to output line separator values in sqlite text output

I've got text like this in one of my sqlite table columns:
Mantas are found in temperate, subtropical and tropical waters. Both species are pelagic; M. birostris migrates across open oceans, singly or in groups, while M. alfredi tends to be resident and coastal. They are filter feeders and eat large quantities of zooplankton, which they swallow with their open mouths as they swim. Gestation lasts over a year, producing live pups.
Mantas may visit cleaning stations for the removal of parasites. Like whales, they breach, for unknown reasons.
The last two lines are broken from the previous by either a \r or \n. I want to be able to see the actual value of \r or \n in the shell output of of the column. Any ideas?
There doesn't seem to be any way to directly do this in the SQLite shell, but you can use .output <file> to output the result to a file, then use a text or hex editor to see what the line endings are.

Resources