I am copying parts of the Simple Semantic Search sample application at https://github.com/vespa-engine/sample-apps/tree/master/simple-semantic-search to get started with dense vector search.
I have indexed our website, dividing every page in paragraph-size docs. Some docs only consist of a name of a person (a single <div> on the website)
With many queries these very short docs get ranked on top although there is no apparent similarity. Querying for "teacher" gives the results below. Why do "Kelly Tracey" and "Luke Hanley" have such a high similarity?
Doc
Relevance score
Professor Jake Dalton
0.4810788561826608
Kelly Tracey
0.4618036348887372
Prof. Sarah Jacoby
0.4605411864409834
Luke Hanley
0.45709536853590715
Dr. Elizabeth McDougal
0.4570338357051837
Casey Kemp
0.4508383490617062
I removed the bm25 part of the ranker for testing
rank-profile simple_semantic inherits default{
inputs {
query(e) tensor<float>(x[384])
}
first-phase {
expression: closeness(field, myEmbedding)
}
}
Query
params = {
"yql": "select * from kvp_semantic_2 where {targetHits: 100}nearestNeighbor(myEmbedding, e)",
"input.query(e)": 'embed({"teacher"})',
"ranking.profile": "simple_semantic",
"hits": 10
}
The component in services.xml is straight from the sample app
<component id="bert" class="ai.vespa.embedding.BertBaseEmbedder" bundle="model-integration">
<config name="embedding.bert-base-embedder">
<transformerModel path="model/minilm-l6-v2.onnx"/>
<tokenizerVocab path="model/bert-base-uncased.txt"/>
</config>
</component>
The same happens with many other queries, like "biography", but not with some, like "translator".
The model here is just 90 Mb. I don't think you can expect it to contain information about which individual humans are teachers or similar.
When you query for teacher the 6 docs you retrieve are all names of humans and at least two of them are even professors. I think that's pretty good.
Related
I am using the newsanchor package in R to try to extract entire article content via NewsAPI. For now I have done the following :
require(newsanchor)
results <- get_everything(query = "Trump +Trade", language = "en")
test <- results$results_df
This give me a dataframe full of info of (maximum) a 100 articles. These however do not containt the entire actual article text. Rather they containt something like the following:
[1] "Tensions between China and the U.S. ratcheted up several notches over the weekend as Washington sent a warship into the disputed waters of the South China Sea. Meanwhile, Google dealt Huaweis smartphone business a crippling blow and an escalating trade war co… [+5173 chars]"
Is there a way to extract the remaining 5173 chars. I have tried to read the documentation but I am not really sure.
I don't think that is possible at least with free plan. If you go through the documentation at https://newsapi.org/docs/endpoints/everything in the Response object section it says :
content - string
The unformatted content of the article, where available. This is truncated to 260 chars for Developer plan users.
So all the content is restricted to only 260 characters. However, test$url has the link of the source article which you can use to scrape the entire content but since it is being aggregated from various sources I don't think there is one automated way to do this.
I am new to MarkLogic ..I need to get the total count of books from the following XML. Can anyone suggest me.
<bk:bookstore xmlns:bk="http://www.bookstore.org">
<bk:book category='Computer'>
<bk:author>Gambardella, Matthew</bk:author>
<bk:title>XML Developer's Guide</bk:title>
<bk:price>44.95</bk:price>
<bk:publish_year>1995</bk:publish_year>
<bk:description>An in-depth look at creating applications with XML.
</bk:description>
</bk:book>
<bk:book category='Fantasy'>
<bk:author>Ralls, Kim</bk:author>
<bk:title>Midnight Rain</bk:title>
<bk:price>5.95</bk:price>
<bk:publish_year>2000</bk:publish_year>
<bk:description>A former architect battles corporate zombies, an evil
sorceress, and her own childhood to become queen of the world.
</bk:description>
</bk:book>
<bk:book category='Comic'>
<bk:author>Robert M. Overstreet</bk:author>
<bk:title>The Overstreet Indian Arrowheads Identification </bk:title>
<bk:price>2000</bk:price>
<bk:publish_year>1991</bk:publish_year>
<bk:description>A leading expert and dedicated collector, Robert M.
Overstreet has been writing The Official Overstreet Identification and
Price
Guide to Indian Arrowheads for more than 21 years</bk:description>
</bk:book>
<bk:book category='Comic'>
<bk:author>Randall Fuller</bk:author>
<bk:title>The Book That Changed America</bk:title>
<bk:price>1000</bk:price>
<bk:publish_year>2017</bk:publish_year>
<bk:description>The New York Times Book Review Throughout its history
America has been torn in two by debates over ideals and beliefs.
</bk:description>
</bk:book>
</bk:bookstore>
Can anyone find the solution for this question as I am new to this.
Id suggest using a cts:count-aggregate in combination with cts:element-reference. This requires you to have a element range index on book.
cts:count-aggregate(cts:element-reference(fn:QName("http://www.bookstore.org", "book")))
If performance isn't too critical and your document count isn't too large, you could also count with fn:count.
declare namespace bk="http://www.bookstore.org";
fn:count(//bk:book)
Try this-
declare namespace bk="http://www.bookstore.org";
let $book_xml :=
<bk:bookstore xmlns:bk="http://www.bookstore.org">
</bk:book>
........
........
</bk:book>
</bk:bookstore>
return fn:count($book_xml//bk:book)
Hope That Helps !
What can you do with Labeled Property Graph used by companies such as Neo4j that you can not do with RDF* /SPARQL* used by BlazeGraph?
What you can do with the store depends, of course, on the stores. And the capabilities of Neo4j and Blazegraph are very different.
Now, if your question is what can I express with RDF-star compared to what the property graph model can express, then this is a valid question, and the answer is they are pretty much equivalent in terms of expressivity.
The property graph uses a higher level of abstraction as its primitives are nodes and relationships, both of which can have internal structure (as in attributes describing them). Instead, RDF breaks every bit of information into triples/statements.
A simple graph describing me posting an answer in StackOverflow can be expressed in Cypher like this:
CREATE
(:Person { uri: 'http://me.com/jb', name: 'JB'})-[:POST { date:'20/06/2021'}]->(:Answer { uri: 'http://stackoverflow.com/answer/1234', text: 'blah blah blah...'})
...and exactly the same information in SPARQL + RDF-star would look like this:
INSERT DATA
{
<http://me.com/jb> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
<http://me.com/jb> <http://schema.org/name> "JB" .
<http://stackoverflow.com/answer/1234> <http://schema.org/text> "blah blah blah..." .
<http://me.com/jb> <http://schema.org/post> <http://stackoverflow.com/answer/1234> .
<http://stackoverflow.com/answer/1234> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Answer> .
<http://me.com/jb> <http://schema.org/post> <http://stackoverflow.com/answer/1234> .
<<<http://me.com/jb> <http://schema.org/post> <http://stackoverflow.com/answer/1234>>>
<http://schema.org/dateCreated> "20/06/2021" .
}
So you can express the same, but one is arguably more "data exchange oriented" (RDF), and the other is more "data manipulation / developer oriented".
Busy comparing SQLServer 2008 R2 and MarkLogic 8 with simple Person Entity.
My dataset is for both 1 Million Records/Documents. Note: Both databases are on the same machine(Localhost).
The following SQLServer Query is ready in a flash:
set statistics time on
select top 10 FirstName + ' ' + LastName, count(FirstName + ' ' + LastName)
from [Person]
group by FirstName + ' ' + LastName
order by count(FirstName + ' ' + LastName) desc
set statistics time off
Result is:
Richard Petter 421
Mark Petter 404
Erik Petter 400
Arjan Petter 239
Erik Wind 237
Jordi Petter 235
Richard Hilbrink 234
Mark Dominee 234
Richard De Boer 233
Erik Bakker 233
SQL Server Execution Times:
CPU time = 717 ms, elapsed time = 198 ms.
The XQuery on MarkLogic 8 however is much slower:
(
let $m := map:map()
let $build :=
for $person in collection('/collections/Persons')/Person
let $pname := $person/concat(FirstName/text(), ' ', LastName/text())
return map:put(
$m, $pname, sum((
map:get($m, $pname), 1)))
for $pname in map:keys($m)
order by map:get($m, $pname) descending
return
concat($pname, ' => ', map:get($m, $pname))
)[1 to 10]
,
xdmp:query-meters()/qm:elapsed-time
Result is:
Richard Petter => 421
Mark Petter => 404
Erik Petter => 400
Arjan Petter => 239
Erik Wind => 237
Jordi Petter => 235
Mark Dominee => 234
Richard Hilbrink => 234
Erik Bakker => 233
Richard De Boer => 233
elapsed-time:PT42.797S
198 msec vs. 42 sec is in my opinion to much difference.
The XQuery is using a map to do the Group By acording to this guide: https://blakeley.com/blogofile/archives/560/
I have 2 questions:
Is the XQuery used in any way tuneable for beter performance?
Is XQuery 3.0 with group by already usable on MarkLogic 8?
Thanks for the help!
As #wst said, the challenge with your current implementation is that it's loading up all the documents in order to pull out the first and last names, adding them up one by one, and then reporting on the top ten. Instead of doing that, you'll want to make use of indexes.
Let's assume that you have string range indexes set up on FirstName and LastName. In that case, you could run this:
xquery version "1.0-ml";
for $co in
cts:element-value-co-occurrences(
xs:QName("FirstName"),
xs:QName("LastName"),
("frequency-order", "limit=10"))
return
$co/cts:value[1] || ' ' || $co/cts:value[2] || ' => ' || cts:frequency($co)
This uses indexes to find First and Last names in the same document. cts:frequency indicates how often that co-occurrence happens. This is all index driven, so it will be very fast.
First, yes, there are many ways to tune queries in MarkLogic. One obvious way is using range indexes; however, I would highly recommend reading their documentation on the subject first:
https://docs.marklogic.com/guide/performance
For a higher level look at the database architecture, there is a whitepaper called Inside Marklogic Server that explains the design extensively:
https://developer.marklogic.com/inside-marklogic
Regarding group by, maybe someone from MarkLogic would like to comment officially, but as I understand it their position is that it's not possible to build a universally high-performing group by, so they choose not to implement it. This puts the responsibility on the developer to understand best practices for writing fast queries.
In your example specifically, it's highly unlikely that Mike Blakeley's map-based group by pattern is the issue. There are several different ways to profile queries in ML, and any of those should lead you to any hot spots. My guess would be that IO overhead to get the Person data is the problem. One common way to solve this is to configure range indexes for FirstName and LastName, and use cts:value-tuples to query them simultaneously from range indexes, which will avoid going to disk for every document not in the cache.
Both answers are relevant. But from what you are asking for, David C's answer is closest to what you appear to want. However, this answer assumes that you need to find the combination of first-name and last-name.
If your document had an identifying, uniq field (think primary key), then:
put a range index on that field
Use cts:element-values
--- options: (fragment-frequency, frequency-order, eager, concurrent, limit=10)
-The count comes from cts:frequencies
-Then use the resulting 10 IDs to grab the Name Information
If the unique ID field can be an integer, then the speeds are many times faster for the initial search as well.
And.. Its also possible to use my step 1-2 to get the list of Id essentially instantly and use these as a limiter on David C's answer - using an element-value-range-query on the IDs in the 'query' option. This stops you from having to build the name yourself as in my option and may speed up David C's approach.
Moral of the story - without first tuning your database for performance(indexes) and using specific queries (range queries, co-occurrence, etc), then the results have no chance of making any sense.
Moral of the story - part 2: lots of approaches - all relevant and viable. Subtle differences - all depending on your specific data.
The doc is listed below
https://docs.marklogic.com/cts:element-values
I'm the founder of a nonprofit news org that is a member of INN (The Institute for Nonprofit News). Some of us members have our wordpress news sites hosted for free through INN and therefore, cannot just install any plugins we want. INN must approve that first.
So the only way I have been able to use Highcharts is through this tool from Builtvisible that is a Highcharts code generator: https://builtvisible.com/highcharts-generator/
I've made a column chart here that is for a newborn screening story and is the Top-10 hospitals with the highest percentage of late samples. So when I hover over each bar I just want the tooltip to display 73%, instead of 73 like it does right now.
Any help would be greatly appreciated. I'm not very knowledgeable with coding and have tried a few things, even contacting the person who built this code generator but I can't figure out how to make it work. For another column chart also I have the same issue but I need it to separate the thousands with a comma instead of a space. So I would like 1,000 to read like that, instead of 1 000.
Here is my code from the generator where I need it to display percentages:
<div id="chart_container"></div><script src="//code.highcharts.com/adapters/standalone-framework.js"></script><script src="//code.highcharts.com/highcharts.js"></script><script>new Highcharts.Chart({"chart":{"backgroundColor":"#fefefe","renderTo":"chart_container","type":"column"},"title":{"text":"Top-10 facilities for lateness"},"colors":["#476974","#3ca1c1","#4ccbf4","#96dff6","#c9e8f6"],"legend":{"enabled":true,"margin":30},"xAxis":{"tickWidth":0,"labels":{"y":20},"categories":["Facilities with the highest overall percentage of late samples"]},"yAxis":{"tickWidth":0,"labels":{"y":-5},"title":{"text":"Percentage (%)"}},"series":[{"name":"NE MT HLTH SERVICES WOLF POINT","data":[73]},{"name":"CLARK FORK VALLEY HOSPITAL","data":[71]},{"name":"MARCUS DALY HOSPITAL","data":[68]},{"name":"ST JOSEPH HOSPITAL","data":[65]},{"name":"SIDNEY HEALTH CENTER","data":[64]},{"name":"THE BIRTH CENTER, MISSOULA","data":[63]},{"name":"GLENDIVE MEDICAL CENTER","data":[59]},{"name":"FRANCES MAHON DEAC HOSPITAL","data":[57]},{"name":"ST LUKE HOSPITAL","data":[55]},{"name":"CABINET PEAKS MEDICAL CENTER","data":[53]}]});</script>
Here is the other column chart where I would like the thousands to be separated by a comma when you hover over each column and it brings up the tool tip. This is also code from the Builtvisible Highcharts code generator:
<div id="chart_container2"></div>
<script src="//code.highcharts.com/adapters/standalone-framework.js"></script><script src="//code.highcharts.com/highcharts.js"></script><script>// <![CDATA[
new Highcharts.Chart({"chart":{"backgroundColor":"#fefefe","renderTo":"chart_container2","type":"column"},"title":{"text":"Wisconsin Lab: On-Time Performance"},"colors":["#6AA121","#2069A1","#A12069","#96dff6","#c9e8f6"],"legend":{"enabled":true,"margin":30},"xAxis":{"tickWidth":0,"labels":{"y":20},"categories":["2011","2012","2013","2014"]},"yAxis":{"tickWidth":0,"labels":{"y":-5},"title":{"text":"Number of samples on-time vs. late"}},"series":[{"name":"On-time","data":[7006,5492,5589,7069]},{"name":"Late","data":[3857,4979,5189,4105]},{"name":"Very Late","data":[701,1075,1074,776]}]});
// ]]></script>
This "tool code" is just regular Highcharts code, and does nothing more than generate it for you.
For the first chart you can use valueSuffix to add a percentage sign after the value. Code:
plotOptions: {
column: {
tooltip: {
valueSuffix: '%'
}
}
}
For the second chart you can use the global thousandSep option to get comma separators. Code:
Highcharts.setOptions({
lang: {
thousandsSep: ','
}
});
See this JSFiddle demonstration of both code segments.