Suppose I have a JSON:
[
{
"title": "Title1",
"reference": [
"123"
]
},
{
"title": "Title2",
"reference": [
"234",
"345"
]
}
]
Id like to modify each element of the reference array so that the reference appears twice. I'd like to achieve:
[
{
"title": "Title1",
"reference": [
"123 is 123"
]
},
{
"title": "Title2",
"reference": [
"234 is 234",
"345 is 345"
]
}
]
I've tried:
jq '.[] | .reference = [("\(.reference[]) is \(.reference[])")]'
but this fails where the array has more than one item:
{
"title": "Title1",
"reference": [
"123 is 123"
]
}
{
"title": "Title2",
"reference": [
"234 is 234",
"345 is 234",
"234 is 345",
"345 is 345"
]
}
How can I modify the above jq to achieve the desired result?
Thanks in advance!
map(.reference |= map(. + " is " + .))
Will change each .reference to be .reference is .reference
[
{
"title": "Title1",
"reference": [
"123 is 123"
]
},
{
"title": "Title2",
"reference": [
"234 is 234",
"345 is 345"
]
}
]
Demo
This should work just fine:
jq '.[].reference[] |= "\(.) is \(.)"'
It replaces every item of the reference arrays with a string which contains itself two times and the word "is"
What I need is simply to extract id, sales from sorted_employees AND full name for that employee from employees.. like this:
{ "Full name": "John Doe", "ID": "employee1", "Sales": "26" }
{ "Full name": "Sam Jones", "ID": "employee2", "Sales": "119" }
What would be the easiest way to combine these 2 arrays ( employees & sorted_employees ) ? I have problems merging the result, I have tried to cast results into an array and I tried purely with jq filters.. but it doesn't want to give me what I need..
employees_id=$( echo "${json_data}" | jq -r '.employees[] | .id' )
sorted_employees_id=$( echo "${json_data}" | jq -r '.sorted_employees[] | .id' )
sorted_employees_sales=$( echo "${json_data}" | jq '.sorted_employees[] | .sales' )
{
"employees": [ {
"started_at": "2018-05-01 12.00",
"id": "employee1",
"facebook": "https://fb/john_doe",
"full name": "John Doe"
}, {
"started_at": "2017-05-01 12.00",
"id": "employee2",
"facebook": "https://fb/sam_jones",
"full name": "Sam Jones"
}, {
"started_at": "2016-05-01 12.00",
"id": "employee3",
"facebook": "https://fb/jane_roe",
"full name": "Jane Roe"
}],
"sorted_employees": [{
"id": "employee1",
"sales": 26
}, {
"id": "employee2",
"sales": 119
}, {
"id": "employee3",
"sales": 84
}]
}
You could combine the two arrays and just group by the common ID field and form the desired output object
jq '.employees + .sorted_employees | group_by(.id) |
map({"Full name": .[0]."full name", ID: .[0].id, "Sales": .[1].sales})'
My goal is to add Weaviate support to the pyLodStorage project
Specifically I'd like to use the sample data from:
https://github.com/WolfgangFahl/pyLoDStorage/blob/master/lodstorage/sample.py
Which has
a few records of Persons from the Royal family
a city list with a few thousand entries
an artificial list of records with as many records as you wish
as examples.
All data is tabular. Some basic python types like:
str
bool
int
float
date
datetime
need to be supported.
I created the project http://wiki.bitplan.com/index.php/DgraphAndWeaviateTest and a script to run Weaviate via docker compose. There is a python unit test which used to work with the Weaviate Python client 0.4.1
I am trying to use the information from https://www.semi.technology/documentation/weaviate/current/how-tos/how-to-create-a-schema.html to refactor this unit test but don't know how to do it.
What needs to be done to get the CRUD tests running as e.g. in the other three tests:
https://github.com/WolfgangFahl/pyLoDStorage/tree/master/tests
for
JSON
SPARQL
SQL
i am especially interested in the "round-trip" handling of list of dicts (aka "Table") with the standard data types above. So I'd like to create a list of dicts and then:
derive the schema automatically by looking at some sample records
check if the schema already exists and if delete it
create the schema
check if the data already exits and if delete it
add the data and store it
optionaly store the schema for further reference
restore the data with or without using the schema information
check that the restored data (list of Dicts) is the same as the original data
Created on 2020-07-24
#author: wf
'''
import unittest
import weaviate
import time
#import getpass
class TestWeaviate(unittest.TestCase):
# https://www.semi.technology/documentation/weaviate/current/client-libs/python.html
def setUp(self):
self.port=8153
self.host="localhost"
#if getpass.getuser()=="wf":
# self.host="zeus"
# self.port=8080
pass
def getClient(self):
self.client=weaviate.Client("http://%s:%d" % (self.host,self.port))
return self.client
def tearDown(self):
pass
def testRunning(self):
'''
make sure weaviate is running
'''
w=self.getClient()
self.assertTrue(w.is_live())
self.assertTrue(w.is_ready())
def testWeaviateSchema(self):
''' see https://www.semi.technology/documentation/weaviate/current/client-libs/python.html '''
w = self.getClient()
#contains_schema = w.schema.contains()
try:
w.create_schema("https://raw.githubusercontent.com/semi-technologies/weaviate-python-client/master/documentation/getting_started/people_schema.json")
except:
pass
entries=[
[ {"name": "John von Neumann"}, "Person", "b36268d4-a6b5-5274-985f-45f13ce0c642"],
[ {"name": "Alan Turing"}, "Person", "1c9cd584-88fe-5010-83d0-017cb3fcb446"],
[ {"name": "Legends"}, "Group", "2db436b5-0557-5016-9c5f-531412adf9c6" ]
]
for entry in entries:
dict,type,uid=entry
try:
w.create(dict,type,uid)
except weaviate.exceptions.ThingAlreadyExistsException as taee:
print ("%s already created" % dict['name'])
pass
def testPersons(self):
return
w = self.getClient()
schema = {
"actions": {"classes": [],"type": "action"},
"things": {"classes": [{
"class": "Person",
"description": "A person such as humans or personality known through culture",
"properties": [
{
"cardinality": "atMostOne",
"dataType": ["text"],
"description": "The name of this person",
"name": "name"
}
]}],
"type": "thing"
}
}
w.create_schema(schema)
w.create_thing({"name": "Andrew S. Tanenbaum"}, "Person")
w.create_thing({"name": "Alan Turing"}, "Person")
w.create_thing({"name": "John von Neumann"}, "Person")
w.create_thing({"name": "Tim Berners-Lee"}, "Person")
def testEventSchema(self):
'''
https://stackoverflow.com/a/63077495/1497139
'''
return
schema = {
"things": {
"type": "thing",
"classes": [
{
"class": "Event",
"description": "event",
"properties": [
{
"name": "acronym",
"description": "acronym",
"dataType": [
"text"
]
},
{
"name": "inCity",
"description": "city reference",
"dataType": [
"City"
],
"cardinality": "many"
}
]
},
{
"class": "City",
"description": "city",
"properties": [
{
"name": "name",
"description": "name",
"dataType": [
"text"
]
},
{
"name": "hasEvent",
"description": "event references",
"dataType": [
"Event"
],
"cardinality": "many"
}
]
}
]
}
}
client = self.getClient()
if not client.contains_schema():
client.create_schema(schema)
event = {"acronym": "example"}
client.create(event, "Event", "2a8d56b7-2dd5-4e68-aa40-53c9196aecde")
city = {"name": "Amsterdam"}
client.create(city, "City", "c60505f9-8271-4eec-b998-81d016648d85")
time.sleep(2.0)
client.add_reference("c60505f9-8271-4eec-b998-81d016648d85", "hasEvent", "2a8d56b7-2dd5-4e68-aa40-53c9196aecde")
if __name__ == "__main__":
#import sys;sys.argv = ['', 'Test.testName']
unittest.main()
The unit test for the connection, schema and data objects you show above works like this with the Python client v1.x (see the inline comments for what's changed):
import unittest
import weaviate
import time
#import getpass
class TestWeaviate(unittest.TestCase):
# https://www.semi.technology/documentation/weaviate/current/client-libs/python.html
def setUp(self):
self.port=8153
self.host="localhost"
#if getpass.getuser()=="wf":
# self.host="zeus"
# self.port=8080
pass
def getClient(self):
self.client=weaviate.Client("http://%s:%d" % (self.host,self.port))
return self.client
def tearDown(self):
pass
def testRunning(self):
'''
make sure weaviate is running
'''
w=self.getClient()
self.assertTrue(w.is_live())
self.assertTrue(w.is_ready())
def testWeaviateSchema(self):
''' see https://www.semi.technology/documentation/weaviate/current/client-libs/python.html '''
w = self.getClient()
#contains_schema = w.schema.contains()
try:
w.schema.create("https://raw.githubusercontent.com/semi-technologies/weaviate-python-client/master/documentation/getting_started/people_schema.json") # instead of w.create_schema, see https://www.semi.technology/documentation/weaviate/current/how-tos/how-to-create-a-schema.html#creating-your-first-schema-with-the-python-client
except:
pass
entries=[
[ {"name": "John von Neumann"}, "Person", "b36268d4-a6b5-5274-985f-45f13ce0c642"],
[ {"name": "Alan Turing"}, "Person", "1c9cd584-88fe-5010-83d0-017cb3fcb446"],
[ {"name": "Legends"}, "Group", "2db436b5-0557-5016-9c5f-531412adf9c6" ]
]
for entry in entries:
dict,type,uid=entry
try:
w.data_object.create(dict,type,uid) # instead of w.create(dict,type,uid), see https://www.semi.technology/documentation/weaviate/current/restful-api-references/semantic-kind.html#example-request-1
except weaviate.exceptions.ThingAlreadyExistsException as taee:
print ("%s already created" % dict['name'])
pass
def testPersons(self):
return
w = self.getClient()
schema = {
"actions": {"classes": [],"type": "action"},
"things": {"classes": [{
"class": "Person",
"description": "A person such as humans or personality known through culture",
"properties": [
{
"cardinality": "atMostOne",
"dataType": ["text"],
"description": "The name of this person",
"name": "name"
}
]}],
"type": "thing"
}
}
w.schema.create(schema) # instead of w.create_schema(schema)
w.data_object.create({"name": "Andrew S. Tanenbaum"}, "Person") # instead of w.create_thing({"name": "Andrew S. Tanenbaum"}, "Person")
w.data_object.create({"name": "Alan Turing"}, "Person")
w.data_object.create({"name": "John von Neumann"}, "Person")
w.data_object.create({"name": "Tim Berners-Lee"}, "Person")
def testEventSchema(self):
'''
https://stackoverflow.com/a/63077495/1497139
'''
return
schema = {
"things": {
"type": "thing",
"classes": [
{
"class": "Event",
"description": "event",
"properties": [
{
"name": "acronym",
"description": "acronym",
"dataType": [
"text"
]
},
{
"name": "inCity",
"description": "city reference",
"dataType": [
"City"
],
"cardinality": "many"
}
]
},
{
"class": "City",
"description": "city",
"properties": [
{
"name": "name",
"description": "name",
"dataType": [
"text"
]
},
{
"name": "hasEvent",
"description": "event references",
"dataType": [
"Event"
],
"cardinality": "many"
}
]
}
]
}
}
client = self.getClient()
if not client.contains_schema():
client.schema.create(schema) # instead of client.create_schema(schema)
event = {"acronym": "example"}
client.data_object.create(event, "Event", "2a8d56b7-2dd5-4e68-aa40-53c9196aecde") # instead of client.create(event, "Event", "2a8d56b7-2dd5-4e68-aa40-53c9196aecde")
city = {"name": "Amsterdam"}
client.data_object.create(city, "City", "c60505f9-8271-4eec-b998-81d016648d85")
time.sleep(2.0)
client.data_object.reference.add("c60505f9-8271-4eec-b998-81d016648d85", "hasEvent", "2a8d56b7-2dd5-4e68-aa40-53c9196aecde") # instead of client.add_reference("c60505f9-8271-4eec-b998-81d016648d85", "hasEvent", "2a8d56b7-2dd5-4e68-aa40-53c9196aecde"), see https://www.semi.technology/documentation/weaviate/current/restful-api-references/semantic-kind.html#add-a-cross-reference
if __name__ == "__main__":
#import sys;sys.argv = ['', 'Test.testName']
unittest.main()
There's no support for automatically deriving a schema from a list of dict (or other formats) yet. This could, as you mention, be a good convenience feature, so we add this to Weaviate's feature suggestions!
The new version of Weaviate is now available (v1.2.1 is the latest release at the time of writing this). With this version a lot of things were removed and even more added. One of the major breaking change is that actions and things were removed, objects were introduced instead. All the changes and features for weaviate v1.2 can be used with weaviate-client python library v2.3.
Most of the current weaviate-client functionality is explained and showed how it works in this article.
Here is the same unittests but for Weaviate v1.2.1 and written using weaviate-client v2.3.1:
import unittest
import weaviate
import time
#import getpass
person_schema = {
"classes": [
{
"class": "Person",
"description": "A person such as humans or personality known through culture",
"properties": [
{
"name": "name",
"description": "The name of this person",
"dataType": ["text"]
}
]
},
{
"class": "Group",
"description": "A set of persons who are associated with each other over some common properties",
"properties": [
{
"name": "name",
"description": "The name under which this group is known",
"dataType": ["text"]
},
{
"name": "members",
"description": "The persons that are part of this group",
"dataType": ["Person"]
}
]
}
]
}
class TestWeaviate(unittest.TestCase):
# NEW link to the page
# https://www.semi.technology/developers/weaviate/current/client-libraries/python.html
def setUp(self):
self.port=8080
self.host="localhost"
#if getpass.getuser()=="wf":
# self.host="zeus"
# self.port=8080
pass
def getClient(self):
self.client=weaviate.Client("http://%s:%d" % (self.host,self.port))
return self.client
def tearDown(self):
pass
def testRunning(self):
'''
make sure weaviate is running
'''
w=self.getClient()
self.assertTrue(w.is_live())
self.assertTrue(w.is_ready())
def testWeaviateSchema(self):
# NEW link to the page
# https://www.semi.technology/developers/weaviate/current/client-libraries/python.html
w = self.getClient()
#contains_schema = w.schema.contains()
# it is a good idea to check if Weaviate has a schema already when testing, otherwise it will result in an error
# this way you know for sure that your current schema is known to weaviate.
if w.schema.contains():
# delete the existing schema, (removes all the data objects too)
w.schema.delete_all()
# instead of w.create_schema(person_schema)
w.schema.create(person_schema)
entries=[
[ {"name": "John von Neumann"}, "Person", "b36268d4-a6b5-5274-985f-45f13ce0c642"],
[ {"name": "Alan Turing"}, "Person", "1c9cd584-88fe-5010-83d0-017cb3fcb446"],
[ {"name": "Legends"}, "Group", "2db436b5-0557-5016-9c5f-531412adf9c6" ]
]
for entry in entries:
dict,type,uid=entry
try:
# instead of w.create(dict,type,uid), see https://www.semi.technology/developers/weaviate/current/restful-api-references/objects.html#create-a-data-object
w.data_object.create(dict,type,uid)
# ObjectAlreadyExistsException is the correct exception starting weaviate-client 2.0.0
except weaviate.exceptions.ObjectAlreadyExistsException as taee:
print ("%s already created" % dict['name'])
pass
def testPersons(self):
return
w = self.getClient()
schema = {
#"actions": {"classes": [],"type": "action"}, `actions` and `things` were removed in weaviate v1.0 and removed in weaviate-client v2.0
# Now there is only `objects`
"classes": [
{
"class": "Person",
"description": "A person such as humans or personality known through culture",
"properties": [
{
#"cardinality": "atMostOne", were removed in weaviate v1.0 and weaviate-client v2.0
"dataType": ["text"],
"description": "The name of this person",
"name": "name"
}
]
}
]
}
# instead of w.create_schema(schema)
w.schema.create(schema)
# instead of w.create_thing({"name": "Andrew S. Tanenbaum"}, "Person")
w.data_object.create({"name": "Andrew S. Tanenbaum"}, "Person")
w.data_object.create({"name": "Alan Turing"}, "Person")
w.data_object.create({"name": "John von Neumann"}, "Person")
w.data_object.create({"name": "Tim Berners-Lee"}, "Person")
def testEventSchema(self):
'''
https://stackoverflow.com/a/63077495/1497139
'''
return
schema = {
# "things": { , were removed in weaviate v1.0 and weaviate-client v2.0
# "type": "thing", was removed in weaviate v1.0 and weaviate-client v2.0
"classes": [
{
"class": "Event",
"description": "event",
"properties": [
{
"name": "acronym",
"description": "acronym",
"dataType": [
"text"
]
},
{
"name": "inCity",
"description": "city reference",
"dataType": [
"City"
],
# "cardinality": "many", were removed in weaviate v1.0 and weaviate-client v2.0
}
]
},
{
"class": "City",
"description": "city",
"properties": [
{
"name": "name",
"description": "name",
"dataType": [
"text"
]
},
{
"name": "hasEvent",
"description": "event references",
"dataType": [
"Event"
],
# "cardinality": "many", were removed in weaviate v1.0 and weaviate-client v2.0
}
]
}
]
}
client = self.getClient()
# this test is going to fail if you are using the same Weaviate instance
# We already created a schema in the test above so the new schme is not going to be created
# and will result in an error.
# we can delete the schema and create a new one.
# instead of client.contains_schema()
if client.schema.contains():
# delete the existing schema, (removes all the data objects too)
client.schema.delete_all()
# instead of client.create_schema(schema)
client.schema.create(schema)
event = {"acronym": "example"}
# instead of client.create(...)
client.data_object.create(event, "Event", "2a8d56b7-2dd5-4e68-aa40-53c9196aecde")
city = {"name": "Amsterdam"}
client.data_object.create(city, "City", "c60505f9-8271-4eec-b998-81d016648d85")
time.sleep(2.0)
# instead of client.add_reference(...), see https://www.semi.technology/developers/weaviate/current/restful-api-references/objects.html#cross-references
client.data_object.reference.add("c60505f9-8271-4eec-b998-81d016648d85", "hasEvent", "2a8d56b7-2dd5-4e68-aa40-53c9196aecde")
if __name__ == "__main__":
#import sys;sys.argv = ['', 'Test.testName']
unittest.main()
I'm walking through a huge JSONL file (100G, 100M rows) line by line extracting two key values from the data. Ideally, I want this written to a file with two columns. I'm a real beginner here.
Here is an example of the JSON on each row of the file referenced on my C drive:
https://api.unpaywall.org/v2/10.6118/jmm.2017.23.2.135?email=YOUR_EMAIL
or:
{
"best_oa_location": {
"evidence": "open (via page says license)",
"host_type": "publisher",
"is_best": true,
"license": "cc-by-nc",
"pmh_id": null,
"updated": "2018-02-14T11:18:21.978814",
"url": "FAKEURL",
"url_for_landing_page": "URL2",
"url_for_pdf": "URL4",
"version": "publishedVersion"
},
"data_standard": 2,
"doi": "10.6118/jmm.2017.23.2.135",
"doi_url": "URL5",
"genre": "journal-article",
"is_oa": true,
"journal_is_in_doaj": false,
"journal_is_oa": false,
"journal_issns": "2288-6478,2288-6761",
"journal_name": "Journal of Menopausal Medicine",
"oa_locations": [
{
"evidence": "open (via page says license)",
"host_type": "publisher",
"is_best": true,
"license": "cc-by-nc",
"pmh_id": null,
"updated": "2018-02-14T11:18:21.978814",
"url": "URL6",
"url_for_landing_page": "hURL7": "hURL8",
"version": "publishedVersion"
},
{
"evidence": "oa repository (via OAI-PMH doi match)",
"host_type": "repository",
"is_best": false,
"license": "cc-by-nc",
"pmh_id": "oai:pubmedcentral.nih.gov:5606912",
"updated": "2017-10-21T18:12:39.724143",
"url": "URL9",
"url_for_landing_page": "URL11",
"url_for_pdf": "URL12",
"version": "publishedVersion"
},
{
"evidence": "oa repository (via pmcid lookup)",
"host_type": "repository",
"is_best": false,
"license": null,
"pmh_id": null,
"updated": "2018-10-11T01:49:34.280389",
"url": "URL13",
"url_for_landing_page": "URL14",
"url_for_pdf": null,
"version": "publishedVersion"
}
],
"published_date": "2017-01-01",
"publisher": "The Korean Society of Menopause (KAMJE)",
"title": "A Case of Granular Cell Tumor of the Clitoris in a Postmenopausal Woman",
"updated": "2018-06-20T20:31:37.509896",
"year": 2017,
"z_authors": [
{
"affiliation": [
{
"name": "Department of Obstetrics and Gynecology, Soonchunhyang University Cheonan Hospital, University of Soonchunhyang College of Medicine, Cheonan, Korea."
}
],
"family": "Min",
"given": "Ji-Won"
},
{
"affiliation": [
{
"name": "Department of Obstetrics and Gynecology, Soonchunhyang University Cheonan Hospital, University of Soonchunhyang College of Medicine, Cheonan, Korea."
}
],
"family": "Kim",
"given": "Yun-Sook"
}
]
}
Here's the code i'm using/wrote:
library (magrittr)
library (jqr)
con = file("C:/users/ME/desktop/miniunpaywall.jsonl", "r");
while ( length(line <- readLines(con, n = -1)) > 0) {
write.table( line %>% jq ('.doi,.best_oa_location.license'), file='test.txt', quote=FALSE, row.names=FALSE);}
What results from this is a line of text for each row of JSON that looks like this:
"10.1016/j.ijcard.2018.10.014,CC-BY"
This is effectively:
"[DOI],[LICENSE]"
I want ideally to have the output be:
[DOI] tab [LICENSE]
I believe my problem is that I'm writing the values as a string into a single column when i say:
write.table( line %>% jq ('.doi,.best_oa_location.license')
I havent figured out a way to remove the quotes i'm getting around each line in my file or how i could separate the two values with a tab. I feel I'm pretty close. Help!
For an input file that looks like this:
{
"employees": [
{
"number": "101",
"tags": [
{
"value": "yes",
"key": "management"
},
{
"value": "joe",
"key": "login"
},
{
"value": "joe blogs",
"key": "name"
}
]
},
{
"number": "102",
"tags": [
{
"value": "no",
"key": "management"
},
{
"value": "jane",
"key": "login"
},
{
"value": "jane doe",
"key": "name"
}
]
},
{
"number": "103",
"tags": [
{
"value": "no",
"key": "management"
},
{
"value": "john",
"key": "login"
},
{
"value": "john doe",
"key": "name"
}
]
}
]
}
... I'd like to get details for all non-management employees so that the desired output looks like this:
{
"number": "102",
"name": "jane doe",
"login": "jane"
}
{
"number": "103",
"name": "john doe",
"login": "john"
}
I can't figure out how to limit results based on a key without selecting that key (in this case "management")
The following is a slightly more succinct solution:
.employees[]
| .tags |= from_entries
| select(.tags.management == "no")
| {number, "name": .tags.name, "login": .tags.login}
Using from_entries, this worked for me:
$ jq '.employees[] | {number: .number, tags: .tags | from_entries} | select(.tags.management=="no") | {number: .number, name: .tags.name, login: .tags.login}' input
... and the output is:
{
"number": "102",
"name": "jane blogs",
"login": "jane"
}
{
"number": "103",
"name": "john doe",
"login": "john"
}
There may be a better way to achieve what I wanted, so I'll leave the question open for a while if someone wants to offer a better solution.
Here is another solution which uses from_entries
.employees[]
| {number} + (.tags | from_entries)
| if .management == "no" then {number, name, login} else empty end