Is there any way to continue to improve the import speed for NebulaGraph Exchange? - nebula-graph

When using NebulaGraph Exchange, I wanted to improve the import performance and adjusted the parameter batch, but the import speed was still not fast enough. Is there any way to continue to improve the import speed?
tags: [
{
name: player
type: {
source: json
sink: client
}
path: "hdfs://192.168.*.*:9000/data/vertex_player.json"
fields: [age,name]
nebula.fields: [age, name]
vertex: {
field:id
}
batch: 256
partition: 32
}

You can try using the following parameters:
batch: The number of data contained in each nGQL statement sent to the NebulaGraph service.
partition: The number of Spark data partitions, indicating the number of concurrent data imports.
nebula.rate: Get a token from the token bucket before sending a request to NebulaGraph.
limit: Represents the size of the token bucket.
timeout: Represents the timeout period for obtaining the token.
The values of these four parameters can be adjusted appropriately according to the machine performance. If the leader of the Storage service changes during the import process, you can adjust the values of these four parameters to reduce the import speed.

Related

JSON API Standard

I have a doubt what is a better option when using json api standard and communication between backend and frontend. I need only one attribute from author association - „username” and other stuff should be hidden for user that fetch this
Case a)
data: [
{
id: „100”,
type: „resource1”,
attributes: {…},
relationships: {author: {data: {id: „10”, type: „author”}}}
}
],
included: [
{
id: „10”,
type: „author”,
attributes: {username: „name”},
relationships: {resources1: {data: [{id: „100”, type: „resource1”}]}}
}
]
Case b)
data: [
{
id: „100”,
type: „resource1”,
attributes: {authorName: „name”, …},
relationships: {author: {data: {id: „10”, type: „author”}}}
}
],
included: []
Case a) looks semantic but there serve much more information in payload
Case b) is faster to get what I want from author (one attribute „username” and this is added in additional attribute: „authorName”), so also don’t need to pleas with associations in frontend side.
Any thoughts which is better practice and why?
Strictly speaking both case a and case b are valid per JSON:API specification.
In case a username is an attribute of author resource. In case b authorName is an attribute of resource1. author resource may have a username attribute in case b as well. In that case you have duplicated state.
I would recommend to only use duplicated state if you have very good reasons. Duplicated state increases complexity - both on server- as well as client-side. Keeping both attributes in sync comes with high costs. E.g. you need to update a client that resource1 changed after a successful update request, which affected the username of author resource. And the client need to parse that response and update the local cache.
There are some reasons, in which duplicating state pays off for good reasons. Calculated values, which would require a client to fetch many resources to calculate them, is a typical example. E.g. you may decide to introduce a averageRating attribute on a product resource because without a client would need to fetch all related ratings of a product only to calculate it.
Trying to reduce payload size is nearly never a good reason to accept the increased complexity. If you consider compressing and package sizes at network level, the raw payload size often doesn't make a big difference.

Does Firestore compress docs' data before sending it to the client?

Does Firestore use any kind of compression to send/receive docs?
For example:
someDoc: {
obj001: {
rating: "VERY_BAD" | "BAD" | "GOOD" | "VERY_GOOD" // ONE OF THESE VALUES
},
obj002: {...},
// ... 500 HUNDRED OBJECTS IN THIS DOC
obj500: {...},
}
I'm intentionally storing the rating property as a string. And it will add up if the object gets too big. Does Firestore use any kind of compression in those repetitive strings? Or will it send the full stringified version over the network without any compression?
I know that the max size for a document is 1MB.
No data compression is performed on the document data by the Firestore clients or servers.
If you'd like to see what actually is sent over the wire, I recommend creating a simple web app and checking out the network tab in your browser's developer console.

Maximum write rate to a document on Firestore

I am using Firestore to figure out, in real-time, each user's share of the cost of an item. Example:
/tickets/100/ticket-item/1:
{
name: 'Red Dead Redemption'
price: '5000'
payers (array of maps): [
{
name: 'John',
share: '1666'
},
{
name: 'Jane',
share: '1667'
},
{
name: 'Jack',
share: '1667'
}
]
}
Given that the max write rate to a document is 1/second, will the write always fail if two users add themselves to the same ticket item doc at the exact same time?
I know that this can be mitigated to an extent by using transactions, but a transaction will only re-execute a finite number of times. Let's say it re-executes up to 5 times. If 6 users write to same ticket item doc at the exact same time, can I expect one of these writes to fail?
I would appreciate any and all advice regarding how to handle this.
will the write always fail if two users add themselves to the same ticket item doc at the exact same time?
Yes it will. So if you are sure you'll have situations in which two or even more users will try to write/update data in a single document in the exact same time, I recommend you to be careful about this limitation because you might start to see some of this write operations to fail.
I know that this can be mitigated to an extent by using transactions
It's a good idea but please be aware that transactions will fail when the client is offline.
If 6 users write to same ticket item doc at the exact same time, can I expect one of these writes to fail?
As the docs states, a transaction will only re-execute a finite number of times. But please also note that in case of a transaction failure:
A failed transaction returns an error and does not write anything to the database.
So all you have to do is to take some action in case o transaction failure.
I'm researching same problem.
May be like a solution: moving "payers" into separate collection with a ticket_id field?
So you'll have no limitations.

Graphite Derivative shows no data

Using graphite/Grafana to record the sizes of all collections in a mongodb instance. I wrote a simple (WIP) python script to do so:
#!/usr/bin/python
from pymongo import MongoClient
import socket
import time
statsd_ip = '127.0.0.1'
statsd_port = 8125
# create a udp socket
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
client = MongoClient(host='12.34.56.78', port=12345)
db = client.my_DB
# get collection list each runtime
collections = db.collection_names()
sizes = {}
# main
while (1):
# get collection size per name
for collection in collections:
sizes[collection] = db.command('collstats', collection)['size']
# write to statsd
for size in sizes:
MESSAGE = "collection_%s:%d|c" % (size, sizes[size])
sock.sendto(MESSAGE, (statsd_ip, statsd_port))
time.sleep(60)
This properly shows all of my collection sizes in grafana. However, I want to get a rate of change on these sizes, so I build the following graphite query in grafana:
derivative(statsd.myHost.collection_myCollection)
And the graph shows up totally blank. Any ideas?
FOLLOW-UP: When selecting a time range greater than 24h, all data similarly disappears from the graph. Can't for the life of me figure out that one.
Update: This was due to the fact that my collectd was configured to send samples every second. The statsd plugin for collectd, however, was receiving data every 60 seconds, so I ended up with None for most data points.
I discovered this by checking the raw data in Graphite by appending &format=raw to the end of a graphite-api query in a browser, which gives you the value of each data point as a comma-separated list.
The temporary fix for this was to surround the graphite query with keepLastValue(60). This however creates a stair-step graph, as the value for each None (60 values) becomes the last valid value within 60 steps. Graphing a derivative of this then becomes a widely spaced sawtooth graph.
In order to fix this, I will probably go on to fix the flush interval on collectd or switch to a standalone statsd instance and configure as necessary from there.

What is the best method to measure site visits and page views in real time?

I currently use Adobe Omniture SiteCatalyst, Google Analytics, and New Relic. All three offer visit and page view metrics. SiteCatalyst has no API that I'm aware of, and their data is often hours behind. Google Analytics and New Relic both offer realtime APIs, but I find that the metrics offered differ wildly across vendors.
What's the best method (API) for measuring realtime visits (page views, unique visitors, etc.)?
Ultimately, I intend to use this data to present realtime conversion rates to my business customers.
Adobe SiteCatalyst does have a realtime api that you can use. It functions in a similar way that reports in SiteCatalyst work.
Here is python example request:
import requests
import sha
import binascii
import time
your_report_suite="ReportSuiteId" #The name of the report suite
what_you_are_looking = "someValue" #value of a the prop that you want to find in the realtime stream
def getRealTimeUsers():
if mobile:
url = 'https://api.omniture.com/admin/1.3/rest/?method='
headers = {'X-WSSE': self.generateHeader()}
method = 'Report.GetRealTimeReport'
report_url = url + method
payload = {
"reportDescription": {
"reportSuiteID": your_report_suite,
"metrics": [
{
"id": "instances"
}
],
"elements": [
{
"id": "prop4",
"search": {
"type": "string",
"keywords": what_you_are_looking
}
}
]
}
}
response = requests.post(url=report_url, headers=headers, data=json.dumps(payload))
data = response.json().get('report').get('data')
def generateHeader():
# Generates the SC headers for the request
nonce = str(time.time())
base64nonce = binascii.b2a_base64(binascii.a2b_qp(nonce))
created_date = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.localtime())
sha_object = sha.new(nonce + created_date + self.sc_key)
password_64 = binascii.b2a_base64(sha_object.digest())
return 'UsernameToken Username="%s", PasswordDigest="%s", Nonce="%s", Created="%s"' % (
self.sc_user, password_64.strip(), base64nonce.strip(), created_date)
Note: Realtime reporting requires that the realtime feature is turned on in your report suite. Also the realtime reports are limited in their dimensionality. There is not a whole lot of documentation on the particular requests required but there is this: https://marketing.adobe.com/developer/documentation/sitecatalyst-reporting/c-real-time
Also I highly recommend experimentation by using the api explorer: https://marketing.adobe.com/developer/api-explorer#Report.GetRealTimeReport
What kind of delay is acceptable? What about accuracy and detail? Script-based systems like Google Analytics require Javascript to be enabled and provide plenty of details about the visitor's demographic and technical information, but raw webserver logfiles give you details about every single request (which is better for technical insight, as you get details on requested images, hotlinking, referrers and other files).
Personally, I'd just use Google Analytics because I'm familar with it, and also because their CDN servers mean that my site won't load slowly; but otherwise I just run typical logfile analysis software on my raw webserver logs, however depending on your software this file analysis can take time to generate a report.

Resources