How can i handle a lot of data with timestamp in arangodb? - bigdata

i am new to handling a lot of data.
Every 100ms i write actually 4 json blocks to my arangodb in a collection.
the content of the json ist something like that:
{
"maintenence": {
"holder_1": 1,
"holder_2": 0,
"holder_3": 0,
"holder_4": 0,
"holder_5": 0,
"holder_6": 0
},
"error": 274,
"pos": {
"left": [
21.45, // changing every 100ms
38.36, // changing every 100ms
10.53 // changing every 100ms
],
"center": [
0.25, // changing every 100ms
0, // changing every 100ms
2.42 // changing every 100ms
],
"right": [
0, // changing every 100ms
0, // changing every 100ms
0 // changing every 100ms
]
},
"sub": [
{
"type": 23,
"name": "plate 01",
"sensors": [
{
"type": 45,
"name": "sensor 01",
"state": {
"open": 1,
"close": 0,
"middle": 0
}
},
{
"type": 34,
"name": "sensor 02",
"state": {
"on": 1
}
}
]
}
],
"timestamp": "2018-02-18 01:56:08.423",
"device": "12227225"
}
every block is another device
In only 2 days there are ~6 million of datasets in the collection.
if i want to get data to draw a line graph from "device 1 position left[0]"
with:
FOR d IN device
FILTER d.timestamp >= "2018-02-18 04:30:00.000" && d.timestamp <= "2018-02-18 04:35:00.000"
RESULT d.pos.left[0]
It tooks a veeeeeery long time so search in this ~6 million datasets.
My question is: is this normal and only machine power can fix this problem or is my way to handle this set of data wrong?
I think ~6 million datasets is not BIG DATA, but i think if i fail with this, how can i handle this if i add 50 more devices collect it not 2 days but 30 days.

converting the timstamps to unix timestamp (number) helps alot.
i added a skiplist index over timestamp & device.
Now, with 13 million datatsets my query runs 920ms.
Thank u!

Related

Grafana: a line graph without date-time on x-axis

This is probably more complicated as it sounds, at least with Grafana.
I have an experiment, where for every location (1-100) a value is changed over time. I want to show this with a line graph (or a bar graph), where x-axis correspond to the locations (1-100) and y-axis correspond to the average value for this location for the time interval that is set in Grafana in the upper right corner. Data comes from database. Please, suggest me, which type of graph (dashboard) should I choose in Grafana to achieve the goal. I can only see two kinds of them, those with time on x-axis and those of type histogram but none seems to be applicable.
seems Grafana builtin panels only support
time series, which mean x-axis must be 'time' type.
bilibala-echarts-panel
I use this third-part panel bilibala-echarts-panel
https://grafana.com/grafana/plugins/bilibala-echarts-panel/
achieved the goal: x-axis not use time value.
It can use a custom callback function, to handle data and render
setting
in grafana query setting:
format as time series
// time series need 3 column:
time // time or number (can be parse as time)
metric // string: series name
value // number (otherwise got value error)
assign the value to time column.
// it may rander as 1970-01 in table view, but in echart we can read it as number
in echart panel option:
 data.series got the grafana data
 // convert/adapte it to echart data here.
maybe format as table can have more flexible data,
 and parse them in the callback js.
summary
grafana and echart have different theory,
 have to understand both of them ,
 and do the convert in js.
// bilibala-echarts-panel using echart v4 #2022-07
We have implemented the graph using natel-plotly-panel plugin.
"panels": [
{
"pconfig": {
"traces": [
{
"mapping": {
"color": "64",
"size": null,
"text": "metriccat",
"x": "loccat",
"y": "valuecat",
"z": null
},
"name": "My value",
"show": {
"line": true,
"lines": true,
"markers": false
}
},
...
]
},
"pluginVersion": "7.5.5",
"targets": [
{
"format": "table",
"group": [],
"metricColumn": "none",
"rawQuery": true,
"rawSql": "SELECT\n avg(column_with_values) AS valuecat,\n loc loccat,\n avg(column_with_values) as metriccat\nFROM ... \nWHERE\n $__timeFilter(timestamp)\nGROUP BY loc\n\n",
"select": [
[
{
"params": [
"value"
],
"type": "column"
}
]
],
"timeColumn": "time",
"where": [
{
"name": "$__timeFilter",
"params": [],
"type": "macro"
}
]
},
...
],
"title": "Average Values Along the Locations",
"type": "natel-plotly-panel",
"version": 1
}
]

Range Behavior on Isolines

When retrieving reverse isolines based on time with a list of ranges does anyone know the behavior?
For example if range is 50,100,150,200,250,300,350,400 the polygon for 50 is much different then if range is 50,100,150.
Based on the two parameter sets below the range is extremely different for the 30 second revere isoline range. The two calls occurred at the same time.
For https://isoline.route.ls.hereapi.com/routing/7.2/calculateisoline.json?apiKey=xxxx&mode=balanced;car;traffic:default;motorway:-3&rangeType=time&destination=geo!43.805388,-79.525348&range=30,1800.
The polygon is:
"isoline": [{
"range": 30,
"component": [{
"id": 0,
"shape": ["43.8079834,-79.5238495",
"43.8066101,-79.5204163",
"43.8038635,-79.5204163",
"43.8024902,-79.5245361",
"43.8038635,-79.528656",
"43.8066101,-79.5293427",
"43.8079834,-79.5272827",
"43.8079834,-79.5238495"]
}]
}
For https://isoline.route.ls.hereapi.com/routing/7.2/calculateisoline.json?apiKey=xxxx&mode=balanced;car;traffic:default;motorway:-3&rangeType=time&destination=geo!43.805388,-79.525348&range=30.
The polygon is:
"isoline": [{
"range": 30,
"component": [{
"id": 0,
"shape": ["43.8059235,-79.5258236",
"43.8059235,-79.5245361",
"43.8057518,-79.5240211",
"43.8054085,-79.5240211",
"43.8050652,-79.5250511",
"43.8047218,-79.5253944",
"43.8047218,-79.5257378",
"43.8054085,-79.5264244",
"43.8057518,-79.5265102",
"43.8059235,-79.5262527",
"43.8059235,-79.5258236"]
}]
}]
The behaviour is same as single range, just that multiple range allows calculation of many isolines with the same start or destination.
Check this link for your reference.
https://developer.here.com/documentation/isoline-routing-api/dev_guide/topics/use-cases/multi-range-isoline.html

Here maps: Routing - Calculate matrix response shows failed status for few inter combinations of starts to destinations in few countries

Making a Rest call to calculate matrix of HERE routing with multiple starts and destinations but getting proper response only for direct one to one start and destinations and getting status:failed for other inter combinations (getting only for principal diagonal values). Facing the issue only for few countries (here it is India) but working for the samples in the website (Europe)
Rest GET call: https://matrix.route.ls.hereapi.com/routing/7.2/calculatematrix.json?apiKey=<API_KEY>&mode=balanced;car;traffic:disabled&summaryAttributes=distance,traveltime&start0=17.251160,78.437737&destination0=16.506174,80.648018&start1=13.069166,80.191391&destination1=12.971599,77.594566
Response: {
"response": {
"metaInfo": {
"timestamp": "2020-02-04T12:36:09Z",
"mapVersion": "8.30.105.150",
"moduleVersion": "7.2.202005-6333",
"interfaceVersion": "2.6.75",
"availableMapVersion": [
"8.30.105.150"
]
},
"matrixEntry": [
{
"startIndex": 0,
"destinationIndex": 0,
"summary": {
"distance": 286827,
"travelTime": 24236,
"costFactor": 24029
}
},
{
"startIndex": 0,
"destinationIndex": 1,
"status": "failed"
},
{
"startIndex": 1,
"destinationIndex": 0,
"status": "failed"
},
{
"startIndex": 1,
"destinationIndex": 1,
"summary": {
"distance": 339029,
"travelTime": 26924,
"costFactor": 26845
}
}
]
}
}
The reason behind the observed behaviour is that the road network in
India is quite dense, and in some areas the algorithm is not able to
find an optimal route within a reasonable time limit.
We suggest trying out our Large Scale Matrix Service. It supports two use cases:
Matrix Routing calculations with live traffic information for matrices up to 10000x10000 size in a limited size region (up to 400km in diameter).
Matrix Routing calculations without live traffic information for matrices up to 10000x10000 size without region limitations for fixed sets of parameters (profiles).

Does SmartSheet support any CURL to get the total row count in a sheet?

I am using curl with REST to access Smartshets in my C# running on WIN CE. My application is supposed to dump some data on smartsheet periodically.
Before I write to a sheet, I would like to know the total row count in the sheet so that I don't exceed 5000 rows per sheet.
I am looking for an API that would return just row count given the sheet id?
Currently using below API which returns the entire sheet data and takes very long to fetch and format.
curl https://api.smartsheet.com/2.0/sheets/{sheetId}
with data of upto 5000 rows pr sheet, it takes very long to fetch and format below response to determine the available rows:
{
"id": 4583173393803140,
"name": "sheet 1",
"version": 6,
"totalRowCount": 240,
"accessLevel": "OWNER",
"effectiveAttachmentOptions": [
"EVERNOTE",
"GOOGLE_DRIVE",
"EGNYTE",
"FILE",
"ONEDRIVE",
"DROPBOX",
"BOX_COM"
],
"readOnly": true,
"ganttEnabled": true,
"dependenciesEnabled": true,
"resourceManagementEnabled": true,
"cellImageUploadEnabled": true,
"userSettings": {
"criticalPathEnabled": false,
"displaySummaryTasks": true
},
"userPermissions": {
"summaryPermissions": "ADMIN"
},
"workspace": {
"id": 825898975642500,
"name": "New Workspace"
},
"projectSettings": {
"workingDays": [
"MONDAY",
"TUESDAY",
"WEDNESDAY"
],
"nonWorkingDays": [],
"lengthOfDay": 8
},
"hasSummaryFields": false,
"permalink": "https://app.smartsheet.com/b/home?lx=pWNSDH9itjBXxBzFmyf-5w",
"createdAt": "2018-09-24T20:27:57Z",
"modifiedAt": "2018-09-26T20:45:08Z",
"columns": [
{
"id": 4583173393803140,
"version": 0,
"index": 0,
"primary": true,
"title": "Primary Column",
"type": "TEXT_NUMBER",
"validation": false
},
{
"id": 2331373580117892,
"version": 0,
"index": 1,
"options": [
"new",
"in progress",
"completed"
],
"title": "status",
"type": "PICKLIST",
"validation": true
}
],
"rows": Array[4962]....
}
Any help will b greatly appreciated.enter code here
There isn't a request to specifically return the number of rows on a Sheet. But, with any GET /sheets/{sheetId} operation the resulting Sheet object returned will have a top level totalRowCount attribute on it. So, you don't have to GET the sheet and count the objects in the rows array. Instead you can look to the totalRowCount attribute and the value there to know how many rows are currently on the sheet.
If you are concerned about pulling down all of the Sheet data you can use paging to keep from getting all of the data returned. Doing a GET /sheets/{sheetId}?pageSize=1 will give you the Sheet object with only the first row of data to help make the payload smaller. The totalRowCount attribute will still be present in the response.

Cosmos DocumentDb: Inefficient ORDER BY

I'm doing some early trials on Cosmos, and populated a table with a set of DTOs. While some simple WHERE queries seem to return quite quickly, others are horribly inefficient. A simple COUNT(1) from c took several secons and used over 10K request units. Even worse, doing a little experiment with ordering also was very discouraging. Here's my query
SELECT TOP 20 c.Foo, c.Location from c
ORDER BY c.Location.Position.Latitude DESC
My collection (if the count was correct, I got super weird results running it while populating the DB, but that's another issue) contains about 300K DTOs. The above query ran for about 30 seconds (I currently have the DB configured to perform with 4K RU/s), and ate 87453.439 RUs with 6 round trips. Obviously, that's a no-go.
According to the documentation, the numeric Latitute property should be indexed, so I'm not sure it's me screwing up here, or the reality didn't really catch up with the marketing here ;)
Any idea on why this doesn't perform properly? Thanks for your advice!
Here's a document as returned:
{
"Id": "y-139",
"Location": {
"Position": {
"Latitude": 47.3796977,
"Longitude": 8.523499
},
"Name": "Restaurant Eichhörnli",
"Details": "Nietengasse 16, 8004 Zürich, Switzerland"
},
"TimeWindow": {
"ReferenceTime": "2017-07-01T15:00:00",
"ReferenceTimeUtc": "2017-07-01T15:00:00+02:00",
"Direction": 0,
"Minutes": 45
}
}
The DB/collection I use is just the default one that can be created for the ToDo application from within the Azure portal. This apparently created the following indexing policy:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*",
"indexes": [
{
"kind": "Range",
"dataType": "Number",
"precision": -1
},
{
"kind": "Hash",
"dataType": "String",
"precision": 3
}
]
}
],
"excludedPaths": []
}
Update as of Dec 2017:
I revisited my unchanged database and ran the same query again. This time, it's fast and instead of > 87000 RUs, it eats up around 6 RUs. Bottom line: It appears there was something very, very wrong with Cosmos DB, but whatever it was, it seems to be gone.

Resources