I am accessing bulk data in elastic through R. For analytics purpose I need to query for data for a relatively long duration (say a month). The data for a month is approx 4.5 million rows and R goes out of memory.
Sample data is below (for 1 day):
dt <- as.Date("2015-09-01", "%Y-%m-%d")
frmdt <- strftime(dt,"%Y-%m-%d")
todt <- as.Date(dt+1)
todt <- strftime(todt,"%Y-%m-%d")
connect(es_base="http://xx.yy.zzz.kk")
start_date <- as.integer(as.POSIXct(frmdt))*1000
end_date <- as.integer(as.POSIXct(todt))*1000
query <- sprintf('{"query":{"range":{"time":{"gte":"%s","lte":"%s"}}}}',start_date,end_date)
s_list <- elastic::Search(index = "organised_2015_09",type = "PROPERTY_SEARCH", body=query ,
fields = c("trackId", "time"), size=1000000)$hits$hits
length(s_list)
[1] 144612
This result for 1 day has 144k records and is 222 MB. Sample list item below:
> s_list[[1]]
$`_index`
[1] "organised_2015_09"
$`_type`
[1] "PROPERTY_SEARCH"
$`_id`
[1] "1441122918941"
$`_version`
[1] 1
$`_score`
[1] 1
$fields
$fields$time
$fields$time[[1]]
[1] 1441122918941
$fields$trackId
$fields$trackId[[1]]
[1] "fd4b4ce88101e58623ba9e6e31971d1f"
Actually a summary count of number of items by "trackId" and "time" (summarize for every day) would suffice for analytics purpose. Hence I tried to transform this into a count query with aggregations. So I constructed the below query:
query < -'{"size" : 0,
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"range": {
"time": {
"gte": 1441045800000,
"lte": 1443551400000
}
}
}
}
},
"aggs": {
"articles_over_time": {
"date_histogram": {
"field": "time",
"interval": "day",
"time_zone": "+05:30"
},
"aggs": {
"group_by_state": {
"terms": {
"field": "trackId",
"size": 0
}
}
}
}
}
}'
response <- elastic::Search(index="organised_recent",type="PROPERTY_SEARCH",body=query, search_type="count")
However I did not gain in speed or document size. i think I am missing something but not sure what.
Related
I have a deserialized object that I want to dynamically loop through to return the related results. The response package looks like so:
{"RatingResponse":
{"Success":"true",
"Message":"",
"QuoteID":"57451",
"LoadNum":"57451",
"Rates":
{"Rate":
[
{"SCAC":"test1",
"CarrierName":"TEST1",
"TransitTime":"1",
"ServiceLevel":"D",
"TotalCost":"1,031.82",
"ThirdPartyCharge":"1,031.82",
"Accessorials":
{"Accessorial":
[
{"Code":"400",
"Cost":"1,655.55",
"Description":"Freight"
},
{"Code":"DSC",
"Cost":"-952.77",
"Description":"Discount"
},
{"Code":"FUE",
"Cost":"329.04",
"Description":"Fuel Surcharge"
}
]
},
"QuoteNumber":""
},
{"SCAC":"test2",
"CarrierName":"TEST2",
"TransitTime":"1",
"ServiceLevel":"D",
"TotalCost":"1,031.82",
"ThirdPartyCharge":"1,031.82",
"Accessorials":
{"Accessorial":
[
{"Code":"400",
"Cost":"1,655.55",
"Description":"Freight"
},
{"Code":"DSC",
"Cost":"-952.77",
"Description":"Discount"
},
{"Code":"FUE",
"Cost":"329.04",
"Description":"Fuel Surcharge"
}
]
},
"QuoteNumber":""
}
]
},
"AverageTotalCost":"1,031.82"
}
}
I have parsed the response data so that there is less information to work with, especially since I only need the Accessorial Costs. The parsed response looks like
[
{
"SCAC": "test1",
"CarrierName": "TEST1",
"TransitTime": "1",
"ServiceLevel": "D",
"TotalCost": "1,031.82",
"ThirdPartyCharge": "1,031.82",
"Accessorials": {
"Accessorial": [
{
"Code": "400",
"Cost": "1,655.55",
"Description": "Freight"
},
{
"Code": "DSC",
"Cost": "-952.77",
"Description": "Discount"
},
{
"Code": "FUE",
"Cost": "329.04",
"Description": "Fuel Surcharge"
}
]
},
"QuoteNumber": ""
},
{
"SCAC": "test2",
"CarrierName": "TEST2",
"TransitTime": "1",
"ServiceLevel": "D",
"TotalCost": "1,031.82",
"ThirdPartyCharge": "1,031.82",
"Accessorials": {
"Accessorial": [
{
"Code": "400",
"Cost": "1,655.55",
"Description": "Freight"
},
{
"Code": "DSC",
"Cost": "-952.77",
"Description": "Discount"
},
{
"Code": "FUE",
"Cost": "329.04",
"Description": "Fuel Surcharge"
}
]
},
"QuoteNumber": ""
}
]
The problem I am facing is that I will never know how many Rate items will come back in the response data, nor will I know the exact amount of Accessorial Costs. I'm hoping to capture the Rate child node counts and the Accessorial child node counts per Rate. Here's what I have so far.
Root rootObject = Newtonsoft.Json.JsonConvert.DeserializeObject<Root>(responseFromServer);
//rate stores the parsed response data
JArray rate = (JArray)JObject.Parse(responseFromServer)["RatingResponse"]["Rates"]["Rate"];
var rate2 = rate.ToString();
//this for loop works as expected. it grabs the number of Rate nodes (in this example, 2)
for (int i = 0; i < rate.Count(); i++)
{
dynamic test2 = rate[i];
//this is where I'm struggling
dynamic em = (JArray)JObject.Parse(test2)["Accessorials"]["Accessorial"].Count();
for (int j = 0; j < em; j++)
{
string test3 = test2.Accessorials.Accessorial[j].Cost;
System.IO.File.AppendAllText(logPath, Environment.NewLine + test3 + Environment.NewLine);
}
}
I apologize in advance for the bad formatting and odd variable names - I'm obviously still testing the functionality, so I've been using random variables.
Where I'm struggling (as notated above) is getting to the Accessorial node to count how many items are in its array. I was thinking I could parse the first array (starting with SCAC data) and extend down to the Accessorial node, but I'm not having any luck.
Any help is GREATLY appreciated, especially since I am new to this type of code and have spent the majority of the day trying to resolve this.
you can try this
var rates = (JArray)JObject.Parse(json)["RatingResponse"]["Rates"]["Rate"];
var costs = rates.Select(r => new
{
CarrierName = r["CarrierName"],
Costs = ((JArray)((JObject)r["Accessorials"])["Accessorial"])
.Where(r => (string)r["Description"] != "Discount")
.Select(r => (double)r["Cost"]).Sum()
}).ToList();
result
[
{
"CarrierName": "TEST1",
"Costs": 1984.59
},
{
"CarrierName": "TEST2",
"Costs": 1984.59
}
]
We are trying to do DynamoDB migration from prod account to stage account.
In the source account, we are making use of "Export" feature of DDB to put the compressed .json.gz files into destination S3 bucket.
We have written a glue script which will read the exported .json.gz files and writes it to DDB table.
We are making the code generic, so we should be able to migrate any DDB table from prod to stage account.
As part of that process, while testing we are facing issues when we are trying to write a NUMBER SET data to target DDB table.
Following is the sample snippet which is raising ValidationException when trying to insert into DDB
from decimal import Decimal
def number_set(datavalue):
# datavalue will be ['0', '1']
set_of_values = set()
for value in datavalue:
set_of_values.add(Decimal(value))
return set_of_values
When running the code, we are getting following ValidationException
An error occurred while calling o82.pyWriteDynamicFrame. Supplied AttributeValue is empty, must contain exactly one of the supported datatypes (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: UKEU70T0BLIKN0K2OL4RU56TGVVV4KQNSO5AEMVJF66Q9ASUAAJG; Proxy: null)
However, if instead of Decimal(value) if we use int(value) then no ValidationException is being thrown and the job succeeds.
I feel that write_dynamic_frame_from_options will try to infer schema based on the values the element contains, if the element has "int" values then the datatype would be "NS", but if the element contains all "Decimal type" values, then it is not able to infer the datatype.
The glue job we have written is
dyf = glue_context.create_dynamic_frame_from_options(
connection_type="s3",
connection_options={
"paths": [file_path]
},
format="json",
transformation_ctx = "dyf",
recurse = True,
)
def number_set(datavalue):
list_of_values = []
for value in datavalue:
list_of_values.append(Decimal(value))
print("list of values ")
print(list_of_values)
return set(list_of_values)
def parse_list(datavalue):
list_of_values = []
for object in datavalue:
list_of_values.append(generic_conversion(object))
return list_of_values
def generic_conversion(value_dict):
for datatype,datavalue in value_dict.items():
if datatype == 'N':
value = Decimal(datavalue)
elif datatype == 'S':
value = datavalue
elif datatype == 'NS':
value = number_set(datavalue)
elif datatype == 'BOOL':
value = datavalue
elif datatype == 'M':
value = construct_map(datavalue)
elif datatype == 'B':
value = datavalue.encode('ascii')
elif datatype == 'L':
value = parse_list(datavalue)
return value
def construct_map(row_dict):
ddb_row = {}
for key,value_dict in row_dict.items():
# value is a dict with key as N or S
# if N then use Decimal type
ddb_row[key] = generic_conversion(value_dict)
return ddb_row
def map_function(rec):
row_dict = rec["Item"]
return construct_map(row_dict)
mapped_dyF = Map.apply(frame = dyf, f = map_function, transformation_ctx = "mapped_dyF")
datasink2 = glue_context.write_dynamic_frame_from_options(
frame=mapped_dyF,
connection_type="dynamodb",
connection_options={
"dynamodb.region": "us-east-1",
"dynamodb.output.tableName": destination_table,
"dynamodb.throughput.write.percent": "0.5"
},
transformation_ctx = "datasink2"
)
can anyone help us in how can we unblock from this situation?
Record that we are trying to insert
{
"region": {
"S": "to_delete"
},
"date": {
"N": "20210916"
},
"number_set": {
"NS": [
"0",
"1"
]
},
"test": {
"BOOL": false
},
"map": {
"M": {
"test": {
"S": "value"
},
"test2": {
"S": "value"
},
"nestedmap": {
"M": {
"key": {
"S": "value"
},
"nestedmap1": {
"M": {
"key1": {
"N": "0"
}
}
}
}
}
}
},
"binary": {
"B": "QUFBY2Q="
},
"list": {
"L": [
{
"S": "abc"
},
{
"S": "def"
},
{
"N": "123"
},
{
"M": {
"key2": {
"S": "value2"
},
"nestedmaplist": {
"M": {
"key3": {
"S": "value3"
}
}
}
}
}
]
}
}
I am trying to create a JSON array of objects using jsonlite in R.
The goal is a JSON like this:
{
"top":[
{
"master1": {
"item1": "value1"
}
},
{
"master2": {
"item2": "value2"
}
}
]
}
I tried list of lists and dataframes with list column but can't get the desired output.
Apart from that, converting the above with fromJSON/toJSON results in a different format:
library(jsonlite)
txt <- '{
"top":[
{
"master1": {
"item1": "value1"
}
},
{
"master2": {
"item2": "value2"
}
}]
}'
toJSON(fromJSON(txt), pretty = T)
# Output
{
"top": [
{
"master1": {
"item1": "value1"
},
"master2": {}
},
{
"master1": {},
"master2": {
"item2": "value2"
}
}
]
}
Do I need to set a parameter for this to work?
By default, the fromJSON call converts your input to a data frame, and therefore adds NA values, which result in the empty entries in your output JSON:
$top
item1 item2
1 value1 <NA>
2 <NA> value2
You need to add simplifyDataFrame = FALSE to the fromJSON call to prevent it from creating a data frame.
toJSON(fromJSON(txt, simplifyDataFrame = FALSE), pretty = T)
gives
{
"top": [
{
"master1": {
"item1": ["value1"]
}
},
{
"master2": {
"item2": ["value2"]
}
}
]
}
I'm trying to get some data in a JSON file using R, but it does not work when the data is under brackets and keys, I'm getting a lot of data, the problem is actually getting the value of the "released" parameter. example:
{
"index": [
{
"id": "a979eb2b85d6c13086b29a21bdc421b2673379a4",
"date": "2019-03-22T01:20:01-0300",
"status": "OK",
"sensor": [
{
"id": "15",
"number": 127,
"callback": {
"released": true #it is not possible to return this data
}
}
]
},
{
"id": "db2890f501a3a49ed74aeb065168e057c3fd51d2",
"date": "2019-03-25T01:20:01-0300",
"status": "NOK",
"sensor": [
{
"id": "15",
"number": 149,
"callback": {
"released": false #it is not possible to return this data
}
}
]
}
]
}
Follow the code:
library(jsonlite)
data <- fromJSON("Desktop/json/file.json")
pagination <- list()
for(i in 0:10){
pagination[[i+1]] <- data$index$sensor$callback
}
data_org <- rbind_pages(pagination)
nrow(data_org)
length <- nrow(data_org)
data_org[1:length, c("released")]
The response was being:
nrow(data_org)
# [1] 0
data_org[1:length, c("released")]
# NULL
I am trying to insert forecasted values from a forecasting model along with timestamps in mongodb from.
The following code converts the R dataframe into json and then bson. However,when the result is inserted into mongodb, the timestamp is not recognized as date object.
mongo1 <-mongo.create(host = "localhost:27017",db = "test",username = "test",password = "test")
rev<-data.frame(ts=c("2017-01-06 05:30:00","2017-01-06 05:31:00","2017-01-06 05:32:00","2017-01-06 05:33:00","2017-01-06 05:34:00"),value=c(10,20,30,40,50))
rev$ts<-as.POSIXct(strptime(rev$ts,format = "%Y-%m-%d %H:%M:%S",tz=""))
revno<-"Revision1"
mylist <- list()
mylist[[ revno ]] <- rev
mylist["lastRevision"]<-revno
StartTime<-"2017-01-06 05:30:00"
site<-"Site1"
id <- mongo.bson.buffer.create()
mongo.bson.buffer.append(id, "site",site)
mongo.bson.buffer.append(id, "ts",as.POSIXct(strptime(StartTime,format = "%Y-%m-%d %H:%M:%S",tz="")) )
s <- mongo.bson.from.buffer(id)
rev.json<-toJSON(mylist,POSIXt=c("mongo"))
rev.bson<-mongo.bson.from.JSON(rev.json)
actPower <- mongo.bson.buffer.create()
mongo.bson.buffer.append(actPower, "_id",s)
mongo.bson.buffer.append(actPower,"activePower",rev.bson)
x <- mongo.bson.from.buffer(actPower)
x
mongo.insert(mongo1,'solarpulse.forecast',x)
Actual Output:
{
"_id" : {
"site" : "site1",
"ts" : ISODate("2017-01-06T18:30:00Z")
},
"activePower" : {
"Revision1" : [
{
"ts" : 1483660800000,
"value" : 10
},
{
"ts" : 1483660860000,
"value" : 20
},
{
"ts" : 1483660920000,
"value" : 30
},
{
"ts" : 1483660980000,
"value" : 40
},
{
"ts" : 1483661040000,
"value" : 50
}
],
"lastRevision" : [
"Revision1"
]
}
}
Expected Output format:
"_id" : {
"site" : "test",
"ts" : ISODate("2016-12-18T18:30:00Z")
}
"Revision1": [{
"ts": ISODate("2016-12-19T07:30:00Z"),
"value": 31
}, {
"ts": ISODate("2016-12-19T07:45:00Z"),
"value": 52
}, {
"ts": ISODate("2016-12-19T08:00:00Z"),
"value": 53
}, {
"ts": ISODate("2016-12-19T08:15:00Z"),
"value": 30
}, {
"ts": ISODate("2016-12-19T08:30:00Z"),
"value": 43
}, {
"ts": ISODate("2016-12-19T08:45:00Z"),
"value": 31
}, {
"ts": ISODate("2016-12-19T09:00:00Z"),
"value": 16
}, {
"ts": ISODate("2016-12-19T09:15:00Z"),
"value": 39
}, {
"ts": ISODate("2016-12-19T09:30:00Z"),
"value": 17
}, {
"ts": ISODate("2016-12-19T09:45:00Z"),
"value": 45
}, {
"ts": ISODate("2016-12-19T10:00:00Z"),
"value": 60
}, {
"ts": ISODate("2016-12-19T10:15:00Z"),
"value": 39
}, {
"ts": ISODate("2016-12-19T10:30:00Z"),
"value": 46
}, {
"ts": ISODate("2016-12-19T10:45:00Z"),
"value": 57
}, {
"ts": ISODate("2016-12-19T11:00:00Z"),
"value": 29
}, {
"ts": ISODate("2016-12-19T11:15:00Z"),
"value": 7
}]
You can use library(mongolite) to insert dates correctly for you. However, I've only managed to get it to correctly insert dates using data.frames. It fails to insert dates correctly using lists or JSON strings.
Here is a working example using a data.frame to insert the data.
library(mongolite)
m <- mongo(collection = "test_dates", db = "test", url = "mongodb://localhost")
# m$drop()
df <- data.frame(id = c("site1","site2"),
ts = c(Sys.time(), Sys.time()))
m$insert(df)
#Complete! Processed total of 2 rows.
#$nInserted
#[1] 2
#
#$nMatched
#[1] 0
#
#$nRemoved
#[1] 0
#
#$nUpserted
#[1] 0
#
#$writeErrors
#list()
A potential (but less than ideal) solution could be to coerce your list to a data.frame and then insert that.
rev<-data.frame(ts=c("2017-01-06 05:30:00","2017-01-06 05:31:00","2017-
01-06 05:32:00","2017-01-06 05:33:00","2017-01-06 05:34:00"),value=c(10,20,30,40,50))
rev$ts<-as.POSIXct(strptime(rev$ts,format = "%Y-%m-%d %H:%M:%S",tz=""))
revno<-"Revision1"
mylist <- list()
mylist[[ revno ]] <- rev
mylist["lastRevision"]<-revno
m$insert(data.frame(mylist))
Or alternatively, insert your list, and then write a function to convert ts values to ISODates() directly within mongo