Batch loading in DynamoDB - amazon-dynamodb

Hi Currently I am loading each row row in Dynamodb which is veryslow.
I have a huge data which i want to load to DynamoDb by JAVA API.
But this takes huge time .For example to load 1 million data it took me 2 days to load to Dynamo.
Is Batch load possible in DynamoDb.I am not finding any information about bulkload or batch load.
Appreciate any help here.

i know it's an old post, but we came up short exploring how to optimize this so embarked on a scientific discovery :)
http://tech.equinox.com/driving-miss-dynamodb/
long and short of it it Hive on EMR is an excellent option (i know, it's "old skool")..
using and tuning these parameters do the trick (see blog for details):
SET dynamodb.throughput.write.percent = x;
SET mapred.reduce.tasks = x;
SET hive.exec.reducers.bytes.per.reducer = x;
SET tez.grouping.split-count = x;

Related

AzureML Dataset.File.from_files creation extremely slow even with 4 files

I have a few thousand of video files in my BlobStorage, which I set it as a datastore.
This blob storage receives new files every night and I need to split the data and register each split as a new version of AzureML Dataset.
This is how I do the data split, simply getting the blob paths and splitting them.
container_client = ContainerClient.from_connection_string(AZ_CONN_STR,'keymoments-clips')
blobs = container_client.list_blobs('soccer')
blobs = map(lambda x: Path(x['name']), blobs)
train_set, test_set = get_train_test(blobs, 0.75, 3, class_subset={'goal', 'hitWoodwork', 'penalty', 'redCard', 'contentiousRefereeDecision'})
valid_set, test_set = split_data(test_set, 0.5, 3)
train_set, test_set, valid_set are just nx2 numpy arrays containing blob storage path and class.
Here is when I try to create a new version of my Dataset:
datastore = Datastore.get(workspace, 'clips_datastore')
dataset_train = Dataset.File.from_files([(datastore, b) for b, _ in train_set[:4]], validate=True, partition_format='**/{class_label}/*.mp4')
dataset_train.register(workspace, 'train_video_clips', create_new_version=True)
How is it possible that the Dataset creation seems to hang for an indefinite time even with only 4 paths?
I saw in the doc that providing a list of Tuple[datastore, path] is perfectly fine. Do you know why?
Thanks
Do you have your Azure Machine Learning Workspace and your Azure Storage Account in different Azure Regions? If that's true, latency may be a contributing factor with validate=True.
Another possibility may be slowness in the way datastore paths are resolved. This is an area where improvements are being worked on.
As an experiment, could you try creating the dataset using a url instead of datastore? Let us know if that makes a difference to performance, and whether it can unblock your current issue in the short term.
Something like this:
dataset_train = Dataset.File.from_files(path="https://bloburl/**/*.mp4?accesstoken", validate=True, partition_format='**/{class_label}/*.mp4')
dataset_train.register(workspace, 'train_video_clips', create_new_version=True)
I'd be interested to see what happens if you run the dataset creation code twice in the same notebook/script. Is it faster the second time? I ask because it might be an issue with the .NET core runtime startup (which would only happen on the first time you run the code)
EDIT 9/16/20
While it doesn't seem to make sense that .NET core invoked when not data is moving, is suspect it is the validate=True part of the param that requires that all the data be inspected (which can computationally expensive). I'd be interested to see what happens if that param is False

saveWorkbook Execution time

I'm sorry if this was already answered but I couldn't find.
I'm using the XLConnect package to add new entries to a spreadsheet, but the execution time of saveWorkbook is increasing and delaying all other tasks that depend on the updated spreadsheet.
The work flow is the following:
Query a SQL db for new entries (Load the result using read.table);
Load out-of-date spreadsheet and save each sheets as a entry of a
list;
Add entries to appropriate sheets/list element;
Color lines, using setCellStyel, according to a series of
parameters (example in code bellow);
saveWorkbook
cs_completo=getOrCreateCellStyle(wb, name = "Cs_Completo")
setFillPattern(cs_completo, fill = XLC$FILL.SOLID_FOREGROUND)
setFillForegroundColor(cs_completo, color = XLC$COLOR.LIGHT_GREEN)
for(status in c("Conferido","Impresso","Entregue","Envelopado")){
if(sum(grepl(status,dados$NomeStatusExame))>0){
index=which(grepl(status,dados$NomeStatusExame))+1
lapply(1:length(desired_tabs),function(x) setCellStyle(wb, sheet = sheet, row=index, col=x,cellstyle = cs_completo))}
}
}
Steps 1 through 4 are complete under 3 three minutes (some sheets have as much as 2000 lines).
Step 5 takes at least 30 minutes!
Is there a way to speed up the saveWorkbook writing process?
I don't know why but saving the workbook to a new file take much less time (under a minute) than overwrite the existing one!

Creating graph in titan from data in csv - example wiki.Vote gives error

I am new to Titan - I loaded titan and successfully ran GraphOfTheGods example including queries given. Next I went on to try bulk loading csv file to create graph and followed steps in Powers of ten - Part 1 http://thinkaurelius.com/2014/05/29/powers-of-ten-part-i/
I am getting an error in loading wiki-Vote.txt
gremlin> g = TitanFactory.open("/tmp/1m") Backend shorthand unknown: /tmp/1m
I tried:
g = TitanFactory.open('conf/titan-berkeleydb-es.properties’)
but get an error in the next step in load-1m.groovy
==>titangraph[berkeleyje:/titan-0.5.4-hadoop2/conf/../db/berkeley] No signature of method: groovy.lang.MissingMethodException.makeKey() is applicable for argument types: () values: [] Possible solutions: every(), any()
Any hints what to do next? I am using groovy for the first time. what kind of groovy expertise needed for working with gremlin
That blog post is meant for Titan 0.4.x. The API shifted when Titan went to 0.5.x. The same principles discussed in the posts generally apply to data loading but the syntax is different in places. The intention is to update those posts in some form when Titan 1.0 comes out with full support of TinkerPop3. Until then, you will need to convert those code examples to the revised API.
For example, an easy way to create a berkeleydb database is with:
g = TitanFactory.build()
.set("storage.backend", "berkeleyje")
.set("storage.directory", "/tmp/1m")
.open();
Please see the docs here. Then most of the schema creation code (which is the biggest change) is now described here and here.
After much experimenting today, I finally figured it out. A lot of changes were needed:
Use makePropertyKey() instead of makeKey(), and makeEdgeLabel() instead of makeLabel()
Use cardinality(Cardinality.SINGLE) instead of unique()
Building the index is quite a bit more complicated. Use the management system instead of the graph both to make the keys and labels, as well as build the index (see https://groups.google.com/forum/#!topic/aureliusgraphs/lGA3Ye4RI5E)
For posterity, here's the modified script that should work (as of 0.5.4):
g = TitanFactory.build().set("storage.backend", "berkeleyje").set("storage.directory", "/tmp/1m").open()
m = g.getManagementSystem()
k = m.makePropertyKey('userId').dataType(String.class).cardinality(Cardinality.SINGLE).make()
m.buildIndex('byId', Vertex.class).addKey(k).buildCompositeIndex()
m.makeEdgeLabel('votesFor').make()
m.commit()
getOrCreate = { id ->
def p = g.V('userId', id)
if (p.hasNext()) {
p.next()
} else {
g.addVertex([userId:id])
}
}
new File('wiki-Vote.txt').eachLine {
if (!it.startsWith("#")){
(fromVertex, toVertex) = it.split('\t').collect(getOrCreate)
fromVertex.addEdge('votesFor', toVertex)
}
}
g.commit()

Titan Graph Queries taking too long to execute

I have a problem with the executing speed of Titan queries.
To be more specific:
I created a property file for my graph using BerkeleyJe which is looking like this:
storage.backend=berkeleyje
storage.directory=/finalGraph_script/graph
Afterwards, i opened the Gremlin.bat to open my Graph.
I set up all the neccessary Index Keys for my nodes:
m = g.getManagementSystem();
username = m.makePropertyKey('username').dataType(String.class).make()
m.buildIndex('byUsername',Vertex.class).addKey(username).unique().buildCompositeIndex()
m.commit()
g.commit()
(all other keys are created the same way...)
I imported a csv file containing about 100 000 lines, each line is producing at least 2 nodes and some edges. All this is done via Batchloading.
That works without a Problem.
Then i execute a groupBy query which is looking like that:
m = g.V.has("imageLink").groupBy{it.imageLink}{it.in("is_on_image").out("is_species")}{it._().species.groupCount().cap.next()}.cap.next()
With this query i want for every node with the property key "imageLink" the number of the different "species". "Species" are also nodes, and can be called by going back the edge "is_on_image" and following the edge "is_species".
Well this is also working like a charm, for my recent nodes. This query is taking about 2 minutes on my local PC.
But now to the problem.
My whole dataset is a csv with 10 million entries. The structure is the same as above, and each line is also creating at least 2 nodes and some edges.
With my local PC i cant even import this set, causing an Memory Exception after 3 days of loading.
So I tried the same on a server with much more RAM and memory. There the Import works, and takes about 1 day. But the groupBy failes after about 3 days.
I actually dont know if the groupBy itself fails, or just the Connection to the Server after that long time.
So my first Question:
In my opinion about 15 million nodes shouldn't be that big deal for a graph database, should it?
Second Question:
Is it normal that it takes so long? Or is there anyway to speed it up using indices? I configured the indices as listet above :(
I don't know which exact information you need for helping me, but please just tell me what you need in addition to that.
Thanks a lot!
Best regards,
Ricardo
EDIT 1: The way im loading the CSV in the Graph:
I'm using this code, i deleted some unneccassry properties, which are also set an property for some nodes, loaded the same way.
bg = new BatchGraph(g, VertexIDType.STRING, 10000)
new File("annotation_nodes_wNothing.csv").eachLine({ final String line ->def (annotationId,species,username,imageLink) = line.split('\t')*.trim();def userVertex = bg.getVertex(username) ?: bg.addVertex(username);def imageVertex = bg.getVertex(imageLink) ?: bg.addVertex(imageLink);def speciesVertex = bg.getVertex(species) ?: bg.addVertex(species);def annotationVertex = bg.getVertex(annotationId) ?: bg.addVertex(annotationId);userVertex.setProperty("username",username);imageVertex.setProperty("imageLink", imageLink);speciesVertex.setProperty("species",species);annotationVertex.setProperty("annotationId", annotationId);def classifies = bg.addEdge(null, userVertex, annotationVertex, "classifies");def is_on_image = bg.addEdge(null, annotationVertex, imageVertex, "is_on_image");def is_species = bg.addEdge(null, annotationVertex, speciesVertex, "is_species");})
bg.commit()
g.commit()

QSqlTableModel.insertRecord() is very slow

I am using PyQt to insert records into a MySQL database. the code basically looks like
self.table = QSqlTableModel()
self.table.setTable('mytable')
while True:
rec = self.table.record()
values = getValueDictionary()
for k,v in values.items():
rec.setValue(k,QVariant(v))
self.table.insertRecord(-1,rec)
The table currently has ~ 50,000 rows in it.
I have timed each line and found that the insertRecord function is taking ~5 seconds to execute, which is unacceptably slow. Everything else is fast.
For comparison, I also made a version of the code that uses
QSqlQuery.prepare("INSERT INTO mytable (f1,f2,...) VALUES (:f1, :f2,...)")
query.bindValue(":f1",blah)
query.exec_()
In this case, the whole thing takes only ~ 20 milliseconds, so the delay is not in the database connection as far as I can tell.
I'd really prefer to use the QtSql stuff instead of the awkward MySQL commands. Any ideas on how to add a bunch of rows to a MySQL database with QtSql instead of raw comands and with reasonable speed?
Thanks,
G
Things to try:
set your EditStrategy to QSqlTableModel.OnManualSubmit
mass insert rows with insertRows
and see if it helps...
you should use begin before and commit after the loop, or turn off the autocommit feature from MySQL ..
this will give you usually a performance increase of 50% or more ..

Resources