saveWorkbook Execution time - r

I'm sorry if this was already answered but I couldn't find.
I'm using the XLConnect package to add new entries to a spreadsheet, but the execution time of saveWorkbook is increasing and delaying all other tasks that depend on the updated spreadsheet.
The work flow is the following:
Query a SQL db for new entries (Load the result using read.table);
Load out-of-date spreadsheet and save each sheets as a entry of a
list;
Add entries to appropriate sheets/list element;
Color lines, using setCellStyel, according to a series of
parameters (example in code bellow);
saveWorkbook
cs_completo=getOrCreateCellStyle(wb, name = "Cs_Completo")
setFillPattern(cs_completo, fill = XLC$FILL.SOLID_FOREGROUND)
setFillForegroundColor(cs_completo, color = XLC$COLOR.LIGHT_GREEN)
for(status in c("Conferido","Impresso","Entregue","Envelopado")){
if(sum(grepl(status,dados$NomeStatusExame))>0){
index=which(grepl(status,dados$NomeStatusExame))+1
lapply(1:length(desired_tabs),function(x) setCellStyle(wb, sheet = sheet, row=index, col=x,cellstyle = cs_completo))}
}
}
Steps 1 through 4 are complete under 3 three minutes (some sheets have as much as 2000 lines).
Step 5 takes at least 30 minutes!
Is there a way to speed up the saveWorkbook writing process?

I don't know why but saving the workbook to a new file take much less time (under a minute) than overwrite the existing one!

Related

Datastage Sequence job- how to process each file at a time if those files are in 7 different folders

DataStage - There are 7 folders in a path and in each folder there are 2 files . for eg : the 2 files are in the folllowing format- filename = test_s1_YYYYMMDD.txt, test_s1_YYYYMMDD.done. The path for these files are user/test/test_s1/
user/test/test_s2/
...
...
..
user/test/test_s7/------here s1,s2...s7 represents the different folders
In these folders the 2 above mentioned files are present , so how can i process each file in a sequence job?
First you need a job to process a file and the filename needs to be a parameter of that job.
For the Sequence level you need two levels - the inner one for the two files within each folder and a outer one for the different directories.
For the inner one you can choose to build a loop with to iterations or simply add the processing job twice to the sequence (which will reduce complexity in case it will always be two files).
The outer Sequence is a loop where you could parameterize the path in a way that the loop counter could be used to generate your 1-7 flexible path addon.
Check out more details on loops here
You can use the loop counter (stage_label.$Counter) to parameterize your job.
Depending on what you want to do with the files, it is an important decision how to process your files. Starting a job (or more) in a sequence for each file can lead to heavy overhead for just starting the jobs. Try loading all files at once in a parallel job using the sequenial file stage.
In the Sequential File Stage, set the appropriate Format. You can also set everything to none to just put each row in one column and process that in a later job. This will make the reading very flexible and forgiving. If your files are all the same structure, define your columns as needed.
To select the files, use File Patterns. In the Options of the Sequential File Stage, choose to have a File Name Column so you can process the filenames in a later job. You might also want to add a Row Number Column.
This method works pretty fast.

Titan Graph Queries taking too long to execute

I have a problem with the executing speed of Titan queries.
To be more specific:
I created a property file for my graph using BerkeleyJe which is looking like this:
storage.backend=berkeleyje
storage.directory=/finalGraph_script/graph
Afterwards, i opened the Gremlin.bat to open my Graph.
I set up all the neccessary Index Keys for my nodes:
m = g.getManagementSystem();
username = m.makePropertyKey('username').dataType(String.class).make()
m.buildIndex('byUsername',Vertex.class).addKey(username).unique().buildCompositeIndex()
m.commit()
g.commit()
(all other keys are created the same way...)
I imported a csv file containing about 100 000 lines, each line is producing at least 2 nodes and some edges. All this is done via Batchloading.
That works without a Problem.
Then i execute a groupBy query which is looking like that:
m = g.V.has("imageLink").groupBy{it.imageLink}{it.in("is_on_image").out("is_species")}{it._().species.groupCount().cap.next()}.cap.next()
With this query i want for every node with the property key "imageLink" the number of the different "species". "Species" are also nodes, and can be called by going back the edge "is_on_image" and following the edge "is_species".
Well this is also working like a charm, for my recent nodes. This query is taking about 2 minutes on my local PC.
But now to the problem.
My whole dataset is a csv with 10 million entries. The structure is the same as above, and each line is also creating at least 2 nodes and some edges.
With my local PC i cant even import this set, causing an Memory Exception after 3 days of loading.
So I tried the same on a server with much more RAM and memory. There the Import works, and takes about 1 day. But the groupBy failes after about 3 days.
I actually dont know if the groupBy itself fails, or just the Connection to the Server after that long time.
So my first Question:
In my opinion about 15 million nodes shouldn't be that big deal for a graph database, should it?
Second Question:
Is it normal that it takes so long? Or is there anyway to speed it up using indices? I configured the indices as listet above :(
I don't know which exact information you need for helping me, but please just tell me what you need in addition to that.
Thanks a lot!
Best regards,
Ricardo
EDIT 1: The way im loading the CSV in the Graph:
I'm using this code, i deleted some unneccassry properties, which are also set an property for some nodes, loaded the same way.
bg = new BatchGraph(g, VertexIDType.STRING, 10000)
new File("annotation_nodes_wNothing.csv").eachLine({ final String line ->def (annotationId,species,username,imageLink) = line.split('\t')*.trim();def userVertex = bg.getVertex(username) ?: bg.addVertex(username);def imageVertex = bg.getVertex(imageLink) ?: bg.addVertex(imageLink);def speciesVertex = bg.getVertex(species) ?: bg.addVertex(species);def annotationVertex = bg.getVertex(annotationId) ?: bg.addVertex(annotationId);userVertex.setProperty("username",username);imageVertex.setProperty("imageLink", imageLink);speciesVertex.setProperty("species",species);annotationVertex.setProperty("annotationId", annotationId);def classifies = bg.addEdge(null, userVertex, annotationVertex, "classifies");def is_on_image = bg.addEdge(null, annotationVertex, imageVertex, "is_on_image");def is_species = bg.addEdge(null, annotationVertex, speciesVertex, "is_species");})
bg.commit()
g.commit()

Batch loading in DynamoDB

Hi Currently I am loading each row row in Dynamodb which is veryslow.
I have a huge data which i want to load to DynamoDb by JAVA API.
But this takes huge time .For example to load 1 million data it took me 2 days to load to Dynamo.
Is Batch load possible in DynamoDb.I am not finding any information about bulkload or batch load.
Appreciate any help here.
i know it's an old post, but we came up short exploring how to optimize this so embarked on a scientific discovery :)
http://tech.equinox.com/driving-miss-dynamodb/
long and short of it it Hive on EMR is an excellent option (i know, it's "old skool")..
using and tuning these parameters do the trick (see blog for details):
SET dynamodb.throughput.write.percent = x;
SET mapred.reduce.tasks = x;
SET hive.exec.reducers.bytes.per.reducer = x;
SET tez.grouping.split-count = x;

phpExcel Read in chunks so slow and memory errors

I'm trying to read a big excel file of about 20mb to import into mysql.
I've searched across internet and found the "Chunks reading" solution, however is not working... or is SO slowly for me, and I'm not sure why.
This is what im doing:
// .....
// into MyReadFilter class.. this is the most important function:
public function readCell($column, $row, $worksheetName = '') {
// Only read the rows and columns that were configured
if (($row == 1) || ($row >= $this->_startRow && $row < $this->_endRow)) {
if (in_array($column,$this->_columns)) {
return true;
}
}
return false;
}
// .....
$filter = new MyReadFilter(1, 22000);
$chunkSize = 10;
$objReader = PHPExcel_IOFactory::createReader($inputFileType);
$objReader->setReadFilter($filter);
$objReader->setReadDataOnly(false); //not sure if this should be true
for ($startRow = 2; $startRow <= 65536; $startRow += $chunkSize) {
echo "Reading";
$filterSubset->setRows($startRow, $chunkSize);
$objPHPExcel = $objReader->load($inputFileName); // this line takes like 40 seconds... for 10 rows?
echo "chunk done! ";
}
However, inside the for, the $objReader->load() is taking like 40 seconds, and in fact, after 2 loops I got a memory error.
If I unset the $objReader inside the for I can make it run about 20 times inside the for... (although it take like 10 minutes) and.. memory error.
I'm wondering why the load function seems to read all the file if im using a filter, also the filter strategy seems to parse all rows and return false for all rows that are not being required... is not posible to abort reading or really read just the required ones?
I've tried a couple of FilterClass and code snippets but got same results...
If You're using a filter, then the Reader is still reading the whole file but only populating the PHPExcel object cells that are defined by the filter; and the Reader still needs to read the whole file each pass of the filtering process, which is what makes it slower.
The Reader needs to read the whole file because of the structure of the raw spreadsheet files. Cell data is not stored with cell formatting, and cell content may also be stored separately. The Reader needs to pull all this together. You can't simply abort the reader when the filter condition is met, because the reader has no way of knowing that it has been completed... if you have a filter that is limiting the load to cells A1:C3, then you can't abort after B3 has been read because you don't know if cell B2 comes after that in the file, or there may be comments associated with cell A1 further on in the file. Until the whole file has been loaded and parsed, you can't start to filter.
The main memory usage in PHPExcel is the PHPExcel object, and specifically the cells (typically about 1k/cell on 32-bit PHP).... the main solution provided to reduce memory here is cell caching. This can (using SQLite caching) reduce cell memory usage to 0k/cell, though at a cost in speed.
The Reader doesn't use much more memory than the size of the Excel file (decompressed) itself, so is normally far less of a memory problem; but this is being addressed (for XML-based spreadsheet formats) by switching from SimpleXML to XMLReader. But it is dependent on the format of the file being loaded; xls format files are very different to xlsx files (xlsx will benefit from this, xls won't) and also dependent on the developers being able to find the time to do this - but it is on the roadmap for the coming year, and work has already started.

PHPExcel - Value from a Cell referencing to another Cell Did Not Obtained Properly

I'm having this problem when I tried to extract information from excel files. Here's my situation, I have 34 Excel files which I received from my various users.
I'm using PHP version 5 to extract from the Excel files. My script will loop for every files, and looping again according to sheet name, and lastly looping again according to cell addresses.
The problem arised when the users had entered into a cell for e.g. =+A1 which means the users referencing the cell value to another cell due to it has the same value with cell A1.
When I checked in mysql (as I saved those for future use) I found from the record for a particular cell is identical with another record obtained from the same cell but in different excel file. What I meant is that, as my php script will loop from one file to another file, the first time PHPExcel read for e.g cell C3 which has some value USD3,000.00 the next files the PHPExcel may go to the same cell C3 but this time the C3 cell contain a formula that referencing to cell A1 ("=+A1" formula)which has value USD5,000.00.
PHP script suppose to record in mysql for USD5,000.00 but it didn't. I suspect that the PHPExcel script did not clear the variable at first round. I've tried unset($objPHPExcel) and destroy the variable but it still happening.
My coding is simple as follows:
if(file_exists($inputFileName))
{
$inputFileType = PHPExcel_IOFactory::identify($inputFileName);
$objReader = PHPExcel_IOFactory::createReader($inputFileType);
$objReader->setReadDataOnly(true);
$objPHPExcel = $objReader->load($inputFileName);
//to obtain date from FILE and store in DB for future comparison
$validating_date_reporting = $objPHPExcel->getSheet(0)->getCell('C10')->getValue();
$validating_date_reporting = PHPExcel_Style_NumberFormat::toFormattedString($validating_date_reporting,"YYYY-MMM-DD");
$validating_date_reporting = date('Y-m-d',strtotime($validating_date_reporting));
//first entry
$entry = mysql_query('INSERT INTO `'.$table.'`(`broker_code`, `date`, `date_from_submission`) VALUES("'.$broker_code.'","'.$reporting_date.'","'.$reporting_date.'")') or die(mysql_error());
foreach($cells_array as $caRef=>$sName)
{
foreach($sName as $sNameRef=>$cells)
{
$wksht_page = array_search($caRef, $sheetNameArray);
$cell_column = $wksht_page.'_'.$cells;
echo $inputFileName.' '.$caRef.' '.$cell_column.'<br>';
$value = $objPHPExcel->setActiveSheetIndexByName($caRef)->getCell($cells)->getCalculatedValue();
echo $value.'<br>';
if($value)
{
$isdPortal->LoginDB($db_periodic_submission);
$record = mysql_query('UPDATE `'.$table.'` SET `'.$cell_column.'` = "'.$value.'" WHERE broker_code = "'.$broker_code.'" AND date_from_submission = "'.$validating_date_reporting.'"') or die(mysql_error());
}
}
}
}
I really hope that you can help me out here..
thank you in advance.
PHPExcel holds a calculation cache as well, and this is not cleared when you unset a workbook: it has to be cleared manually using:
PHPExcel_Calculation::flushInstance();
or
PHPExcel_Calculation::getInstance()->clearCalculationCache();
You can also disable calculation caching completely (although this may slow things down if you have a lot of formulae that reference cells containing other formulae) using:
PHPExcel_Calculation::getInstance()->setCalculationCacheEnabled(FALSE);
before you start processing your files
This is because currently PHPExcel uses a singleton for the calculation engine. It is in the roadmap to switch to using a multiton pattern later this year, which will effectively maintain a separate cache for each workbook, alleviating this problem.
EDIT
Note that simply unsetting $objPHPExcel does not work. You need to detach the worksheets before unsetting $objPHPExcel.
$objPHPExcel->disconnectWorksheets();
unset($objPHPExcel);
as described in section 4.3 of the Developer Documentation. And this is the point where you should also add the PHPExcel_Calculation::flushInstance();

Resources