phpExcel Read in chunks so slow and memory errors - phpexcel

I'm trying to read a big excel file of about 20mb to import into mysql.
I've searched across internet and found the "Chunks reading" solution, however is not working... or is SO slowly for me, and I'm not sure why.
This is what im doing:
// .....
// into MyReadFilter class.. this is the most important function:
public function readCell($column, $row, $worksheetName = '') {
// Only read the rows and columns that were configured
if (($row == 1) || ($row >= $this->_startRow && $row < $this->_endRow)) {
if (in_array($column,$this->_columns)) {
return true;
}
}
return false;
}
// .....
$filter = new MyReadFilter(1, 22000);
$chunkSize = 10;
$objReader = PHPExcel_IOFactory::createReader($inputFileType);
$objReader->setReadFilter($filter);
$objReader->setReadDataOnly(false); //not sure if this should be true
for ($startRow = 2; $startRow <= 65536; $startRow += $chunkSize) {
echo "Reading";
$filterSubset->setRows($startRow, $chunkSize);
$objPHPExcel = $objReader->load($inputFileName); // this line takes like 40 seconds... for 10 rows?
echo "chunk done! ";
}
However, inside the for, the $objReader->load() is taking like 40 seconds, and in fact, after 2 loops I got a memory error.
If I unset the $objReader inside the for I can make it run about 20 times inside the for... (although it take like 10 minutes) and.. memory error.
I'm wondering why the load function seems to read all the file if im using a filter, also the filter strategy seems to parse all rows and return false for all rows that are not being required... is not posible to abort reading or really read just the required ones?
I've tried a couple of FilterClass and code snippets but got same results...

If You're using a filter, then the Reader is still reading the whole file but only populating the PHPExcel object cells that are defined by the filter; and the Reader still needs to read the whole file each pass of the filtering process, which is what makes it slower.
The Reader needs to read the whole file because of the structure of the raw spreadsheet files. Cell data is not stored with cell formatting, and cell content may also be stored separately. The Reader needs to pull all this together. You can't simply abort the reader when the filter condition is met, because the reader has no way of knowing that it has been completed... if you have a filter that is limiting the load to cells A1:C3, then you can't abort after B3 has been read because you don't know if cell B2 comes after that in the file, or there may be comments associated with cell A1 further on in the file. Until the whole file has been loaded and parsed, you can't start to filter.
The main memory usage in PHPExcel is the PHPExcel object, and specifically the cells (typically about 1k/cell on 32-bit PHP).... the main solution provided to reduce memory here is cell caching. This can (using SQLite caching) reduce cell memory usage to 0k/cell, though at a cost in speed.
The Reader doesn't use much more memory than the size of the Excel file (decompressed) itself, so is normally far less of a memory problem; but this is being addressed (for XML-based spreadsheet formats) by switching from SimpleXML to XMLReader. But it is dependent on the format of the file being loaded; xls format files are very different to xlsx files (xlsx will benefit from this, xls won't) and also dependent on the developers being able to find the time to do this - but it is on the roadmap for the coming year, and work has already started.

Related

Rascal MPL get line count of file without loading its contents

Is there a more efficient way than
int fileSize = size(readFileLines(fileLoc));
to get the total number of lines in a file? I presume this code has to read the entire file first, which could become costly for huge files.
I have looked into IO and Loc whether some of this info might be saved in conjunction with the file.
This is the way, unless you'd like to call wc -l via util::ShellExec 😁
Apart from streaming the file and saving some memory counting lines is always linear in the size of the file so you won't win much time.

How to delete a row in a csv file with powershell in R?

Good morning,
I'm new about powershell and I'd like to ask you if somebody can help me.
I have a big csv file around 3.5gb and my goal is to load it with fread (a data.table function) in R environment, but this function makes a error.
> n_a<-fread("C:/x/xy/xyz/name_file.csv",sep=";", fill = TRUE)
The error is:
Warning message:
In fread("C:/x/xy/xyz/name_file.csv") :
Stopped early on line 458945. Expected 29 fields but found 30. Consider fill=TRUE and comment.char=. First discarded non-empty line
I tried to use different way (I putted in my code fill=true, but doesn't work) to solve the problem, but I couldn't do it.
After different researches I found this kind of solution (always to do in R):
>system("powershell Get-Content C:/a/b/c/file.csv | Select -Index (0..458944 + 1000000) > output.csv")
The focus about the use of powershell in R is to delete a specific row and to load with fread the file.
My question is:
How I can delete a specific row in a csv in powershell but without specifying the length of the matrix?
Thank you in advance for every type of help.
Francesco
I'd hazard a guess that the invalid row's location is not known. In such a case, it might be sensible to read the original file and create a new file that contains only valid data. What's more, if the source data would benefit of manipulation, it can be done before reading it into R.
A file as large as 3,5 GiB is a bit on the large side to read in memory as such. Sure, it can be done in the days of 64 bit systems, but for simple row processing it's unwieldy. A scalable solution uses .Net methods and row-by-row approach.
To process a file on row-by-row basis, use .Net methods for efficient row reading. A StringBuilder is created to store rows that contain valid data, others are discarded. The StringBuilder is flushed on disk every so often. Even on days of SSDs, a write operation for each row is relatively slow in respect to writing in a bulk of, say, 10 000 rows a time.
$sb = New-Object Text.StringBuilder
$reader = [IO.File]::OpenText("MyCsvFile.csv")
$i = 0
$MaxRows = 10000
$colonCount = 30
while($null -ne ($line = $reader.ReadLine())) {
# Split the line on semicolons
$elements = $line -split ';'
# If there were $colonCount elements, add those to builder
if($elements.count -eq $colonCount) {
# If $line's contents need modifications, do it here
# before adding it into the builder
[void]$sb.AppendLine($line)
++$i
}
# Write builder contents into file every now and then
if($i -ge $MaxRows) {
add-content "MyCleanCsvFile.csv" $sb.ToString()
[void]$sb.Clear()
$i = 0
}
}
# Flush the builder after the loop if there's data
if($sb.Length -gt 0) {
add-content "MyCleanCsvFile.csv" $sb.ToString()
}
This is easy done in powershell: Read csv in generic list, remove line and write back:
Add-Type -AssemblyName System.Collections
[System.Collections.Generic.List[string]]$csvList = #()
$csvFile = 'C:\test\myfile.csv'
$csvList = [System.IO.File]::ReadLines( $csvFile )
$lineToDelete = 2
[void]$csvList.RemoveAt( $lineToDelete - 1 )
[System.IO.File]::WriteAllLines( $csvFile, $csvList ) | Out-Null
vonPryz's helpful answer offers the best solution, given the size of your input file.
The following works too, but will be slow - in general, due to the overhead of using a pipeline, but also because Get-Content itself is slow due to decorating each line read with additional properties (see green-lighted, but not yet implemented GitHub suggestion #7537):
# Exclude line number 458945 (0-based index 458944)
Get-Content C:/a/b/c/file.csv | Select-Object -SkipIndex 458944 > output.csv
The beneficial flip side of use of the pipeline is that it acts as a memory throttle, so the above command can be used to process arbitrarily large files (though it may take a long time).

saveWorkbook Execution time

I'm sorry if this was already answered but I couldn't find.
I'm using the XLConnect package to add new entries to a spreadsheet, but the execution time of saveWorkbook is increasing and delaying all other tasks that depend on the updated spreadsheet.
The work flow is the following:
Query a SQL db for new entries (Load the result using read.table);
Load out-of-date spreadsheet and save each sheets as a entry of a
list;
Add entries to appropriate sheets/list element;
Color lines, using setCellStyel, according to a series of
parameters (example in code bellow);
saveWorkbook
cs_completo=getOrCreateCellStyle(wb, name = "Cs_Completo")
setFillPattern(cs_completo, fill = XLC$FILL.SOLID_FOREGROUND)
setFillForegroundColor(cs_completo, color = XLC$COLOR.LIGHT_GREEN)
for(status in c("Conferido","Impresso","Entregue","Envelopado")){
if(sum(grepl(status,dados$NomeStatusExame))>0){
index=which(grepl(status,dados$NomeStatusExame))+1
lapply(1:length(desired_tabs),function(x) setCellStyle(wb, sheet = sheet, row=index, col=x,cellstyle = cs_completo))}
}
}
Steps 1 through 4 are complete under 3 three minutes (some sheets have as much as 2000 lines).
Step 5 takes at least 30 minutes!
Is there a way to speed up the saveWorkbook writing process?
I don't know why but saving the workbook to a new file take much less time (under a minute) than overwrite the existing one!

U-SQL Ignore Empty Files

I receive a daily dump of files from a data provider. On occasion we receive empty files (20bytes). Is there any way to automatically avoid processing or skip these files?
I have tried:
USING Extractors.Csv(skipFirstNRows:1, silent:true);
But I seem to get a vertex failure related to what I believe is the empty files.
We recently added a FILE.LENGTH property as a computed virtual column that you can use to filter out files of a certain size.
For example the following should only operate on the files that are larger than 20 bytes:
#data =
EXTRACT
// ... columns to extract
, file_sz = FILE.LENGTH()
FROM "/mydata/{*}"
USING Extractors.Csv();
#res =
SELECT *
FROM #data
WHERE file_sz > 20;

PHPExcel - Value from a Cell referencing to another Cell Did Not Obtained Properly

I'm having this problem when I tried to extract information from excel files. Here's my situation, I have 34 Excel files which I received from my various users.
I'm using PHP version 5 to extract from the Excel files. My script will loop for every files, and looping again according to sheet name, and lastly looping again according to cell addresses.
The problem arised when the users had entered into a cell for e.g. =+A1 which means the users referencing the cell value to another cell due to it has the same value with cell A1.
When I checked in mysql (as I saved those for future use) I found from the record for a particular cell is identical with another record obtained from the same cell but in different excel file. What I meant is that, as my php script will loop from one file to another file, the first time PHPExcel read for e.g cell C3 which has some value USD3,000.00 the next files the PHPExcel may go to the same cell C3 but this time the C3 cell contain a formula that referencing to cell A1 ("=+A1" formula)which has value USD5,000.00.
PHP script suppose to record in mysql for USD5,000.00 but it didn't. I suspect that the PHPExcel script did not clear the variable at first round. I've tried unset($objPHPExcel) and destroy the variable but it still happening.
My coding is simple as follows:
if(file_exists($inputFileName))
{
$inputFileType = PHPExcel_IOFactory::identify($inputFileName);
$objReader = PHPExcel_IOFactory::createReader($inputFileType);
$objReader->setReadDataOnly(true);
$objPHPExcel = $objReader->load($inputFileName);
//to obtain date from FILE and store in DB for future comparison
$validating_date_reporting = $objPHPExcel->getSheet(0)->getCell('C10')->getValue();
$validating_date_reporting = PHPExcel_Style_NumberFormat::toFormattedString($validating_date_reporting,"YYYY-MMM-DD");
$validating_date_reporting = date('Y-m-d',strtotime($validating_date_reporting));
//first entry
$entry = mysql_query('INSERT INTO `'.$table.'`(`broker_code`, `date`, `date_from_submission`) VALUES("'.$broker_code.'","'.$reporting_date.'","'.$reporting_date.'")') or die(mysql_error());
foreach($cells_array as $caRef=>$sName)
{
foreach($sName as $sNameRef=>$cells)
{
$wksht_page = array_search($caRef, $sheetNameArray);
$cell_column = $wksht_page.'_'.$cells;
echo $inputFileName.' '.$caRef.' '.$cell_column.'<br>';
$value = $objPHPExcel->setActiveSheetIndexByName($caRef)->getCell($cells)->getCalculatedValue();
echo $value.'<br>';
if($value)
{
$isdPortal->LoginDB($db_periodic_submission);
$record = mysql_query('UPDATE `'.$table.'` SET `'.$cell_column.'` = "'.$value.'" WHERE broker_code = "'.$broker_code.'" AND date_from_submission = "'.$validating_date_reporting.'"') or die(mysql_error());
}
}
}
}
I really hope that you can help me out here..
thank you in advance.
PHPExcel holds a calculation cache as well, and this is not cleared when you unset a workbook: it has to be cleared manually using:
PHPExcel_Calculation::flushInstance();
or
PHPExcel_Calculation::getInstance()->clearCalculationCache();
You can also disable calculation caching completely (although this may slow things down if you have a lot of formulae that reference cells containing other formulae) using:
PHPExcel_Calculation::getInstance()->setCalculationCacheEnabled(FALSE);
before you start processing your files
This is because currently PHPExcel uses a singleton for the calculation engine. It is in the roadmap to switch to using a multiton pattern later this year, which will effectively maintain a separate cache for each workbook, alleviating this problem.
EDIT
Note that simply unsetting $objPHPExcel does not work. You need to detach the worksheets before unsetting $objPHPExcel.
$objPHPExcel->disconnectWorksheets();
unset($objPHPExcel);
as described in section 4.3 of the Developer Documentation. And this is the point where you should also add the PHPExcel_Calculation::flushInstance();

Resources