As per business functionality we need to read multiple excel files(both .xls and .xlsx format) at different locations in a multi thread environment. Each thread is responsible for reading a file. In order to test the performance, we have created 2 file sets in both .xls and .xlsx formats. One file set has just 20 row data while other file set contains 300,000 row data. We are able successfully read both files in .xls formats and load data into the table. Even for 20 row data .xlsx file, our source code is working fine.
But when the execution flow starts reading .xlsx file, application server is terminated abruptly. When I started tracing down the issue, I have been
facing a strange issue while creating XSSFWorkbook instance.Refer the code snippet below:
OPCPackage opcPackage = OPCPackage.open(FILE);
System.out.println("Created OPCPackage instance.");
XSSFWorkbook workbook = new XSSFWorkbook(opcPackage);
System.out.println("Created XSSFWorkbook instance.");
SXSSFWorkbook sxssfWorkbook = new SXSSFWorkbook(workbook, 1000);
System.out.println("Created SXSSFWorkbook instance.");[/code]
Output
Process XLSX file EXCEL_300K.xlsx start.
Process XLSX file EXCEL.xlsx start.
Created OPCPackage instance.
Created OPCPackage instance.
Created XSSFWorkbook instance.
Created SXSSFWorkbook instance.
Process XLSX file EXCEL.xlsx end.
For larger file set the execution hangs at
XSSFWorkbook workbook = new XSSFWorkbook(opcPackage);
causing heap space issue. Please do help me to fix this issue.
Thanks in advance.
Thanks,
Sankar.
After trying lot of solutions I found out that processing XLSX files require huge memory. But using POI 3.12 library has multiple advantages.
Processes excel files faster.
Has more API's to handle excel files like closing a working book, opening a excel file using File instance etc.
Related
I have a question about autoload writestream.
I have below user case:
Days before I uploaded 2 csv files into databricks file system, then read and write it to table by autoloader.
Today, I found that the files uploaded days before has wrong data that faked. so I deleted these old csv file, and then uploaded 2 new correct csv file.
Then I read and write the new files by autoloader streaming.
I found that the streaming can read the data from new files successfully, but failed to write to table by writestream.
Then I tried to delete the checkpoint folder and all sub folders or files and re-create the checkpoint folder, and read and write by stream again, found that the data is write to table successfully.
Questions:
Since the autoloader has detect the new files, why it can't write to table succesfully until I delete the checkpoint folder and create a new one.
AutoLoader works best when new files are ingested into a directory. Overwriting files might give unexpected results. I haven't worked with the option cloudFiles.allowOverwrites set to True yet, but this might help you (see documentation below).
On the question about readStream detecting the overwritten files, but writeStream not: This is because of the checkpoint. The checkpoint is always linked to the writeStream operation. If you do
df = (spark.readStream.format("cloudFiles")
.option("cloudFiles.schemaLocation", "<path_to_checkpoint>")
.load("filepath"))
display(df)
then you will always view the data of all the files in the directory. If you use writeStream, you need to add .option("checkpointLocation", "<path_to_checkpoint>"). This checkpoint will remember that the files (that were overwritten) already have been processed. This is why the overwritten files will only be processed again after you deleted the checkpoint.
Here is some more documentation about the topic:
https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/faq#does-auto-loader-process-the-file-again-when-the-file-gets-appended-or-overwritten
cloudFiles.allowOverwrites https://docs.databricks.com/ingestion/auto-loader/options.html#common-auto-loader-options
I have a question about uploading a 10mB source data file.
I tried multiple ways to upload this: upload its original version, zipped version, and txt version.
However, every time I click the uploaded data source file, I see this following error message:
out of memory.
I need your advice how to resolve this.
EDIT: This is only happening on PC! No such issue with a Mac reading data into R.
I have to read files that are being updated on a day-to-day basis and stored on a shared drive.
If I have an R session open in which one of these files has been read, the file cannot be saved with new edits until I close my R session.
Is there a way to read a file into R while still allowing others to save new edits through Excel?
Edit: I am using package xlsx and function read.xlsx.
I have an application, where I need to update the source data periodically. Source datafile is a csv file normally stored in the project directory and read with read.csv. The csv. file is changed every day with the updates. The name of the file does not change...just few cases are added.
I need the application to re-read the source pdf file with some periodicity (e.g. once per day). I can do it with reactiveFileReader function and it works when I am running the application from Rstudio, but not after I deploy the application on the web with shinyapps.io.
Can this be even done, when I am not using my own server but shinyapps.io?
I have one .xlsx file having 20 sheets in the file, size is approx 500kb.
while created the .xlsx file i did not used any caching method so my worksheets are created using 'cache_in_memory'.
I am running out of memory now(my server has approx 500mb ram).
Can I cache the worksheets' cells to disk when the memory is not available?
I read in the documentation that after creating the worksheet you cant change the caching method..
Please help me..i want to use disk when the memory is not available to php script..please tell me is that possible?
Caching isn't a feature of the Excel workbook, but of PHPExcel. Just because you created a workbook once without cell caching, doesn't mean you can't enable it when you read that workbook again.
You need to enable cell caching before either loading the workbook, or instantiating a new workbook within your script.