I have an excel file that contains all the filenames of the Images. The path of these images are stored in an Observable Collection via <File> class which came from the folder that contains all of the images. My goal is to create a hyperlink of these filenames by matching it through the pool of image file collection.
I would like to ask if how can I iterate faster through a large collection of file classes in order to get their paths easily.
For example:
Image name from Excel :
ABC_0001
The Full path from the collection must be:
C:\Users\admin\Desktop\Images\ABC_0001.jpg
In order to get their full path, I perform the iteration through Stream.
My procedures:
Extract data using Apache POI.
Stream through the Image Collection by converting each data into
their base filenames vs extracted data.
Get the result and store the fullpath on the object via
getAbsolutePath().
Code:
//storage during iteration
ObservableList<DetailedData> dataCollection = FXCollections.observableArrayList()
//Image collection containing over 13k Images listed via commons-io
ObservableList<File> IMAGE_COLLECTION = FXCollections.observableArrayList(FileUtils.listFiles(browsedFOLDER, new String[]{"JPG", "JPEG", "TIF", "TIFF", "jpg", "jpeg", "tif", "tiff"}, true));
//Sheet data
Sheet sheet1 = wb.getsheetAt(0);
for (Row row: sheet1)
{
DetailedData data = new DetailedData();
//extracted data from excel
String FILENAME = row.getCell(0,Row.MissingCellPolicy.CREATE_NULL_AS_BLANK).getStringCellValue();
//to be filled up based on stream result.
String IMAGE_SOURCE = null;
//stream code with the help of commons-io
File IMAGE = IMAGE_COLLECTION.stream().filter(e -> FilenameUtils.getBaseName(e.getName()).toLowerCase().equals(FILENAME.toLowerCase())).findFirst().orElse(null);
if (IMAGE != null)
IMAGE_SOURCE = IMAGE.getAbsolutePath();
data.setFileName(FILENAME);
data.setFullPath(IMAGE_SOURCE);
dataCollection.add(data);
}
Result:
Excel rows = 9,400
Image Files = 13,000
Iteration Time = 120,000ms
Are the results should appear normal or it can become faster?
I tried using parallelStream() and the results went faster but it consumes higher CPU usage.
This code should speed your code up a lot, but there are a few questions about your code.
ObservableList<DetailedData> dataCollection = FXCollections.observableArrayList() Why are you using ObservableList? Why is this a list of DetailedData and not File. Given that detailed data has setFileName and setFullPath. File already has these.
ObservableList<File> IMAGE_COLLECTION = FXCollections.observableArrayList(FileUtils.listFiles(browsedFOLDER, new String[]{"JPG", "JPEG", "TIF", "TIFF", "jpg", "jpeg", "tif", "tiff"}, true)); Why ObservableList?
These two are small things, but I am curious.
So what I think you should do is use a Map. Your code should look something like the code below.
//storage during iteration
List<DetailedData> dataCollection = new ArrayList();
//Image collection containing over 13k Images listed via commons-io
List<File> IMAGE_COLLECTION = new ArrayList(FileUtils.listFiles(new File("C:\\Users\\blj0011\\Pictures"), new String[]{"JPG", "JPEG", "TIF", "TIFF", "jpg", "jpeg", "tif", "tiff"}, true));
//Use this to map file name to file
Map<String, File> map = new HashMap();
//Use this to add data to the map
IMAGE_COLLECTION.forEach((file) -> {map.put(file.getName().substring(0, file.getName().lastIndexOf(".")).toLowerCase(), file);});
for (Row row: sheet1)
{
//extracted data from excel
String FILENAME = row.getCell(0,Row.MissingCellPolicy.CREATE_NULL_AS_BLANK).getStringCellValue();
//If the map contains the file name, create `DetailedData` object. Then set data. Then add object to datacollection list.
if (map.containsKey(FILENAME.toLowerCase()))
{
DetailedData data = new DetailedData();
data.setFileName(FILENAME);
data.setFullPath(map.get(FILENAME.toLowerCase()).getAbsolutePath());
dataCollection.add(data);
}
}
Comments in the code
I still believe this could be cleaned up a little more if you used List<File> dataCollection = new ArrayList()
If you really want to speed up your search, you should try not to do things repeatedly which could just be done once. For example you could use two loops. The first to prepare your search and the second to actually do the search. Inside your filter you call FilenameUtils.getBaseName and two time a conversion to lower case. It would be better to do these things only once in the first loop and store the resulting Strings in a list. In the second loop you then do the search on this list.
I am also wondering why you use ObservableLists here. A simple List would do as well.
I've tested another approach in this slow iteration.
It seems that the cause is declaring the Stream repeatedly inside the foreach.
I tried using Baeldung's solution <Supplier> and declared it outside the loop together with parallelStream()
Sample Code:
Supplier<Stream<File>> streamSupplier = () -> imageCollection.parallelStream();
for (Row row : sheet)
{
File IMAGE = streamSupplier.get().filter(e -> FilenameUtils.getBaseName(e.getName()).toLowerCase().equals(FILENAME.toLowerCase())).findFirst().orElse(null);
if (IMAGE != null)
IMAGE_SOURCE = IMAGE.getAbsolutePath();
}
Result went 45000ms
Please correct me if my approach was not right.
I have a large oracle query result and want to upload it using http POST.
But with memory constrain, I cannot read all rows at once into memory.
So I read a few rows at a time, but can't find a way to start chunked uploading in R.
If it were in C# it would go something like this
var req = WebRequest.Create("http://myserver/upload");
req.SendChunked=true;
req.method = "POST"
using(var s = req.GetRequestStream()){
while(queryResult.hasRow()){
byte[] buffer = queryResult.readRow();
s.write(buffer,0,buffer.Length);
}
}
resonse = req.getResponse();
Is there anything equivalent in R?
There is requirement of Excel file upload on .aspx page and reading data and store that in database. In the excel user can format specific word or sentence in a any cell. We want preserve those formatting in form of HTML tag. So we want to reach data with formatting with html tag. How it can be achieved.
Probaly the best way would be to use the Excel interop assemblies under Microsoft.Office.Interop.Excel namespace. The code would be something like this:
Excel.Application excel = new Excel.Application();
excel.Workbooks.Open(fileName);
Excel.Worksheet activeWorksheet = ExcelApp.ActiveSheet;
for (int i = 1; i < 100; i++){
for (int j = 1; j < 100; j++){
Excel.Range currentCell = activeWorksheet.Cells[i, j];
// formating
var fontFamily = currentCell.Font.Name;
var italics = currentCell.Font.Italic;
var color = currentCell.Font.Color;
}
}
This opens an excel file and loops trough first 99 rows and columns.
But this could be too intensive since it would open Excel for each document - not sure what kind of performance is required. There are other libraries available that offer simple reading and writing to Excel, but I'm not sure if they offer reading the formats and things like that. You can find some more info about those tools here: Import and export excel. I just checked and it seems EPPPlus support cell styling, so that might be an alternative.
I'm using a technique from another Stack Overflow question to write a CSV file to the Response output for a User to Open/Save. The file looks good in Notepad, but when I open it in Excel the accented characters are garbage. I assumed this was something to do with the character encoding, so I tried manually setting it to UTF-8 (the default for StreamWriter). Here is the code:
// This fills a list to enumerate - each record is one CSV line
List<FullRegistrationInfo> fullUsers = GetFullUserRegistrations();
context.Response.Clear();
context.Response.AddHeader("content-disposition",
"attachment; filename=registros.csv");
context.Response.ContentType = "text/csv";
context.Response.Charset = "utf-8";
using (StreamWriter writer = new StreamWriter(context.Response.OutputStream))
{
for (int i = 0; i < fullUsers.Count(); i++)
{
// Get the record to process
FullRegistrationInfo record = fullUsers[i];
// If it's the first record then write header
if (i == 0)
writer.WriteLine(Encoding.UTF8.GetString(
Encoding.UTF8.GetPreamble()) +
"User, First Name, Surname");
writer.WriteLine(record.User + "," +
record.FirstName + "," +
record.Surname);
}
}
context.Response.End();
Any ideas as to what else I would need to do to correctly encode the file so Excel can view the accented characters?
You may have to write an UTF-8 indicator called Byte-order Mark to the beginning of the output to notify Excel about the UTF-8ness. Silly Excel.
My array is 140bytes. outArray is 512bytes... Not what i wanted. Also i dont know if i am encrypting properly. Is the code below correct? how do i fix this so outArray is the real size and not fixed with many trailing zeros?
var compress = new SevenZipCompressor();
compress.CompressionLevel = CompressionLevel.Ultra;
compress.CompressionMethod = CompressionMethod.Lzma;
compress.ZipEncryptionMethod = ZipEncryptionMethod.Aes256;
var sIn = new MemoryStream(inArray);
var sOut = new MemoryStream();
compress.CompressStream(sIn, sOut, "a");
byte[] outArray = sOut.GetBuffer();
You are getting the whole MemoryStream buffer, you need to use ToArray(),
byte[] outArray = sOut.ToArray();
This will remove the trailing zeros but you may still get an array bigger than input. There is overhead with compression/encryption, which is probably bigger than 140 bytes.
Many compression algorithms (I'm unfamiliar with the specific details for 7-zip) generate output with a minimum output size. 7-zip performs best on large input data sets, and 140 bytes is not "large". You might do better with something like gzip or lzo. What other compression algorithms have you tried?