inet PDFc unable to recognize characters - inet

I was using PDFc to compare two files using ConsoleResultHandle.
Both my files were similar , I had copy pasted them.
This tool Pdfc after comparing was giving
DEBUG - Unsupport CMap format: 6
I checked the differences folder (where it shows the differences) and in the png files it generates , its giving all boxes ..(unsupported characters) as the above debug says.
Did anyone else encounter the same problem.

This problem can occur if there are issues parsing the PDF file. The best way to proceed, as gamma mentions in the comment, is to contact our support team at pdfc#inetsoftware.de with the PDFs in question - we usually answer within 24 hours and will see if we can fix the problem.

Related

Multibyte characters reading problem in IronPdf

I am trying IronPDF. I want to insert PDF metadata to database which I read with IronPDF. However, some "ı" characters in the metadata are not read with IronPDF. Spaces are left in place of these characters. Here is my code sample:
var md = PdfDocument.FromFile("___PATH OF PDF FILE___");
var article_title = md.MetaData.Title;
When I copy paste string to Notepad++ it gives a result like this:
And here is the screenshot of application view:
Is there a way to solve this problem or is this a bug of IronPDF? If everything goes well, of course, I think of buying. But of course, if it fails on the first try, continue to iTextSharp.
EDIT: First of all, I apologize for Windows, which made me surprised. I struggled to get a new system up all day and unfortunately it's still visual studio etc. not to be installed. I added one of the files I had problems with in the below and the IronPDF version appears as 2019.7.0.0.
PDF file: https://yadi.sk/d/HwP9JWRWTzMlSA
First of all, since you haven't provided us with a sample PDF to work with; I've google some Turkish PDF documents having metadata with Turkish characters. This is the file that I came up with: link
As you can see above the Author metadata field has ı Turkish character.
Then I created a dotnet fiddle in order to test this file using IronPDF (with the latest available version - since you haven't specified any):
sample using IronPDF
The output from this sample is ElifCakroglu which is showing the exact same symptom when copied to Notepad++:
Playing with the encodings did not help resolving this issue. So I created another dotnet fiddle to test your alternative solution which was iTextSharp: sample using iTextSharp
This time everything was working as it should be: ElifCakıroglu
Note: I've also tried creating a Word 2016 document and saving it as a PDF then using that file with the above samples and both of them did not work (not accepting as a valid PDF) for some reason. After that I tried and online PDF document validator, but the file was fine. Then I used an online converter to change the PDF version with the default settings and used the output PDF with both samples and the surprising thing is that both of them worked correctly.
My conclusion is that iTextSharp is working consistently with both documents having metadata with Turkish characters present, while IronPDF works correctly 50% of the time.
I believe that this issue is resolved and can be tested in the 2020.9 release branch of IronPdf.
https://www.nuget.org/packages/IronPdf/

Generating Excel file with XLConnect-Removed Feature: Format from /xl/styles.xml part (Styles)

I am using XLConnect in R for the purpose of daily report generation. I have a program that runs automatically at specific time to append the data for most recent date daily into an excel file (Excel 2007). The program works fine to do this task. But, sometimes when i open the excel file it says that "Excel found unreadable content. do you want to recover the content of this workbook?"
The best part of this issue is that i can't reproduce this issue again to know the exact root cause for the problem. It arises in a random manner. Because, when i try to run the program again it works fine. Can somebody help me to identify the root cause?

Track the exact place of a not encoded character in an R script file

more a tip question that can save lots of time in many cases. I have a script.R file which I try to save and get the error:
Not all of the characters in ~/folder/script.R could be encoded using ASCII. To save using a different encoding, choose "File | Save with Encoding..." from the main menu.
I was working on this file for months and today I was editing like crazy my code and got this error for the first time, so obviously I inserted a character that can not be encoded while I was working today.
My question is, can I track and find this specific character and where exactly in the document is?
There are about 1000 lines in my code and it's almost impossible to manually search it.
Use tools::showNonASCIIfile() to spot the non-ascii.
Let me suggest two slight improvements this.
Process:
Save your file using a different encoding (eg UTF-8)
set a variable 'f' to the name of that file. something like this f <- yourpath\\yourfile.R
Then use tools::showNonASCIIfile(f) to display the faulty characters.
Something to check:
I have a Markdown file which I run to output to Word document (not important).
Some of the packages I used to initialise overload previous functions. I have found that the warning messages sometimes have nonASCII characters and this seems to have caused this message for me - some fault put all that output at the end of the file and I had to delete it anyway!
Check where characters are coming back from Warnings!
Cheers
Expanding the accepted answer with this answer to another question, to check for offending characters in the script currently open in RStudio, you can use this:
tools::showNonASCIIfile(rstudioapi::getSourceEditorContext()$path)

Igor Net_CDF loading error

I am trying to load in .nc files into Igor using the following line
Execute/Q "Load_NetCDF/i/q/t/z/s"
I have the Load_NetCDF installed and have used it a lot - it definitely works and works for similar files. I think the difference is that these files contain a couple of multiple dimension waves. Using Load_NetCDF in this way seems to be producing some odd looking results which do not match the content if I look at the same file another way (i.e. looking at the variables individually in MATLAB's ncbrowser).
I am seeing a couple of errors in the Igor command line and have ensured that they occur on the Load_Netcdf line of my code as shown above. Here are the error messages I get:
I've been hunting around for help info on the Load_NetCDF external function but without success. Does anyone know the cause of this problem or a good line of attack to try debugging it?
Are you using the XOP from this page to load the netcdf data?
It states that it does not support 2D waves. I don't know any other XOP to load netcdf data.
The promised error messages in your post are not visible.
What netcdf files are these? Classic or the new format? The new format is based on HDF5 and can, according to this post be read by the HDF5 browser in Igor.

Converting .pdf files to excel (.xls)

A friend of mine doing an internship asked me 2 hours ago if I could help him avoid to do manually 462 pdf file to .xls using free online soft.
I thought of a shell script using unoconv, but I didn't find out how to use it properly, and I am not sure if unoconv can solve this problem since it mainly converts file to pdf, not the reverse thing.
Conversion from PDF to any other structured format is not always possible and not generally recommended.
Having said that, this does look like a one-off job and there's a fair few of them (462).
It's worth pursuing, if you can reliably extract text from most of them and it's reasonably structured. It's a matter of trying to get regular text output across a sample of the PDF's that you can reliably parse into a table structure.
There's plenty of tools around that target either direct or OCR based text extraction, just google around.
One I like is pstotext from the ghostscript suite; the -bboxes option lets me get the coordinates of each word and leaves it up to me to re-assemble the structure. Despite its name it does work on input PDFs. Downside is that it can be a bit flakey and works on some PDF's but not others.
If you get this far, you'd then most likely then need to write a shell-script or program to convert that to a CSV. You can either open this directly via a spread-sheet or look for tools to convert this into XLS.
PS If he hasn't already, get the intern to ask if there's any possible way of getting at the original data that was used to created the PDFs It will save a lot of time and effort and lead to a way more accurate result.
Update An alternative to pstotext is renderpdf.pl command which is included in the Perl CAM::PDF module. More robust, but just reports text (x,y) position, not bounding boxes.
Other responses on a linked question suggest Tabula, too.
https://github.com/tabulapdf/tabula
I tried and it works very well.

Resources