Multibyte characters reading problem in IronPdf - multibyte-characters

I am trying IronPDF. I want to insert PDF metadata to database which I read with IronPDF. However, some "ı" characters in the metadata are not read with IronPDF. Spaces are left in place of these characters. Here is my code sample:
var md = PdfDocument.FromFile("___PATH OF PDF FILE___");
var article_title = md.MetaData.Title;
When I copy paste string to Notepad++ it gives a result like this:
And here is the screenshot of application view:
Is there a way to solve this problem or is this a bug of IronPDF? If everything goes well, of course, I think of buying. But of course, if it fails on the first try, continue to iTextSharp.
EDIT: First of all, I apologize for Windows, which made me surprised. I struggled to get a new system up all day and unfortunately it's still visual studio etc. not to be installed. I added one of the files I had problems with in the below and the IronPDF version appears as 2019.7.0.0.
PDF file: https://yadi.sk/d/HwP9JWRWTzMlSA

First of all, since you haven't provided us with a sample PDF to work with; I've google some Turkish PDF documents having metadata with Turkish characters. This is the file that I came up with: link
As you can see above the Author metadata field has ı Turkish character.
Then I created a dotnet fiddle in order to test this file using IronPDF (with the latest available version - since you haven't specified any):
sample using IronPDF
The output from this sample is ElifCakroglu which is showing the exact same symptom when copied to Notepad++:
Playing with the encodings did not help resolving this issue. So I created another dotnet fiddle to test your alternative solution which was iTextSharp: sample using iTextSharp
This time everything was working as it should be: ElifCakıroglu
Note: I've also tried creating a Word 2016 document and saving it as a PDF then using that file with the above samples and both of them did not work (not accepting as a valid PDF) for some reason. After that I tried and online PDF document validator, but the file was fine. Then I used an online converter to change the PDF version with the default settings and used the output PDF with both samples and the surprising thing is that both of them worked correctly.
My conclusion is that iTextSharp is working consistently with both documents having metadata with Turkish characters present, while IronPDF works correctly 50% of the time.

I believe that this issue is resolved and can be tested in the 2020.9 release branch of IronPdf.
https://www.nuget.org/packages/IronPdf/

Related

"The system cannot find the file specified whenever I knit"

Whenever I knit to a PDF in RStudio, the error "The system cannot find the file specified."
Here is the code that I'm using:
##importing data
library(readr)
Quiz1data_2 <- read_csv("C:/Users/erinp/Downloads/Quiz1data-2.csv")
I have restarted RStudio multiple times and I have copied and pasted the exact link that my file is saved to and it's still not working.
What am I not seeing or what am I not thinking?
Some suggestions/questions:
Without knitting, when you run the line reading in the csv, does it work?
Also, are you sure that the error is referring to the data csv? Could it be referring to the (I'm assuming) markdown file you are writing your code in? Have you moved that file since you started working in it?
Are you able to knit other documents to pdf? You need MiKTeX on a windows machine. Does knitting to html work?
I've found R to be a little tricky reading in files. I usually use the base function like this:go to the environment tab>import dataset>from text(base)> (select the file you want, hit open)>(select settings so that the dataframe preview looks right>import. Code that does this will run in the console, and I copy it into my markdown file so that every time it knits, it replicates that successful process.
I figured it out. It's because I forgot to assign a value to an object so it wouldn't knit into a PDF. I made a comment earlier, but it's small so I thought I would add this to the answer section.

Corrupted Rmarkdown script: How can I get the Cyrillic characters back?

I was working with a script with lots of Cyrillic characters (throughout chunks and out of them) for weeks. One day I have opened a new Rmarkdown script where I wrote English, while the other document is still in my R session. Afterwards, I have returned to the Cyrillic document and everything written turns to something like this 8 иÑлÑ 1995 --> ÐлаÑÑÑ - наÑодÑ
The question is: Where is the source of problem? And, how can the corrupted script turn to its original form (with the Cyrillic characters)?
UPDATE!!
I have tried reopeining the Rstudio scrip with encoding CP1251, CP1252, windows1251 and UTF8, but it does not work. Certaintly the weird symbols change to another weird symbols. The problem is that I have saved the document with the default encoding CP1251 and windows1251) at the very begining.
Solution:
If working with cyrillic and lating characters, be sure you save the Rstudio script with UTF-8 encoding always, when you computer is windows (I do not know mac). If you close the script and open it again, re-open the file with UTF8 encoding.
Assuming you're using RStudio: Open your *.Rmd file and then try to reopen it "with encoding". Therefore simply use the File-Menu as shown below.
Select "Show all encodings" and choose your specific encoding, I suggest windows-1251 for cyrillic encoding:
Note: Apparently the issue can also occur while at the one time opening the *.Rmd file as "standalone" and at the other time from within an R Project.
Hope that would help.

Generating Excel file with XLConnect-Removed Feature: Format from /xl/styles.xml part (Styles)

I am using XLConnect in R for the purpose of daily report generation. I have a program that runs automatically at specific time to append the data for most recent date daily into an excel file (Excel 2007). The program works fine to do this task. But, sometimes when i open the excel file it says that "Excel found unreadable content. do you want to recover the content of this workbook?"
The best part of this issue is that i can't reproduce this issue again to know the exact root cause for the problem. It arises in a random manner. Because, when i try to run the program again it works fine. Can somebody help me to identify the root cause?

Using GhostPCL to converting PCL with images to PDF

I'm currently attempting to convert some PCL files into PDF using GhostPCL (PCL6).
For the most part this works. However, there is an odd problem with some of the conversion. For some reason, PCL6 is not converting some logos where are at the top of our documents. The logo is of the format:
^[(25XABCDEFGHIJKLMNOPQ^[(3#^M
^[(25X^[&a+1.49RRSTUVWXYZ[\]^_`ab^[(3#^M
^[(25X^[&a+1.49Rcdefghijklmnopqrs^M
when viewing the PCL file in vim. When printing the file as a PCL file, the image prints out correctly, but when converting to pdf, the following takes it's place:
ABCDEFGHIJKLMNOPQ
RSTUVWXYZ[\]^_`ab
cdefghijklmnopqrs
I recognize that the format is meant to be matched against some sort of embedded image or font, but it has been really difficult trying to find useful documentation on PCL (so I can actually figure out what these characters mean) or the conversion process.
Can anyone offer some insight on how to approach the conversion? We will need these images/logos in the converted documents since they often contain disclaimer information as part of the image.
EDIT1: I've also attempted converting to postscript and printing then and the same behavior occurs.
EDIT2: When rendering the PCL file in a viewer, the same text shows up instead of the image. But when printing, the logo does show up. Strange...
EDIT3: To clarify, sending the PCL file to a printer directly does not seem to cause the problem (i.e, the logo does print correctly). It's only when I attempt to convert it to another file format that the problem occurs.
What happens when you try rendering the PCL input with Ghostscript ? Eg to the display device. If it doesn't render its not going to end up in a PDF either.
Have you tried printing the file to a PCL printer ?
If it works to a PCL printer, but not when rendering you can open a bug against ghostpcl. If it renders but does not end up in the PDF then you can open a bug against ghostspcl with the 'pdf writer' component.
Its possible that the logo is shown using a rasterop, this is a part of the PCL imaging model which has no counterpart in PDF and so cannot be reproduced. The result of using a rasterop with the PDF device is variable, sometimes it will do what you expect, often it will not.

Importing the contents of a word document into R

I am new to R and have worked for a while as follows. I have the code writen in a word document, then I copy and paste the document with the code into R as to have the code run which works fine, however when the code is long (hundred pages) it takes a significant amount of time in R to start making the code run. This seems rather not a very effective working procedure and I am sure there are other forms to compile the R code.
On another hand one of then that comes to my mind is to import the content of word into R which I am unsure how to do. Have tried with read.table but it does not work, have look on internet as to how to import data, however most explanations are all for data tables etc or internet files in the form of data tables and similar. I have tried saving the document into csv. however word does not include csv have tried with Rich text format and XML package but again the instructions from the packages are for importing tables and similars. I am wondering if there is an effective way for R to import a word document as is in the word document.
Thank you
It's hard to say what the easiest solution would be, without examining the word document. Assuming it only contains code and nothing else, it should be pretty easy to convert it all to plain text from within Word. You can do that by going to File -> Save As, and use 'plain text' under 'Save as type'.
Then edit the filename extension to .R from .txt, download a proper text editor (I can recommend RStudio for R), and open your code in it. Then you will be able to run the code from inside the editor without using copy / paste.
No, read table won't do it.
Microsoft Word has its own format, which includes a lot of meta data over and above the text you enter into it. You'll need a reader/parser that understands the Word format.
A Java developer would use a library like Apache POI to read and parse it into word tokens and n-grams.
Look for Natural Language Processing tools, like this R module:
http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Resources