We found a problem with some content in "example.xlsx" - using ClosedXML library - asp.net

I'm using ClosedXML library to generate a simple Excel file with 2 worksheets.
I keep getting error message whenever i try to open the file saying
"We found a problem with some content in "example.xlsx". Do you want us to try to recover as much as we can. if you trust source of this workbook, click Yes"
If i click Yes, it displays the data as expected, i don't see any
problems with it. Also if i generate only 1 worksheet this error does
not appear.
This is what my stored procedure returns, first result set is populated in sheet1 and second result set is populated in sheet2, which works as expected.
Workbook data
Here is the method i am using, it returns 2 result sets and populates both result sets in 2 different worksheets:
[HttpPost]
[ValidateAntiForgeryToken]
public ActionResult POAReport(POAReportVM model)
{
POAReportVM poaReportVM = reportService.GetPOAReport(model);
using (var workbook = new XLWorkbook())
{
IXLWorksheet worksheet1 = workbook.Worksheets.Add("ProductOrderAccuracy");
worksheet1.Cell("A1").Value = "DATE";
worksheet1.Cell("B1").Value = "ORDER";
worksheet1.Cell("C1").Value = "";
var countsheet1 = 2;
for (int i = 0; i < poaReportVM.productOrderAccuracyList.Count; i++)
{
worksheet1.Cell(countsheet1, 1).Value = poaReportVM.productOrderAccuracyList[i].CompletedDate.ToString();
worksheet1.Cell(countsheet1, 2).Value = poaReportVM.productOrderAccuracyList[i].WebOrderID.ToString();
worksheet1.Cell(countsheet1, 3).Value = poaReportVM.productOrderAccuracyList[i].CompletedIndicator;
countsheet1++;
}
IXLWorksheet worksheet2 = workbook.Worksheets.Add("Summary");
worksheet2.Cell("A1").Value = "Total Orders Sampled";
worksheet2.Cell("B1").Value = "Passed";
worksheet2.Cell("C1").Value = "% Passed";
worksheet2.Cell(2, 1).Value = poaReportVM.summaryVM.TotalOrdersSampled.ToString();
worksheet2.Cell(2, 2).Value = poaReportVM.summaryVM.Passed.ToString();
worksheet2.Cell(2, 3).Value = poaReportVM.summaryVM.PassedPercentage.ToString();
//save file to memory stream and return it as byte array
using (var ms = new MemoryStream())
{
workbook.SaveAs(ms);
ms.Position = 0;
var content = ms.ToArray();
return File(content, "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
}
}
}

I had similar problem. For me the cause was as mentioned in this answer:
I had this issue when I was using EPPlus to customise an existing template. For me the issue was in the template itself as it contained invalid references to lookup tables. I found this in Formula -> Name Manager.
You may find other solutions there.
Apologies as my reputation is too low to add a comment.

Related

Adding # to formula after =

When I add the formula FORECAST.ETS, it adds an # after the equal symbol, like this: = #FORECAST.ETS. Why is this happening?
The code snippet is:
ws.cell(column=1, row=2, value="=FORECAST.ETS(...)"
When I open it with Excel (latest Office 365 version), it shows as =#FORECAST.ETS(..)
I have hit the same issue, but not with Python and openpyxl, but with dotnet Core C# and EPPLUS. What follows is perhaps a workaround based on my findings... but not ideal. I suspect it will work with openpyxl too.
Re-creating the problem
I have written a simplified C# console app that firstly creates a new XLSX (foo.xlsx), writes out some data and my formula, and then outputs the cell with the formula and the value to the Console. It then saves and closes the XLSX, and reopens it and again outputs the formula cell and its value. The code is as follows:
using OfficeOpenXml;
using System;
using System.IO;
namespace TestFormula
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Test starts");
ExcelPackage.LicenseContext = LicenseContext.NonCommercial;
if (File.Exists($".\\foo.xlsx"))
{
File.Delete($".\\foo.xlsx");
}
using (var ep = new ExcelPackage(new FileInfo($".\\foo.xlsx")))
{
ExcelWorkbook wb = ep.Workbook;
ExcelWorksheet wsTest = null;
wsTest = wb.Worksheets.Add("Test");
// Add some look up data...
for (int row = 1; row <= 5; row++)
{
wsTest.Cells[row, 1].Value = row;
wsTest.Cells[row, 2].Value = $"Name {row}";
}
wsTest.Cells[1, 4].Formula = $"=XLOOKUP($A3,$A:$A,$B:$B))";
Console.WriteLine($"Add: formula=\"{wsTest.Cells[1, 4].Formula}\"");
Console.WriteLine($"Add: value=\"{wsTest.Cells[1, 4].Value}\"");
ep.Save();
ep.Dispose();
}
using (var ep = new ExcelPackage(new FileInfo($".\\foo.xlsx")))
{
ExcelWorkbook wb = ep.Workbook;
ExcelWorksheet wsTest = null;
wsTest = wb.Worksheets["Test"];
Console.WriteLine($"Open: formula=\"{wsTest.Cells[1, 4].Formula}\"");
Console.WriteLine($"Open: value=\"{wsTest.Cells[1, 4].Value}\"");
ep.Dispose();
}
Console.WriteLine("Test ends");
}
}
}
The output from the above looks like this...
Note that the formula after closing and re-opening the XLSX with EPPLUS reads just as it was written.
However, if I open the file with Excel I can see that an # has been inserted after the = sign.
If I then double click on the formula cell, I get an Excel error message...
I answered "no" to this question because I wanted to continue to experiment with what was happening behind the scenes.
After double clicking the formula cell to edit it, when I now hit ENTER with the # in the formula, it works. At this point I save the XLSX with the change made.
If I now delete some of my code and just run...
using OfficeOpenXml;
using System;
using System.IO;
namespace TestFormula
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Test starts");
ExcelPackage.LicenseContext = LicenseContext.NonCommercial;
using (var ep = new ExcelPackage(new FileInfo($".\\foo.xlsx")))
{
ExcelWorkbook wb = ep.Workbook;
ExcelWorksheet wsTest = null;
wsTest = wb.Worksheets["Test"];
Console.WriteLine($"Open: formula=\"{wsTest.Cells[1, 4].Formula}\"");
Console.WriteLine($"Open: value=\"{wsTest.Cells[1, 4].Value}\"");
ep.Dispose();
}
Console.WriteLine("Test ends");
}
}
}
I get the following output...
What's particularly interesting about the output is that the formula has been modified by Excel and has been prefixed with _alfn.SINGLE.
Research
It is worth declaring here that I am running Microsoft 365 and I always have patches and updates automatically applied as soon as they become available. So my version of Excel is the latest version.
Google-ing for _alfn.SINGLE provides a number of hits (see References below) and from these I have concluded the following:
In Aug 2019 Microsoft released an update that introduced a new formula keyword called XLOOKUP... intended to replace VLOOKUP and HLOOKUP. As such, the XLSX file format was updated to allow for this new feature. The second reference below mentions dates of the introduction of other formulas around Sep 2018.
I'm guessing that the EPLUS library (and probably the openpyxl) have not updated their file format to compensate for the addition of these new/changed features.
When Excel opens an older file version and detects a more recent formula keyword (i.e. a keyword that was not available in the earlier file version), it does not automatically resolve the formula, but instead throws the error I mentioned above, and then resolves the problem by prefixing the new formula keyword with _alfn.SINGLE.
Solution
It's dirty and short term until the EPPLUS/openpyxl libraries catch up. In my case, in code simply replace...
wsTest.Cells[1, 4].Formula = $"=XLOOKUP($A3,$A:$A,$B:$B)";
... with ...
wsTest.Cells[1, 4].Formula = $"_xlfn.SINGLE(_xlfn.XLOOKUP($A3,$A:$A,$B:$B))";
References
Issue: An _xlfn. prefix is displayed in front of a formula by Microsoft
XLOOKUP XMATCH FILTER RANDARRAY SEQUENCE SORT SORTBY UNIQUE CONCAT IFS MAXIFS MINIFS SWITCH TEXTJOIN by Andreas Killer

How to create multi items/records or save item/record array at one time in client script file

I want to create multiple records at the same time using client script. This is what I'm doing:
var ceateDatasource = app.datasources.Reservation.modes.create;
var newItem = ceateDatasource.item;
newItem.User = user; //'eric'
newItem.Description = description; //'000'
newItem.Location_Lab_fk = lab.value.Id; //'T'
newItem.Area_fk = area.value.Id; //'L'
newItem.Equipment_fk = equipment.value.Id; //'S'
for(var i = 0 ; i < 3; i ++) {
newItem.Start_Date = startDate;
newItem.Start_Hours = '03';
newItem.Start_Minutes = '00';
newItem.End_Date = startDate;
newItem.End_Hours = '23';
newItem.End_Minutes = '30';
// Create the new item
ceateDatasource.createItem();
}
But the result I'm getting is this one:
The three records are created but the only the first one has data. The other two records have empty values on their fields. How can I achieve this?
Thanks.
Update(2019-3-27):
I was able to make it work by putting everything inside the for loop block. However, I have another question.
Is there any method like the below sample code?
var recordData = [Data1, Data2, Data3]
var ceateDatasource;
var newItem = new Array(recordData.length) ;
for(var i = 0 ; i < recordData.length; i ++) {
ceateDatasource = app.datasources.Reservation.modes.create;
newItem[i] = ceateDatasource.item;
newItem[i].User = recordData[i].user;
newItem[i].Description = recordData[i].Description;
newItem[i].Location_Lab_fk = recordData[i].Location_Lab_fk;
newItem[i].Area_fk = recordData[i].Area_fk;
newItem[i].Equipment_fk = recordData[i].Equipment_fk;
newItem[i].Start_Date = recordData[i].Start_Date;
newItem[i].Start_Hours = recordData[i].Start_Hours;
newItem[i].Start_Minutes = recordData[i].Start_Minutes;
newItem[i].End_Date = recordData[i].End_Date;
newItem[i].End_Hours = recordData[i].End_Hours;
newItem[i].End_Minutes = recordData[i].End_Minutes;
}
// Create the new item
ceateDatasource.createItem();
First, it prepares an array 'newItem' and only calls 'ceateDatasource.createItem()' one time to save all new records(or items).
I try to use this method, but it only saves the last record 'newItem[3]'.
I need to write a callback function in 'ceateDatasource.createItem()' but Google App Maker always show a warning "Don't make functions within a loop". So, are there any methods to call 'createItem()' one time to save several records? Or are there some functions like 'array.push' which can be used?
Thanks.
As per AppMaker's official documentation:
A create datasource is a datasource used to create items in a particular data source. Its item property is always populated by a draft item which can be bound to or set programmatically.
What you are trying to do is create three items off the same draft item. That why you see the result you get. If you want to create multiple items, you need to create a draft item for each one, hence all you need to do is put all your code inside the for loop.
for(var i = 0 ; i < 3; i ++) {
var ceateDatasource = app.datasources.Reservation.modes.create;
var newItem = ceateDatasource.item;
newItem.User = user; //'eric'
newItem.Description = description; //'000'
newItem.Location_Lab_fk = lab.value.Id; //'T'
newItem.Area_fk = area.value.Id; //'L'
newItem.Equipment_fk = equipment.value.Id; //'S'
newItem.Start_Date = startDate;
newItem.Start_Hours = '03';
newItem.Start_Minutes = '00';
newItem.End_Date = startDate;
newItem.End_Hours = '23';
newItem.End_Minutes = '30';
// Create the new item
ceateDatasource.createItem();
}
If you want to save several records at the same time using client script, then what you are looking for is the Manual Save Mode. So all you have to do is go to your model's datasource and click on the checkbox "Manual Save Mode".
Then use the same code as above. The only difference is that in order to persist the changes to the server, you need to explicitly save changes. So all you have to do is add the following after the for loop block:
app.datasources.Reservation.saveChanges(function(){
//TODO: Callback handler
});

Extract text with iText not works: encoding or crypted text?

I have a pdf file that as the follow security properties: printing: allowed; document assembly: NOT allowed; content copy: allowed; content copy for accessibility: allowed; page extraction:NOT allowed;
I try to get text with sample code as documentation sample as follow:
pdftext.Text = null;
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(filename);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text.Append(System.Environment.NewLine);
text.Append("\n Page Number:" + page);
text.Append(System.Environment.NewLine);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
progressBar1.Value++;
}
pdftext.Text += text.ToString();
pdfReader.Close();
but the output text is lines with ""??? ? ???????\n?? ??? ? " values;
seems that file is crypted or we have a encoding problem...
note that in the follow lines
var f = pdfReader.IsOpenedWithFullPermissions; -> FALSE
var f1 = pdfReader.IsEncrypted(); - > FALSE
var f2 = pdfReader.ComputeUserPassword(); - > NULL
var f3 = pdfReader.Is128Key(); - > FALSE
var f4 = pdfReader.HasUsageRights();
f, f1, f3, f4 return FALSE ...than seems that the document is not crypted,
...so I don't know if is a Encoding problem or question related to encrypet strings...
Someone can help me?
thanks in advance.
G.G.
Whenever you have trouble extracting text from a document using standard code, the first thing to do is try and copy&paste the text from it using Adobe Acrobat Reader. Adobe Reader copy&paste implements text extraction according to the recommendations of the PDF specification, and if this fails, this usually means that the necessary information required for text extraction in the document are either missing or broken (by accident or by design). To extract the text, one either needs to customize the code specifically to the specific PDF or resort to OCR.
In case of the document at hand, Adobe Reader copy&paste does result in garbage, too, just like when extracting with iText. Thus, there is something fishy in the document.
Inspecting the document one finds that the fonts contain ToUnicode mappings like this:
/CIDInit /ProcSet
findresource begin 12 dict begin begincmap /CIDSystemInfo<</Registry(Adobe)
/Ordering(Identity)
/Supplement 0
>>
def
/CMapName/F18 def
1 begincodespacerange <0000> <FFFF> endcodespacerange
44 beginbfrange
<20> <20> <0020>
<21> <21> <E0F9>
<22> <22> <E0F1>
<23> <23> <E0FA>
<24> <24> <E0F7>
<25> <25> <E0A3>
<26> <26> <E084>
<27> <27> <E097>
<28> <28> <E098>
<29> <29> <E09A>
<2A> <2A> <E08A>
<2B> <2B> <E099>
<2C> <2C> <E0A5>
<2D> <2D> <E086>
<2E> <2E> <E094>
<2F> <2F> <E0DE>
<30> <30> <E0A6>
<31> <31> <E096>
<32> <32> <E088>
<33> <33> <E082>
<34> <34> <E04C>
<35> <35> <E0A4>
<36> <36> <E0F6>
<37> <37> <E0F2>
<38> <38> <E0D8>
<39> <39> <E0AA>
<3A> <3A> <E06C>
<3B> <3B> <E087>
<3C> <3C> <E095>
<3D> <3D> <E0C4>
<3E> <3E> <E07E>
<3F> <3F> <E055>
<40> <40> <E089>
<41> <41> <E085>
<42> <42> <E083>
<43> <43> <E070>
<44> <44> <E0E6>
<45> <45> <E080>
<46> <46> <E0C8>
<47> <47> <E0F4>
<48> <48> <E062>
<49> <49> <E0F3>
<4A> <4A> <E04E>
<4B> <4B> <E05E>
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end
I.e., if you are not into this, the fonts claim that all their glyphs (with the exception of the space glyph at 0x20) represent characters U+E0xx from the Unicode private use area. As the name of that area indicates, there is no common meaning of characters with these values.
Thus, text extraction according to the PDF specification will return strings of characters with undefined meaning with results as you observed in iText or I saw in Adobe Reader.
Sometimes in such a situation one can still enforce proper text extraction by ignoring the ToUnicode map and using either the font Encoding or information inside the embedded font program.
Unfortunately it turns out that here the Encoding effectively contains the same information as does the ToUnicode map, e.g. for the same font as above
/Differences [ 32 /space /uniE0F9 /uniE0F1 /uniE0FA /uniE0F7 /uniE0A3 /uniE084 /uniE097 /uniE098
/uniE09A /uniE08A /uniE099 /uniE0A5 /uniE086 /uniE094 /uniE0DE /uniE0A6 /uniE096
/uniE088 /uniE082 /uniE04C /uniE0A4 /uniE0F6 /uniE0F2 /uniE0D8 /uniE0AA /uniE06C
/uniE087 /uniE095 /uniE0C4 /uniE07E /uniE055 /uniE089 /uniE085 /uniE083 /uniE070
/uniE0E6 /uniE080 /uniE0C8 /uniE0F4 /uniE062 /uniE0F3 /uniE04E /uniE05E ]
and the fonts turns out to be Type3 fonts, i.e. there is no embedded font program but each glyph is defined as an individual PDF canvas without further character information.
Thus, nothing to gain here either.
Actually these small PDF canvasses contain inlined bitmap graphics of the respective glyph which also is the cause of the poor graphical quality of the document (if you don't see that immediately, simply zoom in a bit and you'll see the ragged outlines of the glyphs).
By the way, such a construct usually means that the producer of the PDF explicitly wants to prevent text extraction.
If you happen to have to extract text from many such documents, you can try and determine a mapping from their U+E0xx characters to actually sensible Unicode characters and apply that mapping to your extracted text.
If all those fonts in all those documents happen to use the same U+E0xx codepoints for the same actual characters, you'll be able to do text extraction from those documents after investing a certain amount of initial work.
Otherwise do try OCR.
The following code adds pages to a document which map the ToUnicode values to the characters shown:
void AddFontsTo(PdfReader reader, PdfStamper stamper)
{
int documentPages = reader.NumberOfPages;
for (int page = 1; page <= documentPages; page++)
{
// ignore inherited resources for now
PdfDictionary pageResources = reader.GetPageResources(page);
if (pageResources == null)
continue;
PdfDictionary pageFonts = pageResources.GetAsDict(PdfName.FONT);
if (pageFonts == null || pageFonts.Size == 0)
continue;
List<BaseFont> fonts = new List<BaseFont>();
List<string> fontNames = new List<string>();
HashSet<char> chars = new HashSet<char>();
foreach (PdfName key in pageFonts.Keys)
{
PdfIndirectReference fontReference = pageFonts.GetAsIndirectObject(key);
if (fontReference == null)
continue;
DocumentFont font = (DocumentFont) BaseFont.CreateFont((PRIndirectReference)fontReference);
if (font == null)
continue;
PdfObject toUni = PdfReader.GetPdfObjectRelease(font.FontDictionary.Get(PdfName.TOUNICODE));
CMapToUnicode toUnicodeCmap = null;
if (toUni is PRStream)
{
try
{
byte[] touni = PdfReader.GetStreamBytes((PRStream)toUni);
CidLocationFromByte lb = new CidLocationFromByte(touni);
toUnicodeCmap = new CMapToUnicode();
CMapParserEx.ParseCid("", toUnicodeCmap, lb);
}
catch
{
toUnicodeCmap = null;
}
}
if (toUnicodeCmap == null)
continue;
ICollection<int> mapValues = toUnicodeCmap.CreateDirectMapping().Values;
if (mapValues.Count == 0)
continue;
fonts.Add(font);
fontNames.Add(key.ToString());
foreach (int value in mapValues)
chars.Add((char)value);
}
if (fonts.Count == 0 || chars.Count == 0)
continue;
Rectangle size = (fonts.Count > 10) ? PageSize.A4.Rotate() : PageSize.A4;
PdfPTable table = new PdfPTable(fonts.Count + 1);
table.AddCell("Page " + page);
foreach (String name in fontNames)
{
table.AddCell(name);
}
table.HeaderRows = 1;
float[] widths = new float[fonts.Count + 1];
widths[0] = 2;
for (int i = 1; i <= fonts.Count; i++)
widths[i] = 1;
table.SetWidths(widths);
table.WidthPercentage = 100;
List<char> charList = new List<char>(chars);
charList.Sort();
foreach (char character in charList)
{
table.AddCell(((int)character).ToString("X4"));
foreach (BaseFont font in fonts)
{
table.AddCell(new PdfPCell(new Phrase(character.ToString(), new Font(font))));
}
}
stamper.InsertPage(reader.NumberOfPages + 1, size);
ColumnText columnText = new ColumnText(stamper.GetUnderContent(reader.NumberOfPages));
columnText.AddElement(table);
columnText.SetSimpleColumn(size);
while ((ColumnText.NO_MORE_TEXT & columnText.Go(false)) == 0)
{
stamper.InsertPage(reader.NumberOfPages + 1, size);
columnText.Canvas = stamper.GetUnderContent(reader.NumberOfPages);
columnText.SetSimpleColumn(size);
}
}
}
I applied it to your document like this:
string input = #"4700198773.pdf";
string output = #"4700198773-fonts.pdf";
using (PdfReader reader = new PdfReader(input))
using (FileStream stream = new FileStream(output, FileMode.Create, FileAccess.Write))
using (PdfStamper stamper = new PdfStamper(reader, stream))
{
AddFontsTo(reader, stamper);
}
The additional pages look like this:
Now you have to compare the outputs for the different fonts and pages of this document with each other and with those of a representative selection of file. If you find good enough a pattern, you can try this replacement way.

How to update a PDF file?

I am required to replace a word with a new word, selected from a drop-down list by user, in a PDF document in ASP.NET. I am using iTextSharp , but the new PDF that is created is all distorted as I am not able to extract the formatting/styling info of the PDF while extracting. Also, IS There a way to read a pdf line-by-line? Please help..
protected void Page_Load(object sender, EventArgs e)
{
String s = DropDownList1.SelectedValue;
Response.Write(s);
ListFieldNames(s);
}
private void CreatePDF(string text)
{
string outFileName = #"z:\TEMP\PDF\Test_abc.pdf";
Document doc = new Document();
doc.SetMargins(30f, 30f, 30f, 30f);
PdfWriter.GetInstance(doc, new FileStream(outFileName, FileMode.Create));
doc.Open();
BaseFont bfTimes = BaseFont.CreateFont(BaseFont.COURIER, BaseFont.CP1252, false);
Font times = new Font(bfTimes, 12, Font.BOLDITALIC);
//Chunk ch = new Chunk(text,times);
Paragraph para = new Paragraph(text,times);
//para.SpacingAfter = 9f;
para.Alignment = Element.ALIGN_CENTER;
//para.IndentationLeft = 100;
doc.Add(para);
//doc.Add(new Paragraph(text,times));
doc.Close();
Response.Redirect(#"z:\TEMP\PDF\Test_abc.pdf",false);
}
private void ListFieldNames(string s)
{
ArrayList arrCheck = new ArrayList();
try
{
string pdfTemplate = #"z:\TEMP\PDF\abc.pdf";
//string dest = #"z:\TEMP\PDF\Test_abc.pdf";
PdfReader pdfReader = new PdfReader(pdfTemplate);
string pdfText = string.Empty;
string extracttext = "";
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
PdfReader reader = new PdfReader((string)pdfTemplate);
extracttext = PdfTextExtractor.GetTextFromPage(reader, page, its);
extracttext = Encoding.Unicode.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.Unicode, Encoding.Default.GetBytes(extracttext)));
pdfText = pdfText + extracttext;
pdfText = pdfText.Replace("[xyz]", s);
pdfReader.Close();
}
CreatePDF(pdfText);
}
catch (Exception ex)
{
}
finally
{
}
}
You are making one wrong assumption after the other.
You assume that the concept of "lines" exists in PDF. This is wrong. In Text State, different snippets of text are drawn on the page at absolute positions. For every "show text" operator, iText will return a TextRenderInfo object with the portion of text that was drawn and its coordinates. One line can consist of multiple text snippets. A text snippet may contain whitespace or may even be empty.
You assume that all text in a PDF keeps its natural reading order. This should be true for PDF/UA (UA stands for Universal Accessibility), but it's certainly not true for most PDFs you can find in the wild. That's why iText provides location-based text extraction (see p521 of iText in Action, Second Edition). As explained on p516, the text "Hello World" can be stored in the PDF as "ld", "Wor", "llo", "He". The LocationTextExtractionStrategy will order all the text snippets, reconstructing words if necessary. For instance: it will concatenate "He" and "llo" to "Hello", because there's not sufficient space between the "He" snippet and the "llo" snippet. However, for reasons unknown (probably ignorance), you're using the SimpleTextExtractionStrategy which doesn't order the text based on its location.
You are completely ignoring all the Graphics State operators, as well as the Text State operators that define the font, etc...
You assume that PDF is a Word processing format. This is wrong on many levels, as is your code. Please read the intro of chapter 6 of my book.
All these wrong assumptions almost make me want to vote down your question. At the risk of being voted down myself for this answer, I must tell you that you shouldn't try to "do the same". You're asking something that is very complex, and in many cases even impossible!

bulk insertion in MS SQL from a text file

I have a text file that contains around 21 lac entries and I want to insert all these entries into a table. Initially I have created one function in c# that read line by line and insert into table but it takes too much time. Please suggest an efficient way to insert these bulk data and that file is containing TAB(4 spaces) as delimiter.
And that text file also containing some duplicate entries and I don't want to insert those entries.
Load all of your data into a DataTable object and then use SqlBulkCopy to bulk insert them:
DataTable dtData = new DataTable("Data");
// load your data here
using (SqlConnection dbConn = new SqlConnection("db conn string"))
{
dbConn.Open();
using (SqlTransaction dbTrans = dbConn.BeginTransaction())
{
try
{
using (SqlBulkCopy dbBulkCopy = new SqlBulkCopy(dbConn, SqlBulkCopyOptions.Default, dbTrans))
{
dbBulkCopy.DestinationTableName = "intended SQL table name";
dbBulkCopy.WriteToServer(dtData );
}
dbTrans.Commit();
}
catch
{
dbTrans.Rollback();
throw;
}
}
dbConn.Close();
}
I've included the example to wrap this into a SqlTransaction so there will be a full rollback if there's a failure along the way. To get you started, here's a good CodeProject article on loading the delimited data into a DataSet object.
Sanitizing the data before loading
OK, here's how I think your data looks:
CC_FIPS FULL_NAME_ND
AN Xixerella
AN Vila
AN Sornas
AN Soldeu
AN Sispony
... (cut down for brevity)
In this instance you want to create your DataTable like this:
DataTable dtData = new DataTable("Data");
dtData.Columns.Add("CC_FIPS");
dtData.Columns.Add("FULL_NAME_ND");
Then you want to iterate each row (assuming your tab delimited data is separated row-by-row by carriage returns) and check whether this data already exists in the DataTable using the .Select method and if there is a match (i'm checking for BOTH values, it's up to you whether you want to do something else) then don't add it thereby preventing duplicates.
using (FileStream fs = new FileStream("path to your file", FileMode.Open, FileAccess.Read))
{
int rowIndex = 0;
using (StreamReader sr = new StreamReader(fs))
{
string line = string.Empty;
while (!sr.EndOfStream)
{
line = sr.ReadLine();
// use a row index to skip the header row as you don't want to insert CC_FIPS and FULL_NAME_ND
if (rowIndex > 0)
{
// split your data up into a 2-d array tab delimited
string[] parts = line.Split('\t');
// now check whether this data has already been added to the datatable
DataRow[] rows = dtData.Select("CC_FIPS = '" + parts[0] + "' and FULL_NAME_ND = '" + parts[1] + "'");
if (rows.Length == 0)
{
// if there're no rows, then the data doesn't exist so add it
DataRow nr = dtData.NewRow();
nr["CC_FIPS"] = parts[0];
nr["FULL_NAME_ND"] = parts[1];
dtData.Rows.Add(nr);
}
}
rowIndex++;
}
}
}
At the end of this you should have a sanitized DataTable that you can bulk insert. Please note that this code isn't tested, but it's a best guess as to how you should do it. There are many ways this can be done, and probably a lot better than this method (specifically LINQ) - but it's a starting point.

Resources