Decode special characters when parsing html mvc - asp.net

In my mvc web application I am trying to parse a html document. It seems to work fine but the only issue is that it gives me special charters and not parse characters like æ,å,ø etc correctly.
Here is my code
var html = new HtmlDocument();
html.LoadHtml(new WebClient().DownloadString("http://cricketforbundet.no/index.php/en/klubber"));
var root = html.DocumentNode;
var p = root.Descendants("table").FirstOrDefault().Descendants("tr").Skip(1).FirstOrDefault().ChildNodes.Where(i=>i.Name == "td").FirstOrDefault().InnerText;
I get Bjørvika Cricket Klubb in p where as I should get Bjørvika Cricket Klubb.
Any thoughts? I am using HtmlAgilityPack to parse HTML in ASP.NET

You have to use load instead of LoadHtml and make sure use UTF8 encoding
WebClient webClient = new WebClient();
HtmlDocument html = new HtmlDocument();
html.Load(webClient.OpenRead("http://cricketforbundet.no/index.php/en/klubber"), Encoding.UTF8);
var root = html.DocumentNode;
var p = root.Descendants("table").FirstOrDefault().Descendants("tr").Skip(1).FirstOrDefault().ChildNodes.Where(i => i.Name == "td").FirstOrDefault().InnerText;
check this answer

Related

How to download the content of iframe using html unit where the ajax is used?

I am going to get the content of iframe. The iframe part contains the table, where some of its information needed to be extracted via the ajax function. How can I get the content of them? Thanks in advance.
According to some private problems, I am so sorry that I just can offer the fake link of URL.
final WebClient webClient = new WebClient(BrowserVersion.CHROME);
//creat a new WebClient object which is equal to browser
URL url= new URL("http://www.yahoo.com/");
HtmlPage page=(HtmlPage) webClient.getPage(url);
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(java.util.logging.Level.OFF);
java.util.logging.Logger.getLogger("org.apache.http").setLevel(java.util.logging.Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(java.util.logging.Level.SEVERE);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setJavaScriptEnabled(false);
webClient.getOptions().setRedirectEnabled(true);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
final HtmlPage startPage = webClient.getPage("http://www.yahoo.com/");
String text = startPage.asText();
webClient.waitForBackgroundJavaScriptStartingBefore(5000);
List<HtmlInlineFrame> anchors4 = (List<HtmlInlineFrame>) startPage.<HtmlInlineFrame>getByXPath("//iframe[#id='contentFrame']");
webClient.waitForBackgroundJavaScriptStartingBefore(5000);
System.out.println(anchors4.get(0).getTextContent());
List<HtmlDivision> anchors2 = (List<HtmlDivision>) startPage.<HtmlDivision>getByXPath("//div[#class='login_button']");
anchors2.get(0).click();
List<HtmlInlineFrame> anchors3 = (List<HtmlInlineFrame>) startPage.<HtmlInlineFrame>getByXPath("//iframe[#id='contentFrame']");
webClient.waitForBackgroundJavaScriptStartingBefore(5000);
System.out.println("this" + anchors3.get(0).asXml());
List<HtmlTable> htmlTable = (List<HtmlTable>) startPage.<HtmlTable>getByXPath("//table[#class='Result']");
Try by using FrameWindow(fot his, you need your iframe having a name attribute):
// ...
startPage = anchors2.get(0).click();
HtmlPage framePage = (HtmlPage) startPage.getFrameByName("your_frame_name").getEnclosedPage();
System.out.println("The content that you need: " + framePage.asXml());

scraping html without htmlagilitypack

Due to the limitation of the system, i am not allowed to use htmlagilitypack as i dont have the rights to refer the library. So i can only use native asp.net programming language to parse page.
e.g. i want to scrap this page https://sg.linkedin.com/job/google/jobs/ to get the list of google jobs ( just an example, i am not really planning to get this list but my own company's) , i see they are under how can i extra these jobs description and name.
My current codes are
System.Net.WebClient client = new System.Net.WebClient();
try{
System.IO.Stream myStream = client.OpenRead("https://sg.linkedin.com/job/google/jobs/");
System.IO.StreamReader sr = new System.IO.StreamReader(myStream);
string htmlContent = sr.ReadToEnd();
//do not know how to carry on
}catch(Exception e){
Response.Write(e.Message);
}
how can i carry on?
You can fetch that page and use a regular expression to isolate the useful parts. If you get real lucky, you may have a valid XML file:
var html = new WebClient().DownloadString("https://sg.linkedin.com/job/google/jobs/");
var jobs = new XmlDocument();
jobs.LoadXml(Regex.Replace(Regex.Match(html,
#"<ul class=""jobs"">[\s\S]*?</ul>").Value,
#"itemscope | itemprop="".*?""", "")); // clean invalid attributes
foreach (XmlElement job in jobs.SelectNodes("//li[#class='job']"))
{
Console.WriteLine(job.SelectSingleNode(".//a[#class='company']").InnerText);
Console.WriteLine(job.SelectSingleNode(".//h2/a").InnerText);
Console.WriteLine(job.SelectSingleNode(".//p[#class='abstract']").InnerText);
Console.WriteLine();
}

iTextSharp for reading PDF files is not working on Mono

I'm using iTextSharp.dll to read the contents of a PDF file. On a Windows server it is working correctly, but not on the Mono platform.
Mono error:
Server Error in '/' Application
Object reference not set to an instance of an object
I'm using this code:
PdfReader reader = new PdfReader(filename);
StringBuilder text = new StringBuilder();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
reader.Close();
}
it's ok, the problem was in path.
When I read text, I have problem with special characters (Slovak language {š, č, ť, ž ý, á, é, í, ...}). After read I have character "?", example => často => ?asto

Generating HTML Page on the fly

I'm writing a simple html page creator that will generate html code on customized settings. Now i want to add a "Demo" button that will generate a html page on the fly for the user to see the end result.
Is there any way to generate it in an online application?
Thanks
Actually, you don't need to use the server. You can use javascript: urls within Flash to achieve what you want, like so:
var request:URLRequest = new URLRequest("javascript:var w=window.open('', 'FlashGeneratedHTML', 'width=400, height=400'); w.document.write('<html><head></head><body>hello</body></html>');" );
navigateToURL(request, "_self");
All you need to do is replace the HTML code in the document.write() part of the JavaScript code with your own code.
You could do something like that:
var url:String = "http://servlet.url";
var request:URLRequest = new URLRequest(url);
request.method = URLRequestMethod.POST;
var variables:URLVariables = new URLVariables();
variables.html = source.of.your.html;
request.data = variables;
navigateToURL(request, "_blank");
So you basically navigate to some servlet that you have on your server, sending it html that you've created in your Flex app as a POST parameter and opening received responce in the new window/tab. Servlet should send received html back allowing preview of created html to the end user.

How do I prevent Flash's URLRequest from escaping the url?

I load some XML from a servlet from my Flex application like this:
_loader = new URLLoader();
_loader.load(new URLRequest(_servletURL+"?do=load&id="+_id));
As you can imagine _servletURL is something like http://foo.bar/path/to/servlet
In some cases, this URL contains accented characters (long story). I pass the unescaped string to URLRequest, but it seems that flash escapes it and calls the escaped URL, which is invalid. Ideas?
My friend Luis figured it out:
You should use encodeURI does the UTF8URL encoding
http://livedocs.adobe.com/flex/3/langref/package.html#encodeURI()
but not unescape because it unescapes to ASCII see
http://livedocs.adobe.com/flex/3/langref/package.html#unescape()
I think that is where we are getting a %E9 in the URL instead of the expected %C3%A9.
http://www.w3schools.com/TAGS/ref_urlencode.asp
I'm not sure if this will be any different, but this is a cleaner way of achieving the same URLRequest:
var request:URLRequest = new URLRequest(_servletURL)
request.method = URLRequestMethod.GET;
var reqData:Object = new Object();
reqData.do = "load";
reqData.id = _id;
request.data = reqData;
_loader = new URLLoader(request);
From the livedocs: http://livedocs.adobe.com/flex/3/langref/flash/net/URLRequest.html
Creates a URLRequest object. If System.useCodePage is true, the request is encoded using the system code page, rather than Unicode. If System.useCodePage is false, the request is encoded using Unicode, rather than the system code page.
This page has more information: http://livedocs.adobe.com/flex/3/html/help.html?content=18_Client_System_Environment_3.html
but basically you just need to add this to a function that will be run before the URLRequest (I would probably put it in a creationComplete event)
System.useCodePage = false;

Resources