scraping html without htmlagilitypack

scraping html without htmlagilitypack - asp.net

Due to the limitation of the system, i am not allowed to use htmlagilitypack as i dont have the rights to refer the library. So i can only use native asp.net programming language to parse page.
e.g. i want to scrap this page https://sg.linkedin.com/job/google/jobs/ to get the list of google jobs ( just an example, i am not really planning to get this list but my own company's) , i see they are under how can i extra these jobs description and name.
My current codes are
System.Net.WebClient client = new System.Net.WebClient();
try{
System.IO.Stream myStream = client.OpenRead("https://sg.linkedin.com/job/google/jobs/");
System.IO.StreamReader sr = new System.IO.StreamReader(myStream);
string htmlContent = sr.ReadToEnd();
//do not know how to carry on
}catch(Exception e){
Response.Write(e.Message);
}
how can i carry on?

You can fetch that page and use a regular expression to isolate the useful parts. If you get real lucky, you may have a valid XML file:
var html = new WebClient().DownloadString("https://sg.linkedin.com/job/google/jobs/");
var jobs = new XmlDocument();
jobs.LoadXml(Regex.Replace(Regex.Match(html,
#"<ul class=""jobs"">[\s\S]*?</ul>").Value,
#"itemscope | itemprop="".*?""", "")); // clean invalid attributes
foreach (XmlElement job in jobs.SelectNodes("//li[#class='job']"))
{
Console.WriteLine(job.SelectSingleNode(".//a[#class='company']").InnerText);
Console.WriteLine(job.SelectSingleNode(".//h2/a").InnerText);
Console.WriteLine(job.SelectSingleNode(".//p[#class='abstract']").InnerText);
Console.WriteLine();
}

Related

XmlDocument difficulties returning specific nodes/elements

first post. I hope it meets with the rules of asking questions.
I'm in a bit of bother with an xml document (its an API returned Xml). Now it uses a multitude of internet (http based) security measures which I have worked thru and I am now able to return the the top tier of nodes that are not nested.
however there are a few nodes which are nested under these and I need to return some of these values.
I'm set on using XMLDocument to do this, and I'm not interested in using XPath.
I should also note that I'm using the .Net 4.5 environment.
Example XML
<?xml version="1.0" encoding="utf-8"?>
<results>
<Info xmlns="http://xmlns.namespace">
<title>This Title</title>
<ref>
<SetId>317</SetId>
</ref>
<source>
<name>file.xxx</name>
<type>thisType</type>
<hash>cc7b99599c1bebfc4b8f12e47aba3f76</hash>
<pers>65.97602</pers>
<time>02:20:02.8527777</time>
</source>
....... Continuation which is same as above
Ok so above is the Xml that gets returned from the API, now, I can return title node no problem. What I would also like to return is any of the node values in the Element, for example the pers node value. But I only want to return one (as there are many in the existing xml further down)
Please note that there is an xmlns in the Info node which may not be allowing me to return the values.
So here is my code
using (var response = (HttpWebResponse) request.GetResponse())
{
//Get the response stream
using (Stream stream = response.GetResponseStream())
{
if (stream != null)
{
var xDoc = new XmlDocument();
var nsm = new XmlNamespaceManager(xDoc.NameTable);
nsm.AddNamespace("ns", XmlNamespace);
//Read the response stream
using (XmlReader xmlReader = XmlReader.Create(stream))
{
// This is straight forward, we just need to read the XML document and return the bits we need.
xDoc.Load(xmlReader);
XmlElement root = xDoc.DocumentElement;
var cNodes = root.SelectNodes("/results/ns:Info", nsm);
//Create a new instance of Info so that we can store any data found in the Info Properties.
var info = new Info();
// Now we have a collection of Info objects
foreach (XmlNode node in cNodes)
{
// Do some parsing or other relevant filtering here
var title = node["title"];
if (title != null)
{
info.Title = title.InnerText;
_logger.Info("This is the title returned ############# {0}", info.Title);
}
//This is the bit that is killing me as i can't return the any values in the of the sub nodes
XmlNodeList sourceNodes = node.SelectNodes("source");
foreach (XmlNode sn in sourceNodes)
{
XmlNode source = sn.SelectSingleNode("source");
{
var pers = root["pers"];
if (pers != null) info.pers = pers.InnerText;
_logger.Info("############FPS = {0}", info.pers);
}
}
}
}
Thanks in advance for any help

So I finally figured it out.
Here is the code that gets the subnodes. Basically I wasn't using my namespace identifier or my namespace for returning subnodes within the "Source" node.
For anybody else in this situation,
When you declare your name space there are to parts to it, a namespace identifier which is anything you want it to be in my case I chose "ns" and then the actual namespace in the XML file which is prefixed by xmlns and will contain something like for example: "http://xmlns.mynamespace".
So when searching subnodes inside the top level you need to declare these namespaces for the main node of the subnode you want to get.
// get the <source> subnode using the namespace to returns all <source> values
var source = node.SelectSingleNode("ns:source", nsm);
if (source != null)
{
info.SourceType = source["type"].InnerText;
info.Pers = source["pers"].InnerText;
_logger.Info("This SourceNode is {0}", info.SourceType);
_logger.Info("This PersNode is {0}", info.FramesPerSecond);
}
I hope this helps somebody else that's chasing their tails as I have.
Thanks

Safely read write a .txt file from asp.net

I am using a .txt file to log exceptions thrown from various methods in my asp.net (4.0) project. I have a page which reads texts from that file on every 10 minutes. If there are Read and Write attempts at the same time, will it throw any exception? If you have any better technique to handle such problem, please let me know. Currently, i'm using the following code-
Writing to the file
using (StreamWriter Writer = new StreamWriter(LogFilePath, true))
{
Writer.WriteLine(ErrorMsg);
}
Reading from the file
using (FileStream fs=File.OpenRead(LogFilePath))
{
using (StreamReader reader = new StreamReader(fs))
{
string line;
while ((line = reader.ReadLine()) != null)
{
Response.Write(line + "</br>");
}
}
}
Is these approaches are safe?
Thank you.

As people already suggested, the simplest way is to use external libraries, which handle locking of the file.
However, if you still want to use your own code to do that, make sure you're synchronizing access to the file, using lock:
lock(lockObj)
{
using (StreamWriter Writer = new StreamWriter(LogFilePath, true))
{
Writer.WriteLine(ErrorMsg);
}
}
where lockObj is
static object lockObj = new object();

Spring MVC 3.0 Jasper-Reports 4 Directing HTML reports in browser

I am working with Spring MVC 3 and JasperReports. I've created some great PDF and Xls reports without a problem. What I would like to do is display the created HTML report on screen for the user as a preview of the report they are getting, wrapped in the website template. Is there a way to do this?
I haven't found any tutorials/articles on this subject, I did find a book on JasperReports 3.5 for Java Developers that kind a addressed this. (I'm a noob on this so bear with me.) My understanding of this is that I have to redirect the input stream to the browser. I figure that there must be an easier way! And a way to strip the HTML report header and footer from it.
Any help would be appreciated!

Instead of using another framework to solve my problem. I solved it like this:
#RequestMapping(value = "/report", method = RequestMethod.POST)
public String htmlReport(#RequestParam(value = "beginDate") Date begin,
#RequestParam(value = "endDate", required = false) Date end,
ModelMap map) {
try {
// Setup my data connection
OracleDataSource ds = new OracleDataSource();
ds.setURL("jdbc:oracle:thin:user/password#10.10.10.10:1521:tst3");
Connection conn = ds.getConnection();
// Get the jasper report object located in package org.dphhs.tarts.reports
// Load it
InputStream reportStream = this.getClass().getResourceAsStream("reports/tartsCostAllocation.jasper");
JasperReport jasperReport = (JasperReport) JRLoader.loadObject(reportStream);
// Populate report with data
JasperPrint jasperPrint =
JasperFillManager.fillReport(jasperReport, new HashMap(), conn);
// Create report exporter to be in Html
JRExporter exporter = new JRHtmlExporter();
// Create string buffer to store completed report
StringBuffer sb = new StringBuffer();
// Setup report, no header, no footer, no images for layout
exporter.setParameter(JRHtmlExporterParameter.HTML_HEADER, "");
exporter.setParameter(JRHtmlExporterParameter.HTML_FOOTER, "");
exporter.setParameter(JRHtmlExporterParameter.IS_USING_IMAGES_TO_ALIGN, Boolean.FALSE);
// When report is exported send to string buffer
exporter.setParameter(JRExporterParameter.OUTPUT_STRING_BUFFER, sb);
exporter.setParameter(JRExporterParameter.JASPER_PRINT, jasperPrint);
// Export the report, store to sb
exporter.exportReport();
// Use Jsoup to clean the report table html to output to browser
Whitelist allowedHtml = new Whitelist();
allowedHtml.addTags("table", "tr", "td", "span");
allowedHtml.addTags("table", "style", "cellpadding", "cellspacing", "border", "bgcolor");
allowedHtml.addAttributes("tr", "valign");
allowedHtml.addAttributes("td", "colspan", "style");
allowedHtml.addAttributes("span", "style");
String html = Jsoup.clean(sb.toString(), allowedHtml);
// Add report to map
map.addAttribute("report", html);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return "costallocation/report";
}

NVelocity not finding the template

I'm having some difficulty with using NVelocity in an ASP.NET MVC application. I'm using it as a way of generating emails.
As far as I can make out the details I'm passing are all correct, but it fails to load the template.
Here is the code:
private const string defaultTemplatePath = "Views\\EmailTemplates\\";
...
velocityEngine = new VelocityEngine();
basePath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, defaultTemplatePath);
ExtendedProperties properties = new ExtendedProperties();
properties.Add(RuntimeConstants.RESOURCE_LOADER, "file");
properties.Add(RuntimeConstants.FILE_RESOURCE_LOADER_PATH, basePath);
velocityEngine.Init(properties);
The basePath is the correct directory, I've pasted the value into explorer to ensure it is correct.
if (!velocityEngine.TemplateExists(name))
throw new InvalidOperationException(string.Format("Could not find a template named '{0}'", name));
Template result = velocityEngine.GetTemplate(name);
'name' above is a valid filename in the folder defined as basePath above. However, TemplateExists returns false. If I comment that conditional out and let it fail on the GetTemplate method call the stack trace looks like this:
at NVelocity.Runtime.Resource.ResourceManagerImpl.LoadResource(String resourceName, ResourceType resourceType, String encoding)
at NVelocity.Runtime.Resource.ResourceManagerImpl.GetResource(String resourceName, ResourceType resourceType, String encoding)
at NVelocity.Runtime.RuntimeInstance.GetTemplate(String name, String encoding)
at NVelocity.Runtime.RuntimeInstance.GetTemplate(String name)
at NVelocity.App.VelocityEngine.GetTemplate(String name)
...
I'm now at a bit of an impasse. I feel that the answer is blindingly obvious, but I just can't seem to see it at the moment.

Have you considered using Castle's NVelocityTemplateEngine?
Download from the "TemplateEngine Component 1.1 - September 29th, 2009" section and reference the following assemblies:
using Castle.Components.Common.TemplateEngine.NVelocityTemplateEngine;
using Castle.Components.Common.TemplateEngine;
Then you can simply call:
using (var writer = new StringWriter())
{
_templateEngine.Process(data, string.Empty, writer, _templateContents);
return writer.ToString();
}
Where:
_templateEngine is your NVelocityTemplateEngine
data is your Dictionary of information (I'm using a Dictionary to enable me to access objects by a key ($objectKeyName) in my template.
_templateContents is the actual template string itself.
I hope this is of help to you!
Just to add, you'll want to put that into a static method returning a string of course!

Had this issue recently - NVelocity needs to be initialised with the location of the template files. In this case mergeValues is an anonymous type so in my template I can just refer to $Values.SomeItem:
private string Merge(Object mergeValues)
{
var velocity = new VelocityEngine();
var props = new ExtendedProperties();
props.AddProperty("file.resource.loader.path", #"D:\Path\To\Templates");
velocity.Init(props);
var template = velocity.GetTemplate("MailTemplate.vm");
var context = new VelocityContext();
context.Put("Values", mergeValues);
using (var writer = new StringWriter())
{
template.Merge(context, writer);
return writer.ToString();
}
}

Try setting the file.resource.loader.path
http://weblogs.asp.net/george_v_reilly/archive/2007/03/06/img-srchttpwwwcodegenerationnetlogosnveloc.aspx

Okay - So I'm managed to get something working but it is a bit of a hack and isn't anywhere near a solution that I want, but it got something working.
Basically, I manually load in the template into a string then pass that string to the velocityEngine.Evaluate() method which writes the result into the the given StringWriter. The side effect of this is that the #parse instructions in the template don't work because it still cannot find the files.
using (StringWriter writer = new StringWriter())
{
velocityEngine.Evaluate(context, writer, templateName, template);
return writer.ToString();
}
In the code above templateName is irrelevant as it isn't used. template is the string that contains the entire template that has been pre-loaded from disk.
I'd still appreciate any better solutions as I really don't like this.

The tests are the ultimate authority:
http://fisheye2.atlassian.com/browse/castleproject/NVelocity/trunk/src/NVelocity.Tests/Test/ParserTest.cs?r=6005#l122
Or you could use the TemplateEngine component which is a thin wrapper around NVelocity that makes things easier.

Fill a word document in asp.net?

I am working on Asp.Net project which needs to fill in a word document. My client provides a word template with last name, firstname, birth date,etc... . I have all those information in the sql database, and the client want the users of the application be able to download the word document with filled in information from the database.
What's the best way to archive this? Basically, I need identify those "fillable spot" in word document, fill those information in when the application user clicks on the download button.

If you can use Office 2007 the way to go is to use the Open XML API to format the documents:
http://support.microsoft.com/kb/257757. The reason you have to go that route is that you can't really use Word Automation in a server environment. (you CAN, but it's a huge pain to get working properly, and can EASILY break).
If you can't go the 2007 route, I've actually had pretty good success with just opening up a word template as a stream and finding and replacing the tokens and serving that to the user. This has actually worked surprisingly well in my experience and it's REALLY simple to implement.

I'm not sure about some of the ASP.Net aspects, but I am working on something similar and you might want to look into using an RTF instead. You can use pattern replacement in the RTF. For example you can add a tag like {USER_FIRST_NAME} in the RTF document. When the user clicks the download button, your application can take the information from the database and replace every instance of {USER_FIRST_NAME} with the data from the database. I am currently doing this with PHP and it works great. Word will open the RTF without a problem so that is another reason I chose this method.

I have used Aspose.Words for .NET. It's a little on the pricey side, but it works extremely well and the API is fairly intuitive for something that is potentially very complex.
If you want to pre-design your documents (or allow others to do that for you), anyone can put fields into the document. Aspose can open the document, find and fill the fields, and save a new filled-out copy for download.

Aspose works okay, but again: it's pricey.
Definitely avoid Office Automation in web apps as much as possible. It just doesn't scale well.
My preferred solution for this kind of problem is xml: specifically here I recommend WordProcessingML. You create an Xml document according to the schema, put a .doc extension on it, and MS Word will open it as if it were native in any version as far back as Office XP. This supports most Word features, and this way you can safely reduce the problem to replacing tokens in a text stream.
Be careful googling for more information on this: there's a lot of confusion between this and new Xml-based format for Office 2007. They're not the same thing.

This code works for WordMl text boxes and checkboxes. It's index based, so just pass in an array of strings for all textboxes and an array of bool's for all checkboxes.
public void FillInFields(
Stream sourceStream,
Stream destinationStream,
bool[] pageCheckboxFields,
string[] pageTextFields
) {
StreamUtil.Copy(sourceStream, destinationStream);
sourceStream.Close();
destinationStream.Seek(0, SeekOrigin.Begin);
Package package = Package.Open(destinationStream, FileMode.Open, FileAccess.ReadWrite);
Uri uri = new Uri("/word/document.xml", UriKind.Relative);
PackagePart packagePart = package.GetPart(uri);
Stream documentPart = packagePart.GetStream(FileMode.Open, FileAccess.ReadWrite);
XmlReader xmlReader = XmlReader.Create(documentPart);
XDocument xdocument = XDocument.Load(xmlReader);
List<XElement> textBookmarksList = xdocument
.Descendants(w + "fldChar")
.Where(e => (e.AttributeOrDefault(w + "fldCharType") ?? "") == "separate")
.ToList();
var textBookmarks = textBookmarksList.Select(e => new WordMlTextField(w, e, textBookmarksList.IndexOf(e)));
List<XElement> checkboxBookmarksList = xdocument
.Descendants(w + "checkBox")
.ToList();
IEnumerable<WordMlCheckboxField> checkboxBookmarks = checkboxBookmarksList
.Select(e => new WordMlCheckboxField(w, e, checkboxBookmarksList.IndexOf(e)));
for (int i = 0; i < pageTextFields.Length; i++) {
string value = pageTextFields[i];
if (!String.IsNullOrEmpty(value))
SetWordMlElement(textBookmarks, i, value);
}
for (int i = 0; i < pageCheckboxFields.Length; i++) {
bool value = pageCheckboxFields[i];
SetWordMlElement(checkboxBookmarks, i, value);
}
PackagePart newPart = packagePart;
StreamWriter streamWriter = new StreamWriter(newPart.GetStream(FileMode.Create, FileAccess.Write));
XmlWriter xmlWriter = XmlWriter.Create(streamWriter);
if (xmlWriter == null) throw new Exception("Could not open an XmlWriter to 4311Blank-1.docx.");
xdocument.Save(xmlWriter);
xmlWriter.Close();
streamWriter.Close();
package.Flush();
destinationStream.Seek(0, SeekOrigin.Begin);
}
private class WordMlTextField {
public int? Index { get; set; }
public XElement TextElement { get; set; }
public WordMlTextField(XNamespace ns, XObject element, int index) {
Index = index;
XElement parent = element.Parent;
if (parent == null) throw new NicException("fldChar must have a parent.");
if (parent.Name != ns + "r") {
log.Warn("Expected parent of fldChar to be a run for fldChar at position '" + Index + "'");
return;
}
var nextSibling = parent.ElementsAfterSelf().First();
if (nextSibling.Name != ns + "r") {
log.Warn("Expected a 'r' element after the parent of fldChar at position = " + Index);
return;
}
var text = nextSibling.Element(ns + "t");
if (text == null) {
log.Warn("Expected a 't' element inside the 'r' element after the parent of fldChar at position = " + Index);
}
TextElement = text;
}
}
private class WordMlCheckboxField {
public int? Index { get; set; }
public XElement CheckedElement { get; set; }
public readonly XNamespace _ns;
public WordMlCheckboxField(XNamespace ns, XContainer checkBoxElement, int index) {
_ns = ns;
Index = index;
XElement checkedElement = checkBoxElement.Elements(ns + "checked").FirstOrDefault();
if (checkedElement == null) {
checkedElement = new XElement(ns + "checked", new XAttribute(ns + "val", "0"));
checkBoxElement.Add(checkedElement);
}
CheckedElement = checkedElement;
}
public static void Copy(Stream readStream, Stream writeStream) {
const int Length = 256;
Byte[] buffer = new Byte[Length];
int bytesRead = readStream.Read(buffer, 0, Length);
// write the required bytes
while (bytesRead > 0) {
writeStream.Write(buffer, 0, bytesRead);
bytesRead = readStream.Read(buffer, 0, Length);
}
readStream.Flush();
writeStream.Flush();
}

In general you are going to want to avoid doing Office automation on a sever, and Microsoft has even stated that it is a bad idea as well. However, the technique that I generally use is the Office Open XML that was noted by aquinas. It does take a bit of time to learn your way around the format, but it is well worth it once you do as you don't have to worry about some of the issues involved with Office automation (e.g. processes hanging).
Awhile back I answered a similar question to this that you might find useful, you can find it here.

If you need to do this in DOC files (as opposed to DOCX), then the OpenXML SDK won't help you.
Also, just want to add another +1 about the danger of automating the Office apps on servers. You will run into problems with scale - I guarantee it.
To add another reference to a third-party tool that can be used to solve your problem:
http://www.officewriter.com
OfficeWriter lets you control docs with a full API, or a template-based approach (like what your requirement is) that basically lets you open, bind, and save DOC and DOCX in scenarios like this with little code.

Could you not use Microsofts own InterOp Framework to utilise Word Functionality
See Here

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

scraping html without htmlagilitypack - asp.net

Related

XmlDocument difficulties returning specific nodes/elements

Safely read write a .txt file from asp.net

Spring MVC 3.0 Jasper-Reports 4 Directing HTML reports in browser

NVelocity not finding the template

Fill a word document in asp.net?

Categories

Resources