crawler4j - I can't get the title - crawler4j

In short: I can’t get this URL’s title http://www.namlihipermarketleri.com.tr/default.asp?git=9&urun=10277 (which is broken now (18-11-2015) )
İn my WebCrawler implementation:
#Override
public void visit(Page page) {
System.out.println(page.getWebURL().getURL()); // when this prints the url
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
System.out.println(htmlParseData.getTitle()); // This line prints an empty line!
}
}
Note: Title itself contains some commas “,”.
Can you suggest a solution?
Is this a bug?
Thanks in advance.

The problem was probably there were 4 title tags in the HTML document.
I've used Jsoup: http://jsoup.org/
HtmlParseData htmlParseData = (HtmlParseData) page
.getParseData();
String html = htmlParseData.getHtml();
Document htmlDocument = Jsoup.parse(html);
String title = htmlDocument.getElementsByTag("title").get(0).text();

Related

how to make a picture file downloadable?

I have an ASP.NET MVC3 application and I want to link_to an image file (png, jpeg, gif, etc), and when user clicks on it, the file goes to download, instead of the browser shows it; is there any way to do this?
take your link something like this:
#Html.ActionLink(
"Download Image", // text to show
"Download", // action name
["DownloadManager", // if need, controller]
new { filename = "my-image", fileext = "jpeg" } // file-name and extension
)
and action-method is here:
public FilePathResult Download(string filename, string fileext) {
var basePath = Server.MapPath("~/Contents/Images/");
var fullPath = System.IO.Path.Combine(
basePath, string.Concat(filename.Trim(), '.', fileext.Trim()));
var contentType = GetContentType(fileext);
// The file name to use in the file-download dialog box that is displayed in the browser.
var downloadName = "one-name-for-client-file." + fileext;
return File(fullPath, contentType, downloadName);
}
private string GetContentType(string fileext) {
switch (fileext) {
case "jpg":
case "jpe":
case "jpeg": return "image/jpeg";
case "png": return "image/x-png";
case "gif": return "image/gif";
default: throw new NotSupportedException();
}
}
UPDATE:
in fact, when a file is sending to a browser, this key/value will be generated in http-header:
Content-Disposition: attachment; filename=file-client-name.ext
which file-client-name.ext is the name.extension that you want the file save-as it on client system; for example, if you want to do this in ASP.NET (none mvc), you can create a HttpHandler, write the file-stream to Response, and just add the above key/value to the http-header:
Response.Headers.Add("Content-Disposition", "attachment; filename=" + "file-client-name.ext");
just this, enjoy :D
Well technically your browser is downloading it.
I don't think you can directly link to an image, and have the browser prompt to download.
You could try something where instead of linking directly to the image, you link to a page, which serves up the image in a zip file perhaps - which of course would prompt the download to occur.
Yes, you can.
Now, you'll need to customize this to suit your needs, but I created a FileController that returned files by an identifier (you can easily return by name).
public class FileController : Controller
{
public ActionResult Download(string name)
{
// check the existence of the filename, and load it in to memory
byte[] data = SomeFunctionToReadTheFile(name);
FileContentResult result = new FileContentResult(data, "image/jpg"); // or whatever it is
return result;
}
}
Now, how you read that file or where you get it from is up to you. I then created a route like this:
routes.MapRoute(null, "files/{name}", new { controller = "File", action = "Download"});
My database has a map of identifiers to files (it's actually more complex than this, but I am omitting that logic for brevity), I can write urls like:
"~/files/somefile"
And the relevant file is downloaded.
I don't think this is possible but a simple message saying right click to save image would suffice I think.

How to pass HTML fragment (as a delegate?) to a declarative Razor Helper?

I've been writing some declarative Razor Helpers (using the #helper syntax) for use with the Umbraco 4.7 which now supports the Razor view engine (though I would imagine this applies equally to WebMatrix or ASP.NET MVC). They all work fine. However, I would like to make them a bit more flexible so that I can pass into them an HTML fragment that can be 'wrapped' around the output (but only when there is output). For instance, I have a helper (much simplified here) that can generate an HTML link from some parameters:
#helper HtmlLink(string url, string text = null, string title = null,
string cssClass = null, bool newWindow = false)
{
if (!String.IsNullOrEmpty(url))
{
System.Web.Mvc.TagBuilder linkTag = new System.Web.Mvc.TagBuilder("a");
linkTag.Attributes.Add("href", url);
linkTag.SetInnerText(text ?? url);
if (!String.IsNullOrEmpty(title))
{
linkTag.Attributes.Add("title", title);
}
if (!String.IsNullOrEmpty(cssClass))
{
linkTag.Attributes.Add("class", cssClass);
}
if (newWindow)
{
linkTag.Attributes.Add("rel", "external");
}
#Html.Raw(linkTag.ToString())
}
}
Calling #LinkHelper.HtmlLink("http://www.google.com/", "Google") would generate the HTML output Google.
What would be nice, though, is if I could optionally pass in an HTML fragment that would be wrapped around the generated hyperlink HTML so long as the URL has a value. I'd basically like to be able to do something like this:
#LinkHelper.HtmlLink("http://www.google.com/", "Google", #<li>#link</li>)
and get output
<li>Google</li>
or #LinkHelper.HtmlLink("", "", #<li>#link</li>)
and get no output at all.
I read in Phil Haacked's blog about Templated Razor Delegates but cannot get the hang of how they can be used in this context - if, indeed, it is possible. I get the feeling I'm missing something or barking up the wrong tree.
In case anyone else is looking for this.. I put together the following which will work. It works for empty strings, and if the delegate is null (based on my not-quite-exhaustive testing below.)
The key is as Jakub says, to use the magic #item parameter.
#helper HtmlLink(string url, string text = null,
Func<IHtmlString, HelperResult> formatterFunction = null,
string title = null, string cssClass = null, bool newWindow = false)
{
if (!String.IsNullOrEmpty(url))
{
System.Web.Mvc.TagBuilder linkTag = new System.Web.Mvc.TagBuilder("a");
linkTag.Attributes.Add("href", url);
linkTag.SetInnerText(text ?? url);
if (!String.IsNullOrEmpty(title))
{
linkTag.Attributes.Add("title", title);
}
if (!String.IsNullOrEmpty(cssClass))
{
linkTag.Attributes.Add("class", cssClass);
}
if (newWindow)
{
linkTag.Attributes.Add("rel", "external");
}
// This is the part using the delegate
if (formatterFunction == null)
{
#Html.Raw(linkTag.ToString())
}
else
{
#formatterFunction(Html.Raw(linkTag.ToString()))
}
}
}
#HtmlLink("http://www.google.com", "Google")
#HtmlLink("http://www.google.com", "Google", #<b>#item</b>)
#HtmlLink("http://www.google.com", "Google", #<text><i>#item</i><br/></text>) #* <br/> fails otherwise *#
#HtmlLink("http://www.google.com", "Google", #<b>#item</b>)
#HtmlLink("", "", #<b>#item</b>)
I think the problem is with #link. Templated razor delegates take the data using a 'magic' parameter #item. Try replacing #link with #item in your template.
Also, post the code that executes the template - your HtmlLink method that takes Func<dynamic, object>.

Screen scrape to email with full url for images and css

I am screen scraping a webpage and sending it as a html email.
What is the easiest/best way to manipulate the html to set full http addresses for all images and css files?
Current method is similar to (manually typed) + this is very open to error.
string html = rawHtml.replace("=\"", "=\"" + Request["SERVER_NAME"]);
.
.
Here is the current function we use to screen scrape using GET
public static string WebGet(string address)
{
string result = "";
using (WebClient client = new WebClient())
{
using (StreamReader reader = new StreamReader(client.OpenRead(address)))
{
string s = reader.ReadToEnd();
result = s;
}
}
return result;
}
It sounds like what you need is an HTML parser. Once you parse the html string with the parser, you can execute commands that easily manipulate the DOM, and thus you could find all img elements, check their src and append the Request["SERVER_NAME"] if you need to.
I don't code in ASP, but I found this:
http://htmlagilitypack.codeplex.com/
And here is a useful article I found explaining how to use it:
https://web.archive.org/web/20211020001935/https://www.4guysfromrolla.com/articles/011211-1.aspx

NVelocity -- #parse with embedded resources

I'm generating emails based off embedded NVelocity templates and would like to do something with dynamically included sections. So my embedded resources are something like this:
DigestMail.vm
_Document.vm
_ActionItem.vm
_Event.vm
My email routine will get a list of objects and will pass each of these along with the proper view to DigestMail.vm:
public struct ItemAndView
{
public string View;
public object Item;
}
private void GenerateWeeklyEmail(INewItems[] newestItems)
{
IList<ItemAndView> itemAndViews = new List<ItemAndView>();
foreach (var item in newestItems)
{
itemAndViews.Add(new ItemAndView
{
View = string.Format("MyAssembly.MailTemplates._{0}.vm", item.GetType().Name),
Item = item
});
}
var context = new Dictionary<string, object>();
context["Recipient"] = _user;
context["Items"] = itemAndViews;
string mailBody = _templater.Merge("MyAssembly.MailTemplates.DigestMail.vm", context);
}
And in my DigestMail.vm template I've got something like this:
#foreach($Item in $Items)
====================================================================
#parse($Item.viewname)
#end
But it's unable to #parse when given the path to an embedded resource like this. Is there any way I can tell it to parse each of these embedded templates?
Hey Jake, is .viewname a property? I'm not seeing you setting it in your code, how about you use the following:
#foreach($Item in $Items)
====================================================================
$Item.viewname
#end
I don't know why you're parsing the $Item.viename rather than just using the above? I'm suggesting this as I've just never needed to parse anything!
Please refer to this post where we've discussed the generation of templates.
Hope this helps!

Is there a better way to get ClientID's into external JS files?

I know this has been asked before, but I've found a different way to get references to controls in external JS files but I'm not sure how this would go down in terms of overall speed.
My code is
public static void GenerateClientIDs(Page page, params WebControl[] controls) {
StringBuilder script = new StringBuilder();
script.AppendLine("<script type=\"text/javascript\">");
foreach (WebControl c in controls) {
script.AppendLine(String.Format("var {0} = '#{1}';", c.ID, c.ClientID));
}
script.AppendLine("</script>");
if (!page.ClientScript.IsClientScriptBlockRegistered("Vars")) {
page.ClientScript.RegisterClientScriptBlock(page.GetType(), "Vars", script.ToString());
}
}
This was I can reference the id of the aspx page in my JS files.
Can anyone see any drawbacks to doing things this way? I've only started using external JS files. Before everything was written into the UserControl itself.
Well, the method can only be used once in each page, so if you are calling it from a user control that means that you can never put two of those user controls on the same page.
You could store the control references in a list until the PreRender event, then put them all in a script tag in the page head. That way you can call the method more than once, and all client IDs are put in the same script tag.
Something like:
private const string _key = "ClientIDs";
public static void GenerateClientIDs(params WebControl[] controls) {
Page page = HttpContext.Current.Handler As Page;
List<WebControl> items = HttpContext.Current.Items[_key] as List<WebControl>;
if (items == null) {
page.PreRender += RenderClientIDs;
items = new List<WebControl>();
}
items.AddRange(controls);
HttpContext.Current.Items[_key] = items;
}
private static void RenderClientIDs() {
Page page = HttpContext.Current.Handler As Page;
List<WebControl> items = HttpContext.Current.Items[_key] as List<WebControl>;
StringBuilder script = new StringBuilder();
script.AppendLine("<script type=\"text/javascript\">");
foreach (WebControl c in items) {
script.AppendLine(String.Format("var {0} = '#{1}';", c.ID, c.ClientID));
}
script.AppendLine("</script>");
page.Head.Controls.Add(new LiteralControl(script));
}
Check this out: http://weblogs.asp.net/joewrobel/archive/2008/02/19/clientid-problem-in-external-javascript-files-solved.aspx
Looks like it takes care of the dirty work for you (something like Guffa's answer). It generates a JSON object (example) containing server IDs and client IDs, so you can do something like this in your JavaScript:
var val = PageControls.txtUserName.value;

Resources