How to build .NET web scraper for news articles about people - asp.net

I am looking to create a simple webservice to crawl webpages on specific websites and look for a person's name. Anybody know if there are any examples out there of this, or can anyone help me with the start of this?
Edit: I should mention I want to do this with Visual Studio C#. I will only be looking at English news sites that I specify.

Here is a simple function that returns true if a Web page contains a person's name:
string response;
using (System.Net.WebClient wc = new System.Net.WebClient())
{
response = wc.DownloadString(url);
}
return reponse.Contains("John Doe");
For finding the links within the page, check out this question:Parse HTML links using C#
You can collect distinct Urls throughout the site and run the code above for each Url you find.
Also, type this into Google to see what they find. site:www.somesite.com "John Doe"

Using c# your best option for a crawler and parser (the two parts to your solution) would be to use functionality exposed by the HtmlAgility Pack, which can be found on CodePlex.
Refer to this answer for an example usage scenario: How to use HTML Agility pack

Related

DotNetOpenAuth and Quickbooks

I'm need to develop a .NET 3.5 application that imports data from Quickbooks, and I decided to use DNOA to OAuthorize with them. I downloaded the latest available version (4.1.something), took a look around, then create a QuickBooksConsumer following the example of GoogleConsumer. However, there is a problem I cannot seem to solve.
The url of the QuickBooks REST services looks like this:
https://services.intuit.com/sb/{0}/v2/{1}
where:
{0} is the name of the object to get the records of (like, "invoice", or "payment");
{1} is the realmId, i.e. the id of the Company the data is required for
The problem is that I don't see how to do PrepareAuthorizedRequest with such variable urls. The function is not virtual, so I cannot override it in my QuickBooksConsumer.cs. I'm stuck.
Can you please show me the way how to do that?
Thanks in advance!
Authorizing requests to dynamically created URLs should be no problem at all. Just wrap any URL in a MessageReceivingEndpoint and send it through ConsumerBase.PrepareAuthorizedRequest and you're good to go.

best practice for DB & File search with lucene.net in a asp.net web application

i have site where i need to develop site search functionality. the data may reside in database table or may in aspx page as static word. i search google and found that lucene.net may be appropriate for the site search functionality. but i never use lucene.net so i dont know how to create lucene.net index file. i want to develop 2 utility in my site like
1) one for create & update index file reading data from database table & physical aspx file.
2) utility which search multiple single or multiple keyword against index file.
i found a bit of code snippet which i just do not understand
string indexFileLocation = #"C:\Index";
string stopWordsLocation = #"C:\Stopwords.txt";
var directory = FSDirectory.Open(new DirectoryInfo(indexFileLocation));
Analyzer analyzer = new StandardAnalyzer(
Lucene.Net.Util.Version.LUCENE_29, new FileInfo(stopWordsLocation));
what is Lucene.Net.Util.Version.LUCENE_29 what is stopWordsLocation
how data need to store in Stopwords.txt
but have no concept to develop the above 2 utility. so please guide me how search my DB and as well as aspx files with lucene.net....i will be glad if some one discuss here with bit of sample code. thanks
Lucene.Net.Util.Version.LUCENE_29 just indicates the Lucene version your are using, you should always use the most up to date in new code. It is there for backward compatibility in case you upgrade your Lucene with a version that changes the StandardAnalyzer, but you dont want to re-index all your data.
The stopWordsLocation is the location of a file with your stop words, words you dont want to index.
IE: it, he, she, the, or, and etc...
Its a regular text file, each line should contain 1 stop word, and separate each line with a linebreak.
http://lucene.apache.org/core/old_versioned_docs/versions/3_0_1/api/all/org/apache/lucene/analysis/WordlistLoader.html#getWordSet(java.io.Reader)

Tridion 2009 - Publish another Component from a Component Template

First, the overall description:
There are two Component Templates, NewsArticle and NewsList. NewsArticle is a Dreamweaver Template, and is used to display the content of a news article. NewsList is an xml file that contains aggregated information about all of the news articles.
Currently, a content author must publish the news article, and then re-publish the newslist to regenerate the xml.
Problem:
I have been tasked with having the publish of a news article also regenerate and publish the newslist. Through C#, I am able to retrieve the content of the newslist component, generate the updated xml from the news article, and merge it into the xml from the newslist. I am running into trouble getting the newslist to publish.
I have limited access to documentation, but from what I do have, I believe using the static PublishEngine.Publish method will allow me to do what I need. I believe the first parameter (items) is just a list that contains my updated newslist, and the second parameter is a new PublishInstruction with the RenderInstruction.RenderMode set to Publish. I am a little lost on what the publicationTargets should be.
Am I on the right track? If so, any help with the Publish method call is appreciated, and if not, any suggestions?
Like Quirijn suggested, a broker query is the cleanest approach.
In a situation if a broker isn't available (i.e. static publishing model only) I usually generate the newslist XML from a TBB that adds the XML as a binary, rather than kicking off publishing of another component or page. You can do this by calling this method in your C# TBB:
engine.PublishingContext.RenderedItem.AddBinary(
Stream yourXmlContentConvertedToMemoryStream,
string filename,
StructureGroup location,
string variantId,
string mimeType)
Make the variantId unique per the newslist XML file that you create, so that different components can overwrite/update the same file.
Better yet, do this in a Page Template rather than Component Template so that the news list is generated once per page, rather than per component (if you have multiple articles per page).
You are on the right tracks here with the engine.Publish() method:
PublishEngine.Publish(
new IdentifiableObject[] { linkedComponent },
engine.PublishingContext.PublishInstruction,
new List() { engine.PublishingContext.PublicationTarget });
You can just reuse the PublishInstruction and Target from the current context of your template. This sample shows a Component, but it should work in a page too.
One thing to keep in mind is that this is not possible in SDL Tridion 2011 SP1, as the publish action is not allowed out of the box due to security restrictions. I have an article about this here http://www.tridiondeveloper.com/the-story-of-sdl-tridion-2011-custom-resolver-and-the-allowwriteoperationsintemplates-attribute

.NET frameworks for formatting e-mail messages?

Are there any open source/free frameworks available that take some of the pain out of building HTML e-mails in C#?
I maintain a number of standalone ASP.NET web forms whose main function is to send an e-mail. Most of these are in plain text format right now, because doing a nice HTML presentation is just too tedious.
I'd also be interested in other approaches to tackling this same problem.
EDIT: To be clear, I'm interested in taking plain text form input (name, address, phone number) and dropping it into an HTML e-mail template. That way the receipient would see a nicely formatted message instead of the primitive text output we're currently giving them.
EDIT 2: As I'm thinking more about this and about the answers the question has generated so far, I'm getting a clearer picture of what I'm looking for. Ideally I'd like a new class that would allow me to go:
HtmlMessage body = new HtmlMessage();
body.Header(imageLink);
body.Title("Some Text That Will Display as a Header");
body.Rows.Add("First Name", FirstName.Text);
The HtmlMessage class builds out a table, drops the images in place and adds new rows for each field that I add. It doesn't seem like it would be that hard to write, so if there's nothing out there, maybe I'll go that route
Andrew Davey created Postal which lets you do templated emails using any of the ASP.NET MVC view engines. Here's a video where he talks about how to use it.
His examples:
public class HomeController : Controller {
public ActionResult Index() {
dynamic email = new Email("Example");
email.To = "webninja#example.com";
email.FunnyLink = DB.GetRandomLolcatLink();
email.Send();
return View();
}
}
And the template using Razor:
To: #ViewBag.To From: lolcats#website.com Subject: Important Message
Hello, You wanted important web links right? Check out this:
#ViewBag.FunnyLink
<3
The C# port of StringTemplate worked well for me. I highly recommend it. The template file can have a number of named tokens like this:
...
<b>
Your information to login is as follows:<br />
Username: $username$<br />
Password: $password$<br />
</b>
...
...and you can load this template and populate it like this:
notificationTemplate.SetAttribute("username", Username);
notificationTemplate.SetAttribute("password", Password);
At the end, you get the ToString() of the template and assign it to the MailMessage.Body property.
I recently implemented what you're describing using MarkDownSharp. It was pretty much painless.
It's the same framework (minus a few tweaks) that StackOverflow uses to take plain-text-formatted posts and make them look like nice HTML.
Another option would be to use something like TinyMCE to give your users a WYWIWYG HTML editor. This would give them more power over the look and feel of their emails, but it might just overcomplicate things.
Bear in mind that there are also some security issues with user-generated HTML. Regardless of which strategy you use, you need to make sure you sanitize the user's input so they can't include scary things like script tags in their input.
Edit
Sorry, I didn't realize you were looking for an email templating solution. The simplest solution I've come up with is to enable text "macros" in user-generated content emails. So, for example, the user could input:
Dear {RecipientFirstName},
Thank you for your interest in {ClientCompanyName}. The position you applied for has the following minimum requirements:
- B.S. or greater in Computer Science or related field
- ...
And then we'd do some simple parsing to break this down to:
Dear {0},
Thank you for your interest in {1}. The position you applied for has the following minimum requirements:
- B.S. or greater in Computer Science or related field
- ...
... and ...
0 = "RecipientFirstName"
1 = "ClientCompanyName"
...
We store these two components in our database, and whenever we're ready to create a new instance from this template, we evaluate the values of the given property names, and use a standard format string call to generate the actual content.
string.Format(s, macroCodes.Select(c => EvaluateMacroCode(c, obj)).ToArray());
Then I use MarkdownSharp, along with some HTML sanitizing methods, to produce a nicely-formatted HTML email message:
Dear John,
Thank you for your interest in Microsoft. The position you applied for has the following minimum requirements:
B.S. or greater in Computer Science or related field
...
I'd be curious to know if there's something better out there, but I haven't found anything yet.

how to get the first tweet from my twitter and display it on my webpage?

Please can somebody help to get the first tweet from my twitter and display it on my website.
I am using asp.net 2.0.
Thanks
There are many ways to do this depending on your requirements. Personally, I have accomplished this using the Twitterizer library.
var feed = TwitterTimeline.UserTimeline(new UserTimelineOptions() {
CacheOutput = true,
CacheTimespan = TimeSpan.FromMinutes(1),
ScreenName = "twitter_username",
Count = 1
});
var firstPost = feed.FirstOrDefault();
The advantage of this approach is that you are not required to have a Twitter API key in order to pull the data. It doesn't get much simpler than this!
Update
After you clarified .NET 2.0 I found this .NET 2.0 Twitter API wrapper. Haven't used it myself, but it may be worth a look. Yedda Twitter C# Library
If that that library/wrapper won't do, JSON.NET also has a .NET 2.0 compatible binary.
Just a caveat, JSON.NET would be the most involved simply because the various wrappers that exist will be specialized to Twitter, whereas JSON.NET is just a general JSON parser (Twitterizer even uses it).
In my own opinion, possibly the easiest solution of all is to use jQuery to pull up the API for you. Obviously, that would be done as a client event so it would not be as "ideal" as the alternative. Still it's a the most "no-fuss" solution. Here's a blog post on calling the Twitter API with jQuery.
Take a look at TweetSharp (works on .NET 2.0)
This example doesn't show your tweet, but you get the idea of how easy it is.
using TweetSharp;
TwitterService service = new TwitterService();
IEnumerable<TwitterStatus> tweets = service.ListTweetsOnPublicTimeline();
foreach (var tweet in tweets)
{
Console.WriteLine("{0} says '{1}'", tweet.User.ScreenName, tweet.Text);
}

Resources