I'm fairly new to web development and never before did i do any screen-scraping nor web-crawling, but yesterday a friend of mine asked me if i would be able to grab some data from this website, which is not mine, nor his, but the data is publicly available even for download.
The problem with the data is, it's available only as one file per one date or company, rather than one file for multiple dates or companies, which involves a lot of tedious 'clicking trough' the calendar and so he thought it would be nice if i would be able to create some app that could grab all the data with one click and output it in one single file or something similar..
The website uses aspx webFrom with __doPostBack to retrieve the data for different dates, even the links to download the data in XSL aren't the usual "href=…" links, they are, i assume, references for some asp script…
To be honest the only thing i tried was PHP cURL which didn't work, but since i tried cURL for the first time, i don't even know if it didn't work because it is not possible with cURL, or just because i don't know how to work with it.
I am only somewhat proficient in PHP and JavaScript, but not in ASP, though i would't mind learning something new.
So my question is..
Is it at all possible to grab the data from a website like this? and if it is, would you be so kind as to give me some hints on how to approach this kind of problem?
the website, again, is here http://extranet.net4gas.cz/capacity_ee.aspx
Thanks
C# has a nice WebClient class to do the job:
// Create web client.
WebClient client = new WebClient();
// Download string.
string value = client.DownloadString("http://www.microsoft.com/");
once you have the page html in a string you use regular expressions to scrape the content you are looking for.
here is a very basic regular expression to give a hint:
Regex regex = new Regex(#"\d+");
Match match = regex.Match("hello here 10 values");
if (match.Success)
{
Console.WriteLine(match.Value);
}
Marosko, as you said the data on website is open for public, so for sure you can scrape data out of it. Now, it is to decrease the manual click through dates and scraping data out of it. I personally don't have much idea about how Curl will work but I am sure it will involve a lot of coding. I would rather suggest you to automate the entire process using some automation tool, like a software application. Try Automation Anywhere, I bought it few months back for some data extraction purpose and it worked very well. It is automated and you can check the screen scraping capabilities it shows. Its my favorite :)
Charles
Related
I am currently trying to code out a simple asp.net URL shortener which allows me to customise the shortened url. I am also not allowed to use open source, which means I cannot use any of the url shortening services. I am required to develop on on my own.
But this is the first time I am doing this so i have no idea on how to start(excluding the UI).
I understand that there are already such questions being asked. But I've read through the posts and I couldn't understand what is it about. I've also tried to google for the solution but it doesn't seem to be working.
I would really appreciate any help given to me.
P.S I am fairly new in programming and not strong in any of the programming languages.
You would need:
A system to store pairs of shortened URLs and their full version.
A page which takes the shortened URL parameter (eg. short.aspx?q=SHORTENED), looks it up in your data store, and redirects to the full URL.
Some interface to edit your data store, add new URLs, etcetera.
That should be it really. If this is too difficult, it might be smarter to start on a basic programming course first.
I need to make an application which will access an URL(like http://google.com) and return the time spent to load all elements(images, css, js...) and compare this results with the previous results.
This application need to be a Desktop app, and I will save the informations in a text file ou xml, and use this file do compare with previous results.
I have searched for a similar application, but nothing...
There are some plugins for firefox that list these elements, like Yslow or Firebug, but not what I need.
So, i'm totally lost and I don't know how to start this work?
Exists the possibility of make this application? What language is better for this type of application?
Thks!
This is a very objective question, so without you elaborating more on your requirements, you may not get any useful answers.
Some things you would need to answer are: how many URLs you want to check, where are you wanting to store the results (database, files etc), does it need to run on the desktop or on a server etc.
Personally, I like the statistics that cURL gives you - DNS time, connect time, receive time etc - so you could write something in PHP, but as I stress that is personal preference and may not suit your situation.
I'm curious about website scraping (i.e. how it's done etc..), specifically that I'd like to write a script to perform the task for the site Hype Machine.
I'm actually a Software Engineering Undergraduate (4th year) however we don't really cover any web programming so my understanding of Javascript/RESTFul API/All things Web are pretty limited as we're mainly focused around theory and client side applications.
Any help or directions greatly appreciated.
The first thing to look for is whether the site already offers some sort of structured data, or if you need to parse through the HTML yourself. Looks like there is an RSS feed of latest songs. If that's what you're looking for, it would be good to start there.
You can use a scripting language to download the feed and parse it. I use python, but you could pick a different scripting language if you like. Here's some docs on how you might download a url in python and parse XML in python.
Another thing to be conscious of when you write a program that downloads a site or RSS feed is how often your scraping script runs. If you have it run constantly so that you'll get the new data the second it becomes available, you'll put a lot of load on the site, and there's a good chance they'll block you. Try not to run your script more often than you need to.
You may want to check the following books:
"Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL"
http://www.amazon.com/Webbots-Spiders-Screen-Scrapers-Developing/dp/1593271204
"HTTP Programming Recipes for C# Bots"
http://www.amazon.com/HTTP-Programming-Recipes-C-Bots/dp/0977320677
"HTTP Programming Recipes for Java Bots"
http://www.amazon.com/HTTP-Programming-Recipes-Java-Bots/dp/0977320669
I believe that the most important thing you must analyze is which kind of information do you want to extract. If you want to extract entire websites like google does probably your best option is to analyze tools like nutch from Apache.org or flaptor solution http://ww.hounder.org If you need to extract particular areas on unstructured data documents - websites, docs, pdf - probably you can extend nutch plugins to fit particular needs. nutch.apache.org
On the other hand if you need to extract particular text or clipping areas of a website where you set rules using DOM of the page probably what you need to check is more related to tools like mozenda.com. with those tools you will be able to set up extraction rules in order to scrap particular information on a website. You must take into consideration that any change on a webpage will give you an error on your robot.
Finally, If you are planning to develop a website using information sources you could purchase information from companies such as spinn3r.com were they sell particular niches of information ready to be consume. You will be able to save lots of money on infrastructure.
hope it helps!.
sebastian.
Python has the feedparser module, located at feedparser.org that actually handles RSS in its various flavours and ATOM in its various flavours. No reason to reinvent the wheel.
A client wants to "Web-enable" a spreadsheet calculation -- the user to specify the values of certain cells, then show them the resulting values in other cells.
(They do NOT want to show the user a "spreadsheet-like" interface. This is not a UI question.)
They have a huge spreadsheet with lots of calculations over many, many sheets. But, in the end, only two things matter -- (1) you put numbers in a couple cells on one sheet, and (2) you get corresponding numbers off a couple cells in another sheet. The rest of it is a black box.
I want to present a UI to the user to enter the numbers they want, then I'd like to programatically open the Excel file, set the numbers, tell it to re-calc, and read the result out.
Is this possible/advisable? Is there a commercial component that makes this easier? Are their pitfalls I'm not considering?
(I know I can use Office Automation to do this, but I know it's not recommended to do that server-side, since it tries to run in the context of a user, etc.)
A lot of people are saying I need to recreate the formulas in code. However, this would be staggeringly complex.
It is possible, but not advisable (and officially unsupported).
You can interact with Excel through COM or the .NET Primary Interop Assemblies, but this is meant to be a client-side process.
On the server side, no display or desktop is available and any unexpected dialog boxes (for example) will make your web app hang – your app will behave flaky.
Also, attaching an Excel process to each request isn't exactly a low-resource approach.
Working out the black box and re-implementing it in a proper programming language is clearly the better (as in "more reliable and faster") option.
Related reading: KB257757: Considerations for server-side Automation of Office
You definitely don't want to be using interop on the server side, it's bad enough using it as a kludge on the client side.
I can see two options:
Figure out the spreadsheet logic. This may benefit you in the long term by making the business logic a known quantity, and in the short term you may find that there are actually bugs in the spreadsheet (I have encountered tons of monster spreadsheets used for years that turn out to have simple bugs in them - everyone just assumed the answers must be right)
Evaluate SpreadSheetGear.NET, which is basically a replacement for interop that does it all without Excel (it replicates a huge chunk of Excel's non-visual logic and IO in .NET)
Although this is certainly possible using ASP.NET, it's very inadvisable. It's un-scalable and prone to concurrency errors.
Your best bet is to analyze the spreadsheet calculations and duplicate them. Now, granted, your business is not going to like the time it takes to do this, but it will (presumably) give them a more usable system.
Alternatively, you can simply serve up the spreadsheet to users from your website, in which case you do almost nothing.
Edit: If your stakeholders really insist on using Excel server-side, I suggest you take a good hard look at Excel Services as #John Saunders suggests. It may not get you everything you want, but it'll get you quite a bit, and should solve some of the issues you'll end up with trying to do it server-side with ASP.NET.
That's not to say that it's a panacea; your mileage will certainly vary. And Sharepoint isn't exactly cheap to buy or maintain. In fact, short-term costs could easily be dwarfed by long-term costs if you go the Sharepoint route--but it might the best option to fit a requirement.
I still suggest you push back in favor of coding all of your logic in a separate .NET module. That way you can use it both server-side and client-side. Excel can easily pass calculations to a COM object, and you can very easily publish your .NET library as COM objects. In the end, you'd have a much more maintainable and usable architecture.
Neglecting the discussion whether it makes sense to manipulate an excel sheet on the server-side, one way to perform this would probably look like adopting the
Microsoft.Office.Interop.Excel.dll
Using this library, you can tell Excel to open a Spreadsheet, change and read the contents from .NET. I have used the library in a WinForm application, and I guess that it can also be used from ASP.NET.
Still, consider the concurrency problems already mentioned... However, if the sheet is accessed unfrequently, why not...
The simplest way to do this might be to:
Upload the Excel workbook to Google Docs -- this is very clean, in my experience
Use the Google Spreadsheets Data API to update the data and return the numbers.
Here's a link to get you started on this, if you want to go that direction:
http://code.google.com/apis/spreadsheets/overview.html
Let me be more adamant than others have been: do not use Excel server-side. It is intended to be used as a desktop application, meaning it is not intended to be used from random different threads, possibly multiple threads at a time. You're better off writing your own spreadsheet than trying to use Excel (or any other Office desktop product) form a server.
This is one of the reasons that Excel Services exists. A quick search on MSDN turned up this link: http://blogs.msdn.com/excel/archive/category/11361.aspx. That's a category list, so contains a list of blog posts on the subject. See also Microsoft.Office.Excel.Server.WebServices Namespace.
It sounds like you're talking that the user has the spreadsheet open on their local system, and you want a web site to manipulate that local spreadsheet?
If that's the case, you can't really do that. Even Office automation won't help, unless you want to require them to upload the sheet to the server and download a new altered version.
What you can do is create a web service to do the calculations and add some vba or vsto code to the Excel sheet to talk to that service.
How should I store (and present) the text on a website intended for worldwide use, with several languages? The content is mostly in the form of 500+ word articles, although I will need to translate tiny snippets of text on each page too (such as "print this article" or "back to menu").
I know there are several CMS packages that handle multiple languages, but I have to integrate with our existing ASP systems too, so I am ignoring such solutions.
One concern I have is that Google should be able to find the pages, even for foreign users. I am less concerned about issues with processing dates and currencies.
I worry that, left to my own devices, I will invent a way of doing this which work, but eventually lead to disaster! I want to know what professional solutions you have actually used on real projects, not untried ideas! Thanks very much.
I looked at RESX files, but felt they were unsuitable for all but the most trivial translation solutions (I will elaborate if anyone wants to know).
Google will help me with translating the text, but not storing/presenting it.
Has anyone worked on a multi-language project that relied on their own code for presentation?
Any thoughts on serving up content in the following ways, and which is best?
http://www.website.com/text/view.asp?id=12345&lang=fr
http://www.website.com/text/12345/bonjour_mes_amis.htm
http://fr.website.com/text/12345
(these are not real URLs, i was just showing examples)
Firstly put all code for all languages under one domain - it will help your google-rank.
We have a fully multi-lingual system, with localisations stored in a database but cached with the web application.
Wherever we want a localisation to appear we use:
<%$ Resources: LanguageProvider, Path/To/Localisation %>
Then in our web.config:
<globalization resourceProviderFactoryType="FactoryClassName, AssemblyName"/>
FactoryClassName then implements ResourceProviderFactory to provide the actual dynamic functionality. Localisations are stored in the DB with a string key "Path/To/Localisation"
It is important to cache the localised values - you don't want to have lots of DB lookups on each page, and we cache thousands of localised strings with no performance issues.
Use the user's current browser localisation to choose what language to serve up.
You might want to check GNU Gettext project out - at least something to start with.
Edited to add info about projects:
I've worked on several multilingual projects using Gettext technology in different technologies, including C++/MFC and J2EE/JSP, and it worked all fine. However, you need to write/find your own code to display the localized data of course.
If you are using .Net, I would recommend going with one or more resource files (.resx). There is plenty of documentation on this on MSDN.
As with most general programming questions, it depends on your needs.
For static text, I would use RESX files. For me, as .Net programmer, they are easy to use and the .Net Framework has good support for them.
For any dynamic text, I tend to store such information in the database, especially if the site maintainer is going to be a non-developer. In the past I've used two approaches, adding a language column and creating different entries for the different languages or creating a separate table to store the language specific text.
The table for the first approach might look something like this:
Article Id | Language Id | Language Specific Article Text | Created By | Created Date
This works for situations where you can create different entries for a given article and you don't need to keep any data associated with these different entries in sync (such as an Updated timestamp).
The other approach is to have two separate tables, one for non-language specific text (id, created date, created user, updated date, etc) and another table containing the language specific text. So the tables might look something like this:
First Table: Article Id | Created By | Created Date | Updated By | Updated Date
Second Table: Article Id | Language Id | Language Specific Article Text
For me, the question comes down to updating the non-language dependent data. If you are updating that data then I would lean towards the second approach, otherwise I would go with the first approach as I view that as simpler (can't forget the KISS principle).
If you're just worried about the article content being translated, and do not need a fully integrated option, I have used google translation in the past and it works great on a smaller scale.
Wonderful question.
I solved this problem for the website I made (link in my profile) with a homemade Python 3 script that translates the general template on the fly and inserts a specific content page from a language requested (or guessed by Apache from Accept-Language).
It was fun since I got to learn Python and write my own mini-library for creating content pages. One downside was that our hosting didn't have Python 3, but I made my script generate static HTML (the original one was examining User-agent) and then upload it to server. That works so far and making a new language version of the site is now a breeze :)
The biggest downside of this method is that it is time-consuming to write things from scratch. So if you want, drop me line and I'll help you use my script :)
As for the URL format, I use site.com/content/example.fr since this allows Apache to perform language negotiation in case somebody asks for /content/example and has a browser tell that it likes French language. When you do this Apache also adds .html or whatever as a bonus.
So when a request is for example and I have files
example.fr
example.en
example.vi
Apache will automatically proceed with example.vi for a person with Vietnamese-configured browser or example.en for a person with German-configured browser. Pretty useful.