Directshow advice for range of functionality or is there a better alternative (.NET)? - directshow

I've been doing some work in VB.Net with Directshow over the past 3-4 weeks. I'm creating an application to keep tags on a video and eventually want to be able to extract the tagged parts of the video to a new file. In a video that is 2 hours long I might want to extract say 50 10-15 second "clips" up to 15 times (event tagging). This will be for a free application.
I've found it brilliant (and easy) to render / seek / play clips, etc on XP-Win7 with no issues. I've "discovered" the joys of GraphEdit, creating graphs, the issues with COM in VB.NET, GMFBridge, ....etc.
Now I need some advice. Am I using the right technology. Directshow seems to be very resistant to the idea of "open video", "seek to clip", "write clip to file", .....repeat for all clips, close file. I can sort of do this already if I visibly render the video but would need to do it as a background task faster than realtime render speed.
Things that seem to be missing are:
- an example of anyone doing anything similar (export multiple clips to a single file)
- no easily available 64bit compressors (lots of 32bit stuff around)
- all the references and examples I do find are VERY old
- VB.NET is not the first "port of call" for DirectShow developers
So, the question is, should I be using something else?
If not, has anyone done anything similar before. I'm not looking for their code, I just want some guidelines as it takes ages to figure things out in DirectShow and VB.Net just using trial & error (and Google).
I've looked at AFORGE (no sound), FFMPEG (command line toolset), Media Foundation (reluctant to throw away XP) and a variety of commercial helper libraries but not really getting any further.
Apologies for the length but I wanted readers to understand the background.
All help appreciated.

To output clips to a single file Microsoft had created the "DirectShow Editing Services". Sometimes it works, sometimes not. We use it in our software to create videos from clips like you. With a little bit work you can also include effects to the video.
It is also possible to use AviSynth. It's a scripting system and frameserver for DirectShow.
As I know, with MediaFoundation you can also create a video from multiple clips, but I never tried this.

Related

formatting html for printing with page numbers

Here’s the basic question…
I have a long HTML document (a contract with 100+ pages) that ultimately needs to be a PDF document with headers and footers (page numbers). What is the best tool/language for making this happen?
Here’s the back story…
I work at a satellite office for a low-tech construction company that issues contracts to subcontractors, and because I am the only one who is able to unjam the printer, I have become the defacto IT person in the company. In the past, to make a contract, someone has had to go through a MS Word document (the boiler plate contract) and type in the necessary information to produce a contract.
About a year ago, I got so frustrated with that methodology that I created a MS Access Database where a user could add information using Access forms and then a mail merge with MS Word to populate a contract. This has been a HUGE improvement plus we have been able to start tracking money a lot more easily using the other database features. The database is stored on a shared computer in the satellite office. However, this system only works IF the individual users have MS Access and MS Word installed on their individual machines and only if they are physically connected to our local network.
With the success of this system at the satellite office, I am now attempting to create a web-based version of this tool that everyone in the company can use that only relies on standard software on individual machines and can be accessible anywhere.
I have converted a computer into a server for development purposes using XAMPP, created a SQL database, created HTML forms, and am using PHP to run queries. Over the past few months, I have crash coursed my way through myriad languages including CSS, and have finally gotten everything to the point that the system will create an HTML version of the contract with everything populated. Now I just need to format it for printing (ideally to a virtual PDF printer) with headers and footers (page numbers). This should be the easiest part, right?
CSS with the #media: print tags would, on the surface, appear to be the best way to make this happen because CSS3 uses tags like “#top-left” and “content: counter(page)” to do everything that I want; however, after investing a lot of time setting everything up, it appears that only Foxfire kind of supports this and IE and Chrome absolutely do not.
Headers and footers overlap body content, and I can’t get the pagination to work at all. Apparently these are common frustrations.
In my hunting, I ran across a program called Prince that would seem to do what I want (and quite a bit more), but the price tag on that is way more than I am willing to pay.
I can’t believe that what I want to do is a new or unique thing. I suspect I am just not searching for the right keywords. Is there a better tool/technique out there for converting HTML to a printer-friendly format without spending a ton of money?
I feel your pain. But the only solution I've found that really works is to use a PDF library to write the formatted text to a PDF directly from PHP (or Python or another language, but you mentioned PHP and I've done that). I've used R&OS quite a bit:
http://pdf-php.sourceforge.net/
It may take a little while to get up to speed, but you can do pretty much anything with it, including easily create nicely formatted tables, flowing text and embedded images. The catch is that, with the exception of a few tags like <b></b> and <i></i> you don't get to use any HTML or CSS - essentially you write two output routines, one for HTML and one for PDF.

Lotus Smartsuite to "something newer"

I shall try and keep my scenario as brief as possible and to the point.
The office I’m currently working for uses Lotus Smartsuite on Windows 98 / XP, using lots of Lotus Script to tie together Lotus 123 and Lotus Word Pro documents. They also make heavy use of the Lotus Object Linking functions. I shall describe its behaviour below:
You can fill rows and columns in a 123 Spreadsheet with data galore, style it and format it any way you like and define it as a range (nothing unique here). However, you can then copy that range and paste it as a link in a Lotus Word Pro document. This link is then categorised by its range name, so expanding the range back in the 123 file causes the table in the Word Pro Document to expand. This link also carries with it all the formatting and styling of the cells in the 123 Spreadsheet. As I imagine you are now aware, this link is completely live, you can double click anywhere in the object and it opens up the 123 file for editing, and all changes go backward and forward between the two documents. Most of the data retrieved from testing equipment is stored in these 123 spreadsheets and then parts of that are linked into a final Lotus Word Pro report document sent to the customer.
Note: Just to be clear, this is NOT the same as a DDE link in Open Office, which seems to allow for copying of a non-defined range of cells to be imported into a document where all formatting is lost and editing back and forth is not straight forward. It also behaves differently to an OLE object, which seems to only import the entire Spreadsheet rather than a small subsection of it.
However, in recent years, support this older software (Lotus) is becoming more difficult, especially with regards to sending customers documents (Lotus word Pro files are generally unsupported by more modern Office Tools) and technical support for Lotus Smartsuite seems to be practically non-existent these days. Also, with the fear of on going development in a scripting language no-longer being practised by mainstream IT technicians, on-going development and support seems futile. Once the guys who wrote it move on to other things, we will be left with spaghetti script in a language nobody can help us with.
So, we have this goal of "modernising" our IT system by the end of the year. Linux is becoming a very viable option too (No doubt Debian or a derivative), but Open Office doesn't seem to have the linking capability mentioned above. The reason this linking is so important is because the veterans of the office are so used to working this way - storing data in the spreadsheet, linking back to it later in their Word Pro documents, etc. I think they are more than keen to keep this practice going and we have found no equivalent of it in modern office tools (as was requested of me). I can see, as a software engineer (fluent in many languages), how this practice is not the safest or best way of using and storing data (databases spring to mind), but I was wondering if someone could give me a few other good reasons as to why this is bad practice in the work place (I was always in the belief that you should keep your data away from your reporting and formatting, the two should never be entwined - this looks like spreadsheet hell to me) ... or why this is a good thing to keep doing!?
So, for those of you still with me, I guess what I am asking is:
Is this practice of storing data, formatting it in spreadsheets and importing that directly back and forth between word documents good or bad, and what can be done about it? I guess I'll need to prove my point in case either way for this.
Are there ANY modern alternatives to this linking method (regardless of weather it is good or bad practice or not) out there for Linux or Windows? This link MUST carry formatting as well as dynamic range sizes (DDE links don't seem to be the answer).
What would your solution be if you had to start from scratch? Store everything in databases and use SQL to simply ask for the data you need in your word documents? How would you do this? What software would you use?
Any help with this scenario would be more than helpful, or if you know anywhere I should go to ask for advice, that would be appreciated too.
Thank-you for reading!
My suggestion is to first take a step back. What is the benefit to the way things are done now? Is it just a habit that is tough to break? Is there any reason the documents and spreadsheets need to be maintained and linked the way they are, or is it just a requirement because 'that's how it was done before'?
If you can remove that requirement, you have a lot more options and you're building a system that's easier to understand and maintain.
Regarding question 1, I believe there's nothing wrong with storing data in spreadsheets, especially if the end-users need to create and maintain them and development staff is limited. Some questions are whether that data needs to be secured, is related between spreadsheets, is duplicated across the company, or should be shared in a better way across the company. If any of those are true then a centralized database would make more sense. Personally I'd want any valuable data safely stored in a database where it can be managed, access to it can be controlled, it can be easily backed-up, etc.
Regarding question 2, you can do the same thing in Microsoft Office. You can either link the documents, so that the data stays in the source excel doc but appears in the word doc, or you can embed the excel spreadsheet within the word doc.
You might want to look at Microsoft Access for storing the data and generating reports. Or you could build an application using a relational database back-end and reporting front-end. The possibilities are wide-open. It really depends on where the expertise lies within the company.
If it were me I'd probably use a SQL Express back-end (it's free) and a custom ASP.NET MVC application for generating the reports, but that's just where my expertise lies.

How to scrape websites such as Hype Machine?

I'm curious about website scraping (i.e. how it's done etc..), specifically that I'd like to write a script to perform the task for the site Hype Machine.
I'm actually a Software Engineering Undergraduate (4th year) however we don't really cover any web programming so my understanding of Javascript/RESTFul API/All things Web are pretty limited as we're mainly focused around theory and client side applications.
Any help or directions greatly appreciated.
The first thing to look for is whether the site already offers some sort of structured data, or if you need to parse through the HTML yourself. Looks like there is an RSS feed of latest songs. If that's what you're looking for, it would be good to start there.
You can use a scripting language to download the feed and parse it. I use python, but you could pick a different scripting language if you like. Here's some docs on how you might download a url in python and parse XML in python.
Another thing to be conscious of when you write a program that downloads a site or RSS feed is how often your scraping script runs. If you have it run constantly so that you'll get the new data the second it becomes available, you'll put a lot of load on the site, and there's a good chance they'll block you. Try not to run your script more often than you need to.
You may want to check the following books:
"Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL"
http://www.amazon.com/Webbots-Spiders-Screen-Scrapers-Developing/dp/1593271204
"HTTP Programming Recipes for C# Bots"
http://www.amazon.com/HTTP-Programming-Recipes-C-Bots/dp/0977320677
"HTTP Programming Recipes for Java Bots"
http://www.amazon.com/HTTP-Programming-Recipes-Java-Bots/dp/0977320669
I believe that the most important thing you must analyze is which kind of information do you want to extract. If you want to extract entire websites like google does probably your best option is to analyze tools like nutch from Apache.org or flaptor solution http://ww.hounder.org If you need to extract particular areas on unstructured data documents - websites, docs, pdf - probably you can extend nutch plugins to fit particular needs. nutch.apache.org
On the other hand if you need to extract particular text or clipping areas of a website where you set rules using DOM of the page probably what you need to check is more related to tools like mozenda.com. with those tools you will be able to set up extraction rules in order to scrap particular information on a website. You must take into consideration that any change on a webpage will give you an error on your robot.
Finally, If you are planning to develop a website using information sources you could purchase information from companies such as spinn3r.com were they sell particular niches of information ready to be consume. You will be able to save lots of money on infrastructure.
hope it helps!.
sebastian.
Python has the feedparser module, located at feedparser.org that actually handles RSS in its various flavours and ATOM in its various flavours. No reason to reinvent the wheel.

How do you use Excel server-side?

A client wants to "Web-enable" a spreadsheet calculation -- the user to specify the values of certain cells, then show them the resulting values in other cells.
(They do NOT want to show the user a "spreadsheet-like" interface. This is not a UI question.)
They have a huge spreadsheet with lots of calculations over many, many sheets. But, in the end, only two things matter -- (1) you put numbers in a couple cells on one sheet, and (2) you get corresponding numbers off a couple cells in another sheet. The rest of it is a black box.
I want to present a UI to the user to enter the numbers they want, then I'd like to programatically open the Excel file, set the numbers, tell it to re-calc, and read the result out.
Is this possible/advisable? Is there a commercial component that makes this easier? Are their pitfalls I'm not considering?
(I know I can use Office Automation to do this, but I know it's not recommended to do that server-side, since it tries to run in the context of a user, etc.)
A lot of people are saying I need to recreate the formulas in code. However, this would be staggeringly complex.
It is possible, but not advisable (and officially unsupported).
You can interact with Excel through COM or the .NET Primary Interop Assemblies, but this is meant to be a client-side process.
On the server side, no display or desktop is available and any unexpected dialog boxes (for example) will make your web app hang – your app will behave flaky.
Also, attaching an Excel process to each request isn't exactly a low-resource approach.
Working out the black box and re-implementing it in a proper programming language is clearly the better (as in "more reliable and faster") option.
Related reading: KB257757: Considerations for server-side Automation of Office
You definitely don't want to be using interop on the server side, it's bad enough using it as a kludge on the client side.
I can see two options:
Figure out the spreadsheet logic. This may benefit you in the long term by making the business logic a known quantity, and in the short term you may find that there are actually bugs in the spreadsheet (I have encountered tons of monster spreadsheets used for years that turn out to have simple bugs in them - everyone just assumed the answers must be right)
Evaluate SpreadSheetGear.NET, which is basically a replacement for interop that does it all without Excel (it replicates a huge chunk of Excel's non-visual logic and IO in .NET)
Although this is certainly possible using ASP.NET, it's very inadvisable. It's un-scalable and prone to concurrency errors.
Your best bet is to analyze the spreadsheet calculations and duplicate them. Now, granted, your business is not going to like the time it takes to do this, but it will (presumably) give them a more usable system.
Alternatively, you can simply serve up the spreadsheet to users from your website, in which case you do almost nothing.
Edit: If your stakeholders really insist on using Excel server-side, I suggest you take a good hard look at Excel Services as #John Saunders suggests. It may not get you everything you want, but it'll get you quite a bit, and should solve some of the issues you'll end up with trying to do it server-side with ASP.NET.
That's not to say that it's a panacea; your mileage will certainly vary. And Sharepoint isn't exactly cheap to buy or maintain. In fact, short-term costs could easily be dwarfed by long-term costs if you go the Sharepoint route--but it might the best option to fit a requirement.
I still suggest you push back in favor of coding all of your logic in a separate .NET module. That way you can use it both server-side and client-side. Excel can easily pass calculations to a COM object, and you can very easily publish your .NET library as COM objects. In the end, you'd have a much more maintainable and usable architecture.
Neglecting the discussion whether it makes sense to manipulate an excel sheet on the server-side, one way to perform this would probably look like adopting the
Microsoft.Office.Interop.Excel.dll
Using this library, you can tell Excel to open a Spreadsheet, change and read the contents from .NET. I have used the library in a WinForm application, and I guess that it can also be used from ASP.NET.
Still, consider the concurrency problems already mentioned... However, if the sheet is accessed unfrequently, why not...
The simplest way to do this might be to:
Upload the Excel workbook to Google Docs -- this is very clean, in my experience
Use the Google Spreadsheets Data API to update the data and return the numbers.
Here's a link to get you started on this, if you want to go that direction:
http://code.google.com/apis/spreadsheets/overview.html
Let me be more adamant than others have been: do not use Excel server-side. It is intended to be used as a desktop application, meaning it is not intended to be used from random different threads, possibly multiple threads at a time. You're better off writing your own spreadsheet than trying to use Excel (or any other Office desktop product) form a server.
This is one of the reasons that Excel Services exists. A quick search on MSDN turned up this link: http://blogs.msdn.com/excel/archive/category/11361.aspx. That's a category list, so contains a list of blog posts on the subject. See also Microsoft.Office.Excel.Server.WebServices Namespace.
It sounds like you're talking that the user has the spreadsheet open on their local system, and you want a web site to manipulate that local spreadsheet?
If that's the case, you can't really do that. Even Office automation won't help, unless you want to require them to upload the sheet to the server and download a new altered version.
What you can do is create a web service to do the calculations and add some vba or vsto code to the Excel sheet to talk to that service.

Is Wiki Content Portable?

I'm thinking of starting a wiki, probably on a low cost LAMP hosting account. I'd like the option of exporting my content later in case I want to run it on IIS/ASP.NET down the line. I know in the weblog world, there's an open standard called BlogML which will let you export your blog content to an XML based format on one site and import it into another. Is there something similar with wikis?
The correct answer is ... "it depends".
It depends on which wiki you're using or planning to use. I've used various over the years MoinMoin was ok, used files rather than database, Ubuntu seem to like it. MediaWiki, everyone knows about and JAMWiki is a java clone(ish) of MediaWiki with the aim to be markup compatible with MediaWiki, both use databases and you can generally connect whichever database you want, JAMWiki is pre-configured to use an internal HSQLDB instance.
I recently converted about 80 pages from a MoinMoin wiki into JAMWiki pages and this was probably 90% handled by a tiny perl script I found somewhere (I'll provide a link if I can find it again). The other 10% was unfortunately a by-hand experience (they were of the utmost importance with them being recipies for the missus) ;-)
I also recently setup a Mediawiki instance for work and that took all of about 8 minutes to do. So that'd be my choice.
To answer your question I don't believe that there's such a standard as WikiML as Till called it.
As strange as it sounds, I've investigated screen scraping a wiki for a co-worker to help him port it to another wiki engine. It turned out that screen scraping would have been easier, quicker and more efficient to write to move this particular file based wiki to another one or a CMS.
Given the context that you wrote the question in I would bite the bullet now and pay the little extra for a windows hosted account and put Screwturn wiki on it. You're got the option of using file based or SQL Server based back end for it but because one of your requirements is low cost I'm guessing that you would use file based now for a cheaper hosted account and then you can always upscale the back end to SQL Server.
I haven't heard of WikiML.
I think your biggest obstacle is gonna be converting one wiki markup to another. For example, some wikis use markdown (which is what Stack Overflow uses), others use another markup syntax (e.g. BBCode, ...), etc.. The bottom line is - assuming the contents are databased it's not impossible to export and parse it to make it "fit" in another system. It might just be a pain in the ass.
And if the contents are not databased, it's gonna be a royal pain in the ass. :D
Another solution would be to stay with the same system. I am not sure what the reason is for changing the technology later on. It's not like a growing project requires IIS/ASP.NET all of the sudden. (It might just be the other way around.) But for example, if you could stick with PHP for a while, you could also run that on IIS.

Resources