Application for graphing lots of web related data - graph

I know this isn't programming related, but I hope some feedback which helps me out the misery.
We've actually lots of and different data from our web applications, dating years back.
For example, we've
Apache logfiles
Daily statistics files from our tracking software (CSV)
Another daily statistics from nation-wide rankings for advertisement (CSV)
.. and I can probably produce new data from other sources, too.
Some of the data records started in 2005, some in 2006, etc. However at some point in time we start to have data of all of them.
What's I'm drea^H^H^H^Hsearching for is an application to understand all the data, lets me load them, compare individual data sets and timelines (graphically), compare different data sets within the same time span, allow me to filter (especially the Apache logfile); and of course this all should be interactively.
Just the BZ2 compressed Apache logfiles are already 21GB in total, growing weekly.
I've had no real success with things like awstats, Nihu Web Log Analyzer or similar tools. They can just produce statical information, but I would need to interactive query the information, apply filters, lay over other datas, etc.
I've also tried data mining tools in hope they can help me but didn't really success in using them (i.e. they're over my head), e.g. RapidMiner.
Just to make it sure: it can be a commercial application. But yet have to find something which is really useful.
Somehow I get the impression I'm searching for something which does not exist or I've the wrong approach. Any hints are very welcome.
Update:
In the end I it was a mixture of the following things:
wrote bash and PHP scripts to parse and managing parsing the log files, including lots of filtering capabilities
generated plain old CSV file to read into Excel. I'm lucky to use Excel 2007 and it's graphical capabilities, albeit still working on a fixed set of data, helped a lot
I used Amazon EC2 to run the script and send me the CSV via email. I had to crawl through around 200GB of data and thus used one of the large instances to parallelize the parsing. I had to execute numerous parsing attempts to get the data right, the overall processing duration was 45 minutes. I don't know what I could have done without Amazon EC2. It was worth every buck I paid for it.

Splunk is a product for this sort of thing.
I have not used it my self though.
http://www.splunk.com/

The open source data mining and web mining software RapidMiner can import both Apache web server log files as well as CSV files and it can also import and export Excel sheets. Rapid-I offers a lot of training courses for RapidMiner, some also on web mining and web usage mining.

In the interest of full disclosure, I've not used any commercial tools for what your describing.
Have you looked at LogParser? It might be more manual than what your looking for, but it will allow you to query many different structured formats.
As for the graphical aspect of it, there is some basic charting capabilities built in, but your likely to get much more mileage piping the log parser output into a tabular/delimited format and loading into Excel. From there you can chart/graph just about anything.
As for cross joining different data sources, you can always pump all the data into the database where you'll have a richer language for querying the data.

What you're looking for is a "data mining framework", i.e. something which will happily eat gigabytes of somewhat random data and then lets you slice'n'dice it in yet unknown ways to find the gold nuggets buried deep inside of the static.
Some links:
CloudBase: "CloudBase is a high-performance data warehouse system built on top of Map-Reduce architecture. It enables business analysts using ANSI SQL to directly query large-scale log files arising in web site, telecommunications or IT operations."
RapidMiner: "RapidMiner aleady is a full data mining and business intelligence engine which also covers many related aspects ranging from ETL (Extract, Transform & Load) over Analysis to Reporting."

Related

Searching for text mining/extraction software with Intuitive, Modern UI

I'm researching different products for my organization. We are looking for a solution that will replace our current text mining software - DataWatch Monarch. We need some type of software that will be able to extract only relevant data from PDF reports and prepare it to be stored in a database.
DataWatch is causing a bottleneck for our organization due to learning curve and limitations. I started to try and do this just by programming using R, however, we need a more streamlined approach.
If you know of any easy to use, highly effective, text miners or report-text-extractor-like software please share. I will be looking into Scribe Software, SiMX, RapidMiner, and some others.
RapidMiner can extract info from PDFs no problem using the Text Processing extension. Start with the Read Document operator and go from there.
Storing in a database is also straightforward - set up your database connection in the "Manage Database Connections" menu and then use the "Write Database" operator.

Converting Excel math to SQL in VBNet, ASP.NET web application

I am trying to automate a process that is currently done mainly with excel files. These files have been used for a while and are customized just how the user likes them. I am turning this into a data driven VB NEt application and now and at the task of configuring all the computed columns to do the equations the user's excel spread sheets are doing currently.
The main ones needed that I can't find information on are STANDARDIZE, PERCENTRANK and STDEVA (atleast for computed columns- I have seen STEVA used in select queries)
Excuse me if there is documentation on this I can refer to, I searched google and stackoverflow and wasn't able to find anything. If you could point me to any documentation like this that might exist- that would be a huge help!

data collection for statistics: from web to a database

I'm a statistician by trade and I'd like some recommendations on how to set up a website that can collect data into a database. For personal use, I use Google Forms to collect data, and everything gets populated into a spreadsheet. However, this may not be appropriate in a more professional setting, especially when we have multiple pages/forms. I imagine two uses:
A website where I can send the link to others so they can fill out, similar to Google Forms.
A website where only authorized users can log in to fill out data. Think of a setting where patients are followed periodically in a research study. It'd be cool to have the clinician enter the data directly into the database as he/she fills out the forms as opposed to having another data analyst transcribe his written forms into the database.
The obvious solution would be to hire a web developer. However, I like doing things myself when they are manageable. I imagine a web developer would have to know html, php, and database knowledge (eg, MySQL or PostgreSQL). My experience in these are limited to setting up a wordpress blog on my linux server. My experience with html is also limited as I use emacs org-mode to generate them from plain text. I hope to hear about solutions with a minimal learning curve. My preference of course would be free open source software and Linux-based, but I'd like to hear all available solutions (our data manager is a Windows user).
I recently read a post on Linux Journal that mentions REDCap, but it seems you have to get institutional permission to use.
I also tagged "R" on this post as I'd like to hear what R users are doing about data collection. I'll ultimately analyze the data with R, but all data analysis begins with the scientific question and data collection.
Thanks!
UPDATE 10/4/2010: Thanks everyone for the responses so far. It appears most of the third-party solutions proposed so far has data housed in a database hosted by the vendor. I'd like to house all data in our SQL Server. That is, data entry from the web enters the database in real time, ready for data analysis.
Maybe the limesurvey.org project is of interest ...
It sounds to me like you've got yourself a med study. There are a plethora of concerns that come to mind just from what you've described you want to do. Not the least of which is privacy. Where is it going to be hosted? Have you received consent from the patients to be collecting and transmitting their information electronically? What data are you storing, if any, that could combine to present their identity.
Personally, I steer clear of DIY online data collection tools. I pay a firm, like Ipsos, Research Now/E-Rewards, to program and manage data collection using questionnaires that I have designed. The reason is, knowing how to design research and analyze data is one thing. But if you've been trained in statistics - I can safely argue that you "don't know shit" about data collection. Sure you may know a bunch about sampling theory, but when it comes to getting data in - it's best to leave it to the pros.
There are a number of "industrial quality" online data collection tools available.
Confirmit (Pretty much the gold standard for online data collection)
DASH (Smaller following, but incredibly flexible)
There are also purely web based solutions, some of which are free (not that I would recommend using them)
QuestionPro
SurveyMonkey
Zoomerang
Although, unless you're doing a study with over 50 patients, I would just recommend getting the Physicians or their assistants to fill out Excel sheets and send them to your co.
Also, it's unlikely that you'll need to set-up a username/password system. What you want is referred to as an "open-link". Where respondents click a link and enter information, identifier info can be added by the respondent. You don't need a password because people can only INPUT information, not read it.
Most of the systems I mentioned above work on the idea of emailing a respondent (a clinician) with a link to a web based survey. Which could be easily adapted to your specific needs and act as a reminder to the clinician to fill out the form.
If your question types are simple. I'm sure you could hire a programmer to put together a website that has the forms you need behind an authorized front end. PHP/MySQL would likely do the trick. But, I would double check the privacy laws in your jurisdiction surrounding medical research before going ahead.
I have conducted medial research using an online form (actually two of them). My questions were quite discrete and particular to the disease I was researching.
Previously in a related project, I had created two or three page questionnaires which were printed and then subjects and surgeons filled out the forms and our research coordinator would enter them into our database. It was a lot of work with lots of room for error. I did not like it. Online forms were much better.
I used SurveyGizmo and was happy with it. I looked at lots of options about two years ago. Google Forms did not exist at that time. I went with SurveryGizmo primarily because they had a a statement (attestation) that they were compliant with HIPAA. I could not ensure security such as ssl connections with the other websites. However in order to get myself into that capability (https connections) I had to buy the enterprise level eventhough on every other capability I could have used the free service. Also SurveyGizmo offered a 50% reduction for non-profits which our research institute qualified for.
SurveryGizmo was easy to design and put into production without having to program myself. It was easy to download the data in csv format and read that straight into R. Although I had some weird issues that I needed help with. I had to use the "old" format for export so that it came as a straigtforward csv. Furthermore, the csv file had the odd feature of the the first TWO rows being header rows. But I solved that problem with the help of stackoverflow.
SurveryGizmo has fantastic logic and piping that enbabled me to only ask relevant questions and thereby not waste the time of my respondents and even more importantly, there were no irrelevant questions to confuse respondents.
Finally, I was able to use SurveyGizmo in such a way that I could also track our (research staff) fulfillment and logistics. For instance we got notification when there were new potential subjects who were interested in participating. We were able to note FedEx tracking numbers along with the records of the corresponding subjects.
Basically it worked well.
The safest platform for collecting confidential survey data is Confirmit. There is a learning curve involved here- you will be coding in VisualSQL, which is only used in Confirmit. The survey responses will export to csv files, where you can analyze your results in R.
If you are collecting any confidential data, or data where respondents need unique access links so they can only see their own version of the survey, you will want to use Confirmit. The data is housed in Confirmit's data center, but their data is much more secure than other vendors (i.e., a third party will not be able to hack into your survey and see an individual's responses, or intercept the data that is being sent from your respondent to Confirmit).

Excel Upload to database table

I'm looking for the best solution to allow our users to upload XLS spreadsheet so that they can be used to populate tables in our data warehouse (DW).
Our users are heavy Business Object (BO) users, and BO lets you export to XLS. When they have data in a spreadsheet that needs to be loaded to the DW, they need a process to upload the data in the XLS to the DW's db. As a result, we end up with many of these "interfaces" when I think that what we really need is a programmatic automated feed. Using Excel as a data source for inter-system feeds, in my gut, just seems like a bad idea to me.
Question #1: I'd like to see if you agree and why or why not.
OK, there is no swimming against that tide, so I now take as a given that XLS uploads are here to stay for us. Now I need to find the best solution. First, I'll explain what we do now and then what I don't like about it:
Via web pages, we provide empty XLS files (no rows) with a defined set of columns. Each file is intended to be used to update a different target dest table. In each spreadsheet is an "upload" button. Pushing the Upload button results in the macro in the spreadsheet serializing the contents of the file to CSV and FTPing the data to server folder. Periodically, a scheduler fires off an Informatica ETL job that uses the CSV file as input and loads the data into a custom XLS-specific staging table and then, if the records pass edits, into the appropriate target table. Any errors encountered are logged to an error table. For each XLS file uploaded, the data ends up in a separate staging and error table that is specific for the file.
Some of the things I don't like include about our process are:
1) The macro code in the XLS is too exposed, includes passwords for example, can be tampered with and there are issues ensuring that the users are using the latest XLS templates.
2) Business Rule edits are placed in the ETL program, where they should probably be, but because we would like to catch the errors ASAP, i.e, in the spreadsheet, edits are also added to the macro code. This results in duplication of business edits. I want these rules in one place and centrally controlled. IMHO, I think putting any macro code in the XLS introduces a maintenance issue, even calls to stored procedures (some of which we have) or calls to web services (we haven't yet tried to call .NET Web Services from XLS macros.)
3) Every XLS file upload template has its own process with distinct set of staging and error tables and a custom screen for reporting errors encountered. It seems like we need a more generalized re-usable solution.
Besides often getting data exported to XLS from BO, the users like also Excel because it is easier to edit a large number of records and less clunkier than editing individual records via a web interface.
This is the general direction that I am thinking:
First, I want the users to have the ease of editing of Excel with editing, but without including embedded macros in the spreadsheet. I experimented with Farpoint's Grid with Excel compatibility...
http://www.fpoint.com/netproducts/spreadweb/tour/excel.aspx
...and I found that it was quite easy to allow a user the ability to open up an XLS file that resides on their PC and have it open up in a browser and be able to easily access the data read from server-side .NET web code. Excel isn't running locally in their browser, but the functionality of Excel is reproduced, presumably through a lot of client sided scripting that I expect would be a real pain to duplicate myself. You can even cut and paste from a local spreadsheet into the web's spreadsheet. This sounds great, by biggest problem is cost. Our company is near death and won't allow us to purchase any new software.
Next, I want to identify the common components across all spreadsheet upload processing and come up with generic processing code. For example, I imagine a table which defines each of our spreadsheets and the format of each including the column names and data type definitions, perhaps in terms of their destination columns instead of hard coding. Based on this table template definition, I can generate XLS templates for download from this table definition. I can also perform simple generic edits to ensure that the data entered matches the table definition. And one common web page can be used to present the data and allow report data type mismatch errors and allow for the user to correct them. I would also define a common table for storing the data in a "staging" table, using a table with two columns, submission #, row num, name and value, perhaps. No more "custom everything" is the goal.
Next I need to decide where to put the business rules. My dept's mgt firmly believes that all loading of data should be done by Informatica ETL batch processes and therefore the rules/edits belong "in Informatica". I have zero experience with Informatica tools, I am more of a .NET guy. I am therefore unsure as to how these rules are implemented but I suspect that they are not reusable in the sense that they can be used by a .NET web page to validate a particular record against. You see, in some cases, when the user is not performing a bulk upload, they do have the ability to edit a specific record and I would like the same edits that were applied by the ETL bulk insert process to be applied to an individual update attempt to a single record via a web page. If the solution to write a single web service or stored procedure that can be called from either the web page doing an update of a single record or called thousands of times for each record in a bulk upload? The latter sounds inefficient.
Your thoughts on anything above would be very much welcomed.
From a cost perspective, the efforts you'll need to go through to re-create spreadsheet functionality on the web will exceed the cost of Farpoint or other controls. Even if you made $20 an hour, do you think you could complete a working product in under 2 weeks? I think you have the facts on your side when you discussed maintenance issues if you allow ETL functionality to exist in Excel - you have twice the amount of work to maintain the transformation rules. I think you need to convince management that in order to create a maintainable, robust solution you need some flexible utilities.
Farpoint is a good choice. There is also SpreadsheetGear that is a .Net engine that interprets Excel macros and can run on a web server. It has a Win32 control that allows you to create a WinForms solution with very Excel interface functionality. Last time I checked there was no web control for the product. It does an excellent job of providing Excel capability for processing large amounts of data.
Good luck. I think you will find a good solution since you seem to have a good grasp of the pro's and con's of all the different potential solutions.

How do you use Excel server-side?

A client wants to "Web-enable" a spreadsheet calculation -- the user to specify the values of certain cells, then show them the resulting values in other cells.
(They do NOT want to show the user a "spreadsheet-like" interface. This is not a UI question.)
They have a huge spreadsheet with lots of calculations over many, many sheets. But, in the end, only two things matter -- (1) you put numbers in a couple cells on one sheet, and (2) you get corresponding numbers off a couple cells in another sheet. The rest of it is a black box.
I want to present a UI to the user to enter the numbers they want, then I'd like to programatically open the Excel file, set the numbers, tell it to re-calc, and read the result out.
Is this possible/advisable? Is there a commercial component that makes this easier? Are their pitfalls I'm not considering?
(I know I can use Office Automation to do this, but I know it's not recommended to do that server-side, since it tries to run in the context of a user, etc.)
A lot of people are saying I need to recreate the formulas in code. However, this would be staggeringly complex.
It is possible, but not advisable (and officially unsupported).
You can interact with Excel through COM or the .NET Primary Interop Assemblies, but this is meant to be a client-side process.
On the server side, no display or desktop is available and any unexpected dialog boxes (for example) will make your web app hang – your app will behave flaky.
Also, attaching an Excel process to each request isn't exactly a low-resource approach.
Working out the black box and re-implementing it in a proper programming language is clearly the better (as in "more reliable and faster") option.
Related reading: KB257757: Considerations for server-side Automation of Office
You definitely don't want to be using interop on the server side, it's bad enough using it as a kludge on the client side.
I can see two options:
Figure out the spreadsheet logic. This may benefit you in the long term by making the business logic a known quantity, and in the short term you may find that there are actually bugs in the spreadsheet (I have encountered tons of monster spreadsheets used for years that turn out to have simple bugs in them - everyone just assumed the answers must be right)
Evaluate SpreadSheetGear.NET, which is basically a replacement for interop that does it all without Excel (it replicates a huge chunk of Excel's non-visual logic and IO in .NET)
Although this is certainly possible using ASP.NET, it's very inadvisable. It's un-scalable and prone to concurrency errors.
Your best bet is to analyze the spreadsheet calculations and duplicate them. Now, granted, your business is not going to like the time it takes to do this, but it will (presumably) give them a more usable system.
Alternatively, you can simply serve up the spreadsheet to users from your website, in which case you do almost nothing.
Edit: If your stakeholders really insist on using Excel server-side, I suggest you take a good hard look at Excel Services as #John Saunders suggests. It may not get you everything you want, but it'll get you quite a bit, and should solve some of the issues you'll end up with trying to do it server-side with ASP.NET.
That's not to say that it's a panacea; your mileage will certainly vary. And Sharepoint isn't exactly cheap to buy or maintain. In fact, short-term costs could easily be dwarfed by long-term costs if you go the Sharepoint route--but it might the best option to fit a requirement.
I still suggest you push back in favor of coding all of your logic in a separate .NET module. That way you can use it both server-side and client-side. Excel can easily pass calculations to a COM object, and you can very easily publish your .NET library as COM objects. In the end, you'd have a much more maintainable and usable architecture.
Neglecting the discussion whether it makes sense to manipulate an excel sheet on the server-side, one way to perform this would probably look like adopting the
Microsoft.Office.Interop.Excel.dll
Using this library, you can tell Excel to open a Spreadsheet, change and read the contents from .NET. I have used the library in a WinForm application, and I guess that it can also be used from ASP.NET.
Still, consider the concurrency problems already mentioned... However, if the sheet is accessed unfrequently, why not...
The simplest way to do this might be to:
Upload the Excel workbook to Google Docs -- this is very clean, in my experience
Use the Google Spreadsheets Data API to update the data and return the numbers.
Here's a link to get you started on this, if you want to go that direction:
http://code.google.com/apis/spreadsheets/overview.html
Let me be more adamant than others have been: do not use Excel server-side. It is intended to be used as a desktop application, meaning it is not intended to be used from random different threads, possibly multiple threads at a time. You're better off writing your own spreadsheet than trying to use Excel (or any other Office desktop product) form a server.
This is one of the reasons that Excel Services exists. A quick search on MSDN turned up this link: http://blogs.msdn.com/excel/archive/category/11361.aspx. That's a category list, so contains a list of blog posts on the subject. See also Microsoft.Office.Excel.Server.WebServices Namespace.
It sounds like you're talking that the user has the spreadsheet open on their local system, and you want a web site to manipulate that local spreadsheet?
If that's the case, you can't really do that. Even Office automation won't help, unless you want to require them to upload the sheet to the server and download a new altered version.
What you can do is create a web service to do the calculations and add some vba or vsto code to the Excel sheet to talk to that service.

Resources