Searching for text mining/extraction software with Intuitive, Modern UI - bigdata

I'm researching different products for my organization. We are looking for a solution that will replace our current text mining software - DataWatch Monarch. We need some type of software that will be able to extract only relevant data from PDF reports and prepare it to be stored in a database.
DataWatch is causing a bottleneck for our organization due to learning curve and limitations. I started to try and do this just by programming using R, however, we need a more streamlined approach.
If you know of any easy to use, highly effective, text miners or report-text-extractor-like software please share. I will be looking into Scribe Software, SiMX, RapidMiner, and some others.

RapidMiner can extract info from PDFs no problem using the Text Processing extension. Start with the Read Document operator and go from there.
Storing in a database is also straightforward - set up your database connection in the "Manage Database Connections" menu and then use the "Write Database" operator.

Related

R shiny or Rmarkdown user comments like in Word or Google Docs

I want to create a product/project documentation in R that is going to be reviewed and discussed by a group of reviews. There are plenty of examples of how to create book-like documents using Rmarkdown (e.g. https://bookdown.org/) or interactive data visualizations using R-shiny. However, I could not find any solution for user comments similar to LibreOffice Writer, MS Word, or Google Docs. I could also imagine having a split-pane where one side is dedicated to the content presentation (e.g. text, graphs, code), while the other side is left for comments.
I am aware that such a solution requires a server-side solution for storing comments.
Any hints on existing solutions, workarounds, and implementations are welcome.
If I correctly understood, your question isn't very R specific. R is just code, R files are just text and they don't allow comments (beside the raw hashtag comments) and reviews. Your question is more about version control environments, that allow reviews on code stuff. The most used version control system is git, and git has an integrated panel in RStudio.
Git allows you to split your developpements in branches, which are the different ideas you and your coworkers can work on independentely. Once an idea is finalised, after some consecutive modifications known as commits, it is to be asked for merge in the "main branch". It is a "pull request".
That is where the different platform using git, like GitHub or GitLab, allow some review systems. Basically, when a branch is done, you ask "is that ok ?". Your reviewer can see the changes, can try you things, and tell if that is actually ok.
The other advantage of these is the continuous integration, that is : elaborating tests (in R with testthat) that will be tested on some events merge, like "on each merge to master". It is meant to ensure that the software is going forward : if a merge breaks some earlier test, you'll know it.
For beginners, in order to avoid being lost in bash commands, GitHub Desktop is a fine GUI above Git.
Note : You can also rely on the package usethis which has a few helper functions like use_git, use_gitlab_ci, use_github_action... It's not specific to reviews and comments but to the gitlab and github integration

Lotus Smartsuite to "something newer"

I shall try and keep my scenario as brief as possible and to the point.
The office I’m currently working for uses Lotus Smartsuite on Windows 98 / XP, using lots of Lotus Script to tie together Lotus 123 and Lotus Word Pro documents. They also make heavy use of the Lotus Object Linking functions. I shall describe its behaviour below:
You can fill rows and columns in a 123 Spreadsheet with data galore, style it and format it any way you like and define it as a range (nothing unique here). However, you can then copy that range and paste it as a link in a Lotus Word Pro document. This link is then categorised by its range name, so expanding the range back in the 123 file causes the table in the Word Pro Document to expand. This link also carries with it all the formatting and styling of the cells in the 123 Spreadsheet. As I imagine you are now aware, this link is completely live, you can double click anywhere in the object and it opens up the 123 file for editing, and all changes go backward and forward between the two documents. Most of the data retrieved from testing equipment is stored in these 123 spreadsheets and then parts of that are linked into a final Lotus Word Pro report document sent to the customer.
Note: Just to be clear, this is NOT the same as a DDE link in Open Office, which seems to allow for copying of a non-defined range of cells to be imported into a document where all formatting is lost and editing back and forth is not straight forward. It also behaves differently to an OLE object, which seems to only import the entire Spreadsheet rather than a small subsection of it.
However, in recent years, support this older software (Lotus) is becoming more difficult, especially with regards to sending customers documents (Lotus word Pro files are generally unsupported by more modern Office Tools) and technical support for Lotus Smartsuite seems to be practically non-existent these days. Also, with the fear of on going development in a scripting language no-longer being practised by mainstream IT technicians, on-going development and support seems futile. Once the guys who wrote it move on to other things, we will be left with spaghetti script in a language nobody can help us with.
So, we have this goal of "modernising" our IT system by the end of the year. Linux is becoming a very viable option too (No doubt Debian or a derivative), but Open Office doesn't seem to have the linking capability mentioned above. The reason this linking is so important is because the veterans of the office are so used to working this way - storing data in the spreadsheet, linking back to it later in their Word Pro documents, etc. I think they are more than keen to keep this practice going and we have found no equivalent of it in modern office tools (as was requested of me). I can see, as a software engineer (fluent in many languages), how this practice is not the safest or best way of using and storing data (databases spring to mind), but I was wondering if someone could give me a few other good reasons as to why this is bad practice in the work place (I was always in the belief that you should keep your data away from your reporting and formatting, the two should never be entwined - this looks like spreadsheet hell to me) ... or why this is a good thing to keep doing!?
So, for those of you still with me, I guess what I am asking is:
Is this practice of storing data, formatting it in spreadsheets and importing that directly back and forth between word documents good or bad, and what can be done about it? I guess I'll need to prove my point in case either way for this.
Are there ANY modern alternatives to this linking method (regardless of weather it is good or bad practice or not) out there for Linux or Windows? This link MUST carry formatting as well as dynamic range sizes (DDE links don't seem to be the answer).
What would your solution be if you had to start from scratch? Store everything in databases and use SQL to simply ask for the data you need in your word documents? How would you do this? What software would you use?
Any help with this scenario would be more than helpful, or if you know anywhere I should go to ask for advice, that would be appreciated too.
Thank-you for reading!
My suggestion is to first take a step back. What is the benefit to the way things are done now? Is it just a habit that is tough to break? Is there any reason the documents and spreadsheets need to be maintained and linked the way they are, or is it just a requirement because 'that's how it was done before'?
If you can remove that requirement, you have a lot more options and you're building a system that's easier to understand and maintain.
Regarding question 1, I believe there's nothing wrong with storing data in spreadsheets, especially if the end-users need to create and maintain them and development staff is limited. Some questions are whether that data needs to be secured, is related between spreadsheets, is duplicated across the company, or should be shared in a better way across the company. If any of those are true then a centralized database would make more sense. Personally I'd want any valuable data safely stored in a database where it can be managed, access to it can be controlled, it can be easily backed-up, etc.
Regarding question 2, you can do the same thing in Microsoft Office. You can either link the documents, so that the data stays in the source excel doc but appears in the word doc, or you can embed the excel spreadsheet within the word doc.
You might want to look at Microsoft Access for storing the data and generating reports. Or you could build an application using a relational database back-end and reporting front-end. The possibilities are wide-open. It really depends on where the expertise lies within the company.
If it were me I'd probably use a SQL Express back-end (it's free) and a custom ASP.NET MVC application for generating the reports, but that's just where my expertise lies.

How to scrape websites such as Hype Machine?

I'm curious about website scraping (i.e. how it's done etc..), specifically that I'd like to write a script to perform the task for the site Hype Machine.
I'm actually a Software Engineering Undergraduate (4th year) however we don't really cover any web programming so my understanding of Javascript/RESTFul API/All things Web are pretty limited as we're mainly focused around theory and client side applications.
Any help or directions greatly appreciated.
The first thing to look for is whether the site already offers some sort of structured data, or if you need to parse through the HTML yourself. Looks like there is an RSS feed of latest songs. If that's what you're looking for, it would be good to start there.
You can use a scripting language to download the feed and parse it. I use python, but you could pick a different scripting language if you like. Here's some docs on how you might download a url in python and parse XML in python.
Another thing to be conscious of when you write a program that downloads a site or RSS feed is how often your scraping script runs. If you have it run constantly so that you'll get the new data the second it becomes available, you'll put a lot of load on the site, and there's a good chance they'll block you. Try not to run your script more often than you need to.
You may want to check the following books:
"Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL"
http://www.amazon.com/Webbots-Spiders-Screen-Scrapers-Developing/dp/1593271204
"HTTP Programming Recipes for C# Bots"
http://www.amazon.com/HTTP-Programming-Recipes-C-Bots/dp/0977320677
"HTTP Programming Recipes for Java Bots"
http://www.amazon.com/HTTP-Programming-Recipes-Java-Bots/dp/0977320669
I believe that the most important thing you must analyze is which kind of information do you want to extract. If you want to extract entire websites like google does probably your best option is to analyze tools like nutch from Apache.org or flaptor solution http://ww.hounder.org If you need to extract particular areas on unstructured data documents - websites, docs, pdf - probably you can extend nutch plugins to fit particular needs. nutch.apache.org
On the other hand if you need to extract particular text or clipping areas of a website where you set rules using DOM of the page probably what you need to check is more related to tools like mozenda.com. with those tools you will be able to set up extraction rules in order to scrap particular information on a website. You must take into consideration that any change on a webpage will give you an error on your robot.
Finally, If you are planning to develop a website using information sources you could purchase information from companies such as spinn3r.com were they sell particular niches of information ready to be consume. You will be able to save lots of money on infrastructure.
hope it helps!.
sebastian.
Python has the feedparser module, located at feedparser.org that actually handles RSS in its various flavours and ATOM in its various flavours. No reason to reinvent the wheel.

Generating keywords from a pdf automatically

My application allows user to upload pdf files and store them on the webserver for later viewing. I store the name of the file, location, size, upload date, user name etc in an SQL server database.
I'd like to be able to programatically, just after a file is uploaded, generate a list of keywords (maybe everything except common words) and store them in the sql database as well so that subsequent users can do keyword searches...
Suggestions on how to approach this task? Does these type of routine already exist?
EDIT: Just to clarify my requirements, I wouldn't be concerned with doing OCR, I don't know the insides' of PDF's, but I understand that if it was generated by an app, such as Word->PDF Print, the text of the document is searchable...so really my first task, and the intent of my question is, how do I access the text of a PDF file from an asp.net app? OCR on scanned PDF's is probably beyond my requirements at this point.
As a first step you should extract all text from the PDF.
ghostscript and pdftotext can do this, the PDFBox is another option.
There are certainly other tools as well.
Then you can remove all stopwords and duplicates and write it to the database.
I has been mentioned that this does not work for scanned PDF documents but this is only half the truth. On the one hand there are lots of scanned PDFs which have text additionally embeded, because that is what some scanners drivers do (Canon CanoScan drivers performs OCR and generate searchable PDFs). On the other hand documents generated with LaTeX that contain non-ASCCII characters return garbage in my experience (even when I copy and paste in acrobat).
The only problem I foresee of grabbing every non-common word is that you'll dilute your search results and have to query the DB for more pdfs. One website to look at is Scribd which does something similar to what you are talking about doing with users uploading files and people being able to view them online via a flash app.
That is very interesting topic. The question is how many keywords do you need to define one PDF. If you say:
3 to 10 - I would check methods of text categorization such as bayesian classifier or K-NN (that method will group PDF files into clusters which are similar). I know that similar algorithms are used to filter spam. But it is a system that need input for example if you add keywords to 100 PDF this system will learn the schemas. I am not an expert but this is one way to do it.
more than 10 - then I would suggest brute force -> filter common words -> get most frequent words for a specific document.
I would explore first option. You must surely check such methods as "text categorization", "auto tagging", "text mining", "automatic keyword extraction".
Some links :
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Keyword Extraction Using Naive Bayes
If you are planning on indexing PDF documents, you should consider using a dedicated text search engine like Lucene. Lucene provides features that will be difficult to implement using only SQL and a relational database. You will still need to extract the text from the PDF documents, but won't have to worry about filtering out common words. By filtering out common words, you will completely lose the ability to do phrase searches.

Application for graphing lots of web related data

I know this isn't programming related, but I hope some feedback which helps me out the misery.
We've actually lots of and different data from our web applications, dating years back.
For example, we've
Apache logfiles
Daily statistics files from our tracking software (CSV)
Another daily statistics from nation-wide rankings for advertisement (CSV)
.. and I can probably produce new data from other sources, too.
Some of the data records started in 2005, some in 2006, etc. However at some point in time we start to have data of all of them.
What's I'm drea^H^H^H^Hsearching for is an application to understand all the data, lets me load them, compare individual data sets and timelines (graphically), compare different data sets within the same time span, allow me to filter (especially the Apache logfile); and of course this all should be interactively.
Just the BZ2 compressed Apache logfiles are already 21GB in total, growing weekly.
I've had no real success with things like awstats, Nihu Web Log Analyzer or similar tools. They can just produce statical information, but I would need to interactive query the information, apply filters, lay over other datas, etc.
I've also tried data mining tools in hope they can help me but didn't really success in using them (i.e. they're over my head), e.g. RapidMiner.
Just to make it sure: it can be a commercial application. But yet have to find something which is really useful.
Somehow I get the impression I'm searching for something which does not exist or I've the wrong approach. Any hints are very welcome.
Update:
In the end I it was a mixture of the following things:
wrote bash and PHP scripts to parse and managing parsing the log files, including lots of filtering capabilities
generated plain old CSV file to read into Excel. I'm lucky to use Excel 2007 and it's graphical capabilities, albeit still working on a fixed set of data, helped a lot
I used Amazon EC2 to run the script and send me the CSV via email. I had to crawl through around 200GB of data and thus used one of the large instances to parallelize the parsing. I had to execute numerous parsing attempts to get the data right, the overall processing duration was 45 minutes. I don't know what I could have done without Amazon EC2. It was worth every buck I paid for it.
Splunk is a product for this sort of thing.
I have not used it my self though.
http://www.splunk.com/
The open source data mining and web mining software RapidMiner can import both Apache web server log files as well as CSV files and it can also import and export Excel sheets. Rapid-I offers a lot of training courses for RapidMiner, some also on web mining and web usage mining.
In the interest of full disclosure, I've not used any commercial tools for what your describing.
Have you looked at LogParser? It might be more manual than what your looking for, but it will allow you to query many different structured formats.
As for the graphical aspect of it, there is some basic charting capabilities built in, but your likely to get much more mileage piping the log parser output into a tabular/delimited format and loading into Excel. From there you can chart/graph just about anything.
As for cross joining different data sources, you can always pump all the data into the database where you'll have a richer language for querying the data.
What you're looking for is a "data mining framework", i.e. something which will happily eat gigabytes of somewhat random data and then lets you slice'n'dice it in yet unknown ways to find the gold nuggets buried deep inside of the static.
Some links:
CloudBase: "CloudBase is a high-performance data warehouse system built on top of Map-Reduce architecture. It enables business analysts using ANSI SQL to directly query large-scale log files arising in web site, telecommunications or IT operations."
RapidMiner: "RapidMiner aleady is a full data mining and business intelligence engine which also covers many related aspects ranging from ETL (Extract, Transform & Load) over Analysis to Reporting."

Resources