Is a web-scraping software a "wrapper"? - web-scraping

I found that on Google, the definition of a wrapper is "any entity that encapsulates (wraps around) another item." also, "Wrappers are used for two primary purposes: to convert data to a compatible format or to hide the complexity of the underlying entity using abstraction."
I also know that web-scraping methods are used for extracting data from websites. Does this mean that a web-scraping software is a wrapper? Or have I completely misunderstood the definition of a wrapper?

Wrapper in data mining is a program that extracts content of a particular information source and translates it into a relational form.
Extracting data out of a website can be called web-scraping, however a wrapper will go ahead and perform one more step at "Translating the data pulled into a relational form"
We can say that a wrapper is something that a web-scraping tool might use in order to get more insight about crawled data

Related

How to write Custom HTML code for a Rally/CA report

What would be the HTML code to "filter out" a handful of specific user stories?
Your question is highly unspecific. The only way to get stories is to programatically access the API via a language like Javascript, Java, C#, C++, etc., etc.
You can embed javascript into your html page and get the code to fetch stories with a filter passed in on the access. To see how to structure a query, you could turn on the developer tools in your browser and have a look at the network accesses that the browser does when fetching stories into a custom list app on a page. Using the custom list, you could refine your query to what you want first.
You could always build a custom app for a specific use case, but if you're looking for data and having trouble finding it, there are ways to do so with a combination of custom lists, Rally's own query language, and creative use of advanced filters. It's also possible to massage your data in way that makes Rally's native reporting a bit easier to use.
This is just an example but, if I'm looking to get information on the quarterly progress of my team who don't use start/end date or releases/milestones, there's not a lot available from an app/report standpoint that's already built. However, if I coach my team on keeping a few simple data elements neat and tidy, and utilize the custom report views to make that data useful, it can be pretty quick and easy to implement.
I have my teams keep a few basic fields up to date: Title, Owner, Project, Tags, Refined Estimate (all at a feature level), and most importantly - keeping a parent/child relationship between most work.
Now I can build a report that filters by a certain tag, that can also be filtered by team, and also has the ability to show additional valuable data that can be unearthed because your house is tidy. In this case, you can now display a column that will total all child objects under a certain feature, and display that next to 'Planned' estimate, which will give you the ability to also export and show a planned vs. actual to help your teams estimate more accurately.
It's a round-about way of saying there are a lot of possibilities with the tool if you can use your resources. Building custom apps means you also have to maintain them or pay someone with the knowledge to do so.

MS Word template with loops, tables and charts

For our SaaS (LAMP) product reporting we are currently using JasperReports. We find it too cumbersome to develop reports with and the output in Word unworkable. Moreover, a couple of customers request to be able to develop simple reports themselves (to be used as mail merge). We would therefore like to develop templates right in Word. The idea is to have an application/webservice that would receive the Word template and JSON data from the LAMP application and return the filled-in report. The report has to support:
Loops inside content (repeating a document section several times while filling in array data)
Filling in tables (populating rows from array)
Filling in chart data in pre-created charts (from array)
This is the functionality we are using in JasperReports right now. Are there existing solutions to this? I've found quite a lot that can substitute simple variables, but no info about the the above three points. Will it be a lot of effort to write one from scratch? I would prefer a Windows OpenXML-based solution rather than a Linux PHPOffice-based one as I presume the former would handle the text split by spell-checker and language tags (though I'm not sure).
Windward and Docmosis are both commercial products that support the features you've listed and they are intended to be added to your application to provide reporting capabilities. Neither is are not OpenXML based. They can use Word documents as templates and perform the data merge into different output formats. Please note I work for Docmosis.
Aspose Words is another tool and it can populate a template but most of the power is through code rather than controls/directives in the template. Given your OpenXML thoughts, perhaps this is more what you are looking for.
More tools are recommended here in StackExchange.
I hope that helps.
ReportBox is a Web based reporting solution that can be used by any software application to generate documents and reports in Microsoft Word/ Excel/ PowerPoint/ HTML(DocX/Xlsx/PPTx/HTML) using OpenXML.
The process starts by building a Microsoft Word/ Excel/ PowerPoint/ HTML document as a template and uploading to ReportBox portal. Your application either sends data to ReportBox or ReportBox can pull data from your application database, which is then merged with the template to produce the finished report. Please note that I work for GreenThoughts.

How to feed Word 2010 (.docx) documents/templates with data from MySQL database?

What would be the best approach to replace placeholders in a .docx document (Word 2010) with data coming from a MySQL database?
Can I just open the file using a server side language and do a string replace per each placeholder?
Is there any existing tool/library available?
Thanks
Disclosure: I work for Invantive.
Using Invantive Composition (http://www.invantive.com/products/invantive-composition) you can fill Word documents (letters, legal pleadings, insurancy policies) with data from a database (IBM DB2, Oracle, MySQL, Teradata and SQL Server) and then fully change the contents at will manually. It is intended for real Microsoft Word end-users (both the guys that make the template and the ones that use it) that access the databases through a central webservice and models with queries. Invantive Composition allows nested repeating groups of data and lay-out. Integrates into Microsoft Word using click once.
In the past, I personally have also been using JasperReports (http://community.jaspersoft.com/project/jasperreports-library) to generate letters using the RTF output target of JasperReports. It is free and works fine as long as you do not want to edit the output more than a few words and have Java/SQL development skills. Just as Invantive Composition it works fine for large numbers of different reports.
As long as you can control the environment completely, you can also consider using RTF as intermediate language (not for end-users, only real developers). Save document as RTF, replace parts of the text you need to be replacable, write a webservice that accepts the parameter and dumps back the resulting RTF. Takes some time to generate more complex tables (tables are obviously something invented by the human race after the RTF specification was written :-) This approach only works with very limited number of templates and when you have sufficient developer time available to get it up and running and stabilized.
As an independent reviewer, I have also seen cases where XML templates were used, but the results were not as good as with JasperReports.
**Disclosure: I lead the docx4j project **
There are heaps of existing tools/libraries available!
Yes, you can just do a string replace, but that is a brittle approach, since Word may have split the string across runs.
You can use MERGEFIELDs, or content control data binding.
docx4j supports all three approaches, but content control data binding is the most powerful.
ContentControlsMergeXML
MERGEFIELDs
VariableReplace
One thing to consider especially is "repeats". If you want say a row of a table in Word, for each matching row in your MySQL table, then you need a way to make this happen.
docx4j does this with a "repeat" content control around the table row; whichever solution you choose, I'd make sure up front that it can handle repeats.
If you want to use PHP the most complete available solution is PHPDocX.
You may check in the tutorial how to substitute placeholder variables by data coming from any data source (like a MySQL DB).
In particular, you may populate table rows with an indefinite number of entries and you may delete whole blocks of the Word document depending on the data fed to the application or build dynamical Word charts.
You may check the available DEMO for a simple but quite illustrative example (its inner workings are explained in the tutorial section).
You can use open Open XML SDK and replace your placeholders like this.
Disclosure: I lead the docxgenjs project
I think you shouldn't have to code everything by yourself, that's why I created a Mustache-like templating engine for docx
Demo:
http://javascript-ninja.fr/docxgenjs/examples/demo.html
Repo
https://github.com/edi9999/docxgenjs
It is JS-based and works client and server side.
Yes, you can use server side language to do it.
Check on apache POI.
http://poi.apache.org
Hello I read the above esp the comments and Ivantive looks impressive - but the solution I needed was much simpler. Use Selection.Range.InsertDatabase in Word to fetch records from an access database or excel spreadsheet or even just another word document. With the access solution you can choose the layout of the records to fetch and have it fetch just particular recordds based on a field (eg ID). Google the words above and it'll take you to MS guidance and an example VB script. Worked well in just a few mins. Now looking for VB script that asks the person what ID they want from the dbase and we're done.
it uses docx templates that have merge fields with java objects (the objects have the information you load from mysql or any other source). The xdoc report is an project for java language, the home page of the project is https://code.google.com/p/xdocreport/.
*Disclosure: I create the templ4docx project *
Hello
You can use templ4docx java library, which is on maven central repository, so you can just add it to your maven dependencies:
<dependency>
<groupId>pl.jsolve</groupId>
<artifactId>templ4docx</artifactId>
<version>2.0.0</version>
</dependency>
Example usage:
Docx docx = new Docx("E:\\template.docx");
Variables variables = new Variables();
variables.addTextVariable(new TextVariable("${firstName}", "John"));
variables.addTextVariable(new TextVariable("${lastName}", "Sky"));
docx.fillTemplate(variables);
docx.save("E:\\filledTemplate.docx");
More details you can find here: http://jsolve.github.io/java/templ4docx/

Convert nested dictionary/xml to flat file for sqlite

I've scoured the net and cannot seem to find an appropriate example so I thought I'd ask...
(Btw, much of this is new to me- not all, just most.)
Problem: trying to convert a bio/python nested dictionary (or xml) of pubmed citation data into a flat (normalized) structure eg, sqlite. Citation data was fetched from pubmed using biopython and was parsed into a dictionary, but can also retrieve as xml if needed.
Not all citations will have all fields/keys and not all fields/keys will have the same number of items (authors, mesh terms, refs, etc...) and understand that this is part of the normalization process.
This is about where my practical understanding ends.
That said, I think the process should go something like this: first remove/normalize all unique fields (those that have 1 per paper eg, title, abstract, date, citation, etc..., but say not affiliation as that would be linked to first author). Papers with no abstract could be filled as null?
Then move on to, say, authors and create a separate table again using PMID as the fk and then do same for the various other fields/keys/items in separate tables eg, mesh headings, EC numbers, ref, etc...
Is there a way to do this that removes (pops?) keys/items from the master dictionary so that I can visually see what's been done/needs to be done (obviously leaving the PMID)?
Again, apologies in advance if I'm asking a blindingly obvious question to the initiated- and I do understand that you can't fit a nested structure into a flat space- just looking for the least boneheaded way of going about this and hopefully one that will allow me to make sure that everything was properly captured.
Many thanks,
chris
A quick question -- if you already have the data in XML, why are you normalizing it into a SQL format? Why not just use the raw XML? Berkeley DB XML is a library (like SQLite) that links into your application. There is no separate server to install or maintain. The library allows you to store and query XML data using XPath or XQuery. It's very fast, has a small footprint. is transactional, recoverable and highly reliable. It has HA features as well, if that is required.
Keeping the data in XML should simplify the whole data import process and still allow you to query the semi-structured data.

How to build flexible web forms in ASP.NET

I'm using asp.net and I need to build an application where we can easily create forms without recreating the database, and preferably without changing the create/read/update/delete queries. The goal is to allow customers to create their own forms with dropdowns, textboxes, checkboxes, even many-to-one relationship to another simple form (that's stretching it). The user does not have to be able to create the forms themselves, but I don't want to be adding tables, fields, queries, web page, etc. each time a new form is requested/modified.
2 questions:
1) How do I structure a flexible database to do this (in SQL Server)? I can think of two ways: a) Create a table for each datatype (int, varchar(x), smalldatetime, bit, etc). This would be very difficult to create the adequate queries. b) Create the form table with lots of extra fields and various datatypes in case the user needs 5 integers or 5 date fields. This seems the easiest, but is probably pretty inefficient use of space.
2) How do I build the forms? I thought about creating an xml sheet that had the validations, data type, control to display, etc. as a list. Then I would parse through the xml to build the form on the fly. Probably using css to do the layout (that would have to be manual, which is ok).
Is there a better/best way? Is there something out there that I could look at to get ideas? Any help is much appreciated.
This sounds like a potential candidate for an InfoPath solution. At first blush, it will do most/all of what you are asking.
This article gives a brief overview of creating an InfoPath form that is based on a SQL data source.
http://office.microsoft.com/en-us/infopath-help/design-a-form-template-based-on-a-microsoft-sql-server-database-HP010086639.aspx
I have built a completely custom solution like you are describing, and if I ever did it again I would probably opt for either 1) a third-party product or 2) less functionality. You can spend 90% of your time working on 10% of the feature set.
EDIT: reread your questions and here is additional feedback.
1 - Flexible data structure: A couple things to keep in mind are performance and the ability to write reports against the data. The more generic the data structure, the harder these will be to achieve (again, speaking from experience).
Somewhat contrary to both performance and report-readiness, Microsoft SharePoint uses XML fragments/documents in generic tables for maximum flexibility. I can't argue with the features of SharePoint, so this does get the job done and greatly simplifies the data structure. XML will perform well when done correctly, but most people will find it more difficult to write queries against XML. Even though XML is a "first class citizen" to SQL Server, it may or may not perform as well as an optimized table structure.
2 - Forms: I have implemented custom forms using XML transformed by XSLT. XML is often a good choice for storing form structure; XSLT is a monster unto itself, but it is very powerful. For what it's worth, InfoPath stores its form structure as XML.
I've also implemented dynamic forms using custom controls in .Net. This is a very object-oriented approach, but (depending on the complexity of the UI) can require a significant amount of code.
Lastly (again using SharePoint as an example), Microsoft implemented horrendously complicated XML list/form definitions in SharePoint 2007. The complexity defeats many of the benefits. In other words, if you go the XML route, make your structures clean and simple or you will have a maintenance nightmare on your hands.
EDIT #2: In reference to Scott's question below, here's a high-level data structure that will avoid duplicated data and doesn't rely on XML for the majority of the form definition.
Caveat: I just put this design together in SQL Management Studio...I only spent 10 minutes on it; developing a flexible form system is not a trivial task, so this is an over-simplification. It does not address the storage of user-entered data, just the definition of the form.
The tables:
Form - top-level form table which contains (as you would guess) the collection of fields that comprise the form.
Field - generic fields that could be reused across forms. For example, you don't want 50 different "Last Name" fields for 50 different forms. Note the "DataTypeId" column. You could store any type you wanted in this column, like "number, "free text", even a value that indicates the user should pick from a list.
FormField - allows a form to contain 0-many fields in its definition. You could even extend this table to indicate that the user can ADD as many of this field as they need.
Constraint - basically a lookup table that defines a constraint type (maybe it's max length, max occurrences, required, etc.)
FormFieldConstraint - relates a constraint to a particular instance of a form field. This table combines a specific form with a specific field with a specific constraint. Note the metadata column; this potentially would be a good use for XML to store the specifics of the constraint.
Essentially, I suggest building a normalized database with few or no null values and no duplicated data. A structure as I've described would get you on the path to that goal.
I think if you need truly dynamic forms saved into a database, you'd have to create a sort of "dictionary" data table.
For example...
UserForms
---------
FormID
FieldName
FieldValue
FormID relates back to the parent form (so you can aggregate all of the fields for one form. FieldName is the name of the text field entered from. FieldValue is the value entered/selected for that field.
Obviously this isn't a perfect solution. You could have issues typing your data dynamically, but I leave the implementation of that up to you.
Anyways, hopefully this gives you somewhere to start thinking about how you'd like to accomplish things. Good luck!
P.S. I've found using webforms with .NET to be a total pain when doing dynamic form generation. In the instances I had to do it, I ditched it almost entirely and used native HTML elements. Then rewired my form by using the necessary values from Request. Probably not a perfect solution either, but it worked the best for me.
We created a forms system like the one you're describing with a schema very similar to the one at the end of Tim's post. It has been pretty complicated, and we really had to wrestle with the standard WebForms controls like the DetailsView and GridView to make them be able to perform CRUD operations on groups of answers. They're used to binding straight to properties on an object, and we're forcing them to look up a field ID in a dictionary first. They don't like it. You may want to consider using MVC.
One tricky question is how to store the answers. We ended up using a table that's keyed on FieldId, InstanceId (for example, if 10 people are filling out your form, there are 10 instances), and RowNumber (because we support multi-row forms for things like employment history). Instead of doing it this way, I would recommend making your AnswerRow a first-class concept with its own table tied to an Instance, and having the answers be linked to the AnswerRow and Field.
In order to support different value types, we had to create multiple answer fields on our answer table (AnswerChar, AnswerDate, AnswerInt, AnswerDecimal). Each of these maps to one or more Field Types (Date, Time, Number, etc.). Each Field Type has its own logic to represent the value in the UI, and put the value into the appropriate property on our answer objects.
All in all, it's been a lot of work. It's worth it for us, since this is the crux of the product, but you will definitely want to keep the cost in mind before embarking on a project like this.

Resources