Scalable solution for extracting semantic data from websites?

Scalable solution for extracting semantic data from websites? - web-scraping

Let's say I have a (rather large) number of websites on my disk, scraped or fetched from e.g. Common Crawl. I have no prior knowledge about the HTML structure, assume that each page is structured differently (which is usually the case). From each of them I want to extract some kinds of semantic information (known in advance) like articles/posts with metadata (date, author, tags, comments, etc.).
One straightforward way to go would be to write a simple parser for each of the websites, given good quality parsing libraries out there it should be easy enough. But this approach obviously does not scale. Is there a more clever solution to this problem? How would I proceed and what is actually the difficulty of this task?
You can include paid services, if anything like this exists. If you're aware of any better way of getting this kind of data (on a specific topic; instead of manual scraping / Common Crawl) please also include it.

Related

What is a good localization strategy for knitr reports?

I would like to create aggregation reports based on ggplot2 and knitr. Unfortunately I want to do it in four languages, namely English, German, French, Italian. So far labels for plots and figures are basically coming from data itself, i.e. they are generated from data.frame headers or factor levels.
Given that I have more than 100 categorical variables with different levels I wonder what an efficient translation strategy might be. There's .po files and Portable Object editors for other languages and even for R and its messages itself. Given that the number of languages might be increased it becomes more likely that other persons need to be involved for translation. Obviously these persons are typically no R users and might not even like text editors.
Has anybody faced the same problem and has developed some good strategy or experience to share? Could you imagine something xliff like?
EDIT: I have seen this thread in the meantime but I believe gettext does only work for packages. I wonder if the domain in this post is really valid.

No direct experience myself, but wondering if a google spreadsheet might be provide a good workflow for collaborating with translators e.g. then get into R using RCurl solutions such as answer here:
read.csv fails to read a CSV file from google docs
I used this technique to analyse survey data.

Automatic documentation of datasets

I'm working on a project right now where I have been slowly accumulating a bunch of different variables from a bunch of different sources. Being a somewhat clever person, I created a different sub-directory for each under a main "original_data" directory, and included a .txt file with the URL and other descriptors of where I got the data from. Being an insufficiently clever person, these .txt files have no structure.
Now I am faced with the task of compiling a methods section which documents all the different data sources. I am willing to go through and add structure to the data, but then I would need to find or build a reporting tool to scan through the directories and extract the information.
This seems like something that ProjectTemplate would have already, but I can't seem to find that functionality there.
Does such a tool exist?
If it does not, what considerations should be taken into account to provide maximum flexibility? Some preliminary thoughts:
A markup language should be used (YAML?)
All sub-directories should be scanned
To facilitate (2), a standard extension for a dataset descriptor should be used
Critically, to make this most useful there needs to be some way to match variable descriptors with the name that they ultimately take on. Therefore either all renaming of variables has to be done in the source files rather than in a cleaning step (less than ideal), some code-parsing has to be done by the documentation engine to track variable name changes (ugh!), or some simpler hybrid such as allowing the variable renames to be specified in the markup file should be used.
Ideally the report would be templated as well (e.g. "We pulled the [var] variable from [dset] dataset on [date]."), and possibly linked to Sweave.
The tool should be flexible enough to not be overly burdensome. This means that minimal documentation would simply be a dataset name.

This is a very good question: people should be very concerned about all of the sequences of data collection, aggregation, transformation, etc., that form the basis for statistical results. Unfortunately, this is not widely practiced.
Before addressing your questions, I want to emphasize that this appears quite related to the general aim of managing data provenance. I might as well give you a Google link to read more. :) There are a bunch of resources that you'll find, such as the surveys, software tools (e.g. some listed in the Wikipedia entry), various research projects (e.g. the Provenance Challenge), and more.
That's a conceptual start, now to address practical issues:
I'm working on a project right now where I have been slowly accumulating a bunch of different variables from a bunch of different sources. Being a somewhat clever person, I created a different sub-directory for each under a main "original_data" directory, and included a .txt file with the URL and other descriptors of where I got the data from. Being an insufficiently clever person, these .txt files have no structure.
Welcome to everyone's nightmare. :)
Now I am faced with the task of compiling a methods section which documents all the different data sources. I am willing to go through and add structure to the data, but then I would need to find or build a reporting tool to scan through the directories and extract the information.
No problem. list.files(...,recursive = TRUE) might become a good friend; see also listDirectory() in R.utils.
It's worth noting that filling in a methods section on data sources is a narrow application within data provenance. In fact, it's rather unfortunate that the CRAN Task View on Reproducible Research focuses only on documentation. The aims of data provenance are, in my experience, a subset of reproducible research, and documentation of data manipulation and results are a subset of data provenance. Thus, this task view is still in its infancy regarding reproducible research. It might be useful for your aims, but you'll eventually outgrow it. :)
Does such a tool exist?
Yes. What are such tools? Mon dieu... it is very application-centric in general. Within R, I think that these tools are not given much attention (* see below). That's rather unfortunate - either I'm missing something, or else the R community is missing something that we should be using.
For the basic process that you've described, I typically use JSON (see this answer and this answer for comments on what I'm up to). For much of my work, I represent this as a "data flow model" (that term can be ambiguous, by the way, especially in the context of computing, but I mean it from a statistical analyses perspective). In many cases, this flow is described via JSON, so it is not hard to extract the sequence from JSON to address how particular results arose.
For more complex or regulated projects, JSON is not enough, and I use databases to define how data was collected, transformed, etc. For regulated projects, the database may have lots of authentication, logging, and more in it, to ensure that data provenance is well documented. I suspect that that kind of DB is well beyond your interest, so let's move on...
1. A markup language should be used (YAML?)
Frankly, whatever you need to describe your data flow will be adequate. Most of the time, I find it adequate to have good JSON, good data directory layouts, and good sequencing of scripts.
2. All sub-directories should be scanned
Done: listDirectory()
3. To facilitate (2), a standard extension for a dataset descriptor should be used
Trivial: ".json". ;-) Or ".SecretSauce" works, too.
4. Critically, to make this most useful there needs to be some way to match variable descriptors with the name that they ultimately take on. Therefore either all renaming of variables has to be done in the source files rather than in a cleaning step (less than ideal), some code-parsing has to be done by the documentation engine to track variable name changes (ugh!), or some simpler hybrid such as allowing the variable renames to be specified in the markup file should be used.
As stated, this doesn't quite make sense. Suppose that I take var1 and var2, and create var3 and var4. Perhaps var4 is just a mapping of var2 to its quantiles and var3 is the observation-wise maximum of var1 and var2; or I might create var4 from var2 by truncating extreme values. If I do so, do I retain the name of var2? On the other hand, if you're referring to simply matching "long names" with "simple names" (i.e. text descriptors to R variables), then this is something only you can do. If you have very structured data, it's not hard to create a list of text names matching variable names; alternatively, you could create tokens upon which string substitution could be performed. I don't think it's hard to create a CSV (or, better yet, JSON ;-)) that matches variable name to descriptor. Simply keep checking that all variables have matching descriptor strings, and stop once that's done.
5. Ideally the report would be templated as well (e.g. "We pulled the [var] variable from [dset] dataset on [date]."), and possibly linked to Sweave.
That's where others' suggestions of roxygen and roxygen2 can apply.
6. The tool should be flexible enough to not be overly burdensome. This means that minimal documentation would simply be a dataset name.
Hmm, I'm stumped here. :)
(*) By the way, if you want one FOSS project that relates to this, check out Taverna. It has been integrated with R as documented in several places. This may be overkill for your needs at this time, but it's worth investigating as an example of a decently mature workflow system.
Note 1: Because I frequently use bigmemory for large data sets, I have to name the columns of each matrix. These are stored in a descriptor file for each binary file. That process encourages the creation of descriptors matching variable names (and matrices) to descriptors. If you store your data in a database or other external files supporting random access and multiple R/W access (e.g. memory mapped files, HDF5 files, anything but .rdat files), you will likely find that adding descriptors becomes second nature.

Data Visualisation

UPDATE: I had posted this on UI.stackexchange also for views on different kinds od visualisation. I am posting this here for finding out the programming techniques and tools required to do so.
Let us have the following three sets of information
Now I want to combine all of this data and show it all together. Telling it like a story. Giving inter-relations. Showing similarities in terms, concepts etc. to get the following (Note that in the diagram below, the colored relations may not be exact, they are merely indicative of a node of information)
Situation: I need to tell somebody the relation between two or more important things through the commonness of concepts, keywords, behaviours in those things.
One way that I figured out would be to use circles for concepts.
So that all concepts connected to thing A would be connected to it and all concept related to B would be connected to it. And the common concepts would be connected to both. That way 2 things can be easily compared.
Problem: To build such a graph/visualisation manually would be cumbersome. Especially to add, arrange, update and manipulate.
Question: Is there a good way to do it. Also, Is there a tool available for doing this?
I hope this make the question much more clear. :)

Where does this data (the concepts, keywords, and relations between them etc...) come from? If it's in a database somewhere you could write soem code to generate a graphiz file then open it in a graphiz visualizer. There might be some tools out there that allow interactive editing of a graphiz graph, it looks like WebDot may and there are probably others.

How to display the hierarchical data on User Interface

You're talking about Venn diagrams. I think there should be plenty of online and offline tools that can help making these.
graphiz has been mentioned already, although that would be used more to show a flow of a system, or a treeview.
When you're talking about software development and want to display a design through diagrams, a complete diagram solution already exist as UML. And there are plenty of UMT tools that can help here. A commercial version is Altova UModel, which has some very nice features. You could probably use Use Cases as the most logical diagram type.
Also see Wikipedia for more info about use case diagrams. Reconsidering the image you've added, I do tend to consider it to be a usecase. Since UML is based on XML, it should be possible to transform your data through a stylesheet to UML, then use a random UML tool to display the diagrams.To convert your data to XML, well... If it's in Excel then exporting it to XML should not be too difficult.
Why is your sample image an Use Case? Well, you have actors (Pinguin, Koala, Tulips) and you have actions. (well, kind of actions: Cause for concern, some kind of animal, linked to movie, bites your nose off...) And finally, there are associations between the actors and the actions connecting them all in some way. Thus Data--(export)->XML--(Styleheet)->UML--(UML tool)->Diagram.

D3: Data-Driven Documents JS library

Abstraction or not?

The other day i stumbled onto a rather old usenet post by Linus Torwalds. It is the infamous "You are full of bull****" post when he defends his choice of using plain C for Git over something more modern.
In particular this post made me think about the enormous amount of abstraction layers that accumulate one over the other where I work. Mine is a Windows .Net environment. I must say that I like C# and the .Net environment, it really makes most things easy.
Now, I come from a very different background made of Unix technologies like C and a plethora or scripting languages; to me, also, OOP is just one, and not always the best, programming paradigm.. I often struggle (in a working kind of way, of course!) with my colleagues (one in particular), because they appear to be of the "any problem can be solved with an additional level of abstraction" church, while I'm more of the "keeping it simple" school. I think that there is a very different mental approach to the problems that maybe comes from the exposure to different cultures.
As a very simple example, for the first project I did here I needed some configuration for an application. I made a 10 rows class to load and parse a txt file to be located in the program's root dir containing colon separated key / value pairs, one per row. It worked.
In the end, to standardize the approach to the configuration problem, we now have a library to be located on every machine running each configured program that calls a service that, at startup, loads up an xml that contains the references to other xmls, one per application, that contain the configurations themselves.
Now, it is extensible and made up of fancy reusable abstractions, providers and all, but I still think that, if we one day really happen to reuse part of it, with the time taken to make it up, we can make the needed code from start or copy / past the old code and modify it.
What are your thoughts about it? Can you point out some interesting reference dealing with the problem?
Thanks

Abstraction makes it easier to construct software and understand how it is put together, but it complicates fully understanding certain issues around performance and security, because the abstraction layers introduce certain kinds of complexity.
Torvalds' position is not absurd, but he is an extremist.

Simple answer: programming languages provide data structures and ways to combine them. Use these directly at first, do not abstract. If you find you have representation invariants to maintain that are at a high risk of being broken due to a large number of usage sites possibly outside your control, then consider abstraction.
To implement this, first provide functions and convert the call sites to use them without hiding the representation. Hide the data representation only when you're satisfied your functional representation is sufficient. Make sure at this time to document the invariant being protected.
An "extreme programming" version of this: do not abstract until you have test cases that break your program. If you think the invariant can be breached, write the case that breaks it first.

Here's a similar question: https://stackoverflow.com/questions/1992279/abstraction-in-todays-languages-excited-or-sad.
I agree with #Steve Emmerson - 'Coders at Work' would give you some excellent perspective on this issue.

Best way to incorporate spell checkers with a build process

I try to externalize all strings (and other constants) used in any application I write, for many reasons that are probably second-nature to most stack-overflowers, but one thing I would like to have is the ability to automate spell checking of any user-visible strings. This poses a couple problems:
Not all strings are user-visible, and it's non-trivial to spearate them, and keep that separation in place (but it is possible)
Most, if not all, string externalization methods I've used involve significant text that will not pass a spell checker such as aspell/ispell (eg: theStrName="some string." and comments)
Many spellcheckers (once again, aspell/ispell) don't handle many words out of the box (generally technical terms, proper nouns, or just 'new' terminology, like metadata).
How do you incorporate something like this into your build procedures/test suites? It is not feasible to have someone manually spell check all the strings in an application each time they are changed -- and there is no chance that they will all be spelled correctly the first time.

We do it manually, if errors aren't picked up during testing then they're picked up by the QA team, or during localization by the translators, or during localization QA. Then we lodge a bug.
Most of our developers are not native English speakers, so it's not an uncommon problem for us. The number that slip through the cracks is so small that this is a satisfactory solution for us.
Nothing over a few hundred lines is ever 100% bug-free (well... maybe the odd piece of embedded code), just think of spelling mistakes as bugs and don't waste too much time on it.
As soon as your application matures, over 90% of strings won't change between releases and it would be a reasonably trivial exercise to compare two versions of your resources, figure out what'ts new (check them first), what's changed/updated (check next) and what hasn't changed (no need to check these)
So think of it more like I need to check ALL of these manually the first time, and I'm only going to have to check 10% of them next time. Now ask yourself if you still really need to automate spell checking.

I can think of two ways to approach this semi-automatically:
Have the compiler help you differentiate between strings used in the UI and strings used elsewhere. Overload different variants of the string datatype depending on it's purpose, and overload the output methods to only accept that type - that way you can create a fake UI that just outputs the UI strings, and do the spell checking on that.
If this is doable of course depends on the platform and the overall architecture of the application.
Another approach could be to simply update the spell checkers database with all the strings that appear in the code - comments, xpaths, table names, you name it - and regard them as perfectly cromulent. This will of course reduce the precision of the spell checking.

First thing, regarding string externalization - GNU GetText (if used properly) creates string files that are contain almost no text other then the actual content of the strings (there are some headers but its easy to cause a spell checker to ignore them).
Second thing, what I would do is to run the spell checker in a continuous integration environment and have the errors fed externally, probably through a web interface but email will also work. Developers can then review the errors and either fix them in the code or use some easy interface to let the spell check know that a misspelling should be ignored (a web interface can integrate both the error view and the spell checker interface).

If you're using java and are storing your localized strings in resource bundles then you could check the Bundle.properties files and validate the bundle strings. You could also add a special comment annotation that your processor could use to determine if an entry should be skipped.
This method will allow you to give a hint as to the locale and provide a way of checking multiple languages within the one build process.
I can't answer how you would perform the actual spell checking itself, though I think what I've presented will guid you as for the method of performing the spell checking.

Use aspell. It's a programme, it's available for unixoids and cygwin, it can be run over lots of kinds of source code. Use it.

First point, please don't put it into you build process. I would be a vengeful coder if I (meaning my computer) had to spell check all the content on the site every time I tried to debug or build a new feature. I don't even think this kind of operation belongs as a unit test (you're testing a human interface, not a computerised one).
Second point, don't write a script. You're going to have so many false positives fall through the cracks that people will stop reading the reports and you are no better off than when you started.
Third point, this is probably most easily solved by having humans do it: QA team, copy writers, beta testers, translators, etc. All the big sites with internationalised content that I've built had the same process: we took the copy from the copy writers, sent it to the translating service/agency, put it into the persistence layer, and deployed it. Testers (QA, developers, PMs, designers, etc.) would find spelling or grammatical mistakes and lodge bug reports. There is just too much red tape and pairs of eyes for that many spelling/grammar errors to slip through.
Fourth point, there will always be spelling and grammar mistakes on your page. Even major newspaper web sites haven't gotten around this and they have whole office buildings filled with editors.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex