Adjacency list of links in web site - web-scraping

I have to solve one problem. I have 2, 3 web sites, with a lot of connections between them. I have to find shorter way between 2 urls.
However in begining I need s.th to map this sites and makes Adjacency list that I can use. Every row should contain, one page and all connections that start from there.
I need software that can make such scan or I should write such kind of software.
It wont be so dificult with php curl for example. :)

Maybe you need imagine your connections as graph nodes and use calculation with it? At big news website which detect named entities in news articles (http://topbitcoinnews.com) we use same approach and store our data in neo4j.org

Related

How to choose right design solution for application?

Currently i am creating a design for a new enterprise application. Right now we have a lot of different proprietary solution and we wanna create a new one to switch from them all.
Briefly it is a kind of data destribution system. We have a lot of clients who needs a lot of different data.
What do i want:
1) Common REST API service
2) Some synchronoyu(async?) enviroment to send task and get data back. On image below you could see i think to use spring kafka request/reply template. It helps to scale my application in future.
3) Different typec of calculators for every kind of data
I search a lot how to do the second point the best way but didn't find any ready solutions or advices. Is it good to use kafka here? Maybe some one could give me advice about best practice in such situations.
Plz, send me links for articles or something else, because it will be a big application and i wanna start to create it rightly from the beginning.

Making a tree of Wikipedia links

I am trying to use the Wikipedia API to get all links on all pages. Currently I'm using
https://en.wikipedia.org/w/api.php?format=json&action=query&generator=alllinks&prop=links&pllimit=max&plnamespace=0
but this does not seem to start at the first article and end at the last. How can I get this to generate all pages and all their links?
The English Wikipedia has approximately 1.05 billion internal links. Considering the list=alllinks module has a limit of 500 links per request, it's not realistic to get all links from the API.
Instead, you can download Wikipedia's database dumps and use those. Specifically, you want the pagelinks dump, containing information about the links themselves, and very likely also the page dump, for mapping page ids to page titles.
I know this is an old question, but in case anyone else is searching and finds this, I highly recommend looking at Wikicrush to extract the link graph for all of Wikipedia. It produces a relatively compact representation that can be used to very quickly traverse links.

scraping a website - is this even possible?

I'd like to extract the Rates values for properties in Northern Ireland from the LPS website http://lpsni.gov.uk/vListDCV/search.asp?submit=form
I'm a reasonable php programmer but I haven't a clue how I'd go about doing this. Can someone point me in the direction of what I need to find out in order to do this?
Is it even possible to do what I want?
Yes, it is very do-able.
Pointers: Ignore trying to go in through the form, all the data can be reached via static links from http://lpsni.gov.uk/vListDCV/districts.asp as all the propertys are fixed it becomes merely a case of scraping each layer for links to build loops within loops eg: councils-wards-streets-etc till you eventually get down to the meat and pull it out using cURL or even just file_get_contents and regex off the bits you don't want. Store for later use in a database.
Scraping data using Php is lengthy and code-driven. You write the script and extract the data from the site but I suggest to automate the process.

I want to create an RSS feed that is customizable

I want to create a dropdown of RSS feeds and users can pick and choose the feeds they want and a custom feed would be created. Is this possible using straight up HTML and java script or do I need a server technology. There are 7 separate feeds so the possible combinations are 7! - far too many for me to individually code into if statements and separate feeds. Is there a program that will generate the possible feeds for me automatically after I update one of them? Then I could just upload the updated xml files.
Right. So I set up my xml files, say I have one for birthdays, one for deaths, and one for mid life crises. So that is three xml files with three separate links for rss feeds. Now what I want is for people to be able to check off the ones to which they wish to subscribe rather than hitting each one separately. So I would have a form with three checkboxes and a submit button. I could do this with javascript by having 6 separate xml feeds, one for each possible combination. But if I have 4 feeds then I need to set up 24 feeds, and 5 would be 120 possible feed combinations.
So the question becomes, is there some software or library that will either handle this computation for me and crank out RSS mixes/blends similar to what some RSS mixing software seems to do. The problem with the services and software I have seen is that it provides blending for people subscribing to feeds but not for providers. I can see in my head how easily this could be done programmatically even though it would spit out alot of xml and html/javascript.
I guess another way about it would be for them to sign up for multiple feeds simultaneously but I'm not sure if that can be done.
If I am making no sense I apologize. I have never seen this done so it might not be possible. I am just going to go with the page with a bunch of RSS links.
Thanks for everyones responses. I appreciate it.
Just because there are 7 options doesn't mean you need to write 7! if statements. You only need to check if each one of the options is set, and output something appropriately.
So, yes, you need to do this server side. And it's not at all difficult.
Where are you stuck, specifically? Your question is missing a few details.

What is the best way to store site configuration data?

I have a question about storing site configuration data.
We have a platform for web applications. The idea is that different clients can have their data hosted and displayed on their own site which sits on top of this platform. Each site has a configuration which determines which panels relevant to the client appear on which pages.
The system was originally designed to keep all the configuration data for each site in a database. When the site is loaded all the configuration data is loaded into a SiteConfiguration object, and the clients panels are generated based on the content of this object. This works, but I find it very difficult to work with to apply change requests or add new sites because there is so much data to sift through and it's difficult maintain a mental model of the site and its configuration.
Recently I've been tasked with developing a subset of some of the sites to be generated as PDF documents for printing. I decided to take a different approach to how I would define the configuration in that instead of storing configuration data in the database, I wrote XML files to contain the data. I find it much easier to work with because instead of reading meaningless rows of data which are related to other meaningless rows of data, I have meaningful documents with semantic, readable information with the relationships defined by visually understandable element nesting.
So now with these 2 approaches to storing site configuration data, I'd like to get the opinions of people more experienced in dealing with this issue on dealing with these two approaches. What is the best way of storing site configuration data? Is there a better way than the two ways I outlined here?
note: StackOverflow is telling me the question appears to be subjective and is likely to be closed. I'm not trying to be subjective. I'd like to know how best to approach this issue next time and if people with industry experience on this could provide some input.
if the information is needed for per client specific configuration it is probably best done in a database with an admin tool written for it so that non technical people can also manage it. Also it's easier that way when you need versioning/history on it. XML isn't always the best on that part. Also XML is harder to maintain in the end (for non technical people).
Do you read out the XML every time from disk (performance hit) or do you keep it cached in memory? Either solution you choose, caching makes a big difference in the end for performance.
Grz, Kris.
You're using ASP.NET so what's wrong with web.config for your basic settings (if it's per project deploy), then as you've said, custom XML or database configuration settings for anything more complicated (or if you have multiple users/clients with the same project deploy)?
I'd only use custom XML documents for something like a "site layout document" where things won't change that often and you're going to have lots of semi-meaningless data (e.g. 23553123). And layout should be handled by css as much as possible anyway.
For our team XML is a good choice (app.config or web.config or custom configuration file, it depends), but sometimes it is better to design configuration API to make configurations in code. For example modern IoC containers has in-code configuration APIs with fluent interfaces. This approach can give benefits if you need to configure many similar to each other entities or want to achive good human readability. But this doesn't works if non-programmers need to make configurations.

Resources