Scrape Data from Wikipedia

Scrape Data from Wikipedia - web-scraping

I am trying to find or build a web scraper that is able to go through and find every state/national park in the US along with their GPS coordinates and land area. I have looked into some frameworks like Scrapy and then I see there are some sites that are specifically for Wikipedia such as http://wiki.dbpedia.org/About. Is there any specific advantage to either one of these or would either one work better to load the information into an online database?

Let's suppose you want to parse pages like this Wikipedia page. The following code should work.
var doc = new HtmlDocument();
doc = .. //Load the document here. See doc.Load(..), doc.LoadHtml(..), etc.
//We get all the rows from the table (except the header)
var rows = doc.DocumentNode.SelectNodes("//table[contains(#class, 'sortable')]//tr").Skip(1);
foreach (var row in rows) {
var name = HttpUtility.HtmlDecode(row.SelectSingleNode("./*[1]/a[#href and #title]").InnerText);
var loc = HttpUtility.HtmlDecode(row.SelectSingleNode(".//span[#class='geo-dec']").InnerText);
var areaNodes = row.SelectSingleNode("./*[5]").ChildNodes.Skip(1);
string area = "";
foreach (var a in areaNodes) {
area += HttpUtility.HtmlDecode(a.InnerText);
}
Console.WriteLine("{0,-30} {1,-20} {2,-10}", name, loc, area);
}
I tested it, and it produces the following output:
Acadia 44.35A°N 68.21A°W 47,389.67 acres (191.8 km2)
American Samoa 14.25A°S 170.68A°W 9,000.00 acres (36.4 km2)
Arches 38.68A°N 109.57A°W 76,518.98 acres (309.7 km2)
Badlands 43.75A°N 102.50A°W 242,755.94 acres (982.4 km2)
Big Bend 29.25A°N 103.25A°W 801,163.21 acres (3,242.2 km2)
Biscayne 25.65A°N 80.08A°W 172,924.07 acres (699.8 km2)
Black Canyon of the Gunnison 38.57A°N 107.72A°W 32,950.03 acres (133.3 km2)
Bryce Canyon 37.57A°N 112.18A°W 35,835.08 acres (145.0 km2)
Canyonlands 38.2A°N 109.93A°W 337,597.83 acres (1,366.2 km2)
Capitol Reef 38.20A°N 111.17A°W 241,904.26 acres (979.0 km2)
Carlsbad Caverns 32.17A°N 104.44A°W 46,766.45 acres (189.3 km2)
Channel Islands 34.01A°N 119.42A°W 249,561.00 acres (1,009.9 km2)
Congaree 33.78A°N 80.78A°W 26,545.86 acres (107.4 km2)
Crater Lake 42.94A°N 122.1A°W 183,224.05 acres (741.5 km2)
Cuyahoga Valley 41.24A°N 81.55A°W 32,860.73 acres (133.0 km2)
Death Valley 36.24A°N 116.82A°W 3,372,401.96 acres (13,647.6 km2)
Denali 63.33A°N 150.50A°W 4,740,911.72 acres (19,185.8 km2)
Dry Tortugas 24.63A°N 82.87A°W 64,701.22 acres (261.8 km2)
Everglades 25.32A°N 80.93A°W 1,508,537.90 acres (6,104.8 km2)
Gates of the Arctic 67.78A°N 153.30A°W 7,523,897.74 acres (30,448.1 km2)
Glacier 48.80A°N 114.00A°W 1,013,572.41 acres (4,101.8 km2)
(...)
I think that's a start. If some page fails, you have to see if the layout changes, etc.
Of course, you will also have to find a way of obtaining all the links you want to parse.
One important thing: Do you know if is permitted to scrape Wikipedia? I have no idea, but you should see if it is before doing it... ;)

Though the question is a little old, another alternative available right now is to avoid any scraping and get the raw data direct from protectedplanet.net - it contains data from the World Database of Protected Areas and the UN's List of Protected Areas. (Disclosure: I worked for UNEP-WCMC, the organisation that produced and maintains the database and the website.)
It's free for non-commercial use, but you'll need to register to download. For example, this page lets you download 22,600 protected areas in the USA as KMZ, CSV and SHP (contains lat, lng, boundaries, IUCN category and a bunch of other metadata).

I would conisder this not the best approach.
My idea would be to go to the API from openstreetmap.org (or any other GEO based API that you can query) and ask it for the data you want. National parks are likely to be found pretty easily. You can get the names from a source like Wikipedia and then ask ony of the GEO APIs to give you the information you want.
BTW, what'S wrong with Wikipedias List of National Parks?

Related

Extracting full article text via the newsanchor package [in R]

I am using the newsanchor package in R to try to extract entire article content via NewsAPI. For now I have done the following :
require(newsanchor)
results <- get_everything(query = "Trump +Trade", language = "en")
test <- results$results_df
This give me a dataframe full of info of (maximum) a 100 articles. These however do not containt the entire actual article text. Rather they containt something like the following:
[1] "Tensions between China and the U.S. ratcheted up several notches over the weekend as Washington sent a warship into the disputed waters of the South China Sea. Meanwhile, Google dealt Huaweis smartphone business a crippling blow and an escalating trade war co… [+5173 chars]"
Is there a way to extract the remaining 5173 chars. I have tried to read the documentation but I am not really sure.

I don't think that is possible at least with free plan. If you go through the documentation at https://newsapi.org/docs/endpoints/everything in the Response object section it says :
content - string
The unformatted content of the article, where available. This is truncated to 260 chars for Developer plan users.
So all the content is restricted to only 260 characters. However, test$url has the link of the source article which you can use to scrape the entire content but since it is being aggregated from various sources I don't think there is one automated way to do this.

lnternational/Domestic flight indicator in Sabre reservation

When using Sabre APIs, is there any reliable indicator available in a Sabre TravelItineraryReadRS (or GetReservation) or other API that indicates whether a flight is international or domestic?
I want to avoid adding complexity and having to maintain a separate list of airport codes and countries if possible, and instead just use an indicator from a response.
I've checked <FlightSegment> in <PTC_FareBreakdown> but nothing seems to indicate internationality:
<tir39:FlightSegment ConnectionInd="O" DepartureDateTime="02-24T13:00" FlightNumber="123" ResBookDesigCode="E" SegmentNumber="1" Status="SS">
<tir39:BaggageAllowance Number="01P"/>
<tir39:FareBasis Code="AFB112"/>
<tir39:MarketingAirline Code="VA" FlightNumber="123"/>
<tir39:OriginLocation LocationCode="BNE"/>
<tir39:ValidityDates>
<tir39:NotValidAfter>2019-02-24</tir39:NotValidAfter>
<tir39:NotValidBefore>2019-02-24</tir39:NotValidBefore>
</tir39:ValidityDates>
</tir39:FlightSegment>
and also checked in <ReservationItems><Item>, e.g.:
<tir39:Item RPH="1">
<tir39:FlightSegment AirMilesFlown="0466" ArrivalDateTime="05-18T14:40" DayOfWeekInd="6" DepartureDateTime="2019-05-18T13:05" SegmentBookedDate="2018-12-21T11:20:00" ElapsedTime="01.35" eTicket="true" FlightNumber="0529" NumberInParty="01" ResBookDesigCode="E" SegmentNumber="0001" SmokingAllowed="false" SpecialMeal="false" Status="HK" StopQuantity="00" IsPast="false" CodeShare="false" Id="123">
<tir39:DestinationLocation LocationCode="SYD" Terminal="TERMINAL 3 DOMESTIC" TerminalCode="3"/>
<tir39:Equipment AirEquipType="21B"/>
<tir39:MarketingAirline Code="QF" FlightNumber="0529">
<tir39:Banner>MARKETED BY QANTAS AIRWAYS</tir39:Banner>
</tir39:MarketingAirline>
<tir39:Meal Code="L"/>
<tir39:OperatingAirline Code="QF" FlightNumber="0529" ResBookDesigCode="E">
<tir39:Banner>OPERATED BY QANTAS AIRWAYS</tir39:Banner>
</tir39:OperatingAirline>
<tir39:OperatingAirlinePricing Code="QF"/>
<tir39:DisclosureCarrier Code="QF" DOT="false">
<tir39:Banner>QANTAS AIRWAYS</tir39:Banner>
</tir39:DisclosureCarrier>
<tir39:OriginLocation LocationCode="BNE" Terminal="DOMESTIC" TerminalCode="D"/>
<tir39:UpdatedArrivalTime>05-18T14:40</tir39:UpdatedArrivalTime>
<tir39:UpdatedDepartureTime>05-18T13:05</tir39:UpdatedDepartureTime>
</tir39:FlightSegment>
</tir39:Item>
and although these have origin/destination airports, neither indicate whether the flight is international or not, and the terminal name is not reliable as an indicator.
<PriceQuotePlus> has a DomesticIntlInd attribute that initially looked useful:
<tir39:PriceQuotePlus DomesticIntlInd="I" PricingStatus="S" VerifyFareCalc="false" ItineraryChanged="false" ...>
but PriceQuotePlus and therefore DomesticIntlInd does not seem to be present in all circumstances. e.g. I have TravelItineraryReadRs responses where there is no PriceQuotePlus element, but still contains ReservationItem/Item/FlightSegment elements that I need to be able to identify as International or Domestic.
Not only this, but as an example, I have a reservation where "DomesticIntlInd" is set to "I" in a reservation that does not have an International flight (it has only one flight, and that flight is domestic (BNE-SYD)).
Any other thoughts on where I might find a reliable international flight indicator or is this functionality simply not available?

Sabre does expose a City Pairs API that includes country codes for each airport, which you could use to infer whether a flight started and ended in the same country.
They also expose this as a list that you could build into your own data table, but the API would probably be more futureproof.
The current file can be found here, but I don't know if that link will work forever.

Supporting Airport codes in geocode & calculate route

Is there any API to get geocode for airport code?
For ex: if I need to calculate time from home(say its Malibu) to LAX(Los Angeles Intl. Airport), Ideally I would follow below steps:
Get my home address geo location(via geocoder)
Get LAX geo location(via geocoder)
Use above as source and destination in "calculateroute".
However when I use "LAX" in geocoder, its gives some place in CHE(Switzerland).
If I append with country(USA), its listing some other place in Georgia.
*https://geocoder.api.here.com/6.2/geocode.json?app_id=MY-APP-ID&app_code=MY-APP-CODEgen=9&searchtext=LAX
https://geocoder.api.here.com/6.2/geocode.json?app_id=MY-APP-ID&app_code=MY-APP-CODEgen=9&searchtext=LAX,USA*
Is there any alternate way to do it OR the only way is for me to maintain a map of IATA airport codes with their geo coordinates and use it directly in calculateroute?

To get the geocode of an Airport:
Use Landmark geocoding: categoryids=4581
categoryids
xs:integer
Limit landmark results to one or more categories. Examples:
Highway exits: 116
Airports: 4581
Tourist attractions: 7999
Example:
http://geocoder.api.here.com/6.2/search.json?categoryids=4581&gen=8&jsonattributes=1&language=en-US&maxresults=20&searchtext=LAX&app_id={YOUR_APP_ID}&app_code={YOUR_APP_CODE}
Read more at developer.here.com/documentation/geocoder/topics/resource-search.html

How to get the items count in xml using marklogic

I am new to MarkLogic ..I need to get the total count of books from the following XML. Can anyone suggest me.
<bk:bookstore xmlns:bk="http://www.bookstore.org">
<bk:book category='Computer'>
<bk:author>Gambardella, Matthew</bk:author>
<bk:title>XML Developer's Guide</bk:title>
<bk:price>44.95</bk:price>
<bk:publish_year>1995</bk:publish_year>
<bk:description>An in-depth look at creating applications with XML.
</bk:description>
</bk:book>
<bk:book category='Fantasy'>
<bk:author>Ralls, Kim</bk:author>
<bk:title>Midnight Rain</bk:title>
<bk:price>5.95</bk:price>
<bk:publish_year>2000</bk:publish_year>
<bk:description>A former architect battles corporate zombies, an evil
sorceress, and her own childhood to become queen of the world.
</bk:description>
</bk:book>
<bk:book category='Comic'>
<bk:author>Robert M. Overstreet</bk:author>
<bk:title>The Overstreet Indian Arrowheads Identification </bk:title>
<bk:price>2000</bk:price>
<bk:publish_year>1991</bk:publish_year>
<bk:description>A leading expert and dedicated collector, Robert M.
Overstreet has been writing The Official Overstreet Identification and
Price
Guide to Indian Arrowheads for more than 21 years</bk:description>
</bk:book>
<bk:book category='Comic'>
<bk:author>Randall Fuller</bk:author>
<bk:title>The Book That Changed America</bk:title>
<bk:price>1000</bk:price>
<bk:publish_year>2017</bk:publish_year>
<bk:description>The New York Times Book Review Throughout its history
America has been torn in two by debates over ideals and beliefs.
</bk:description>
</bk:book>
</bk:bookstore>
Can anyone find the solution for this question as I am new to this.

Id suggest using a cts:count-aggregate in combination with cts:element-reference. This requires you to have a element range index on book.
cts:count-aggregate(cts:element-reference(fn:QName("http://www.bookstore.org", "book")))
If performance isn't too critical and your document count isn't too large, you could also count with fn:count.
declare namespace bk="http://www.bookstore.org";
fn:count(//bk:book)

Try this-
declare namespace bk="http://www.bookstore.org";
let $book_xml :=
<bk:bookstore xmlns:bk="http://www.bookstore.org">
</bk:book>
........
........
</bk:book>
</bk:bookstore>
return fn:count($book_xml//bk:book)
Hope That Helps !

Collect tweets with their related tweeters

I am doing text mining on tweets,I have collected random tweets form different accounts about some topic, I transformed the tweets into data frame, I was able to find the most frequent tweeters among those tweets(by using the column "screenName")... like those tweets:
[1] "ISCSP_ORG: #cybercrime NetSafe publishes guide to phishing:
Auckland, Monday 04 June 2013 – Most New Zealanders will have...
http://t.co/dFLyOO0Djf"
[1] "ISCSP_ORG: #cybercrime Business Briefs: MILL CREEK — H.M. Jackson
High School DECA chapter members earned the organizatio...
http://t.co/auqL6mP7AQ"
[1] "BNDarticles: How do you protect your #smallbiz from #cybercrime?
Here are the top 3 new ways they get in & how to stop them.
http://t.co/DME9q30mcu"
[1] "TweetMoNowNa: RT #jamescollinss: #senatormbishop It's the same
problem I've been having in my fight against #cybercrime. \"Vested
Interests\" - Tell me if …"
[1] "jamescollinss: #senatormbishop It's the same problem I've been
having in my fight against #cybercrime. \"Vested Interests\" - Tell me
if you work out a way!"
there are different tweeters have sent many tweets (in the collected dataset)
Now , I want to collect/group the related tweets for their corresponding tweeters/user..
Is there any way to do it using R ?? any suggestion? your help would be very appreciated.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scrape Data from Wikipedia - web-scraping

Related

Extracting full article text via the newsanchor package [in R]

lnternational/Domestic flight indicator in Sabre reservation

Supporting Airport codes in geocode & calculate route

How to get the items count in xml using marklogic

Collect tweets with their related tweeters

Categories

Resources