Rvest html_nodes span once and Xpath - r

I would like to collect parish.name from masstimes.org
I used selector gadget. The CSS selector 'span' as an XPath is shown below. Please report any bugs that you find with this converter. The result is //span
The data I would like to collect is here: <span once="parish.name" class="">Saint John Chrysostom [Ruthenian]</span>
I'm not sure what the html_nodes command should look like?
Thanks,

Javascript is required to display the results. So you won't scrape anything with Rvest. You should use RSelenium. As an alternative you can download the JSON loaded in the background to fetch the data.
First, you need to obtain the lat and long of the city you're looking for. The website uses Arcgis API to get them. For example, for New York the GET url is :
https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/findAddressCandidates?f=json&SingleLine=New%20York,%20NY,%20USA
Output :
{"spatialReference":{"wkid":4326,"latestWkid":4326},"candidates":[{"address":"New York","location":{"x":-74.007139999999936,"y":40.714550000000031},"score":100,"attributes":{},"extent":{"xmin":-74.257139999999936,"ymin":40.464550000000031,"xmax":-73.757139999999936,"ymax":40.964550000000031}}]}
From this output, lat is 40.715 (rounded with 3 digits) and long is -74.007. Use GET and content (as text) from httr package to load the file in R. And str_extract from stringr to extract these 2 values.
Example. I'm looking for Paris in France. We modify the URL (add Paris and FRA into it) to get the JSON, then store its content. We extract lat and long, then construct the URL with paste0.
data=GET("https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/findAddressCandidates?f=json&SingleLine=Paris,%20FRA")
parse=content(data,as="text")
lat = round(as.numeric(str_extract_all(parse,"\\d+\\.\\d+")[[1]][2]),digits = 3)
long = round(as.numeric(str_extract_all(parse,"\\d+\\.\\d+")[[1]][1]),digits = 3)
paste0("https://apiv4.updateparishdata.org/Churchs/?lat=",lat,"&long=",long,"&pg=1")
Output :
https://apiv4.updateparishdata.org/Churchs/?lat=48.857&long=2.341&pg=1
You can also manually lookup these values with your favorite search engine.
Once you get them, you can construct your request url. Like the following one :
https://apiv4.updateparishdata.org/Churchs/?lat=40.091&long=-82.95&pg=1
where lat and long is the value you've found and pg the page number (30 results per page).
Load the JSON in R with jsonlite(a df is created) and extract the column of interest :
library (jsonlite)
mydata <- fromJSON("https://apiv4.updateparishdata.org/Churchs/?lat=40.091&long=-82.95&pg=1")
mydata$name
Output :
[1] "Saint John Chrysostom [Ruthenian]" "Saint Elizabeth"
[3] "Saint Anthony" "St. Matthias"
[5] "Saint Paul" "Holy Resurrection [Melkite]"
[7] "St. Michael" "St. James the Less"
[9] "Our Lady of Peace" "Immaculate Conception"
[11] "Saints Augustine and Gabriel" "Saint Peter"
[13] "Holy Name" "Saint Timothy"
[15] "Ohio Dominican University" "St. Thomas More Newman Center"
[17] "Church of the Resurrection" "Saint Andrew Roman Catholic Church "
[19] "Saint Matthew" "Saint Joan of Arc Catholic Church"
[21] "St. Thomas the Apostle" "St. Agatha"
[23] "Sacred Heart Church" "Saint Dominic"
[25] "Saint John the Baptist " "Saint Francis of Assisi"
[27] "Saint Patrick" "Holy Spirit"
[29] "Saint Brendan" "Saint Christopher"

Related

USArrests data.frame in R - which state (row) presents the smallest and the largest crime rate (column)

I am using the USArrests data.frame in R and I need to see for each crime (Murder, Assault and Rape) which state presents the smallest and the largest crime rate.
I guess I have to calculate the max and min for each crime and I have done that.
which(USArrests$Murder == min(USArrests$Murder))
[1] 34
The problem is that I cannot retrieve State in row 34, but only the whole row:
USArrests[34,]
Murder Assault UrbanPop Rape
North Dakota 0.8 45 44 7.3
I am just starting using R so can anyone help me please?
I would usually suggest taking a different approach to a problem like this but for ease I'm going to offer the following solution and maybe come back later with a more well thought out way.
You can use the attributes() function to see particular 'attributes' of a dataframe.
Eg:
attributes(USArrests)
will give you the following output.
$names
[1] "Murder" "Assault" "UrbanPop" "Rape"
$class
[1] "data.frame"
$row.names
[1] "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado"
[7] "Connecticut" "Delaware" "Florida" "Georgia" "Hawaii" "Idaho"
[13] "Illinois" "Indiana" "Iowa" "Kansas" "Kentucky" "Louisiana"
[19] "Maine" "Maryland" "Massachusetts" "Michigan" "Minnesota" "Mississippi"
[25] "Missouri" "Montana" "Nebraska" "Nevada" "New Hampshire" "New Jersey"
[31] "New Mexico" "New York" "North Carolina" "North Dakota" "Ohio" "Oklahoma"
[37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina" "South Dakota" "Tennessee"
[43] "Texas" "Utah" "Vermont" "Virginia" "Washington" "West Virginia"
[49] "Wisconsin" "Wyoming"
So now we know the dataframe is composed of 'names' (name of charge), 'row.names' (names of states) and that the 'class' is a dataframe. As a newcomer to R it is important to note that in the results above, the row id is only given for the first item on each new line. This will make more sense in the last step.
Using this knowledge we can use attributes to find just the states by doing the following:
attributes(USArrests)$row.names
To find the 34th state in the list which you have identified as North Dakota, we can simply give the row id for that state, as per below.
attributes(USArrests)$row.names[34]
Which will give you....
[1] "North Dakota"
Again, this is probably not the most elegant way of doing this, but it will work for your scenario.
Hope this helps and happy coding.
EDIT
As I mentioned there's usually a more elegant, performant and efficient way of doing things. Here is another such way of achieving your goal.
row.names(USArrests)[which.min(USArrests$Murder)]
You'll probably be able to see instantly what is happening here, but essentially, we're asking for the row name associated with the lowest value for the Murder charge. Again this gives...
[1] "North Dakota"
You can now apply this logic to find the states with the max & min crime rates for each offence. Eg, for max Assaults
row.names(USArrests)[which.max(USArrests$Assault)]
Giving...
[1] "North Carolina"
It appears that the State name is stored as a rowname. You can access the rownames of a dataframe using the rownames function.
To find the element which has the lowest value in the vector-column, you can use the which.min function.
We have indeed:
> USArrests[which.min(USArrests$Murder), "Murder"]
[1] 0.8
Hence, your command becomes:
> rownames(USArrests)[which.min(USArrests$Murder)]
[1] "North Dakota"

Use textConnection and scan to switch pasted character data to a vector

I want to use textConnection and scan in R to switch
a pasted character dataset to a character vector as row.names.
My little example is as follows:
x = textConnection('
Arcadia
Bryce Canyon
Cuyahoga Valley
Everglades
Grand Canyon
Grand Teton
Great Smoky
Hot Springs
Olympic
Mount Rainier
Rocky Mountain
Shenandoah
Yellowstone
Yosemite
Zion
')
scan(x,character(0))
Each line of the dataset represents a place, thus
it is expected to have a character vector with length 15.
However, scan(x,character(0)) gives
Read 23 items
[1] "Arcadia" "Bryce" "Canyon" "Cuyahoga" "Valley"
[6] "Everglades" "Grand" "Canyon" "Grand" "Teton"
[11] "Great" "Smoky" "Hot" "Springs" "Olympic"
[16] "Mount" "Rainier" "Rocky" "Mountain" "Shenandoah"
[21] "Yellowstone" "Yosemite" "Zion"
I then tried scan(x,character(0),seq='\n'), but it also didn't work! Any help?
Since the input is quoted, we should specify the parameter sep (and not seq!) if we want scan to use 'non white space' as deliminator.
From ?scan:
sep by default, scan expects to read ‘white-space’ delimited input fields. Alternatively, sep can be used to specify a character
which delimits fields. A field is always delimited by an end-of-line
marker unless it is quoted. If specified this should be the empty
character string (the default) or NULL or a character string
containing just one single-byte character.
x = textConnection('
Arcadia
Bryce Canyon
Cuyahoga Valley
Everglades
Grand Canyon
Grand Teton
Great Smoky
Hot Springs
Olympic
Mount Rainier
Rocky Mountain
Shenandoah
Yellowstone
Yosemite
Zion
')
scan(x,character(0), sep="\n")
Returns:
Read 15 items
[1] "Arcadia" "Bryce Canyon" "Cuyahoga Valley" "Everglades"
[5] "Grand Canyon" "Grand Teton" "Great Smoky" "Hot Springs"
[9] "Olympic" "Mount Rainier" "Rocky Mountain" "Shenandoah"
[13] "Yellowstone" "Yosemite" "Zion"

googleway, the radar argument is now deprecated

Attempting to use googleway package in r to list retirement villages within a specified radius of a location. Get the radar argument is now deprecated error message and null results as a consequence.
library(googleway)
a <- google_places(location = c(-36.796578,174.768836),search_string = "Retirement Village",radius=10000, key = "key")
a$results$name
```
#Would expect this to give me retirement villages within 10km radius, instead get error message
```> library(googleway)
> a <- google_places(location = c(-36.796578,174.768836),search_string = "Retirement Village",radius=10000, key = "key")
The radar argument is now deprecated
> a$results$name
NULL
```
There's nothing wrong with the code you've written, and that 'message' you get is not an error, it's a message, but it probably should be removed - I've made an issue to remove it here
a <- google_places(
location = c(-36.796578,174.768836)
, search_string = "Retirement Village"
, radius = 10000
, key = "key"
)
place_name( a )
# [1] "Fairview Lifestyle Village" "Eastcliffe Retirement Village"
# [3] "Meadowbank Retirement Village" "The Poynton - Metlifecare Retirement Village"
# [5] "The Orchards - Metlifecare Retirement Village" "Bupa Hugh Green Care Home & Retirement Village"
# [7] "Bert Sutcliffe Retirement Village" "Grace Joel Retirement Village"
# [9] "Bupa Remuera Retirement Village and Care Home" "7 Saint Vincent - Metlifecare Retirement Village"
# [11] "Remuera Gardens Retirement Village" "William Sanders Retirement Village"
# [13] "Puriri Park Retirement Village" "Selwyn Village"
# [15] "Aria Bay Retirement Village" "Highgrove Village & Patrick Ferry House"
# [17] "Settlers Albany Retirement Village" "Knightsbridge Village"
# [19] "Remuera Rise" "Northbridge Residential Village"
Are you sure the API key you're using is enabled on the Places API?

How to fetch headlines from google news using rvest R?

I want to fetch headlines from google news using rvest in R. I have done this so far
library(rvest)
url=read_html("https://www.google.com/search?hl=en&tbm=nws&authuser=0&q=american+president")
selector_name<-"r"
fnames<-html_nodes(x = url, css = selector_name) %>%
html_text()
but the result is
> fnames
character(0)
This is the inspect element of a headline?
<h3 class="r">Obama Addresses Racial Tensions at Celebration of African ...</h3>
How can I fetch the headlines from google news?
I think you are just missing a dot for the class name:
> headlines = read_html("https://www.google.com/search?hl=en&tbm=nws&authuser=0&q=american+president") %>%
html_nodes(".r") %>%
html_text()
> headlines
[1] "Iranian President: No American President Can Renegotiate the Now ..."
[2] "US: President Barack Obama vetoes 9/11 bill"
[3] "President Obama Wants Donald Trump to Visit New African ..."
[4] "President Obama: Discrimination Should Concern 'All Americans ..."
[5] "Conrad Black: The Middle East watches, and waits, for the next ..."
[6] "Putin's close friend: Donald Trump will be next US president"
[7] "US election 2016 polls and odds: Latest Donald Trump and Hillary ..."
[8] "US election: Ted Cruz endorses Donald Trump for president"
[9] "Obama – I'm proud of my 'African record' as US president"
[10] "Almost 6000 Americans Have Already Voted for President"
Well you could do by:
library(rvest)
reviews <- link %>%
read_html() %>%
html_nodes(".g") %>%
html_text()
you check via inspect element where the text(headline is present), in this case it would class g. Then read the text within each node.

With R and XML, can an XPath 1.0 expression eliminate duplicates in the content returned?

When I extract content from the following URL, using XPath 1.0, the cities that are returned contain duplicates, starting with Birmingham. (The complete set of values returned is more than 140, so I have truncated it.) Is there a way with the XPath expression to avoid the duplicates?
require(XML)
doc <- htmlTreeParse("http://www.littler.com/locations", useInternal = TRUE)
xpathSApply(doc, "//div[#class = 'mm-location-usa']//a[position() < 12]", xmlValue, trim = TRUE)
[1] "Birmingham" "Mobile" "Anchorage" "Phoenix" "Fayetteville" "Fresno"
[7] "Irvine" "L.A. - Century City" "L.A. - Downtown" "Sacramento" "San Diego" "Birmingham"
[13] "Mobile" "Anchorage" "Phoenix" "Fayetteville" "Fresno" "Irvine"
[19] "L.A. - Century City" "L.A. - Downtown" "Sacramento" "San Diego"
Is there an XPath expression or work around along the lines of [not-duplicate()]?
Also, various [position() < X] permutations don't produce only the cities and only one instance of each. In fact, it's hard to figure out how positions are counted.
I would appreciate any guidance or finding out that the best I can do is limit the number of duplicates returned.
BTW XPath result with duplicates is not the same problem nor are the questions that pertain to duplicate nodes, e.g., How do I identify duplicate nodes in XPath 1.0 using an XPathNavigator to evaluate?
There is a function for this, it is called distinct-values(), but unfortunately, it is only available in XPath 2.0. In R, you are limited to XPath 1.0.
What you can do is
//div[#class = 'mm-location-usa']//a[position() < 12 and not(normalize-space(.) = normalize-space(following::a))]
What it does, in plain English:
Look for div elements, but only if their class attribute value equals "mm-location-usa". Look for descendant a element of those div elements, but only if the a element's position is less than 12 and if the normalized text content of that a element is not equal to the text content of an a element that follows.
But is is a computationally intensive approach and not the most elegant one. I recommend you take jlhoward's solution.
Can't you just do it this way??
require(XML)
doc <- htmlTreeParse("http://www.littler.com/locations", useInternal = TRUE)
xPath <- "//div[#class = 'mm-location-usa']//a[position() < 12]"
unique(xpathSApply(doc, xPath, xmlValue, trim = TRUE))
# [1] "Birmingham" "Mobile" "Anchorage"
# [4] "Phoenix" "Fayetteville" "Fresno"
# [7] "Irvine" "L.A. - Century City" "L.A. - Downtown"
# [10] "Sacramento" "San Diego"
Or, you can just create an XPath to process the li tags in the first div (since they are duplicate divs):
xpathSApply(doc, "//div[#id='lmblocks-mega-menu---locations'][1]/
div[#class='mm-location-usa']/
ul/
li[#class='mm-list-item']", xmlValue, trim = TRUE)
## [1] "Birmingham" "Mobile" "Anchorage"
## [4] "Phoenix" "Fayetteville" "Fresno"
## [7] "Irvine" "L.A. - Century City" "L.A. - Downtown"
## [10] "Sacramento" "San Diego" "San Francisco"
## [13] "San Jose" "Santa Maria" "Walnut Creek"
## [16] "Denver" "New Haven" "Washington, DC"
## [19] "Miami" "Orlando" "Atlanta"
## [22] "Chicago" "Indianapolis" "Overland Park"
## [25] "Lexington" "Boston" "Detroit"
## [28] "Minneapolis" "Kansas City" "St. Louis"
## [31] "Las Vegas" "Reno" "Newark"
## [34] "Albuquerque" "Long Island" "New York"
## [37] "Rochester" "Charlotte" "Cleveland"
## [40] "Columbus" "Portland" "Philadelphia"
## [43] "Pittsburgh" "San Juan" "Providence"
## [46] "Columbia" "Memphis" "Nashville"
## [49] "Dallas" "Houston" "Tysons Corner"
## [52] "Seattle" "Morgantown" "Milwaukee"
I made an assumption here that you're going after US locations.

Resources