Are there any website content monitoring packages in R? - r

I know there are free website content monitoring programs that send email alerts when the content of a website is changed, but is there a package (or any way to hard code) in R which can do this? It would be helpful to integrate this in one work flow.

R is a general purpose programming language so you can do anything with it.
Core idiom for what you are trying to do is:
Identify target site
Pull content & content metadata
Cache ^^ (you need to figure this out; RDBMS tables? NoSQL tables? Files?)
Let n time-periods pass (you need to figure this out: cron? launchd? Amazon lambda?)
Pull content & content metadata
Compare ^^ against cached versions; NOTE this works best if you know the structure of the target site vs use an overly generic framework)
If difference is "significant", notify via whatever means you want (you need to figure this out: email? SMS? Twitter?)
For content, you may not be aware that httr::GET() returns a rich, complex data object full of metadata. I did not do a str(res) below to encourage you to do so on your own.
library(httr)
library(rvest)
library(splashr)
library(hgr) # devtools::install_github("hrbrmstr/hgr")
library(tlsh) # devtools::install_github("hrbrmstr/tlsh")
target_url <- "https://www.whitehouse.gov/briefings-statements/"
Get it like a browser would
httr::GET(
url = target_url,
httr::user_agent(splashr::ua_macos_safari)
) -> res
Cache page size and use a substantial difference to signal notification
(page_size <- res$headers['content-length'])
## $`content-length`
## [1] "12783"
Calculate & cache local sensitify hash value use tlsh_simple_diff() to see if there are "substantial" hash changes and use that as a signal to notify:
doc_text <- httr::content(res, as = "text")
(doc_hash <- tlsh_simple_hash(doc_text))
## [1] "563386E33C44683E060B739261ADF20CB2D38563EE151C88A3F95169999FF97A1F385D"
This site uses structured <div>'s so cache and use more/fewer/different ones to signal notification:
doc <- httr::content(res)
news_items <- html_nodes(doc, "div.briefing-statement__content")
(total_news_items <- length(news_items))
## [1] 10
(headlines <- gsub("[[:space:]]+", " ", html_text(news_items, trim=TRUE)))
## [1] "News Clips CNBC: “Job Openings Hit Record 7.136 Million in August” Economy & Jobs Oct 16, 2018"
## [2] "Fact Sheets Congressional Democrats Want to Take Away Your Doctor, Outlaw Your Private Insurance, and Put Bureaucrats In Charge of Your Healthcare Healthcare Oct 16, 2018"
## [3] "Remarks Remarks by President Trump in Briefing on Hurricane Michael Land & Agriculture Oct 15, 2018"
## [4] "Remarks Remarks by President Trump and Governor Scott at FEMA Aid Distribution Center | Lynn Haven, FL Land & Agriculture Oct 15, 2018"
## [5] "Remarks Remarks by President Trump During Tour of Lynn Haven Community | Lynn Haven, FL Land & Agriculture Oct 15, 2018"
## [6] "Remarks Remarks by President Trump and Governor Scott Upon Arrival in Florida Land & Agriculture Oct 15, 2018"
## [7] "Remarks Remarks by President Trump Before Marine One Departure Foreign Policy Oct 15, 2018"
## [8] "Statements & Releases White House Appoints 2018-2019 Class of White House Fellows Oct 15, 2018"
## [9] "Statements & Releases President Donald J. Trump Approves Georgia Disaster Declaration Land & Agriculture Oct 14, 2018"
## [10] "Statements & Releases President Donald J. Trump Amends Florida Disaster Declaration Land & Agriculture Oct 14, 2018"
Use a "readability" tool to turn the contents into plaintext cache & compare with one of the many "text diff/string diff" R packages:
content_meta <- hgr::just_the_facts(target_url)
str(content_meta)
## List of 11
## $ title : chr "Briefings & Statements"
## $ content : chr "<p class=\"body-overflow\"> <header class=\"header\"> </header>\n<main id=\"main-content\"> <div class=\"page-r"| __truncated__
## $ lead_image_url: chr "https://www.whitehouse.gov/wp-content/uploads/2017/12/wh.gov-share-img_03-1024x538.png"
## $ next_page_url : chr "https://www.whitehouse.gov/briefings-statements/page/2"
## $ url : chr "https://www.whitehouse.gov/briefings-statements/"
## $ domain : chr "www.whitehouse.gov"
## $ excerpt : chr "Get official White House briefings, statements, and remarks from President Donald J. Trump and members of his Administration."
## $ word_count : int 22
## $ direction : chr "ltr"
## $ total_pages : int 2
## $ pages_rendered: int 2
## - attr(*, "row.names")= int 1
## - attr(*, "class")= chr "hgr"
Unfortunately, you asked a general purpose computing-ish question and, as such, it is likely to get closed.

Related

R splitting cells with complex pattern strings

I am web-scraping a website (https://pubmed.ncbi.nlm.nih.gov/) to build up a dataset out of it.
`
> str(result)
$ title : chr [1:4007]
$ authors : chr [1:4007]
$ cite : chr [1:4007]
$ PMID : chr [1:4007]
$ synopsis: chr [1:4007]
$ link : chr [1:4007]
$ abstract: chr [1:4007] `
title authors cite PMID synop…¹ link abstr…
1 Food insecurity in baccalaureate nursing stude… Cocker… J Pr… 3386… METHOD… http… "Backg…
2 Household food insecurity and educational outc… Masa R… Publ… 3271… We mea… http… "Objec…
3 Food Insecurity and Food Label Comprehension a… Mansou… Nutr… 3437… Multiv… http… "Food …
4 The Household Food Security and Feeding Patter… Omachi… Nutr… 3623… Childr… http… "Child…
5 Food insecurity: Its prevalence and relationsh… Turnbu… J Hu… 3373… BACKGR… http… "Backg…
6 Cross-sectional Analysis of Food Insecurity an… Estrel… West… 3535… METHOD… http… "Intro…
`
Among the various variables I am creating, there is one that is giving me trouble (resulut$cite): it includes various information that I have to split up into different columns in order to get a clear overview of the data I need. Here there is an example of some rows to show the difficulty I am facing. I have searched for similar issues but can't find anything fitting this.
1. Public Health. 2021 Sep;198:332-339. doi: 10.1016/j.puhe.2021.07.032. Epub 2021 Sep 9. PMID: 34509858
2. Public Health Nutr. 2021 Apr;24(6):1469-1477. doi: 10.1017/S1368980021000550. Epub 2021 Feb 9. PMID: 33557975
3. Clin Nutr ESPEN. 2022 Dec;52:229-239. doi: 10.1016/j.clnesp.2022.11.005. Epub 2022 Nov 10. PMID: 36513458
Given this, I would like to split result$cite into multiple columns in order to attain what follows:
Review Date of publication Page doi Reference Epub PMID
Public Health. 2021 Sep; 198:332-339. 10.1016 j.puhe.2021.07.032. 2021 Sep 9. 34509858
Public Health Nutr. 2021 Apr; 24(6):1469-1477.10.1017 S1368980021000550. 2021 Feb 9. 33557975
Clin Nutr ESPEN. 2022 Dec; 52:229-239. 10.1016 j.clnesp.2022.11.005. 2022 Nov 10. 36513458
The main problem for me is that the strings are not regular, hence I can't find a pattern to split the cells into different columns. Any idea?
Does this work for you? (Since the OP will not provide data in reproducible format I've created a toy column idbeside the cite column):
library(tidyverse)
result %>%
extract(cite,
into = c("Review","Date of publication","Page","doi","Reference","Epub","PMID"),
regex = "\\d+\\. ([^.]+)\\. ([^;]+);([^.]+)\\. doi:([^/]+)/(\\S+)\\. Epub ([^.]+)\\. PMID: (\\d+)")
id Review Date of publication Page doi Reference Epub
1 1 Public Health 2021 Sep 198:332-339 10.1016 j.puhe.2021.07.032 2021 Sep 9
2 2 Public Health Nutr 2021 Apr 24(6):1469-1477 10.1017 S1368980021000550 2021 Feb 9
3 3 Clin Nutr ESPEN 2022 Dec 52:229-239 10.1016 j.clnesp.2022.11.005 2022 Nov 10
PMID
1 34509858
2 33557975
3 36513458
(NB: if there aren't leading numerics in cite, just remove this part from regex: \\d+\\. (with whitespace at the end!)
The way extract works may look difficult to parse but is actually quite simple. Essentially, you do two things: (i) you look at the strings and try and figure out how they are patterned, i.e., what rules they follow; (ii) in the regex argument you put everything you want to extract into distinct capture groups ((...)) and everything else remains outside of the capture groups.
Data (update #2):
result <- data.frame(id = 1:3,
cite = c("1. Public Health. 2021 Sep;198:332-339. doi: 10.1016/j.puhe.2021.07.032. Epub 2021 Sep 9. PMID: 34509858",
"2. Public Health Nutr. 2021 Apr;24(6):1469-1477. doi: 10.1017/S1368980021000550. Epub 2021 Feb 9. PMID: 33557975",
"3. Clin Nutr ESPEN. 2022 Dec;52:229-239. doi: 10.1016/j.clnesp.2022.11.005. Epub 2022 Nov 10. PMID: 36513458")
)

Optional pattern part in regex lookbehind

In the example below I am trying to extract the text between 'Supreme Court' or 'Supreme Court of the United States' and the next date (including the date). The result below is not what I intended since result 2 includes "of the United States".
I assume the error is due to the .*? part since . can also match 'of the United States'. Any ideas how to exclude it?
I guess more generally speaking, the question is how to include an optional 'element' into a lookbehind (which seems not to be possible since ? makes it a non-fixed length input).
Many thanks!
library(tidyverse)
txt <- c("The US Supreme Court decided on 2 April 2020 The Supreme Court of the United States decided on 5 March 2011 also.")
str_extract_all(txt, regex("(?<=Supreme Court)(\\sof the United States)?.*?\\d{1,2}\\s\\w+\\s\\d{2,4}"))
#> [[1]]
#> [1] " decided on 2 April 2020"
#> [2] " of the United States decided on 5 March 2011"
Created on 2021-12-09 by the reprex package (v2.0.1)
I also tried
str_extract_all(txt, regex("(?<=(Supreme Court)|(Supreme Court of the United States)).*?\\d{1,2}\\s\\w+\\s\\d{2,4}"))
however the result is the same.
In this case, I would prefer using the perl engine which is implemented in Base R, rather than using the ICU-library engine which stringr/stringi uses.
pattern <- "Supreme Court (of the United States ?)?\\K.*?\\d{1,2}\\s\\w+\\s\\d{2,4}"
regmatches(txt, gregexpr(pattern, txt, perl = TRUE))
[[1]]
[1] "decided on 2 April 2020" "decided on 5 March 2011"
You can do this with str_match_all and group capture:
str_match_all(txt, regex("Supreme Court(?:\\sof the United States)?(.*?\\d{1,2}\\s\\w+\\s\\d{2,4})")) %>%
.[[1]] %>% .[, 2]
[1] " decided on 2 April 2020" " decided on 5 March 2011"

R to web scrape- using rvest- timeout error

library(rvest)
jobbank <- read_html("https://www.jobbank.gc.ca/LMI_bulletin.do?cid=3373&AREA=0007&INDUSTRYCD=&EVENTCD=")
Error in open.connection(x, "rb") :
Timeout was reached: Connection timed out after 10015 milliseconds
jobbank %>%
html_node(".lmiBox") %>%
html_text()
Error in eval(lhs, parent, parent) : object 'jobbank' not found
I'm trying to find keywords from the news section of the websites but it seems to be showing me these 2 error messages.
Seems to be working fine on my side.
library(rvest)
#> Loading required package: xml2
library(stringr)
jobbank <- read_html("https://www.jobbank.gc.ca/LMI_bulletin.do?cid=3373&AREA=0007&INDUSTRYCD=&EVENTCD=")
jobbank %>%
html_node(".lmiBox") %>%
html_text() %>%
str_split("(\r\\n+\\s+)|(\\n\\s+)")
#> [[1]]
#> [1] ""
#> [2] "Week of Jan 14 - Jan 18, 2019Lowe's Canada is looking to hire about 2,650 full-time, part-time and seasonal staff at its stores in Ontario. The company will hold a National Hiring Day on February 23."
#> [3] "The Ministry of Innovation, Science, and Economic Development announced $5M in funding to support automotive innovation at APAG Elektronik Corp. and Service Mold + Aerospace Inc. in Windsor, creating 160 jobs"
#> [4] "A $1M investment by the provincial government into Kenora's Downtown Revitalization Project for a plaza and infrastructure upgrades will create 75 new jobs"
#> [5] "Redfin Corp., an American real estate brokerage, is expanding into Canada and hiring in Toronto"
#> [6] "The construction of townhomes at Walkerville Stones in Windsor is expected to begin this spring "
#> [7] "The Ontario Emerging Jobs Institute (OEJI) at the Nav Centre in Cornwall opened. The OEJI provides skills training in areas with worker shortages."
#> [8] "The Chartwell Meadowbrook Retirement Residence in Lively broke ground on their expansion project, which includes 41 new suites and 14 town homes"
#> [9] "Lambton College created an Information Technology and Communication Research Centre using a five-year, $2M grant from the Natural Sciences and Engineering Research Council of Canada. They hope to use part of the funding to employ students."
#> [10] "SnapCab, a workspace pod manufacturer in Kingston, has grown from 20 to 25 employees with more hiring expected to occur in 2019"
#> [11] "Niagara Pallet & Recyclers Ltd., a manufacturer of pallets and shipping materials in Smithville, is hiring general labour workers, AZ and DZ drivers, production staff, forklift drivers and saw operators"
#> [12] "A1 Demolition will begin demolition of the former Maliboo Club in Simcoe. The plan is to rebuild the structure with residential and commercial space."
#> [13] "MidiCi: The Neapolitan Pizza Co., Sweet Jesus, La Carnita and The Pie Commission will be among several restaurants opening in the 34,000-sq.-ft. Food District in Mississauga this spring "
#> [14] "Menkes Developments Ltd., in partnership with TD Greystone Asset Management, will renovate the former Canada Permanent Trust Building in Toronto. Work on the 270,000-sq.-ft. space is expected to take between 12 and 18 months."
#> [15] "Westmount Signs & Printing in Waterloo is hiring experienced installers after doubling the size of its workforce to 24 employees in the last year and a half"
#> [16] "Microbrewery, Heral Haus Brewing Co. opened in Stratford at the end of December"
#> [17] "Demolition is expected to start this month on Windsor's old City Hall and is expected to be complete by August"
#> [18] "Urban Planet, a clothing store, will open as early as February 2019 at the Cornwall Square mall in Cornwall"
#> [19] "The federal government committed $3.5M towards the construction of a new art gallery in Thunder Bay, bringing total government funding for the project to $27.5M"
#> [20] "The Rec Room, a 44,000-sq.-ft. entertainment complex by Cineplex Entertainment LP, is scheduled to open in Mississauga in March "
#> [21] "Yang Teashop opened a second location in Toronto with plans to open two more locations in the Greater Toronto Area"
#> [22] "Spacecraft Brewery opened in Sudbury"
#> [23] "The Town of Lakeshore will be accepting applications for 11 summer student positions until March 1"
#> [24] "Virtual reality arcade Cntrl V opened in Lindsay"
#> [25] "A new restaurant, Presqu'ile Café and Burger, opened in Brighton"
#> [26] "Beauty brand Morphe LLC opened a store in Mississauga"
#> [27] "Footwear retailer Brown Shoe Company of Canada Ltd. Inc. will open an outlet store in Halton Hills in April"
#> [28] "The Westdale Theatre in Hamilton is scheduled to reopen in February "
#> [29] "Early ON/Family Grouping will open a child care centre in Monkton"
#> [30] "The De Novo addiction treatment centre opened in Huntsville "
#> [31] "French Revolution Bakery & Crêperie opened in Dundas"
#> [32] "A Williams Fresh Cafe is slated to open in Stoney Creek, one of three new locations opening this year in southwestern Ontario"
#> [33] "Monigram Coffee Midtown cafe will open in Kitchener this winter "
#> [34] "My Roti Place opened a fourth restaurant in Toronto"
#> [35] "A Gangster Cheese restaurant opened in Whitby"
#> [36] "A Copper Branch restaurant opened in Mississauga "
#> [37] "Hallmark Canada will exit about 20 company-owned stores across Canada in 2019 by either transitioning them to independent ownership or closing them. The loacations of the affected stores have not been identified."
#> [38] "Lush Cosmetics at the Intercity Shopping Centre in Thunder Bay will close at the end of January"
#> [39] ""
Created on 2019-01-28 by the reprex package (v0.2.1)

Getting {xml_nodeset (0)} when using html_nodes from rvest package in R

I am trying to scrape headlines off a few news websites using the html_node function and the SelectorGadget but find that some do not work giving the result "{xml_nodeset (0)}". For example the below code gives such result:
url_cnn = 'https://edition.cnn.com/'
webpage_cnn = read_html(url_cnn)
headlines_html_cnn = html_nodes(webpage_cnn,'.cd__headline-text')
headlines_html_cnn
The ".cd__headline-text" I got using the SelectorGadget.
Other websites work such as:
url_cnbc = 'https://www.cnbc.com/world/?region=world'
webpage_cnbc = read_html(url_cnbc)
headlines_html_cnbc = html_nodes(webpage_cnbc,'.headline')
headlines_html_cnbc
Gives a full set of headlines. Any ideas why some websites return the "{xml_nodeset (0)}" result?
Please, please, please stop using Selector Gadget. I know Hadley swears by it but he's 100% wrong. What you see with Selector Gadget is what's been created in the DOM after javascript has been executed and other resources have been loaded asynchronously. Please use "View Source". That's what you get when you use read_html().
Having said that, I'm impressed CNN is as generous as they are (you def can scrape this page) and the content is most certainly on that page, just not rendered (which is likely even better):
Now, that's javascript, not JSON so we'll need some help from the V8 package:
library(rvest)
library(V8)
ctx <- v8()
# get the page source
pg <- read_html("https://edition.cnn.com/")
# find the node with the data in a <script> tag
html_node(pg, xpath=".//script[contains(., 'var CNN = CNN || {};CNN.isWebview')]") %>%
html_text() %>% # get the plaintext
ctx$eval() # sent it to V8 to execute it
cnn <- ctx$get("CNN") # get the data ^^ just created
After exploring the cnn object:
str(cnn[["contentModel"]][["siblings"]][["articleList"]], 1)
## 'data.frame': 55 obs. of 7 variables:
## $ uri : chr "/2018/11/16/politics/cia-assessment-khashoggi-assassination-saudi-arabia/index.html" "/2018/11/16/politics/hunt-crown-prince-saudi-un-resolution/index.html" "/2018/11/15/politics/us-khashoggi-sanctions/index.html" "/2018/11/15/middleeast/jamal-khashoggi-saudi-prosecutor-death-penalty-intl/index.html" ...
## $ headline : chr "<strong>CIA determines Saudi Crown Prince personally ordered journalist's death, senior US official says</strong>" "Saudi crown prince's 'fit' over UN resolution" "US issues sanctions on 17 Saudis over Khashoggi murder" "Saudi prosecutor seeks death penalty for Khashoggi killers" ...
## $ thumbnail : chr "//cdn.cnn.com/cnnnext/dam/assets/181025083025-prince-mohammed-bin-salman-small-11.jpg" "//cdn.cnn.com/cnnnext/dam/assets/181025083025-prince-mohammed-bin-salman-small-11.jpg" "//cdn.cnn.com/cnnnext/dam/assets/181025171830-jamal-khashoggi-small-11.jpg" "//cdn.cnn.com/cnnnext/dam/assets/181025171830-jamal-khashoggi-small-11.jpg" ...
## $ duration : chr "" "" "" "" ...
## $ description: chr "The CIA has determined that Saudi Crown Prince Mohammed bin Salman personally ordered the killing of journalist"| __truncated__ "Multiple sources tell CNN that a much-anticipated United Nations Security Council resolution calling for a cess"| __truncated__ "The Trump administration on Thursday imposed penalties on 17 individuals over their alleged roles in the <a hre"| __truncated__ "Saudi prosecutors said Thursday they would seek the death penalty for five people allegedly involved in the mur"| __truncated__ ...
## $ layout : chr "" "" "" "" ...
## $ iconType : chr NA NA NA NA ...

Select by Date or Sort by date on GoogleNewsSource R

I am using the R package tm.plugin.webmining. Using the function GoogleNewsSource() I would like to query the news sorted by date and also from a specific date. Is there any paremeter to query the news of a specific date?
library(tm)
library(tm.plugin.webmining)
searchTerm <- "Data Mining"
corpusGoog <- WebCorpus(GoogleNewsSource(params=list(hl="en", q=searchTerm,
ie="utf-8", num=10, output="rss" )))
headers <- meta(corpusGoog,tag="datetimestamp")
If you're looking for a data frame-like structure, this is how you'd go about creating it (note: not all fields are extracted from the corpus):
library(dplyr)
make_row <- function(elem) {
data.frame(timestamp=elem[[2]]$datetimestamp,
heading=elem[[2]]$heading,
description=elem[[2]]$description,
content=elem$content,
stringsAsFactors=FALSE)
}
dat <- bind_rows(lapply(corpusGoog, make_row))
str(dat)
## Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10 obs. of 4 variables:
## $ timestamp : POSIXct, format: "2015-02-03 13:08:16" "2015-01-11 23:37:45" ...
## $ heading : chr "A guide to data mining with Hadoop - Information Age" "Barack Obama to seek limits on student data mining - Politico" "Is data mining riddled with risk or a natural hazard of the internet? - INTHEBLACK" "Why an obscure British data-mining company is worth $3 billion - Quartz" ...
## $ description: chr "Information AgeA guide to data mining with HadoopInformation AgeWith the advent of the Internet of Things and the transition fr"| __truncated__ "PoliticoBarack Obama to seek limits on student data miningPoliticoPresident Barack Obama on Monday is expected to call for toug"| __truncated__ "INTHEBLACKIs data mining riddled with risk or a natural hazard of the internet?INTHEBLACKData mining is now viewed as a serious"| __truncated__ "QuartzWhy an obscure British data-mining company is worth $3 billionQuartzTesco, the troubled British retail group, is starting"| __truncated__ ...
## $ content : chr "A guide to data mining with Hadoop\nHow businesses can realise and capitalise on the opportunities that Hadoop offers\nPosted b"| __truncated__ "By Stephanie Simon\n1/11/15 6:32 PM EST\nPresident Barack Obama on Monday is expected to call for tough legislation to protect "| __truncated__ "By Adam Courtenay\nData mining is now viewed as a serious security threat, but with all the hype, s"| __truncated__ "How We Buy\nJanuary 12, 2015\nTesco, the troubled British retail group, is starting over. After an accounting scandal , a serie"| __truncated__ ...
Then, you can do anything you want. For example:
dat %>%
arrange(timestamp) %>%
select(heading) %>%
head
## Source: local data frame [6 x 1]
##
## heading
## 1 The potential of fighting corruption through data mining - Transparency International (pre
## 2 Barack Obama to seek limits on student data mining - Politico
## 3 Why an obscure British data-mining company is worth $3 billion - Quartz
## 4 Parks and Rec Recap: Treat Yo Self to Some Data Mining - Indianapolis Monthly
## 5 Fraud and data mining in Vancouverâ\u0080¦just Outside the Lines - Vancouver Sun (blog)
## 6 'Parks and Rec' Data-Mining Episode Was Eerily True To Life - MediaPost Communications
If you want/need something else, you need to be clearer in your question.
I was looking at google query string and noticed they pass startdate and enddate tag in the query if you click dates on right hand side of the page.
You can use the same tag name and yout results will be confined within start and end date.
GoogleFinanceSource(query, params = list(hl = "en", q = query, ie = "utf-8",
start = 0, num = 25, output = "rss",
startdate='2015-10-26', enddate = '2015-10-28'))

Resources