R splitting cells with complex pattern strings - r

I am web-scraping a website (https://pubmed.ncbi.nlm.nih.gov/) to build up a dataset out of it.
`
> str(result)
$ title : chr [1:4007]
$ authors : chr [1:4007]
$ cite : chr [1:4007]
$ PMID : chr [1:4007]
$ synopsis: chr [1:4007]
$ link : chr [1:4007]
$ abstract: chr [1:4007] `
title authors cite PMID synop…¹ link abstr…
1 Food insecurity in baccalaureate nursing stude… Cocker… J Pr… 3386… METHOD… http… "Backg…
2 Household food insecurity and educational outc… Masa R… Publ… 3271… We mea… http… "Objec…
3 Food Insecurity and Food Label Comprehension a… Mansou… Nutr… 3437… Multiv… http… "Food …
4 The Household Food Security and Feeding Patter… Omachi… Nutr… 3623… Childr… http… "Child…
5 Food insecurity: Its prevalence and relationsh… Turnbu… J Hu… 3373… BACKGR… http… "Backg…
6 Cross-sectional Analysis of Food Insecurity an… Estrel… West… 3535… METHOD… http… "Intro…
`
Among the various variables I am creating, there is one that is giving me trouble (resulut$cite): it includes various information that I have to split up into different columns in order to get a clear overview of the data I need. Here there is an example of some rows to show the difficulty I am facing. I have searched for similar issues but can't find anything fitting this.
1. Public Health. 2021 Sep;198:332-339. doi: 10.1016/j.puhe.2021.07.032. Epub 2021 Sep 9. PMID: 34509858
2. Public Health Nutr. 2021 Apr;24(6):1469-1477. doi: 10.1017/S1368980021000550. Epub 2021 Feb 9. PMID: 33557975
3. Clin Nutr ESPEN. 2022 Dec;52:229-239. doi: 10.1016/j.clnesp.2022.11.005. Epub 2022 Nov 10. PMID: 36513458
Given this, I would like to split result$cite into multiple columns in order to attain what follows:
Review Date of publication Page doi Reference Epub PMID
Public Health. 2021 Sep; 198:332-339. 10.1016 j.puhe.2021.07.032. 2021 Sep 9. 34509858
Public Health Nutr. 2021 Apr; 24(6):1469-1477.10.1017 S1368980021000550. 2021 Feb 9. 33557975
Clin Nutr ESPEN. 2022 Dec; 52:229-239. 10.1016 j.clnesp.2022.11.005. 2022 Nov 10. 36513458
The main problem for me is that the strings are not regular, hence I can't find a pattern to split the cells into different columns. Any idea?

Does this work for you? (Since the OP will not provide data in reproducible format I've created a toy column idbeside the cite column):
library(tidyverse)
result %>%
extract(cite,
into = c("Review","Date of publication","Page","doi","Reference","Epub","PMID"),
regex = "\\d+\\. ([^.]+)\\. ([^;]+);([^.]+)\\. doi:([^/]+)/(\\S+)\\. Epub ([^.]+)\\. PMID: (\\d+)")
id Review Date of publication Page doi Reference Epub
1 1 Public Health 2021 Sep 198:332-339 10.1016 j.puhe.2021.07.032 2021 Sep 9
2 2 Public Health Nutr 2021 Apr 24(6):1469-1477 10.1017 S1368980021000550 2021 Feb 9
3 3 Clin Nutr ESPEN 2022 Dec 52:229-239 10.1016 j.clnesp.2022.11.005 2022 Nov 10
PMID
1 34509858
2 33557975
3 36513458
(NB: if there aren't leading numerics in cite, just remove this part from regex: \\d+\\. (with whitespace at the end!)
The way extract works may look difficult to parse but is actually quite simple. Essentially, you do two things: (i) you look at the strings and try and figure out how they are patterned, i.e., what rules they follow; (ii) in the regex argument you put everything you want to extract into distinct capture groups ((...)) and everything else remains outside of the capture groups.
Data (update #2):
result <- data.frame(id = 1:3,
cite = c("1. Public Health. 2021 Sep;198:332-339. doi: 10.1016/j.puhe.2021.07.032. Epub 2021 Sep 9. PMID: 34509858",
"2. Public Health Nutr. 2021 Apr;24(6):1469-1477. doi: 10.1017/S1368980021000550. Epub 2021 Feb 9. PMID: 33557975",
"3. Clin Nutr ESPEN. 2022 Dec;52:229-239. doi: 10.1016/j.clnesp.2022.11.005. Epub 2022 Nov 10. PMID: 36513458")
)

Related

Optional pattern part in regex lookbehind

In the example below I am trying to extract the text between 'Supreme Court' or 'Supreme Court of the United States' and the next date (including the date). The result below is not what I intended since result 2 includes "of the United States".
I assume the error is due to the .*? part since . can also match 'of the United States'. Any ideas how to exclude it?
I guess more generally speaking, the question is how to include an optional 'element' into a lookbehind (which seems not to be possible since ? makes it a non-fixed length input).
Many thanks!
library(tidyverse)
txt <- c("The US Supreme Court decided on 2 April 2020 The Supreme Court of the United States decided on 5 March 2011 also.")
str_extract_all(txt, regex("(?<=Supreme Court)(\\sof the United States)?.*?\\d{1,2}\\s\\w+\\s\\d{2,4}"))
#> [[1]]
#> [1] " decided on 2 April 2020"
#> [2] " of the United States decided on 5 March 2011"
Created on 2021-12-09 by the reprex package (v2.0.1)
I also tried
str_extract_all(txt, regex("(?<=(Supreme Court)|(Supreme Court of the United States)).*?\\d{1,2}\\s\\w+\\s\\d{2,4}"))
however the result is the same.
In this case, I would prefer using the perl engine which is implemented in Base R, rather than using the ICU-library engine which stringr/stringi uses.
pattern <- "Supreme Court (of the United States ?)?\\K.*?\\d{1,2}\\s\\w+\\s\\d{2,4}"
regmatches(txt, gregexpr(pattern, txt, perl = TRUE))
[[1]]
[1] "decided on 2 April 2020" "decided on 5 March 2011"
You can do this with str_match_all and group capture:
str_match_all(txt, regex("Supreme Court(?:\\sof the United States)?(.*?\\d{1,2}\\s\\w+\\s\\d{2,4})")) %>%
.[[1]] %>% .[, 2]
[1] " decided on 2 April 2020" " decided on 5 March 2011"

Are there any website content monitoring packages in R?

I know there are free website content monitoring programs that send email alerts when the content of a website is changed, but is there a package (or any way to hard code) in R which can do this? It would be helpful to integrate this in one work flow.
R is a general purpose programming language so you can do anything with it.
Core idiom for what you are trying to do is:
Identify target site
Pull content & content metadata
Cache ^^ (you need to figure this out; RDBMS tables? NoSQL tables? Files?)
Let n time-periods pass (you need to figure this out: cron? launchd? Amazon lambda?)
Pull content & content metadata
Compare ^^ against cached versions; NOTE this works best if you know the structure of the target site vs use an overly generic framework)
If difference is "significant", notify via whatever means you want (you need to figure this out: email? SMS? Twitter?)
For content, you may not be aware that httr::GET() returns a rich, complex data object full of metadata. I did not do a str(res) below to encourage you to do so on your own.
library(httr)
library(rvest)
library(splashr)
library(hgr) # devtools::install_github("hrbrmstr/hgr")
library(tlsh) # devtools::install_github("hrbrmstr/tlsh")
target_url <- "https://www.whitehouse.gov/briefings-statements/"
Get it like a browser would
httr::GET(
url = target_url,
httr::user_agent(splashr::ua_macos_safari)
) -> res
Cache page size and use a substantial difference to signal notification
(page_size <- res$headers['content-length'])
## $`content-length`
## [1] "12783"
Calculate & cache local sensitify hash value use tlsh_simple_diff() to see if there are "substantial" hash changes and use that as a signal to notify:
doc_text <- httr::content(res, as = "text")
(doc_hash <- tlsh_simple_hash(doc_text))
## [1] "563386E33C44683E060B739261ADF20CB2D38563EE151C88A3F95169999FF97A1F385D"
This site uses structured <div>'s so cache and use more/fewer/different ones to signal notification:
doc <- httr::content(res)
news_items <- html_nodes(doc, "div.briefing-statement__content")
(total_news_items <- length(news_items))
## [1] 10
(headlines <- gsub("[[:space:]]+", " ", html_text(news_items, trim=TRUE)))
## [1] "News Clips CNBC: “Job Openings Hit Record 7.136 Million in August” Economy & Jobs Oct 16, 2018"
## [2] "Fact Sheets Congressional Democrats Want to Take Away Your Doctor, Outlaw Your Private Insurance, and Put Bureaucrats In Charge of Your Healthcare Healthcare Oct 16, 2018"
## [3] "Remarks Remarks by President Trump in Briefing on Hurricane Michael Land & Agriculture Oct 15, 2018"
## [4] "Remarks Remarks by President Trump and Governor Scott at FEMA Aid Distribution Center | Lynn Haven, FL Land & Agriculture Oct 15, 2018"
## [5] "Remarks Remarks by President Trump During Tour of Lynn Haven Community | Lynn Haven, FL Land & Agriculture Oct 15, 2018"
## [6] "Remarks Remarks by President Trump and Governor Scott Upon Arrival in Florida Land & Agriculture Oct 15, 2018"
## [7] "Remarks Remarks by President Trump Before Marine One Departure Foreign Policy Oct 15, 2018"
## [8] "Statements & Releases White House Appoints 2018-2019 Class of White House Fellows Oct 15, 2018"
## [9] "Statements & Releases President Donald J. Trump Approves Georgia Disaster Declaration Land & Agriculture Oct 14, 2018"
## [10] "Statements & Releases President Donald J. Trump Amends Florida Disaster Declaration Land & Agriculture Oct 14, 2018"
Use a "readability" tool to turn the contents into plaintext cache & compare with one of the many "text diff/string diff" R packages:
content_meta <- hgr::just_the_facts(target_url)
str(content_meta)
## List of 11
## $ title : chr "Briefings & Statements"
## $ content : chr "<p class=\"body-overflow\"> <header class=\"header\"> </header>\n<main id=\"main-content\"> <div class=\"page-r"| __truncated__
## $ lead_image_url: chr "https://www.whitehouse.gov/wp-content/uploads/2017/12/wh.gov-share-img_03-1024x538.png"
## $ next_page_url : chr "https://www.whitehouse.gov/briefings-statements/page/2"
## $ url : chr "https://www.whitehouse.gov/briefings-statements/"
## $ domain : chr "www.whitehouse.gov"
## $ excerpt : chr "Get official White House briefings, statements, and remarks from President Donald J. Trump and members of his Administration."
## $ word_count : int 22
## $ direction : chr "ltr"
## $ total_pages : int 2
## $ pages_rendered: int 2
## - attr(*, "row.names")= int 1
## - attr(*, "class")= chr "hgr"
Unfortunately, you asked a general purpose computing-ish question and, as such, it is likely to get closed.

How to remove extra | (pipe) separator from rows when loading | (pipe)-separated text into R

I am reading text from a file in which the text is separated by | (pipes).
The text table looks like this (tweet id|date and time|tweet):
545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
I am reading this information using the following code:
nyt <- read.table(file=".../nytimeshealth.txt",
sep="|",
header = F,
quote="",
fill=T,
stringsAsFactors = F,
numerals ="no.loss",
encoding = "UTF-8",
na.strings = "NA")
Now, while most of the rows in the original file have 3 columns, each separated by a '|', a few of the rows have an additional '|' separator. That is to say, they have four columns, because some of the tweets themselves contain a | symbol.
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
I know that usingfill=T option in the read.table function above allows me to read rows of unequal length (blank fields are implicitly added in the empty cells).
So, the row above becomes
71 545074589374881792 Wed Dec 17 04:34:43 +0000 2014 National Briefing
72 New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
However, now column 3 of row 71 has incomplete information, and columns 2 and 3 of row 72 are empty while column 1 does not contain the tweet ID but a part of the tweet. Is there any way I can avoid this? I would like to remove the extra | separator wherever it appears, so that I do not lose any information.
Is this possible while reading the text file into R? Or is it something I will have to take care of before I start loading the text. What would be my best course of action?
I created a text file called text.txt with the 3 lines you provide as example of your data (the 2 easy lines without any | in the tweet as well as the one which has a | inside the tweet).
Here is the content of this file:
545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx
Code
library(tidyverse)
readLines("text.txt", encoding = "UTF-8") %>%
map(., str_split_fixed, "\\|", 3) %>%
map_df(., as_tibble)
Result
# A tibble: 3 x 3
V1 V2
<chr> <chr>
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
<chr>
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Say…
3 National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to …
Here is another solution, to get back to your comment and to use your initial code. But this solution will only work if you have one | per tweet (you can have tweets with none as long as at least one tweet has one |). If you don't have any | in your tweets, or if some tweets have more than one |, it will break and you will have to edit it. So the other answer, which will work regardless of the structure of your tweets is better IMO.
I am still using my text.txt file:
df <- read.table(file = "text.txt",
sep = "|",
header = F,
quote = "",
fill = T,
stringsAsFactors = F,
numerals = "no.loss",
encoding = "UTF-8",
na.strings = "NA")
df %>%
mutate(V3 = paste0(V3, V4)) %>%
select(- V4)
Result
V1 V2
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
3 National Briefing New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

Extract date from texts in corpus R

I have a corpus object from which I want to extract data so I can add them as docvar.
The object looks like this
v1 <- c("(SE22-y -7 A go q ,, Document of The World Bank FOR OFFICIAL USE ONLY il I ( >I8.( )]i 1 t'f-l±E C 4'( | Report No. 9529-LSO l il .rt N ,- / . t ,!I . 1. 'i 1( T v f) (: AR.) STAFF APPRAISAL REPORT KINGDOM OF LESOTHO EDUCATION SECTOR DEVELOPMENT PROJECT JUNE 19, 1991 Population and Human Resources Division Southern Africa Department This document has a restricted distribution and may be used by reipients only in the performance of their official duties. Its contents may not otherwise be disclosed without World Bank authorization.",
"Document of The World Bank Report No. 13611-PAK STAFF APPRAISAL REPORT PAKISTAN POPULATION WELFARE PROGRAM PROJECT FREBRUARY 10, 1995 Population and Human Resources Division Country Department I South Asia Region",
"I Toward an Environmental Strategy for Asia A Summary of a World Bank Discussion Paper Carter Brandon Ramesh Ramankutty The World Bank Washliington, D.C. (C 1993 The International Bank for Reconstruction and Development / THiE WORLD BANK 1818 H Street, N.W. Washington, D.C. 20433 All rights reserved Manufactured in the United States of America First printing November 1993",
"Report No. PID9188 Project Name East Timor-TP-Emergency School (#) Readiness Project Region East Asia and Pacific Region Sector Other Education Project ID TPPE70268 Borrower(s) EAST TIMOR Implementing Agency Address UNTAET (UN TRANSITIONAL ADMINISTRATION FOR EAST TIMOR) Contact Person: Cecilio Adorna, UNTAET, Dili, East Timor Fax: 61-8 89 422198 Environment Category C Date PID Prepared June 16, 2000 Projected Appraisal Date May 27, 2000 Projected Board Date June 20, 2000",
"Page 1 CONFORMED COPY CREDIT NUMBER 2447-CHA (Reform, Institutional Support and Preinvestment Project) between PEOPLE'S REPUBLIC OF CHINA and INTERNATIONAL DEVELOPMENT ASSOCIATION Dated December 30, 1992")
c1 <- corpus(v1)
The first thing I want to do is extract the first occurring date, mostly it occurs as "Month Year" (December 1990) or "Month Day, Year" (JUNE 19, 1991) or with a typo FREBRUARY 10, 1995 in which case the month could be discarded.
My code is a combination of
Extract date text from string
&
Extract Dates in any format from Text in R:
lapply(c1$documents$texts, function(x) anydate(str_extract_all(c1$documents$texts, "[[:alnum:]]+[ /]*\\d{2}[ /]*\\d{4}")))
and get the error:
Error in anytime_cpp(x = x, tz = tz, asUTC = asUTC, asDate = TRUE, useR = useR, : Unsupported Type
However, I do not know how to supply the date format. Furthermore, I don't really get how to write the correct regular expressions.
https://www.regular-expressions.info/dates.html & https://www.regular-expressions.info/rlanguage.html
other questions on this subject are:
Extract date from text
Need to extract date from a text file of strings in R
http://r.789695.n4.nabble.com/Regexp-extract-first-occurrence-of-date-in-string-td997254.html
Extract date from given string in r
str_extract_all(texts(c1)
, "(\\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Nov(?:ember)?|Oct(?:ober)?|Dec(?:ember)?) (?:19[7-9]\\d|2\\d{3})(?=\\D|$))|(\\b(?:JAN(?:UARY)?|FEB(?:RUARY)?|MAR(?:CH)?|APR(?:IL)?|MAY|JUN(?:E)?|JUL(?:Y)?|AUG(?:UST)?|SEP(?:TEMBER)?|NOV(?:EMBER)?|OCT(?:OBER)?|DEC(?:EMBER)?) (?:19[7-9]\\d|2\\d{3})(?=\\D|$))|((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\\s+\\d{1,2},\\s+\\d{4})|(\\b(JAN(UARY)?|FEB(RUARY)?|MAR(CH)?|APR(IL)?|MAY|JUN(E)?|JUL(Y)?|AUG(UST)?|SEP(TEMBER)?|OCT(OBER)?|NOV(EMBER)?|DEC(EMBER)?)\\s+\\d{1,2},\\s+\\d{4})"
, simplify = TRUE)[,1]
This gives the first occurrence of format JUNE 19, 1991 or December 1990

Select by Date or Sort by date on GoogleNewsSource R

I am using the R package tm.plugin.webmining. Using the function GoogleNewsSource() I would like to query the news sorted by date and also from a specific date. Is there any paremeter to query the news of a specific date?
library(tm)
library(tm.plugin.webmining)
searchTerm <- "Data Mining"
corpusGoog <- WebCorpus(GoogleNewsSource(params=list(hl="en", q=searchTerm,
ie="utf-8", num=10, output="rss" )))
headers <- meta(corpusGoog,tag="datetimestamp")
If you're looking for a data frame-like structure, this is how you'd go about creating it (note: not all fields are extracted from the corpus):
library(dplyr)
make_row <- function(elem) {
data.frame(timestamp=elem[[2]]$datetimestamp,
heading=elem[[2]]$heading,
description=elem[[2]]$description,
content=elem$content,
stringsAsFactors=FALSE)
}
dat <- bind_rows(lapply(corpusGoog, make_row))
str(dat)
## Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10 obs. of 4 variables:
## $ timestamp : POSIXct, format: "2015-02-03 13:08:16" "2015-01-11 23:37:45" ...
## $ heading : chr "A guide to data mining with Hadoop - Information Age" "Barack Obama to seek limits on student data mining - Politico" "Is data mining riddled with risk or a natural hazard of the internet? - INTHEBLACK" "Why an obscure British data-mining company is worth $3 billion - Quartz" ...
## $ description: chr "Information AgeA guide to data mining with HadoopInformation AgeWith the advent of the Internet of Things and the transition fr"| __truncated__ "PoliticoBarack Obama to seek limits on student data miningPoliticoPresident Barack Obama on Monday is expected to call for toug"| __truncated__ "INTHEBLACKIs data mining riddled with risk or a natural hazard of the internet?INTHEBLACKData mining is now viewed as a serious"| __truncated__ "QuartzWhy an obscure British data-mining company is worth $3 billionQuartzTesco, the troubled British retail group, is starting"| __truncated__ ...
## $ content : chr "A guide to data mining with Hadoop\nHow businesses can realise and capitalise on the opportunities that Hadoop offers\nPosted b"| __truncated__ "By Stephanie Simon\n1/11/15 6:32 PM EST\nPresident Barack Obama on Monday is expected to call for tough legislation to protect "| __truncated__ "By Adam Courtenay\nData mining is now viewed as a serious security threat, but with all the hype, s"| __truncated__ "How We Buy\nJanuary 12, 2015\nTesco, the troubled British retail group, is starting over. After an accounting scandal , a serie"| __truncated__ ...
Then, you can do anything you want. For example:
dat %>%
arrange(timestamp) %>%
select(heading) %>%
head
## Source: local data frame [6 x 1]
##
## heading
## 1 The potential of fighting corruption through data mining - Transparency International (pre
## 2 Barack Obama to seek limits on student data mining - Politico
## 3 Why an obscure British data-mining company is worth $3 billion - Quartz
## 4 Parks and Rec Recap: Treat Yo Self to Some Data Mining - Indianapolis Monthly
## 5 Fraud and data mining in Vancouverâ\u0080¦just Outside the Lines - Vancouver Sun (blog)
## 6 'Parks and Rec' Data-Mining Episode Was Eerily True To Life - MediaPost Communications
If you want/need something else, you need to be clearer in your question.
I was looking at google query string and noticed they pass startdate and enddate tag in the query if you click dates on right hand side of the page.
You can use the same tag name and yout results will be confined within start and end date.
GoogleFinanceSource(query, params = list(hl = "en", q = query, ie = "utf-8",
start = 0, num = 25, output = "rss",
startdate='2015-10-26', enddate = '2015-10-28'))

Resources