Retrieve citations of a journal paper using R - r

Using R, I want to obtain the list of articles referencing to a scientific journal paper.
The only information I have is the title of the article, e.g. "Protein measurement with the folin phenol reagent".
Is anyone able to help me by producing a replicable example that I can use?
Here is what I tried so far.
The R package fulltext seems to be useful, because it allows to retrieve a list of IDs linked to an article. For instance, I can get the article's DOI:
library(fulltext)
res1 <- ft_search(query = "Protein measurement with the folin phenol reagent", from = "crossref")
res1 <- ft_links(res1)
res1$crossref$ids
In the same way, I can get the scopus id, by setting from = "scopus" in the function fulltext::ft_search (and by including a scopus API key).
If using the DOI, I can obtain the number of citations of the article using the R library rcrossref:
rcrossref::cr_citation_count(res1$crossref$ids[1])
Similarly, I can use the R package rscopus if I want to use the scopus id, rather than the DOI.
Unfortunately, this information is not sufficient to me, as I need the list of articles referencing to the paper, not the number.
I saw on the internet many people using the package scholar. But if I understand correctly, for this to work I need article's authors to have a google scholar ID, and I have to find a way to retrieve this ID. So it doesn't look like a viable solution.
Does anyone has any idea on how to solve this problem?

Once you have the DOI, you can use the OpenCitations API to fetch data about publications that cite the article. Access the API with the rjson-package via https://opencitations.net/index/coci/api/v1/citations/{DOI}. The field name citing contains as values the DOIs of all publications that cite the publication. You can then use CrossRef's API to fetch further metadata about the citing papers, such as titles, journal, publication date and authors (via https://api.crossref.org/works/{DOI}).
Here is an example of OpenCitations' API with three citations (as of January 2021).
Here is a possible code (with the same example as above):
opcit <- "https://opencitations.net/index/coci/api/v1/citations/10.1177/1369148118786043"
result <- rjson::fromJSON(file = opcit)
citing <- lapply(result, function(x){
x[['citing']]
})
# a vector with three DOIs, each of which cite the paper
citing <- unlist(citing)
Now we have the vector citing with three DOIs. You can then use rcrossref to find out basic information about the citing papers, such as:
paper <- rcrossref::cr_works(citing[1])
# find out the title of that paper
paper[["data"]][["title"]]
# output: "Exchange diplomacy: theory, policy and practice in the Fulbright program"
Since you have a vector of DOIs in citing, you could also use this approach:
citingdata <- rcrossref::cr_cn(citing)
The output of citingdata should lead to the metadata of the three citing papers, structured like in these two examples:
[[1]]
[1] "#article{Wong_2020,\n\tdoi = {10.1017/s1752971920000196},\n\turl = {https://doi.org/10.1017%2Fs1752971920000196},\n\tyear = 2020,\n\tmonth = {jun},\n\tpublisher = {Cambridge University Press ({CUP})},\n\tpages = {1--31},\n\tauthor = {Seanon S. Wong},\n\ttitle = {One-upmanship and putdowns: the aggressive use of interaction rituals in face-to-face diplomacy},\n\tjournal = {International Theory}\n}"
[[2]]
[1] "#article{Aalberts_2020,\n\tdoi = {10.1080/21624887.2020.1792734},\n\turl = {https://doi.org/10.1080%2F21624887.2020.1792734},\n\tyear = 2020,\n\tmonth = {aug},\n\tpublisher = {Informa {UK} Limited},\n\tvolume = {8},\n\tnumber = {3},\n\tpages = {240--264},\n\tauthor = {Tanja Aalberts and Xymena Kurowska and Anna Leander and Maria Mälksoo and Charlotte Heath-Kelly and Luisa Lobato and Ted Svensson},\n\ttitle = {Rituals of world politics: on (visual) practices disordering things},\n\tjournal = {Critical Studies on Security}\n}"

Related

Split the string into multiple sentences with R and pos tagging

I don't know if this is the right place, but if possible, could you help me split a text into several sentences using R.
I have a database that contains the description of activities that employees perform. I would like to split this text into several sentences and then extract the verb-noun pair from each sentence.
I can do this line by line, but as there are many lines it would take forever, so I would like to know if you guys know how to do this for the entire column.
You guys can see the database in: https://docs.google.com/spreadsheets/d/1NiMj37q8_hJhuNFCiQcjO6UBvI9_-OM4/edit?usp=sharing&ouid=115543599430411372875&rtpof=true&sd=true
I can do it one by one as the following code, but I would like to do it for the entire description
library(udpipe)
> docs <- "Determine and formulate policies and provide overall direction of companies or private and public sector organizations within guidelines set up by a board of directors or similar governing body. Plan, direct, or coordinate operational activities at the highest level of management with the help of subordinate executives and staff managers."
docs <- setNames(docs, "doc1")
anno <- udpipe(docs, object = "english", udpipe_model_repo = "bnosac/udpipe.models.ud")
anno <- cbind_dependencies(anno, type = "parent")
subset(anno, upos_parent %in% c("NOUN", "VERB") & upos %in% c("NOUN", "VERB"),
+select = c("doc_id", "paragraph_id", "sentence_id", "token", "token_parent", "dep_rel","upos", "upos_parent"))

Create multiiple rmarkdown reports with one dataset

I would like to create several pdf files in rmarkdown.
This is a sample of my data:
mydata <- data.frame(First = c("John", "Hui", "Jared","Jenner"), Second = c("Smith", "Chang", "Jzu","King"), Sport = c("Football","Ballet","Ballet","Football"), Age = c("12", "13", "12","13"), submission = c("Microbes may be the friends of future colonists living off the land on the moon, Mars or elsewhere in the solar system and aiming to establish self-sufficient homes.
Space colonists, like people on Earth, will need what are known as rare earth elements, which are critical to modern technologies. These 17 elements, with daunting names like yttrium, lanthanum, neodymium and gadolinium, are sparsely distributed in the Earth’s crust. Without the rare earths, we wouldn’t have certain lasers, metallic alloys and powerful magnets that are used in cellphones and electric cars.", "But mining them on Earth today is an arduous process. It requires crushing tons of ore and then extracting smidgens of these metals using chemicals that leave behind rivers of toxic waste water.
Experiments conducted aboard the International Space Station show that a potentially cleaner, more efficient method could work on other worlds: let bacteria do the messy work of separating rare earth elements from rock.", "“The idea is the biology is essentially catalyzing a reaction that would occur very slowly without the biology,” said Charles S. Cockell, a professor of astrobiology at the University of Edinburgh.
On Earth, such biomining techniques are already used to produce 10 to 20 percent of the world’s copper and also at some gold mines; scientists have identified microbes that help leach rare earth elements out of rocks.", "Blank"))
With help from the community, I was able to arrive at a cool rmarkdown solution that would create a single html file, with all the data I want.
This is saved as Essay to Word.Rmd
```{r echo = FALSE}
# using data from above
# mydata <- mydata
# Define template (using column names from data.frame)
template <- "**First:** `r First`   **Second:** `r Second` <br>
**Age:** `r Age`
**Submission** <br>
`r Submission`"
# Now process the template for each row of the data.frame
src <- lapply(1:nrow(mydata), function(i) {
knitr::knit_child(text=template, envir=mydata[i, ], quiet=TRUE)
})
```
# Print result to document
`r knitr::knit_child(text=unlist(src))`
```
This creates a single file:
I would like to create a single html (or preferably PDF file) for each "sport" listed in the data. So I would have all the submissions for students who do "Ballet" in one file, and a separate file with all the submissions of students who play football.
I have been looking a few different solutions, and I found this to be the most helpful:
R Knitr PDF: Is there a posssibility to automatically save PDF reports (generated from .Rmd) through a loop?
Following suite, I created a separate R script to loop through and subset the data by sport:
Unfortunately, this is creating a separate file with ALL the students, not just those who belong to that sport.
for (sport in unique(mydata$Sport)){
subgroup <- mydata[mydata$Sport == sport,]
render("Essay to Word.Rmd",output_file = paste0('report.',sport, '.html'))
}
Any idea what might be going on with this code above?
Is it possible to directly create these files as PDF docs instead of html? I know I can click on each file to save them as pdf after the fact, but I will have 40 different sports files to work with.
Is is possible to add a thin line between each "submission" essay within a file?
Any help would be great, thank you!!!
This could be achieved via a parametrized report like so:
Add parameters for the data and e.g. the type of sport to your Rmd
Inside the lapply pass your subgroup dataset to render via argument params
You can add horizontal lines via ***
If you want pdf then use output_format="pdf_document". Additionally to render your document I had to switch the latex engine via output_options
Rmd:
---
params:
data: null
sport: null
---
```{r echo = FALSE}
# using data from above
data <- params$data
# Define template (using column names from data.frame)
template <- "
***
**First:** `r First`   **Second:** `r Second` <br>
**Age:** `r Age`
**Submission** <br>
`r Submission`"
# Now process the template for each row of the data.frame
src <- lapply(1:nrow(data), function(i) {
knitr::knit_child(text=template, envir=data[i, ], quiet=TRUE)
})
```
# Print result to document. Sport: `r params$sport`
`r knitr::knit_child(text=unlist(src))`
R Script:
mydata <- data.frame(First = c("John", "Hui", "Jared","Jenner"),
Second = c("Smith", "Chang", "Jzu","King"),
Sport = c("Football","Ballet","Ballet","Football"),
Age = c("12", "13", "12","13"),
Submission = c("Microbes may be the friends of future colonists living off the land on the moon, Mars or elsewhere in the solar system and aiming to establish self-sufficient homes.
Space colonists, like people on Earth, will need what are known as rare earth elements, which are critical to modern technologies. These 17 elements, with daunting names like yttrium, lanthanum, neodymium and gadolinium, are sparsely distributed in the Earth’s crust. Without the rare earths, we wouldn’t have certain lasers, metallic alloys and powerful magnets that are used in cellphones and electric cars.", "But mining them on Earth today is an arduous process. It requires crushing tons of ore and then extracting smidgens of these metals using chemicals that leave behind rivers of toxic waste water.
Experiments conducted aboard the International Space Station show that a potentially cleaner, more efficient method could work on other worlds: let bacteria do the messy work of separating rare earth elements from rock.", "“The idea is the biology is essentially catalyzing a reaction that would occur very slowly without the biology,” said Charles S. Cockell, a professor of astrobiology at the University of Edinburgh.
On Earth, such biomining techniques are already used to produce 10 to 20 percent of the world’s copper and also at some gold mines; scientists have identified microbes that help leach rare earth elements out of rocks.", "Blank"))
for (sport in unique(mydata$Sport)){
subgroup <- mydata[mydata$Sport == sport,]
rmarkdown::render("test.Rmd", output_format = "html_document", output_file = paste0('report.', sport, '.html'), params = list(data = subgroup, sport = sport))
rmarkdown::render("test.Rmd", output_format = "pdf_document", output_options = list(latex_engine = "xelatex"), output_file = paste0('report.', sport, '.pdf'), params = list(data = subgroup, sport = sport))
}
In order to directly create a pdf from your rmd-file , you could use the following function in a separate R script where your data is loaded, and then use map from the purrr package to iterate over the data (in the rmd-file the output must be set to pdf_document):
library(tidyverse)
library(lazyeval)
get_report <- function(sport){
sport <- enquo(sport)
mydata <- mydata %>%
filter(Sport == !!sport)
render("test.rmd", output_file = paste('report_', as_name(sport), '.pdf', sep=''))
}
map(as.vector(data$Sport), get_report)
Hope that is what you are looking for?

Extract list based on string with tabulizer package

Extracting the quarterly income statement with the tabulizer package and converting it to tabular form.
# 2017 Q3 Report
telia_url = "http://www.teliacompany.com/globalassets/telia-
company/documents/reports/2017/q3/telia-company-q3-2017-en"
telialists = extract_tables(telia_url)
teliatest1 = as.data.frame(telialists[22])
#2009 Q3#
telia_url2009 = "http://www.teliacompany.com/globalassets/telia-
company/documents/reports/2009/q3/teliasonera-q3-2009-report-en.pdf"
telialists2009 = extract_tables(telia_url2009)
teliatest2 = as.data.frame(telialists2009[9])
Interested only in the Condensed Consolidated Statements of Comprehensive Income table. This string is exact or very similar for all historical reports.
Above, for the 2017 report, list #22 was the correct table. However, since 2009 report had a different layout, #9 was the correct for that particular report.
What would be a clever solution to make this function dynamic, depending on where the string (or substring) of "Condensed Consolidated Statements of Comprehensive Income" is located?
Perhaps using the tm package to find the relative position?
Thanks
You could use pdftools to find the page you're interested in.
For instance a function like this one should do the job:
get_table <- function(url) {
txt <- pdftools::pdf_text(url)
p <- grep("condensed consolidated statements.{0,10}comprehensive income",
txt,
ignore.case = TRUE)[1]
L <- tabulizer::extract_tables(url, pages = p)
i <- which.max(lengths(L))
data.frame(L[[i]])
}
The first step is to read all the pages in the character vector txt. Then grep allows you to find the first page looking like the one you want (I inserted .{0,10} to allow a maximum of ten characters like spaces or newlines in the middle of the title).
Using tabulizer, you can extract the list L of all tables located on this page, which should be much faster than extracting all the tables of the document, as you did. Your table is probably the biggest on that page, hence the which.max.

Entrez and RISmed library for pubmed data mining

I'm using this "RISmed" library to do some query of my gene or protein of interest and the output comes with pubmed ID basically, but most of the times it consist of non-specific hits as well which are not my interest. As I can only see the pubmed ID I have to manually put those returned ID and search them in NCBI to see if the paper is of my interest or not.
Question: Is there a way to to return the abstract of the paper or summary sort of along with its pumed ID , which can be implemented in R?
If anyone can help it would be really great..
Using the example from the manuals we need EUtilsGet function.
library(RISmed)
search_topic <- 'copd'
search_query <- EUtilsSummary(search_topic, retmax = 10,
mindate = 2012, maxdate = 2012)
summary(search_query)
# see the ids of our returned query
QueryId(search_query)
# get actual data from PubMed
records <- EUtilsGet(search_query)
class(records)
# store it
pubmed_data <- data.frame('Title' = ArticleTitle(records),
'Abstract' = AbstractText(records))

How can I get all available data series names from quantmod and Quandl?

I am interested in a large amount of free data from FRED, OECD, BIS, World-bank, etc. All data are macro-economic in nature. I am not really interested in stock price data.
I would essentially like to construct a CSV table with all symbols available for me in quantmod and Quandl. I am almost certain this table would be useful for others as well.
Symbol, Title, Units, Frequency
X X X X
Y Y Y Y
I found a similar question which has no answer.
How can i see all available data series from quantmod package?
Is there a way to do this instead of searching FRED, OECD and manually on quandl on country by country and variable by variable level?
Thank you.
Doe, on Quandl you can query the API for a dataset search, as described in this link:
https://www.quandl.com/docs/api#dataset-search
If you're using the Quandl package, you could use the Quandl.search function to build your query as specified in the above link. Here's the query from the link above using Quandl.search:
Quandl.search(query = "crude oil", page = 1, source = "DOE", silent = TRUE)

Resources