Retrieve references from a paper using doi - r

How can we retrieve all references from a paper using the doi and save them into a dataframe?
Using rscopus:
library(rscopus)
library(dplyr)
auth_token_header("please_add")
akey="please_add"
object_retrieval("10.1109/ISCSLP.2014.6936630", ref = "doi")
Is this the correct option?

I'm not sure this is actually indexed by Scopus.
library(rscopus)
x = abstract_retrieval("10.1109/ISCSLP.2014.6936630", identifier= "doi")
#> HTTP specified is:https://api.elsevier.com/content/abstract/doi/10.1109/ISCSLP.2014.6936630
x$content
#> $`service-error`
#> $`service-error`$status
#> $`service-error`$status$statusCode
#> [1] "RESOURCE_NOT_FOUND"
#>
#> $`service-error`$status$statusText
#> [1] "The resource specified cannot be found."
process_author_name(last_name = "Lin", first_name = "Xuee")
#> HTTP specified is (without API key): https://api.elsevier.com/content/search/author?query=AUTHFIRST%28Xuee%29%2BAND%2BAUTHLAST%28Lin%29&count=200&start=0
#> Error in process_author_name(last_name = "Lin", first_name = "Xuee"): No author name found
Created on 2019-01-28 by the reprex package (v0.2.1)

It should be the correct option. But Scopus discriminates between (academic) subscribers and non-subscribers. Many of the available items are disabled by default for non-subscribers. You can check the available items here: https://dev.elsevier.com/api_key_settings.html
The table on the page says that citations overview is disabled by default, not available to non-subscribers.

Related

download.file with wildcard matching in R

I'm trying to download all the files that match a pattern from a url directory in R using download.file, but I can't get it working for even a single file. The url is:
https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2019/
and the pattern match is all files like: AIS_2019_*_18.zip
Here is what I've tried for a single file case:
download.file('https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2019/AIS_2019_04_18.zip',
destfile = "AIS_2019_04_18.zip",
method = "wget", extra = c("-r", "-np", "-L", "--max-redirect=0"))
but I always get 'wget' call had nonzero exit status
I've also tried setting method = internal and mode = w, but get ```scheme not supported in url'
Here's a way to generate all the links that you can then loop through them with a for loop.
library(glue)
library(stringr)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
# Setup
month_dates <- glue("2019-{str_pad(1:12, width = 2, pad = '0')}-01")
days_in_months <- days_in_month(as.Date(month_dates))
# Get appropriate number of days and months combinations
months <- rep(1:12, days_in_months)
days <- unlist(mapply(function(x) str_pad(1:x, width = 2, pad = "0"),
days_in_months))
base_url <- "https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2019"
# Put everything together
all_files <- glue("{base_url}/AIS_2019_{months}_{days}.zip")
# See results
head(all_files)
#> https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2019/AIS_2019_1_01.zip
#> https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2019/AIS_2019_1_02.zip
#> https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2019/AIS_2019_1_03.zip
#> https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2019/AIS_2019_1_04.zip
#> https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2019/AIS_2019_1_05.zip
#> https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2019/AIS_2019_1_06.zip
# Check number of the days in a year is correct
length(all_files)
#> [1] 365
Created on 2021-08-04 by the reprex package (v2.0.0)
Once you have those created, you can do something like:
# Untested
for (file in all_files) {
download.file(file,
destfile = basename(file),
extra = c("-r", "-np", "-L", "--max-redirect=0"))
}

Reading PDF portfolio in R

Is it possible to read/convert PDF portfolios in R?
I usually use pdftools, however, I get an error:
library(pdftools)
#> Using poppler version 0.73.0
link <- c("http://www.accessdata.fda.gov/cdrh_docs/pdf19/K190072.pdf")
pdftools::pdf_convert(link, dpi = 600)
#> Converting page 1 to K190072_1.png...
#> PDF error: Non conformant codestream TPsot==TNsot.<0a>
#> PDF error: Non conformant codestream TPsot==TNsot.<0a>
#> PDF error: Non conformant codestream TPsot==TNsot.<0a>
#> PDF error: Non conformant codestream TPsot==TNsot.<0a>
#> done!
#> [1] "K190072_1.png"
Created on 2021-05-06 by the reprex package (v1.0.0)
The K190072_1.png I finally get is only the image of the portfolio front page.
I am interessted in the document K190072.510kSummary.Final_Sent001.pdf of this PDF portfolio
I found a way for Python (Reading a PDF Portfolio in Python?) but I would really like to do that in R.
Thank you for your help.
There seems to be an issue with pdf_convert handling one-page raw pdf data (it wants to use basename(pdf) under these conditions), so I have edited that function so that it also works with the second attached pdf file.
If you only need the first file then you could run this with the original pdf_convert function, but it will give an error with the second file.
If you are interested in rendering raster graphics from the attached files this worked for me:
library(pdftools)
#> Using poppler version 21.02.0
link <- c("http://www.accessdata.fda.gov/cdrh_docs/pdf19/K190072.pdf")
pdf_convert <- function (pdf, format = "png", pages = NULL, filenames = NULL,
dpi = 72, antialias = TRUE, opw = "", upw = "", verbose = TRUE) {
config <- poppler_config()
if (!config$can_render || !length(config$supported_image_formats))
stop("You version of libppoppler does not support rendering")
format <- match.arg(format, poppler_config()$supported_image_formats)
if (is.null(pages))
pages <- seq_len(pdf_info(pdf, opw = opw, upw = upw)$pages)
if (!is.numeric(pages) || !length(pages))
stop("Argument 'pages' must be a one-indexed vector of page numbers")
if (length(filenames) < 2 & !is.raw(pdf)) { # added !is.raw(pdf)
input <- sub(".pdf", "", basename(pdf), fixed = TRUE)
filenames <- if (length(filenames)) {
sprintf(filenames, pages, format)
}
else {
sprintf("%s_%d.%s", input, pages, format)
}
}
if (length(filenames) != length(pages))
stop("Length of 'filenames' must be one or equal to 'pages'")
antialiasing <- isTRUE(antialias) || isTRUE(antialias ==
"draw")
text_antialiasing <- isTRUE(antialias) || isTRUE(antialias ==
"text")
pdftools:::poppler_convert(pdftools:::loadfile(pdf), format, pages, filenames,
dpi, opw, upw, antialiasing, text_antialiasing, verbose)
}
lapply(pdf_attachments(link), function(x) pdf_convert(x$data,
filenames=paste0(tools::file_path_sans_ext(x$name), "-",
seq_along(pdf_data(x$data)), ".png")))
#> Converting page 1 to K190072.510kSummary.Final_Sent001-1.png... done!
#> Converting page 2 to K190072.510kSummary.Final_Sent001-2.png... done!
#> Converting page 3 to K190072.510kSummary.Final_Sent001-3.png... done!
#> Converting page 4 to K190072.510kSummary.Final_Sent001-4.png... done!
#> Converting page 5 to K190072.510kSummary.Final_Sent001-5.png... done!
#> Converting page 1 to K190072.IFU.FINAL_Sent001-1.png... done!
#> Converting page 1 to K190072.Letter.SE.FINAL_Sent001-1.png... done!
#> Converting page 2 to K190072.Letter.SE.FINAL_Sent001-2.png... done!
#> [[1]]
#> [1] "K190072.510kSummary.Final_Sent001-1.png"
#> [2] "K190072.510kSummary.Final_Sent001-2.png"
#> [3] "K190072.510kSummary.Final_Sent001-3.png"
#> [4] "K190072.510kSummary.Final_Sent001-4.png"
#> [5] "K190072.510kSummary.Final_Sent001-5.png"
#>
#> [[2]]
#> [1] "K190072.IFU.FINAL_Sent001-1.png"
#>
#> [[3]]
#> [1] "K190072.Letter.SE.FINAL_Sent001-1.png"
#> [2] "K190072.Letter.SE.FINAL_Sent001-2.png"
Created on 2021-05-05 by the reprex package (v2.0.0)

Error despite purrr's 'otherwise' - Why is purrr/possibly's 'otherwise' not triggered?

I am scraping content from websites. For this I iterate over links. If an error occurs, purrr's possibly adverb should keep the process going, and place a "missing" (or "NA_character") as a result.
The code below works as intended when the site linked to is not existing, i.e. the output is "missing";
However, if the site linked to exists, but the element which I am trying to extract from the site does not exist, the function throws an error despite having defined a value for 'otherwise'.
To me this is surprising, since the documentation states that
' possibly : wrapped function uses a default value ( otherwise ) whenever an error occurs.'
Any idea why this is happening? I understand that i could modify the function accordingly (e.g. check for the length of the returned object). But I do not understand why the 'otherwise' value was not used.
library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.0.4
#> Warning: package 'tidyr' was built under R version 4.0.4
#> Warning: package 'dplyr' was built under R version 4.0.4
library(rvest)
#> Warning: package 'rvest' was built under R version 4.0.4
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
# possibly with wrong links when scraping site ----------------------------
#see https://github.com/tidyverse/purrr/issues/409
sample_data <- tibble::tibble(
link = c(
#link ok, selected item exists
"https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00068/index.shtml#tab-Sten.Protokoll",
#link not ok
"https://www.wrong-url.foobar",
#link ok, selected item does not exist on site
"https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00094/index.shtml#tab-Sten.Protokoll"
)
)
fn_get_link_to_records <- function(link_to_overview_sessions) {
print(link_to_overview_sessions)
link_to_overview_sessions %>%
rvest::read_html() %>%
rvest::html_elements("a") %>%
rvest::html_attr("href") %>%
enframe(name = NULL,
value = "link_to_text") %>%
filter(str_detect(link_to_text, regex("\\/NRSITZ_\\d+\\/fnameorig_\\d+\\.html$"))) %>%
mutate(link_to_text=glue::glue("https://www.parlament.gv.at/{link_to_text}")) %>%
pull()
}
sample_data %>%
mutate(link_to_text=map_chr(link,
possibly(fn_get_link_to_records,
otherwise=NA_character_)))
#> [1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00068/index.shtml#tab-Sten.Protokoll"
#> [1] "https://www.wrong-url.foobar"
#> [1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00094/index.shtml#tab-Sten.Protokoll"
#> Error: Problem with `mutate()` input `link_to_text`.
#> x Result 3 must be a single string, not a vector of class `glue/character` and of length 0
#> i Input `link_to_text` is `map_chr(link, possibly(fn_get_link_to_records, otherwise = NA_character_))`.
sample_data %>%
mutate(link_to_text=map_chr(link,
possibly(fn_get_link_to_records,
otherwise="missing")))
#> [1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00068/index.shtml#tab-Sten.Protokoll"
#> [1] "https://www.wrong-url.foobar"
#> [1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00094/index.shtml#tab-Sten.Protokoll"
#> Error: Problem with `mutate()` input `link_to_text`.
#> x Result 3 must be a single string, not a vector of class `glue/character` and of length 0
#> i Input `link_to_text` is `map_chr(link, possibly(fn_get_link_to_records, otherwise = "missing"))`.
Created on 2021-03-28 by the reprex package (v1.0.0)
UPDATE: I added the output below to make the unexpected result (last chunk) clearer.
sample_data[1:2,] %>%
mutate(link_to_text=map_chr(link,
possibly(fn_get_link_to_records,
otherwise="missing")))
#> [1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00068/index.shtml#tab-Sten.Protokoll"
#> [1] "https://www.wrong-url.foobar"
#> # A tibble: 2 x 2
#> link link_to_text
#> <chr> <chr>
#> 1 https://www.parlament.gv.at/PAKT/VHG~ https://www.parlament.gv.at//PAKT/VHG/X~
#> 2 https://www.wrong-url.foobar missing
sample_data[3, ] %>%
mutate(link_to_text=map_chr(link,
possibly(fn_get_link_to_records,
otherwise="missing")))
#> [1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00094/index.shtml#tab-Sten.Protokoll"
#> Error: Problem with `mutate()` input `link_to_text`.
#> x Result 1 must be a single string, not a vector of class `glue/character` and of length 0
#> i Input `link_to_text` is `map_chr(link, possibly(fn_get_link_to_records, otherwise = "missing"))`.
Created on 2021-03-29 by the reprex package (v1.0.0)
The error is coming from map_chr but you have possibly wrapped around fn_get_link_to_records function. If you run fn_get_link_to_records(sample_data$link[3]) you'll see the URL get's printed and nothing is returned and no error is generated. However, map_chr cannot change this empty output to character value hence you get the error. Instead of map_chr if you use map you'll see it works.
sample_data[3,] %>%
mutate(link_to_text= map(link, fn_get_link_to_records))
#[1] #"https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00094/index.shtml#tab-Sten.Protokoll"
# A tibble: 1 x 2
# link link_to_text
# <chr> <list>
#1 https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00094/index.shtml#tab-Sten.Pro… <glue [0]>
but link_to_text is empty. The solution as you already know is check length of output value and return NA or generate an error inside fn_get_link_to_records functions for such cases which will be handled using possibly.

Replacing the attribute value of an htmltools::tag

Say I have the following tag:
library(htmltools)
t = div(name = 'oldname')
I can overwrite the 'name' attribute of this tag using t$attribs$name = 'newname' but prefer using htmltools getters/setters, does the package have a function that facilitates this?
Looking through the package manual, the only function that allows for the manipulation of tag attributes is tagAppendAttributes, which only appends the new atrribute value to the original:
t = tagAppendAttributes(t, name = 'newname')
t
#<div name="oldname newname"></div>
Does the absence of a helper function that overwrites the value of an attribute mean that tag attributes are not meant to be overwritten?
You're probably overthinking this. Look at the code for tagAppendAttributes:
tagAppendAttributes
#> function (tag, ...)
#> {
#> tag$attribs <- c(tag$attribs, list(...))
#> tag
#> }
All it does is take whatever you pass and write directly to tag$attribs. If you unclass your object you'll see it's just a list really:
unclass(t)
#> $name
#> [1] "div"
#>
#> $attribs
#> $attribs$name
#> [1] "oldname"
#>
#>
#> $children
#> list()
I can see why writing directly to an object's data member rather than using a setter might not feel right if you come from an object-oriented programming background, but this is clearly a "public" data member in an informal S3 class. Setting it directly is no more likely to break it that any other implementation.
If you really want to I suppose you could define a setter:
tagSetAttributes <- function(tag, ...) {tag$attribs <- list(...); tag}
tagSetAttributes(t, name = "new name")
#> <div name="new name"></div>

Can I filter out certain rows/records when retrieving data from Salesforce using the RForcecom function "rforcecom.retrieve"?

Thanks for helping me with my first Stack Overflow question. I am trying to retrieve all the data from several fields in an Object called "Applied Questionnaire"; however, I do not want to retrieve any records that have the name "Training Site".
Currently, this is my code, which works:
quarterly_site_scores = rforcecom.retrieve(session, "AppliedQuestionnaire__c",
c("Site__c", "Site_Name__c", "Total_Score__c"))
%>% rename(site_id = Site__c, site_name = Site_Name__c)
quarterly_site_scores = quarterly_site_scores[!(quarterly_site_scores$site_name == "TRAINING PARK SITE" |
quarterly_site_scores$status != "Completed"),]
However, I'm wondering if there's a more elegant, streamlined solution here. Can I filter at the same time I retrieve? Or is there a better way to filter here?
(I've simplified the code here - I'm actually pulling in about ten fields and filtering on about five or six criteria, just in this one example).
Thank you.
Adding what the OP discovered as an answer using the salesforcer package which returns the SOQL resultset as a tbl_df.
library(salesforcer)
library(tidyverse)
sf_auth(username, password, security_token)
# list all object names in a Salesforce org
ped_objects <- sf_list_objects() %>% .$sobjects %>% map_chr(~pluck(., "name"))
# list all the fields on a particular object
fields <- sf_describe_object_fields('AppliedQuestionnaireBundle2__c')
# write a query to retrieve certain records from that object
site_scores_soql <- "SELECT Site__c,
Site_Name__c,
Total_Score__c
FROM AppliedQuestionnaireBundle2__c
WHERE Site_Name__c != 'GENERIC SITE'
AND Site_Name__c != 'TRAINING PARK SITE'
AND Status__c = 'Completed'"
# run the query
quarterly_site_scores <- sf_query(site_scores_soql)
quarterly_site_scores
#> # A tibble: 3 x 3
#> Site__c Site_Name__c Total_Score__c
#> <chr> <chr> <dbl>
#> 1 A Site Name1 78
#> 2 B Site Name2 52
#> 3 C Site Name3 83

Resources