I need help to extract information from a pdf file in r
(for example https://arxiv.org/pdf/1701.07008.pdf)
I'm using pdftools, but sometimes pdf_info() doesn't work and in that case I can't manage to do it automatically with pdf_text()
NB notice that tabulizer didn't work on my PC.
Here is the treatment I'm doing (Sorry you need to save the pdf and do it with your own path):
info <- pdf_info(paste0(path_folder,"/",pdf_path))
title <- c(title,info$keys$Title)
key <- c(key,info$keys$Keywords)
auth <- c(auth,info$keys$Author)
dom <- c(dom,info$keys$Subject)
metadata <- c(metadata,info$metadata)
I would like to get title and abstract most of the time.
We will need to make some assumptions about the structure of the pdf we wish to scrape. The code below makes the following assumptions:
Title and abstract are on page 1 (fair assumption?)
Title is of height 15
The abstract is between the first occurrence of the word "Abstract" and first occurrence of the word "Introduction"
library(tidyverse)
library(pdftools)
data = pdf_data("~/Desktop/scrape.pdf")
#Get First page
page_1 = data[[1]]
# Get Title, here we assume its of size 15
title = page_1%>%
filter(height == 15)%>%
.$text%>%
paste0(collapse = " ")
#Get Abstract
abstract_start = which(page_1$text == "Abstract.")[1]
introduction_start = which(page_1$text == "Introduction")[1]
abstract = page_1$text[abstract_start:(introduction_start-2)]%>%
paste0(collapse = " ")
You can, of course, work off of this and impose stricter constraints for your scraper.
Related
I am working with quarto to table some results from a qualitative data analysis, and present them in a {DT} or {gt} table.
I have a placeholder character in the table I'm receiving from another data source, but cannot seem to replace that placeholder with one or more carriage returns to make entries easier to read in the resulting DT or gt table.
Thanks for your collective help!
library(tidyverse)
library(DT)
library(gt)
df_text <- tibble::tibble(index = c("C-1", "C-2"),
finding = c("A finding on a single line.", "A finding with a return in the middle.<return>Second line is usually some additional piece of context or a quote.")) %>%
dplyr::mutate(finding = stringr::str_replace_all(finding, "<return>", "\\\n\\\n"))
DT::datatable(df_text)
gt::gt(df_text)
for gt you need
gt::gt(df_text) |> tab_style(style=cell_text(whitespace = "pre"),
locations=cells_body())
for DT you could modify the column with the required text to be html and then tell DT to respect your HTML
df_text <- tibble::tibble(index = c("C-1", "C-2"),
finding = c("A finding on a single line.", "A finding with a return in the middle.<return>Second line is usually some additional piece of context or a quote.")) %>%
dplyr::mutate(finding = paste0("<HTML>",
stringr::str_replace_all(finding, "<return>", "</br></br>"),
"</HTML>"))
DT::datatable(df_text,escape = FALSE)
Based on #Nir Graham's answer, I wrote a function. The type = "dt" isn't working for some reason, but Nir's recommendation to dplyr::mutate() inline does :shrug:
fx_add_returns <- function(x, type = c("dt", "gt")) {
type <- match.arg(type)
if (type == "dt"){
paste0("<HTML>",
stringr::str_replace_all(x, "<return>", "</br></br>"), "</HTML>")
}
if (type == "gt"){
stringr::str_replace_all(x, "<return>", "\\\n\\\n")
}
}
I'd like to format references to academic papers in different citation styles with R.
With package rcrossref, I can easily create citations to certain articles based on their DOIs in the style you specify. However, not all papers have a DOI, so I'm looking for an easy way to get citations in text with different styles based on the article info from a BibTeX entry or some other type of input.
Using rcrossref:
The package contains length(rcrossref::get_styles()) 2209 different styles.
For example, you can get citations in text to some highly cited papers (DOIs from this source: https://doi.org/10.1038/514550a) with different styles in text in a list element as follows:
library(rcrossref)
# some DOIs of interest
dois <- c("10.1038/514550a", "10.1038/227680a0", "10.1016/0003-2697(76)90527-3", "10.1073/Pnas.74.12.5463", "10.1016/0003-2697(87)90021-2", "10.1107/S0108767307043930")
# APA cv style
cr_cn(dois = dois, format = "text", style="apa-cv")
# same with Chicago style
cr_cn(dois = dois, format = "text", style="chicago-note-bibliography")
# same with Vancouver style
cr_cn(dois = dois, format = "text", style="vancouver")
Now, say I have an entry without a DOI f.ex. in BibTex format, like:
#article {PMID:14907713, Title = {Protein measurement with the Folin phenol reagent}, Author = {LOWRY, OH and ROSEBROUGH, NJ and FARR, AL and RANDALL, RJ}, Number = {1}, Volume = {193}, Month = {November}, Year = {1951}, Journal = {The Journal of biological chemistry}, ISSN = {0021-9258}, Pages = {265—275}, URL = {http://www.jbc.org/content/193/1/265.long} }
and I'd like to format also this entry f.ex in APA cv, Chicago and Vancouver styles, and get the result in text, how can I do that? I Haven't found a function for that. Is there any way currently available for this task?
Thank you!
So it doesn't look like rcrossref supports this because everything happens on their API server and there doesn't appear to be a way to specify a raw bibtex entry that doesn't have a DOI.
However, it does appear that pandoc which is usually installed with RStudio and is used by rmarkdown has support for citation formatting. I tried to do a bit of reverse engineering to see if it would be possible to just produce the citation for a given entry. Here's the function I've created.
citation <- function(bib, csl="chicago-author-date.csl", toformat="plain", cslrepo="https://raw.githubusercontent.com/citation-style-language/styles/master") {
if (!file.exists(bib)) {
message("Assuming input is literal bibtex entry")
tmpbib <- tempfile(fileext = ".bib")
on.exit(unlink(tmpbib), add=TRUE)
if(!validUTF8(bib)) {
bib <- iconv(bib, to="UTF-8")
}
writeLines(bib, tmpbib)
bib <- tmpbib
}
if (tools::file_ext(csl)!="csl") {
warning("CSL file name should end in '.csl'")
}
if (!file.exists(csl)) {
cslurl <- file.path(cslrepo, csl)
message(paste("Downling CSL from", cslurl))
cslresp <- httr::GET(cslurl, httr::write_disk(csl))
if(httr::http_error(cslresp)) {
stop(paste("Could not download CSL.", "Code:", httr::status_code(cslresp)))
}
}
tmpcit <- tempfile(fileext = ".md")
on.exit(unlink(tmpcit), add=TRUE)
writeLines(c("---","nocite: '#*'","---"), tmpcit)
rmarkdown::find_pandoc()
command <- paste(shQuote(rmarkdown:::pandoc()),
"--filter", "pandoc-citeproc",
"--to", shQuote(toformat),
"--csl", shQuote(csl),
"--bibliography", shQuote(bib),
shQuote(tmpcit))
rmarkdown:::with_pandoc_safe_environment({
result <- system(command, intern = TRUE)
Encoding(result) <- "UTF-8"
})
result
}
You can pass in your reference, and it will convert it using a standard "CSL" file. These CSL files are what control the formatting. There is a giant repo with different CSL for different formats here. You can specify a CSL file and if the file doesn't exist, this function will automatically download it from the repo.
You can either pass in a "raw" citation
test <- "#article {PMID:14907713, Title = {Protein measurement with the Folin phenol reagent}, Author = {LOWRY, OH and ROSEBROUGH, NJ and FARR, AL and RANDALL, RJ}, Number = {1}, Volume = {193}, Month = {November}, Year = {1951}, Journal = {The Journal of biological chemistry}, ISSN = {0021-9258}, Pages = {265-275}, URL = {http://www.jbc.org/content/193/1/265.long} } "
citation(test)
Or if the data was in a file, you could use the file name
writeLines(test, "test.bib")
citation("test.bib")
And if you wanted to use a different CSL, you can just set the name of the CSL file in the CSL= parameter
citation("test.bib", csl="apa-cv.csl")
citation("test.bib", csl="chicago-note-bibliography.csl")
citation("test.bib", csl="vancouver.csl")
I have some paragraphs, and for each paragraph, I have different key words. For example:
I am a student. I like machine learning...
Here my keywords are student and machine learning. I want to give them different colors such as red for student and yellow for machine learning. So, the result should be something like:
Can I use R to do this and how?
Also, I know Python can somehow do this. For example:
from spacy import displacy
doc = nlp('I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ')
displacy.render(doc, style='ent', jupyter=True)
In here, the result is:
But this looks like just for name entity. In my case, my keywords are extracted by myself. SO it might be different
As mentioned in the comments, I created a small package for this purpose some time ago. It is still pretty experimental and can currently only be used in RMarkdown or it will open a browser window (Viewer Pane in Rstudio) to display the text when used interactivly.
# devtools::install_github("JBGruber/highlightr")
library(highlightr)
text <- "I am a student. I like machine learning..."
df <- data.frame(
feature = c("student", "machine learning"),
bg_colour = c("red", "yellow"),
stringsAsFactors = FALSE
)
dict <- as_dict(df)
highlight(text, dict)
---
output: html_document
---
```{r , results='asis'}
library(highlightr)
text <- "I am a student. I like machine learning..."
df <- data.frame(
feature = c("student", "machine learning"),
bg_colour = c("red", "yellow"),
stringsAsFactors = FALSE
)
dict <- as_dict(df)
highlight(text, dict)
```
The package is built on some very straighforward manipulation of the html output:
# bg_colour
for (j in seq_along(dict$feature)) {
text[i] <- stringi::stri_replace_all_fixed(
str = text[i],
pattern = dict$feature[j],
replacement = paste0("<span style='background-color: ",
dict$bg_colour[j], "'>",
dict$feature[j], "</span>"),
opts_fixed = stringi::stri_opts_fixed(case_insensitive = case_insensitive)
)
}
All I do here is adding <span style='background-color: yellow'> before a word that is highlighted and </span> after that word. When I have time I will do the same for LaTeX output and maybe more. The reason for using stringi here to do a simple replacement job is that it can be used case-insensitive while ignoring other regex.
I am curious how to access additional attributes for a graph which are associated with the edges. To follow along here is a minimal example:
library("igraph")
library("SocialMediaLab")
myapikey =''
myapisecret =''
myaccesstoken = ''
myaccesstokensecret = ''
tweets <- Authenticate("twitter",
apiKey = myapikey,
apiSecret = myapisecret,
accessToken = myaccesstoken,
accessTokenSecret = myaccesstokensecret) %>%
Collect(searchTerm="#trump", numTweets = 100,writeToFile=FALSE,verbose=TRUE)
g_twitter_actor <- tweets %>% Create("Actor", writeToFile=FALSE)
c <- igraph::components(g_twitter_actor, mode = 'weak')
subCluster <- induced.subgraph(g_twitter_actor, V(g_twitter_actor)[which(c$membership == which.max(c$csize))])
The initial tweets contains the following columns
colnames(tweets)
[1] "text" "favorited" "favoriteCount" "replyToSN" "created_at" "truncated" "replyToSID" "id"
[9] "replyToUID" "statusSource" "screen_name" "retweetCount" "isRetweet" "retweeted" "longitude" "latitude"
[17] "from_user" "reply_to" "users_mentioned" "retweet_from" "hashtags_used"
How can I access the text property for the subgraph in order to perform text analysis?
E(subCluster)$text does not work
E(subCluster)$text does not work because the values for tweets$text are not added to the graph when it is made. So you have to do that manually. It's a bit of a pain, but doable. Requires some subsetting of the tweets data frame and matching based on user names.
First, notice that the edge types are in a particular order: retweets, mentions, replies. The same text from a particular user can apply to all three of these. So I think it makes sense to add text serially.
> unique(E(g_twitter_actor)$edgeType)
[1] "Retweet" "Mention" "Reply"
Using dplry and reshape2 makes this easier.
library(reshape2); library(dplyr)
#Make data frame for retweets, mentions, replies
rts <- tweets %>% filter(!is.na(retweet_from))
ms <- tweets %>% filter(users_mentioned!="character(0)")
rpls <- tweets %>% filter(!is.na(reply_to))
Since users_mentioned can contain a list of individuals, we have to unlist it. But we want to associate the users mentioned with the user who mentioned them.
#Name each element in the users_mentioned list after the user who mentioned
names(ms$users_mentioned) <- ms$screen_name
ms <- melt(ms$users_mentioned) #melting creates a data frame for each user and the users they mention
#Add the text
ms$text <- tweets[match(ms$L1,tweets$screen_name),1]
Now add each of these to the network as an edge attribute by matching the edge type.
E(g_twitter_actor)$text[E(g_twitter_actor)$edgeType %in% "Retweet"] <- rts$text
E(g_twitter_actor)$text[E(g_twitter_actor)$edgeType %in% "Mention"] <- ms$text
E(g_twitter_actor)$text[E(g_twitter_actor)$edgeType %in% "Reply"] <- rpls$text
Now you can subset and get the edge value for text.
subCluster <- induced.subgraph(g_twitter_actor,
V(g_twitter_actor)[which(c$membership == which.max(c$csize))])
I'm trying to save html code chunk in csv file rows from two different pages.
Took two links
Use loop to visit the links and select two html code chunks using rvest
print them using sapply
Want to print the output in a row in csv file ( Need help on this)
I can see the html chunks in console but can't save them in csv .. I want to save the html code rather than values. I used IMDB just for code replication purpose.
library(rvest
movielinks <- c("http://www.imdb.com/movies-coming-soon/?ref_=inth_cs", "http://www.imdb.com/movies-in-theaters/?ref_=nv_tp_inth_1")
moviesheet <- NULL
for (mov in 1: length(movielinks)) {
#print(mov)
pageurl <- paste0(movielinks[mov])
# print(pageurl)
movieurl <- html(pageurl)
movie_name <- movieurl %>%
html_nodes("h4 a ")# %>% # find all links
strings<-paste(sapply(movie_name, function(x) { print(x) }))
moviesheet <- rbind(moviesheet, strings)
}
write.csv(moviesheet, "moviesheet.csv")
Final Outcome is something like this
Product Price HtmlCode
Soap 20 <a href="/title/tt3691740/?ref_=cs_ov_tt" title="
The BFG (2016)"
itemprop="url"> The BFG (2016)</a>