I'm trying to extract a table from a PDF with the R tabulizer package. The functions work fine, but it can't get all the data from the entire table.
Below are my codes
library(tabulizer)
library(tidyverse)
library(abjutils)
D_path = "https://github.com/financebr/files/raw/master/Compacto09-08-2019.pdf"
out <- extract_tables(D_path,encoding = 'UTF-8')
arrumar_nomes <- function(x) {
x %>%
tolower() %>%
str_trim() %>%
str_replace_all('[[:space:]]+', '_') %>%
str_replace_all('%', 'p') %>%
str_replace_all('r\\$', '') %>%
abjutils::rm_accent()
}
tab_tidy <- out %>%
map(as_tibble) %>%
bind_rows() %>%
set_names(arrumar_nomes(.[1,])) %>%
slice(-1) %>%
mutate_all(funs(str_replace_all(., '[[:space:]]+', ' '))) %>%
mutate_all(str_trim)
Comparing the PDF table (D_path) with the tab_tidy database you can see that some information was missing. All first columns, which are merged, are not found during extract_tables(). Also, all lines that contain “Boi Gordo” and “Boi Magro” information are not found by the function either.
The rest is in perfect condition. Would you know why and how to solve it? The questions here in the forum dealing with this do not have much answer.
Related
I'm trying to find a way to copy-paste the title and the abstract from a PubMed page.
I started using
browseURL("https://pubmed.ncbi.nlm.nih.gov/19592249") ## final numbers are the PMID
now I can't find a way to obtain the title and the abstract in a txt way. I have to do it for multiple PMID so I need to automatize it. It can be useful also just copying everything is on that page and after I can take only what I need.
Is it possible to do that? thanks!
I suppose what you're trying to do is scrape PubMed for articles of interest?
Here's one way to do this using the rvest package:
#Required libraries.
library(magrittr)
library(rvest)
#Function.
getpubmed <- function(url){
dat <- rvest::read_html(url)
pid <- dat %>% html_elements(xpath = '//*[#title="PubMed ID"]') %>% html_text2() %>% unique()
ptitle <- dat %>% html_elements(xpath = '//*[#class="heading-title"]') %>% html_text2() %>% unique()
pabs <- dat %>% html_elements(xpath = '//*[#id="enc-abstract"]') %>% html_text2()
return(data.frame(pubmed_id = pid, title = ptitle, abs = pabs, stringsAsFactors = FALSE))
}
#Test run.
urls <- c("https://pubmed.ncbi.nlm.nih.gov/19592249", "https://pubmed.ncbi.nlm.nih.gov/22281223/")
df <- do.call("rbind", lapply(urls, getpubmed))
The code should be fairly self-explanatory. (I've not added the contents of df here for brevity.) The function getpubmed does no error-handling or anything of that sort, but it is a start. By supplying a vector of URLs to the do.call("rbind", lapply(urls, getpubmed)) construct, you can get back a data.frame consisting of the PubMed ID, title, and abstract as columns.
Another option would be to explore the easyPubMed package.
I would also use a function and rvest. However, I would go with a passing the pid in as the argument function, using html_node as only a single node is needed to be matched, and use faster css selectors. String cleaning is done via stringr package:
library(rvest)
library(stringr)
library(dplyr)
get_abstract <- function(pid){
page <- read_html(paste0('https://pubmed.ncbi.nlm.nih.gov/', pid))
df <-tibble(
title = page %>% html_node('.heading-title') %>% html_text() %>% str_squish(),
abstract = page %>% html_node('#enc-abstract') %>% html_text() %>% str_squish()
)
return(df)
}
get_abstract('19592249')
Error in unnest_tokens.data.frame(., entity, text, token = tokenize_scispacy_entities, :
Expected output of tokenizing function to be a list of length 100
The unnest_tokens() works well for a sample of few observations but fails on the entire dataset.
https://github.com/dgrtwo/cord19
Reproducible example:
library(dplyr)
library(cord19)
library(tidyverse)
library(tidytext)
library(spacyr)
Install the model from here - https://github.com/allenai/scispacy
spacy_initialize("en_core_sci_sm")
tokenize_scispacy_entities <- function(text) {
spacy_extract_entity(text) %>%
group_by(doc_id) %>%
nest() %>%
pull(data) %>%
map("text") %>%
map(str_to_lower)
}
paragraph_entities <- cord19_paragraphs %>%
select(paper_id, text) %>%
sample_n(10) %>%
unnest_tokens(entity, text, token = tokenize_scispacy_entities)
I face the same problem. I don't know the reason why, after I filter out empty abstract and shorter abstract string, everything seems work just fine.
abstract_entities <- article_data %>%
filter(nchar(abstract) > 30) %>%
select(paper_id, title, abstract) %>%
sample_n(1000) %>%
unnest_tokens(entity, abstract, token = tokenize_scispacy_entities)
I am trying to replicate an analysis using tidytext in R, except using a loop. The specific example comes from Julia Silge and David Robinson's Text Mining with R, a Tidy Approach. The context for it can be found here: https://www.tidytextmining.com/sentiment.html#sentiment-analysis-with-inner-join.
In the text, they give an example of how to do sentiment analysis using the NRC lexicon, which has eight different sentiments, including joy, anger, and anticipation. I'm not doing an analysis for a specific book like the example, so I commented out that line, and it still works:
nrc_list <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
wordcount_joy <- wordcount %>%
# filter(book == "Emma") %>%
inner_join(nrc_list) %>%
count(word, sort = TRUE)
As I said before, this works. I now want to modify it to loop over all eight emotions, and save the results in a dataframe labeled with the emotion. How I tried to modify it:
emotion <- c('anger', 'disgust', 'joy', 'surprise', 'anticip', 'fear', 'sadness', 'trust')
for (i in emotion) {
nrc_list <- get_sentiments("nrc") %>%
filter(sentiment == "i")
wcount[[i]] <- wordcount %>%
inner_join(nrc_list) %>%
count(word, sort = TRUE)
}
I get an "Error: object 'wcount' not found" message when I do this. I have googled this and it seems like the answers to this question is to use wcount[[i]] but clearly something is off when I tried adapting it. Do you have any suggestions?
Code below will do the trick. Note that you are refering to wordcount in your loop and the example uses tidybooks. Code follows the steps as in the link to tidytextmining you are refering to.
library(janeaustenr)
library(dplyr)
library(stringr)
library(tidytext)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
emotion <- c('anger', 'disgust', 'joy', 'surprise', 'anticip', 'fear', 'sadness', 'trust')
# initialize list with the length of the emotion vector
wcount <- vector("list", length(emotion))
# name the list entries
names(wcount) <- emotion
# run loop
for (i in emotion) {
nrc_list <- get_sentiments("nrc") %>%
filter(sentiment == i)
wcount[[i]] <- tidy_books %>%
inner_join(nrc_list) %>%
count(word, sort = TRUE)
}
I would like to scrape only the candidate names from these tables and the votes that are reported in the third column (after the image, candidate name).
This is as far as I've gotten.
library(rvest)
ndp_leadership<-url('https://en.wikipedia.org/wiki/New_Democratic_Party_leadership_elections')
results<-read_html(ndp_leadership, 'table')
results<-html_nodes(results, 'table')
out<-results %>%
html_nodes(xpath="//*[contains(., 'Candidate')]//tr/td")
out
Although this doesn't really use XPath, here's one way to do it:
results <- read_html(ndp_leadership) %>%
html_nodes(".wikitable") %>%
html_table(fill=TRUE) %>%
map(~ .[,2]) %>%
unlist %>%
setdiff(., c("Candidate", "Total"))
I am working on a project that requires me to go through various pages of links, and within these links find the xml file and parse it. I am having trouble extracting the xml file. There are two xml files within each link and I am interested in the one that is bigger. How can I extract the xml file, and find the the one with the max size. I tried using the grep function but its constantly giving me an error.
sotu<-data.frame()
for (i in seq(1,501, 100))
{
securl <- paste0("https://www.sec.gov/cgi-bin/srch-edgar?text=abs-
ee&start=", i, "&count=100&first=2016")
main.page <- read_html(securl)
urls <- main.page %>%
html_nodes("div td:nth-child(2) a")%>%
html_attr("href")
baseurl <- "https://www.sec.gov"
fulllink <-paste(baseurl, urls, sep = "")
names <- main.page %>%
html_nodes ("div td:nth-child(2) a") %>%
html_text()
date <- main.page %>%
html_nodes ("td:nth-child(5)") %>%
html_text()
result <- data.frame(urls=fulllink,companyname=names,FilingDate=date, stringsAsFactors = FALSE)
sotu<- rbind(sotu,result)
}
for (i in seq(nrow(sotu)))
{
getXML <- read_html(sotu$urls[1]) %>%
grep("xml", getXML, ignore.case=FALSE )
}
Everything works except when I try to loop over every link and find the xml file, I keep getting an error. Is this not the right function?
With some help from dplyr we can do:
sotu %>%
rowwise() %>%
do({
read_html(.$urls) %>%
html_table() %>%
as.data.frame() %>%
filter(grepl('.*\\.xml', Document)) %>%
filter(Size == max(Size))
})
or, as the type is always 'EX-102' at least in the example:
sotu %>%
rowwise() %>%
do({
read_html(.$urls) %>%
html_table() %>%
as.data.frame() %>%
filter(Type == 'EX-102')
})
This also get rid of the for loop, which is rarely a good idea in R.