Using pdftools in R to extract specific table after a string - r

I have couple of pdfs and I wish to extract the shareholders table. How can I specify such that only table appearing after the string 'TWENTY LARGEST SHAREHOLDERS' is extracted?
I tried but was not quite sure of the function part.
library("pdftools")
library("tidyverse")
url <- c("https://www.computershare.com/News/Annual%20Report%202019.pdf?2")
raw_text <- map(url, pdf_text)
clean_table <- function(table){
table <- str_split(table, "\n", simplify = TRUE)
table_start <- stringr::str_which(table, "TWENTY LARGEST SHAREHOLDERS")
table <- table[1, (table_start +1 ):(table_end - 1)]
table <- str_replace_all(table, "\\s{2,}", "|")
text_con <- textConnection(table)
data_table <- read.csv(text_con, sep = "|")
colnames(data_table) <- c("Name", "Number of Shares", "Percentage")
}
shares <- map_df(raw_text, clean_table)

Try this. Besides some minor issues the main change is that I first get the page which contains the desired table. BTW: You have to search for "Twenty Largest Shareholders" and not "TWENTY LARGEST SHAREHOLDERS".
library(pdftools)
library(tidyverse)
# download pdf
url <- c("https://www.computershare.com/News/Annual%20Report%202019.pdf?2")
raw_text <- map(url, pdf_text)
clean_table1 <- function(raw) {
# Split the single pages
raw <- map(raw, ~ str_split(.x, "\\n") %>% unlist())
# Concatenate the splitted pages
raw <- reduce(raw, c)
table_start <- stringr::str_which(tolower(raw), "twenty largest shareholders")
table_end <- stringr::str_which(tolower(raw), "total")
table_end <- table_end[min(which(table_end > table_start))]
table <- raw[(table_start + 3 ):(table_end - 1)]
table <- str_replace_all(table, "\\s{2,}", "|")
text_con <- textConnection(table)
data_table <- read.csv(text_con, sep = "|")
colnames(data_table) <- c("Name", "Number of Shares", "Percentage")
data_table
}
shares <- map_df(raw_text, clean_table1)
head(shares)
#> Name Number of Shares
#> 1 J P Morgan Nominees Australia Pty Limited 109,500,852
#> 2 Citicorp Nominees Pty Limited 57,714,777
#> 3 Mr Chris Morris 32,231,000
#> 4 National Nominees Limited 19,355,892
#> 5 Welas Pty Ltd 18,950,000
#> 6 BNP Paribas Nominees Pty Ltd <Agency Lending DRP A/C> 11,520,882
#> Percentage
#> 1 20.17
#> 2 10.63
#> 3 5.94
#> 4 3.56
#> 5 3.49
#> 6 2.12

Related

Matrix to Dataframe conversion fails in R

I have created a function that calls an api and parses for the id, label, description, and score of each annotation. But I can't seem to get the dataframe to display properly.
Here's code:
get_wikidata_links <- function(input_text, minimum_score) {
#
# Function which takes a character vector of length 1 as input (i.e. all text
# needs to be combined into a single character) as well as a minimum certainty
# score, and returns a tibble with key information and links to Wikidata
#
# Input
# - input_text: Text input (character)
# - minimum_score: Minimum score that every returned entity needs to have
# (numeric)
#
# Output
# - top_wikidata_links: Table with the first four columns being 'id', 'label',
# 'description', 'score' (tibble)
#
base_url <- "https://opentapioca.org/api/annotate"
r <- GET(base_url, query = list(query = input_text))
data = content(r)$annotations
framed = list()
vec = list()
dummy = 0
for (i in 1:length(data)) {
data1 = data[[i]]$tags
for (j in 1:length(data1)) {
data2 = data1[[j]]
if (data2$score>minimum_score) {
vec[1] <- data2$id
vec[2] <- data2$label
vec[3] <- data2$desc
vec[4] <- data2$score
dummy <- dummy + 1
framed[[dummy]] <- vec
}
}
}
data_matrix <- do.call("rbind", framed)
top_wikidata_links <- as.data.frame(data_matrix, stringsAsFactors = FALSE)
colnames(top_wikidata_links) <- c("ID", "Label", "Description", "Score")
return(top_wikidata_links)
}
Now I test this function with a couple phrases:
# Test 1
text_example_1 <- c("Karl Popper worked at the LSE.")
get_wikidata_links(input_text_1, -0.5)
#
# Hint: The output should be a tibble similar to the one outlined below
#
# | id | label | description | score |
# | "Q81244" | "Karl Popper" | "Austrian-British philosopher of science" | 2.4568285 |
# | "Q174570" | "London School of Economics and Political Science" | "university in Westminster, UK" | "1.4685043" |
# | "Q171240" | "London Stock Exchange" | "stock exchange in the City of London" | "-0.4124461" |
# Test 2
text_example_2 <- c("Claude Shannon studied at the University of Michigan and at MIT.")
get_wikidata_links(text_example_2, 0)
Now for some reason the matrix data_matrix works fine:
Output
But the data frame conversion fails as such:
Output
I guess it's bit easier to manage through some hoisting and unnesting. Inspired by https://tidyr.tidyverse.org/articles/rectangle.html :
library(httr)
library(tidyr)
library(dplyr)
get_wikidata_links <- function(input_text, minimum_score) {
base_url <- "https://opentapioca.org/api/annotate"
r <- GET(base_url, query = list(query = input_text))
tibble(link = content(r)$annotations) %>%
hoist(link, tags = "tags") %>%
unnest_longer(tags) %>%
hoist(tags, ID = "id", Label = "label", Description = "desc", Score = "score") %>%
select(ID:Score) %>%
filter(Score >= minimum_score)
}
text_example_1 <- c("Karl Popper worked at the LSE.")
get_wikidata_links(text_example_1, -0.5)
#> # A tibble: 3 × 4
#> ID Label Description Score
#> <chr> <chr> <chr> <dbl>
#> 1 Q81244 Karl Popper Austrian-Brit… 2.46
#> 2 Q174570 London School of Economics and Political Science university in… 1.47
#> 3 Q171240 London Stock Exchange stock exchang… -0.412
text_example_2 <- c("Claude Shannon studied at the University of Michigan and at MIT.")
get_wikidata_links(text_example_2, 0)
#> # A tibble: 3 × 4
#> ID Label Description Score
#> <chr> <chr> <chr> <dbl>
#> 1 Q92760 Claude Shannon American mathematician an… 1.96
#> 2 Q230492 University of Michigan public research universit… 1.29
#> 3 Q49108 Massachusetts Institute of Technology research university in Ca… 0.902
Created on 2023-01-19 with reprex v2.0.2
Service itself seems bit unstable or overloaded.

Combine fuzzy and exact merge in R

Here is my sample data:
a <- data.frame(name = c('Ace CO', 'Bayes', 'aasd', 'Apple', 'Orange', 'Banana',
'Ace CO', 'Bayes', 'aasd', 'Apple', 'Orange', 'Banana'),
date=c(1991,1991,1991,1991,1991,1991,
1992,1992,1992,1992,1992,1992),
price = c(10, 13, 2, 1, 15, 1,
11,15,3,2,14,4))
b <- data.frame(name = c('Ace Co.', 'Bayes INC.', 'asd',
'Ace Co.', 'Bayes INC.', 'asd'),
date=c(1991,1991,1991,1992,1992,1992),
qty = c(9, 99, 10,10,105,15))
I am left joining a to b by date and name, date is exact while name is fuzzy. I have tried stringdist_join but it only accomdates fuzzy merge.
The expected output is as follows:
c<- data.frame(name = c('Ace Co.', 'Bayes INC.', 'asd',
'Ace Co.', 'Bayes INC.', 'asd'),
date=c(1991,1991,1991,1992,1992,1992),
qty = c(9, 99, 10,10,105,15),
price = c(10, 13, 2,11,15,3))
I'd like to manipulate it under dplyr.
Using distance matrix to merge fuzzy strings
Main principle
Get the distance matrix between each unique terms of you vectors. Then, check what threshold might lead to the best results (this has to be human supervised I think).
Then, use this new correspondance table to merge your dataframes. Finallyyou can change names (i.e. adding "inc.") easier because you have "standardized" names.
With utils::adist()
I think stringdist is better because you can choose the method, but here is a base example as a suggestion on how to use this concept of distance to get the expected output.
# 1st create a matrix with the Standard Levenshtein distance between the name fields of both sources (or other method from stringdist)
dist_name_matrix <- adist(unique(a$name), unique(b$name), partial = TRUE, ignore.case = TRUE)
colnames(dist_name_matrix) <- unique(b$name)
rownames(dist_name_matrix) <- unique(a$name)
# lets convert this matrix to a dataframe for more visual changes, you will need to check it yourself
library(dplyr)
library(tidyr)
dist_df <- dist_name_matrix %>%
as.data.frame() %>%
tibble::rownames_to_column(., "a_name") %>%
pivot_longer(cols = 2:last_col(), names_to = "b_name", values_to = "dist") %>%
filter(dist < 2) # you might need to adapt this to your needs
# Now this can be used to merge your data i.e
a %>%
left_join(., dist_df, by = c("name" = "a_name")) %>%
right_join(., b, by = c("b_name" = "name", "date" = "date")) %>%
# added just to match your expected output
filter(!is.na(name)) %>%
select(b_name, date, qty, price)
Output:
b_name date qty price
1 Ace Co. 1991 9 10
2 Bayes INC. 1991 99 13
3 asd 1991 10 2
4 Ace Co. 1992 10 11
5 Bayes INC. 1992 105 15
6 asd 1992 15 3
Same process can be used with stringdist:
library(stringdist)
dist_name_matrix <- stringdistmatrix(unique(a$name), unique(b$name), method = "jw", useBytes = FALSE)
colnames(dist_name_matrix) <- unique(b$name)
rownames(dist_name_matrix) <- unique(a$name)
Then just adapt the threshold after human check i.e. filter(dist < 0.2)
agrep solution
The following function is almost surely not as general as it is supposed to be. But here it goes.
funMerge <- function(X, Y, col, col_approx, sep = "."){
other_cols.x <- setdiff(names(X), c(col, col_approx))
other_cols.y <- setdiff(names(Y), c(col, col_approx))
sp.x <- split(X, X[[col]])
sp.y <- split(Y, Y[[col]])
common_names <- intersect(names(sp.x), names(sp.y))
res <- sapply(common_names, function(sp.name){
x <- sp.x[[sp.name]]
y <- sp.y[[sp.name]]
k <- sapply(x[[col_approx]], agrep, y[[col_approx]])
k <- k[sapply(k, length) > 0]
k <- unlist(k)
i <- match(names(k), x[[col_approx]])
df_other.x <- x[k, other_cols.x, drop = FALSE]
df_other.y <- y[k, other_cols.y, drop = FALSE]
df_tmp <- data.frame(
x[k, col],
names(k),
y[k, col_approx]
)
names(df_tmp) <- c(col, col_approx, paste(col_approx, "y", sep = sep))
cbind(df_tmp, df_other.x, df_other.y)
}, simplify = FALSE)
res <- do.call(rbind, res)
row.names(res) <- NULL
res
}
funMerge(a, b, col = "date", col_approx = "name")
# date name name.y price qty
#1 1991 Ace Co Ace Co. 10 9
#2 1991 Bayes Bayes Inc. 13 99
#3 1991 asd asdf 2 10
#4 1992 Ace Co Ace CO. 11 10
#5 1992 Bayes Bayes INC. 15 105
#6 1992 asd aasdf 3 15
stringdist solution
The following function uses package stringdist to compute the Jaro-Winkler pairwise distances between the columns that need to be matched approximately.
From help('stringdist-metrics'), my emphasis.
The metric you need to choose for an application strongly depends on both the nature of the string (what does the string represent?) and the cause of dissimilarities between the strings you are measuring. For example, if you are comparing human-typed names that may contain typo's, the Jaro-Winkler distance may be of use. If you are comparing names that were written down after hearing them, a phonetic distance may be a better choice.
A more efficient algorithm would be to first split the data sets by the exact match column and then apply the method of funMerge2.
library(stringdist)
funMerge2 <- function(X, Y, col, col_approx, method = "jw", threshold = 0.2){
x <- X[[col_approx]]
y <- Y[[col_approx]]
d <- stringdistmatrix(x, y, method = method, useBytes = FALSE)
w <- which(d < threshold, arr.ind = TRUE)
Z1 <- X[w[, "row"], ]
Z2 <- Y[w[, "col"], ]
res <- cbind(Z1, Z2)
common_cols <- grep(col, names(res))
res <- res[apply(res[, common_cols], 1, function(x) x[1] == x[2]), ]
row.names(res) <- NULL
res
}
funMerge2(a, b, col = "date", col_approx = "name")
# name date price name date qty
#1 Ace Co 1991 10 Ace Co. 1991 9
#2 Bayes 1991 13 Bayes Inc. 1991 99
#3 asd 1991 2 asdf 1991 10
#4 Ace Co 1992 11 Ace CO. 1992 10
#5 Bayes 1992 15 Bayes INC. 1992 105
#6 asd 1992 3 aasdf 1992 15

How to put values inside a column based on other column values in R

I am working with R to scrape and clean a data for my work as journalist. I could get the table of the HTML, then read it as dataframe and rename the columns' name. Now I am trying to create a new column which gets a value considering other column' values.
This new column should get the values of "Avante", "DEM", "MDB", "Patriota", "PCdoB" and so on. It´s the party of each deputy. Avante, for example, has three deputies, who are "Adalberto Cavalcanti", "Cabo Sabino" and "Silvio Costa". The name of the deputies always come below the whole row with the party´s name.
url <- "http://www.camara.leg.br/internet/votacao/mostraVotacao.asp?ideVotacao=8559&numLegislatura=55&codCasa=1&numSessaoLegislativa=4&indTipoSessaoLegislativa=O&numSessao=225&indTipoSessao=E&tipo=partido"
library(xml2)
library(rvest)
file <- read_html(url)
tables <- html_nodes(file, "table")
table1 <- html_table(tables[3], fill = TRUE, header = T)
head(table1)
table1_df <- as.data.frame(table1)
colnames(table1_df) <- c("deputado", "uf", "voto")
This is what I have right now:
enter image description here
This is what I want:
enter image description here
Here's a solution that uses only base R:
url <- "http://www.camara.leg.br/internet/votacao/mostraVotacao.asp?ideVotacao=8559&numLegislatura=55&codCasa=1&numSessaoLegislativa=4&indTipoSessaoLegislativa=O&numSessao=225&indTipoSessao=E&tipo=partido"
library(xml2)
library(rvest)
file <- read_html(url)
tables <- html_nodes(file, "table")
table1 <- html_table(tables[3], fill = TRUE, header = T)
head(table1)
table1_df <- as.data.frame(table1)
colnames(table1_df) <- c("deputado", "uf", "voto")
# create the new column for later
table1_df$new_column <- NA
# identify rows with the Total PARTY: NUM rows
idx <- grep("Total.*: \\d+", table1_df$deputado)
# Loop over these and assign the values
for (i in seq_along(idx)){
# Extract the number of deputados
n <- as.numeric(sub("^.*: ", "", table1_df$deputado[idx[i]]))
# Extract the party
partido <- sub("Total ", "", table1_df$deputado[idx[i]])
partido <- sub(": .*", "", partido)
# Assign the values
table1_df$new_column[(idx[i] - n):(idx[i] - 1)] <- partido
}
# Remove the unnecessary lines
table1_df <- table1_df[-grep("Total .*:.*", table1_df$deputado), ]
table1_df <- table1_df[-which(table1_df$deputado == table1_df$uf), ]
Here is an another option using zoo and dplyr.
1) Get the names of the parties.
parties <- sub(pattern = "Total\\s(.+):\\s\\d+",
replacement = "\\1",
x = table1_df$deputado[grepl("Total", x = table1_df$deputado)])
2) Add parties as new column and carry the last oberservations forword as there are many NAs in parties[match(table1_df$deputado, parties)].
table1_df$new_col <- zoo::na.locf(parties[match(table1_df$deputado, parties)])
3) Delete unneeded rows.
library(dplyr)
table1_df <- table1_df %>%
group_by(new_col) %>%
slice(2:(n()-1))
table1_df
# A tibble: 324 x 4
# Groups: new_col [24]
# deputado uf voto new_col
# <chr> <chr> <chr> <chr>
# 1 Adalberto Cavalcanti PE Não Avante
# 2 Cabo Sabino CE Abstenção Avante
# 3 Silvio Costa PE Sim Avante
# 4 Alan Rick AC Sim DEM
# 5 Alberto Fraga DF Não DEM
# 6 Alexandre Leite SP Sim DEM
# 7 Arthur Oliveira Maia BA Sim DEM
# 8 Carlos Melles MG Sim DEM
# 9 Efraim Filho PB Não DEM
#10 Eli Corrêa Filho SP Sim DEM
# ... with 314 more rows

R data splitting unicodes

I have a data and want to split into columns
price_list <- c("Vegetables", " Garlic Desi<U+062A><U+06BE><U+0648><U+0645> <U+062F><U+06CC><U+0633><U+06CC> 140 per kg ",
" Fresh-bean<U+0641><U+0631><U+0627><U+0634><U+0628><U+06CC><U+0646> — per kg ",
"Fruits",
" Apple Kala Kolu Irani<U+0633><U+06CC><U+0628> <U+06A9><U+0627><U+0644><U+0627> <U+06A9><U+0648><U+0644><U+0648> <U+0627><U+06CC><U+0631><U+0627><U+0646><U+06CC> 168 per kg ",
" Apple golden 115 per kg ",
" Banana (I)<U+06A9><U+06CC><U+0644><U+0627> <U+0627><U+0646><U+0688><U+06CC><U+0646> 182 per dozen ",
"Others",
" Chicken<U+0645><U+0631><U+063A><U+06CC> <U+0634><U+06CC><U+0648><U+0631> 170 per kg ",
" Egg<U+0627><U+0646><U+0688><U+06D2> <U+0634><U+06CC><U+0648><U+0631> 95 per dozen "
)
tried but Unicodes creating problem
library(stringr)
regexp <- "[[:digit:]]+"
rprice <- str_extract(df$price_list, regexp)
df$price <- data.frame(rprice)
Desired out put like
Name Unicode Price Quantity
Vegetables
Fresh-bean فراشبین NA kg
Fruits
Apple golden NA 115 kg
Others
Egg انڈے شیور NA dozen
This forum is really helpful saved hundred and thousands of hours thanks
url <- "https://ictadministration.gov.pk/services/price-list/
complete code
library(rvest)
scraping_wiki <- read_html("https://ictadministration.gov.pk/services/price-list/")
library(magrittr)
price_date <- scraping_wiki %>%
html_nodes(".tm-article-content > ol:nth-child(1) > div:nth-child(1)") %>%
html_text()%>%
strsplit(split = "\n") %>%
unlist() %>%
.[. != ""]
price_date <- gsub(":", "", price_date)
price_list <- scraping_wiki %>%
html_nodes(".xl-tbl") %>%
html_text() %>%
strsplit(split = "\n") %>%
unlist() %>%
.[. != ""]
Wow, messy. This gets you close:
library(dplyr)
library(stringr)
unis <- price_list %>% str_extract(pattern = "<[[:print:]]*>")
words <- price_list %>% str_extract(pattern = "[A-Z a-z<]*") %>% gsub("<U", "", x = .)
price <- price_list %>% str_extract(pattern = "[0-9]* per") %>% gsub("per", "", x = .)
quant <- price_list %>% str_extract(pattern = "per [a-z]*")
df <- tibble(Name = words, Unicode = unis, Price = price, Quantity = quant)
Result:
> head(df)
# A tibble: 6 x 4
Name Unicode Price Quantity
<chr> <chr> <chr> <chr>
1 Vegetables NA NA NA
2 " Garlic Desi" <U+062A><U+06BE><U+0648><U+0645> <U+062F><U+06CC><U+0633><U+06CC> "140~ per kg
3 " Fresh" <U+0641><U+0631><U+0627><U+0634><U+0628><U+06CC><U+0646> " " per kg
4 Fruits NA NA NA
5 " Apple Kala Kolu Irani" <U+0633><U+06CC><U+0628> <U+06A9><U+0627><U+0644><U+0627> <U+06A9><U+~ "168~ per kg
6 " Apple golden " NA "115~ per kg
I'm not a regex genius, so I'm sure there must be a cleaner way.
Here's a functional approach. It's always good to learn to find a work around with functions.
Following are the steps:
1. Clean the price_list and keep the name, number and quantity.
2. Write functions which does that.
3. Apply functions on the new data frame.
# clean text
clean_list <- lapply(price_list, function(i) gsub("<[^>]+>", "",i))
clean_list <- lapply(clean_list, function(i) gsub('per','',i))
clean_list <- lapply(clean_list, str_trim)
# convert list to data frame
df <- data.table(do.call('rbind', clean_list))
colnames(df) <- 'text'
# helper functions
get_number <- function(j)
{
p1 <- unlist(strsplit(j, ' '))
p2 <- grepl('\\d+',p1)
if(sum(as.integer(p2)) ==1) return (grep('\\d+',p1,value = T))
else return (0)
}
get_quantity <- function(j)
{
p1 <- unlist(strsplit(j, ' '))
p2 <- grepl('kg|dozen',p1)
if(sum(as.integer(p2)) ==1) return (grep('kg|dozen',p1,value = T))
else return (NA)
}
# apply functions and get output
df[,Name := sapply(text, function(i) unlist(strsplit(i, ' '))[1])]
df[,Price := sapply(text, get_number)]
df[,Quantity := sapply(text, get_quantity)]
df[,Unicode := sapply(price_list, function(x) str_extract(string = x, pattern = '<[[:print:]]*>'))]
head(df)
text Name Price Quantity Unicode
1 Vegetables Vegetables 0 NA NA
2 Garlic Desi 140 kg Garlic Desi 140 kg <U+062A><U+06BE><U+0648><U+0645> <U+062F><U+06CC><U+0633><U+06CC>
3 Fresh-bean — kg Fresh-bean 0 kg <U+0641><U+0631><U+0627><U+0634><U+0628><U+06CC><U+0646>
4 Fruits Fruits 0 NA NA
5 Apple Kala Kolu Irani 168 kg Apple Kala Kolu Irani 168 kg <U+0633><U+06CC><U+0628> <U+06A9><U+0627><U+0644><U+0627> <U+06A9><U+0648><U+0644><…
6 Apple golden 115 kg Apple golden 115 kg NA
>

Maximum occurrence of any set of words in text in R

Given a set of lines, I have to find maximum occurrence of words(need not be single word, can be set of words also.)
say, I have a text like,
string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"
I want output to be,
john beck - 3
chemical engineer - 2
Is there any function or package which does this?
Try this:
string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"
library(tau)
library(tm)
tokens <- MC_tokenizer(string)
tokens <- tokens[tokens != ""]
string_ <- paste(stemCompletion(stemDocument(tokens), tokens), collapse = " ")
## if you want only bi-grams:
tab <- sort(textcnt(string_, method = "string", n = 2), decreasing = TRUE)
data.frame(Freq = tab[tab > 1])
# Freq
# john beck 3
# chemical engineer 2
## if you want uni-, bi- and tri-grams:
nmin <- 1; nmax <- 3
tab <- sort(do.call(c, lapply(nmin:nmax, function(x) textcnt(string_, method = "string", n = x) )), decreasing = TRUE)
data.frame(Freq = tab[tab > 1])
# Freq
# beck 3
# john 3
# john beck 3
# chemical 2
# engineer 2
# is 2
# chemical engineer 2
Could also try this, using the quanteda package:
require(quanteda)
mydfm <- dfm(string, ngrams = 1:2, concatenator = "_", stem = TRUE, verbose = FALSE)
topfeatures(mydfm)
## beck john john_beck chemic chemical_engin engin is
## 3 3 3 2 2 2 2
## an an_chem are
## 1 1 1
You lose the stems, but this counts "john beck" three times instead of just two (since without stemming, "john beck's" will be a separate type).
It's simpler though!

Resources