web scraping to extract only table values from the webpage

web scraping to extract only table values from the webpage - r

I wanted to extract only table values from the following link.
url<-"https://www.ds-norden.com/drycargo/fleetlist/"
I was trying the following code but I am not getting my desired output
library(rvest)
url <- "https://www.scorpiobulkers.com/our-fleet/"
webpage<-read_html(url)
rank_data_html<- html_node(webpage,".col-main")
rank_data<-html_text(rank_data_html)
head(rank_data)
from this code, I was getting the whole text of the webpage. I just wanted the fleet list which in the table of the webpage and store that as df in R.

library(rvest)
url <- "https://www.scorpiobulkers.com/our-fleet/"
webpage<-read_html(url)
rank_data <-
webpage %>%
html_node("table") %>%
html_table()
head(rank_data)
#> Vessel Name Year Built (1) Yard (2) Vessel Type
#> 1 NA SBI Bravo 2015 Nacks Ultramax
#> 2 NA SBI Athena 2015 Chengxi Ultramax
#> 3 NA SBI Antares 2015 Nacks Ultramax
#> 4 NA SBI Cronos 2015 Imabari Ultramax
#> 5 NA SBI Leo 2015 Dacks Ultramax
#> 6 NA SBI Echo 2015 Imabari Ultramax

Related

Read table from PDF with partially filled column using Pdftools

I've written a function in R using pdftools to read a table from a pdf. The function gets the job done, but unfortunately the table contains a column for notes, which is only partially filled. As a result the data in the resulting table is shifted by one column in the row containing a note.
Here's the table.
And here's the code:
# load library
library(pdftools)
# link to report
url <- "https://www.rymanhealthcare.co.nz/hubfs/Investor%20Centre/Financial/Half%20year%20results%202022/Ryman%20Healthcare%20Limited%20-%20Announcement%20Numbers%20and%20financial%20statements%20-%2030%20September%202022.pdf"
# read data through pdftool
data <- pdf_text(url)
# create a function to read the pdfs
scrape_pdf <- function(list_of_tables,
table_number,
number_columns,
column_names,
first_row,
last_row) {
data <- list_of_tables[table_number]
data <- trimws(data)
data <- strsplit(data, "\n")
data <- data[[1]]
data <- data[min(grep(first_row, data)):
max(grep(last_row, data))]
data <- str_split_fixed(data, " {2,}", number_columns)
data <- data.frame(data)
names(data) <- column_names
return(data)
}
names <- c("","6m 30-9-2022","6m 30-9-2021","12m 30-3-2022")
output <- scrape_pdf(rym22Q3fs,3,5,names,"Care fees","Basic and diluted")
And the output.
6m 30-9-2022 6m 30-9-2021 12m 30-3-2022 NA
1 Care fees 210,187 194,603 398,206
2 Management fees 59,746 50,959 105,552
3 Interest received 364 42 41
4 Other income 3,942 2,260 4,998
5 Total revenue 274,239 247,864 508,797
6
7 Fair-value movement of
8 investment properties 3 261,346 285,143 745,885
9 Total income 535,585 533,007 1,254,682
10
11 Operating expenses (265,148) (225,380) (466,238)
12 Depreciation and
13 amortisation expenses (22,996) (17,854) (35,698)
14 Finance costs (19,355) (15,250) (30,664)
15 Impairment loss 2 (10,784) - -
16 Total expenses (318,283) (258,484) (532,600)
17
18 Profit before income tax 217,302 274,523 722,082
19 Income tax (expense) / credit (23,316) 6,944 (29,209)
20 Profit for the period 193,986 281,467 692,873
21
22 Earnings per share
23 Basic and diluted (cents per share) 38.8 56.3 138.6
How can I best circumvent this issue?
Many thanks in advance!

While readr::read_fwf() is for handling fixed width files, it performs pretty well on text from pdftools too once header / footer rows are removed. Even if it has to guess column widths, though those can be specified too.
library(pdftools)
library(dplyr, warn.conflicts = F)
url <- "https://www.rymanhealthcare.co.nz/hubfs/Investor%20Centre/Financial/Half%20year%20results%202022/Ryman%20Healthcare%20Limited%20-%20Announcement%20Numbers%20and%20financial%20statements%20-%2030%20September%202022.pdf"
data <- pdf_text(url)
scrape_pdf <- function(pdf_text_item, first_row_str, last_row_str){
lines <- unlist(strsplit(pdf_text_item, "\n"))
# remove 0-length lines
lines <- lines[nchar(lines) > 0]
lines <- lines[min(grep(first_row_str, lines)):
max(grep(last_row_str , lines))]
# paste lines back into single string for read_fwf()
paste(lines, collapse = "\n") %>%
readr::read_fwf() %>%
# re-connect strings in first colum if values were split between rows
mutate(X1 = if_else(!is.na(lag(X1)) & is.na(lag(X3)), paste(lag(X1), X1), X1)) %>%
filter(!is.na(X3))
}
output <- scrape_pdf(data[3], "Care fees","Basic and diluted" )
Result:
output %>%
mutate(X1 = stringr::str_trunc(X1, 35))
#> # A tibble: 16 × 5
#> X1 X2 X3 X4 X5
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 Care fees NA 210,187 194,603 398,206
#> 2 Management fees NA 59,746 50,959 105,552
#> 3 Interest received NA 364 42 41
#> 4 Other income NA 3,942 2,260 4,998
#> 5 Total revenue NA 274,239 247,864 508,797
#> 6 Fair-value movement of investmen... 3 261,346 285,143 745,885
#> 7 Total income NA 535,585 533,007 1,254,682
#> 8 Operating expenses NA (265,148) (225,380) (466,238)
#> 9 Depreciation and amortisation ex... NA (22,996) (17,854) (35,698)
#> 10 Finance costs NA (19,355) (15,250) (30,664)
#> 11 Impairment loss 2 (10,784) - -
#> 12 Total expenses NA (318,283) (258,484) (532,600)
#> 13 Profit before income tax NA 217,302 274,523 722,082
#> 14 Income tax (expense) / credit NA (23,316) 6,944 (29,209)
#> 15 Profit for the period NA 193,986 281,467 692,873
#> 16 Earnings per share Basic and dil... NA 38.8 56.3 138.6
Created on 2022-11-19 with reprex v2.0.2

Problem scraping a webpage with R and Rvest

I used the code below to extract a table from a webpage:
library(rvest)
library(dplyr)
#Link to site and then getting html code.
link <- "https://www.stats.gov.sa/en/915"
page <- read_html(link)
#extract table from html
files <- page %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
However, I get results that differ from that on the webpage. The result are shown below:
A tibble: 1 × 4
Name Report Period Periodicity Download
1 please wait ... please wait ... please wait ... please wait ...
I was wondering is there a way to get the table in the form I view it through the web browser without using Rselenium. This is because this doesn't seem to work with r studio online

The solution could be RSelenium
Below a simple example
library(RSelenium)
library(rvest)
library(dplyr)
#Your URL
URL <- "https://www.stats.gov.sa/en/915"
#Open the browser by RSelenium
rD <- RSelenium::rsDriver(browser = "firefox", port = 4544L, verbose = F)
remDr <- rD[["client"]]
#Open the page into browser
remDr$navigate(URL)
#Get the table that you see
remDr$getPageSource()[[1]] %>%
read_html() %>%
html_table()
[[1]]
# A tibble: 13 x 4
Name `Report Period` Periodicity Download
<chr> <int> <chr> <lgl>
1 Ar-Riyad Region 2017 Annual NA
2 Makkah Al-Mokarramah Region 2017 Annual NA
3 Al-Madinah Al-Monawarah Region 2017 Annual NA
4 Al-Qaseem Region 2017 Annual NA
5 Eastern Region 2017 Annual NA
6 Aseer Region 2017 Annual NA
7 Tabouk Region 2017 Annual NA
8 Hail Region 2017 Annual NA
9 Northern Borders Region 2017 Annual NA
10 Jazan Region 2017 Annual NA
11 Najran Region 2017 Annual NA
12 Al-Baha Region 2017 Annual NA
13 Al-Jouf Region 2017 Annual NA

What's Wrong: rvest's Error 'in open.connection(x, "rb") and readHTMLTable ()`s "XML contents does not seem to be XML"? [duplicate]

I am trying to extract all the table from this page using R, for html_node i had passed "table". In console the output is weird. Data is available in webpage but in R console it shows NA. Please suggest me where i had made mistake.
library(xml2)
library(rvest)
url <- "https://www.iii.org/table-archive/21110"
page <- read_html(url) #Creates an html document from URL
table <- html_table(page, fill = TRUE) #Parses tables into data frames
table
part of Output:
X4 X5 X6
1 Direct premiums written (1) Market share (2) 1
2 Market share (2) <NA> NA
3 10.6% <NA> NA
4 6.0 <NA> NA
5 5.4 <NA> NA
6 5.4 <NA> NA
7 5.2 <NA> NA
8 4.5 <NA> NA
9 3.3 <NA> NA
10 3.2 <NA> NA
11 3.0 <NA> NA
12 2.2 <NA> NA
X7 X8 X9 X10
1 State Farm Mutual Automobile Insurance $51,063,111 10.6% 2
2 <NA> <NA> <NA> NA
3 <NA> <NA> <NA> NA
4 <NA> <NA> <NA> NA
5 <NA> <NA> <NA> NA
6 <NA> <NA> <NA> NA
7 <NA> <NA> <NA> NA
8 <NA> <NA> <NA> NA
9 <NA> <NA> <NA> NA
10 <NA> <NA> <NA> NA
11 <NA> <NA> <NA> NA
12 <NA> <NA> <NA> NA

This will get all of the tables into a single data frame:
library(tidyverse)
library(rvest)
url <- "https://www.iii.org/table-archive/21110"
df <- url %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = T) %>%
lapply(., function(x) setNames(x, c("Rank", "Company", "Direct_premiums_written",
"Market_share")))
tables <- data.frame()
for (i in seq(2,18,2)) {
temp <- df[[i]]
tables <- bind_rows(tables, temp)
}
You can then subset this however you want. For example, lets extract the information from the third table that represents 2009:
table_2009 <- tables[21:30,] %>%
mutate(Year = 2009)
To add all the years at once:
years <- c(2017, 2008, 2009, 2010, 2011, 2013, 2014, 2015, 2016)
tables <- tables %>%
mutate(Year = rep(years, each = 10))
Hope this helps.

There are a couple of issues with these tables.
First, I think you'll get better results if you specify the class of table. In this case, .tablesorter.
Second, you'll note that in some tables the second column header is Group, in other cases it is Group/company. This is what causes the NA. So you need to rename the columns to be consistent for all tables.
You can get a list of tables with renamed column headers like this:
tables <- page %>%
html_nodes("table.tablesorter") %>%
html_table() %>%
lapply(., function(x) setNames(x, c("rank", "group_company",
"direct_premiums_written", "market_share")))
Looking at the web page we see that the tables are for years 2017, 2008 to 2011 and 2013 to 2016. So we could add these years as names to the list then bind the tables together with a column for year:
library(dplyr)
tables <- setNames(tables, c(2017, 2008:2011, 2013:2016)) %>%
bind_rows(.id = "Year")

There are multiple items in the list that you have named table. (Not a good practice: there's a function by that name.)
str(tbl)
List of 18
$ :'data.frame': 12 obs. of 45 variables:
..$ X1 : chr [1:12] "Rank\nGroup/company\nDirect premiums written (1)\nMarket share (2)\n1\nState Farm Mutual Automobile Insurance\n"| __truncated__ "Rank" "1" "2" ...
..$ X2 : chr [1:12] "Rank" "Group/company" "State Farm Mutual Automobile Insurance" "Berkshire Hathaway Inc." ...
..$ X3 : chr [1:12] "Group/company" "Direct premiums written (1)" "$64,892,583" "38,408,251" ...
snippped rest of long output
Perhaps you only want the last one?
tbl[[18]]
Rank Group/company
1 1 State Farm Mutual Automobile Insurance
2 2 Berkshire Hathaway Inc.
3 3 Liberty Mutual
4 4 Allstate Corp.
5 5 Progressive Corp.
6 6 Travelers Companies Inc.
7 7 Chubb Ltd.
8 8 Nationwide Mutual Group
9 9 Farmers Insurance Group of Companies (3)
10 10 USAA Insurance Group
Direct premiums written (1) Market share (2)
1 $62,189,311 10.2%
2 33,300,439 5.4
3 32,217,215 5.3
4 30,875,771 5.0
5 23,951,690 3.9
6 23,918,048 3.9
7 20,786,847 3.4
8 19,756,093 3.2
9 19,677,601 3.2
10 18,273,675 3.0
Nope; going back to the page it's clear you want the first, but its structure appears to have been misinterpreted and the data has been arranged as "wide", with all the data residing in the first row. So some of the columns are being displayed and the rest of the data seems to be messed up; Just take columns 2:4:
tbl[[1]][ ,c('X2','X3','X4')]
X2 X3
1 Rank Group/company
2 Group/company Direct premiums written (1)
3 State Farm Mutual Automobile Insurance $64,892,583
4 Berkshire Hathaway Inc. 38,408,251
5 Liberty Mutual 33,831,726
6 Allstate Corp. 31,501,664
7 Progressive Corp. 27,862,882
8 Travelers Companies Inc. 24,875,076
9 Chubb Ltd. 21,266,737
10 USAA Insurance Group 20,151,368
11 Farmers Insurance Group of Companies (3) 19,855,517
12 Nationwide Mutual Group 19,218,907
X4
1 Direct premiums written (1)
2 Market share (2)
3 10.1%
4 6.0
5 5.3
6 4.9
7 4.3
8 3.9
9 3.3
10 3.1
11 3.1
12 3.0

World Bank API query

I want to get data using World Bank's API. For this purpose I use follow query.
wb_data <- httr::GET("http://api.worldbank.org/v2/country/all/indicator/AG.AGR.TRAC.NO?format=json") %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
data.frame()
It works pretty good. However, when I try to specify more than two variables it doesn't work.
http://api.worldbank.org/v2/country/all/indicator/AG.AGR.TRAC.NO;NE.CON.PRVT.ZS?format=json
Note, if i change format to xml and also add source=2 because data become from same database (World Development Indicator) query works.
http://api.worldbank.org/v2/country/all/indicator/AG.AGR.TRAC.NO;NE.CON.PRVT.ZS?source=2&formal=xml
However, if i want to get data from different databases (e.g. WDI and Doing Business) it doesn't work again.
So, my first question is how can I get multiple data from different databases using one query. According to the World Bank API tutorial I can include about 60 indicators.
My second question is how can I specify number of rows per page. As I might know I can add something like &per_page=100 to get 100 rows as an output. Should i calculate number of rows by myself or I can use something lika that &per_page=9999999 to get all data upon request.
P.S. I don't want to use any libraries (such as: wb or wbstats). I want to do it by myself and also to learn something new.

Here's an answer to your question. To use multiple indicators and return JSON, you need to provide both the source ID and the format type, as mentioned in the World Bank API tutorial. You can get the total number of pages from one of the returned JSON parameters, called "total". You can then use this value in a second GET request to return the full number of pages using the per_page parameter.
library(magrittr)
library(httr)
library(jsonlite)
# set up the target url - you need BOTH the source ID and the format parameters
target_url <- "http://api.worldbank.org/v2/country/chn;ago/indicator/AG.AGR.TRAC.NO;SP.POP.TOTL?source=2&format=json"
# look at the metadata returned for the target url
httr::GET(target_url) %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
# the metadata is in the first item in the returned list of JSON
extract2(1)
#> $page
#> [1] 1
#>
#> $pages
#> [1] 5
#>
#> $per_page
#> [1] 50
#>
#> $total
#> [1] 240
#>
#> $sourceid
#> NULL
#>
#> $lastupdated
#> [1] "2019-12-20"
# get the total number of pages for the target url query
wb_data_totalpagenumber <- httr::GET(target_url) %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
# get the first item in the returned list of JSON
extract2(1) %>%
# get the total number of pages, which is a named element called "total"
extract2("total")
# get the data
wb_data <- httr::GET(paste0(target_url, "&per_page=", wb_data_totalpagenumber)) %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
# get the data, which is the second item in the returned list of JSON
extract2(2) %>%
data.frame()
# look at the data
dim(wb_data)
#> [1] 240 11
head(wb_data)
#> countryiso3code date value scale unit obs_status decimal indicator.id
#> 1 AGO 2019 NA 0 AG.AGR.TRAC.NO
#> 2 AGO 2018 NA 0 AG.AGR.TRAC.NO
#> 3 AGO 2017 NA 0 AG.AGR.TRAC.NO
#> 4 AGO 2016 NA 0 AG.AGR.TRAC.NO
#> 5 AGO 2015 NA 0 AG.AGR.TRAC.NO
#> 6 AGO 2014 NA 0 AG.AGR.TRAC.NO
#> indicator.value country.id country.value
#> 1 Agricultural machinery, tractors AO Angola
#> 2 Agricultural machinery, tractors AO Angola
#> 3 Agricultural machinery, tractors AO Angola
#> 4 Agricultural machinery, tractors AO Angola
#> 5 Agricultural machinery, tractors AO Angola
#> 6 Agricultural machinery, tractors AO Angola
tail(wb_data)
#> countryiso3code date value scale unit obs_status decimal indicator.id
#> 235 CHN 1965 715185000 <NA> 0 SP.POP.TOTL
#> 236 CHN 1964 698355000 <NA> 0 SP.POP.TOTL
#> 237 CHN 1963 682335000 <NA> 0 SP.POP.TOTL
#> 238 CHN 1962 665770000 <NA> 0 SP.POP.TOTL
#> 239 CHN 1961 660330000 <NA> 0 SP.POP.TOTL
#> 240 CHN 1960 667070000 <NA> 0 SP.POP.TOTL
#> indicator.value country.id country.value
#> 235 Population, total CN China
#> 236 Population, total CN China
#> 237 Population, total CN China
#> 238 Population, total CN China
#> 239 Population, total CN China
#> 240 Population, total CN China
Created on 2020-01-30 by the reprex package (v0.3.0)

extract table from webpage using R

I am trying to extract all the table from this page using R, for html_node i had passed "table". In console the output is weird. Data is available in webpage but in R console it shows NA. Please suggest me where i had made mistake.
library(xml2)
library(rvest)
url <- "https://www.iii.org/table-archive/21110"
page <- read_html(url) #Creates an html document from URL
table <- html_table(page, fill = TRUE) #Parses tables into data frames
table
part of Output:
X4 X5 X6
1 Direct premiums written (1) Market share (2) 1
2 Market share (2) <NA> NA
3 10.6% <NA> NA
4 6.0 <NA> NA
5 5.4 <NA> NA
6 5.4 <NA> NA
7 5.2 <NA> NA
8 4.5 <NA> NA
9 3.3 <NA> NA
10 3.2 <NA> NA
11 3.0 <NA> NA
12 2.2 <NA> NA
X7 X8 X9 X10
1 State Farm Mutual Automobile Insurance $51,063,111 10.6% 2
2 <NA> <NA> <NA> NA
3 <NA> <NA> <NA> NA
4 <NA> <NA> <NA> NA
5 <NA> <NA> <NA> NA
6 <NA> <NA> <NA> NA
7 <NA> <NA> <NA> NA
8 <NA> <NA> <NA> NA
9 <NA> <NA> <NA> NA
10 <NA> <NA> <NA> NA
11 <NA> <NA> <NA> NA
12 <NA> <NA> <NA> NA

This will get all of the tables into a single data frame:
library(tidyverse)
library(rvest)
url <- "https://www.iii.org/table-archive/21110"
df <- url %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = T) %>%
lapply(., function(x) setNames(x, c("Rank", "Company", "Direct_premiums_written",
"Market_share")))
tables <- data.frame()
for (i in seq(2,18,2)) {
temp <- df[[i]]
tables <- bind_rows(tables, temp)
}
You can then subset this however you want. For example, lets extract the information from the third table that represents 2009:
table_2009 <- tables[21:30,] %>%
mutate(Year = 2009)
To add all the years at once:
years <- c(2017, 2008, 2009, 2010, 2011, 2013, 2014, 2015, 2016)
tables <- tables %>%
mutate(Year = rep(years, each = 10))
Hope this helps.

There are a couple of issues with these tables.
First, I think you'll get better results if you specify the class of table. In this case, .tablesorter.
Second, you'll note that in some tables the second column header is Group, in other cases it is Group/company. This is what causes the NA. So you need to rename the columns to be consistent for all tables.
You can get a list of tables with renamed column headers like this:
tables <- page %>%
html_nodes("table.tablesorter") %>%
html_table() %>%
lapply(., function(x) setNames(x, c("rank", "group_company",
"direct_premiums_written", "market_share")))
Looking at the web page we see that the tables are for years 2017, 2008 to 2011 and 2013 to 2016. So we could add these years as names to the list then bind the tables together with a column for year:
library(dplyr)
tables <- setNames(tables, c(2017, 2008:2011, 2013:2016)) %>%
bind_rows(.id = "Year")

There are multiple items in the list that you have named table. (Not a good practice: there's a function by that name.)
str(tbl)
List of 18
$ :'data.frame': 12 obs. of 45 variables:
..$ X1 : chr [1:12] "Rank\nGroup/company\nDirect premiums written (1)\nMarket share (2)\n1\nState Farm Mutual Automobile Insurance\n"| __truncated__ "Rank" "1" "2" ...
..$ X2 : chr [1:12] "Rank" "Group/company" "State Farm Mutual Automobile Insurance" "Berkshire Hathaway Inc." ...
..$ X3 : chr [1:12] "Group/company" "Direct premiums written (1)" "$64,892,583" "38,408,251" ...
snippped rest of long output
Perhaps you only want the last one?
tbl[[18]]
Rank Group/company
1 1 State Farm Mutual Automobile Insurance
2 2 Berkshire Hathaway Inc.
3 3 Liberty Mutual
4 4 Allstate Corp.
5 5 Progressive Corp.
6 6 Travelers Companies Inc.
7 7 Chubb Ltd.
8 8 Nationwide Mutual Group
9 9 Farmers Insurance Group of Companies (3)
10 10 USAA Insurance Group
Direct premiums written (1) Market share (2)
1 $62,189,311 10.2%
2 33,300,439 5.4
3 32,217,215 5.3
4 30,875,771 5.0
5 23,951,690 3.9
6 23,918,048 3.9
7 20,786,847 3.4
8 19,756,093 3.2
9 19,677,601 3.2
10 18,273,675 3.0
Nope; going back to the page it's clear you want the first, but its structure appears to have been misinterpreted and the data has been arranged as "wide", with all the data residing in the first row. So some of the columns are being displayed and the rest of the data seems to be messed up; Just take columns 2:4:
tbl[[1]][ ,c('X2','X3','X4')]
X2 X3
1 Rank Group/company
2 Group/company Direct premiums written (1)
3 State Farm Mutual Automobile Insurance $64,892,583
4 Berkshire Hathaway Inc. 38,408,251
5 Liberty Mutual 33,831,726
6 Allstate Corp. 31,501,664
7 Progressive Corp. 27,862,882
8 Travelers Companies Inc. 24,875,076
9 Chubb Ltd. 21,266,737
10 USAA Insurance Group 20,151,368
11 Farmers Insurance Group of Companies (3) 19,855,517
12 Nationwide Mutual Group 19,218,907
X4
1 Direct premiums written (1)
2 Market share (2)
3 10.1%
4 6.0
5 5.3
6 4.9
7 4.3
8 3.9
9 3.3
10 3.1
11 3.1
12 3.0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

web scraping to extract only table values from the webpage - r

Related

Read table from PDF with partially filled column using Pdftools

Problem scraping a webpage with R and Rvest

What's Wrong: rvest's Error 'in open.connection(x, "rb") and readHTMLTable ()`s "XML contents does not seem to be XML"? [duplicate]

World Bank API query

extract table from webpage using R

Categories

Resources