R - Importing and formatting mutiple tables from various urls

R - Importing and formatting mutiple tables from various urls - r

I am a newbie in R so probably there is already some answer to the following question but I haven't find a solution matching the issue I am facing.
I am trying to get tables from a number of webpages. They shold be around 5200.
I have imported one in order to format it but I need to automatize the process to get them all.
Here's the url:
http://www.tbca.net.br/base-dados/int_composicao_estatistica.php?cod_produto=C0195C
I have tried to find out a way to get all the tables by doing:
url <- paste0("http://www.tbca.net.br/base-dados/int_composicao_estatistica.php?cod_produto=", ., sep="" )
but I receive an error message according which
, .,
cannot be read.
In any case, I neither get how to automatize the process of formatting once I would get it.
Any hint?

Here's how you would do it for one product:
url <- "http://www.tbca.net.br/base-dados/int_composicao_estatistica.php?cod_produto=C0195C"
h <- read_html(url)
tab <- html_table(h, fill=TRUE) %>%
as_tibble(.name_repair = "universal")
tab
# # A tibble: 37 x 1
# ...1$Componente $Unidades $`Valor por 100… $`Desvio padrão` $`Valor Mínimo` $`Valor Máximo` $`Número de dad…
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Energia kJ 578 - - - -
# 2 Energia kcal 136 - - - -
# 3 Umidade g 65,5 - - - -
# 4 Carboidrato to… g 33,3 - - - -
# 5 Carboidrato di… g 32,5 - - - -
# 6 Proteína g 0,60 - - - -
# 7 Lipídios g 0,26 - - - -
# 8 Fibra alimentar g 0,84 - - - -
# 9 Álcool g 0,00 - - - -
# 10 Cinzas g 0,39 - - - -
# # … with 27 more rows, and 2 more variables: $Referências <chr>, $`Tipo de dados` <chr>
If you wanted to scrape all the codes and get all of the tables, you could do that with the following. First, we can set up a loop to scrape all of the links. By investigating the source, you would find, as you did, that all of the product codes have "cod_produto" in the href attribute. You could use an xpath selector to keep only those a tags containing that string. You're basically looping over every page until you get to one that doesn't have any links. This gives you 5203 links.
library(glue)
all_links <- NULL
links <- "init"
i <- 1
while(length(links) > 0){
url <- glue("http://www.tbca.net.br/base-dados/composicao_alimentos.php?pagina={i}&atuald=3")
h <- read_html(url)
links <- h %>% html_nodes(xpath = "//a[contains(#href,'cod_produto')]") %>% html_attr("href") %>% unique()
all_links <- c(all_links, links)
i <- i+1
}
EDIT
Next, we can follow each link and pull the table out of it, storing the table in the list called tabs. In answer to the question about how to get the name of the product in the data, there are two easy things to do. The first is to make the table into a data frame and then make a variable (I called it code) in the data frame that has the code name. The second is to set the list names to be the product code. The answer below has been edited to do both things.
all_links <- unique(all_links)
tabs <- vector(mode="list", length=length(all_links))
for(i in 1:length(all_links)){
url <- glue("http://www.tbca.net.br/base-dados/{all_links[i]}")
code <- gsub(".*=(.*)$", "\\1", url)
h <- read_html(url)
tmp <- html_table(h, fill=TRUE)[[1]]
tmp <- as.data.frame(tmp)
tmp$code <- code
tabs[[i]] <- tmp
names(tabs)[i] <- code
}

Related

Rvest : Extracting clickable content

I am trying to extract the table in the link below
https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=1&Tx_State=0&Tx_District=0&Tx_Market=0&DateFrom=2022-01-28&DateTo=2022-01-28&Fr_Date=2022-01-28&To_Date=2022-01-28&Tx_Trend=2&Tx_CommodityHead=Wheat&Tx_StateHead=--Select--&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--
I want the whole table to be extracted and I am using the following code
html_page <- read_html(curl(curl))
tab <- html_page %>% html_table(., fill = TRUE)
I get the table in tab[[1]], however, if you notice that website it has a clickable section within the table that has additional data. That part is missing from the extracted table. Will appreciate any help on how the whole table can be extracted.

I'm not sure what you're getting. However, when I pulled from this website I see that there are multiple tabs but I pulled all of the data.
Here is the bottom of the table, when you show all.
Here are the results, when I query for the last line of this website data.
library(rvest)
library(tidyverse)
hx = "https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=1&Tx_State=0&Tx_District=0&Tx_Market=0&DateFrom=2022-01-28&DateTo=2022-01-28&Fr_Date=2022-01-28&To_Date=2022-01-28&Tx_Trend=2&Tx_CommodityHead=Wheat&Tx_StateHead=--Select--&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--"
htp <- read_html(hx) %>% html_table(., fill = T)
tbOne = htp[[1]][, 1:10] # just the data
tbOne %>% filter(`State Name` == "Uttar Pradesh",
`District Name` == "Badaun",
`Market Name` == "Wazirganj")
# # A tibble: 1 × 10
# `State Name` `District Name` `Market Name` Variety Group `Arrivals (Tonnes)`
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Uttar Pradesh Badaun Wazirganj Dara Cereals 3.50
# # … with 4 more variables: `Min Price (Rs./Quintal)` <chr>,
# # `Max Price (Rs./Quintal)` <chr>, `Modal Price (Rs./Quintal)` <chr>,
# # `Reported Date` <chr>
Update
When I pressed the 2, nothing happened (and I did try repeatedly). However, I needed to be really patient and I wasn't. Sorry about that.
The URL has the query in it, so the URL can be used to get all of the data. You could do this by adding the states you're missing, or you could do this for every state. For example, page one ends on Utter Pradesh, but we don't know if this is all of Utter Pradesh. That might make more sense when you see what I did.
Using rvest, I collected all of the states' names from the form. Then I put these name-value pairs into a data frame.
# collect form values for State
ht <- read_html(hx) %>% html_form()
df1 <- as.data.frame(ht[[1]][["fields"]][["ctl00$ddlState"]][["options"]]) %>%
rownames_to_column("State")
names(df1)[2] <- "Abb"
To only look at the states that were not included in page one, you could just query the states after Utter Pradesh like this.
which(df1$State == "Uttar Pradesh", arr.ind = T)
# [1] 35
# split the URL
urone = "https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=1&Tx_State="
urtwo = "&Tx_District=0&Tx_Market=0&DateFrom=2022-01-28&DateTo=2022-01-28&Fr_Date=2022-01-28&To_Date=2022-01-28&Tx_Trend=2&Tx_CommodityHead=Wheat&Tx_StateHead=West+Bengal&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--"
# collect remaining states' data
df2 <- map(36:nrow(df1),
function(x){
# assemble URL
y = toString(df1$Abb[x])
urall = paste0(urone, y, urtwo)
# get table
tabs <- read_html(urall) %>% html_table(., fill = T)
tabs
})
length(df2)
# [1] 2
length(df2[[1]]) # state 36 is empty
length(df2[[2]]) # state 37 is not
# add the new data to the original data
df3 <- df2[[2]][[1]]
tbOne <- rbind(tbOne, df3) # one data frame of tabled data
If you wanted to make sure that you had all the data for each state, you could expand this. Although, using map for that much data may be slow. So I used the function mclapply from the package parallel. In this code, I used 15 cores. You may need to change this depending on your computer's processor. Using 15 made this take less than a second.
# skip row 1, that's "select" or all
df4 <- mclapply(2:nrow(df1), mc.cores = getOption("mc.cores", 15L),
function(x){
# assemble URL
y = toString(df1$Abb[x])
urall = paste0(urone, y, urtwo)
# get table
tabs <- read_html(urall) %>% html_table(., fill = T)
tabs
})
length(df4)
# [1] 36
# create storage using first state with data
df5 <- df4[[7]][[1]]
map(8:36,
function(x){
y = length(df4[[x]])
if(y > 0){
df5 <<- rbind(df5, df4[[x]][[1]])
}
})
Now you have a data frame, df5 that started as each state queried separately.
I didn't look at how the data was different. However, my tbOne data frame has 577 observations. My df5 data frame has 584.

Webscraping using R - Table content

New to webscraping. I am trying to scrape specific data from websites.
For eg. https://www.vesselfinder.com/vessels/KOTA-CARUM-IMO-9494577-MMSI-563150100
I need to scrape the distance the ship has travelled in 2020 and 2021.
shipws <- read_html(shipsite)
The above code gets me the site. shipsite is the url.
Now, I tried using,
a <- shipws %>%
html_nodes( css = "_1hFrZ") %>%
html_attr()
But it returns a empty. _1hFrZ was the td class in the website. It returns empty when I use html_text() too.
a <- shipsite %>%
html() %>%
html_nodes(xpath='//*[#id="tbc1"]/div[1]/div[1]/table') %>%
html_table()
Few tutorials asked me to do it above way and that turned up with errors that html() function does not exist. If I remove html()
Would love to know where I am going wrong. Thank you.

We can just get all the tables from website by,
df = 'https://www.vesselfinder.com/vessels/KOTA-CARUM-IMO-9494577-MMSI-563150100' %>%
read_html() %>% html_table()
The table of interest is,
df[[2]]
# A tibble: 4 x 2
X1 X2
<chr> <int>
1 Travelled distance (nm) 98985
2 Port Calls 54
3 Average / Max Speed (kn) NA
4 Min / Max Draught (m) NA

scraping wikimedia category trees

I want to use R to scrape the links contained within a wikimedia category tree and the structure of the tree from here. The code below can open up all the collapsible bullet points
library(RSelenium)
rD <- rsDriver(check = FALSE)
remDr <- rD[["client"]]
remDr$navigate("https://commons.wikimedia.org/wiki/Category:Sports")
n <- 1
# n <- 10 # takes a long time to expand all bullet points
for(i in 1:n){
b <- remDr$findElements(using = "css selector", "[title='expand']")
for(i in 1:length(b)){
b[[i]]$clickElement()
}
}
... but i am struggling to build a data base that would look like...
I can get the bullet href and names using the code below, but i am struggling to find a way to indicate which level each bullet points refers too (i.e. how deep in the category tree each bullet point is)? I am thinking there might be a clever xpath method to count how many CategoryTreeChildren deep each bullet is but that is reaching well beyond my capabilities.
# for testing I manually expand the bullets for the first couple of branches
# (fully for Bulgaria women badminton, basketball) and the last possible
# branch rather than let the for loop run and run through multiple cycles.
library(tidyverse)
library(rvest)
s <- remDr$getPageSource()
d <- read_html(s[[1]]) %>%
html_nodes("div#mw-subcategories") %>%
html_nodes("div.CategoryTreeItem") %>%
html_nodes("a") %>%
map(xml_attrs) %>%
map_df(~as.list(.)) %>%
as_tibble()
# > d
# # A tibble: 135 x 2
# href title
# <chr> <chr>
# 1 /wiki/Category:Categories_by_sport Category:Categories by sport
# 2 /wiki/Category:Categories_by_sport_by_c~ Category:Categories by sport by co~
# 3 /wiki/Category:Categories_of_Bulgaria_b~ Category:Categories of Bulgaria by~
# 4 /wiki/Category:Female_sportspeople_from~ Category:Female sportspeople from ~
# 5 /wiki/Category:Female_badminton_players~ Category:Female badminton players ~
# 6 /wiki/Category:Maria_Delcheva Category:Maria Delcheva
# 7 /wiki/Category:Petya_Nedelcheva Category:Petya Nedelcheva
# 8 /wiki/Category:Gabriela_Stoeva Category:Gabriela Stoeva
# 9 /wiki/Category:Stefani_Stoeva Category:Stefani Stoeva
# 10 /wiki/Category:Women%27s_basketball_pla~ Category:Women's basketball player~
I have also played around with the WikipediR package - it says in the package description that it can be used to retrieve elements of category trees but i cannot find an example of how to implement it.

How to scrape multiple tables that are without IDs or Class using R

I'm trying to scrape this webpage using R : http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all (All the pages)
I'm new to programming. And everywhere I've looked, tables are mostly identified with IDs or Divs or Class. On this page there's none. Data is stored in Table format. How should I scrape it?
This is what I did :
library(rvest)
webpage <- read_html("http://zipnet.in/index.php
page=missing_mobile_phones_search&criteria=browse_all")
tbls <- html_nodes(webpage, "table")
head(tbls)
tbls_ls <- webpage %>%
html_nodes("table") %>%
.[9:10] %>%
html_table(fill = TRUE)
colnames(tbls_ls[[1]]) <- c("Mobile Make", "State", "District",
"Police Station", "Status", "Mobile Type(GSM/CDMA)",
"FIR/DD/GD Dat")

You can scrape the table data by targeting the css id of each table. It looks like each page is composed of 3 different tables pasted one after another. Two of the tables have #AutoNumber15 css id while the third (in the middle) has the #AutoNumber16 css id.
I put a simple code example that should get you started in the right direction.
suppressMessages(library(tidyverse))
suppressMessages(library(rvest))
# define function to scrape the table data from a page
get_page <- function(page_id = 1) {
# default link
link <- "http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all&Page_No="
# build link
link <- paste0(link, page_id)
# get tables data
wp <- read_html(link)
wp %>%
html_nodes("#AutoNumber16, #AutoNumber15") %>%
html_table(fill = TRUE) %>%
bind_rows()
}
# get the data from the first three pages
iter_page <- 1:3
# this is just a progress bar
pb <- progress_estimated(length(iter_page))
# this code will iterate over pages 1 through 3 and apply the get_page()
# function defined earlier. The Sys.sleep() part is used to pause the code
# after each iteration so that the sever is not overloaded with requests.
map_df(iter_page, ~ {
pb$tick()$print()
df <- get_page(.x)
Sys.sleep(sample(10, 1) * 0.1)
as_tibble(df)
})
#> # A tibble: 72 x 4
#> X1 X2 X3
#> <chr> <chr> <chr>
#> 1 FIR/DD/GD Number 000165 State
#> 2 FIR/DD/GD Date 17/08/2017 District
#> 3 Mobile Type(GSM/CDMA) GSM Police Station
#> 4 Mobile Make SAMSUNG J2 Mobile Number
#> 5 Missing/Stolen Date 23/04/2017 IMEI Number
#> 6 Complainant AKEEL KHAN Complainant Contact Number
#> 7 Status Stolen/Theft Report Date/Time on ZIPNET
#> 8 <NA> <NA> <NA>
#> 9 FIR/DD/GD Number FIR No 37/ State
#> 10 FIR/DD/GD Date 17/08/2017 District
#> # ... with 62 more rows, and 1 more variables: X4 <chr>

parse multiple XML files based on a vector and rbind in a dataframe

With some effort and help from the stackers, I have been able to parse a webpage and save it as a dataframe. I want to repeat the same operation on multiple xml files and rbind the list. Here is what I tried and did successfully:
library(XML)
xml.url <- "http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml"
doc <- xmlParse(xml.url)
x <- xmlToDataFrame(getNodeSet(doc,"//SAMPLE_ATTRIBUTE"))
x$UNITS <- NULL
x_t <- t(x)
x_t <- as.data.frame(x_t)
names(x_t) <- as.matrix(x_t[1, ])
x_t <- x_t[-1, ]
x_t[] <- lapply(x_t, function(x) type.convert(as.character(x)))
Above code works well, now when I try to apply a function to do the same for multiple xml files :
ERS_ID <- c("ERS445758","ERS445759", "ERS445760", "ERS445761", "ERS445762")
xml_url_test = as.vector(sprintf("http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml",
ERS_ID))
XML_parser <- function(XML_url){
doc <- xmlParse(XML_url)
x <- xmlToDataFrame(getNodeSet(doc,"//SAMPLE_ATTRIBUTE"))
x$UNITS <- NULL
x_t <- t(x)
x_t <- as.data.frame(x_t)
names(x_t) <- as.matrix(x_t[1, ])
x_t <- x_t[-1, ]
x_t[] <- lapply(x_t, function(x) type.convert(as.character(x)))
return(x_t)
}
major_test <- sapply(xml_url_test, XML_parser)
It works, but gives me a long list that is not in the right data frame format as I generated for the single XML file.
Finally I would like to also add a column to the final dataframe that has the ERS number from the ERS_ID vector
Something like x_t$ERSid <- ERS_ID in the function
Can someone point out what am I missing in the function as well as any better ways to do the task?
Thanks!

Your main issue is using sapply over lapply() where the latter returns a list and former attempts to simplify to a vector or matrix, here being a matrix.
major_test <- lapply(xml_url_test, XML_parser)
Of course, sapply is a wrapper for lapply and can also return a list: sapply(..., simplify=FALSE):
major_test <- sapply(xml_url_test, XML_parser, simplify=FALSE)
However, a few other items came up:
At beginning, you are not concatenating your ERS_ID to the url stem with sprintf's %s operator. So right now, the same urls are repeating.
At end, you are not binding your list of data frames into a compiled final single dataframe.
Add new ERS column inside your defined function, passing in ERS_ID vector. And while creating column, also remove the ERS prefix with gsub.
R code (adjusted)
XML_parser <- function(eid) {
XML_url <- as.vector(sprintf("http://www.ebi.ac.uk/ena/data/view/%s&display=xml", eid))
doc <- xmlParse(XML_url)
x <- xmlToDataFrame(getNodeSet(doc,"//SAMPLE_ATTRIBUTE"))
x$UNITS <- NULL
x_t <- t(x)
x_t <- as.data.frame(x_t)
names(x_t) <- as.matrix(x_t[1, ])
x_t <- x_t[-1, ]
x_t[] <- lapply(x_t, function(x) type.convert(as.character(x)))
x_t$ERSid <- gsub("ERS", "", eid) # ADD COL, REMOVE ERS
x_t <- x_t[,c(ncol(x_t),2:ncol(x_t)-1)] # MOVE NEW COL TO FIRST
return(x_t)
}
major_test <- lapply(ERS_ID, XML_parser)
# major_test <- sapply(ERS_ID, XML_parser, simplify=FALSE)
# BIND DATA FRAMES TOGETHER
finaldf <- do.call(rbind, major_test)
# RESET ROW NAMES
row.names(finaldf) <- seq(nrow(finaldf))

Using xml2 and the tidyverse you can do something like this:
require(xml2)
require(purrr)
require(tidyr)
urls <- rep("http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml", 2)
identifier <- LETTERS[seq_along(urls)] # Take a unique identifier per url here
parse_attribute <- function(x){
out <- data.frame(tag = xml_text(xml_find_all(x, "./TAG")),
value = xml_text(xml_find_all(x, "./VALUE")), stringsAsFactors = FALSE)
spread(out, tag, value)
}
doc <- map(urls, read_xml)
out <- doc %>%
map(xml_find_all, "//SAMPLE_ATTRIBUTE") %>%
set_names(identifier) %>%
map_df(parse_attribute, .id="url")
Which gives you a 2x36 data.frame. To parse the column type i would suggest using readr::type_convert(out)
Out looks as follows:
url age body product body site body-mass index chimera check collection date
1 A 28 mucosa Sigmoid colon 16.95502 ChimeraSlayer; Usearch 4.1 database 2009-03-16
2 B 28 mucosa Sigmoid colon 16.95502 ChimeraSlayer; Usearch 4.1 database 2009-03-16
disease status ENA-BASE-COUNT ENA-CHECKLIST ENA-FIRST-PUBLIC ENA-LAST-UPDATE ENA-SPOT-COUNT
1 remission 627051 ERC000015 2014-12-31 2016-10-21 1668
2 remission 627051 ERC000015 2014-12-31 2016-10-21 1668
environment (biome) environment (feature) environment (material) experimental factor
1 organism-associated habitat organism-associated habitat mucus microbiome
2 organism-associated habitat organism-associated habitat mucus microbiome
gastrointestinal tract disorder geographic location (country and/or sea,region) geographic location (latitude)
1 Ulcerative Colitis India 72.82807
2 Ulcerative Colitis India 72.82807
geographic location (longitude) host subject id human gut environmental package investigation type
1 18.94084 1 human-gut metagenome
2 18.94084 1 human-gut metagenome
medication multiplex identifiers pcr primers phenotype project name
1 ASA;Steroids;Probiotics;Antibiotics TGATACGTCT 27F-338R pathological BMRP
2 ASA;Steroids;Probiotics;Antibiotics TGATACGTCT 27F-338R pathological BMRP
sample collection device or method sequence quality check sequencing method sequencing template sex target gene
1 biopsy software pyrosequencing DNA male 16S rRNA
2 biopsy software pyrosequencing DNA male 16S rRNA
target subfragment
1 V1V2
2 V1V2

purrr is really helpful here, as you can iterate over a vector of URLs or a list of XML files with map, or within nested elements with at_depth, and simplify the results with the *_df forms and flatten.
library(tidyverse)
library(xml2)
# be kind, don't call this more times than you need to
x <- c("ERS445758","ERS445759", "ERS445760", "ERS445761", "ERS445762") %>%
sprintf("http://www.ebi.ac.uk/ena/data/view/%s&display=xml", .) %>%
map(read_xml) # read each URL into a list item
df <- x %>% map(xml_find_all, '//SAMPLE_ATTRIBUTE') %>% # for each item select nodes
at_depth(2, as_list) %>% # convert each (nested) attribute to list
map_df(map_df, flatten) # flatten items, collect pages to df, then all to one df
df
## # A tibble: 175 × 3
## TAG VALUE UNITS
## <chr> <chr> <chr>
## 1 investigation type metagenome <NA>
## 2 project name BMRP <NA>
## 3 experimental factor microbiome <NA>
## 4 target gene 16S rRNA <NA>
## 5 target subfragment V1V2 <NA>
## 6 pcr primers 27F-338R <NA>
## 7 multiplex identifiers TGATACGTCT <NA>
## 8 sequencing method pyrosequencing <NA>
## 9 sequence quality check software <NA>
## 10 chimera check ChimeraSlayer; Usearch 4.1 database <NA>
## # ... with 165 more rows

You can retrieve multiple IDs with a single REST url using a comma-separated list or range like ERS445758-ERS445762 and avoid multiple queries to the ENA.
This code gets all 5 samples into a node set and then applies functions using a leading dot in the xpath string so its relative to that node.
ERS_ID <- c("ERS445758","ERS445759", "ERS445760", "ERS445761", "ERS445762")
url <- paste0( "http://www.ebi.ac.uk/ena/data/view/", paste(ERS_ID, collapse=","), "&display=xml")
doc <- xmlParse(url)
samples <- getNodeSet( doc, "//SAMPLE")
## check the first node
samples[[1]]
## get the sample attribute node set and apply xmlToDataFrame to that
x <- lapply( lapply(samples, getNodeSet, ".//SAMPLE_ATTRIBUTE"), xmlToDataFrame)
# labels for bind_rows
names(x) <- sapply(samples, xpathSApply, ".//PRIMARY_ID", xmlValue)
library(dplyr)
y <- bind_rows(x, .id="sample")
z <- subset(y, TAG %in% c("age","sex","body site","body-mass index") , 1:3)
sample TAG VALUE
15 ERS445758 age 28
16 ERS445758 sex male
17 ERS445758 body site Sigmoid colon
19 ERS445758 body-mass index 16.9550173
50 ERS445759 age 58
51 ERS445759 sex male
...
library(tidyr)
z %>% spread( TAG, VALUE)
sample age body site body-mass index sex
1 ERS445758 28 Sigmoid colon 16.9550173 male
2 ERS445759 58 Sigmoid colon 23.22543185 male
3 ERS445760 26 Sigmoid colon 20.76124567 female
4 ERS445761 30 Sigmoid colon 0 male
5 ERS445762 36 Sigmoid colon 0 male

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - Importing and formatting mutiple tables from various urls - r

Related

Rvest : Extracting clickable content

Webscraping using R - Table content

scraping wikimedia category trees

How to scrape multiple tables that are without IDs or Class using R

parse multiple XML files based on a vector and rbind in a dataframe

Categories

Resources