Replace Missing Values with NA in Web Scraping with R - r

I am trying web scraping with R (rvest) for the first time. I am trying to replace missing values with 'NA' but it doesn't seem to work at all. Can you guys check the code below and please help me?
library(rvest)
library('purrr')
link= "https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=action&sort=user_rating,desc&start=1&ref_=adv_nxt"
page=read_html(link)
movies<-data.frame(name = page %>% html_nodes(".lister-item-header a") %>% html_text,
year = page %>% html_nodes(".text-muted.unbold") %>% html_text(),
certificate = page %>% html_nodes(".certificate") %>% html_text(),
runtime = page %>% html_nodes(".runtime") %>% html_text(),
genre = page %>% html_nodes(".genre") %>% html_text(),
imdb_rating = page %>% html_nodes(".ratings-imdb-rating strong") %>% html_text(),
director = page %>% html_nodes(".text-muted+ p a:nth-child(1)") %>% html_text(),
number_of_votes = page %>% html_nodes(".sort-num_votes-visible span:nth-child(2)") %>% html_text(),
gross = page %>% html_nodes(".ghost~ .text-muted+ span") %>% html_text())
The certificate and gross values are missing for certain movies. I tried the following methods to replace missing values with N/A
certificate = page %>%
html_nodes(".certificate") %>% html_text() %>% gsub('\\s+', ' ', .)
gross = page %>% html_nodes(".ghost~ .text-muted+ span") %>% html_text() %>% replace(!nzchar(.),NA)
certificate = page %>% html_nodes(".certificate") %>%
html_text(trim = TRUE) %>% {if(length(.) == "") NA else .}
None of them work for me. The commands execute without error but does not replace the missing values with NA and I get less number of entries.
Without replacing the missing values, I cannot make the movies data frame because I get the error as:
error in data.frame(name = page %>% html_nodes(".lister-item-header a") %>% :
arguments imply differing number of rows: 50, 49, 37

I recommend narrowing your web scraping focus to a specific parent element, such as the cards shown in the image, and then iterating through those elements to extract the specific child elements of interest. This approach will make the process more efficient and targeted. NA will be returned if no element is found in certain cards.
library(tidyverse)
library(rvest)
movies <-
"https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=action&sort=user_rating,desc&start=1&ref_=adv_nxt" %>%
read_html()
movies %>%
html_elements(".lister-item-content") %>% # the cards
map_dfr(~ tibble( # interate through the list and grab the elements:
title = .x %>%
html_element(".lister-item-header a") %>%
html_text2(),
year = .x %>%
html_element(".text-muted.unbold") %>%
html_text2(),
certificate = .x %>%
html_element(".certificate") %>%
html_text2(),
runtime = .x %>%
html_element(".runtime") %>%
html_text2(),
genre = .x %>%
html_element(".genre") %>%
html_text2(),
rating = .x %>%
html_element(".ratings-imdb-rating strong") %>%
html_text2(),
director = .x %>%
html_element(".text-muted+ p a:nth-child(1)") %>%
html_text2(),
votes = .x %>%
html_element(".sort-num_votes-visible span:nth-child(2)") %>%
html_text2(),
gross = .x %>%
html_element(".ghost~ .text-muted+ span") %>%
html_text2()
))
Results
# A tibble: 50 × 9
title year certi…¹ runtime genre rating direc…² votes gross
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "The Dark Knight" (200… 15 152 min Acti… 9.0 Christ… 2,66… $534…
2 "Ringenes herre: Atter en kong… (200… 12 201 min Acti… 9.0 Peter … 1,85… $377…
3 "Inception" (201… 15 148 min Acti… 8.8 Christ… 2,36… $292…
4 "Ringenes herre: Ringens brors… (200… 12 178 min Acti… 8.8 Peter … 1,88… $315…
5 "Ringenes herre: To t\u00e5rn" (200… 12 179 min Acti… 8.8 Peter … 1,67… $342…
6 "The Matrix" (199… 15 136 min Acti… 8.7 Lana W… 1,92… $171…
7 "Star Wars: Episode V - Imperi… (198… 9 124 min Acti… 8.7 Irvin … 1,29… $290…
8 "Soorarai Pottru" (202… NA 153 min Acti… 8.7 Sudha … 117,… NA
9 "Stjernekrigen" (197… 11 121 min Acti… 8.6 George… 1,37… $322…
10 "Terminator 2 - Dommens dag" (199… 15 137 min Acti… 8.6 James … 1,10… $204…
# … with 40 more rows, and abbreviated variable names ¹​certificate, ²​director
# ℹ Use `print(n = ...)` to see more rows

Related

Select the correct html element with rvest

Im some ocassion a Stack user help me for make this script. Im edit it for add more attributes but I have problems when try to add Authors
The Author label is next to target and href. I have problem in this part.
library(tidyverse)
library(rvest)
startTime <- Sys.time()
get_cg <- function(pages) {
cat("Scraping page", pages, "\n")
page <-
str_c("https://cgspace.cgiar.org/discover?
scope=10568%2F106146&query=cassava&submit=&rpp=10&page=", pages) %>%
read_html()
tibble(
title = page %>%
html_elements(".ds-artifact-item") %>%
html_element(".description-info") %>%
html_text2(), # run well
fecha = page %>%
html_elements(".ds-artifact-item") %>%
html_element(".date") %>%
html_text2(), # run well
Type = page %>%
html_elements(".ds-artifact-item") %>%
html_element(".artifact-type") %>%
html_text2(), # run well
Autor= page %>%
html_elements(".ds-artifact-item") %>%
html_element(".description-info") %>%
html_attr("href"), # not download the Authors
link = page %>%
html_elements(".ds-artifact-item") %>%
html_element(".description-info") %>%
html_attr("href") %>% # run well
str_c("https://cgspace.cgiar.org", .)
)
}
df <- map_dfr(1, get_cg)
endTime <- Sys.time()
print(endTime - startTim)
Im try with other selector but get NA
This should get you a collapsed list of authors for each book, separated by ; , basically the same as presented on the page:
library(tidyverse, warn.conflicts = F)
library(rvest, warn.conflicts = F)
startTime <- Sys.time()
get_cg <- function(pages) {
cat("Scraping page", pages, "\n")
page <-
str_c("https://cgspace.cgiar.org/discover?scope=10568%2F106146&query=cassava&submit=&rpp=10&page=", pages) %>%
read_html()
html_elements(page, "div.artifact-description > div.artifact-description") %>%
map_df(~ list(
title = html_element(.x, ".description-info") %>% html_text2(),
fecha = html_element(.x, ".date") %>% html_text2(),
Type = html_element(.x, ".artifact-type") %>% html_text2(),
# Autor_links = html_elements(.x,".description-info > span > a") %>% html_attr("href") %>% paste(collapse = ";"),
Autor = html_element(.x, "span.description-info") %>% html_text2(),
link = html_element(.x, "a.description-info") %>% html_attr("href") %>% str_c("https://cgspace.cgiar.org", .)
))
}
df <- map_dfr(1, get_cg)
#> Scraping page 1
endTime <- Sys.time()
print(endTime - startTime)
#> Time difference of 0.989037 secs
Result:
df
#> # A tibble: 10 × 5
#> title fecha Type Autor link
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Global Climate Regions for Cassava 2020… Type… Hyma… http…
#> 2 Performance of the CSM–MANIHOT–Cassava model for sim… 2021… Type… Phon… http…
#> 3 Adoption of cassava improved modern varieties in the… 2020 Type… Laba… http…
#> 4 First report of Sri Lankan cassava mosaic virus and … 2021… Type… Chit… http…
#> 5 Surveillance and diagnostics dataset on Sri Lankan c… 2020 Type… Siri… http…
#> 6 Socieconomic and soil conservation practices for cas… 2022… Type… Ibar… http…
#> 7 The transformation and outcome of traditional cassav… 2020 Type… Dou,… http…
#> 8 Cassava Annual Report 2019 2020 Type… Inte… http…
#> 9 Cassava Annual Report 2020 2021… Type… Bece… http…
#> 10 Adoption of cassava improved modern varieties in the… 2020 Type… Flor… http…
glimpse(df)
#> Rows: 10
#> Columns: 5
#> $ title <chr> "Global Climate Regions for Cassava", "Performance of the CSM–MA…
#> $ fecha <chr> "2020-08-03", "2021-05-01", "2020", "2021-04-23", "2020", "2022-…
#> $ Type <chr> "Type:Dataset", "Type:Journal Article", "Type:Dataset", "Type:Jo…
#> $ Autor <chr> "Hyman, Glenn G.", "Phoncharoen, Phanupong; Banterng, Poramate; …
#> $ link <chr> "https://cgspace.cgiar.org/handle/10568/109500", "https://cgspac…
Created on 2022-12-03 with reprex v2.0.2

map_df -- Argument 1 must be a data frame or a named atomic vector

I am an infectious diseases physician and have set myself the challenge of creating a dataframe with the UK cumulative published cases of monkeypox, so I can graph it as a runing tally or a chloropleth map as there is no nice dashboard at present for this.
All the data is published as html webpages rather than as a nice csv so I am trying to scrape it all off the internet using the rvest package.
Data is only published intermittently (about twice per week) with the cumulative totals for each of the 4 home nations in UK.
I have managed to get working code to pull data from each of the separate webpages and testing it on the first 2 pages in my mpx_gov_uk_pages list works well giving a small example tibble:
library(tidyverse)
library(lubridate)
library(rvest)
library(janitor)
# load in overview page url which has links to each date of published cases
mpx_gov_uk_overview_page <- c("https://www.gov.uk/government/publications/monkeypox-outbreak-epidemiological-overview")
# extract urls for each date page
mpx_gov_uk_pages <- mpx_gov_uk_overview_page %>%
read_html %>%
html_nodes(".govuk-link") %>%
html_attr('href') %>%
str_subset("\\d{1,2}-[a-z]+-\\d{4}") %>%
paste0("https://www.gov.uk", .) %>%
as.character()
# make table for home nations for each date
table1 <- mpx_gov_uk_pages[1] %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
janitor::clean_names() %>%
rename(area = starts_with(c("uk", "devolved")),
cases = matches(c("total", "confirmed_cases"))) %>%
separate(cases, c("cases", NA), sep = "\\s\\(") %>%
mutate(date = dmy(str_extract(mpx_gov_uk_pages[1], "\\d{1,2}-[a-z]+-\\d{4}")),
cases = as.numeric(gsub(",", "", cases))) %>%
select(date, area, cases) %>%
filter(!area %in% c("Total"))
table2 <- mpx_gov_uk_pages[2] %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
janitor::clean_names() %>%
rename(area = starts_with(c("uk", "devolved")),
cases = matches(c("total", "confirmed_cases"))) %>%
separate(cases, c("cases", NA), sep = "\\s\\(") %>%
mutate(date = dmy(str_extract(mpx_gov_uk_pages[2], "\\d{1,2}-[a-z]+-\\d{4}")),
cases = as.numeric(gsub(",", "", cases))) %>%
select(date, area, cases) %>%
filter(!area %in% c("Total"))
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [4].
# Combine tables
bind_rows(table1, table2)
#> # A tibble: 8 × 3
#> date area cases
#> <date> <chr> <dbl>
#> 1 2022-08-02 England 2638
#> 2 2022-08-02 Northern Ireland 24
#> 3 2022-08-02 Scotland 65
#> 4 2022-08-02 Wales 32
#> 5 2022-07-29 England 2436
#> 6 2022-07-29 Northern Ireland 19
#> 7 2022-07-29 Scotland 61
#> 8 2022-07-29 Wales 30
I want to automate this by creating a generic function and passing the list of urls to purrr::map_df as there will be an ever growing number of pages (there's already 13):
pull_first_table <- function(x){
x %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
janitor::clean_names() %>%
rename(area = starts_with(c("uk", "devolved")),
cases = matches(c("total", "confirmed_cases"))) %>%
separate(cases, c("cases", NA), sep = "\\s\\(") %>%
mutate(date = dmy(str_extract({{x}}, "\\d{1,2}-[a-z]+-\\d{4}")),
cases = as.numeric(gsub(",", "", cases))) %>%
select(date, area, cases) %>%
filter(!area %in% c("Total"))
}
summary_table <- map_df(mpx_gov_uk_pages, ~ pull_first_table)
Error in `dplyr::bind_rows()`:
! Argument 1 must be a data frame or a named atomic vector.
Run `rlang::last_error()` to see where the error occurred.
The generic function seems to work ok when I supply it with a single element e.g. mpx_gov_uk_cases[2] but I cannot seem to get map_df to work properly even though the webscraping is producing tibbles.
All help and pointers greatly welcomed.
We just need the function and not a lambda expression.
map_dfr(mpx_gov_uk_pages, pull_first_table)
-output
# A tibble: 52 × 3
date area cases
<date> <chr> <dbl>
1 2022-08-02 England 2638
2 2022-08-02 Northern Ireland 24
3 2022-08-02 Scotland 65
4 2022-08-02 Wales 32
5 2022-07-29 England 2436
6 2022-07-29 Northern Ireland 19
7 2022-07-29 Scotland 61
8 2022-07-29 Wales 30
9 2022-07-26 England 2325
10 2022-07-26 Northern Ireland 18
# … with 42 more rows
If we use the lambda expression,
map_dfr(mpx_gov_uk_pages, ~ pull_first_table(.x))

Sort data frame rows in the right columns

I am currently in the data cleaning process. My data has more than 6 digits rows. I cannot come up with a solution in order to have the data in the right order. Can you give me a hint please?
Thanks in advance
df <- data.frame(price= c("['380€']", "3hr 15 min", "4hr", "3hr 55min", "2h", "20€"),
airlines = c("['Icelandir']", "€1,142", "16€", "17€", "19€", "Iberia"),
duration = c("['3h']","Turkish airlines", "KLM", "easyJet", "2 hr 1min", "Finnair"),
depart = c("LGW", "AMS", "NUE", "ZRH", "LHR", "VAR"))
My desired output is
price airline duration price_right airline_right duration_right depart
['380€'] ['Icelandair'] ['3h'] ['380€'] ['Icelandair'] ['3h'] LGW
3 hr 15 min €1,142 Turkish airlines €1,142 Turkish airlines 3 hr 15 min AMS
4hr €16 KLM €16 KLM 4hr NUE
3hr 55min €17 easyJet €17 easyJet 3hr 55min ZRH
2h €19 2hr 1min €19 Iberia 2h LHR
2hr min "Iberia" Finnair €20 Finnair 2hr 1min VAR
For this example we could do something like this:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything()) %>%
arrange(value) %>%
group_by(group =as.integer(gl(n(),3,n()))) %>%
mutate(id = row_number()) %>%
mutate(name = case_when(id == 1 ~ "price",
id == 2 ~ "duration",
id == 3 ~ "airlines",
TRUE ~ NA_character_)) %>%
ungroup() %>%
select(-group, -id) %>%
group_by(name) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = name, values_from = value) %>%
select(-id)
price duration airlines
<chr> <chr> <chr>
1 ['380€'] ['3h'] ['Icelandir']
2 €1,142 3hr 15 min Turkish airlines

how to scrape text from a HTML body

I've never scraped. Would it be straightforward to scrape the text in the main, big gray box only from the link below (starting with header SRUS43 KMSR 271039, ending with .END)? My end goal is to basically have three tidy columns of data from all that text: the five digit codes, the values in inches, and the basin elevation descriptions, so any pointers with processing the text format are welcome, too.
https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6
thank you for any help.
Reading in the text is fairly easy (see #DiceBoyT answer). Cleaning up the format for three columns is a bit more involved. Below could use some clean-up (especially with the regex), but it gets the job done:
library(tidyverse)
library(rvest)
text <- read_html("https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6") %>%
html_node(".notes") %>%
html_text()
df <- tibble(txt = read_lines(text))
df %>%
mutate(
row = row_number(),
with_code = str_extract(txt, "^[A-z0-9]{5}\\s+\\d+(\\.)?\\d"),
wo_code = str_extract(txt, "^:?\\s+\\d+(\\.)?\\d") %>% str_extract("[:digit:]+\\.?[:digit:]"),
basin_desc = if_else(!is.na(with_code), lag(txt, 1), NA_character_) %>% str_sub(start = 2)
) %>%
separate(with_code, c("code", "val"), sep = "\\s+") %>%
mutate(
combined_val = case_when(
!is.na(val) ~ val,
!is.na(wo_code) ~ wo_code,
TRUE ~ NA_character_
) %>% as.numeric
) %>%
filter(!is.na(combined_val)) %>%
mutate(
code = zoo::na.locf(code),
basin_desc = zoo::na.locf(basin_desc)
) %>%
select(
code, combined_val, basin_desc
)
#> # A tibble: 643 x 3
#> code combined_val basin_desc
#> <chr> <dbl> <chr>
#> 1 ACSC1 0 San Antonio Ck - Sunol
#> 2 ADLC1 0 Arroyo De La Laguna
#> 3 ADOC1 0 Santa Ana R - Prado Dam
#> 4 AHOC1 0 Arroyo Honda nr San Jose
#> 5 AKYC1 41 SF American nr Kyburz
#> 6 AKYC1 3.2 SF American nr Kyburz
#> 7 AKYC1 42.2 SF American nr Kyburz
#> 8 ALQC1 0 Alamo Canal nr Pleasanton
#> 9 ALRC1 0 Alamitos Ck - Almaden Res
#> 10 ANDC1 0 Coyote Ck - Anderson Res
#> # ... with 633 more rows
Created on 2019-03-27 by the reprex package (v0.2.1)
This is pretty straightforward to scrape with rvest:
library(rvest)
text <- read_html("https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6") %>%
html_node(".notes") %>%
html_text()

Replacing missing value when web scraping (rvest)

I'm trying to write a script that will go through a list of players provided by the website Transfermarkt and gathering some information about them. For that, I've created the script below, but faced a problem with 1 of the 29 players in the list. Due to one page being arranged differently than the others, the code outputs a list of only 28 players since it can't find information on the aforementioned page.
I understand why the code I've written doesn't find any information on the given page and thus giving me a list of 28, but I don't know how to rewrite a code in order to achieve what I want:
for the script to simply replace the entry with a "-" if it does not find anything, in this case a nationality, for the node on a particular page and return a full list with 29 players with all the other info in it.
The player page in question is this and while the other pages has the node used in the code for nationality, here it's ".dataValue span".
I'm still quite new to R and it might be an easy fix, but atm I can't figure it out. Any help or advise is appreciated.
URL <- "http://www.transfermarkt.de/fc-bayern-munchen/leistungsdaten/verein/27/reldata/%262016/plus/1"
WS <- read_html(URL)
Team <- WS %>% html_nodes(".spielprofil_tooltip") %>% html_attr("href") %>% as.character()
Team <- paste0("http://www.transfermarkt.de",Team)
Catcher <- data.frame(Name=character(),Nat=character(),Vertrag=character())
for (i in Team) {
WS1 <- read_html(i)
Name <- WS1 %>% html_nodes("h1") %>% html_text() %>% as.character()
Nat <- WS1 %>% html_nodes(".hide-for-small+ p .dataValue span") %>% html_text() %>% as.character()
Vertrag <- WS1 %>% html_nodes(".dataValue:nth-child(9)") %>% html_text() %>% as.character()
if (length(Nat) > 0) {
temp <- data.frame(Name,Nat,Vertrag)
Catcher <- rbind(Catcher,temp)
}
else {}
cat("*")
}
num_Rows <- nrow(Catcher)
odd_indexes <- seq(1,num_Rows,2)
Catcher <- data.frame(Catcher[odd_indexes,])
It's honestly easier to scrape the whole table, just in case things move around. I find purrr is a helpful complement for rvest, allowing you to iterate over URLs and node lists and easily coerce results to data.frames:
library(rvest)
library(purrr)
# build dynamically if you like
urls <- c(boateng = 'http://www.transfermarkt.de/jerome-boateng/profil/spieler/26485',
friedl = 'http://www.transfermarkt.de/marco-friedl/profil/spieler/156990')
# scrape once, parse iteratively
html <- urls %>% map(read_html)
df <- html %>%
map(html_nodes, '.dataDaten p') %>%
map_df(map_df,
~list(
variable = .x %>% html_node('.dataItem') %>% html_text(trim = TRUE),
value = .x %>% html_node('.dataValue') %>% html_text(trim = TRUE) %>% gsub('\\s+', ' ', .)
),
.id = 'player')
df
#> # A tibble: 17 × 3
#> player variable value
#> <chr> <chr> <chr>
#> 1 boateng Geb./Alter: 03.09.1988 (28)
#> 2 boateng Geburtsort: Berlin
#> 3 boateng Nationalität: Deutschland
#> 4 boateng Größe: 1,92 m
#> 5 boateng Position: Innenverteidiger
#> 6 boateng Vertrag bis: 30.06.2021
#> 7 boateng Berater: SAM SPORTS
#> 8 boateng Nationalspieler: Deutschland
#> 9 boateng Länderspiele/Tore: 67/1
#> 10 friedl Geb./Alter: 16.03.1998 (19)
#> 11 friedl Nationalität: Österreich
#> 12 friedl Größe: 1,87 m
#> 13 friedl Position: Linker Verteidiger
#> 14 friedl Vertrag bis: 30.06.2021
#> 15 friedl Berater: acta7
#> 16 friedl Akt. Nationalspieler: Österreich U19
#> 17 friedl Länderspiele/Tore: 6/0
Alternately, that particular piece of data is in three places on those pages, so if one is inconsistent there's a chance the others are better. Or grab them from the table with the whole team—countries are not printed, but they're in the title attribute of the flag images, which can be grabbed with html_attr:
html <- read_html('http://www.transfermarkt.de/fc-bayern-munchen/leistungsdaten/verein/27/reldata/%262016/plus/1')
team <- html %>%
html_nodes('tr.odd, tr.even') %>%
map_df(~list(player = .x %>% html_node('a.spielprofil_tooltip') %>% html_text(),
nationality = .x %>% html_nodes('img.flaggenrahmen') %>% html_attr('title') %>% toString()))
team
#> # A tibble: 29 × 2
#> player nationality
#> <chr> <chr>
#> 1 Manuel Neuer Deutschland
#> 2 Sven Ulreich Deutschland
#> 3 Tom Starke Deutschland
#> 4 Jérôme Boateng Deutschland
#> 5 David Alaba Österreich
#> 6 Mats Hummels Deutschland
#> 7 Javi Martínez Spanien
#> 8 Juan Bernat Spanien
#> 9 Philipp Lahm Deutschland
#> 10 Rafinha Brasilien, Deutschland
#> # ... with 19 more rows

Resources