How to save multiples match in one column? rvest, R and stringr - r

This question is a sequence to the problem stackoverflow
I have these two example html: url1.html ; url2.html
The url3.html is another example with more IPC
In URL2.html there is no information (51) and in URL1.html there is.
I'm using this code in R:
library(rvest)
library(tidyverse)
library(stringr)
x<-data.frame(
URL=c(1:2),
page=c(paste(readLines("url1.html"), collapse="\n"),
paste(readLines("url2.html"), collapse="\n"))
)
for (i in 1:nrow(x)){
html<-x$page[i]%>% unclass() %>% unlist()
read_html(html,encoding = "ISO-8859-1") %>%
rvest::html_elements(xpath = '//*[#id="principal"]/table[2]') %>%
html_nodes(xpath='//div[#id="classificacao0"]') %>%
html_text(trim=T)%>%
str_replace_all(.,"[\\n\\r\\t]+", "")%>%
stringr::str_trim( ) -> tmp
if(length(tmp) == 0) tmp <- "ND"
x$ipc_0[i] <- tmp %>% str_replace_all(.,"\\s+", " ") %>% str_replace_all(.," \\)", "\\)")
}
for (i in 1:nrow(htm_temp)){
html<-x$page[i]%>% unclass() %>% unlist()
read_html(html,encoding = "ISO-8859-1") %>%
rvest::html_elements(xpath = '//*[#id="principal"]/table[2]') %>%
html_nodes(xpath='//div[#id="classificacao1"]') %>%
html_text(trim=T)%>%
str_replace_all(.,"[\\n\\r\\t]+", "")%>%
stringr::str_trim( ) -> tmp
if(length(tmp) == 0) tmp <- "ND"
x$ipc_1[i] <- tmp %>% str_replace_all(.,"\\s+", " ") %>% str_replace_all(.," \\)", "\\)")
}
Result: partially correct
Desired result:create a new dataframe with the following structure.
URL
IPC
1
B62B 1/16 (1968.09)...
1
B62B 1/00 (1968.09)...
2
ND
Problem: There are url`s that have the code (51) and others that do not. When you have the code (51) the structure can contain "n" id with the following structure xpath='//div[#id="classificacao0"]. the Rating Id can contain values from 0 to "n". How to optimize this code to capture the necessary information without having to do a lot of for (variable in vector) for each "n"?
Any idea how to solve this problem?

You can use css attribute = value css selector list with ^ starts with operator to capture/exclude elements with specific id and id values.
Convert your current extraction code into a function which accepts (in this case) an url as argument. Extend the regex to remove the other characters not shown in your desired output.
Have that function return a tibble of url and ipcs found; wrap the whole thing in a map_dfr() call to generate a single DataFrame result.
library(rvest)
library(tidyverse)
urls <- sprintf("https://prequest.websiteseguro.com/example/url%i.html", 1:3)
get_ipc <- function(url) {
ipc <- read_html(url, encoding = "ISO-8859-1") %>%
html_elements("div[id^=classificacao]:not([id^=classificacaoc]) .normal > b") %>%
html_text(trim = T) %>%
str_replace_all(., "[\\n\\r\\t]+|\\(|\\s{2,}|\\)", "")%>%
stringr::str_trim()
if(length(ipc) == 0) ipc <- "ND"
return(tibble(url = url, ipc))
}
df <- purrr::map_dfr(urls, get_ipc)
print(df)

Related

Using seperate items in list as input for pipe operator in R

I have written a script which uses a list of URL's as input, and then scrapes certain information from the websites. I have done this with a for loop, but already the process time is verry long, I expect the list to get bigger over time, so I wanted to re-code my script a more efficient way. My idea was to eliminate the for loop and use pipe operators to reduce the processing time. My original (working code) is as follows;
imo <- c()
mmsi <- c()
for(i in 1:nrow(data)){
url <- sprintf("https://www.marinevesseltraffic.com/vessels?vessel=%s&flag=&page=1&sort=lenght&direction=desc",data$NAME[i])
page <- read_html(url)
CSSextract1 <- html_nodes(page, '.td_imo')
CSSextract2 <- html_nodes(page, '.td_mmsi')
imos <- html_text(CSSextract1)[2]
imo[i] <- imos
mmsis <- html_text(CSSextract2)[2]
mmsi[i] <- mmsis
}
data$IMO <- gsub("[\r \n \t]", "", imo)
data$MMSI <- gsub("[\r \n \t]", "", mmsi)
data$NAME <- gsub("\\+", " ", data$NAME)
I have re-written the code, trying to eliminate the for loop as follows;
CSSex1 <- function(page){
CSSextract <- html_nodes(page,'.td_imo')
return(CSSextract)
}
data$url <- sprintf("https://www.marinevesseltraffic.com/vessels?vessel=%s&flag=&page=1&sort=lenght&direction=desc",data$NAME)
data$mmsi <- data$url %>% read_html() %>% CSSex1() %>% html_text()[2]
However it gives me the error;
Error: `x` must be a string of length 1
I assume, the way I coded, the list (data$url) as a whole is now taken as input, so my question is;
Is it possible, and if yes how, to take each element from data$url as a input without using a (for) loop?
You may wish to set up url as a column of a data frame (data) to try:
mmsi_func <- function(x) {
z <- x %>%
read_html() %>%
CSSex1() %>%
html_text()
z[2]
}
data <- data %>%
rowwise() %>%
dplyr::mutate(mmsi = mmsi_func(url))
or something along those lines. I am not sure what the expected output is supposed to look like, but if it is a list rather than a vector, you can use this minor adjustment for a list column in the dataframe:
mmsi_func <- function(x) {
z <- x %>%
read_html() %>%
CSSex1() %>%
html_text()
z[2]
}
data <- data %>%
rowwise() %>%
dplyr::mutate(mmsi = list(mmsi_func(url)))

How to let failed steps in R loops output n/a or null blank so that there won't be any row number issue with data.frame() function

I'm STUCK at R looping. I want to use some scraped HTML to extract several variables. I'd love to see that failed steps within an iteration output n/a or null(blank) in that column so that the row numbers remain the same as the original one for further manipulation. However, with/without trycatch(), sometimes values repeat in the output dataset, resulting in redundant observations, and there are errors showing "arguments imply differing number of rows" (see 1st picture). I'm confused. Can anyone help me? Thank you very much!
#Bring In Libraries
library(rvest)
library(dplyr)
library(plyr)
library(stringr)
library(readr)
library(tidyr)
#Create a trim function to clean white space
trim <- function( x ) {
gsub('(^[[:space:]]+|[[:space:]]+$)', '', x)
}
extract_data <- function(x,y){
trim(sapply(strsplit(sapply(strsplit(x,y),'[[',2),'\n'),'[[',2))
}
#Find the number of dog food list webpages to scrape
home <- read_html('https://www.chewy.com/b/food-332')
number <- home %>%
html_nodes('.results-pagination ul li:nth-child(9) a') %>%
html_text()
#Create a blank table
all_links <- data.frame()
#### First Grab the html for every dog food ####
for (i in 1:as.numeric(number)) {
#Read the html of the each dog food list webpage
url <- read_html(paste0('https://www.chewy.com/b/food_c332_p',i))
#Build Container for link
for (j in 1:41) { #The biggest [j] in the CSS selector is 41 in page 1
tryCatch({
#This is the link to grab info for each dog food later
link <- url%>%
html_nodes(paste0('article:nth-child(',j,') a')) %>%
html_attr('href')%>% nth(1)%>%
{paste0('https://www.chewy.com',.)}
brand <- url %>%
html_nodes(paste0('article:nth-child(',j,') a section div.ga-eec__brand')) %>%
html_text()
name <- url %>%
html_nodes(paste0('article:nth-child(',j,') a section div.ga-eec__name')) %>%
html_text() %>%
{sapply(strsplit(.,','),'[[',1)} %>%
{gsub('^[[:alpha:]]/d ','',.)} %>%#Clean title with irregular prefix
str_remove(brand)%>% trim()
links <- {data.frame(html=link, Name=name, Brand=brand)} #%>%
#dplyr::rename(html=1)
print(paste0('Finished page ',i,', item ',j))
all_links <- rbind(all_links,links)
}, error=function(e){cat(conditionMessage(e))})
}
}
Other times the loop skips the whole iteration where any value is failed to extract, and jump directly to the next iteration, resulting in fewer observations in the output dataset compared to the original dataset (see 2nd picture).
#Create a blank table
stats <- data.frame()
for (i in 1:nrow(clean_links)) {
tryCatch({
link <- read_html(path[i])
#Data to scrape for each cleaned html
brand <- link %>%
html_nodes('#product-subtitle a span') %>%
html_text()%>% trim()
name <- link %>%
html_nodes('#product-title h1') %>%
html_text() %>%
str_remove(brand)%>% trim() %>%
{gsub('^[[:alpha:]]/d ','',.)} #Clean title with irregular prefix
price <- link %>%
html_nodes('.ga-eec__price') %>%
html_text()%>%
{gsub('\n','',.)}%>% trim()
size <- link %>%
html_nodes('.ga-eec__variant')%>%
html_text() %>% trim()
value <- link %>%
html_nodes('.cw-tabs__content--right') %>%
html_text() %>% nth(1) %>%
{gsub('\n[[:space:]]+', '\n', .)}
food_form <- extract_data(value,'Food Form')
manufacturer <- extract_data(value,'Brand')
life_stage <- extract_data(value,'Lifestage')
breed_size <- extract_data(value,'Breed')
special_diet <- extract_data(value,'Special Diet')
nutro <- link %>% html_nodes('#Nutritional-Info section.cw-tabs__content--right') %>%
html_text()%>%
{as.numeric(unlist(regmatches(.,gregexpr('[[:digit:]]+\\.*[[:digit:]]*',.))))}
protein <- nutro[1]
fat <- nutro[2]
review_content<- link %>% html_nodes('.ugc-list_stars') %>%
html_text() %>% trim()%>% parse_number()
review_num <- review_content[1]
rating <- review_content[2]
recommend <- link %>%
html_nodes('.ugc-list__recap__recommend p:nth-child(1) span') %>%
html_text() %>% parse_number() %>% paste0('%')
#Create a table for the data
info <- data.frame(Food_Form = food_form, Manufacturer = manufacturer, Brand = brand, Product_Name = name, Price = price, Size = size,
Life_Stage = life_stage, Breed_Size = breed_size, Special_Diet = special_diet,
Protein = protein, Fat = fat, Review_Num = review_num,
Recommend_percent = recommend,
Rating = rating, html=path[i]
)
#Bind the two datasets
stats <- rbind(stats,info)
print(paste0('Finished with: link',i))
}, error=function(e){cat(conditionMessage(e))})
}
Some, not all of those statements, are at risk of throwing an error and throwing the entire row of the data.frame
I'd take away the tryCatch block. And instead identifying the statements at risk.
library(bettertrace) ## good for seeing what statements actually trigger errors
## a little helper
to.NA <- function( x ) {
if( inherits(x, "try-error") || is.null(x) ) {
return(NA)
} else {
return(x)
}
}
## and later in your block:
## [...]
fat <- try( nutro[2] ) %>% to.NA
food_form <- try( extract_data(value,'Food Form') ) %>% to.NA
## etc.
}
In essence I'm just running try catch with each statement, instead the entire block, and making it more livable with a helper function.
You could throw try in to the function as well, but its also ok to have it there on each line to better show what's actually happening.
You can then keep the code you have to add these to the data.frame, but you will now have NA's instead, which should not mess up your structure
(also notice how to.NA does not mess with your data when data extraction is successfull)

How do I scrape data with R when the same class is used repeatedly?

I'm trying to scrape contact info from a website that lists relevant organizations I wish to contact. However, the info I need is repeatedly placed under the same class along with lots of irrelevant info.
My initial idea was to grab each piece of info separately (name, phone, email, website, etc.) and create a table (because that's how the tutorials do it with their perfect example sites).
Unfortunately, everything other than the company name uses the same class (.summaryRecordType). Grabbing everything under that class isn't too bad as the only parts I don't want are "area served". However, I'm not sure how to separate each piece of info and anchor it to the proper company name in a table.
My guess now is that I need to use the wider class (.summaryTitlePrivatePractice) to get company names and contact info while keeping them linked to make a table. But, that makes everything into one solid paragraph of text.
I'd like to get this all into a table that has separate columns for the company names, addresses, phone, email, and website. I don't need any other info; but, if it's easier to leave it in under its own column, that's fine, I just won't use it.
I'm brand new to this and not sure where to go from here. If this would be easier in Python, feel free to give a solution in that language. I'm only using R because I am mildly familiar with it for data visualization. Code I've tried below:
#Loading relevant packages
library(xml2)
library(rvest)
#library(stringr)
#library(dbplyr)
#Website
ementalhealth <- 'https://www.ementalhealth.ca/Winnipeg-Regional-Health-Authority/Mental-Health-Facilities/index.php?m=heading&ID=229'
#Getting the Company Names
CompName <- ementalhealth
CleanCompName <- CompName %>%
read_html() %>%
html_nodes("div.emhTip") %>%
html_text()
#Cleaning the Company Names
CleanCompName <- gsub("\n","", CleanCompName)
CleanCompName <- gsub("\t","", CleanCompName)
head(CleanCompName)
#Getting the Contact Data
CompSum <- ementalhealth
CleanCompSum <- CompSum %>%
read_html() %>%
html_nodes(".summaryRecordType") %>%
html_text()
#Cleaning the Contact Data
CleanCompSum <- gsub("\n","", CleanCompSum)
CleanCompSum <- gsub("\t","", CleanCompSum)
head(CleanCompSum, 50)
#Issue where some companies don't have all the info, or have extra info (Area Served)... and I'm not sure how to link them in a table
#What if I grab everything altogether?
CompCombined <- ementalhealth
CleanCompCombined <- CompCombined %>%
read_html() %>%
html_nodes(".summaryTitlePrivatePractice") %>%
html_text()
#Cleaning the Contact Data
CleanCompCombined <- gsub("\n","", CleanCompCombined)
CleanCompCombined <- gsub("\t","", CleanCompCombined)
head(CleanCompCombined, 50)
#Now everything is one big paragraph
How difficult this is depends on what info you want from the page. I am working to the assumption you want a dataframe/tibble that details from PointOfCare e.g. 1. Hospitals, through ServiceType e.g. Publicly Funded / Free Services, all the way down into the actual listings details of each service.
There are two immediate problems to overcome if going for all the above info:
The DOM is pretty flat i.e. the PointOfCare info is at same level of DOM as ServiceType and the start of service listings is only 1 level deeper. This means there is no nice logical way to use an HTML parser and select for parent nodes then process children, and still get the desired info mapped for the PointOfCare and ServiceType to each service listing.
There are differing numbers of child nodes holding a given service's info, those with className summaryRecordType, within each listing (ranging between 3 and 5).
① To deal with the first problem I decide to convert the retrieved HTML to a string and split that string into chunks to process. I retrieve the PointOfCare labels and use those to generate the initial blocks settings_blocks:
all_text <- page %>% toString()
split_nodes <- page %>% html_nodes(xpath = '//*[#class="classyHeading"]/parent::div')
points_of_delivery <- map(split_nodes, point_of_delivery)
matches <- map(split_nodes, delivery_matches)
settings_blocks <- get_blocks(matches)
At this point I have e.g. 1. Hospitals in the first block, 2. Inpatient services in the second block and so on.
I then further split each of those chunks by the ServiceTypes:
service_types <- c("Publicly Funded / Free Services", "Private Practice Professionals and\r\nCommercial Businesses")
Annoyingly, I had to hardcode as \r\n in the latter string, rather than retrieve from the relevant node html itself, as it was not present otherwise (and therefore match was not found for split).
So, 1. Hospitals when processed would have only a sub-chunk for Publicly Funded / Free Services, whereas 2. Inpatient services would end up split in two Publicly Funded / Free Services and Private Practice Professionals and Commercial Businesses. This all happens in a loop over settings_blocks.
for (i in seq_along(settings_blocks)) {
r <- r + 1
point_of_care <- points_of_delivery[[i]]
splits <- split_points(settings_blocks[[i]])
nodes_html <- tryCatch(final_blocks(splits, settings_blocks[[i]]), error = function(e) print(i))
There are a couple of sections with no listings e.g. 3.3 Drop-in centres; in those cases I generate a record as follows:
record <- list(
point_of_care, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
A full record has the following info fields:
list(point_of_care, service_type, Title, Url, BusinessName, ServiceDescription, Address, Tel, Website, AreaServed, Ages)
Once at the lowest level block, nodes_html[[j]], provided there are tables (each table is a listing) I retrieve the info for all the fields of interest:
records[[r]] <- map(tables, ~ record_from_table(.x, point_of_care, service_type))
② Now, we still have the issue of differing amounts of info in each listing table. However, it turns out one can map what info is present according to how many child nodes with className summaryRecordType are present. The mapping is as follows:
| Child nodes count | BusinessName | ServiceDescription | Address | Tel | Website | AreaServed |
|-------------------|--------------|--------------------|---------|-----|---------|------------|
| 3 | | | 1 | 2 | | 3 |
| 4 | 1 | | 2 | 3 | | 4 |
| 5 | | 1 | 2 | 3 | 4 | 5 |
From column 2 onwards, the number indicates which node holds the info indicated by the column header. As I loop each bottom level chunk I have a helper function that applies this mapping when retrieving the listing info:
record_from_table <- function(table, point_of_care, service_type) {
info_lines <- table %>% html_nodes(".summaryRecordType")
Title <- table %>%
html_node("a") %>%
html_text() %>%
trimws()
Url <- table %>%
html_node("a") %>%
html_attr("href") %>%
url_absolute(link)
if (length(info_lines) == 3) {
BusinessName <- NA_character_
ServiceDescription <- NA_character_
Address <- info_lines[1] %>%
html_text() %>%
trimws() # etc.........
I pass in PointOfCare and ServiceType so they are mapped to the record level. By the end of
for (i in seq_along(settings_blocks)) {...}
I have a list of records/listings. I then do some tidying of the records. I return tibbles, so I can later use map_dfr to generate my final dataframe structure:
records <- unlist(records, recursive = FALSE) %>% map(clean_record)
listings <- map_dfr(records, unlist)
With the final dataframe structure in place and populated I set about tidying up some other things I noticed:
① During my final_blocks function the encoding of UTF-8 input strings was getting garbled.
For example, the following correctly UTF-8 encoded string (on Windows OS):
Autisme-Asperger-Québec (AAQc)
Ended up as:
Autisme-Asperger-Québec (AAQc)
A colleague pointed out that it was actually tidy_html() at fault; and that this was particular to Windows OS - ran fine on Linux - due to the default encoding for Windows. The mangling is called Mojibake. He pointed me to the following links for further reading:
https://en.wikipedia.org/wiki/Mojibake
https://www.weblogism.com/item/270/why-does-e-become-a
To quote only a small part of the latter link:
The reason lies in the UTF-8 representation. Characters below or equal to 127 (0x7F) are represented with 1 byte only, and this is equivalent to the ASCII value. Characters below or equal to 2047 are written on two bytes of the form 110yyyyy 10xxxxxx where the scalar representation of the character is: 0000000000yyyyyxxxxxx
“é” is U+00E9 (LATIN SMALLER LETTER E WITH ACUTE), which in binary
representation is: 00000000 11101001. “é” is therefore between 127 and
2027 (233), so it will be coded on 2 bytes. Therefore its UTF-8
representation is 11000011 10101001.
Now let’s imagine that this “é” sits in a document that’s believed to
be latin-1, and we want to convert it to UTF-8. iso-8859-1 characters
are coded on 8 bits, so the 2-byte character “é” will become 2
1-byte-long latin-1 characters. The first character is 11000011, i.e.
C3, which, when checking the table corresponds to “Ô (U+00C3); the
second one is 10101001, i.e. A9, which corresponds to “©” (U+00A9).
The colleague pointed out I could fix this by converting it from UTF-8 to latin twice because UTF-8 characters have been encoded in UTF-8 again.
iconv(iconv(<mangled_string>, from = "UTF-8", to = "latin1"), "UTF-8", "latin1")
I had introduced tidy_html to ensure sliced text ended up being parsable.
② I chose not to try and fix the mangled strings as per the description above. Instead, as my final dataframe provided the skeleton for where all my data resided, I simply went back to the original HTMLDocument and parsed out the info again (in UTF-8) and mapped onto my dataframe. This had the added benefit of preserving spacing between certain words and line breaks.
titles <- page %>%
html_nodes(".emhTip a:nth-of-type(1)") %>%
html_text()
descriptions <- page %>%
html_nodes(".emhTip + .summaryRecordType") %>%
html_text() %>%
trimws()
mixed_nodes <- page %>%
html_nodes(".summaryTitlePrivatePractice > div:nth-child(2)") %>%
html_text() %>%
trimws()
r <- r1 <- 0
# over-write existing values with tidier properly encoded strings
for (i in seq_along(listings$Title)) {
if (!is.na(listings$Title[i])) {
r <- r + 1
listings$Title[i] <- titles[r]
if (!is.na(listings$BusinessName[i])) {
listings$BusinessName[i] <- mixed_nodes[r]
}
}
if (!is.na(listings$ServiceDescription[i])) {
r1 <- r1 + 1
listings$ServiceDescription[i] <- descriptions[r1]
}
}
Last, but not least, I noticed that some service descriptions had a ...more in the listing, where an additional XHR request would be required to gather the full description. I decided, in case you wanted to obtain the full descriptions, in those cases, to provide a helper function to retrieve these:
expanded_descriptions <- map2(listings$ServiceDescription, listings$Url, ~ full_description(.x, .y)) %>% unlist()
listings$ServiceDescription <- expanded_descriptions
Now, that did slow the run-time as I needed to add some delays in to ensure connections were opened and closed properly.
The full code is below, including a couple of attributions where I borrowed a few lines from other SO contributors.
R:
library(stringr)
library(rvest)
library(htmltidy)
library(tidyverse)
point_of_delivery <- function(node) {
pod <- node %>%
html_node(".classyHeading") %>%
html_text() %>%
str_split("\n") %>%
unlist() %>%
tail(1) %>%
trimws() %>%
str_replace("\xa0", " ")
return(pod)
}
delivery_matches <- function(node) {
dm <- node %>%
html_node(".classyHeading") %>%
html_text() %>%
str_split("\n") %>%
unlist() %>%
tail(1)
return(dm)
}
get_blocks <- function(a_list) {
results <- vector("list", length(a_list))
for (i in seq_along(a_list)) {
start_pos <- str_locate(all_text, gsub("\\)", "\\\\)", gsub("\\(", "\\\\(", a_list[i])))[, 1]
if (i == length(a_list)) {
block <- substring(all_text, start_pos, nchar(all_text)) %>% tidy_html()
} else {
next_start <- str_locate(all_text, gsub("\\)", "\\\\)", gsub("\\(", "\\\\(", a_list[i + 1])))[, 1]
block <- substring(all_text, start_pos, next_start) %>% tidy_html()
}
results[[i]] <- block
}
return(results)
}
split_points <- function(node) {
res <- map(service_types, ~ str_locate_all(node %>% toString(), .)) %>% unlist()
if (length(res) == 0) {
return(c(NA_integer_))
} else {
return(res[seq(1, length(res), 2)]) # https://stackoverflow.com/a/34100009/6241235 #stas g
}
}
final_blocks <- function(splits, block) {
results <- vector("list", length(splits))
if (length(splits) == 1) {
res <- ifelse(is.na(splits), splits, block %>% tidy_html())
} else {
for (i in seq_along(splits)) {
start_pos <- splits[i]
if (i == length(splits)) {
res <- substring(block, start_pos, nchar(block)) %>% tidy_html()
} else {
next_start <- splits[i + 1]
res <- substring(block, start_pos, next_start) %>% tidy_html()
}
results[i] <- res
}
return(results)
}
}
record_from_table <- function(table, point_of_care, service_type) {
info_lines <- table %>% html_nodes(".summaryRecordType")
Title <- table %>%
html_node("a") %>%
html_text() %>%
trimws()
Url <- table %>%
html_node("a") %>%
html_attr("href") %>%
url_absolute(link)
if (length(info_lines) == 3) {
BusinessName <- NA_character_
ServiceDescription <- NA_character_
Address <- info_lines[1] %>%
html_text() %>%
trimws()
Tel <- info_lines[2] %>%
html_text() %>%
trimws()
Website <- NA_character_
AreaServed <- info_lines[3] %>%
html_text() %>%
trimws()
} else if (length(info_lines) == 4) {
BusinessName <- info_lines[1] %>%
html_text() %>%
trimws()
ServiceDescription <- NA_character_
Address <- info_lines[2] %>%
html_text() %>%
trimws()
Tel <- info_lines[3] %>%
html_text() %>%
trimws()
Website <- NA_character_
AreaServed <- info_lines[4] %>%
html_text() %>%
trimws()
} else {
BusinessName <- NA_character_
ServiceDescription <- info_lines[1] %>%
html_text() %>%
trimws()
Address <- info_lines[2] %>%
html_text() %>%
trimws()
Tel <- info_lines[3] %>%
html_text() %>%
trimws()
Website <- info_lines[4] %>%
html_text() %>%
trimws()
AreaServed <- info_lines[5] %>%
html_text() %>%
trimws()
}
Ages <- get_age(table)
return(list(point_of_care, service_type, Title, Url, BusinessName, ServiceDescription, Address, Tel, Website, AreaServed, Ages))
}
get_age <- function(table) {
tryCatch(table %>% html_node(".summaryTitlePrivatePractice + td") %>%
html_text() %>% str_replace("Add to Info Cart", "") %>% trimws(), error = function(e) {
return(NA_character_)
})
}
clean_record <- function(a_record) {
a_record[[7]] <- str_replace(a_record[[7]], " Map", "")
a_record[[10]] <- str_replace(a_record[[10]], "Area[s]? Served: ", "")
a_record <- set_names(a_record, c("PointOfCare", "ServiceType", "Title", "Url", "BusinessName", "ServiceDescription", "Address", "Tel", "Website", "AreaServed", "Ages"))
return(a_record %>% as_tibble())
}
full_description <- function(current_description, current_url) {
if (grepl(" \\.\\.\\.", current_description)) {
content <- read_html(current_url, encoding = "UTF-8") %>%
html_node(".recordSummary") %>%
html_text() %>%
trimws()
CatchupPause(.1)
} else {
content <- gsub("\\s+more", "", current_description) %>% trimws()
}
return(content)
}
CatchupPause <- function(Secs) { # https://stackoverflow.com/a/52758758 #nm200
Sys.sleep(Secs) # pause to let connection work
closeAllConnections()
gc()
}
link <- "https://www.ementalhealth.ca/Winnipeg-Regional-Health-Authority/Mental-Health-Facilities/index.php?m=heading&ID=229&recordType=1&sortBy=0"
page <- read_html(link, encoding = "UTF-8")
all_text <- page %>% toString()
split_nodes <- page %>% html_nodes(xpath = '//*[#class="classyHeading"]/parent::div')
points_of_delivery <- map(split_nodes, point_of_delivery)
matches <- map(split_nodes, delivery_matches)
settings_blocks <- get_blocks(matches)
service_types <- c("Publicly Funded / Free Services", "Private Practice Professionals and\r\nCommercial Businesses") # annoying have to hardcode as \r\n not present in node output
records <- vector("list", 1000) # > max expected num entries when lists unnested
r <- 0
# Generate all records for the final tibble
for (i in seq_along(settings_blocks)) {
r <- r + 1
point_of_care <- points_of_delivery[[i]]
splits <- split_points(settings_blocks[[i]])
nodes_html <- tryCatch(final_blocks(splits, settings_blocks[[i]]), error = function(e) print(i))
if (is.na(nodes_html)[1]) {
record <- list(
point_of_care, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
)
records[[r]] <- list(record)
} else {
for (j in seq_along(nodes_html)) {
service_type <- if_else(str_detect(nodes_html[[j]], service_types[1]), service_types[1], service_types[2])
tables <- nodes_html[[j]] %>%
read_html() %>%
html_nodes(".condensedViewTable")
records[[r]] <- map(tables, ~ record_from_table(.x, point_of_care, service_type))
r <- r + 1
}
}
}
records <- unlist(records, recursive = FALSE) %>% map(clean_record)
listings <- map_dfr(records, unlist)
#
## Partly due to default Windows encoding, and lack of UTF-8 support in R, causing Mojibake via earlier tidy_html(), we grab the properly encoded info
## to overwrite the mangled text |text lacking spaces
titles <- page %>%
html_nodes(".emhTip a:nth-of-type(1)") %>%
html_text()
descriptions <- page %>%
html_nodes(".emhTip + .summaryRecordType") %>%
html_text() %>%
trimws()
mixed_nodes <- page %>%
html_nodes(".summaryTitlePrivatePractice > div:nth-child(2)") %>%
html_text() %>%
trimws()
r <- r1 <- 0
# over-write existing values with tidier properly encoded strings
for (i in seq_along(listings$Title)) {
if (!is.na(listings$Title[i])) {
r <- r + 1
listings$Title[i] <- titles[r]
if (!is.na(listings$BusinessName[i])) {
listings$BusinessName[i] <- mixed_nodes[r]
}
}
if (!is.na(listings$ServiceDescription[i])) {
r1 <- r1 + 1
listings$ServiceDescription[i] <- descriptions[r1]
}
}
# descriptions_to_expand <- dplyr::filter(listings, grepl(" \\.\\.\\.", ServiceDescription))
expanded_descriptions <- map2(listings$ServiceDescription, listings$Url, ~ full_description(.x, .y)) %>% unlist()
listings$ServiceDescription <- expanded_descriptions
write.csv(listings, "~/data.csv", na = "")
Some example rows of output:
click on image to enlarge
Although QHarr's answer is flawless, I thought I would follow up with what I actually did (before getting their answer) for anyone viewing this question in the future that, like me, wouldn't have the knowledge to modify QHarr's response to fit their own situation.
Because I was having issues grabbing the relevant data separately from the site, I just grabbed everything together from .summaryTitlePrivatePractice and used the \n\r\t's that came with it as markers to split the data. Everything in the included code (below) should be self-explanatory, but I'll explain the cleaning section. There were random strings of \t\r\n in between each piece I needed and I could see that \n never repeated more than 4 times by checking the output at that stage. So, I converted that all to a single \n for each instance and used that as my marker for splitting the data.
This solution wasn't ideal as not every entry (row) had the same amount of data, so I had to manually go through and move data around to align the columns. This was ok in this case because most rows had the same amount of columns and there were only 150 rows, but it wouldn't work for larger data sets. I also used some functions in Excel to make bulk edits where applicable that sped up the process. R Code below:
library(xml2)
library(rvest)
library(tidyverse)
#Website
ementalhealth <- 'https://www.ementalhealth.ca/Winnipeg-Regional-Health-Authority/Mental-Health-Facilities/index.php?m=heading&ID=229'
#What if I grab everything altogether?
CompCombined <- ementalhealth
CleanCompCombined <- CompCombined %>%
read_html() %>%
html_nodes(".summaryTitlePrivatePractice") %>%
html_text()
#Cleaning the Everything Data
CleanCompCombined <- gsub("\t","", CleanCompCombined)
CleanCompCombined <- gsub("\r","", CleanCompCombined)
CleanCompCombined <- gsub("\n\n\n\n","\n", CleanCompCombined)
CleanCompCombined <- gsub("\n\n\n","\n", CleanCompCombined)
CleanCompCombined <- gsub("\n\n","\n", CleanCompCombined)
#Now everything is one big paragraph
#removing duplicates
dedupedsplit <- unique(CleanCompCombined)
#splits
col1 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 1)
col2 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 2)
col3 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 3)
col4 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 4)
col5 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 5)
col6 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 6)
col7 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 7)
col8 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 8)
col9 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 9)
col10 <- sapply(strsplit(as.character(dedupedsplit),'\n'), "[", 10)
#tablecraft
blahtable <- data.frame(col1 = col1,
col2 = col2,
col3 = col3,
col4 = col4,
col5 = col5,
col6 = col6,
col7 = col7,
col8 = col8,
col9 = col9,
col10 = col10)
#output
blahtable```

How to pass multiple values in a rvest submission form

This is a follow up to a prior thread. The code works fantastic for a single value but I get the following error when trying to pass more than 1 value I get an error based on the length of the function.
Error in vapply(elements, encode, character(1)) :
values must be length 1,
but FUN(X[1]) result is length 3
Here is a sample of the code. In most instances I have been able just to name an object and scrape that way.
library(httr)
library(rvest)
library(dplyr)
b<-c('48127','48180','49504')
POST(
url = "http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl",
body = list(zipcode = b),
encode = "form"
) -> res
I was wondering if a loop to insert the values into the form would be the right way to go? However my loop writing skills are still in development and I am unsure of where to place it; in addition when i call the loop it doesn't print line by line it just returns null results.
#d isn't listed in the above code as it returns null
d<-for(i in 1:3){nrow(b)}
Here is an approach to send multiple POST requests
library(httr)
library(rvest)
b <- c('48127','48180','49504')
For each element in b perform a function that will send the appropriate POST request
res <- lapply(b, function(x){
res <- POST(
url = "http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl",
body = list(zipcode = x),
encode = "form"
)
res <- read_html(content(res, as="raw"))
})
Now for each element of the list res you should do the parsing steps explained by hrbrmstr: How can I Scrape a CGI-Bin with rvest and R?
library(tidyverse)
I will use hrbrmstr's code since he is king and it is already clear to you. Only thing we are doing here is performing it on each element of res list.
res_list = lapply(res, function(x){
rows <- html_nodes(x, "table[width='300'] > tr > td")
ret <- data_frame(
record = !is.na(html_attr(rows, "bgcolor")),
text = html_text(rows, trim=TRUE)
) %>%
mutate(record = cumsum(record)) %>%
filter(text != "") %>%
group_by(record) %>%
summarise(x = paste0(text, collapse="|")) %>%
separate(x, c("store", "address1", "city_state_zip", "phone_and_or_distance"), sep="\\|", extra="merge")
return(ret)
}
)
or using map from purrr
res %>%
map(function(x){
rows <- html_nodes(x, "table[width='300'] > tr > td")
data_frame(
record = !is.na(html_attr(rows, "bgcolor")),
text = html_text(rows, trim=TRUE)
) %>%
mutate(record = cumsum(record)) %>%
filter(text != "") %>%
group_by(record) %>%
summarise(x = paste0(text, collapse="|")) %>%
separate(x, c("store", "address1", "city_state_zip", "phone_and_or_distance"),
sep="\\|", extra="merge") -> ret
return(ret)
}
)
If you would like this in a data frame:
res_df <- data.frame(do.call(rbind, res_list), #rbinds list elements
b = rep(b, times = unlist(lapply(res_list, length)))) #names the rows according to elements in b
You can put the values inside the post as below,
b<-c('48127','48180','49504')
for(i in 1:length(b)) {
POST(
url = "http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl",
body = list(zipcode =b[i]),
encode = "form"
) -> res
# YOUR CODES HERE (for getting content of the page etc.)
}
But since for every different zipcode value the "res" value will be different, you need the put the rest of the codes inside the area I commented. Otherwise you get the last value only.

R: Trouble in appending rows in a dataframe from webscraping in r [duplicate]

This question already has answers here:
Error in if/while (condition) {: missing Value where TRUE/FALSE needed
(4 answers)
Closed 5 years ago.
I have dataframe with 7 rows and 1 column,which contains links of a website, I'm trying to extract data from those various link and store them in a data frame but not able to append that.Also I'm checking that if for a link if there is no records(this I'm checking through html attribute of that link) skip that link and proceed to next link.I'm also trying to fetch data for multiple pages of a link.
This is reproducible data
text1="http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom="
text3="&proptype="
text4="Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment"
text5="&cityName=Thane&BudgetMin="
text6="&BudgetMax="
bhk=c("1","2","3","4","5",">5")
budg_min=c("5-Lacs","10-Lacs","20-Lacs","30-Lacs","40-Lacs","50-Lacs","60-Lacs","70-Lacs","80-Lacs","90-Lacs","1-Crores","1.2-Crores","1.4-Crores","1.6-Crores","1.8-Crores","2-Crores","2.3-Crores","2.6-Crores","3-Crores","3.5-Crores","4-Crores","4.5-Crores","5-Crores","10-Crores","20-Crores")
budg_max=c("5-Lacs","10-Lacs","20-Lacs","30-Lacs","40-Lacs","50-Lacs","60-Lacs","70-Lacs","80-Lacs","90-Lacs","1-Crores","1.2-Crores","1.4-Crores","1.6-Crores","1.8-Crores","2-Crores","2.3-Crores","2.6-Crores","3-Crores","3.5-Crores","4-Crores","4.5-Crores","5-Crores","10-Crores","20-Crores")
eg <- expand.grid(bhk = bhk, budg_min = budg_min, budg_max = budg_max)
eg <- eg[as.integer(eg$budg_min) <= as.integer(eg$budg_max),]
uuu <- sprintf("%s%s%s%s%s%s%s%s", text1,eg[,1],text3,text4,text5,eg[,2],text6,eg[,3])
uuu_df1=data.frame(x=uuu[1:7,])
dput(uuu_df1)
I have 3 solution for this but none seems to be working fine.
SOlution#1
urlList <- llply(uuu_df1[,1], function(url){
this_pg <- read_html(url)
results_count <- this_pg %>%
xml_find_first(".//span[#id='resultCount']") %>%
xml_text() %>%
as.integer()
if(results_count > 0){
cards <- this_pg %>%
xml_find_all('//div[#class="SRCard"]')
df <- ldply(cards, .fun=function(x){
y <- data.frame(wine = x %>% xml_find_first('.//span[#class="agentNameh"]') %>% xml_text(),
excerpt = x %>% xml_find_first('.//div[#class="postedOn"]') %>% xml_text(),
locality = x %>% xml_find_first('.//span[#class="localityFirst"]') %>% xml_text(),
society = x %>% xml_find_first('.//div[#class="labValu"]') %>% xml_text() %>% gsub('\\n', '', .))
return(y)
})
} else {
df <- NULL
}
return(df)
}, .progress = 'text')
names(urlList) <- uuu_df1[,1]
a=bind_rows(urlList)
Above code gives me error Error in if (results_count > 0) { : missing value where TRUE/FALSE needed
Solution#2
urlList <- lapply(uuu_df1[,1], function(url){
UrlPage <- html(as.character(url))
ImgNode <- UrlPage %>% html_node("div.noResultHead")
u <- paste("No", word(string = as(ImgNode, "character"), start=4, end=5), sep=" ")
cat(".")
pg <- read_html(url)
if(u!="No Results Found!") {
df <- data.frame(wine=html_text(html_nodes(pg, ".agentNameh")),
excerpt=html_text(html_nodes(pg, ".postedOn")),
locality=html_text(html_nodes(pg,".localityFirst")),
society=html_text(html_nodes(pg,'.labValu .stop-propagation:nth-child(1)')),
stringsAsFactors=FALSE)
} else {
# ASSIGN EMPTY DATAFRAME (FOR CONSISTENT STRUCTURE)
df <- data.frame(wine=character(), excerpt=character(), locality=character(), society=character())
}
# RETURN NAMED LIST
return(list(UrlPage=UrlPage, ImgNode=ImgNode, u=u, df=df))
})
# ROW BIND ONLY DATAFRAME ELEMENT FROM LIST
wines <- map_df(urlList, function(u) u$df)
Above code gives empty dataframe
Solution#3
uuu_df1=data.frame(x=uuu_df[1:7,])
wines=data.frame()
url_test=c()
UrlPage_test=c()
u=c()
ImgNode=c()
pg=c()
for(i in 1:dim(uuu_df1)[1]) {
url_test[i]=as.character(uuu_df1[i,])
UrlPage_test[i] <- html(url_test[i])
ImgNode[i] <- UrlPage_test[i] %>% html_node("div.noResultHead")
u[i]=ImgNode[i]
u[i]=as(u[i],"character")
u[i]=paste("No",word(string = u, start = 4, end = 5),sep = " ")
if(u[i]=="No Results Found!") next
{
map_df(1:5, function(i) # here 1:5 is number of webpages of a website
{
# simple but effective progress indicator
cat(".")
pg[i] <- read_html(sprintf(url_test[i], i))
data.frame(wine=html_text(html_nodes(pg[i], ".agentNameh")),
excerpt=html_text(html_nodes(pg[i], ".postedOn")),
locality=html_text(html_nodes(pg[i],".localityFirst")),
society=html_text(html_nodes(pg[i],'.labValu .stop-propagation:nth-child(1)')),
stringsAsFactors=FALSE)
}) -> wines
}}
Above code also gives an error
Error in UseMethod("xml_find_first") :
no applicable method for 'xml_find_first' applied to an object of class "list"
In addition: Warning messages:
1: 'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")
2: In UrlPage_test[i] <- html(url_test[i]) :
number of items to replace is not a multiple of replacement length
Any suggestions on which code can be corrected so that my requirement is met. Thanks in advance
Solution #1
That missing value where TRUE/FALSE needed is printed when you do something like this:
if (NA > 0) {
do something
}
So replace your if condition
if(results_count > 0)
with
(!is.na(results_count) & (results_count > 0))

Resources