I am trying to webscrape the following website:
http://www.healthgrades.com/hospital-directory/california-ca-san-mateo/affiliated-physicians-HGSTED418D46050070
I am using R to webscrape the website. Particularly, I am trying to copy all of doctor's names and specialties from this website. However, the main issue that I am dealing with is that the url link does not change when I press the arrow/next button. I can not use any basic techniques to webscrape this page. How can I solve this problem? It would be nice to have all of the data that I am collecting in one data matrix/spreadsheet.
dum <- "http://www.healthgrades.com/hospital-directory/california-ca-san-mateo/affiliated-physicians-HGSTED418D46050070"
library(XML)
ddum <- htmlParse(dum)
noofpages <- xpathSApply(ddum,'//*/span[#class="paginationItem active"]/following-sibling::*[1]',xmlValue)[1]
noofpages <- (as.numeric(gsub(' of ','',noofpages))-1)%/%5+1
doctors <- c(); dspec <- c()
for(i in 1:noofpages){
if(i>1){
ddum <- htmlParse(paste0(dum,"?pagenumber=",i,'#'))
}
doctors <- c(doctors, xpathSApply(ddum,'//*/a[#class="providerSearchResultSelectAction"]',xmlValue))
dspec <- c(dspec, xpathSApply(ddum,'//*/div[#class="listingHeaderLeftColumn"]/p',xmlValue))
}
paste(doctors,dspec,sep=',')
# [1] "Dr. Julia Adamian, MD,Internal Medicine"
# [2] "Dr. Eric R. Adler, MD,Internal Medicine"
# [3] "Dr. Ramzi S. Alami, MD,General Surgery"
# [4] "Dr. Jason L. Anderson, MD,Internal Medicine"
# [5] "Dr. Karl A. Anderson, MD,Urology"
# [6] "Dr. Christine E. Angeles, MD,Geriatric Medicine, Pulmonology"
It looks like they're using the variable
?pagenumber=x
You can probably iterate over x to get your data.
On a side note,
I'm not sure which browser you are using, but Chrome has a handy feature where you can right click on a button and select inspect element.
Related
I am new to R and webscraping. For practice I am trying to scrape book titles from a fake website that has multiple pages ('http://books.toscrape.com/catalogue/page-1.html'), and then calculate certain metrics based on the book titles. There are 20 books on each page and 50 pages, I have managed to scrape and calculate metrics for the first 20 books, however I want to calculate the metrics for the full 1000 books on the website.
The current output looks like this:
[1] "A Light in the Attic"
[2] "Tipping the Velvet"
[3] "Soumission"
[4] "Sharp Objects"
[5] "Sapiens: A Brief History of Humankind"
[6] "The Requiem Red"
[7] "The Dirty Little Secrets of Getting Your Dream Job"
[8] "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull"
[9] "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics"
[10] "The Black Maria"
[11] "Starving Hearts (Triangular Trade Trilogy, #1)"
[12] "Shakespeare's Sonnets"
[13] "Set Me Free"
[14] "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)"
[15] "Rip it Up and Start Again"
[16] "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991"
[17] "Olio"
[18] "Mesaerion: The Best Science Fiction Stories 1800-1849"
[19] "Libertarianism for Beginners"
[20] "It's Only the Himalayas"
I want this to be 1000 books long instead of 20, this will allow me to use the same code to calculate the metrics but for 1000 books instead of 20.
Code:
url<-'http://books.toscrape.com/catalogue/page-1.html'
url %>%
read_html() %>%
html_nodes('h3 a') %>%
html_attr('title')->titles
titles
What would be the best way to scrape every book from the website and make the list 1000 book titles long instead of 20? Thanks in advance.
Generate the 50 URLs, then iterate on them, e.g. with purrr::map
library(rvest)
urls <- paste0('http://books.toscrape.com/catalogue/page-', 1:50, '.html')
titles <- purrr::map(
urls,
. %>%
read_html() %>%
html_nodes('h3 a') %>%
html_attr('title')
)
something like this perhaps?
library(tidyverse)
library(rvest)
library(data.table)
# Vector with URL's to scrape
url <- paste0("http://books.toscrape.com/catalogue/page-", 1:20, ".html")
# Scrape to list
L <- lapply( url, function(x) {
print( paste0( "scraping: ", x, " ... " ) )
data.table(titles = read_html(x) %>%
html_nodes('h3 a') %>%
html_attr('title') )
})
# Bind list to single data.table
data.table::rbindlist(L, use.names = TRUE, fill = TRUE)
This page shows six sections listing people between <h3> tags.
How can I use XPath to select these six sections separately (using rvest), perhaps into a nested list? My goal is to later lapply through these six sections to fetch the people's names and affiliations (separated by section).
The HTML isn't so well-structured, i.e. not every text is located within specific tags. An example:
<h3>Editor-in-Chief</h3>
Claudio Ronco – <i>St. Bartolo Hospital</i>, Vicenza, Italy<br />
<br />
<h3>Clinical Engineering</h3>
William R. Clark – <i>Purdue University</i>, West Lafayette, IN, USA<br />
Hideyuki Kawanashi – <i>Tsuchiya General Hospital</i>, Hiroshima, Japan<br />
I access the site with the following code:
journal_url <- "https://www.karger.com/Journal/EditorialBoard/223997"
webpage <- rvest::html_session(journal_url,
httr::user_agent("Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20"))
webpage <- rvest::html_nodes(webpage, css = '#editorialboard')
I tried various XPaths to extract the six sections with html_nodes into a nested list of six lists, but none of them work properly:
# this gives me a list of 190 (instead of 6) elements, leaving out the text between <i> and </i>
webpage <- rvest::html_nodes(webpage, xpath = '//text()[preceding-sibling::h3 and following-sibling::h3]')
# this gives me a list of 190 (instead of 6) elements, leaving out text that are not between tags
webpage <- rvest::html_nodes(webpage, xpath = '//*[preceding-sibling::h3 and following-sibling::h3]')
# error "VECTOR_ELT() can only be applied to a 'list', not a 'logical'"
webpage <- rvest::html_nodes(webpage, xpath = '//* and text()[preceding-sibling::h3 and following-sibling::h3]')
# this gives me a list of 274 (instead of 6) elements
webpage <- rvest::html_nodes(webpage, xpath = '//text()[preceding-sibling::h3]')
Are you ok with an ugly solution that does not use XPath? I don't think you can get a nested list from the structure of this website... But I am not very experienced in xpath.
I first got the headings, divided the raw text using the heading names and then, within each group, divided the members using '\n' as a separator.
journal_url <- "https://www.karger.com/Journal/EditorialBoard/223997"
webpage <- read_html(journal_url) %>% html_node(css = '#editorialboard')
# get h3 headings
headings <- webpage %>% html_nodes('h3') %>% html_text()
# get raw text
raw.text <- webpage %>% html_text()
# split raw text on h3 headings and put in a list
list.members <- list()
raw.text.2 <- raw.text
for (h in headings) {
# split on headings
b <- strsplit(raw.text.2, h, fixed=TRUE)
# split members using \n as separator
c <- strsplit(b[[1]][1], '\n', fixed=TRUE)
# clean empty elements from vector
c <- list(c[[1]][c[[1]] != ""])
# add vector of member to list
list.members <- c(list.members, c)
# update text
raw.text.2 <- b[[1]][2]
}
# remove first element of main list
list.members <- list.members[2:length(list.members)]
# add final segment of raw.text to list
c <- strsplit(raw.text.2, '\n', fixed=TRUE)
c <- list(c[[1]][c[[1]] != ""])
list.members <- c(list.members, c)
# add names to list
names(list.members) <- headings
Then you get a list of the groups and each element of the list is a vector with strings for each member (using all info)
> list.members$`Editor-in-Chief`
[1] "Claudio Ronco – St. Bartolo Hospital, Vicenza, Italy"
> list.members$`Clinical Engineering`
[1] "William R. Clark – Purdue University, West Lafayette, IN, USA"
[2] "Hideyuki Kawanashi – Tsuchiya General Hospital, Hiroshima, Japan"
[3] "Tadayuki Kawasaki – Mobara Clinic, Mobara City, Japan"
[4] "Jeongchul Kim – Wake Forest School of Medicine, Winston-Salem, NC, USA"
[5] "Anna Lorenzin – International Renal Research Institute of Vicenza, Vicenza, Italy"
[6] "Ikuto Masakane – Honcho Yabuki Clinic, Yamagata City, Japan"
[7] "Michio Mineshima – Tokyo Women's Medical University, Tokyo, Japan"
[8] "Tomotaka Naramura – Kurashiki University of Science and the Arts, Kurashiki, Japan"
[9] "Mauro Neri – International Renal Research Institute of Vicenza, Vicenza, Italy"
[10] "Masanori Shibata – Koujukai Rehabilitation Hospital, Nagoya City, Japan"
[11] "Yoshihiro Tange – Kyushu University of Health and Welfare, Nobeoka-City, Japan"
[12] "Yoshiaki Takemoto – Osaka City University, Osaka City, Japan"
I used rvest package in R to scrape some web data but I am having a lot of trouble getting it into a usuable format.
My data currently looks like this:
test
[1] "v. Philadelphia"
[2] "TD GardenRegular Season"
[3] "PTS: 23. Jayson TatumREB: 10. M. MorrisAST: 7. Kyrie Irving"
[4] "PTS: 23. Joel EmbiidREB: 15. Ben SimmonsAST: 8. Ben Simmons"
[5] "100.7 - 83.4"
[6] "# Toronto"
[7] "Air Canada Centre Regular Season"
[8] "PTS: 21. Kyrie IrvingREB: 10. Al HorfordAST: 9. Al Horford"
[9] "PTS: 31. K. LeonardREB: 10. K. LeonardAST: 7. F. VanVleet"
[10] "115.6 - 103.3"
Can someone help me perform the correct operations in order to have it look like this (as a data frame) and provide the code, I would really appreciate it:
Opponent Venue
Philadelphia TD Garden
Toronto Air Canada Centre
I do not need any of the other information.
Let me know if there are any issues :)
# put your data in here
input <- c("v. Philadelphia", "TD GardenRegular Season",
"", "", "",
"# Toronto", "Air Canada Centre Regular Season",
"", "", "")
index <- 1:length(input)
# raw table format
out_raw <- data.frame(Opponent = input[index%%5==1],
Venue = input[index%%5==2])
# using stringi package
install.packages("stringi")
library(stringi)
# copy and clean up
out_clean <- out_raw
out_clean$Opponent <- stri_extract_last_regex(out_raw$Opponent, "(?<=\\s).*$")
out_clean$Venue <- trimws(gsub("Regular Season", "", out_raw$Venue))
out_clean
Recently, I have started to learn R and trying to explore more by automating the process. Below is the sample data and I'm trying to create a new column by finding and replacing the particular text within the label (colname:Designations).
Since, I'm getting this work with loads of new data I would like to automate using R programming than using excel formulas.
Dataset:
strings<-c("Zonal Manager","Department Manager","Network Manager","Head of Sales","Account Manager","Alliance Manager","Additional Manager","Senior Vice President","General manager","Senior Analyst", "Solution Architect","AGM")
R code i used:
t<-data.frame(strings,stringsAsFactors = FALSE)
colnames(t)[1]<-"Designations"
y<-sub(".*Manager*","Manager",strings,ignore.case = TRUE)
Challenge:
In this all the data got changed as Manager but I needed to replace other designations with the main themes.
I tried with ifelse statement, grep, grepl, str,sub, etc but I didn't get what I'm looking for
I can't use first/second/last words (as'delimit') since the main themes scatters to and fro.. Eg: Chief Information Officer or Commercial Finance Manager or AGM
Excel Work:
I have already coded 300 main themes as...
Manager (for all GM, Asst.Manager,Sales Manager,etc)
Architect (Solution Arch, Sr. Arch, etc)
Director (Senior Director, Director, Asst.Director, etc)
Senior Analyst
Analyst
Head (for head of sales)
What I'm looking for:
I needed to create a new column and should replace the text with the relevant main themes as I did in excel using R.
I'm ok if i can take the main themes that I have already coded in excel to match the themes using R programming (as vlookup in excel).
Expected result:
enter image description here
Thanks in advance for your help!
Yes, exactly the same thing I'm expeccting. Thanks!! But when I tried the same methodology by uploading the new dataset (excel file) and with
df %>%
mutate(theme=gsub(".*(Manager|Lead|Director|Head|Administrator|Executive|Executive|VP|President|Consultant|CFO|CTO|CEO|CMO|CDO|CIO|COO|Cheif Executive Officer|Chief Technological Officer|Chief Digital Officer|Chief Financial Officer|Chief Marketing Officer|Chief Digital Officer|Chief Information Officer,Chief Operations Officer)).*","\\1",Designations,ignore.case = TRUE))
it didn't work. Should I correct somewhere else.?
data:
strings<-c("Zonal Manager","Department Manager","Network Manager","Head of Sales","Account Manager",
"Alliance Manager","Additional Manager","Senior Vice President","General manager","Senior Analyst", "Solution Architect","AGM")
you need to prepare a good look up table: (you complete it and make it perfect.)
lu_table <- data.frame(new = c("Manager", "Architect","Director"), old = c("Manager|GM","Architect|Arch","Director"), stringsAsFactors = F)
Then you can let mapply do the job:
mapply(function(new,old) {ans <- strings; ans[grepl(old,ans)]<-new; strings <<- ans; return(NULL)}, new = lu_table$new, old = lu_table$old)
now look at strings:
> strings
[1] "Manager" "Manager" "Manager" "Head of Sales" "Manager" "Manager"
[7] "Manager" "Senior Vice President" "General manager" "Senior Analyst" "Architect" "Manager"
please note:
This solution uses <<-. So this might not be the best possible solution. But works in this case.
Do you mean something like this?
library(dplyr)
strings <-
c(
"Zonal Manager",
"Department Manager",
"Network Manager",
"Head of Sales",
"Account Manager",
"Alliance Manager",
"Additional Manager",
"Senior Vice President",
"General manager",
"Senior Analyst",
"Solution Architect",
"AGM"
)
df = data.frame(Designations = strings)
df %>%
mutate(
theme = gsub(
".*(manager|head|analyst|architect|agm|director|president).*",
"\\1",
Designations,
ignore.case = TRUE
)
)
#> Designations theme
#> 1 Zonal Manager Manager
#> 2 Department Manager Manager
#> 3 Network Manager Manager
#> 4 Head of Sales Head
#> 5 Account Manager Manager
#> 6 Alliance Manager Manager
#> 7 Additional Manager Manager
#> 8 Senior Vice President President
#> 9 General manager manager
#> 10 Senior Analyst Analyst
#> 11 Solution Architect Architect
#> 12 AGM AGM
Created on 2018-10-04 by the reprex package (v0.2.1)
I have a regular expression that is able to match my data, using grepl, but I can't figure out how to extract the sub-expressions inside it to new columns.
This is returning the test string as foo, without any of the sub-expressions:
entryPattern <- "(\\d+)\\s+([[:lower:][:blank:]-]*[A-Z][[:alpha:][:blank:]-]+[A-Z]\\s[[:alpha:][:blank:]]+)\\s+([A-Z]{3})\\s+(\\d{4})\\s+(\\d\\d\\-\\d\\d)\\s+([[:print:][:blank:]]+)\\s+(\\d*\\:?\\d+\\.\\d+)"
test <- "101 POULET Laure FRA 1992 25-29 E. M. S. Bron Natation 26.00"
m <- regexpr(entryPattern, test)
foo <- regmatches(test, m)
In my real use case, I'm acting on lots of strings similar to test. I'm able to find the correctly formatted ones, so I think the pattern is correct.
rows$isMatch <- grepl(entryPattern, rows$text)
What 'm hoping to do is add the sub-expressions as new columns in the rows dataframe (i.e. rows$rank, rows$name, rows$country, etc.). Thanks in advance for any advice.
It seems that regmatches won't do what I want. Instead, I need the stringr package, as suggested by #kent-johnson.
library(stringr)
test <- "101 POULET Laure FRA 1992 25-29 E. M. S. Bron Natation 26.00"
entryPattern <- "(\\d+)\\s+([[:lower:][:blank:]-]*[A-Z][[:alpha:][:blank:]-]+[A-Z]\\s[[:alpha:][:blank:]]+?)\\s+([A-Z]{3})\\s+(\\d{4})\\s+(\\d\\d\\-\\d\\d)\\s+([[:print:][:blank:]]+?)\\s+(\\d*\\:?\\d+\\.\\d+)"
str_match(test, entryPattern)[1,2:8]
Which outputs:
[1] "101"
[2] "POULET Laure"
[3] "FRA"
[4] "1992"
[5] "25-29"
[6] "E. M. S. Bron Natation"
[7] "26.00"