Structure character data into data frame - r

I used rvest package in R to scrape some web data but I am having a lot of trouble getting it into a usuable format.
My data currently looks like this:
test
[1] "v. Philadelphia"
[2] "TD GardenRegular Season"
[3] "PTS: 23. Jayson TatumREB: 10. M. MorrisAST: 7. Kyrie Irving"
[4] "PTS: 23. Joel EmbiidREB: 15. Ben SimmonsAST: 8. Ben Simmons"
[5] "100.7 - 83.4"
[6] "# Toronto"
[7] "Air Canada Centre Regular Season"
[8] "PTS: 21. Kyrie IrvingREB: 10. Al HorfordAST: 9. Al Horford"
[9] "PTS: 31. K. LeonardREB: 10. K. LeonardAST: 7. F. VanVleet"
[10] "115.6 - 103.3"
Can someone help me perform the correct operations in order to have it look like this (as a data frame) and provide the code, I would really appreciate it:
Opponent Venue
Philadelphia TD Garden
Toronto Air Canada Centre
I do not need any of the other information.

Let me know if there are any issues :)
# put your data in here
input <- c("v. Philadelphia", "TD GardenRegular Season",
"", "", "",
"# Toronto", "Air Canada Centre Regular Season",
"", "", "")
index <- 1:length(input)
# raw table format
out_raw <- data.frame(Opponent = input[index%%5==1],
Venue = input[index%%5==2])
# using stringi package
install.packages("stringi")
library(stringi)
# copy and clean up
out_clean <- out_raw
out_clean$Opponent <- stri_extract_last_regex(out_raw$Opponent, "(?<=\\s).*$")
out_clean$Venue <- trimws(gsub("Regular Season", "", out_raw$Venue))
out_clean

Related

Is there a way to scrape through multiple pages on a website in R

I am new to R and webscraping. For practice I am trying to scrape book titles from a fake website that has multiple pages ('http://books.toscrape.com/catalogue/page-1.html'), and then calculate certain metrics based on the book titles. There are 20 books on each page and 50 pages, I have managed to scrape and calculate metrics for the first 20 books, however I want to calculate the metrics for the full 1000 books on the website.
The current output looks like this:
[1] "A Light in the Attic"
[2] "Tipping the Velvet"
[3] "Soumission"
[4] "Sharp Objects"
[5] "Sapiens: A Brief History of Humankind"
[6] "The Requiem Red"
[7] "The Dirty Little Secrets of Getting Your Dream Job"
[8] "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull"
[9] "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics"
[10] "The Black Maria"
[11] "Starving Hearts (Triangular Trade Trilogy, #1)"
[12] "Shakespeare's Sonnets"
[13] "Set Me Free"
[14] "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)"
[15] "Rip it Up and Start Again"
[16] "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991"
[17] "Olio"
[18] "Mesaerion: The Best Science Fiction Stories 1800-1849"
[19] "Libertarianism for Beginners"
[20] "It's Only the Himalayas"
I want this to be 1000 books long instead of 20, this will allow me to use the same code to calculate the metrics but for 1000 books instead of 20.
Code:
url<-'http://books.toscrape.com/catalogue/page-1.html'
url %>%
read_html() %>%
html_nodes('h3 a') %>%
html_attr('title')->titles
titles
What would be the best way to scrape every book from the website and make the list 1000 book titles long instead of 20? Thanks in advance.
Generate the 50 URLs, then iterate on them, e.g. with purrr::map
library(rvest)
urls <- paste0('http://books.toscrape.com/catalogue/page-', 1:50, '.html')
titles <- purrr::map(
urls,
. %>%
read_html() %>%
html_nodes('h3 a') %>%
html_attr('title')
)
something like this perhaps?
library(tidyverse)
library(rvest)
library(data.table)
# Vector with URL's to scrape
url <- paste0("http://books.toscrape.com/catalogue/page-", 1:20, ".html")
# Scrape to list
L <- lapply( url, function(x) {
print( paste0( "scraping: ", x, " ... " ) )
data.table(titles = read_html(x) %>%
html_nodes('h3 a') %>%
html_attr('title') )
})
# Bind list to single data.table
data.table::rbindlist(L, use.names = TRUE, fill = TRUE)

Use textConnection and scan to switch pasted character data to a vector

I want to use textConnection and scan in R to switch
a pasted character dataset to a character vector as row.names.
My little example is as follows:
x = textConnection('
Arcadia
Bryce Canyon
Cuyahoga Valley
Everglades
Grand Canyon
Grand Teton
Great Smoky
Hot Springs
Olympic
Mount Rainier
Rocky Mountain
Shenandoah
Yellowstone
Yosemite
Zion
')
scan(x,character(0))
Each line of the dataset represents a place, thus
it is expected to have a character vector with length 15.
However, scan(x,character(0)) gives
Read 23 items
[1] "Arcadia" "Bryce" "Canyon" "Cuyahoga" "Valley"
[6] "Everglades" "Grand" "Canyon" "Grand" "Teton"
[11] "Great" "Smoky" "Hot" "Springs" "Olympic"
[16] "Mount" "Rainier" "Rocky" "Mountain" "Shenandoah"
[21] "Yellowstone" "Yosemite" "Zion"
I then tried scan(x,character(0),seq='\n'), but it also didn't work! Any help?
Since the input is quoted, we should specify the parameter sep (and not seq!) if we want scan to use 'non white space' as deliminator.
From ?scan:
sep by default, scan expects to read ‘white-space’ delimited input fields. Alternatively, sep can be used to specify a character
which delimits fields. A field is always delimited by an end-of-line
marker unless it is quoted. If specified this should be the empty
character string (the default) or NULL or a character string
containing just one single-byte character.
x = textConnection('
Arcadia
Bryce Canyon
Cuyahoga Valley
Everglades
Grand Canyon
Grand Teton
Great Smoky
Hot Springs
Olympic
Mount Rainier
Rocky Mountain
Shenandoah
Yellowstone
Yosemite
Zion
')
scan(x,character(0), sep="\n")
Returns:
Read 15 items
[1] "Arcadia" "Bryce Canyon" "Cuyahoga Valley" "Everglades"
[5] "Grand Canyon" "Grand Teton" "Great Smoky" "Hot Springs"
[9] "Olympic" "Mount Rainier" "Rocky Mountain" "Shenandoah"
[13] "Yellowstone" "Yosemite" "Zion"

Extracting Sub-expressions from a Dataframe of Strings Using Regular Expressions

I have a regular expression that is able to match my data, using grepl, but I can't figure out how to extract the sub-expressions inside it to new columns.
This is returning the test string as foo, without any of the sub-expressions:
entryPattern <- "(\\d+)\\s+([[:lower:][:blank:]-]*[A-Z][[:alpha:][:blank:]-]+[A-Z]\\s[[:alpha:][:blank:]]+)\\s+([A-Z]{3})\\s+(\\d{4})\\s+(\\d\\d\\-\\d\\d)\\s+([[:print:][:blank:]]+)\\s+(\\d*\\:?\\d+\\.\\d+)"
test <- "101 POULET Laure FRA 1992 25-29 E. M. S. Bron Natation 26.00"
m <- regexpr(entryPattern, test)
foo <- regmatches(test, m)
In my real use case, I'm acting on lots of strings similar to test. I'm able to find the correctly formatted ones, so I think the pattern is correct.
rows$isMatch <- grepl(entryPattern, rows$text)
What 'm hoping to do is add the sub-expressions as new columns in the rows dataframe (i.e. rows$rank, rows$name, rows$country, etc.). Thanks in advance for any advice.
It seems that regmatches won't do what I want. Instead, I need the stringr package, as suggested by #kent-johnson.
library(stringr)
test <- "101 POULET Laure FRA 1992 25-29 E. M. S. Bron Natation 26.00"
entryPattern <- "(\\d+)\\s+([[:lower:][:blank:]-]*[A-Z][[:alpha:][:blank:]-]+[A-Z]\\s[[:alpha:][:blank:]]+?)\\s+([A-Z]{3})\\s+(\\d{4})\\s+(\\d\\d\\-\\d\\d)\\s+([[:print:][:blank:]]+?)\\s+(\\d*\\:?\\d+\\.\\d+)"
str_match(test, entryPattern)[1,2:8]
Which outputs:
[1] "101"
[2] "POULET Laure"
[3] "FRA"
[4] "1992"
[5] "25-29"
[6] "E. M. S. Bron Natation"
[7] "26.00"

pattern matching with sub(), unable to catch and replace first occurrence

The followings are the results I expect
> title = "La La Land (2016/I)"
[1]"(2016" #result
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
[1]"(2013" #result
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6"
[1]"(2015" #result
==================================================================
The followings are what I got by applying codesub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1] "(1500-1800) (#1.1)" #result. However, I expected it to be "(2013)"
> title = "La La Land (2016/I)"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1] "(2016/I)" #result as I expect
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1]"(2017)" # result. However, I expect it to be "(2015)"
The followings are what I GOT by applying codesub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
> title = "La La Land (2016/I)"
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "La La Land (2016/I)" #result. However, I expect it to be "(2016)"
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6"
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "(2017)" #result. However, I expect it to be "(2015)"
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "(2013)" #result as I expect
I checked the description of sub, it says "sub performs replacement of the first match. In this case, the first match should be (2013).
In a word, I try to write a sub()command to return the first occurrence of a year in a string.
I guess there is something wrong with my code but couldn't find it, appreciate if anyone could help me.
==================================================================
In fact, my ultimate goal is to extract the year of all movies. However, I don't know how to do it in one step. Therefore, I decide to first find the year in (dddd format, then use code sub(pattern="\\((\\d{4}).*", a, replacement="\\1") to find the pure number of the year.
for example:
> a= "(2015"
> sub(pattern="\\((\\d{4}).*", a, replacement="\\1")
[1] "2015"
> a= "(2015)"
> sub(pattern="\\((\\d{4}).*", a, replacement="\\1")
[1] "2015"
=================updated 05/29/2017 22:51PM=======================
the str_extract in akrun's answer works well with my dataset.
However, the sub() doesn't work for all data. The following are what I did. However, my code doesn't work with all 500 records. I would really appreciate if anyone could point out the mistakes on my codes. I really cannot figure it out myself. Thank you very much.
> t1
[1] "Man Who Fell to Earth (Remix) (2010) (TV)"
> t2
[1] "Manual pr\u0087ctico del amigo imaginario (abreviado) (2008)"
> title = c(t1,t2)
> x=gsub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
> x
[1] "(2010) (TV)" "(2008)"
> sub(pattern="\\((.*)\\).*", x, replacement="\\1")
[1] "2010) (TV" "2008"
However, my goal is to get 2010 and 2008. My code works with t2 but fails with t1
We can match 0 or more characters that are not a ( ([^(]*) from the start (^) of the string, followed by a ( and four digits (\\([0-9]{4}) which we capture as a group ((...)) followed by other characters (.*) and replace with the backreference (\\1) of the captured group
sub("^[^(]*(\\([0-9]{4}).*", "\\1", title)
#[1] "(2016" "(2013" "(2015"
If we need to remove the (, then capture only the numbers that follows the \\( as a group
sub("^[^(]*\\(([0-9]{4}).*", "\\1", title)
#[1] "2016" "2013" "2015"
Or with str_extract, we use a regex lookaround to extract the 4 digit numbers that follows the (
library(stringr)
str_extract(title, "(?<=\\()[0-9]{4}")
#[1] "2016" "2013" "2015"
Or with regmatches/regexpr
regmatches(title, regexpr("(?<=\\()([0-9]{4})", title, perl = TRUE))
#[1] "2016" "2013" "2015"
data
title <- c("La La Land (2016/I)",
"_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_",
"dfajfj(2015)asdfjuwer f(2017)fa.erewr6")

Webscraping Issues Involving Clicking (Using R)

I am trying to webscrape the following website:
http://www.healthgrades.com/hospital-directory/california-ca-san-mateo/affiliated-physicians-HGSTED418D46050070
I am using R to webscrape the website. Particularly, I am trying to copy all of doctor's names and specialties from this website. However, the main issue that I am dealing with is that the url link does not change when I press the arrow/next button. I can not use any basic techniques to webscrape this page. How can I solve this problem? It would be nice to have all of the data that I am collecting in one data matrix/spreadsheet.
dum <- "http://www.healthgrades.com/hospital-directory/california-ca-san-mateo/affiliated-physicians-HGSTED418D46050070"
library(XML)
ddum <- htmlParse(dum)
noofpages <- xpathSApply(ddum,'//*/span[#class="paginationItem active"]/following-sibling::*[1]',xmlValue)[1]
noofpages <- (as.numeric(gsub(' of ','',noofpages))-1)%/%5+1
doctors <- c(); dspec <- c()
for(i in 1:noofpages){
if(i>1){
ddum <- htmlParse(paste0(dum,"?pagenumber=",i,'#'))
}
doctors <- c(doctors, xpathSApply(ddum,'//*/a[#class="providerSearchResultSelectAction"]',xmlValue))
dspec <- c(dspec, xpathSApply(ddum,'//*/div[#class="listingHeaderLeftColumn"]/p',xmlValue))
}
paste(doctors,dspec,sep=',')
# [1] "Dr. Julia Adamian, MD,Internal Medicine"
# [2] "Dr. Eric R. Adler, MD,Internal Medicine"
# [3] "Dr. Ramzi S. Alami, MD,General Surgery"
# [4] "Dr. Jason L. Anderson, MD,Internal Medicine"
# [5] "Dr. Karl A. Anderson, MD,Urology"
# [6] "Dr. Christine E. Angeles, MD,Geriatric Medicine, Pulmonology"
It looks like they're using the variable
?pagenumber=x
You can probably iterate over x to get your data.
On a side note,
I'm not sure which browser you are using, but Chrome has a handy feature where you can right click on a button and select inspect element.

Resources