Splitting data from one column into different columns in R - r

"I have carried out scraping in R but facing the problem in splitting the data into different columns. I am not able to write the code for column 8:10 (last line of the code). Below is the code"
library(xml2)
library(rvest)
library(stringr)
library(tidyr)
reddit_wbpg <- read_html("https://www.tripadvisor.in/Hotel_Review-g304551-d3583700-Reviews-or10-Lemon_Tree_Premier_Delhi_Airport-New_Delhi_National_Capital_Territory_of_Delhi.html")
title <- reddit_wbpg %>%
html_node("title") %>%
html_text()
reviews <- reddit_wbpg %>%
html_nodes("q.location-review-review-list-parts-ExpandableReview__reviewText--gOmRC") %>%
html_text()
user_data1 <- reddit_wbpg %>%
html_nodes("div.social-member-event-MemberEventOnObjectBlock__event_type--3njyv") %>%
html_text()
user_data2 <- reddit_wbpg %>%
html_nodes("div.social-member-MemberHeaderStats__event_info--30wFs") %>%
html_text()
review_title <- reddit_wbpg %>%
html_nodes("div.location-review-review-list-parts-ReviewTitle__reviewTitle--2GO9Z") %>%
html_text()
scraping_data <- data.frame(page_title= title, review_title = review_title, reviews = reviews, user_data1 = user_data1,user_data2 = user_data2)
scraping_data <- cbind(scraping_data,"a","a","a","a","a")
colnames(scraping_data)[6:10] <- c("user_name", "date", "location", "contribution" , "helpful_votes")
scraping_data[,6:7] <- str_split_fixed(scraping_data$user_data1, " wrote a review", 2)
scraping_data[,8] <- str_extract(scraping_data$user_data2,"^.+?(?=[0-9]+ [hc])")
scraping_data[,9] <- str_extract(scraping_data$user_data2,"[0-9]+(?= contributions)")
scraping_data[,10] <- str_extract(scraping_data$user_data2,"[0-9]+(?= helpful votes)")
The output can be seen here in the image attached:

Here's one approach with str_extract using both positive and negative lookahead:
scraping_data[,8] <- str_extract(scraping_data$user_data2,"^(?![0-9]+ (con|hel)).+?(?=[0-9]+ (con|hel)|$)")
scraping_data[,9] <- str_extract(scraping_data$user_data2,"[0-9]+(?= contribution)")
scraping_data[,10] <- str_extract(scraping_data$user_data2,"[0-9]+(?= helpful vote)")
scraping_data
# user_data1 user_data2 user_name date location contribution helpful_votes
#1 mohd saqibsaqib wrote a review Mar 2020 2 contributions2 helpful votes a a <NA> 2 2
#2 hitesh k wrote a review Mar 2020 4 contributions1 helpful vote a a <NA> 4 1
#3 Basant wrote a review Mar 2020 2 contributions a a <NA> 2 <NA>
#4 RagP65 wrote a review Mar 2020 New Delhi, India9 contributions4 helpful votes a a New Delhi, India 9 4
#5 Mbosma wrote a review Mar 2020 2 contributions a a <NA> 2 <NA>

Related

Is there a function similar to read_html() that can be used on data table or data frame types in R?

I'm attempting to webscrape from footballdb.com to get data related to NFL player injuries for a model I am creating from links such as this: https://www.footballdb.com/transactions/injuries.html?yr=2016&wk=1&type=reg which will then be output in a data table. Along with data related to individual player injury information (i.e. their name, injury, and status throughout the week leading up to the game), I also want to include the season and week of the injury in question for each player. I started by using nested for loops to generate the url for each webpage in question, along with the season and week corresponding to each webpage, which were stored in a data table with columns: link, season, and week.
I then tried to to use the functions map_df(), read_html(), and html_nodes() to extract the information I wanted from each webpage, but I run into errors as read_html() does not work for for objects of the data table or data frame class. I then tried to use different types of indexing and the $ operator with no luck either. Is there anyway I can modify the code I have produced thus far to extract the information I want from a data table? Below is what I have written thus far:
library(purrr)
library(rvest)
library(data.table)
#Remove file if file already exists
if (file.exists("./project/volume/data/interim/injuryreports.csv")) {
file.remove("./project/volume/data/interim/injuryreports.csv")}
#Declare variables and empty data tables
path1<-("https://www.footballdb.com/transactions/injuries.html?yr=")
seasons<-c("2016", "2017", "2020")
weeks<-1:17
result<-data.table()
temp<-NULL
#Use nested for loops to get the url, season, and week for each webpage of interest, store in result data table
for(s in 1:length(seasons)){
for(w in 1:length(weeks)){
temp$link<- paste0(path1, seasons[s],"&wk=", as.character(w), "&type=reg")
temp$season<-as.numeric(seasons[s])
temp$week<-weeks[w]
result<-rbind(result,temp)
}
}
#Get rid of any potential empty values from result
result<-compact(result)
###Errors Below####
DT <- map_df(result, function(x){
page <- read_html(x[[1]])
data.table(
Season = x[[2]],
Week = x[[3]],
Player = page %>% html_nodes('.divtable .td:nth-child(1) b') %>% html_text(),
Injury = page %>% html_nodes('.divtable .td:nth-child(2)') %>% html_text(),
Wed = page %>% html_nodes('.divtable .td:nth-child(3)') %>% html_text(),
Thu = page %>% html_nodes('.divtable .td:nth-child(4)') %>% html_text(),
Fri = page %>% html_nodes('.divtable .td:nth-child(5)') %>% html_text(),
GameStatus = page %>% html_nodes('.divtable .td:nth-child(6)') %>% html_text()
)
}
)
#####End of Errors###
#Write out injury data table
fwrite(DT,"./project/volume/data/interim/injuryreports.csv")
The issue is that your input data frame result is a datatable. When passing this to map_df it will loop over the columns(!!) of the datable not the rows.
One approach to make your code work is to split result by link and loop over the resulting list.
Note: For the reprex I only loop over the first two elements of the list. Additionally I have put your function outside of the map statement which made debugging easier.
library(purrr)
library(rvest)
library(data.table)
#Declare variables and empty data tables
path1<-("https://www.footballdb.com/transactions/injuries.html?yr=")
seasons<-c("2016", "2017", "2020")
weeks<-1:17
result<-data.table()
temp<-NULL
#Use nested for loops to get the url, season, and week for each webpage of interest, store in result data table
for(s in 1:length(seasons)){
for(w in 1:length(weeks)){
temp$link<- paste0(path1, seasons[s],"&wk=", as.character(w), "&type=reg")
temp$season<-as.numeric(seasons[s])
temp$week<-weeks[w]
result<-rbind(result,temp)
}
}
#Get rid of any potential empty values from result
result<-compact(result)
result <- split(result, result$link)
get_table <- function(x) {
page <- read_html(x[[1]])
data.table(
Season = x[[2]],
Week = x[[3]],
Player = page %>% html_nodes('.divtable .td:nth-child(1) b') %>% html_text(),
Injury = page %>% html_nodes('.divtable .td:nth-child(2)') %>% html_text(),
Wed = page %>% html_nodes('.divtable .td:nth-child(3)') %>% html_text(),
Thu = page %>% html_nodes('.divtable .td:nth-child(4)') %>% html_text(),
Fri = page %>% html_nodes('.divtable .td:nth-child(5)') %>% html_text(),
GameStatus = page %>% html_nodes('.divtable .td:nth-child(6)') %>% html_text()
)
}
DT <- map_df(result[1:2], get_table)
DT
#> Season Week Player Injury Wed Thu Fri
#> 1: 2016 1 Justin Bethel Foot Limited Limited Limited
#> 2: 2016 1 Lamar Louis Knee DNP Limited Limited
#> 3: 2016 1 Kareem Martin Knee DNP DNP DNP
#> 4: 2016 1 Alex Okafor Biceps Full Full Full
#> 5: 2016 1 Frostee Rucker Neck Limited Limited Full
#> ---
#> 437: 2016 10 Will Blackmon Thumb Limited Limited Limited
#> 438: 2016 10 Duke Ihenacho Concussion Full Full Full
#> 439: 2016 10 DeSean Jackson Shoulder DNP DNP DNP
#> 440: 2016 10 Morgan Moses Ankle Limited Limited Limited
#> 441: 2016 10 Brandon Scherff Shoulder Full Full Full
#> GameStatus
#> 1: (09/09) Questionable vs NE
#> 2: (09/09) Questionable vs NE
#> 3: (09/09) Out vs NE
#> 4: --
#> 5: --
#> ---
#> 437: (11/11) Questionable vs Min
#> 438: (11/11) Questionable vs Min
#> 439: (11/11) Doubtful vs Min
#> 440: (11/11) Questionable vs Min
#> 441: --

Cannot identify html node for scraping in rvest

trying to grab links from a page for subsequent analysis and can only grab about 1/2 of them which may be due to filtering. I'm trying to extract the links highlighted here:
My approach is as follows, which is not ideal because I believe I may be losing some links in the filter() call.
library(rvest)
library(tidyverse)
#initiate session
session <- html_session("https://www.backlisted.fm/episodes")
#collect links for all episodes from the index page:
session %>%
read_html() %>%
html_nodes(".underline-body-links a") %>%
html_attr("href") %>%
tibble(link_temp = .) %>%
filter(str_detect(link_temp, pattern = "episodes/")) %>%
distinct()
#css:
#.underline-body-links #page .html-block a, .underline-body-links #page .product-excerpt ahere
#result:
link_temp
<chr>
1 /episodes/116-mfk-fisher-how-to-cook-a-wolf
2 https://www.backlisted.fm/episodes/109-barbara-pym-excellent-women
3 /episodes/115-george-amp-weedon-grossmith-the-diary-of-a-nobody
4 https://www.backlisted.fm/episodes/27-jane-gardam-a-long-way-from-verona
5 https://www.backlisted.fm/episodes/5-b-s-johnson-christie-malrys-own-double-entry
6 https://www.backlisted.fm/episodes/97-ray-bradbury-the-illustrated-man
7 /episodes/114-william-golding-the-inheritors
8 https://www.backlisted.fm/episodes/30-georgette-heyer-venetia
9 https://www.backlisted.fm/episodes/49-anita-brookner-look-at-me
10 https://www.backlisted.fm/episodes/71-jrr-tolkien-the-return-of-the-king
# … with 43 more rows
I've been reading multiple documents but I can't target that one type of href. Any help will be much appreciated. Thank you.
Try this
library(rvest)
library(tidyverse)
session <- html_session("https://www.backlisted.fm/index")
raw_html <- read_html(session)
node <- raw_html %>% html_nodes(css = "li p a")
link <- node %>% html_attr("href")
title <- node %>% html_text()
tibble(title, link)
# A tibble: 117 x 2
# title link
# <chr> <chr>
# 1 "A Month in the Country" https://www.backlisted.fm/episodes/1-j-l-carr-a-month-in-the-country
# 2 " - J.L. Carr (with Lissa Evans)" #
# 3 "Good Morning, Midnight - Jean Rhys" https://www.backlisted.fm/episodes/2-jean-rhys-good-morning-midnight
# 4 "It Had to Be You - David Nobbs" https://www.backlisted.fm/episodes/3-david-nobbs-1
# 5 "The Blessing - Nancy Mitford" https://www.backlisted.fm/episodes/4-nancy-mitford-the-blessing
# 6 "Christie Malry's Own Double Entry - B.S. Joh… https://www.backlisted.fm/episodes/5-b-s-johnson-christie-malrys-own-dou…
# 7 "Passing - Nella Larsen" https://www.backlisted.fm/episodes/6-nella-larsen-passing
# 8 "The Great Fire - Shirley Hazzard" https://www.backlisted.fm/episodes/7-shirley-hazzard-the-great-fire
# 9 "Lolly Willowes - Sylvia Townsend Warner" https://www.backlisted.fm/episodes/8-sylvia-townsend-warner-lolly-willow…
# 10 "The Information - Martin Amis" https://www.backlisted.fm/episodes/9-martin-amis-the-information
# … with 107 more rows

Scrape Data through RVest

I am looking to get the article names by category from https://www.inquirer.net/article-index?d=2020-6-13
I've attempted to read the article names by doing:
library('rvest')
year <- 2020
month <- 06
day <- 13
url <- paste('http://www.inquirer.net/article-index?d=', year, '-', month, '-',day, sep = "")
pg <- read_html(url)
test<-pg %>%
html_nodes("#index-wrap") %>%
html_text()
This returns only 1 string of all articles names and it's very messy.
I ultimately would like to have a dataframe that looks like below:
Date Category Article Name
2020-06-13 News ‘We can never let our guard down’ vs terrorism – Cayetano
2020-06-13 News PNP spox says mañanita remark did not intend to put Sinas in bad light
2020-06-13 News After stranded mom’s death, Pasay LGU helps over 400 stranded individuals
2020-06-13 World 4 dead after tanker truck explodes on highway in China
etc.
etc.
etc.
etc.
2020-06-13 Lifestyle Book: Melania Trump delayed 2017 move to DC to get new prenup
Does anyone know what I may be missing? Very new to this, thanks!
This is maybe the closest you can get:
library(rvest)
#> Loading required package: xml2
library(tibble)
year <- 2020
month <- 06
day <- 13
url <- paste0('http://www.inquirer.net/article-index?d=', year, '-', month, '-', day)
div <- read_html(url) %>% html_node(xpath = '//*[#id ="index-wrap"]')
links <- html_nodes(div, xpath = '//a[#rel = "bookmark"]')
post_date <- html_nodes(div, xpath = '//span[#class = "index-postdate"]') %>%
html_text()
test <- tibble(date = post_date,
text = html_text(links),
link = html_attr(links, "href"))
test
#> # A tibble: 261 x 3
#> date text link
#> <chr> <chr> <chr>
#> 1 1 day a~ ‘We can never let our guard down~ https://newsinfo.inquirer.net/129~
#> 2 1 day a~ PNP spox says mañanita remark di~ https://newsinfo.inquirer.net/129~
#> 3 1 day a~ After stranded mom’s death, Pasa~ https://newsinfo.inquirer.net/129~
#> 4 1 day a~ Putting up lining for bike lanes~ https://newsinfo.inquirer.net/129~
#> 5 1 day a~ PH Army provides accommodation f~ https://newsinfo.inquirer.net/129~
#> 6 1 day a~ DA: Local poultry production suf~ https://newsinfo.inquirer.net/129~
#> 7 1 day a~ IATF assessing proposed design t~ https://newsinfo.inquirer.net/129~
#> 8 1 day a~ PCSO lost ‘most likely’ P13B dur~ https://newsinfo.inquirer.net/129~
#> 9 2 days ~ DOH: No IATF recommendations yet~ https://newsinfo.inquirer.net/129~
#> 10 2 days ~ PH coronavirus cases exceed 25,0~ https://newsinfo.inquirer.net/129~
#> # ... with 251 more rows
Created on 2020-06-14 by the reprex package (v0.3.0)
you fogot read_html() then use that in the dplyr statment
library('rvest')
year <- 2020
month <- 06
day <- 13
url <- paste('http://www.inquirer.net/article-index?d=', year, '-', month, '-',day, sep = "")
#added page
page <- read_html(url)
test <- page %>%
#changed xpath
html_node(xpath = '//*[#id ="index-wrap"]') %>%
html_text()
test
update, i suck at dplyr but this is what i have before i go to bed
library('rvest')
year <- 2020
month <- 06
day <- 13
url <- paste('http://www.inquirer.net/article-index?d=', year, '-', month, '-',day, sep = "")
#addad page
page <- read_html(url)
titles <- page %>%
html_nodes(xpath = '//*[#id ="index-wrap"]/h4') %>%
html_text()
sections <- page %>%
html_nodes(xpath = '//*[#id ="index-wrap"]/ul')
stories <- sections %>%
html_nodes(xpath = '//li/a') %>%
html_text()
stories

Add synonyms from qdap to a preexisting dataframe in R

I have created the following dataframe df in R
Sl NO Word
1 get
2 Free
3 Joshi
4 Hello
5 New
I have used this code to get a list of synonyms but the same are in the form of a list
library(qdap)
synonyms(DF$Word)
I am getting a list of synonymous words for this. I Want to get the synonymous words for each word in the dataframe appended rowwise to the dataframe as separate columns.
DF<-
Sl NO Word Syn1 Syn2
1 get obtain receive
2 Free independent NA
3 Joshi NA NA
4 Hello Greeting NA
5 New Unused Fresh
Is there an elegant way to obtain this.Are there other dictionaries that can be used for this.
One approach could be to use mapply and pass each word at a time to qdap::synonyms. The result from 'synonyms' can be collapsed in a column using paste0 function with collapse = "|". Now data is ready.
Use tidyr::separate to separate columns into Syn1, Syn2 etc.
Note: synonyms is called with two arguments as return.list = FALSE, multiwords = FALSE
The below code has limit on maximum 10 synonyms but solution can be evolved to handle number dynamically.
library(tidyverse)
library(qdap)
df %>%
mutate(Synonyms =
mapply(function(x)paste0(
head(synonyms(x, return.list = FALSE, multiwords = FALSE),10), collapse = "|"),
tolower(.$Word))) %>%
separate(Synonyms, paste("Syn",1:10), sep = "\\|", extra = "drop" )
Result:
# SlNO Word Syn 1 Syn 2 Syn 3 Syn 4 Syn 5 Syn 6 Syn 7 Syn 8 Syn 9 Syn 10
# 1 1 get achieve acquire attain bag bring earn fetch gain glean inherit
# 2 2 Free buckshee complimentary gratis gratuitous unpaid footloose independent liberated loose uncommitted
# 3 3 Joshi <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 4 4 Hello <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 5 5 New advanced all-singing all-dancing contemporary current different fresh ground-breaking happening latest
Data
df <- read.table(text =
"SlNO Word
1 get
2 Free
3 Joshi
4 Hello
5 New",
header = TRUE, stringsAsFactors = FALSE)
Here is another approach with splitstackshape::cSplit.
library(tidyverse)
library(qdap)
library(splitstackshape)
DF <- read.table(text = tt, header = T)
DF <- DF %>% mutate_at(vars(Word), tolower)
syns <- synonyms_frame(synonyms(tolower(DF$Word))) %>%
mutate_at(vars(x), funs(str_remove(x, "\\..*"))) %>%
mutate_at(vars(y), funs(str_extract(y, '[:alpha:]+'))) %>%
group_by(x) %>%
summarise(Syn = toString(y)) %>%
rename(Word = x) %>% cSplit('Syn')
left_join(DF, syns)
I am not sure how exactly would you like to add all synonyms of a word because when you run synonyms("get") it gives 75 definitions of get and I feel that the desired layout will not be of much help if you add all values of 75 definitions in a single row.
So in below solution I have selected the very first definition only.
library(qdap)
library(dplyr)
library(splitstackshape)
df %>%
rowwise() %>%
mutate(synonym_of_word = paste(synonyms(tolower(word))[[1]], collapse=",")) %>%
cSplit("synonym_of_word", ",")
Sample data:
df <- structure(list(sl_no = 1:5, word = c("get", "Free", "Joshi",
"Hello", "New")), .Names = c("sl_no", "word"), class = "data.frame", row.names = c(NA,
-5L))

Creating a dataframe from a scraped character vector

I am trying to create a dataframe that has the columns: First name, Last name, Party, State, Member ID. Here is my code
library('rvest')
candidate_url <- 'https://www.congress.gov/help/field-values/member-bioguide-ids'
candidate_page <- read_html(candidate_url)
candidate_nodes <- html_nodes(candidate_page, 'table')
candidate_list <- html_text(candidate_nodes)
My main issue is getting the member IDs. An example ID is A000009. When I use the gsub function I lose the leading A in this example. The A is from this candidate's last name (Abercrombie), but I do not know how to add the A back into the member ID. Of course if there's a better way I am open to any suggestions.
Since you've got an HTML table, use html_table to extract it to a data.frame. You'll need fill = TRUE, because the table has extra empty rows inserted between each entry, which you can easily drop afterwards with tidyr::drop_na.
library(tidyverse)
library(rvest)
page <- 'https://www.congress.gov/help/field-values/member-bioguide-ids' %>%
read_html()
members <- page %>%
html_node('table') %>%
html_table(fill = TRUE) %>%
set_names('member', 'bioguide') %>%
drop_na(member) %>% # remove empty rows inserted in the table
tbl_df() # for printing
members
#> # A tibble: 2,243 x 2
#> member bioguide
#> * <chr> <chr>
#> 1 Abdnor, James (Republican - South Dakota) A000009
#> 2 Abercrombie, Neil (Democratic - Hawaii) A000014
#> 3 Abourezk, James (Democratic - South Dakota) A000017
#> 4 Abraham, Ralph Lee (Republican - Louisiana) A000374
#> 5 Abraham, Spencer (Republican - Michigan) A000355
#> 6 Abzug, Bella S. (Democratic - New York) A000018
#> 7 Acevedo-Vila, Anibal (Democratic - Puerto Rico) A000359
#> 8 Ackerman, Gary L. (Democratic - New York) A000022
#> 9 Adams, Alma S. (Democratic - North Carolina) A000370
#> 10 Adams, Brock (Democratic - Washington) A000031
#> # ... with 2,233 more rows
The member column could be further extracted, if you like.
There are also many other useful sources for this data, some of which correlate it with other useful variables. This one is well-structured and updated regularly.
Give this a try. I have updated this to include separating out the different fields.
library('rvest')
library('dplyr')
library('tidyr')
candidate_url <- 'https://www.congress.gov/help/field-values/member-bioguide-ids'
candidate_page <- read_html(candidate_url)
candidate_nodes <- html_nodes(candidate_page, 'table')
df.candidates <- as.data.frame(html_table(candidate_nodes, header = TRUE, fill = TRUE), stringsAsFactors = FALSE)
df.candidates <- df.candidates[!is.na(df.candidates$Member),]
df.candidates <- df.candidates %>%
mutate(Party.State = gsub("[\\(\\)]", "", regmatches(Member, gregexpr("\\(.*?\\)", Member))[[1]])) %>%
separate(Party.State, into = c("Party","State"), sep = " - ") %>%
mutate(Full.name = trimws(regmatches(df.candidates$Member, regexpr("^[^\\(]+", df.candidates$Member)))) %>%
separate(Full.name, into = c("Last.Name","First.Name","Suffix"), sep = ",", fill = "right") %>%
select(First.Name, Last.Name, Suffix, Party, State, Member.ID)
This is a bit hackish, but if you want to extract the variables using regex here are a few pointers.
candidate_list <- unlist(candidate_list)
ID <- regmatches(candidate_list,
gregexpr("[a-zA-Z]{1}[0-9]{6}", candidate_list))
party_state <- regmatches(candidate_list,
gregexpr("(?<=\\()[^)]+(?=\\))", candidate_list, perl=TRUE))
names_etc <- strsplit(candidate_list, "[a-zA-Z]{1}[0-9]{6}")
names <- sapply(names_etc, function(x) sub(" \\([^)]*\\)", "", x))

Resources