rvest scraping data from a button style tag? - r

Following up with more delving into scraping data off sites. Trying to pull data off of this site seatgeeks to get a few columns. I'm having trouble accessing the pricing and link data specifically. The following code runs well but I can't get accurate data for pricing and for links. 65$ keeps repeating itself even though the numbers are different per button. Any ideas? Appreciate the help!
#ticket scruber
library(rvest)
tix_link = paste("https://seatgeek.com/new-york-knicks-tickets#events")
tix_info = tix_link %>% read_html() %>%
html_nodes(".event-listing-title span")
link_date = read_html(tix_link)
link_date = html_nodes(link_date, ".event-listing-date")
link_time = read_html(tix_link)
link_time = html_nodes(link_time, ".event-listing-time")
link_price = read_html(tix_link)
link_price = html_node(link_price, ".event-listing-button")
link_info = read_html(tix_link)
link_info = html_node(link_info, "span")
#convert to data frame
ticket_deals = data.frame(deals = html_text(tix_info),
date = html_text(link_date),
time = html_text(link_time),
price = html_text(link_price),
correpsonding_link = html_attr(link_info,"href"))
head(ticket_deals)
deals date
1 Dallas Mavericks at New York Knicks \n Nov 14
2 Detroit Pistons at New York Knicks \n Nov 16
3 Atlanta Hawks at New York Knicks \n Nov 20
4 Portland Trail Blazers at New York Knicks \n Nov 22
5 Charlotte Hornets at New York Knicks \n Nov 25
6 Oklahoma City Thunder at New York Knicks \n Nov 28
time price
1 \n Mon 7:30 PM \n From $65
2 \n Wed 7:30 PM \n From $65
3 \n Sun 12:00 PM \n From $65
4 \n Tue 7:30 PM \n From $65
5 \n Fri 7:30 PM \n From $65
6 \n Mon 7:30 PM \n From $65
correpsonding_link
1 <NA>
2 <NA>
3 <NA>
4 <NA>
5 <NA>

Related

R: move everything after a word to a new column and then only keep the last four digits in the new column

My data frame has a column called "State" and contains the state name, HB/HF number, and the date the law went into effect. I want the state column to only contain the state name and the second column to contain just the year. How would I do this?
Mintz = read.csv('https://github.com/bandcar/mintz/raw/main/State%20Legislation%20on%20Biosimilars2.csv')
mintz = Mintz
# delete rows if col 2 has a blank value.
mintz = mintz[mintz$Substitution.Requirements != "", ]
# removes entire row if column 1 has the word State
mintz=mintz[mintz$State != "State", ]
#reset row numbers
mintz= mintz %>% data.frame(row.names = 1:nrow(.))
# delete PR
mintz = mintz[-34,]
#reset row numbers
mintz= mintz %>% data.frame(row.names = 1:nrow(.))
I'm almost certain I'll need to use strsplit(gsub()) but I'm not sure how to this since there's no specific pattern
EDIT
I still need help keeping only the state name in column 1.
As for moving the year to a new column, I found the below. It works, but I don't know why it works. From my understanding \d means that \d is the actual character it's searching for. the "." means to search for one character, and I have no idea what the \1 means. Another strange thing is that Minnesota (row 20) did not have a year, so it instead used characters. Isn't \d only supposed to be for digits? Someone care to explain?
mintz2 = mintz
mintz2$Year = sub('.*(\\d{4}).*', '\\1', mintz2$State)
One way could be:
For demonstration purposes select the State column.
Then we use str_extract to extract all numbers with 4 digits with that are at the end of the string \\d{4}-> this gives us the Year column.
Finally we make use of the inbuilt state.name function make a pattern of it an use it again with str_extract and remove NA rows.
library(dplyr)
library(stringr)
mintz %>%
select(State) %>%
mutate(Year = str_extract(State, '\\d{4}$'), .after=State,
State = str_extract(State, paste(state.name, collapse='|'))
) %>%
na.omit()
State Year
2 Arizona 2016
3 California 2016
7 Connecticut 2018
12 Florida 2013
13 Georgia 2015
16 Hawaii 2016
21 Illinois 2016
24 Indiana 2014
28 Iowa 2017
32 Kansas 2017
33 Kentucky 2016
34 Louisiana 2015
39 Maryland 2017
42 Michigan 2018
46 Missouri 2016
47 Montana 2017
50 Nebraska 2018
51 Nevada 2018
54 New Hampshire 2018
55 New Jersey 2016
59 New York 2017
62 North Carolina 2015
63 North Dakota 2013
66 Ohio 2017
67 Oregon 2016
70 Pennsylvania 2016
74 Rhode Island 2016
75 South Carolina 2017
78 South Dakota 2019
79 Tennessee 2015
82 Texas 2015
85 Utah 2015
88 Vermont 2018
89 Virginia 2013
92 Washington 2015
93 West Virginia 2018
96 Wisconsin 2019
97 Wyoming 2018

Signalling where a new row should start based on arbitrary characters when converting webscraped output to a tibble

I'm scraping a television script and then trying to clean it up. This is what I have so far:
library(tidyverse)
library(rvest)
s1_e1 <- read_html('http://www.chakoteya.net/DoctorWho/27-1.htm')
s1_e1 <- s1_e1 %>%
html_nodes("p") %>%
html_text()
s1_e1 <- str_replace_all(string = s1_e1, pattern = "\\s*\\([^\\)]+\\)", replacement = "")
s1_e1 <- str_replace_all(string = s1_e1, pattern = "\\s*\\[[^\\]]+\\]", replacement = "")
s1_e1 <- str_squish(s1_e1)
s1_e1 <- s1_e1 %>%
as_tibble() %>%
filter(value!="") %>%
mutate(season = "27",
episode_num = "1",
airdate_orig = str_sub(.$value[1], -12),
episode_name = str_sub(.$value[1], 1, regexpr(" O", .$value[1])-1)) %>%
slice(-1)
Which gives me the below:
# A tibble: 38 x 5
value season episode_num airdate_orig episode_name
<chr> <chr> <chr> <chr> <chr>
1 ROSE: Bye! JACKIE: See you later! 27 1 26 Mar, 2005 Rose
2 TANNOY: This is a customer announcement… 27 1 26 Mar, 2005 Rose
3 ROSE: You pulled his arm off. DOCTOR: Y… 27 1 26 Mar, 2005 Rose
4 ROSE: That's just not funny. That's sic… 27 1 26 Mar, 2005 Rose
5 TAXI DRIVER: Watch it! 27 1 26 Mar, 2005 Rose
6 TELEVISION: The whole of Central London… 27 1 26 Mar, 2005 Rose
7 JACKIE: There's no point in getting up,… 27 1 26 Mar, 2005 Rose
8 JACKIE: There's Finch's. You could try … 27 1 26 Mar, 2005 Rose
9 ROSE: It's about last night. He's part … 27 1 26 Mar, 2005 Rose
10 ROSE: Don't mind the mess. Do you want … 27 1 26 Mar, 2005 Rose
# … with 28 more rows
I would like each row to be a new character's speech. As you can see, thankfully the script capitalizes who is speaking and then has a colon and a space before new speech, i.e. ROSE: or TANNOY: . Is there a way to indicate to R that I want each row of the tibble to begin with this capitalized text followed by a colon and to continue in that row until there is another capitalized word followed by a colon?
For example, the first row would start with ROSE: Bye! and the second row would start with JACKIE: See you later!, the third TANNOY: This is a customer announcement… until it reached another capitalized word followed by a colon, and so on.
Additionally, if anyone has any suggestions for how I can integrate the stringr functions into the dplyr chunk let me know. I can make a separate post about this if that's best, but I kept getting errors when attempting to do that (the above is functional though).
Many thanks in advance!
You could use a Look-ahead pattern:
library(tidyverse)
s1_e1 %>%
mutate(value=str_split(value, "\\s(?=[A-Z]+:)")) %>%
unnest(value)
returns
# A tibble: 322 x 5
value season episode_num airdate_orig episode_name
<chr> <chr> <chr> <chr> <chr>
1 ROSE: Bye! 27 1 26 Mar, 2005 Rose
2 JACKIE: See you later! 27 1 26 Mar, 2005 Rose
3 TANNOY: This is a customer announcement. The store will be closi~ 27 1 26 Mar, 2005 Rose
4 GUARD: Oi! 27 1 26 Mar, 2005 Rose
5 ROSE: Wilson? Wilson, I've got the lottery money. Wilson, are yo~ 27 1 26 Mar, 2005 Rose
6 ROSE: I can't hang about 'cos they're closing the shop. Wilson! ~ 27 1 26 Mar, 2005 Rose
7 ROSE: Hello? Hello, Wilson, it's Rose. Hello? Wilson? 27 1 26 Mar, 2005 Rose
8 ROSE: Wilson? Wilson! 27 1 26 Mar, 2005 Rose
9 ROSE: You're kidding me. 27 1 26 Mar, 2005 Rose
10 ROSE: Is that someone mucking about? Who is it? 27 1 26 Mar, 2005 Rose
Simplified workflow
You can indeed put all your operations into one pipe:
s1_e1 <- read_html('http://www.chakoteya.net/DoctorWho/27-1.htm') %>%
html_nodes("p") %>%
html_text() %>%
tibble(value = .) %>%
mutate(value = str_squish(str_replace_all(value, "(\\s*\\([^\\)]+\\)|\\s*\\[[^\\]]+\\])", ""))) %>%
filter(value!="") %>%
mutate(season = "27",
episode_num = "1",
airdate_orig = str_sub(.$value[1], -12),
episode_name = str_sub(.$value[1], 1, regexpr(" O", .$value[1])-1)) %>%
slice(-1) %>%
mutate(value=str_split(value, "\\s(?=[A-Z]+:)")) %>%
unnest(value)

put the resulting values from for loop into a table in r [duplicate]

This question already has an answer here:
Using Reshape from wide to long in R [closed]
(1 answer)
Closed 2 years ago.
I'm trying to calculate the total number of matches played by each team in the year 2019 and put them in a table along with the corresponding team names
teams<-c("Sunrisers Hyderabad", "Mumbai Indians", "Gujarat Lions", "Rising Pune Supergiants",
"Royal Challengers Bangalore","Kolkata Knight Riders","Delhi Daredevils",
"Kings XI Punjab", "Deccan Chargers","Rajasthan Royals", "Chennai Super Kings",
"Kochi Tuskers Kerala", "Pune Warriors", "Delhi Capitals", " Gujarat Lions")
for (j in teams) {
print(j)
ipl_table %>%
filter(season==2019 & (team1==j | team2 ==j)) %>%
summarise(match_count=n())->kl
print(kl)
match_played<-data.frame(Teams=teams,Match_count=kl)
}
The match played by last team (i.e Gujarat Lions is 0 and its filling 0's for all other teams as well.
The output match_played can be found on the link given below.
I'd be really glad if someone could help me regarding this error as I'm very new to R.
filter for the particular season, get data in long format and then count number of matches.
library(dplyr)
matches %>%
filter(season == 2019) %>%
tidyr::pivot_longer(cols = c(team1, team2), values_to = 'team_name') %>%
count(team_name) -> result
result
# team_name n
# <chr> <int>
#1 Chennai Super Kings 17
#2 Delhi Capitals 16
#3 Kings XI Punjab 14
#4 Kolkata Knight Riders 14
#5 Mumbai Indians 16
#6 Rajasthan Royals 14
#7 Royal Challengers Bangalore 14
#8 Sunrisers Hyderabad 15
Here is an example
library(tidyr)
df_2019 <- matches[matches$season == 2019, ] # get the season you need
df_long <- gather(df_2019, Team_id, Team_Name, team1:team2) # Make it long format
final_count <- data.frame(t(table(df_long$Team_Name)))[-1] # count the number of matches
names(final_count) <- c("Team", "Matches")
Team Matches
1 Chennai Super Kings 17
2 Delhi Capitals 16
3 Kings XI Punjab 14
4 Kolkata Knight Riders 14
5 Mumbai Indians 16
6 Rajasthan Royals 14
7 Royal Challengers Bangalore 14
8 Sunrisers Hyderabad 15
Or by using base R
final_count <- data.frame(t(table(c(df_2019$team1, df_2019$team2))))[-1]
names(final_count) <- c("Team", "Matches")
final_count

Replace For Loop to fill column depending on other column value

I have a two-column dataframe (HOME & AWAY) called 'gamelist' with sports games. The HOME column also includes some dates with the corresponding games listed below.
HOME AWAY
15 Oct 2019 Pre-season
Phoenix Suns Denver Nuggets
Utah Jazz Sacramento Kings
Dallas Mavericks Oklahoma City Thunder
Memphis Grizzlies Charlotte Hornets
14 Oct 2019 Pre-season
Miami Heat Atlanta Hawks
13 Oct 2019 Pre-season
Orlando Magic Philadelphia 76ers
Toronto Raptors Chicago Bulls
Washington Wizards Milwaukee Bucks
I want to create a new column with the dates for each game. Coming from a excel vba approach, I've used a for loop which is giving the result intented but I was wondering if there was a more efficient approach in R, and I'm sure there is.
This is the code I've used:
gamelist<-add_column(gamelist,SDATE="",.before = 1)
for(i in 1:nrow(gamelist)){
if(str_count(gamelist[[i,3]],"\\d")==6){
gamelist[i,2]<-gamelist[i,3]
}else{
gamelist[i,2]<-gamelist[i-1,2]
}
}
Which gives me this as intended
SDATE HOME AWAY
15 Oct 2019 15 Oct 2019 Pre-season
15 Oct 2019 Phoenix Suns Denver Nuggets
15 Oct 2019 Utah Jazz Sacramento Kings
15 Oct 2019 Dallas Mavericks Oklahoma City Thunder
15 Oct 2019 Memphis Grizzlies Charlotte Hornets
14 Oct 2019 14 Oct 2019 Pre-season
14 Oct 2019 Miami Heat Atlanta Hawks
13 Oct 2019 13 Oct 2019 Pre-season
13 Oct 2019 Orlando Magic Philadelphia 76ers
13 Oct 2019 Toronto Raptors Chicago Bulls
13 Oct 2019 Washington Wizards Milwaukee Bucks
My apologies for the dataframe formatting, couldn't figure out how to reproduce one properly here.
Thanks for your help
We could use str_extract to get only the 'dates' so that if there is no match it returns NA, then we use fill to fill the NA elements with the previous non-NA values
library(dplyr)
library(tidyr)
library(stringr)
gamelist %>%
mutate(SDATE = str_extract(HOME, "^\\d+ [A-Za-z]+ \\d{4}")) %>%
fill(SDATE)
# HOME AWAY SDATE
#1 15 Oct 2019 Pre-season 15 Oct 2019
#2 Phoenix Suns Denver Nuggets 15 Oct 2019
#3 Utah Jazz Sacramento Kings 15 Oct 2019
#4 Dallas Mavericks Oklahoma City Thunder 15 Oct 2019
#5 Memphis Grizzlies Charlotte Hornets 15 Oct 2019
#6 14 Oct 2019 Pre-season 14 Oct 2019
#7 Miami Heat Atlanta Hawks 14 Oct 2019
#8 13 Oct 2019 Pre-season 13 Oct 2019
#9 Orlando Magic Philadelphia 76ers 13 Oct 2019
#10 Toronto Raptors Chicago Bulls 13 Oct 2019
#11 Washington Wizards Milwaukee Bucks 13 Oct 2019
If we need the SDATE column first, we can use select
gamelist %>%
mutate(SDATE = str_extract(HOME, "^\\d+ [A-Za-z]+ \\d{4}")) %>%
fill(SDATE) %>%
select(SDATE, everything())
Or use add_column from tibble with either .after or .before
library(tibble)
gamelist %>%
add_column(SDATE = str_extract(.$HOME, "^\\d+ [A-Za-z]+ \\d{4}"),
.before = 1 ) %>%
fill(SDATE)
data
gamelist <- structure(list(HOME = c("15 Oct 2019", "Phoenix Suns", "Utah Jazz",
"Dallas Mavericks", "Memphis Grizzlies", "14 Oct 2019", "Miami Heat",
"13 Oct 2019", "Orlando Magic", "Toronto Raptors", "Washington Wizards"
), AWAY = c("Pre-season", "Denver Nuggets", "Sacramento Kings",
"Oklahoma City Thunder", "Charlotte Hornets", "Pre-season", "Atlanta Hawks",
"Pre-season", "Philadelphia 76ers", "Chicago Bulls", "Milwaukee Bucks"
)), class = "data.frame", row.names = c(NA, -11L))
If the date is always in the HOME column when the AWAY column is "Pre-season" (or some other predictable condition), then you could do something like:
# data
gamelist <- data.frame(
stringsAsFactors = FALSE,
HOME = c("15-Oct-19","Phoenix Suns",
"Utah Jazz","Dallas Mavericks","Memphis Grizzlies",
"14-Oct-19","Miami Heat","13-Oct-19","Orlando Magic",
"Toronto Raptors","Washington Wizards"),
AWAY = c("Pre-season","Denver Nuggets",
"Sacramento Kings","Oklahoma City Thunder",
"Charlotte Hornets","Pre-season","Atlanta Hawks","Pre-season",
"Philadelphia 76ers","Chicago Bulls","Milwaukee Bucks")
)
# create blank column to fill in
gamelist$date <- NA
# fill cases where there's a date
gamelist$date[gamelist$AWAY=="Pre-season"] <- gamelist$HOME[gamelist$AWAY=="Pre-season"]
# user zoo::na.locf() to fill in missing values
gamelist$date <- zoo::na.locf(gamelist$date)

how to retrieve text from span & p tag in r

I have following link
url = "https://timesofindia.indiatimes.com/topic/Adani"
In above url I want to extract the headline, para below that and date in 3 different columns.
I am able to extract only one news headline and para with following code
results_headline <- url2 %>%
read_html() %>%
html_nodes(xpath='//*#id="c_topic_list1_1"]/div[1]/ul/li[4]/div/a/span[1]')
results_para <- url2 %>%
read_html() %>%
html_nodes(xpath='//*[#id="c_topic_list1_1"]/div[1]/ul/li[4]/div/a/p')
I want to extract all the headlines,paragraph and date on that page.
How can I do it in R?
Once again you can simply use css selector to extract the content of it.
url2 = "https://timesofindia.indiatimes.com/topic/Adani"
titles <- url2 %>% read_html() %>% html_nodes("div > a > span.title") %>% html_text()
dates <- url2 %>% read_html() %>% html_nodes("div > a > span.meta") %>% html_text()
desc <- url2 %>% read_html() %>% html_nodes("div > a > p") %>% html_text()
data.frame(titles,dates,desc)
output:
> data.frame(titles,dates,desc)
titles dates
1 \nDRI drops Adani Group overvaluation case\n Oct 28
2 \nAdani Enterprises to demerge renewable energy biz\n Oct 7
3 \nAdani Enterprises' Q2 PAT falls 6% to Rs 59 cr\n Nov 13
4 \nAdani firm close to finalising RInfra power acquisition deal\n Nov 12
5 \nAdani group shares surge up to 9%\n Aug 28
6 \nAdani Transmission acquires RInfra WRSSS assets for Rs 1k cr\n Nov 1
7 \nVedanta, Adani may bid for Bunder diamond project in MP\n Oct 27
8 \nAdani Power coercing land from farmers: M K Stalin\n Oct 31
9 \nAdani Transmission acquires 2 SPVs from RVPN\n Aug 6
desc
1 Additional director general, DRI (adjudication), K V S Singh, has dropped all charges and summarily closed all proceedings in a speaking order.
2 New Delhi, Oct 7 () Adani Enterprises today announced plans to demerge its renewable energy business into associate company Adani Green Energy Ltd as part of simplifying overall business structure.
3 New Delhi, Nov 13 () Adani Enterprises, the flagship firm of Adani group, today said its profit after tax fell by 6.34 per cent to Rs 59 crore in the July-September quarter of 2017-18 compared to Rs 63 crore in the same quarter a year ago.
4 New Delhi, Nov 12 () Adani Transmission is likely to clinch a deal of Rs 13,000-14,000 crore with Reliance Infrastructure to acquire the latter's Mumbai power business much before the January 2018 deadline to mark its foray into power distribution business.
5 New Delhi, Aug 28 () Shares of Adani group of companies surged up to 9 per cent today as the mining giant will start work on its 16.5 billion dollar Carmichael coal project in Australia in October and is expected to ship the first consignment in March 2020. The stock jumped 9.
6 New Delhi, Nov 1 () Adani Transmission today said it has completed acquisition of operational transmission assets of WRSS Schemes of Reliance Infra for Rs 1,000 crore. In effect, its power-wheeling network crossed the 8,500 circuit km mark.
7 New Delhi, Oct 27 () Metals and mining major Vedanta Ltd and the Adani Group may bid for the Bunder diamond project in Madhya Pradesh from which global giant Rio Tinto exited this year, according to sources. "Vedanta may bid for the Bunder project," said a source on the condition of anonymity.
8
9

Resources