Web Scraping Using Multiple Variables in Link - r

I am trying to efficiently scrape weekly tournament data from pgatour.com, and place the results in one encompassing table. Below, is an example link that I will use:
https://www.pgatour.com/stats/stat.02568.y2019.eon.t041.html
In the example link - 02568 is one of many stat_id's and t041 is one of many tournament_id's. I want the scrape to get every combo of stat_id and tournament_id in the following manner:
Currently, my lapply is cycling through both id's at the same time and I am only getting 3 of the possible 9 combinations. Is there a way to change my lapply call to cycle through both id's in the desired manner?
library(rvest)
library(dplyr)
library(stringr)
tournament_id <- c("t041", "t054", "t464")
stat_id <- c("02568", "02567", "02564")
url_g <- c(paste('https://www.pgatour.com/stats/stat.', stat_id, '.y2019.eon.', tournament_id,'.html', sep =""))
test_table_pga4 <- lapply(url_g, function(i){
page2 <- read_html(i)
test_table_pga5 <- page2 %>% html_nodes("#statsTable") %>% html_table() %>% .[[1]] %>%
mutate(tournament = i)
})
test_golf7 <- as_tibble(rbind.fill(test_table_pga4))

Use expand.grid() to create unique combinations of stat_id and tournament_id and then mutate a new column with those links.
library(tidyverse)
library(janitor)
library(rvest)
df <- expand.grid(
tournament_id = c("t041", "t054", "t464"),
stat_id = c("02568", "02567", "02564")
) %>%
mutate(
links = paste0(
'https://www.pgatour.com/stats/stat.',
stat_id,
'.y2019.eon.',
tournament_id,
'.html'
)
) %>%
as_tibble()
# Function to get the table
get_info <- function(link, tournament) {
link %>%
read_html() %>%
html_table() %>%
.[[2]] %>%
clean_names() %>%
select(-rank_last_week ) %>%
mutate(rank_this_week = rank_this_week %>%
as.character,
tournament = tournament) %>%
relocate(tournament)
}
# Retrieve the tables and bind them
df %$%
map2_dfr(links, tournament_id, get_info)
# A tibble: 648 × 9
tournament rank_this_week player_name rounds average total_sg_app
<fct> <chr> <chr> <int> <dbl> <dbl>
1 t041 1 Corey Conners 4 2.89 11.6
2 t041 2 Matt Kuchar 4 2.16 8.62
3 t041 3 Byeong Hun An 4 1.90 7.60
4 t041 4 Charley Hoffman 4 1.72 6.88
5 t041 5 Ryan Moore 4 1.43 5.73
6 t041 6 Brian Stuard 4 1.42 5.69
7 t041 7 Danny Lee 4 1.30 5.18
8 t041 8 Cameron Tringale 4 1.22 4.88
9 t041 9 Si Woo Kim 4 1.22 4.87
10 t041 10 Scottie Scheffler 4 1.16 4.62
# … with 638 more rows, and 3 more variables: measured_rounds <int>,
# total_sg_ott <dbl>, total_sg_putting <dbl>

Related

Cannot calculate percentage?

I'm trying to calculate the percentage of column but the function did not work.
Here is my dataset: https://www.kaggle.com/datasets/ngoduyha/real-estate-sale-us
Here is my function:
sale <- read_csv("re_sale.csv")
sale %>%
filter(!is.na(`Property Type`)) %>%
group_by(`Property Type`) %>%
summarize(sale_vol = n(), percent = sale_vol/sum(sale_vol)*100)
And it resulted like this:
any help would be greatly appreciated!
Try ungrouping after summarising the data. Then yes, compute the percentages.
suppressPackageStartupMessages(
library(tidyverse)
)
sale %>%
filter(!is.na(`Property Type`)) %>%
group_by(`Property Type`) %>%
summarize(sale_vol = n()) %>%
ungroup() %>%
mutate(percent = sale_vol/sum(sale_vol)*100)
## A tibble: 12 × 3
# `Property Type` sale_vol percent
# <chr> <int> <dbl>
# 1 "" 382446 38.4
# 2 "Apartments" 486 0.0487
# 3 "Commercial" 1981 0.199
# 4 "Condo" 105420 10.6
# 5 "Four Family" 2150 0.216
# 6 "Industrial" 228 0.0229
# 7 "Public Utility" 5 0.000501
# 8 "Residential" 60728 6.09
# 9 "Single Family" 401612 40.3
#10 "Three Family" 12586 1.26
#11 "Two Family" 26408 2.65
#12 "Vacant Land" 3163 0.317

how to scrape text from an icon - R

I'm trying to scrape all the data from this website. There are icons over some of the competitors names indicating that the person was disqualified for being a 'no-show'.
I would like create a data frame with all the competitors while also specifying who was disqualified, but I'm running into two issues:
(1) trying to add the disclaimer next to the persons name produces the error cannot coerce class ‘"xml_nodeset"’ to a data.frame.
(2) trying to extract the text from just the icon (and not the competitor names) produces a blank data frame.
library(rvest); library(tidyverse)
html = read_html('https://web.archive.org/web/20220913034642/https://www.bjjcompsystem.com/tournaments/1869/categories/2053162')
dq = data.frame(winner = html %>%
html_nodes('.match-card__competitor--red') %>%
html_text(trim = TRUE),
opponent = html %>%
html_nodes('hr+ .match-card__competitor'),
dq = html %>%
html_nodes('.match-card__disqualification') %>%
html_text())
This approach generally works only on tabular data where you can be sure that the number of matches for each of those selectors are constant and order is also fixed. In your example you have:
127 matches for .match-card__competitor--red
127 matches for hr+ .match-card__competitor
14 matches for .match-card__disqualification (you get no results for this because you should use html_attr("title") for title attribute instead of html_text())
Basically you are trying to combine columns of different lengths into the same dataframe. Even if it would work, you'd just add DSQ for 14 first matches.
As you'd probably want to keep information about matched, participants, results and disqualifications instead of just having a list of participants, I'd suggest to work with a list of match cards, i.e. extract all required information from a single card while not breaking relations and then move to the next card.
My purrr is far from perfect, but perhaps something like this:
library(rvest)
library(magrittr)
library(purrr)
library(dplyr)
library(tibble)
library(tidyr)
# helpers -----------------------------------------------------------------
# to keep matches with details (when/where) in header
is_valid_match <- function(element){
return(length(html_elements(element, ".bracket-match-header")) > 0)
}
# detect winner
is_winner <- function(element){
return(length(html_elements(element, ".match-competitor--loser")) < 1 )
}
# extract data from competitor sections
comp_details <- function(comp_card, prefix="_"){
l = lst()
l[paste(prefix, "n", sep = "")] <- comp_card %>% html_element(".match-card__competitor-n") %>% html_text()
l[paste(prefix, "name", sep = "")] <- comp_card %>% html_element(".match-card__competitor-name") %>% html_text()
l[paste(prefix, "club", sep = "")] <- comp_card %>% html_element(".match-card__club-name") %>% html_text()
l[paste(prefix, "dq", sep = "")] <- comp_card %>% html_element(".match-card__disqualification") %>% html_attr("title")
l[paste(prefix, "won", sep = "")] <- comp_card %>% html_element(".match-competitor--loser") %>% length() == 0
return(l)
}
# scrape & process --------------------------------------------------------
html <- read_html('https://web.archive.org/web/20220913034642/https://www.bjjcompsystem.com/tournaments/1869/categories/2053162')
html %>%
# collect all match cards
html_elements("div.tournament-category__match") %>%
keep(is_valid_match) %>%
# apply anonymous function to every item in the list of match cards
map(function(match_card){
match_id <- match_card %>% html_element(".tournament-category__match-card") %>% html_attr("id")
where <- match_card %>% html_element(".bracket-match-header__where") %>% html_text()
when <- match_card %>% html_element(".bracket-match-header__when") %>% html_text()
competitors <- html_nodes(match_card, ".match-card__competitor")
# extract competitior data
comp01 <- competitors[[1]] %>% comp_details(prefix = "comp01_")
comp02 <- competitors[[2]] %>% comp_details(prefix = "comp02_")
winner_idx <- competitors %>% detect_index(is_winner)
# lst for creating a named list
l <- lst(match_id, where, when, winner_idx)
# combine all items and comp lists into single list
l <- c(l,comp01, comp02)
return(l)
}) %>%
# each resulting list item into single-row tibble
map(as_tibble) %>%
# reduce list of tibbles into single tibble
reduce(bind_rows)
Result:
#> # A tibble: 65 × 14
#> match_id where when winne…¹ comp0…² comp0…³ comp0…⁴ comp0…⁵ comp0…⁶ comp0…⁷
#> <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <lgl> <chr>
#> 1 match-1-1 FIGH… Sat … 2 58 Christ… Rodrig… <NA> FALSE 66
#> 2 match-1-9 FIGH… Sat … 2 6 Melvin… GF Team Disqua… FALSE 66
#> 3 match-1-… FIGH… Sat … 2 47 Eric R… Atos J… <NA> FALSE 66
#> 4 match-1-… FIGH… Sat … 1 47 Eric R… Atos J… <NA> TRUE 10
#> 5 match-1-… FIGH… Sat … 2 42 Ivan M… CheckM… <NA> FALSE 66
#> 6 match-1-… FIGH… Sat … 2 18 Joel S… Gracie… <NA> FALSE 47
#> 7 match-1-… FIGH… Sat … 1 42 Ivan M… CheckM… <NA> TRUE 26
#> 8 match-1-… FIGH… Sat … 2 34 Matthe… Super … <NA> FALSE 18
#> 9 match-2-9 FIGH… Sat … 1 62 Bryan … Team J… <NA> TRUE 4
#> 10 match-2-… FIGH… Sat … 2 22 Steffe… Six Bl… <NA> FALSE 30
#> # … with 55 more rows, 4 more variables: comp02_name <chr>, comp02_club <chr>,
#> # comp02_dq <chr>, comp02_won <lgl>, and abbreviated variable names
#> # ¹​winner_idx, ²​comp01_n, ³​comp01_name, ⁴​comp01_club, ⁵​comp01_dq,
#> # ⁶​comp01_won, ⁷​comp02_n
Created on 2022-09-19 with reprex v2.0.2
Also note that not all matches have a winner and both participants can be disqualified (screenshot), so splitting them to winners & opponents might not be optimal.

R: how to average every 7th row

I want to take the average of each column (except the date) after every seven rows. I tried the approach below, but I was getting incorrect values. This method also seems really long. Is there a way to shorten it?
bankamerica = read.csv('https://raw.githubusercontent.com/bandcar/Examples/main/bankamerica.csv')
library(tidyverse)
GroupLabels <- 0:(nrow(bankamerica) - 1)%/% 7
bankamerica$Group <- GroupLabels
Avgs <- bankamerica %>%
group_by(bankamerica$Group) %>%
summarize(Avg = mean(bankamerica$tr))
EDITED: Just realized this code provides the incorrect values
I think you're on the right path.
bankamerica %>%
mutate(group = cumsum(row_number() %% 7 == 1)) %>%
group_by(group) %>%
summarise(caldate = first(caldate), across(-caldate, mean)) %>%
select(-group)
## A tibble: 144 × 3
# caldate tr var
# <chr> <dbl> <dbl>
# 1 1/2/01 28.9 -50.6
# 2 1/11/01 23.6 -45.4
# 3 1/23/01 20.9 -45
# 4 2/1/01 17.4 -48
# 5 2/12/01 14.4 -48
# 6 2/21/01 17 -48.9
# 7 3/2/01 19.1 -56
# 8 3/13/01 19.4 -56.9
# 9 3/22/01 23.3 -55.7
#10 4/2/01 7.71 -58.3
This averages every 7 rows not every 7 days, because there are missing days in the data.

Is there any function that give the changes between columns?

I have a df that looks like this.
head(dfhigh)
rownames 2015Y 2016Y 2017Y 2018Y 2019Y 2020Y 2021Y
1 Australia 29583.7403 48397.383 45220.323 68461.941 39218.044 20140.351 29773.188
2 Austria* 1294.5092 -8400.973 14926.164 5511.625 2912.795 -14962.963 5855.014
3 Belgium* -24013.3111 68177.596 -3057.153 27119.084 -9208.553 13881.481 22955.298
4 Canada 43852.7732 36061.859 22764.156 37653.521 50141.784 23174.006 59693.992
5 Chile* 20507.8407 12249.294 6128.716 7735.778 12499.238 8385.907 15251.538
6 Czech Republic 465.2137 9814.496 9517.948 11010.423 10108.914 9410.576 5805.084
I want to calculate the changes between years, so instead of the values, the table has the percentage of change (obviously deleting 2015Y).
Try this using (current - previous)/ previous *100
lst <- list()
nm <- names(dfhigh)[-1]
for(i in 1:(length(nm) - 1)){
lst[[i]] <- (dfhigh[[nm[i+1]]] - dfhigh[[nm[i]]]) / dfhigh[[nm[i]]] * 100
}
ans <- do.call(cbind , lst)
colnames(ans) <- paste("ch_of" , nm[-1])
ans
you can change the formula to calculate percentage as you want
You could also use a tidyverse solution.
library(tidyverse)
df %>%
pivot_longer(!rownames) %>%
group_by(rownames) %>%
mutate(value = 100*value/lag(value)-100) %>%
ungroup() %>%
pivot_wider(names_from = name, values_from = value)
# # A tibble: 6 × 8
# rownames `2015Y` `2016Y` `2017Y` `2018Y` `2019Y` `2020Y` `2021Y`
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Australia NA 63.6 -6.56 51.4 -42.7 -48.6 47.8
# 2 Austria* NA -749. -278. -63.1 -47.2 -614. -139.
# 3 Belgium* NA -384. -104. -987. -134. -251. 65.4
# 4 Canada NA -17.8 -36.9 65.4 33.2 -53.8 158.
# 5 Chile* NA -40.3 -50.0 26.2 61.6 -32.9 81.9
# 6 CzechRepublic NA 2010. -3.02 15.7 -8.19 -6.91 -38.3

How to create a cumulative variable that groups by PERMNO and arranges by date in R

I have a dataframe with variables from COMPUSTAT containing data on various accounting items, including SG&A expenses from different companies.
I want to create a new variable in the dataframe which accumulates the SG&A expenses for each company in chronological order. I use PERMNO codes as the unique ID for each company.
I have tried this code, however it does not seem to work:
crsp.comp2$cxsgaq <- crsp.comp2 %>%
group_by(permno) %>%
arrange(date) %>%
mutate_at(vars(xsgaq), cumsum(xsgaq))
(xsgag is the COMPUSTAT variable for SG&A expenses)
Thank you very much for your help
Your example code is attempting write the entire dataframe crsp.comp2, into a variable crsp.comp2$cxsgaq.
Usually the vars() function variables needs to be "quoted"; though in your situation, use the standard mutate() function and assign the cxsgaq variable there.
crsp.comp2 <- crsp.comp2 %>%
group_by(permno) %>%
arrange(date) %>%
mutate(cxsgaq = cumsum(xsgaq))
Reproducible example with iris dataset:
library(tidyverse)
iris %>%
group_by(Species) %>%
arrange(Sepal.Length) %>%
mutate(C.Sepal.Width = cumsum(Sepal.Width))
Building on the answer from #m-viking, if using the WRDS PostgreSQL server, you would simply use window_order (from dplyr) in place of arrange. (I use the Compustat firm identifier gvkey in place of permno so that this code works, but the idea is the same.)
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres(),
bigint = "integer", sslmode='allow')
fundq <- tbl(pg, sql("SELECT * FROM comp.fundq"))
comp2 <-
fundq %>%
filter(indfmt == "INDL", datafmt == "STD",
consol == "C", popsrc == "D")
comp2 <-
comp2 %>%
group_by(gvkey) %>%
dbplyr::window_order(datadate) %>%
mutate(cxsgaq = cumsum(xsgaq))
comp2 %>%
filter(!is.na(xsgaq)) %>%
select(gvkey, datadate, xsgaq, cxsgaq)
#> # Source: lazy query [?? x 4]
#> # Database: postgres [iangow#wrds-pgdata.wharton.upenn.edu:9737/wrds]
#> # Groups: gvkey
#> # Ordered by: datadate
#> gvkey datadate xsgaq cxsgaq
#> <chr> <date> <dbl> <dbl>
#> 1 001000 1966-12-31 0.679 0.679
#> 2 001000 1967-12-31 1.02 1.70
#> 3 001000 1968-12-31 5.86 7.55
#> 4 001000 1969-12-31 7.18 14.7
#> 5 001000 1970-12-31 8.25 23.0
#> 6 001000 1971-12-31 7.96 30.9
#> 7 001000 1972-12-31 7.55 38.5
#> 8 001000 1973-12-31 8.53 47.0
#> 9 001000 1974-12-31 8.86 55.9
#> 10 001000 1975-12-31 9.59 65.5
#> # … with more rows
Created on 2021-04-05 by the reprex package (v1.0.0)

Resources