Evening all, I'm having a few issues at the moment scraping data from multiple web pages.
library(RCurl)
library(XML)
tables <- readHTMLTable(getURL("https://www.basketball-reference.com/leagues/NBA_2018_games.html"))
for (i in c("october", "november", "december", "january")) {
readHTMLTable(getURL(paste0("https://www.basketball-reference.com/leagues/NBA_2018_games-",i,".html")))
regular <- tables[["schedule"]]
write.csv(regular, file = paste0("./", i, i, ".csv"))
}
I'm having an issue where it doesn't appear to be looping through the months and is just saving 4 files from october.
Any help appreciated.
this is not the most elegant way but it works good.
Hope help you.
Code to web scraping
rm(list = ls())
if(!require("rvest")){install.packages("rvest");library("rvest")}
for (i in c("october", "november", "december", "january")) {
nba_url <- read_html(paste0("https://www.basketball-reference.com/leagues/NBA_2018_games-",i,".html"))
#Left part of the table
left<-nba_url %>%
html_nodes(".left") %>% #item de precios
html_text()
left<-left[-length(left)]
left<-left[-(1:4)]
#Assign specific values
Date<-left[seq(1,length(left),4)]
Visitor<-left[seq(2,length(left),4)]
Home<-left[seq(3,length(left),4)]
#Right part of the table
right<-nba_url %>%
html_nodes(".right") %>% #item de precios
html_text()
right<-right[-length(right)]
right<-right[-(1:2)]
#Assign specific values
Start<-right[seq(1,length(right),3)]
PTS1<-right[seq(2,length(right),3)]
PTS2<-right[seq(3,length(right),3)]
nba_data<-data.frame(Date,Start,Visitor,PTS1,Home,PTS2)
write.csv(nba_data, file = paste0("./", i, i, ".csv"))
}
This is a solution using the tidyvere to scrape this website. But first we check the robots.txt file of the website to get a sense of the limit rate for request. See for reference the post Analyzing “Crawl-Delay” Settings in Common Crawl robots.txt Data with R for further info.
library(spiderbar)
library(robotstxt)
rt <- robxp(get_robotstxt("https://www.basketball-reference.com"))
crawl_delays(rt)
#> agent crawl_delay
#> 1 * 3
#> 2 ahrefsbot -1
#> 3 twitterbot -1
#> 4 slysearch -1
#> 5 ground-control -1
#> 6 groundcontrol -1
#> 7 matrix -1
#> 8 hal9000 -1
#> 9 carmine -1
#> 10 the-matrix -1
#> 11 skynet -1
We are interested by the * value. We see we have to wait a minimum of 3 sec between requests. We will took 5 secondes.
We use the tidyverse ecosystem to build the urls and iterate through them to get a table with all the data.
library(tidyverse)
library(rvest)
#> Le chargement a nécessité le package : xml2
#>
#> Attachement du package : 'rvest'
#> The following object is masked from 'package:purrr':
#>
#> pluck
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
month_sub <- c("october", "november", "december", "january")
urls <- map_chr(month_sub, ~ paste0("https://www.basketball-reference.com/leagues/NBA_2018_games-", .,".html"))
urls
#> [1] "https://www.basketball-reference.com/leagues/NBA_2018_games-october.html"
#> [2] "https://www.basketball-reference.com/leagues/NBA_2018_games-november.html"
#> [3] "https://www.basketball-reference.com/leagues/NBA_2018_games-december.html"
#> [4] "https://www.basketball-reference.com/leagues/NBA_2018_games-january.html"
pb <- progress_estimated(length(urls))
map(urls, ~{
url <- .
pb$tick()$print()
Sys.sleep(5) # we take 5sec
tables <- read_html(url) %>%
# we select the table part by its table id tag
html_nodes("#schedule") %>%
# we extract the table
html_table() %>%
# we get a 1 element list so we take flatten to get a tibble
flatten_df()
}) -> tables
# we get a list of tables, one per month
str(tables, 1)
#> List of 4
#> $ :Classes 'tbl_df', 'tbl' and 'data.frame': 104 obs. of 8 variables:
#> $ :Classes 'tbl_df', 'tbl' and 'data.frame': 213 obs. of 8 variables:
#> $ :Classes 'tbl_df', 'tbl' and 'data.frame': 227 obs. of 8 variables:
#> $ :Classes 'tbl_df', 'tbl' and 'data.frame': 216 obs. of 8 variables:
# we can get all the data in one table by binding rows.
# As we saw on the website that there are 2 empty columns with no names,
# we need to take care of it with repair_name before row binding
res <- tables %>%
map_df(tibble::repair_names)
res
#> # A tibble: 760 x 8
#> Date `Start (ET)` `Visitor/Neutral` PTS
#> <chr> <chr> <chr> <int>
#> 1 Tue, Oct 17, 2017 8:01 pm Boston Celtics 102
#> 2 Tue, Oct 17, 2017 10:30 pm Houston Rockets 121
#> 3 Wed, Oct 18, 2017 7:30 pm Milwaukee Bucks 100
#> 4 Wed, Oct 18, 2017 8:30 pm Atlanta Hawks 111
#> 5 Wed, Oct 18, 2017 7:00 pm Charlotte Hornets 102
#> 6 Wed, Oct 18, 2017 7:00 pm Brooklyn Nets 140
#> 7 Wed, Oct 18, 2017 8:00 pm New Orleans Pelicans 103
#> 8 Wed, Oct 18, 2017 7:00 pm Miami Heat 116
#> 9 Wed, Oct 18, 2017 10:00 pm Portland Trail Blazers 76
#> 10 Wed, Oct 18, 2017 10:00 pm Houston Rockets 100
#> # ... with 750 more rows, and 4 more variables: `Home/Neutral` <chr>,
#> # V1 <chr>, V2 <chr>, Notes <lgl>
Related
I tried to rename columns which is actually a very straight forward operation but still getting errors. I tried two methods and none of them working. Can any one explain, what needs to be done to rename columns without getting these strange errors. I tried several SO posts but none of them really worked.
library(pacman)
#> Warning: package 'pacman' was built under R version 4.2.1
p_load(dplyr, readr)
data = read_csv("https://raw.githubusercontent.com/srk7774/data/master/august_october_2020.csv",
col_names = TRUE)
#> Rows: 16 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): X.1
#> dbl (2): Total Agree - August 2020, Total Agree - October 2020
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
column_recodes <- c(X.1 = "country",
august = "Total Agree - August 2020",
october = "`Total Agree - October 2020",
`Another non-existent column name` = "bar")
data %>% rename_with(~recode(., !!!column_recodes))
#> # A tibble: 16 × 3
#> country `Total Agree - August 2020` `Total Agree - October 2020`
#> <chr> <dbl> <dbl>
#> 1 Total 77 73
#> 2 India 87 87
#> 3 China 97 85
#> 4 South Korea 84 83
#> 5 Brazil 88 81
#> 6 Australia 88 79
#> 7 United Kingdom 85 79
#> 8 Mexico 75 78
#> 9 Canada 76 76
#> 10 Germany 67 69
#> 11 Japan 75 69
#> 12 South Africa 64 68
#> 13 Italy 67 65
#> 14 Spain 72 64
#> 15 United States 67 64
#> 16 France 59 54
data %>%
rename(country = X.1,
august = Total.Agree...August.2020,
october = Total.Agree...October.2020)
#> Error in `chr_as_locations()`:
#> ! Can't rename columns that don't exist.
#> ✖ Column `Total.Agree...August.2020` doesn't exist.
Created on 2022-10-24 by the reprex package (v2.0.1)
Add backtick when using names with space:
data %>%
rename(country = X.1,
august = `Total Agree - August 2020`,
october =`Total Agree - October 2020`)
I can download in the browser a file from this website
https://www.cmegroup.com/ftp/pub/settle/comex_future.csv
However when I try the following
url <- "https://www.cmegroup.com/ftp/pub/settle/comex_future.csv"
dest <- "C:\\COMEXfut.csv"
download.file(url, dest)
I get the following error message
Error in download.file(url, dest) :
cannot open URL 'https://www.cmegroup.com/ftp/pub/settle/comex_future.csv'
In addition: Warning message:
In download.file(url, dest) :
InternetOpenUrl failed: 'The operation timed out'
even if I choose:
options(timeout = max(600, getOption("timeout")))
any idea why is this happening ? thanks !
The problem here is that the site from which you are downloading needs a couple of additional headers. The easiest way to supply them is using the httr package
library(httr)
url <- "https://www.cmegroup.com/ftp/pub/settle/comex_future.csv"
UA <- paste('Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0)',
'Gecko/20100101 Firefox/98.0')
res <- GET(url, add_headers(`User-Agent` = UA, Connection = 'keep-alive'))
This should download in less than a second.
If you want to save the file you can do
writeBin(res$content, 'myfile.csv')
Or if you just want to read the data straight into R without even saving it, you can do:
content(res)
#> Rows: 527 Columns: 20
#> 0s-- Column specification ----------------------------------------------------------------
#> Delimiter: ","
#> chr (10): PRODUCT SYMBOL, CONTRACT MONTH, CONTRACT DAY, CONTRACT, PRODUCT DESCRIPTIO...
#> dbl (10): CONTRACT YEAR, OPEN, HIGH, LOW, LAST, SETTLE, EST. VOL, PRIOR SETTLE, PRIO...
#>
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 527 x 20
#> `PRODUCT SYMBOL` `CONTRACT MONTH` `CONTRACT YEAR` `CONTRACT DAY` CONTRACT
#> <chr> <chr> <dbl> <chr> <chr>
#> 1 0GC 07 2022 NA 0GCN22
#> 2 4GC 03 2022 NA 4GCH22
#> 3 4GC 05 2022 NA 4GCK22
#> 4 4GC 06 2022 NA 4GCM22
#> 5 4GC 08 2022 NA 4GCQ22
#> 6 4GC 10 2022 NA 4GCV22
#> 7 4GC 12 2022 NA 4GCZ22
#> 8 4GC 02 2023 NA 4GCG23
#> 9 4GC 04 2023 NA 4GCJ23
#> 10 4GC 06 2023 NA 4GCM23
#> # ... with 517 more rows, and 15 more variables: PRODUCT DESCRIPTION <chr>, OPEN <dbl>,
#> # HIGH <dbl>, HIGH AB INDICATOR <chr>, LOW <dbl>, LOW AB INDICATOR <chr>, LAST <dbl>,
#> # LAST AB INDICATOR <chr>, SETTLE <dbl>, PT CHG <chr>, EST. VOL <dbl>,
#> # PRIOR SETTLE <dbl>, PRIOR VOL <dbl>, PRIOR INT <dbl>, TRADEDATE <chr>
Let's say I have school enrollment data stored in this format, with start date and end date fields:
unique_name
enrollment_start
enrollment_end
Amy
1, Jan, 2017
30, Sep 2018
Franklin
1, Jan, 2017
19, Feb, 2017
Franklin
5, Jun, 2017
4, Feb, 2018
Franklin
21, Oct, 2018
9, Mar, 2019
Samir
1, Jun, 2017
4, Feb, 2017
Samir
5, Apr, 2017
12, Sep, 2018
...
...
...
And I want to produce aggregated counts of enrollment by month like this:
month
enrollment_count
Jan, 2017
25
Feb, 2017
31
Mar, 2017
19
Apr, 2017
34
May, 2017
29
Jun, 2017
32
...
...
Is there an easy way to accomplish this with dplyr?
The only way I can think to do this is by looping over a list of all months from range month_min to month_max to count the number of rows with start or stop dates that fall inside each month. Hoping for easier code.
I think this can be done pretty elegantly with the clock and ivs packages. You seem to want monthly counts, so you can use the year-month type from clock. And ivs is a package dedicated to working with intervals of data, which is exactly what you have here. Here we assume that if your enrollment start/end fell in a month, then you should be considered active in that month.
library(ivs)
library(clock)
library(dplyr, warn.conflicts = FALSE)
enrollments <- tribble(
~unique_name, ~enrollment_start, ~enrollment_end,
"Amy", "1, Jan, 2017", "30, Sep, 2018",
"Franklin", "1, Jan, 2017", "19, Feb, 2017",
"Franklin", "5, Jun, 2017", "4, Feb, 2018",
"Franklin", "21, Oct, 2018", "9, Mar, 2019",
"Samir", "1, Jan, 2017", "4, Feb, 2017",
"Samir", "5, Apr, 2017", "12, Sep, 2018"
)
# Parse these into "day" precision year-month-day objects, then restrict
# them to just "month" precision because that is all we need
enrollments <- enrollments %>%
mutate(
start = enrollment_start %>%
year_month_day_parse(format = "%d, %b, %Y") %>%
calendar_narrow("month"),
end = enrollment_end %>%
year_month_day_parse(format = "%d, %b, %Y") %>%
calendar_narrow("month") %>%
add_months(1),
.keep = "unused"
)
enrollments
#> # A tibble: 6 × 3
#> unique_name start end
#> <chr> <ymd<month>> <ymd<month>>
#> 1 Amy 2017-01 2018-10
#> 2 Franklin 2017-01 2017-03
#> 3 Franklin 2017-06 2018-03
#> 4 Franklin 2018-10 2019-04
#> 5 Samir 2017-01 2017-03
#> 6 Samir 2017-04 2018-10
# Create an interval vector, note that these are half-open intervals.
# The month on the RHS is not included, which is why we added 1 to `end` above.
enrollments <- enrollments %>%
mutate(active = iv(start, end), .keep = "unused")
enrollments
#> # A tibble: 6 × 2
#> unique_name active
#> <chr> <iv<ymd<month>>>
#> 1 Amy [2017-01, 2018-10)
#> 2 Franklin [2017-01, 2017-03)
#> 3 Franklin [2017-06, 2018-03)
#> 4 Franklin [2018-10, 2019-04)
#> 5 Samir [2017-01, 2017-03)
#> 6 Samir [2017-04, 2018-10)
# We'll generate a sequence of months that will be part of the final result
bounds <- range(enrollments$active)
lower <- iv_start(bounds[[1]])
upper <- iv_end(bounds[[2]]) - 1L
months <- tibble(month = seq(lower, upper, by = 1))
months
#> # A tibble: 27 × 1
#> month
#> <ymd<month>>
#> 1 2017-01
#> 2 2017-02
#> 3 2017-03
#> 4 2017-04
#> 5 2017-05
#> 6 2017-06
#> 7 2017-07
#> 8 2017-08
#> 9 2017-09
#> 10 2017-10
#> # … with 17 more rows
# To actually compute the counts, use `iv_count_between()`, which counts up all
# instances where `month[i]` is between any interval in `enrollments$active`
months %>%
mutate(count = iv_count_between(month, enrollments$active))
#> # A tibble: 27 × 2
#> month count
#> <ymd<month>> <int>
#> 1 2017-01 3
#> 2 2017-02 3
#> 3 2017-03 1
#> 4 2017-04 2
#> 5 2017-05 2
#> 6 2017-06 3
#> 7 2017-07 3
#> 8 2017-08 3
#> 9 2017-09 3
#> 10 2017-10 3
#> # … with 17 more rows
Created on 2022-04-05 by the reprex package (v2.0.1)
Create a list column containing the sequence of months between each set of dates, then unnest and count.
Notes:
I use lubridate::floor_date() to round enrollment_start to the first day of the month. Otherwise, seq() may skip months if enrollment_start is on the 29th of the month or later.
The fifth row of your example data has enrollment_start later than enrollment_end -- I assumed this was an error and removed.
library(tidyverse)
library(lubridate)
enrollments %>%
mutate(
across(c(enrollment_start, enrollment_end), dmy), # convert to date
month = map2(
floor_date(enrollment_start, unit = "month"), # round to 1st day
enrollment_end,
~ seq(.x, .y, by = "month")
)
) %>%
unnest_longer(month) %>%
count(month, name = "enrollment_count")
#> # A tibble: 27 x 2
#> month enrollment_count
#> <date> <int>
#> 1 2017-01-01 2
#> 2 2017-02-01 2
#> 3 2017-03-01 1
#> 4 2017-04-01 2
#> 5 2017-05-01 2
#> 6 2017-06-01 3
#> 7 2017-07-01 3
#> 8 2017-08-01 3
#> 9 2017-09-01 3
#> 10 2017-10-01 3
#> # ... with 17 more rows
Created on 2022-03-25 by the reprex package (v2.0.1)
Here's my take on this with dplyr and tidyr.
Pivot the data creating multiple rows per student and format your dates.
group on student and generate missing months using complete.
group on the generated periods and count.
data %>%
pivot_longer(cols=c('enrollment_start','enrollment_end')) %>%
mutate(value = as.Date(value, format = "%d, %B, %Y")) %>%
mutate(value = lubridate::floor_date(value, 'month')) %>%
# unique_name name value
# <chr> <chr> <date>
# 1 Amy enrollment_start 2017-01-01
# 2 Amy enrollment_end 2018-09-30
# 3 Franklin enrollment_start 2017-01-01
# 4 Franklin enrollment_end 2017-02-19
# ..etc.
group_by(unique_name) %>%
complete(value = seq.Date(min(value), max(value), by="month")) %>%
arrange(unique_name, value)
enrollment_count <- group_by(data, value) %>%
count()
Edit: I forgot to floor the dates in order to properly aggregate per period at the end. Added floor_date from lubridate to do this.
This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed last year.
I am very new to coding and just started doing some R graphics and now I am kinda lost with my data analyse and need some light! I am training some analyses and I got a very long dataset with 19 Countries x 12 months x 22 Products and for every month a Profit. Kinda like this:
Country Month Product Profit
Brazil Jan A 50
Brazil fev A 80
Brazil mar A 15
Austria Jan A 35
Austria fev A 80
Austria mar A 47
France Jan A 21
France fev A 66
France mar A 15
[...]
France Dez C 40 etc...
I am was thinking to do one graph showing the profits through the year and another for every country, so I could see the top and bottom 2 countries. I wanted to have something like:
All Countries Jan 106 or Brazil 2021 145
All Countries Fev 146 Austria 2021 162
All Countries Mar 77 France 2021 112
but the sum function can't help with characters type and as I have a long List, idk how to sum only part of the column.
sorry if it got confusing.
The package dplyr has quite a natural syntax for this:
require(dplyr)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data.frame(
Country = rep(c(rep("Brazil", 3L), rep("Austria", 3L), rep("France", 3L)), 3L),
Profit = rep(c(50, 80, 15, 35, 80, 47, 21, 66, 15), 3L),
Month = rep(c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep"), 3L),
Year = sort(rep(c(2019, 2020, 2021), 9L))
)
df %>%
group_by(Country, Month) %>%
summarize(sum = sum(Profit))
#> `summarise()` has grouped output by 'Country'. You can override using the `.groups` argument.
#> # A tibble: 9 × 3
#> # Groups: Country [3]
#> Country Month sum
#> <chr> <chr> <dbl>
#> 1 Austria Apr 105
#> 2 Austria Jun 141
#> 3 Austria May 240
#> 4 Brazil Feb 240
#> 5 Brazil Jan 150
#> 6 Brazil Mar 45
#> 7 France Aug 198
#> 8 France Jul 63
#> 9 France Sep 45
Using base R, you can try something along these lines.
# sum of profit per month
out1 <- tapply(df$Profit, df$Month, sum)
# sum of profit per year per country
out2 <- data.frame(
profit = sapply(split(df, f = ~ df$Country + df$Year), function(x) sum(x$Profit))
)
out2$Country <- gsub('\\.[0-9]*', '', rownames(out2))
out2$Year <- gsub('[a-zA-z]*\\.', '', rownames(out2))
rownames(out2) <- NULL
Output
> out1
Apr Aug Feb Jan Jul Jun Mar May Sep
105 198 240 150 63 141 45 240 45
> head(out2)
profit Country Year
1 162 Austria 2019
2 145 Brazil 2019
3 102 France 2019
4 162 Austria 2020
5 145 Brazil 2020
6 102 France 2020
Data
# sample data
df <- data.frame(
Country = rep(c(rep('Brazil',3L),rep('Austria',3L),rep('France',3L)), 3L),
Profit = rep(c(50,80,15,35,80,47,21,66,15), 3L),
Month = rep(c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep'),3L),
Year = sort(rep(c(2019,2020,2021), 9L))
)
library(XML)
library(dplyr)
library(rvest)
presid <- read_html("https://en.wikipedia.org/wiki/List_of_presidents_of_Peru") %>% # read the html page
html_nodes("table") %>% # extract nodes which contain a table
.[3] %>% # select the node which contains the relevant table
html_table(header = NA,
trim = T) # extract the table
t3 <- presid[[1]] # flatten data
t4 <-t3[unique(t3$N),] # eliminated duplicate
t5 <- subset(t4,!is.na(President))#
I need to read this table and filter the data in the best way that does not allow losing a lot of information when filtering the data.
The loss of rows is very important, it is reduced from 98 rows in t3, to 72 in t4 and to 63 in t5 when in reality I only need to reduce the information from 98 rows to 84 that can be filtered through column N
I have tried these formulas, but not result
strsplit (as.character (t3$N), split = "(? <= [a-zA-Z]) (? = [0-9])", perl = TRUE)
other
grep("[[:numeric:]]{2, }",N,value=T)
the rows of column N that I need to filter are those with the decimal point 0.5, 2.5, 6.5, 6.6, and others that have the terminal .5, in total there are 14 rows that I must remove.
my dataframe would be reduced from 98 to 84 rows.
I could filter by date but I have not found much material about it that can help me,
thanks
Since the data from the website has duplicate column names we can use janitor::clean_names() to have clean column names and then keep only those rows that have whole numbers in the n column.
library(rvest)
library(dplyr)
read_html("https://en.wikipedia.org/wiki/List_of_presidents_of_Peru") %>%
html_nodes("table") %>%
.[3] %>%
html_table(header = NA,trim = T) %>%
.[[1]] %>%
janitor::clean_names() %>%
filter(grepl('^\\d+$', n)) -> result
result
# A tibble: 85 x 10
# n president president_2 president_3 term_of_office term_of_office_2 title
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 "" "" José de la R… 28 February 18… 23 June 1823 President of …
# 2 2 "" "" José Bernard… 16 August 1823 18 November 1823 President of …
# 3 2 "" "" José Bernard… 18 November 18… 10 February 1824 Constitutiona…
# 4 3 "" "" José de La M… 10 June 1827 7 June 1829 Constitutiona…
# 5 4 "" "" Agustín Gama… 7 June 1829 19 December 1829 Antonio Gutié…
# 6 4 "" "" Agustín Gama… 1 September 18… 19 December 1829 Provisional P…
# 7 4 "" "" Agustín Gama… 19 December 18… 19 December 1833 Constitutiona…
# 8 5 "" "" Luis José de… 21 December 18… 21 December 1833 Provisional P…
# 9 6 "" "" Felipe Salav… 25 February 18… 7 February 1836 Supreme Head …
#10 7 "" "" Agustín Gama… 20 January 183… 15 August 1839 Provisional P…
# … with 75 more rows, and 3 more variables: form_of_entry <chr>, vice_president <chr>,
# vice_president_2 <chr>