Edit: It looks like this is a known issue with the "cascade" method. Results that return NA values after the first attempt don't like being converted to doubles when subsequent methods return lat/lons.
Data: I have a list of addresses that I need to geocode. I'm using lapply() to split-apply-combine, which works, but very slowly. My thought to split (further)-apply-combine is returning errors about dim names and sizes that are confusing to me.
# example data
library(dplyr)
library(tidygeocoder)
url <- "https://www.briandunning.com/sample-data/us-500.zip"
download.file(url = url, destfile = basename(url))
adds <- readr::read_csv(basename(url)) %>%
select(address, city,
county, state, zip) %>%
mutate(date = seq.Date(as.Date('2015-01-01'), to = Sys.Date(), length.out = 500)) %>%
mutate(year = lubridate::year(date)) %>%
# to keep it small
sample_n(20)
This works, split addresses by year, apply tidygeocoder function to return lat/lons, and recombine.
adds_by_year <- adds %>% split(.$year)
geo_list <- lapply(adds_by_year, function(x) {
geo <- geocode(.tbl = x,
street = address,
city = city,
county = county,
state = state,
postalcode = zip,
# cascade method uses all options (census, osm, etc)
# takes longer but may be more accurate
method = "cascade", timeout = 500) %>%
filter(!is.na(lat))
return(geo)
})
out <- bind_rows(geo_list)
Below does not:
adds <- adds %>%
mutate(yrmn = zoo::as.yearmon(date))
adds_by_yrm <- adds %>% split(.$yrmn)
geo_list <- lapply(adds_by_yrm, function(x) {
geo <- geocode(.tbl = x,
street = address,
city = city,
county = county,
state = state,
postalcode = zip,
# cascade method uses all options (census, osm, etc)
# takes longer but may be more accurate
method = "cascade", timeout = 500) %>%
filter(!is.na(lat))
return(geo)
})
out <- bind_rows(geo_list)
Returns this error:
Error: Assigned data `retry_results` must be compatible with existing data.
ℹ Error occurred for column `lat`.
x Can't convert from <double> to <logical> due to loss of precision.
* Locations: 1.
Run `rlang::last_error()` to see where the error occurred.
I did some searching and found this, but the proposed solution -- wrapping x in as.data.frame(), resulted in the same error.
Any insight is appreciated. I've looked into using purrr but I'm not sure I grok completely.
Here is the full backtrace, which I'm not familiar enough with to parse completely:
Backtrace:
█
1. ├─base::lapply(...)
2. │ └─global::FUN(X[[i]], ...)
3. │ └─tidygeocoder::geocode(...)
4. │ ├─base::do.call(geo, geo_args)
5. │ └─(function (address = NULL, street = NULL, city = NULL, county = NULL, ...
6. │ ├─base::do.call(geo_cascade, all_args[!names(all_args) %in% c("method")])
7. │ └─(function (..., cascade_order = c("census", "osm")) ...
8. │ ├─base::`[<-`(...)
9. │ └─tibble:::`[<-.tbl_df`(...)
10. │ └─tibble:::tbl_subassign(x, i, j, value, i_arg, j_arg, substitute(value))
11. │ └─tibble:::tbl_subassign_row(x, i, value, value_arg)
12. │ ├─base::withCallingHandlers(...)
13. │ └─vctrs::`vec_slice<-`(`*tmp*`, i, value = value[[j]])
14. │ └─(function () ...
15. │ └─vctrs:::vec_cast.logical.double(...)
16. │ └─vctrs::maybe_lossy_cast(out, x, to, lossy, x_arg = x_arg, to_arg = to_arg)
17. │ ├─base::withRestarts(...)
18. │ │ └─base:::withOneRestart(expr, restarts[[1L]])
19. │ │ └─base:::doWithOneRestart(return(expr), restart)
20. │ └─vctrs:::stop_lossy_cast(...)
21. │ └─vctrs:::stop_vctrs(...)
22. │ └─rlang::abort(message, class = c(class, "vctrs_error"), ...)
23. │ └─rlang:::signal_abort(cnd)
24. │ └─base::signalCondition(cnd)
25. └─(function (cnd) ...
It is working with dplyr 1.0.6
dplyr::bind_rows(geo_list)
# A tibble: 8 x 11
address city county state zip date year yrmn lat long geo_method
<chr> <chr> <chr> <chr> <chr> <date> <dbl> <yearmon> <dbl> <dbl> <chr>
1 134 Lewis Rd Nashville Davidson TN 37211 2016-11-06 2016 Nov 2016 36.2 -86.8 osm
2 6651 Municipal Rd Houma Terrebonne LA 70360 2017-02-03 2017 Feb 2017 29.6 -90.7 osm
3 189 Village Park Rd Crestview Okaloosa FL 32536 2017-08-25 2017 Aug 2017 30.8 -86.6 osm
4 9122 Carpenter Ave New Haven New Haven CT 06511 2018-01-14 2018 Jan 2018 41.5 -72.8 osm
5 5221 Bear Valley Rd Nashville Davidson TN 37211 2018-09-17 2018 Sep 2018 36.1 -86.8 osm
6 28 S 7th St #2824 Englewood Bergen NJ 07631 2020-03-31 2020 Mar 2020 40.9 -74.0 census
7 5 E Truman Rd Abilene Taylor TX 79602 2021-02-25 2021 Feb 2021 32.5 -99.7 osm
8 9 Front St Washington District of Columbia DC 20001 2021-05-16 2021 May 2021 38.9 -77.0 osm
Noticed that there are some list elements having 0 rows. Maybe, we could remove those 0 row elements and then use bind_rows
library(purrr)
library(dplyr)
geo_list %>%
keep(~ NROW(.x) > 0) %>%
bind_rows
# A tibble: 8 x 11
address city county state zip date year yrmn lat long geo_method
<chr> <chr> <chr> <chr> <chr> <date> <dbl> <yearmon> <dbl> <dbl> <chr>
1 134 Lewis Rd Nashville Davidson TN 37211 2016-11-06 2016 Nov 2016 36.2 -86.8 osm
2 6651 Municipal Rd Houma Terrebonne LA 70360 2017-02-03 2017 Feb 2017 29.6 -90.7 osm
3 189 Village Park Rd Crestview Okaloosa FL 32536 2017-08-25 2017 Aug 2017 30.8 -86.6 osm
4 9122 Carpenter Ave New Haven New Haven CT 06511 2018-01-14 2018 Jan 2018 41.5 -72.8 osm
5 5221 Bear Valley Rd Nashville Davidson TN 37211 2018-09-17 2018 Sep 2018 36.1 -86.8 osm
6 28 S 7th St #2824 Englewood Bergen NJ 07631 2020-03-31 2020 Mar 2020 40.9 -74.0 census
7 5 E Truman Rd Abilene Taylor TX 79602 2021-02-25 2021 Feb 2021 32.5 -99.7 osm
8 9 Front St Washington District of Columbia DC 20001 2021-05-16 2021 May 2021 38.9 -77.0 osm
SOLVED:
update dplyr (thanks to akrun)
update tidygeocoder-- turns out the issue was bind_rows numeric results to NA results, which was dealt with in a newer release, which I didn't have yet. Posting my code here because there are several useful flags in the geocode() function for debugging:
adds_by_yrm <- adds %>% split(.$yrmn)
geo_list <- lapply(adds_by_yrm, function(x) {
geo <- geocode(.tbl = as.data.frame(x),
street = address,
city = city,
county = county,
state = state,
postalcode = zip,
# cascade method uses all options (census, osm, etc)
# takes longer but may be more accurate
method = "cascade",
cascade_order = c("census", "osm"),
timeout = 500,
unique_only = TRUE,
verbose = T) %>%
filter(!is.na(lat))
return(geo)
})
out <- geo_list %>%
purrr::keep(~ NROW(.x) > 0) %>%
bind_rows()
Related
I have a data frame that is supposed to show the winners of a tournament and their opponents. Currently the loser is in every other row. So, row 1 is the winner, row 2 is the loser, row 3 is the winner, row 4 is the loser, and so on.
I want the winner and their opponent to be next to each other so that it's easier to see who competed against who. The tricky part is keeping the gym, names, and competitor number for each person together in the same row.
How do I move every other row to a new column so that the winner and their opponent are in the same row?
y = read.csv('https://raw.githubusercontent.com/bandcar/Examples/main/y.csv')
# FAILED ATTEMPT
library(data.table)
z=dcast(setDT(y)[, grp := gl(.N, 2, .N)], grp ~ rowid(grp),
value.var = setdiff(names(y), 'grp'))[, grp := NULL][]
Note that both photos are different data sets
What my df currently looks like:
Similar to what I want it to look like:
Using dplyr you could do:
library(dplyr)
read.csv('https://raw.githubusercontent.com/bandcar/Examples/main/y.csv') %>%
group_by(fight, date) %>%
summarise(division = first(division),
competitor_1 = first(competitor),
name_1 = first(name),
competitor_2 = last(competitor),
name_2 = last(name))
#> `summarise()` has grouped output by 'fight'. You can override using the
#> `.groups` argument.
#> # A tibble: 61 x 7
#> # Groups: fight [26]
#> fight date division competitor_1 name_1 compe~1 name_2
#> <chr> <chr> <chr> <int> <chr> <int> <chr>
#> 1 BYE BYE Master 2 1 Rafael M~ 1 Rafae~
#> 2 FIGHT 19 Thu 09/01 at 12:14 PM Master 2 2 Piter Fr~ 63 Alan ~
#> 3 FIGHT 20 Thu 09/01 at 01:01 PM Master 2 16 Marques ~ 55 Diego~
#> 4 FIGHT 20 Thu 09/01 at 12:13 PM Master 2 28 Kenned D~ 44 Verge~
#> 5 FIGHT 22 Thu 09/01 at 12:27 PM Master 2 4 Marcus V~ 52 Kian ~
#> 6 FIGHT 23 Thu 09/01 at 12:33 PM Master 2 30 Adam Col~ 46 Steph~
#> 7 FIGHT 23 Thu 09/01 at 12:54 PM Master 2 31 Namrod B~ 47 Stefa~
#> 8 FIGHT 23 Thu 09/01 at 12:58 PM Master 2 13 David Ch~ 53 Joshu~
#> 9 FIGHT 24 Thu 09/01 at 01:08 PM Master 2 3 Sandro G~ 56 Carlo~
#> 10 FIGHT 24 Thu 09/01 at 12:35 PM Master 2 8 Rafael R~ 60 Andre~
#> # ... with 51 more rows, and abbreviated variable name 1: competitor_2
Created on 2022-09-16 with reprex v2.0.2
There are some problems with your dataset, e.g. for "FIGHT 22" there are four entries (from your description I expected two entries).
division gender belt weight fight date competitor name gym
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Master 2 Male BLACK Middle FIGHT 22 Thu 09/01 at 12:27 PM 4 Marcus V. C. Antelante Ares BJJ
2 Master 2 Male BLACK Middle FIGHT 22 Thu 09/01 at 12:27 PM 62 Andrew E. Ganthier Renzo Gracie Academy
3 Master 2 Male BLACK Middle FIGHT 22 Thu 09/01 at 12:27 PM 11 Jimmy Dang Khoa Tat CheckMat
4 Master 2 Male BLACK Middle FIGHT 22 Thu 09/01 at 12:27 PM 52 Kian Takumi Kadota Brasa CTA
The same problem exists for fights 26 and 35. Assuming these are corrected, and assuming odd rows contain winners and even rows contain losers, the following code should work (using tidyverse):
y %>%
mutate(outcome = if_else(row_number() %% 2 == 1, "winner", "loser")) %>%
pivot_wider(names_from = outcome, values_from = c(competitor, name, gym))
This will get you close to what you want. It adds the winner column. Odd number index is winner and even number index is loser. I removed the BYE week rows for aesthetics. Then we group by the date and fight and keep the desired data from the combined rows and expand the summarised columns to the winner loser information.
library(dplyr)
y %>%
mutate(
winner = ifelse((y$X %% 2) == 0,'loser','winner')) %>%
filter(date != 'BYE') %>%
group_by(date, fight) %>%
summarise(division = first(division),
belt = first(belt),
weight = first(weight),
gender = first(gender),
winner.rank = first(competitor),
winner = first(name),
winner.gym = first(gym),
opp.rank= last(competitor),
opponent = last(name),
opponent.gym = last(gym))
I have scraped multiple tables from a basketball site using a for loop.
years <- c(2016:2021)
final_table <- {}
for(i in 1:length(years)){
url <- paste0("https://www.basketball-reference.com/friv/free_agents.cgi?year=",years[i])
past_free_agency_page <- read_html(url)
past_free_agency_webtable<- html_nodes(past_free_agency_page, "table")
past_free_agency_table <- html_table(past_free_agency_webtable, header = T)[[1]]
final_table <- rbind(final_table, past_free_agency_table)
}
This retrieves everything correctly, but I am trying to combine all of these tables as they are created. If you notice it is 5 total tables (Year 2016 - 2021).
There is one error that I am getting: I try to combine the table with rbind() at the end of the loop. It does not work. It says "the names do not match". I do not know of a clever way to fix this issue because I am new to working with loops, and I have tried turning the scraped table into a df with no success.
My next issue has to do with how the tables are combined. In the website links, one can see that the table has headers within it, that repeat the Master header exactly. The code treats it as another row, so it appears as an instance within each of the tables. I want these to be ignored.
The last issue has to do with making each of these rows unique, I want the respective year of each table to be a column in its own. For example, for the year 2016, I want the table to have a column that says 2016. I have tried something inside the loop, such as past_free_agency_table[,1] <- c(years[i]), I want to do this because some of these tables have the same players, and I want to be able to uniquely identify, which table is which.
Sort of a loop, but in purrr way.
library(tidyverse)
library(rvest)
get_df <- function(year) {
"https://www.basketball-reference.com/friv/free_agents.cgi?year=" %>%
paste0(., year) %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
mutate(years = year) %>%
select(Rk, years, everything())
}
df <- map_dfr(2016:2020, get_df)
# A tibble: 1,161 × 16
Rk years Player Pos Age Type OTm `2015-16 Stats` WS NTm
<chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 2016 Kevin Du… F-G 33-2… UFA OKC 28.2 Pts, 8.2 … 14.5 GSW
2 2 2016 LeBron J… F-G 37-1… UFA CLE 25.3 Pts, 7.4 … 13.6 CLE
3 3 2016 Hassan W… C 33-0… UFA MIA 14.2 Pts, 11.8… 10.3 MIA
4 4 2016 DeMar De… G-F 32-3… UFA TOR 23.5 Pts, 4.5 … 9.9 TOR
5 5 2016 Al Horfo… C-F 36-0… UFA ATL 15.2 Pts, 7.3 … 9.4 BOS
6 6 2016 Marvin W… F 36-0… UFA CHO 11.7 Pts, 6.4 … 7.8 CHA
7 7 2016 Andre Dr… C 28-3… RFA DET 16.2 Pts, 14.8… 7.4 DET
8 8 2016 Pau Gasol C-F 41-3… UFA CHI 16.5 Pts, 11.0… 7.1 SAS
9 9 2016 Dirk Now… F 44-0… UFA DAL 18.3 Pts, 6.5 … 6.8 DAL
10 10 2016 Dwight H… C 36-1… UFA HOU 13.7 Pts, 11.8… 6.6 ATL
# … with 1,151 more rows, and 6 more variables: Terms <chr>, Notes <chr>,
# `2016-17 Stats` <chr>, `2017-18 Stats` <chr>, `2018-19 Stats` <chr>,
# `2019-20 Stats` <chr>
I have the following code:
#load req'd libraries
library(plyr)
library(dplyr)
library(fitzRoy)
#get the raw data from the fitzRoy package (season selection use :)
player <- fetch_player_stats(season = 2020:2021, source = "fryzigg")
#select the req'd cols
player <- player %>%
select(venue_name, match_date, match_round, player_id, player_first_name, player_last_name,
kicks, handballs, disposals)
#change the match_date to date format
player$match_date <- as.Date(player$match_date, format = "%Y-%m-%d")
#add a column for the year (season)
player$season <- format(as.Date(player$match_date, format="%Y-%m-%d"),"%Y")
#change format for match_round
player$match_round <- as.numeric(player$match_round)
#add opponent
player2$opponent <- ifelse(player2$player_team == player2$match_home_team, player2$match_away_team,
ifelse(player2$player_team == player2$match_away_team, player2$match_home_team, player2$match_away_team))
#sort
player <- player %>%
arrange(player_id, season, match_round)
head(player)
This gives me a data frame like so:
# A tibble: 6 x 10
venue_name match_date match_round player_id player_first_name player_last_name kicks handballs disposals season
<chr> <date> <dbl> <int> <chr> <chr> <int> <int> <int> <chr>
1 GIANTS Stadium 2020-03-21 1 11170 Gary Ablett 16 8 24 2020
2 GMHBA Stadium 2020-06-12 2 11170 Gary Ablett 9 12 21 2020
3 GMHBA Stadium 2020-06-20 3 11170 Gary Ablett 8 6 14 2020
4 MCG 2020-06-28 4 11170 Gary Ablett 3 8 11 2020
5 GMHBA Stadium 2020-07-04 5 11170 Gary Ablett 6 8 14 2020
6 SCG 2020-07-09 6 11170 Gary Ablett 11 3 14 2020
I am trying to add several new columns:
A season average of disposals by player that is cumulative based on each round. So for example,
using the table above, the new column would look like:
| season_average_disposals
| 24
| 22.5
| 20
| 17.5
| 16.8
| 16.3
When the season changes, say from 2020 to 2021, this would reset and the first entry would be the total disposal for round 1 that season.
Similar to the above, a season average of disposals by player by venue that is cumulative based on each round.
Similar to the above, a season average of disposals by player by venue and opponent that is cumulative based on each round.
A career average that is cumulative based on season and round. So this would not reset when the season changes, it would just keep calculating.
I tried using this:
player <- player %>%
transform(season_average_disposals = ifelse(lag(season) == season, mean(disposals), disposals))
But this does not give me the required results.
For 1)
library(dplyr)
player %>%
group_split(season, player_id) %>%
purrr::map_dfr(~.x %>%
mutate(season_average_disposals = cummean(disposals)))
I need to download weather data from NASA’s POWER (Prediction Of Worldwide Energy Resource). The package nasapower is a package developed for data retrieval using R. I need to download many locations (lat, long coordinates). To do this I tried a simple loop with three locations as a reproducible example.
library(nasapower)
data1 <- read.csv(text = "
location,long,lat
loc1, -56.547, -14.2427
loc2, -57.547, -15.2427
loc3, -58.547, -16.2427")
i=1
all.weather <- data.frame()
for (i in seq_along(1:nrow(data1))) {
weather.data <- get_power(community = "AG",
lonlat = c(data1$long[i],data1$lat[i]),
dates = c("2015-01-01", "2015-01-10"),
temporal_average = "DAILY",
pars = c("T2M_MAX"))
all.weather <-rbind(all.weather, weather.data)
}
This works perfect. The problem is that I am trying to mimic this using purrr::map since I want to have an alternative within tidyverse. This is what I did but it does not work:
library(dplyr)
library(purrr)
all.weather <- data1 %>%
group_by(location) %>%
map(get_power(community = "AG",
lonlat = c(long, lat),
dates = c("2015-01-01", "2015-01-10"),
temporal_average = "DAILY",
site_elevation = NULL,
pars = c("T2M_MAX")))
I got the following error:
Error in isFALSE(length(lonlat != 2)) : object 'long' not found
Any hint on how to run this using purrr?
To make your code work make use of purrr::pmap instead of map like so:
map is for one argument functions, map2 for two argument funs and pmap is the most general one allowing for funs with more than two arguments.
pmap will loop over the rows of your df. As your df has 3 columns 3 arguments are passed to the function, even if the first argument location is not used. To make this work and to make use of the column names you have to specify the function and the argument names via function(location, long, lat)
library(nasapower)
data1 <- read.csv(text = "
location,long,lat
loc1, -56.547, -14.2427
loc2, -57.547, -15.2427
loc3, -58.547, -16.2427")
library(dplyr)
library(purrr)
all.weather <- data1 %>%
pmap(function(location, long, lat) get_power(community = "AG",
lonlat = c(long, lat),
dates = c("2015-01-01", "2015-01-10"),
temporal_average = "DAILY",
site_elevation = NULL,
pars = c("T2M_MAX"))) %>%
# Name list with locations
setNames(data1$location) %>%
# Add location names as identifiers
bind_rows(.id = "location")
head(all.weather)
#> NASA/POWER SRB/FLASHFlux/MERRA2/GEOS 5.12.4 (FP-IT) 0.5 x 0.5 Degree Daily Averaged Data
#> Dates (month/day/year): 01/01/2015 through 01/10/2015
#> Location: Latitude -14.2427 Longitude -56.547
#> Elevation from MERRA-2: Average for 1/2x1/2 degree lat/lon region = 379.25 meters Site = na
#> Climate zone: na (reference Briggs et al: http://www.energycodes.gov)
#> Value for missing model data cannot be computed or out of model availability range: NA
#>
#> Parameters:
#> T2M_MAX MERRA2 1/2x1/2 Maximum Temperature at 2 Meters (C)
#>
#> # A tibble: 6 x 9
#> location LON LAT YEAR MM DD DOY YYYYMMDD T2M_MAX
#> <chr> <dbl> <dbl> <dbl> <int> <int> <int> <date> <dbl>
#> 1 loc1 -56.5 -14.2 2015 1 1 1 2015-01-01 29.9
#> 2 loc1 -56.5 -14.2 2015 1 2 2 2015-01-02 30.1
#> 3 loc1 -56.5 -14.2 2015 1 3 3 2015-01-03 27.3
#> 4 loc1 -56.5 -14.2 2015 1 4 4 2015-01-04 28.7
#> 5 loc1 -56.5 -14.2 2015 1 5 5 2015-01-05 30
#> 6 loc1 -56.5 -14.2 2015 1 6 6 2015-01-06 28.7
I am trying to use rvest() to extract some information. What I have is a list of links and I would like to bind the rows of the data collected together.
What I currently have is the following;
EDIT: heres the links without the weekend data
links <- c("https://finance.yahoo.com/calendar/ipo?day=2018-03-05", "https://finance.yahoo.com/calendar/ipo?day=2018-03-06",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-07", "https://finance.yahoo.com/calendar/ipo?day=2018-03-08",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-09", "https://finance.yahoo.com/calendar/ipo?day=2018-03-12",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-13", "https://finance.yahoo.com/calendar/ipo?day=2018-03-14",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-15", "https://finance.yahoo.com/calendar/ipo?day=2018-03-16",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-19", "https://finance.yahoo.com/calendar/ipo?day=2018-03-20",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-21", "https://finance.yahoo.com/calendar/ipo?day=2018-03-22",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-23", "https://finance.yahoo.com/calendar/ipo?day=2018-03-26",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-27", "https://finance.yahoo.com/calendar/ipo?day=2018-03-28",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-29", "https://finance.yahoo.com/calendar/ipo?day=2018-03-30",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-02", "https://finance.yahoo.com/calendar/ipo?day=2018-04-03",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-04", "https://finance.yahoo.com/calendar/ipo?day=2018-04-05",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-06", "https://finance.yahoo.com/calendar/ipo?day=2018-04-09",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-10", "https://finance.yahoo.com/calendar/ipo?day=2018-04-11",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-12", "https://finance.yahoo.com/calendar/ipo?day=2018-04-13",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-16", "https://finance.yahoo.com/calendar/ipo?day=2018-04-17",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-18", "https://finance.yahoo.com/calendar/ipo?day=2018-04-19",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-20", "https://finance.yahoo.com/calendar/ipo?day=2018-04-23",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-24", "https://finance.yahoo.com/calendar/ipo?day=2018-04-25",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-26", "https://finance.yahoo.com/calendar/ipo?day=2018-04-27",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-30", "https://finance.yahoo.com/calendar/ipo?day=2018-05-01",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-02", "https://finance.yahoo.com/calendar/ipo?day=2018-05-03",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-04", "https://finance.yahoo.com/calendar/ipo?day=2018-05-07",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-08", "https://finance.yahoo.com/calendar/ipo?day=2018-05-09",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-10")
Code:
library(rvest)
library(dplyr)
library(magrittr)
x <- links %>%
read_html() %>%
html_table() %>%
extract2(1) %>%
bind_rows() %>%
as_tibble
This gives the following error:
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Expecting a single string value: [type=character; extent=68].
I am able to get the code working for 1 link however when I try to get it working for all the links I am running into errors. For example this code works:
x <- "https://finance.yahoo.com/calendar/ipo?day=2018-05-08" %>%
read_html() %>%
html_table() %>%
extract2(1) %>%
bind_rows() %>%
as_tibble
EDIT:
from = "2016-03-04"
to = "2018-05-10"
s <- seq(as.Date(from), as.Date(to), "days")
library(chron)
s <- s[!is.weekend(s)]
links <- paste0("https://finance.yahoo.com/calendar/ipo?day=", s)
library(rvest)
library(dplyr)
library(magrittr)
library(purrr)
library(naniar)
IPOs <- links[1:400] %>%
map_dfr(~read_html(.x) %>%
html_table() %>%
extract2(1) %>%
naniar::replace_with_na_all(condition = ~.x == "-") %>%
type.convert(as.is = TRUE) )
It looks like you want to loop through the URL's. For each one you want to read it, parse it into a data frame, and extracting the first data frame in the list. So the read_html() through extract2() steps should be done within the loop.
One option is to use a purrr::map_dfr() loop, since it looks like you want to bind things into a single tibble in the end.
Nominally that could look like:
library(rvest)
library(dplyr)
library(magrittr)
library(purrr)
links %>%
map_dfr(~read_html(.x) %>%
html_table() %>%
extract2(1) )
However, it turns out that you have missing values that are represented by hyphens (-). Some of the tables have these and some don't. When these are present, R reads your integer columns as characters while when they are not present integers are read as integer columns. This causes problems when binding everything together.
I did not see an argument in read_html() to deal with these directly (I was looking for the equivalent of na.strings in read.table() or na in readr::read_csv()). My work-around was to convert the hyphens to NA using function replace_with_na_all() from package naniar (see the vignette here). Then I converted all columns to the appropriate type with type.convert().
All of this was done within the map_dfr() loop.
Here is an example with just the first two URL's in links.
links[1:2] %>%
map_dfr(~read_html(.x) %>%
html_table() %>%
extract2(1) %>%
naniar::replace_with_na_all(condition = ~.x == "-") %>%
type.convert(as.is = TRUE) )
# A tibble: 15 x 9
Symbol Company Exchange Date `Price Range` Price Currency Shares Actions
<chr> <chr> <chr> <chr> <chr> <dbl> <chr> <int> <chr>
1 2003.HK Vcredit Hldg Ltd HKSE Jun 21, 2018 NA 20 HKD 49969000 Priced
2 2003.HK Vcredit Hldg Ltd HKSE Jun 21, 2018 NA 20 HKD 11745600 Priced
3 2003.HK Vcredit Hldg Ltd HKSE Jun 21, 2018 NA 20 HKD 6857200 Priced
4 0000 Vcredit Hldg Ltd HKSE Jun 12, 2018 NA NA HKD NA Expected
5 6571.JP QB Net Holdings Co Ltd Japan OTC Mar 14, 2018 21.11 - 21.11 NA Y 9785900 Expected
6 1621.HK Vico Intl Hldg Ltd HKSE Mar 05, 2018 NA 0.35 HKD 175000000 Priced
7 PZM.AX Piston Mach Ltd ASX Mar 05, 2018 0.32 - 0.32 NA AU 50000000 Expected
8 "" Agp Ltd Karachi Mar 05, 2018 0.76 - 0.76 80 PKR 8750000 Priced
9 GRC.L GRC International Group PLC LSE Mar 05, 2018 0.98 - 0.98 0.7 GBP 8414286 Priced
10 ACPH.BR Acacia Pharma Group PLC Brussels Mar 05, 2018 3.24 - 4.16 3.6 EUR 3175413 Priced
11 ACPH.BR Acacia Pharma Group PLC Brussels Mar 05, 2018 3.24 - 4.16 3.6 EUR 7935698 Priced
12 GCI.AX Gryphon Capital Income Tr ASX May 23, 2018 1.57 - 1.57 2 AUD 87650000 Priced
13 GCI.AX Gryphon Capital Income Tr ASX May 04, 2018 1.57 - 1.57 NA AUD 50000000 Expected
14 STRL.L Stirling Inds Plc LSE Mar 06, 2018 1.40 - 1.40 1 GBP 8881002 Priced
15 541006.BO Angel Fibers Ltd BSE Mar 06, 2018 NA 27 INR 6408000 Priced