I am trying to use rvest() to extract some information. What I have is a list of links and I would like to bind the rows of the data collected together.
What I currently have is the following;
EDIT: heres the links without the weekend data
links <- c("https://finance.yahoo.com/calendar/ipo?day=2018-03-05", "https://finance.yahoo.com/calendar/ipo?day=2018-03-06",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-07", "https://finance.yahoo.com/calendar/ipo?day=2018-03-08",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-09", "https://finance.yahoo.com/calendar/ipo?day=2018-03-12",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-13", "https://finance.yahoo.com/calendar/ipo?day=2018-03-14",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-15", "https://finance.yahoo.com/calendar/ipo?day=2018-03-16",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-19", "https://finance.yahoo.com/calendar/ipo?day=2018-03-20",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-21", "https://finance.yahoo.com/calendar/ipo?day=2018-03-22",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-23", "https://finance.yahoo.com/calendar/ipo?day=2018-03-26",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-27", "https://finance.yahoo.com/calendar/ipo?day=2018-03-28",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-29", "https://finance.yahoo.com/calendar/ipo?day=2018-03-30",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-02", "https://finance.yahoo.com/calendar/ipo?day=2018-04-03",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-04", "https://finance.yahoo.com/calendar/ipo?day=2018-04-05",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-06", "https://finance.yahoo.com/calendar/ipo?day=2018-04-09",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-10", "https://finance.yahoo.com/calendar/ipo?day=2018-04-11",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-12", "https://finance.yahoo.com/calendar/ipo?day=2018-04-13",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-16", "https://finance.yahoo.com/calendar/ipo?day=2018-04-17",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-18", "https://finance.yahoo.com/calendar/ipo?day=2018-04-19",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-20", "https://finance.yahoo.com/calendar/ipo?day=2018-04-23",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-24", "https://finance.yahoo.com/calendar/ipo?day=2018-04-25",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-26", "https://finance.yahoo.com/calendar/ipo?day=2018-04-27",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-30", "https://finance.yahoo.com/calendar/ipo?day=2018-05-01",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-02", "https://finance.yahoo.com/calendar/ipo?day=2018-05-03",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-04", "https://finance.yahoo.com/calendar/ipo?day=2018-05-07",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-08", "https://finance.yahoo.com/calendar/ipo?day=2018-05-09",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-10")
Code:
library(rvest)
library(dplyr)
library(magrittr)
x <- links %>%
read_html() %>%
html_table() %>%
extract2(1) %>%
bind_rows() %>%
as_tibble
This gives the following error:
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Expecting a single string value: [type=character; extent=68].
I am able to get the code working for 1 link however when I try to get it working for all the links I am running into errors. For example this code works:
x <- "https://finance.yahoo.com/calendar/ipo?day=2018-05-08" %>%
read_html() %>%
html_table() %>%
extract2(1) %>%
bind_rows() %>%
as_tibble
EDIT:
from = "2016-03-04"
to = "2018-05-10"
s <- seq(as.Date(from), as.Date(to), "days")
library(chron)
s <- s[!is.weekend(s)]
links <- paste0("https://finance.yahoo.com/calendar/ipo?day=", s)
library(rvest)
library(dplyr)
library(magrittr)
library(purrr)
library(naniar)
IPOs <- links[1:400] %>%
map_dfr(~read_html(.x) %>%
html_table() %>%
extract2(1) %>%
naniar::replace_with_na_all(condition = ~.x == "-") %>%
type.convert(as.is = TRUE) )
It looks like you want to loop through the URL's. For each one you want to read it, parse it into a data frame, and extracting the first data frame in the list. So the read_html() through extract2() steps should be done within the loop.
One option is to use a purrr::map_dfr() loop, since it looks like you want to bind things into a single tibble in the end.
Nominally that could look like:
library(rvest)
library(dplyr)
library(magrittr)
library(purrr)
links %>%
map_dfr(~read_html(.x) %>%
html_table() %>%
extract2(1) )
However, it turns out that you have missing values that are represented by hyphens (-). Some of the tables have these and some don't. When these are present, R reads your integer columns as characters while when they are not present integers are read as integer columns. This causes problems when binding everything together.
I did not see an argument in read_html() to deal with these directly (I was looking for the equivalent of na.strings in read.table() or na in readr::read_csv()). My work-around was to convert the hyphens to NA using function replace_with_na_all() from package naniar (see the vignette here). Then I converted all columns to the appropriate type with type.convert().
All of this was done within the map_dfr() loop.
Here is an example with just the first two URL's in links.
links[1:2] %>%
map_dfr(~read_html(.x) %>%
html_table() %>%
extract2(1) %>%
naniar::replace_with_na_all(condition = ~.x == "-") %>%
type.convert(as.is = TRUE) )
# A tibble: 15 x 9
Symbol Company Exchange Date `Price Range` Price Currency Shares Actions
<chr> <chr> <chr> <chr> <chr> <dbl> <chr> <int> <chr>
1 2003.HK Vcredit Hldg Ltd HKSE Jun 21, 2018 NA 20 HKD 49969000 Priced
2 2003.HK Vcredit Hldg Ltd HKSE Jun 21, 2018 NA 20 HKD 11745600 Priced
3 2003.HK Vcredit Hldg Ltd HKSE Jun 21, 2018 NA 20 HKD 6857200 Priced
4 0000 Vcredit Hldg Ltd HKSE Jun 12, 2018 NA NA HKD NA Expected
5 6571.JP QB Net Holdings Co Ltd Japan OTC Mar 14, 2018 21.11 - 21.11 NA Y 9785900 Expected
6 1621.HK Vico Intl Hldg Ltd HKSE Mar 05, 2018 NA 0.35 HKD 175000000 Priced
7 PZM.AX Piston Mach Ltd ASX Mar 05, 2018 0.32 - 0.32 NA AU 50000000 Expected
8 "" Agp Ltd Karachi Mar 05, 2018 0.76 - 0.76 80 PKR 8750000 Priced
9 GRC.L GRC International Group PLC LSE Mar 05, 2018 0.98 - 0.98 0.7 GBP 8414286 Priced
10 ACPH.BR Acacia Pharma Group PLC Brussels Mar 05, 2018 3.24 - 4.16 3.6 EUR 3175413 Priced
11 ACPH.BR Acacia Pharma Group PLC Brussels Mar 05, 2018 3.24 - 4.16 3.6 EUR 7935698 Priced
12 GCI.AX Gryphon Capital Income Tr ASX May 23, 2018 1.57 - 1.57 2 AUD 87650000 Priced
13 GCI.AX Gryphon Capital Income Tr ASX May 04, 2018 1.57 - 1.57 NA AUD 50000000 Expected
14 STRL.L Stirling Inds Plc LSE Mar 06, 2018 1.40 - 1.40 1 GBP 8881002 Priced
15 541006.BO Angel Fibers Ltd BSE Mar 06, 2018 NA 27 INR 6408000 Priced
Related
Working with some Business Labor Statistics data (https://www.bls.gov/regions/mid-atlantic/data/producerpriceindexconcrete_us_table.htm). I have scraped the table from this url and am trying to get it into a tidy format. Here is a working example:
Commodity jan feb mar
Nonmetallic mineral products
2020 257.2 258.1 258.5
2021 262.6 263.4 264.4
Concrete ingredients
2020 316.0 316.9 317.8
2021 328.4 328.4 328.4
Construction gravel
2020 359.2 360.7 362.1
2021 375.0 374.7 374.1
How can I get the 2020 and 2021 rows into a "Year" column, and get jan, feb, mar, etc. into a "Month" column like below?
Commodity Month Year Value
Nonmetallic mineral products jan 2020 257.2
Nonmetallic mineral products feb 2020 258.1
Concrete ingredients jan 2020 316.0
Concrete ingredients jan 2021 328.4
We could use tidyverse to transform the data into the required format.
Create a grouping column 'grp' by doing the cumulative sum (cumsum) of a logical vector i.e. presence of letters in the 'Commodity' column
Use mutate to create the 'Year', by replaceing the first element as NA and modify the 'Commodity' by updating it with the first value
Remove the first row with slice
ungroup and reshape the data into long format with pivot_longer
library(dplyr)
library(stringr)
library(tidyr)
df1 %>%
group_by(grp = cumsum(str_detect(Commodity, "[A-Za-z]"))) %>%
mutate(Year = replace(Commodity, 1, NA),
Commodity = first(Commodity)) %>%
slice(-1) %>%
ungroup %>%
select(-grp) %>%
pivot_longer(cols = jan:mar, names_to = 'Month')
-output
# A tibble: 18 x 4
Commodity Year Month value
<chr> <chr> <chr> <dbl>
1 Nonmetallic mineral products 2020 jan 257.
2 Nonmetallic mineral products 2020 feb 258.
3 Nonmetallic mineral products 2020 mar 258.
4 Nonmetallic mineral products 2021 jan 263.
5 Nonmetallic mineral products 2021 feb 263.
6 Nonmetallic mineral products 2021 mar 264.
7 Concrete ingredients 2020 jan 316
8 Concrete ingredients 2020 feb 317.
9 Concrete ingredients 2020 mar 318.
10 Concrete ingredients 2021 jan 328.
11 Concrete ingredients 2021 feb 328.
12 Concrete ingredients 2021 mar 328.
13 Construction gravel 2020 jan 359.
14 Construction gravel 2020 feb 361.
15 Construction gravel 2020 mar 362.
16 Construction gravel 2021 jan 375
17 Construction gravel 2021 feb 375.
18 Construction gravel 2021 mar 374.
data
df1 <- structure(list(Commodity = c("Nonmetallic mineral products",
"2020", "2021", "Concrete ingredients", "2020", "2021", "Construction gravel",
"2020", "2021"), jan = c(NA, 257.2, 262.6, NA, 316, 328.4, NA,
359.2, 375), feb = c(NA, 258.1, 263.4, NA, 316.9, 328.4, NA,
360.7, 374.7), mar = c(NA, 258.5, 264.4, NA, 317.8, 328.4, NA,
362.1, 374.1)), class = "data.frame", row.names = c(NA, -9L))
Another strategy would be to transform the data while scraping:
library(rvest)
library(dpyr)
"https://www.bls.gov/regions/mid-atlantic/data/producerpriceindexconcrete_us_table.htm" %>%
read_html() %>%
html_table() %>%
first() %>%
set_names(.[1, ]) %>%
tail(-1) %>%
split(ifelse(str_detect(.$Commodity, "\\d{4}"),
"data", "commodities")) %>%
with(data %>%
select(-`Historical data`) %>%
mutate(Year = as.integer(Commodity),
Commodity = commodities$Commodity %>%
head(nrow(commodities) - 1) %>%
rep(times = rep(2, length(.))),
across(-c(Commodity, Year), readr::parse_number),
.before = 1)) %>%
pivot_longer(-c(Year, Commodity)) %>%
transmute(Commodity, Year, Month = name, Value = value)
Result:
# A tibble: 754 x 4
Commodity Year Month Value
<chr> <int> <chr> <dbl>
1 Nonmetallic mineral products 2020 Jan 257.
2 Nonmetallic mineral products 2020 Feb 258.
3 Nonmetallic mineral products 2020 Mar 258.
4 Nonmetallic mineral products 2020 Apr 258.
5 Nonmetallic mineral products 2020 May 257.
6 Nonmetallic mineral products 2020 Jun 257.
7 Nonmetallic mineral products 2020 Jul 257.
8 Nonmetallic mineral products 2020 Aug 257
9 Nonmetallic mineral products 2020 Sep 258.
10 Nonmetallic mineral products 2020 Oct 257.
# … with 744 more rows
Again, doing the transformation whilst scraping: You could also use nth-child range css selectors to isolate nodes and combine. Then repeat retrieved elements for the necessary lengths. CSS selectors are a quick way of filtering data.
Some explanation:
This gathers the commodities list and extends it for the number of years * number of months:
Commodity = page %>% html_nodes("table#ro3ppiconcrete > tbody > tr:nth-child(3n+1)") %>% html_text(trim = T) %>%
rep(each = length(months)) %>% rep(times = 2)
This joins all the Year 2 values (2021) under all the Year 1 (2020) values:
Value = vctrs::vec_c(
page %>% html_nodes("table#ro3ppiconcrete tbody > tr:nth-child(3n-1) td:nth-child(n+3):nth-child(-n+14)") %>% html_text(trim = T),
page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n) td:nth-child(n+3):nth-child(-n+14)") %>% html_text(trim = T)
This stacks the required number of repeats of Year 2 under those of Year 1
Year = vctrs::vec_c(
page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n+2) th") %>% html_text(trim = T) %>% rep(times = 2),
page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n) th") %>% html_text(trim = T) %>% rep(times = 2)
) %>% as.integer()
R:
library(rvest)
library(magrittr)
page <- read_html("https://www.bls.gov/regions/mid-atlantic/data/producerpriceindexconcrete_us_table.htm")
months <- page %>%
html_nodes("table#ro3ppiconcrete tr:nth-child(2) > th:nth-child(n+2):nth-child(-n+13)") %>%
html_text()
df <- data.frame(
Commodity = page %>% html_nodes("table#ro3ppiconcrete > tbody > tr:nth-child(3n+1)") %>% html_text(trim = T) %>%
rep(each = length(months)) %>% rep(times = 2),
Year = vctrs::vec_c(
page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n+2) th") %>% html_text(trim = T) %>% rep(times = 2),
page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n) th") %>% html_text(trim = T) %>% rep(times = 2)
) %>% as.integer(),
Month = months,
Value = vctrs::vec_c(
page %>% html_nodes("table#ro3ppiconcrete tbody > tr:nth-child(3n-1) td:nth-child(n+3):nth-child(-n+14)") %>% html_text(trim = T),
page %>% html_nodes("table#ro3ppiconcrete tbody tr:nth-child(3n) td:nth-child(n+3):nth-child(-n+14)") %>% html_text(trim = T)
)
)
# if you wish to remove the provisional flag from Value and have as numeric
df$Value <- gsub('(p)', '', df$Value, fixed = T) %>% as.double()
:nth-child()
The :nth-child() CSS pseudo-class matches elements based on their
position in a group of siblings.
Functional notation
<An+B> Represents elements in a list whose indices match those found
in a custom pattern of numbers, defined by An+B, where:
A is an integer step size,
B is an integer offset,
n is all nonnegative integers, starting from 0.
It can be read as the An+Bth element of a list.
I'd like to get the information in three tables from a website. I tried to apply the code below, but the table is in a confusing format.
url <- 'http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9'
url %>% html_table(fill = TRUE)
Obs.: tidyverse and rvest have been used
You need to do some cleaning of the table.
library(rvest)
library(dplyr)
url <- 'http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9'
url %>%
read_html %>%
html_table(fill = TRUE) %>%
.[[1]] %>%
.[complete.cases(.),] %>%
mutate_all(~gsub('\n|\\s{2,}', '', .))
# W/L Fighter Str Td Sub Pass
#1 loss Tom AaronMatt Ricehouse 00 00 00 00
#2 win Tom AaronEric Steenberg 00 00 00 00
# Event Method Round Time
#1 Strikeforce - Henderson vs. BabaluDec. 04, 2010 U-DEC 3 5:00
#2 Strikeforce - Heavy ArtilleryMay. 15, 2010 SUBGuillotine Choke 1 0:56
The table you're working with is tricky because there are table cells (<td> elements in HTML) that span two rows in order to repeat information. When html_table strips information out, those individual rows get concatenated and you get long strings of blank spaces and newlines.
library(dplyr)
library(rvest)
ufc <- read_html("http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9") %>%
html_table(fill = TRUE) %>%
.[[1]] %>%
filter(!is.na(Fighter)) # could instead use janitor::remove_empty or rowSums for number of NAs
ufc$Fighter[1]
#> [1] "Tom Aaron\n \n \n\n \n \n Matt Ricehouse"
With some regex, you can make those blanks into your delimiters to split the cells. Information that applies to two rows (such as time) gets repeated. Originally I did this with mutate_all, but realized Event shouldn't be split—for that, instead just remove the extra spaces. Adjust as needed for other columns.
ufc %>%
mutate_at(vars(Fighter:Pass), stringr::str_replace_all, "\\s{2,}", "|") %>%
mutate_all(stringr::str_replace_all, "\\s{2,}", " ") %>%
tidyr::separate_rows(everything(), sep = "\\|")
#> W/L Fighter Str Td Sub Pass
#> 1 loss Tom Aaron 0 0 0 0
#> 2 loss Matt Ricehouse 0 0 0 0
#> 3 win Tom Aaron 0 0 0 0
#> 4 win Eric Steenberg 0 0 0 0
#> Event Method Round
#> 1 Strikeforce - Henderson vs. Babalu Dec. 04, 2010 U-DEC 3
#> 2 Strikeforce - Henderson vs. Babalu Dec. 04, 2010 U-DEC 3
#> 3 Strikeforce - Heavy Artillery May. 15, 2010 SUB Guillotine Choke 1
#> 4 Strikeforce - Heavy Artillery May. 15, 2010 SUB Guillotine Choke 1
#> Time
#> 1 5:00
#> 2 5:00
#> 3 0:56
#> 4 0:56
I have a dataframe that looks something like this (I have a lot more years and variables):
Name State2014 State2015 State2016 Tuition2014 Tuition2015 Tuition2016 StateGrants2014
Jared CA CA MA 22430 23060 40650 5000
Beth CA CA CA 36400 37050 37180 4200
Steven MA MA MA 18010 18250 18720 NA
Lary MA CA MA 24080 30800 24600 6600
Tom MA OR OR 40450 15800 16040 NA
Alfred OR OR OR 23570 23680 23750 3500
Cathy OR OR OR 32070 32070 33040 4700
My objective (in this example) is to get the mean tuition for each state, and the sum of state grants for each state. My thought was to subset the data by year:
State2014 Tuition2014 StateGrants2014
CA 22430 5000
CA 36400 4200
MA 18010 NA
MA 24080 6600
MA 40450 NA
OR 23570 3500
OR 32070 4700
State2015 Tuition2015
CA 23060
CA 37050
MA 18250
CA 30800
OR 15800
OR 23680
OR 32070
State2016 Tuition2016
MA 40650
CA 37180
MA 18720
MA 24600
OR 16040
OR 23750
OR 33040
Then I would group_by state and summarize (and save each as a separate df) to get the following:
State2014 Tuition2014 StateGrants2014
CA 29415 9200
MA 27513 6600
OR 27820 6600
State2015 Tuition2015
CA 30303
MA 18250
OR 23850
State2016 Tuition2016
CA 37180
MA 27990
OR 24277
Then I would merge the by state. Here is my code:
years = c(2014,2015,2016)
for (i in seq_along(years){
#grab the variables from a certain year and save as a new df.
df_year <- df[, grep(paste(years[[i]],"$",sep=""), colnames(df))]
#Take off the year from each variable name (to make it easier to summarize)
names(df_year) <- gsub(years[[i]], "", names(df_year), fixed = TRUE)
df_year <- df_year %>%
group_by(state) %>%
summarize(Tuition = mean(Tuition, na.rm = TRUE),
#this part of the code does not work. In this example, I only want to have this part if the year is 2016.
if (years[[i]]=='2016')
{Stategrant = mean(Stategrant, na.rm = TRUE)})
#rename df_year to df####
assign(paste("df",years[[i]],sep=''),df_year)
}
I have about 50 years of data, and a good amount of variables, so I wanted to use a loop. So my question is, how do i add a conditional statement (summarize certain variables conditioned on the year) in the group_by()/summarize() function? Thanks!
*Edit: I realize that I could take the if{} out of the function, and do something like:
if (years[[i]]==2016){
df_year <- df_year %>%
group_by(state) %>%
summarize(Tuition = mean(Tuition, na.rm = TRUE),
Stategrant = mean(Stategrant, na.rm = TRUE))
#rename df_year to df####
assign(paste("df",years[[i]],sep=''),df_year)
}
else{
df_year <- df_year %>%
group_by(state) %>%
summarize(Tuition = mean(Tuition, na.rm = TRUE))
#rename df_year to df####
assign(paste("df",years[[i]],sep=''),df_year)
{
}
but there are just so many combinations of variables, that using a for loop would not be very efficient or useful.
This is so much easier with tidy data, so let me show you how to tidy up your data. See http://r4ds.had.co.nz/tidy-data.html.
library(tidyr)
library(dplyr)
df <- gather(df, key, value, -Name) %>%
# separate years from the variables
separate(key, c("var", "year"), sep = -5) %>%
# the above line splits up e.g. State2014 into State and 2014.
# It does so by splitting at the fifth element from the end of the
# entry. Please check that this works for your other variables
# in case your naming conventions are inconsistent.
spread(var, value) %>%
# turn numbers back to numeric
mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>%
gather(var, val, -Name, -year, -State) %>%
# group by the variables of interest. Note that `var` here
# refers to Tuition and StateGrants. If you have more variables,
# they will be included here as well. If you want to exclude more
# variables from being included here in `var`, add more "-colName"
# entries in the `gather` statement above
group_by(year, State, var) %>%
# summarize:
summarise(mean_values = mean(val))
This gives you:
Source: local data frame [18 x 4]
Groups: year, State [?]
year State var mean_values
<chr> <chr> <chr> <dbl>
1 2014 CA StateGrants 4600.00
2 2014 CA Tuition 29415.00
3 2014 MA StateGrants NA
4 2014 MA Tuition 27513.33
5 2014 OR StateGrants 4100.00
6 2014 OR Tuition 27820.00
7 2015 CA StateGrants NA
8 2015 CA Tuition 30303.33
9 2015 MA StateGrants NA
10 2015 MA Tuition 18250.00
11 2015 OR StateGrants NA
12 2015 OR Tuition 23850.00
13 2016 CA StateGrants NA
14 2016 CA Tuition 37180.00
15 2016 MA StateGrants NA
16 2016 MA Tuition 27990.00
17 2016 OR StateGrants NA
18 2016 OR Tuition 24276.67
If you don't like the shape of this, you can e.g. add an %>% spread(var, mean_values) behind the summarise statement to have the means for Tuition and StateGrants in different columns.
If you want to compute different functions for Tuition and Grants (e.g. mean of Tuition and sum for grants, you could do the following:
df <- gather(df, key, value, -Name) %>%
separate(key, c("var", "year"), sep = -5) %>%
spread(var, value) %>%
mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>%
group_by(year, State) %>%
summarise(Grant_Sum = sum(StateGrants, na.rm=T), Tuition_Mean = mean(Tuition) )
This gives you:
Source: local data frame [9 x 4]
Groups: year [?]
year State Grant_Sum Tuition_Mean
<chr> <chr> <dbl> <dbl>
1 2014 CA 9200 29415.00
2 2014 MA 6600 27513.33
3 2014 OR 8200 27820.00
4 2015 CA 0 30303.33
5 2015 MA 0 18250.00
6 2015 OR 0 23850.00
7 2016 CA 0 37180.00
8 2016 MA 0 27990.00
9 2016 OR 0 24276.67
Note that I used sum here, with na.rm = T, which returns 0 if all elements are NAs. Make sure this makes sense in your use case.
Also, just to mention it, to get your individual data.frames that you asked for, you can use filter(year == 2014) etc, as in df_2014 <- filter(df, year == 2014).
I am trying to get a table from this
website :
http://www.oddsportal.com/american-football/usa/nfl-2012-2013/results/
I actually want to get the table in the middle of the page.
I tried different ways but in vain.
library("rvest")
library(dplyr)
url1 <- "http://www.oddsportal.com/american-football/usa/nfl-2012-2013/results/"
table <- url1 %>%
read_html() %>%
html_nodes(xpath='//*[#id="tournamentTable"]') %>%
html_table(fill = T)
This does not work because i believe that the table is not defined as table.
I also tried to grab the rows separately by using:
df <- mps1 %>%
html_nodes(css = "tr.odd.deactivate,tr.center.nob-border")
but it obtains nothing.
Any idea how can I do it?
thanks
Based on previous questions by people trying to scrape from this site, this table is probably dynamically generated. As far as I know, the only way to deal with pages like this is to use RSelenium - which basically automates a browser.
After a lot of trial and error, the following code seems to work (using Chrome on Windows 10)...
library(RSelenium)
library(rvest)
library(dplyr)
url <- "http://www.oddsportal.com/american-football/usa/nfl-2012-2013/results/"
rD <- rsDriver(port=4444L,browser="chrome")
remDr <- rD$client
remDr$navigate(url)
page <- remDr$getPageSource()
remDr$close() #you can leave open if you are doing several of these: close at the end
table <- page[[1]] %>%
read_html() %>%
html_nodes(xpath='//table[#id="tournamentTable"]') %>% #specify table as there is a div with same id
html_table(fill = T)
table <- table[[1]]
head(table)
American Football» USA»NFL 2012/2013 American Football» USA»NFL 2012/2013 American Football» USA»NFL 2012/2013 American Football» USA»NFL 2012/2013 American Football» USA»NFL 2012/2013 American Football» USA»NFL 2012/2013
1 03 Feb 2013 - Play Offs 03 Feb 2013 - Play Offs 03 Feb 2013 - Play Offs 03 Feb 2013 - Play Offs 1.00 2.00
2 NA NA
3 23:30 San Francisco 49ers - Baltimore Ravens San Francisco 49ers - Baltimore Ravens 31:34 1.49 2.71
4 28 Jan 2013 - All Stars 28 Jan 2013 - All Stars 28 Jan 2013 - All Stars 28 Jan 2013 - All Stars 1.00 2.00
5 NA NA
6 00:00 NFC - AFC NFC - AFC 62:35 2.03 1.83
American Football» USA»NFL 2012/2013
1 B's
2
3 9
4 B's
5
6 9
The odds are coming out as decimal numbers, unfortunately, but hopefully you can work with that.
I want to spread this data below (first 12 rows shown here only) by the column 'Year', returning the sum of 'Orders' grouped by 'CountryName'. Then calculate the % change in 'Orders' for each 'CountryName' from 2014 to 2015.
CountryName Days pCountry Revenue Orders Year
United Kingdom 0-1 days India 2604.799 13 2014
Norway 8-14 days Australia 5631.123 9 2015
US 31-45 days UAE 970.8324 2 2014
United Kingdom 4-7 days Austria 94.3814 1 2015
Norway 8-14 days Slovenia 939.8392 3 2014
South Korea 46-60 days Germany 1959.4199 15 2014
UK 8-14 days Poland 1394.9096 6. 2015
UK 61-90 days Lithuania -170.8035 -1 2015
US 8-14 days Belize 1687.68 5 2014
Australia 46-60 days Chile 888.72 2. 0 2014
US 15-30 days Turkey 2320.7355 8 2014
Australia 0-1 days Hong Kong 672.1099 2 2015
I can make this work with a smaller test dataframe, but can only seem to return endless errors like 'sum not meaningful for factors' or 'duplicate identifiers for rows' with the full data. After hours of reading the dplyr docs and trying things I've given up. Can anyone help with this code...
data %>%
spread(Year, Orders) %>%
group_by(CountryName) %>%
summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>%
mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))
The expected output would be a table similar to below. (Note: these numbers are for illustrative purposes, they are not hand calculated.)
CountryName percent_inc
UK 34.2
US 28.2
Norway 36.1
... ...
Edit
I had to make a few edits to the variable names, please note.
Sum first, while your data are still in long format, then spread. Here's an example with fake data:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2014:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
spread(Year, sum_orders) %>%
mutate(Pct = (`2014` - `2015`)/`2014` * 100)
Country `2014` `2015` Pct
1 A 575 599 -4.173913
2 B 457 486 -6.345733
3 C 481 319 33.679834
4 D 423 481 -13.711584
5 E 528 551 -4.356061
If you have multiple years, it's probably easier to just keep it in long format until you're ready to make a nice output table:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2010:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
group_by(Country) %>%
arrange(Country, Year) %>%
mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100)
Country Year sum_orders Pct
<fctr> <int> <int> <dbl>
1 A 2010 205 NA
2 A 2011 144 29.756098
3 A 2012 226 -56.944444
4 A 2013 119 47.345133
5 A 2014 177 -48.739496
6 A 2015 303 -71.186441
7 B 2010 146 NA
8 B 2011 159 -8.904110
9 B 2012 152 4.402516
10 B 2013 180 -18.421053
# ... with 20 more rows
This is not an answer because you haven't really asked a reproducible question, but just to help out.
Error 1 You're getting this error duplicate identifiers for rows likely because of spread. spread wants to make N columns of your N unique values but it needs to know which unique row to place those values. If you have duplicate value-combinations, for instance:
CountryName Days pCountry Revenue
United Kingdom 0-1 days India 2604.799
United Kingdom 0-1 days India 2604.799
shows up twice, then spread gets confused which row it should place the data in. The quick fix is to data %>% mutate(row=row_number()) %>% spread... before spread.
Error 2 You're getting this error sum not meaningful for factors likely because of summarise_all. summarise_all will operate on all columns but some columns contain strings (or factors). What does United Kingdom + United Kingdom equal? Try instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015)).