web scraping data table with r rvest

web scraping data table with r rvest - r

I'm trying to scrape a table from the following website:
http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats
The table is entitled "Miscellaneous Stats" and the problem is there are multiple tables on this webpage and I don't know if I'm identifying the correct one. I have attempted the following code but all it creates is a blank data frame:
library(rvest)
adv <- "http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats"
tmisc <- adv %>%
read_html() %>%
html_nodes(xpath = '//*[#id="div_misc_stats"]') %>%
html_table()
tmisc <- data.frame(tmisc)
I have a feeling I'm missing something trivial but I haven't found this through all my google searches. Any help is much appreciated.

Since the table you want is hidden in a comment until revealed by JavaScript, you either need to use RSelenium to run the JavaScript (which is kind of a pain), or parse the comments (which is still a pain, but slightly less so).
library(rvest)
library(readr) # for type_convert
adv <- "http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats"
h <- adv %>% read_html() # be kind; don't rescrape unless necessary
df <- h %>% html_nodes(xpath = '//comment()') %>% # select comments
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to single string
read_html() %>% # reread as HTML
html_node('table#misc_stats') %>% # select desired node
html_table() %>% # parse node to table
{ setNames(.[-1, ], paste0(names(.), .[1, ])) } %>% # extract names from first row
type_convert() # fix column types
df[1:6, 1:14]
## Rk Team Age PW PL MOV SOS SRS ORtg DRtg Pace FTr 3PAr TS%
## 2 1 Golden State Warriors* 27.4 65 17 10.76 -0.38 10.38 114.5 103.8 99.3 0.250 0.362 0.593
## 3 2 San Antonio Spurs* 30.3 67 15 10.63 -0.36 10.28 110.3 99.0 93.8 0.246 0.223 0.564
## 4 3 Oklahoma City Thunder* 25.8 59 23 7.28 -0.19 7.09 113.1 105.6 96.7 0.292 0.275 0.565
## 5 4 Cleveland Cavaliers* 28.1 57 25 6.00 -0.55 5.45 110.9 104.5 93.3 0.259 0.352 0.558
## 6 5 Los Angeles Clippers* 29.7 53 29 4.28 -0.15 4.13 108.3 103.8 95.8 0.318 0.324 0.556
## 7 6 Toronto Raptors* 26.3 53 29 4.50 -0.42 4.08 110.0 105.2 92.9 0.328 0.287 0.552

Here is and another messy solution. Read the page, save it, reread it, remove the comment markers and then process the page:
gameUrl <- "http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats"
gameHtml <- gameUrl %>% read_html()
#gameHtml %>% html_nodes("tbody")
#Only save and work with the body
body<-html_node(gameHtml,"body")
write_xml(body, "nba.xml")
#Find and remove comments
lines<-readLines("nba.xml")
lines<-lines[-grep("<!--", lines)]
lines<-lines[-grep("-->", lines)]
writeLines(lines, "nba2.xml")
#Read the file back in and process normally
body<-read_html("nba2.xml")
#Table 10 was found by looking at all of tables and picking the one of interest
tableofinterest<-(html_nodes(body, "tbody")[10])
rows<-html_nodes(tableofinterest, "tr")
tableOfResults<-t(sapply(rows, function(x) {html_text(html_nodes(x, "td"))}))
#find titles from the frist record's attributes
titles<-html_attrs(html_nodes(rows[1], "td"))
dfnames<-unlist(titles)[seq(2, 2*length(titles), by=2)]
#Final results are stored in data frame "df"
df<-as.data.frame(tableOfResults)
names(df)<-dfnames
This code works but should be simplified! This was based on a similar solution which I posted here: How to get table using rvest()

Related

create new variables from formulas stored in a list using dplyr

I have a list of formulas which I want to use to create new variables with mutate. For each formula stored in my list, I want to create a new variable. I want to automatically generate one variable for each element in my list. This is my code
library("dplyr")
library("purrr")
library("formula.tools")
t<-10 #just some constant which needs to be included (and found within my pipe)
ut <- list( # my list with the formulas as elements
v1 = V.1 ~ A * B*t,
v2 = V.2 ~ A+B)
data <- tibble(A=rnorm(10),B=runif(10)) %>% ## the dataset
mutate(!!lhs(ut[["v1"]]) := !!rhs(ut[["v1"]]),
!!lhs(ut[["v2"]]) := !!rhs(ut[["v2"]]))
This works fine. However, I do not want to write this for each element in my function. I want to mutate to take each element of the list, and apply the formula, i.e. I need some kind of loop. I tried with across, but across requires existing variables.
I tried to wrap it into a function and use map, but this didn't work
by_formula <- function(equation){
!!lhs(equation) := !!rhs(equation)
}
data <- tibble(A=rnorm(10),B=runif(10)) %>%
mutate(map(ut,by_formula))
I appreciate any hints how to do this so that I do not need to worry about the length of the list. This should be part of a function where the length of the list depends on the user input.

Here is one way
library(dplyr)
library(purrr)
library(formula.tools)
by_formula <- function(equation){
# //! cur_data_all may get deprecated in favor of pick
# pick(everything()) %>%
cur_data_all() %>%
transmute(!!lhs(equation) := !!rhs(equation) )
}
tibble(A=rnorm(10),B=runif(10)) %>%
mutate(map_dfc(ut, by_formula))
-output
# A tibble: 10 × 4
A B V.1 V.2
<dbl> <dbl> <dbl> <dbl>
1 1.73 0.0770 1.33 1.80
2 -1.46 0.894 -13.0 -0.562
3 -0.620 0.804 -4.99 0.184
4 0.834 0.524 4.37 1.36
5 -0.980 0.00581 -0.0569 -0.974
6 -0.361 0.316 -1.14 -0.0444
7 1.73 0.833 14.4 2.57
8 1.71 0.512 8.74 2.22
9 0.233 0.944 2.20 1.18
10 -0.832 0.474 -3.94 -0.358

Append output from iterative mapped function

This is a follow-up to a previous question: Read functions as text and use for plotting
The output of the mapped function...
data %>%
bind_cols(
map(.x = models,.f = text_model) %>%
set_names(models) %>%
bind_rows(.id = "model")
)
...generates a data frame with the results of each function written to a separate column (with the function included in the column headers).
However, it would be best to have the output from each function appended such that all results are included in the same column with a separate column to keep track of which function ("model001", "model002",..."model500") generated the results.
How can the code from the previous question (Read functions as text and use for plotting) be adjusted to write the results in this manner?
Edit: Someone suggested Read functions as text and use for plotting as an answer, but this post is a follow-up to that one asking about how the output can be written to a single column (rather than a sperate column for each function).

Given the other answer, we can pivot the data
data %>%
bind_cols(
map(.x = models,.f = text_model) %>%
set_names(models_names) %>%
bind_rows(.id = "model")
) %>%
pivot_longer(cols = model1:model2,names_to = "model")
# A tibble: 200 x 6
A B C D model value
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 0.833 0.538 0.647 1.65 model1 22.9
2 0.833 0.538 0.647 1.65 model2 57.9
3 2.07 1.20 -0.748 -2.04 model1 35.3
4 2.07 1.20 -0.748 -2.04 model2 70.3
5 0.880 -0.199 1.08 1.04 model1 29.2
6 0.880 -0.199 1.08 1.04 model2 64.2
7 0.252 0.400 1.45 -0.0645 model1 15.6
8 0.252 0.400 1.45 -0.0645 model2 50.6
9 0.746 0.0943 -1.00 1.44 model1 20.4
10 0.746 0.0943 -1.00 1.44 model2 55.4
# ... with 190 more rows

How could I use R to pull a few select lines out of a large text file?

I am fairly new to stack overflow but did not find this in the search engine. Please let me know if this question should not be asked here.
I have a very large text file. It has 16 entries and each entry looks like this:
AI_File 10
Version
Date 20200708 08:18:41
Prompt1 LOC
Resp1 H****
Prompt2 QUAD
Resp2 1012
TransComp c-p-s
Model Horizontal
### Computed Results
LAI 4.36
SEL 0.47
ACF 0.879
DIFN 0.031
MTA 40.
SEM 1.
SMP 5
### Ring Summary
MASK 1 1 1 1 1
ANGLES 7.000 23.00 38.00 53.00 68.00
AVGTRANS 0.038 0.044 0.055 0.054 0.030
ACFS 0.916 0.959 0.856 0.844 0.872
CNTCT# 3.539 2.992 2.666 2.076 1.499
STDDEV 0.826 0.523 0.816 0.730 0.354
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.028 0.039 0.034 0.032 0.018
### Contributing Sensors
### Observations
A 1 20200708 08:19:12 x 31.42 38.30 40.61 48.69 60.28
L 2 20200708 08:19:12 1 5.0e-006
B 3 20200708 08:19:21 x 2.279 2.103 1.408 5.027 1.084
B 4 20200708 08:19:31 x 1.054 0.528 0.344 0.400 0.379
B 5 20200708 08:19:39 x 0.446 1.255 2.948 3.828 1.202
B 6 20200708 08:19:47 x 1.937 2.613 5.909 3.665 5.964
B 7 20200708 08:19:55 x 0.265 1.957 0.580 0.311 0.551
Almost all of this is junk information, and I am looking to run some code for the whole file that will only give me the lines for "Resp2" and "LAI" for all 16 of the entries. Is a task like this doable in R? If so, how would I do it?
Thanks very much for any help and please let me know if there's any more information I can give to clear anything up.

I've saved your file as a text file and read in the lines. Then you can use regex to extract the desired rows. However, I feel that my approach is rather clumsy, I bet there are more elegant ways (maybe also with (unix) command line tools).
data <- readLines("testfile.txt")
library(stringr)
resp2 <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^Resp2).*$")))
lai <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
resp2 = resp2[!is.na(resp2)],
lai = lai[!is.na(lai)]
)
data_extract
resp2 lai
1 1012 4.36

A solution based in the tidyverse can look as follows.
library(dplyr)
library(vroom)
library(stringr)
library(tibble)
library(tidyr)
vroom_lines('data') %>%
enframe() %>%
filter(str_detect(value, 'Resp2|LAI')) %>%
transmute(value = str_squish(value)) %>%
separate(value, into = c('name', 'value'), sep = ' ')
# name value
# <chr> <chr>
# 1 Resp2 1012
# 2 LAI 4.36

Export results from split and lapply to a csv or Excel file

I'm using the split() and lapply functions to run Mann Kendall trend tests in bulk. In the code below, split() separates the results (ConcLow) by Analyte (water quality parameter). Then lapply runs the MannKendall and summary for each. The output goes to the console (example shown below code), but I'd like it to go into an Excel or cvs document so I can work with it. Ideally the Excel document would have the analyte (TOC for example) in the first column, then end column = tau value, 3rd column = pvalue. Then the next tab or following columns would display results from the summary function. Any assistance you can provide is greatly appreciated! I'm quite new to R.
mk.analyte <- split(BarkTop$ConcLow, BarkTop$Analyte)
lapply(mk.analyte, MannKendall)
lapply(mk.analyte, summary)
Output for each analyte looks like this (abbreviated here, but it's a long list):
$TOC
tau = 0.0108, 2-sided pvalue =0.8081
$TOC
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.378 2.054 2.255 2.434 2.600 4.530
Data look like this:
Date Location Analyte ConcLow Units
5/8/2000 Barker Res. Hardness 3.34 mg/L (as CaCO3)
11/24/2000 Barker Res. Hardness 9.47 mg/L (as CaCO3)
6/12/2001 Barker Res. Hardness 1.4 mg/L (as CaCO3)
12/29/2001 Barker Res. Hardness 21.9 mg/L (as CaCO3)
7/17/2002 Barker Res. Fe (diss 81 ug/L
2/2/2003 Barker Res. Fe (diss 90 ug/L
8/21/2003 Barker Res. Fe (diss 0.08 ug/L
3/8/2004 Barker Res. Fe (diss 15.748 ug/L
9/24/2004 Barker Res. TSS 6.2 mg/L
4/12/2005 Barker Res. TSS 8 mg/L
10/29/2005 Barker Res. TSS 10 mg/L

In my own opinion, I would use the tidyverse, as it is easier to read.
Short way:
#Sample data
set.seed(42)
df <- data.frame(
Location = replicate(1000, sample(letters[1:15], 1)),
Analyte = replicate(1000, sample(c("Hardness", "TSS", "Fe"), 1)),
ConcLow = runif(1000, 1, 30))
#Soltion
df %>%
nest(-Location, -Analyte) %>%
mutate(
mannKendall = purrr::map(data, function(x) {
broom::tidy(Kendall::MannKendall(x$ConcLow))}),
sumData = purrr::map(data, function(x) {
broom::tidy(summary(x$ConcLow))})) %>%
select(-data) %>%
unnest(mannKendall, sumData) %>%
write_excel_csv(path = "mydata.xls")
#How the table looks like:
# A tibble: 45 x 13
Location Analyte statistic p.value kendall_score denominator var_kendall_sco~ minimum q1 median
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 n Fe 0.264 0.0907 61 231. 1258. 1.38 14.4 20.6
2 o Hardne~ 0.0870 0.568 24 276. 1625. 2.02 9.52 18.3
3 e Fe -0.108 0.499 -25 231. 1258. 1.14 9.24 15.9
4 m TSS -0.00654 1 -1 153 697 2.19 5.89 10.4
5 j TSS -0.158 0.363 -27 171. 817 1.20 6.44 12.8
6 h Hardne~ 0.0909 0.466 48 528 4165. 4.28 11.1 19.4
7 l TSS -0.0526 0.780 -9 171. 817 5.39 12.5 21.1
8 c Fe -0.0736 0.652 -17 231. 1258. 1.63 5.87 10.6
9 j Hardne~ 0.415 0.0143 71 171. 817 4.50 11.7 15.4
10 k Fe -0.146 0.342 -37 253. 1434. 2.68 12.3 15.4
# ... with 35 more rows, and 3 more variables: mean <dbl>, q3 <dbl>, maximum <dbl>
Long way
It's a bit backwards but you can do something below.
Please note that I used subset from the mtcars dataset for my solution.
require(tidyverse)
df <- mtcars %>%
select(cyl, disp)
wilx <- df %>%
split(.$cyl) %>%
map(function(x) {broom::tidy(wilcox.test(x$disp, paired = FALSE,
exact = FALSE))})
sumData <- df %>%
split(.$cyl) %>%
map(function(x) {summary(x$disp)})
for (i in 1:length(wilx)) {
write_excel_csv(as.data.frame(wilx[i]), path = paste0(getwd(), "/wilx", i, ".xls"))
write_excel_csv(as.data.frame(unlist(sumData[i])), path = paste0(getwd(), "/sumData", i, ".xls"))
}

Scraping table using html_table in R

I want to scrape the Sector Weightings Table from the following link:
http://portfolios.morningstar.com/fund/summary?t=SPY&region=usa&culture=en-US&ownerCountry=USA
The table i want is table 6 in the website's source code. I have the following script written in R:
library(rvest)
turl = 'http://portfolios.morningstar.com/fund/summary?t=SPY'
turlr = read_html(turl)
df6<-html_table(html_nodes(turlr, 'table')[[6]], fill = TRUE)
However when i run the last line of the script i get the following error message
Error in out[j + k, ] : subscript out of bounds

Since the required table is designed in a different way rvest is not able to format it into proper table. But using XML package you can do it quite easily.
library(XML)
library(dplyr)
#read required table
turl = 'http://portfolios.morningstar.com/fund/summary?t=SPY'
temp_table <- readHTMLTable(turl)[[6]]
#process table to readable format
final_table <- temp_table %>%
select(V2, V3, V4, V5) %>%
na.omit() %>%
`colnames<-` (c("","% Stocks","Benchmark","Category Avg")) %>%
`rownames<-` (seq_len(nrow(.)))
final_table
Output is:
% Stocks Benchmark Category Avg
1 Cyclical
2 Basic Materials 2.79 3.16 3.22
3 Consumer Cyclical 11.06 11.42 11.15
4 Financial Services 16.39 16.50 17.22
5 Real Estate 2.24 3.18 2.00
6 Sensitive
7 Communication Services 3.56 3.37 3.50
8 Energy 5.83 5.79 5.79
9 Industrials 10.37 10.89 11.70
10 Technology 22.16 21.41 19.72
11 Defensive
12 Consumer Defensive 8.20 7.60 8.56
13 Healthcare 14.24 13.57 14.57
14 Utilities 3.15 3.11 2.59
Hope it helps!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

web scraping data table with r rvest - r

Related

create new variables from formulas stored in a list using dplyr

Append output from iterative mapped function

How could I use R to pull a few select lines out of a large text file?

Export results from split and lapply to a csv or Excel file

Scraping table using html_table in R

Categories

Resources