Scraping table using html_table in R - r

I want to scrape the Sector Weightings Table from the following link:
http://portfolios.morningstar.com/fund/summary?t=SPY&region=usa&culture=en-US&ownerCountry=USA
The table i want is table 6 in the website's source code. I have the following script written in R:
library(rvest)
turl = 'http://portfolios.morningstar.com/fund/summary?t=SPY'
turlr = read_html(turl)
df6<-html_table(html_nodes(turlr, 'table')[[6]], fill = TRUE)
However when i run the last line of the script i get the following error message
Error in out[j + k, ] : subscript out of bounds

Since the required table is designed in a different way rvest is not able to format it into proper table. But using XML package you can do it quite easily.
library(XML)
library(dplyr)
#read required table
turl = 'http://portfolios.morningstar.com/fund/summary?t=SPY'
temp_table <- readHTMLTable(turl)[[6]]
#process table to readable format
final_table <- temp_table %>%
select(V2, V3, V4, V5) %>%
na.omit() %>%
`colnames<-` (c("","% Stocks","Benchmark","Category Avg")) %>%
`rownames<-` (seq_len(nrow(.)))
final_table
Output is:
% Stocks Benchmark Category Avg
1 Cyclical
2 Basic Materials 2.79 3.16 3.22
3 Consumer Cyclical 11.06 11.42 11.15
4 Financial Services 16.39 16.50 17.22
5 Real Estate 2.24 3.18 2.00
6 Sensitive
7 Communication Services 3.56 3.37 3.50
8 Energy 5.83 5.79 5.79
9 Industrials 10.37 10.89 11.70
10 Technology 22.16 21.41 19.72
11 Defensive
12 Consumer Defensive 8.20 7.60 8.56
13 Healthcare 14.24 13.57 14.57
14 Utilities 3.15 3.11 2.59
Hope it helps!

Related

Grouping Frame Values

I have a dataset of ingredients for cookies. I'm trying to answer which group (A, B, C, etc) of cookies has the most sugar in them. The dataset is structured as follows:
group id mois prot fat hocolate sugar carb cal
1 A 14069 27.82 21.43 44.87 5.11 1.77 0.77 4.93
2 A 14053 28.49 21.26 43.89 5.34 1.79 1.02 4.84
3 A 14025 28.35 19.99 45.78 5.08 1.63 0.80 4.95
4 B 14016 30.55 20.15 43.13 4.79 1.61 1.38 4.74
5 B 14005 30.49 21.28 41.65 4.82 1.64 1.76 4.67
6 A 14075 31.14 20.23 42.31 4.92 1.65 1.40 4.67
7 C 14082 31.21 20.97 41.34 4.71 1.58 1.77 4.63
8 C 14097 28.76 21.41 41.60 5.28 1.75 2.95 4.72
etc....
How can I plot the mean of each grouping to show that one of them has a higher average of sugar than the others? Or at the least, how can I print off the results of the grouped averages of sugar to defend my argument that one has more sugar than the other?
After saving your text to CSV and loading this file into R, it's pretty easy to obtain the mean sugar quantity per group, which I'm assuming is what you need.
You first group your data by variable group and then summarize the data using the "mean" function.
library(dplyr)
(cookies = df %>%
group_by(group) %>%
summarize(meanSugar = mean(sugar)))
group meanSugar
<chr> <dbl>
1 A 1.71
2 B 1.62
3 C 1.66
As you can see, group A has sugar content a bit higher than the others based on your data.
If you wanna go a step further and really plot this data, you can do that:
library(ggplot2)
cookies %>%
ggplot(aes(x=meanSugar,y=reorder(group,meanSugar),fill=group,label=meanSugar)) +
geom_col()+
labs(y="Cookie groups",x="Mean Sugar")+
geom_label(stat="identity",hjust=+1.2,color="white")+
theme(legend.position = "none")
If you have any questions on some of these steps, let me know!
Obs: please try to provide better data the next time so it's easy to reproduce what you need and give you a quick answer :)

How can I use dplyr to turn one column into 3 based on the characters in the original column?

Hopefully this makes sense. I have one column in my dataset that has multiple entries of one of three size category (read in the data as characters), "(0,1.88]", "(1.88,4]", and "(4,10]". I would to combine all of my entries together by plot (another column in the dataset), totaling the response for each size category in its own column.
Ideally, I'm trying to take data which has multiple responses in each Plot and end up with one total response for each plot, divided by size category. I'm hoping to get something like this:
Plot Total Response for (0,1.88] Total Response for (1.88,4] Total Response for (4,10]
Here is the head of my data. Not all of it is needed, only Plot, ounces, and tuber.diam. tuber.diam has the entries grouped into size categories.
head(newChippers)
Plot ounces Height Shape Area plot variety rate block width length tuber.oz.bin tuber.diam
1 2422 1.31 1.22 26122 3237 242 Lamoka 3 4 1.65 1.70 (0,4] (0,1.88]
2 2422 2.76 1.56 27853 5740 242 Lamoka 3 4 2.20 2.24 (0,4] (1.88,4]
3 2422 1.62 1.31 24125 3721 242 Lamoka 3 4 1.53 1.95 (0,4] (0,1.88]
4 2422 3.37 1.70 27147 6498 242 Lamoka 3 4 2.17 2.48 (0,4] (1.88,4]
5 2422 3.19 1.70 27683 6126 242 Lamoka 3 4 2.22 2.34 (0,4] (1.88,4]
6 2422 2.83 1.53 27356 6009 242 Lamoka 3 4 2.00 2.53 (0,4] (1.88,4]
Here is what I currently have for making the new dataset:
YieldSizeProfileDiameter <- newChippers %>%
group_by(Plot) %>%
summarize(totalOz = sum(Weight),
Diameter.0.1.88 = (tuber.diam("(0,1.88]")),
Diameter.1.88.4 = (tuber.diam(" (1.88,4]")),
Diameter.4.10 = (tuber.diam(" (4,10]")))
I get the following error code:
Error in x[[n]] : object of type 'closure' is not subsettable
Any help would be very much appreciated! Again, I'm very sorry if I've explained it poorly or made it too complicated. If any additional information is needed, I can try to provide it. Thank you!
I have revised your code. I assume your variable weight is the same as variable ounce as there is no weight variable in newChippers your data data. I use weight here as in your code:
YieldSizeProfileDiameter <- newChippers %>%
group_by(Plot, tuber.diam) %>%
summarize(totalOz = sum(Weight)) %>%
pivot_wider(names_from = tuber.diam, values_from = totalOz)
YieldSizeProfileDiameter
I have not tested the code on my side as I do not have the data.

Rolling 182-day average

I have a dataset with multiple sites and sampling years, with a score for every day of the year. For example, SiteA has 40 years of data with a value for every day, and sampling year defined as Sampling.Year. To make it confusing our sampling year is July-June so takes the form of 2016-2017.
For example:
SiteName Sampling.Year Date Score
A 2015-2016 1
A 2015-2016 5
A 2015-2016 2
A 2016-2017 3
A 2016-2017 12
A 2016-2017 6
B 2015-2016 9
B 2015-2016 2
B 2015-2016 1
B 2016-2017 4
B 2016-2017 1
B 2016-2017 7
I want to apply a rolling 182-day average across this data to find the maximum (182-day average) score for each site/Sampling.Year combination. The outcome would be, e.g.:
Site Sampling.Year MaxAve StartDate
A 2016-2017 7.5 01/10/2016
A 2017-2018 6.0 12/12/2017
B 2016-2017 2.3 13/11/2016
B 2017-2018 4.2 09/09/2017
I have saved a sample dataset here:
Sample data.
I want to use a loop code (because I am a novice and i'm not sure of a better way) along the lines of this, but it's the grouping of sites and years that I'm finding tricky. I would ideally like to have the moving average able to be exported as a new dataframe with start and end date (or at least start date) for each window so we can check it against weather conditions at the time.
Moving_Average_Function <- function(arr, n=182){
res = arr
for(i in n:length(arr)){
res[i] = mean(arr[(i-n+1):i])
}
res
}
Thanks in advance
If you are willing to use external libraries, you could use group_by() from the dplyr package, and the roll_mean() function from the RcppRoll package. RcppRoll has a set of fast flexible functions for calculating moving averages.
I would also tend to convert your DATE column to a date class so it arranges smoothly.
library(dplyr) # I would typically use library(tidyverse) to load both dplyr and tidyr (among other related packages)
library(tidyr)
library(lubridate)
library(RcppRoll)
my_data <- data.table::fread("DailyScore.csv") # easy way to load a data frame from file
my_data2 <- my_data %>%
mutate(DATE = dmy(DATE)) %>% # Converting to Date format
pivot_longer(H1:T2,
names_to = "Sensor",
values_to = "data"
) %>% # convert column names to data
group_by(STATION, Sensor) %>% # so you don't average by site.
arrange(STATION, DATE) %>% # to be sure you are in order for the rolling mean
# The STATION argument isn't necessary, but helps for display
mutate(Mean_182 = roll_meanr(data, 182)) %>% # New column with your rolling mean
pivot_wider(names_from = Sensor, values_from = c(data, Mean_182)) # converts back to original "wide" format
my_data2[180:195,]
# # A tibble: 16 x 14
# # Groups: STATION [1]
# STATION SITENAME Sampling.Year DATE data_H1 data_I1 data_H2 data_P2 data_T2 Mean_182_H1
# <chr> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Site A Site A 1979-1980 1980-06-28 2.85 1.06e- 9 2.10 0.762 2.85 NA
# 2 Site A Site A 1979-1980 1980-06-29 2.79 1.62e-12 2.06 0.744 2.79 NA
# 3 Site A Site A 1979-1980 1980-06-30 2.75 1.00e-11 2.04 0.732 2.75 2.70
# 4 Site A Site A 1980-1981 1980-07-01 2.72 1.00e-11 2.01 0.724 2.72 2.71
# 5 Site A Site A 1980-1981 1980-07-02 2.70 1.00e-11 2.00 0.720 2.70 2.73
# 6 Site A Site A 1980-1981 1980-07-03 2.68 1.00e-11 1.98 0.718 2.68 2.74
# 7 Site A Site A 1980-1981 1980-07-04 2.67 1.00e-11 1.97 0.719 2.67 2.75
# 8 Site A Site A 1980-1981 1980-07-05 2.65 1.11e- 9 1.95 0.708 2.65 2.76
# 9 Site A Site A 1980-1981 1980-07-06 2.62 2.77e-10 1.93 0.703 2.62 2.76
# 10 Site A Site A 1980-1981 1980-07-07 2.60 3.18e-12 1.92 0.700 2.60 2.77
# 11 Site A Site A 1980-1981 1980-07-08 2.59 1.00e-11 1.90 0.701 2.59 2.79
# 12 Site A Site A 1980-1981 1980-07-09 2.59 1.00e-11 1.89 0.706 2.59 2.80
# 13 Site A Site A 1980-1981 1980-07-10 2.59 1.00e-11 1.89 0.713 2.59 2.81
# 14 Site A Site A 1980-1981 1980-07-11 2.59 1.00e-11 1.88 0.722 2.59 2.82
# 15 Site A Site A 1980-1981 1980-07-12 2.60 1.00e-11 1.88 0.731 2.60 2.83
# 16 Site A Site A 1980-1981 1980-07-13 2.60 1.00e-11 1.87 0.741 2.60 2.84
# # ... with 4 more variables: Mean_182_I1 <dbl>, Mean_182_H2 <dbl>, Mean_182_P2 <dbl>, Mean_182_T2 <dbl>
Couple things to keep in mind, which will affect how you set this up.
In general rolling averages will return NA when they don't have a complete data set. So with a 182-day average, you'll get a series of 181 NA's before your first complete average.
You'll want to figure out how you want to handle the roll over parts- especially with a long period rolling average, if you don't want to mix sampling years, you'll have about half the year without data.
Using loops would be very inefficient for such operations. You can use some dedicated functions which allow you to perform this by group and use zoo::rollmean to get rolling mean.
library(dplyr)
DailyScore %>%
group_by(SITENAME, Sampling.Year) %>%
summarise(max_average = max(zoo::rollmean(Score, 182)))

rvest empty list for table via xpath

I need help for scraping data from the website below. I copied how the link here does https://msperlin.github.io/pafdR/importingInternet.html#accessing-data-from-web-pages-webscraping to get the table of rates down but i get only list 0. Can someone please help me?
library(rvest)
# set url and xpath
my.url <- 'https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yield'
my.xpath <- '//*[#id="t-content-main-content"]/div/table/tbody/tr/td/div/table'
# get nodes from html
out.nodes <- html_nodes(read_html(my.url),
xpath = my.xpath)
# get table from nodes (each element in
# list is a table)
df <- html_table(out.nodes)
df
It's usually better to not use extremely precise XPath statements since the structure of pages can change and sometimes what looks correct in the browser source or Developer Tools may not be (browsers modify HTML when they read it in).
Using said Developer Tools (in this case via Firefox but Chrome shld work fine too), an Inspect Element near the table shows:
We can just stick with CSS selector syntax and do:
library(rvest)
pg <- read_html("https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yield")
html_node(pg, "table.t-chart") %>%
html_table()
## Date 1 mo 3 mo 6 mo 1 yr 2 yr 3 yr 5 yr 7 yr 10 yr 20 yr 30 yr
## 1 10/01/18 2.13 2.23 2.40 2.60 2.82 2.90 2.96 3.04 3.09 3.18 3.24
## 2 10/02/18 2.14 2.23 2.41 2.61 2.82 2.88 2.94 3.01 3.05 3.14 3.20
## 3 10/03/18 2.15 2.23 2.41 2.62 2.85 2.94 3.02 3.10 3.15 3.24 3.30
## 4 10/04/18 2.16 2.22 2.42 2.63 2.87 2.97 3.05 3.14 3.19 3.29 3.35
## 5 10/05/18 2.15 2.23 2.41 2.64 2.88 2.99 3.07 3.18 3.23 3.34 3.40
In this case CSS selectors are easier (that's not always the case) but you can also use XPath queries as well:
html_node(pg, xpath = ".//table[#class='t-chart']")

web scraping data table with r rvest

I'm trying to scrape a table from the following website:
http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats
The table is entitled "Miscellaneous Stats" and the problem is there are multiple tables on this webpage and I don't know if I'm identifying the correct one. I have attempted the following code but all it creates is a blank data frame:
library(rvest)
adv <- "http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats"
tmisc <- adv %>%
read_html() %>%
html_nodes(xpath = '//*[#id="div_misc_stats"]') %>%
html_table()
tmisc <- data.frame(tmisc)
I have a feeling I'm missing something trivial but I haven't found this through all my google searches. Any help is much appreciated.
Since the table you want is hidden in a comment until revealed by JavaScript, you either need to use RSelenium to run the JavaScript (which is kind of a pain), or parse the comments (which is still a pain, but slightly less so).
library(rvest)
library(readr) # for type_convert
adv <- "http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats"
h <- adv %>% read_html() # be kind; don't rescrape unless necessary
df <- h %>% html_nodes(xpath = '//comment()') %>% # select comments
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to single string
read_html() %>% # reread as HTML
html_node('table#misc_stats') %>% # select desired node
html_table() %>% # parse node to table
{ setNames(.[-1, ], paste0(names(.), .[1, ])) } %>% # extract names from first row
type_convert() # fix column types
df[1:6, 1:14]
## Rk Team Age PW PL MOV SOS SRS ORtg DRtg Pace FTr 3PAr TS%
## 2 1 Golden State Warriors* 27.4 65 17 10.76 -0.38 10.38 114.5 103.8 99.3 0.250 0.362 0.593
## 3 2 San Antonio Spurs* 30.3 67 15 10.63 -0.36 10.28 110.3 99.0 93.8 0.246 0.223 0.564
## 4 3 Oklahoma City Thunder* 25.8 59 23 7.28 -0.19 7.09 113.1 105.6 96.7 0.292 0.275 0.565
## 5 4 Cleveland Cavaliers* 28.1 57 25 6.00 -0.55 5.45 110.9 104.5 93.3 0.259 0.352 0.558
## 6 5 Los Angeles Clippers* 29.7 53 29 4.28 -0.15 4.13 108.3 103.8 95.8 0.318 0.324 0.556
## 7 6 Toronto Raptors* 26.3 53 29 4.50 -0.42 4.08 110.0 105.2 92.9 0.328 0.287 0.552
Here is and another messy solution. Read the page, save it, reread it, remove the comment markers and then process the page:
gameUrl <- "http://www.basketball-reference.com/leagues/NBA_2016.html?lid=header_seasons#all_misc_stats"
gameHtml <- gameUrl %>% read_html()
#gameHtml %>% html_nodes("tbody")
#Only save and work with the body
body<-html_node(gameHtml,"body")
write_xml(body, "nba.xml")
#Find and remove comments
lines<-readLines("nba.xml")
lines<-lines[-grep("<!--", lines)]
lines<-lines[-grep("-->", lines)]
writeLines(lines, "nba2.xml")
#Read the file back in and process normally
body<-read_html("nba2.xml")
#Table 10 was found by looking at all of tables and picking the one of interest
tableofinterest<-(html_nodes(body, "tbody")[10])
rows<-html_nodes(tableofinterest, "tr")
tableOfResults<-t(sapply(rows, function(x) {html_text(html_nodes(x, "td"))}))
#find titles from the frist record's attributes
titles<-html_attrs(html_nodes(rows[1], "td"))
dfnames<-unlist(titles)[seq(2, 2*length(titles), by=2)]
#Final results are stored in data frame "df"
df<-as.data.frame(tableOfResults)
names(df)<-dfnames
This code works but should be simplified! This was based on a similar solution which I posted here: How to get table using rvest()

Resources