rvest empty list for table via xpath - r

I need help for scraping data from the website below. I copied how the link here does https://msperlin.github.io/pafdR/importingInternet.html#accessing-data-from-web-pages-webscraping to get the table of rates down but i get only list 0. Can someone please help me?
library(rvest)
# set url and xpath
my.url <- 'https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yield'
my.xpath <- '//*[#id="t-content-main-content"]/div/table/tbody/tr/td/div/table'
# get nodes from html
out.nodes <- html_nodes(read_html(my.url),
xpath = my.xpath)
# get table from nodes (each element in
# list is a table)
df <- html_table(out.nodes)
df

It's usually better to not use extremely precise XPath statements since the structure of pages can change and sometimes what looks correct in the browser source or Developer Tools may not be (browsers modify HTML when they read it in).
Using said Developer Tools (in this case via Firefox but Chrome shld work fine too), an Inspect Element near the table shows:
We can just stick with CSS selector syntax and do:
library(rvest)
pg <- read_html("https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yield")
html_node(pg, "table.t-chart") %>%
html_table()
## Date 1 mo 3 mo 6 mo 1 yr 2 yr 3 yr 5 yr 7 yr 10 yr 20 yr 30 yr
## 1 10/01/18 2.13 2.23 2.40 2.60 2.82 2.90 2.96 3.04 3.09 3.18 3.24
## 2 10/02/18 2.14 2.23 2.41 2.61 2.82 2.88 2.94 3.01 3.05 3.14 3.20
## 3 10/03/18 2.15 2.23 2.41 2.62 2.85 2.94 3.02 3.10 3.15 3.24 3.30
## 4 10/04/18 2.16 2.22 2.42 2.63 2.87 2.97 3.05 3.14 3.19 3.29 3.35
## 5 10/05/18 2.15 2.23 2.41 2.64 2.88 2.99 3.07 3.18 3.23 3.34 3.40
In this case CSS selectors are easier (that's not always the case) but you can also use XPath queries as well:
html_node(pg, xpath = ".//table[#class='t-chart']")

Related

Grouping Frame Values

I have a dataset of ingredients for cookies. I'm trying to answer which group (A, B, C, etc) of cookies has the most sugar in them. The dataset is structured as follows:
group id mois prot fat hocolate sugar carb cal
1 A 14069 27.82 21.43 44.87 5.11 1.77 0.77 4.93
2 A 14053 28.49 21.26 43.89 5.34 1.79 1.02 4.84
3 A 14025 28.35 19.99 45.78 5.08 1.63 0.80 4.95
4 B 14016 30.55 20.15 43.13 4.79 1.61 1.38 4.74
5 B 14005 30.49 21.28 41.65 4.82 1.64 1.76 4.67
6 A 14075 31.14 20.23 42.31 4.92 1.65 1.40 4.67
7 C 14082 31.21 20.97 41.34 4.71 1.58 1.77 4.63
8 C 14097 28.76 21.41 41.60 5.28 1.75 2.95 4.72
etc....
How can I plot the mean of each grouping to show that one of them has a higher average of sugar than the others? Or at the least, how can I print off the results of the grouped averages of sugar to defend my argument that one has more sugar than the other?
After saving your text to CSV and loading this file into R, it's pretty easy to obtain the mean sugar quantity per group, which I'm assuming is what you need.
You first group your data by variable group and then summarize the data using the "mean" function.
library(dplyr)
(cookies = df %>%
group_by(group) %>%
summarize(meanSugar = mean(sugar)))
group meanSugar
<chr> <dbl>
1 A 1.71
2 B 1.62
3 C 1.66
As you can see, group A has sugar content a bit higher than the others based on your data.
If you wanna go a step further and really plot this data, you can do that:
library(ggplot2)
cookies %>%
ggplot(aes(x=meanSugar,y=reorder(group,meanSugar),fill=group,label=meanSugar)) +
geom_col()+
labs(y="Cookie groups",x="Mean Sugar")+
geom_label(stat="identity",hjust=+1.2,color="white")+
theme(legend.position = "none")
If you have any questions on some of these steps, let me know!
Obs: please try to provide better data the next time so it's easy to reproduce what you need and give you a quick answer :)

Rolling 182-day average

I have a dataset with multiple sites and sampling years, with a score for every day of the year. For example, SiteA has 40 years of data with a value for every day, and sampling year defined as Sampling.Year. To make it confusing our sampling year is July-June so takes the form of 2016-2017.
For example:
SiteName Sampling.Year Date Score
A 2015-2016 1
A 2015-2016 5
A 2015-2016 2
A 2016-2017 3
A 2016-2017 12
A 2016-2017 6
B 2015-2016 9
B 2015-2016 2
B 2015-2016 1
B 2016-2017 4
B 2016-2017 1
B 2016-2017 7
I want to apply a rolling 182-day average across this data to find the maximum (182-day average) score for each site/Sampling.Year combination. The outcome would be, e.g.:
Site Sampling.Year MaxAve StartDate
A 2016-2017 7.5 01/10/2016
A 2017-2018 6.0 12/12/2017
B 2016-2017 2.3 13/11/2016
B 2017-2018 4.2 09/09/2017
I have saved a sample dataset here:
Sample data.
I want to use a loop code (because I am a novice and i'm not sure of a better way) along the lines of this, but it's the grouping of sites and years that I'm finding tricky. I would ideally like to have the moving average able to be exported as a new dataframe with start and end date (or at least start date) for each window so we can check it against weather conditions at the time.
Moving_Average_Function <- function(arr, n=182){
res = arr
for(i in n:length(arr)){
res[i] = mean(arr[(i-n+1):i])
}
res
}
Thanks in advance
If you are willing to use external libraries, you could use group_by() from the dplyr package, and the roll_mean() function from the RcppRoll package. RcppRoll has a set of fast flexible functions for calculating moving averages.
I would also tend to convert your DATE column to a date class so it arranges smoothly.
library(dplyr) # I would typically use library(tidyverse) to load both dplyr and tidyr (among other related packages)
library(tidyr)
library(lubridate)
library(RcppRoll)
my_data <- data.table::fread("DailyScore.csv") # easy way to load a data frame from file
my_data2 <- my_data %>%
mutate(DATE = dmy(DATE)) %>% # Converting to Date format
pivot_longer(H1:T2,
names_to = "Sensor",
values_to = "data"
) %>% # convert column names to data
group_by(STATION, Sensor) %>% # so you don't average by site.
arrange(STATION, DATE) %>% # to be sure you are in order for the rolling mean
# The STATION argument isn't necessary, but helps for display
mutate(Mean_182 = roll_meanr(data, 182)) %>% # New column with your rolling mean
pivot_wider(names_from = Sensor, values_from = c(data, Mean_182)) # converts back to original "wide" format
my_data2[180:195,]
# # A tibble: 16 x 14
# # Groups: STATION [1]
# STATION SITENAME Sampling.Year DATE data_H1 data_I1 data_H2 data_P2 data_T2 Mean_182_H1
# <chr> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Site A Site A 1979-1980 1980-06-28 2.85 1.06e- 9 2.10 0.762 2.85 NA
# 2 Site A Site A 1979-1980 1980-06-29 2.79 1.62e-12 2.06 0.744 2.79 NA
# 3 Site A Site A 1979-1980 1980-06-30 2.75 1.00e-11 2.04 0.732 2.75 2.70
# 4 Site A Site A 1980-1981 1980-07-01 2.72 1.00e-11 2.01 0.724 2.72 2.71
# 5 Site A Site A 1980-1981 1980-07-02 2.70 1.00e-11 2.00 0.720 2.70 2.73
# 6 Site A Site A 1980-1981 1980-07-03 2.68 1.00e-11 1.98 0.718 2.68 2.74
# 7 Site A Site A 1980-1981 1980-07-04 2.67 1.00e-11 1.97 0.719 2.67 2.75
# 8 Site A Site A 1980-1981 1980-07-05 2.65 1.11e- 9 1.95 0.708 2.65 2.76
# 9 Site A Site A 1980-1981 1980-07-06 2.62 2.77e-10 1.93 0.703 2.62 2.76
# 10 Site A Site A 1980-1981 1980-07-07 2.60 3.18e-12 1.92 0.700 2.60 2.77
# 11 Site A Site A 1980-1981 1980-07-08 2.59 1.00e-11 1.90 0.701 2.59 2.79
# 12 Site A Site A 1980-1981 1980-07-09 2.59 1.00e-11 1.89 0.706 2.59 2.80
# 13 Site A Site A 1980-1981 1980-07-10 2.59 1.00e-11 1.89 0.713 2.59 2.81
# 14 Site A Site A 1980-1981 1980-07-11 2.59 1.00e-11 1.88 0.722 2.59 2.82
# 15 Site A Site A 1980-1981 1980-07-12 2.60 1.00e-11 1.88 0.731 2.60 2.83
# 16 Site A Site A 1980-1981 1980-07-13 2.60 1.00e-11 1.87 0.741 2.60 2.84
# # ... with 4 more variables: Mean_182_I1 <dbl>, Mean_182_H2 <dbl>, Mean_182_P2 <dbl>, Mean_182_T2 <dbl>
Couple things to keep in mind, which will affect how you set this up.
In general rolling averages will return NA when they don't have a complete data set. So with a 182-day average, you'll get a series of 181 NA's before your first complete average.
You'll want to figure out how you want to handle the roll over parts- especially with a long period rolling average, if you don't want to mix sampling years, you'll have about half the year without data.
Using loops would be very inefficient for such operations. You can use some dedicated functions which allow you to perform this by group and use zoo::rollmean to get rolling mean.
library(dplyr)
DailyScore %>%
group_by(SITENAME, Sampling.Year) %>%
summarise(max_average = max(zoo::rollmean(Score, 182)))

Scraping table using html_table in R

I want to scrape the Sector Weightings Table from the following link:
http://portfolios.morningstar.com/fund/summary?t=SPY&region=usa&culture=en-US&ownerCountry=USA
The table i want is table 6 in the website's source code. I have the following script written in R:
library(rvest)
turl = 'http://portfolios.morningstar.com/fund/summary?t=SPY'
turlr = read_html(turl)
df6<-html_table(html_nodes(turlr, 'table')[[6]], fill = TRUE)
However when i run the last line of the script i get the following error message
Error in out[j + k, ] : subscript out of bounds
Since the required table is designed in a different way rvest is not able to format it into proper table. But using XML package you can do it quite easily.
library(XML)
library(dplyr)
#read required table
turl = 'http://portfolios.morningstar.com/fund/summary?t=SPY'
temp_table <- readHTMLTable(turl)[[6]]
#process table to readable format
final_table <- temp_table %>%
select(V2, V3, V4, V5) %>%
na.omit() %>%
`colnames<-` (c("","% Stocks","Benchmark","Category Avg")) %>%
`rownames<-` (seq_len(nrow(.)))
final_table
Output is:
% Stocks Benchmark Category Avg
1 Cyclical
2 Basic Materials 2.79 3.16 3.22
3 Consumer Cyclical 11.06 11.42 11.15
4 Financial Services 16.39 16.50 17.22
5 Real Estate 2.24 3.18 2.00
6 Sensitive
7 Communication Services 3.56 3.37 3.50
8 Energy 5.83 5.79 5.79
9 Industrials 10.37 10.89 11.70
10 Technology 22.16 21.41 19.72
11 Defensive
12 Consumer Defensive 8.20 7.60 8.56
13 Healthcare 14.24 13.57 14.57
14 Utilities 3.15 3.11 2.59
Hope it helps!

remove duplicate two criterion interval R

I am working on cleaning and processing of data with R. I would like to remove the duplicates from a matrix. See the example below.
I would like to remove duplicate according to two criterion, and if it is possible using an interval (If the RT ± 0.1 and the m.z ± 0.001 for a same row is detected more than one time in the table, so remove the extra row).
RT m.z
1 2.02 326.1988
2 2.03 326.1989
3 2.06 326.1990
4 2.03 331.1533
5 2.03 375.1785
6 2.03 301.2852
7 2.04 301.2852
8 2.06 301.2852
9 2.07 357.2609
10 2.07 308.0327
11 2.08 218.2221
12 2.08 312.3617
13 2.10 473.3453
14 2.15 388.3929
I would like a out put like that:
RT m.z
1 2.02 326.1988
2
3 2.06 326.1990
4 2.03 331.1533
5 2.03 375.1785
6 2.03 301.2852
7
8 2.06 301.2852
9 2.07 357.2609
10 2.07 308.0327
11 2.08 218.2221
12 2.08 312.3617
13 2.10 473.3453
14 2.15 388.3929
If you can help that will help me a lot.
Thanks in advance.
This is a way to do it with dplyr. Not sure if it's the most efficient way.
df <- read.table(textConnection("RT m.z
1 2.02 326.1988
2 2.03 326.1989
3 2.06 326.1990
4 2.03 331.1533
5 2.03 375.1785
6 2.03 301.2852
7 2.04 301.2852
8 2.06 301.2852
9 2.07 357.2609
10 2.07 308.0327
11 2.08 218.2221
12 2.08 312.3617
13 2.10 473.3453
14 2.15 388.3929"))
Now with the same data you provided.
library(dplyr)
# This calculates the difference in RT and m.z between consecutive rows
# and looks for absolute differences on which we filter further down the chain
df %>% mutate(
rtdiff = abs(lag(RT) - RT),
mzdiff = abs(lag(m.z) - m.z)
) %>%
# This replaces the NAs in the first row
# with large values so filter does not have to deal with NAs
mutate(rtdiff = replace(rtdiff, is.na(rtdiff), 999),
mzdiff = replace(mzdiff, is.na(mzdiff), 999)) %>%
# Remove the rows that don't meet your condition
filter(!(rtdiff < 0.02 & mzdiff < 0.0002)) %>%
# select only the columns you need and lose the rest
select(RT, m.z)
giving us:
RT m.z
1 2.02 326.1988
2 2.06 326.1990
3 2.03 331.1533
4 2.03 375.1785
5 2.03 301.2852
6 2.06 301.2852
7 2.07 357.2609
8 2.07 308.0327
9 2.08 218.2221
10 2.08 312.3617
11 2.10 473.3453
12 2.15 388.3929
Hi It seems I have intercalated value between my replicates.
So I propose a small change in the Maiasaura code.
for (i in 1:100){
reduced.list.pre.filtering = reduced.list.pre.filtering %>% mutate(
rtdiff = abs(lag(RT..min.,i) - RT..min.),
mzdiff = abs(lag(Max..m.z,i) - Max..m.z)) %>%
mutate(rtdiff = replace(rtdiff, is.na(rtdiff), 999),
mzdiff = replace(mzdiff, is.na(mzdiff), 999)) %>%
filter(!(rtdiff < setRT & mzdiff < setmz )) %>%
select(RT..min., Max..m.z)}
Like that we check all the 100 followed values of a row. Hope it gonna helps somebody else. Do not hesitate if you have a better solution.

Using the ddply comand on a subset of data

I got some issues using the command 'ddply' of the 'plyr' package. I created a dataframe which looks like this one :
u v intensity season
24986 -1.97 -0.35 2.0 1
24987 -1.29 -1.53 2.0 1
24988 -0.94 -0.34 1.0 1
24989 -1.03 2.82 3.0 1
24990 1.37 3.76 4.0 1
24991 1.93 2.30 3.0 2
24992 3.83 -3.21 5.0 2
24993 0.52 -2.95 3.0 2
24994 3.06 -2.57 4.0 2
24995 2.57 -3.06 4.0 2
24996 0.34 -0.94 1.0 2
24997 0.87 4.92 5.0 3
24998 0.69 3.94 4.0 3
24999 4.60 3.86 6.0 3
I tried to use the function cumsum on the u and v values, but I don't get what I want. When I select a subset of my data, corresponding to a season, for example :
x <- cumsum(mydata$u[56297:56704]*10.8)
y <- cumsum(mydata$v[56297:56704]*10.8)
...this works perfectly. The thing is that I got a huge dataset (67208 rows) with 92 seasons, and I'd like to make this function work on subsets of data. So I tried this :
new <- ddply(mydata, .(mydata$seasons), summarize, x=c(0,cumsum(mydata$u*10.8)))
...and the result looks like this :
24986 1 NA
24987 1 NA
24988 1 NA
I found some questions related to this one on stackoverflow and other website, but none of them helped me dealing with my problem. If someone has an idea, you're welcome ;)
Don't use your data.frame's name inside the plyr "function". just reference the column name as though it was defined:
ddply(mydata, .(seasons), summarise, x=c(0, cumsum(u*10.8)))

Resources