Webscraping Returns xml_nodeset 0 in R - r

I am trying to scrape this website. https://web.tmxmoney.com/earnings.php?qm_symbol=DOL specifically the table at the bottom of the screen.
I have tried countless of CSS and XPath but still get {xml_nodeset(0)}. I am looking for an intuitive answer rather than just the code.
Here are a few of my attempts.
library(httr)
library(rvest)
library(dplyr)
tbl = read_html('https://web.tmxmoney.com/earnings.php?qm_symbol=DOL')%>%
html_nodes("table").[2]%>%html_table(fill = T)#no luck
tbl = read_html('https://web.tmxmoney.com/earnings.php?qm_symbol=DOL')%>%
html_nodes(xpath = '//*[#id="DataTables_Table_0"]')%>%html_table(fill = T)#node set(0)
I have tried countless others, using selector gadget and inspecting the source code.

I didn't check the complete Terms of Service, so please be aware that scraping might not be legal.
The following should do the trick:
library(rvest)
library(data.table)
library(httr)
library(XML)
library(RSelenium)
mybrowser <- rsDriver(browser = 'firefox')
link <- "https://web.tmxmoney.com/earnings.php?qm_symbol=DOL"
mybrowser$client$navigate(link)
mybrowser$client$findElement(using = 'css selector', "#DataTables_Table_0")$getElementText()
html.table <- mybrowser$client$findElement(using = 'css selector', "#DataTables_Table_0")
webElem5txt <- html.table$getElementAttribute("outerHTML")[[1]]
df.table <- read_html(webElem5txt) %>% html_table() %>% data.frame(.)
mybrowser$server$stop()
# Excerpt of the data:
> df.table
Var.1 Quarter.End X..EPS.Actual X..EPS.Estimate X..Estimates X..Surprise X..Surprise.1 Date
1 NA 2019-07-31 (Q2 2020) 0.45 0.47 3 -0.02 -4.26% 2019-09-12
2 NA 2019-04-30 (Q1 2020) 0.33 0.33 4 0.00 0.00% 2019-06-13
3 NA 2019-01-31 (Q4 2019) 0.54 0.55 4 -0.01 -1.82% 2019-03-28
4 NA 2018-10-31 (Q3 2019) 0.41 0.42 3 -0.01 -2.38% 2018-12-06
5 NA 2018-07-31 (Q2 2019) 0.43 0.44 3 -0.01 -2.27% 2018-09-13
6 NA 2018-04-30 (Q1 2019) 0.31 0.31 3 0.00 0.00% 2018-06-07
7 NA 2018-01-31 (Q4 2018) 0.48 0.47 4 0.01 2.13% 2018-03-29

Related

lm function/coefficients, for specific different time windows/events in one data frame R

I am conducting an event study with the market model: AR(i,t)=R(i,t) - ((alpha(i) + beta(i)*R(m,t)). I struggle with calculating the alpha(intercept) and beta(slope) estimators because of data format and filtering. This is what my data looks like at the moment:
Date ISIN R STOXX_Return Rating_Change Rating
9 2016-10-01 CH00 0.0175315633 -0.0003749766 0.00 A
10 2016-11-01 CH00 -0.0733760706 -0.0220566972 0.00 A
11 2016-12-01 CH00 -0.0107454123 0.0182991778 0.00 A
12 2017-01-01 CH00 0.0457420548 0.0641541456 1.90 A
...
21 2017-10-01 CH00 0.0250723834 0.0374169332 0.00 A
22 2017-11-01 CH00 -0.0780495570 0.0179348620 0.00 A
23 2017-12-01 CH00 0.0688209502 -0.0312226700 0.00 A
24 2018-01-01 CH00 -0.0064684781 0.0144049186 -0.90 A
..
74 2017-01-01 GB00 0.0409336446 0.0641541456 0.00 B+
75 2017-02-01 GB00 0.0056671717 0.0006470779 0.00 B+
76 2017-03-01 GB00 0.0028145957 0.0364348490 0.00 B+
77 2017-04-01 GB00 0.0366417787 0.0144673074 3.66 B+
...
There is an "event" if the Rating Change is non-zero (line 12, 24, 77).
What I need, is doing a regression with the lm() function for only the pre-event-windows (for instance lines 9:11, 21:23, 74:77 - which is -3:1 pre event).
However, there are several events per ISIN
meaning I have to group by ISIN and by the event (non-zero rating change) and
then do the regression with lm() (R ~ STOXX_Return) for each pre-event-window and each ISIN and
save the values in columns next to the event and pre-event-window.
I did not manage to do it in a conditional for-loop or with magrittr/dplyr (or anything google-able :-) ). Nothing really worked out - I just do not know how to manage the "double-filter" for ISIN and event, with a following regression and the output of the coefficients.
Has anyone an approach how to solve this?
Thank you very much in advance for any support- very appreciated!
Addition after response
I tried the following way:
PRE-WINDOW
filter_lmco_pre_tmp1 <- within(data, {
event_pre_window <- if_else(lag(Rating_Change!=0),1,0)
event_pre_window <- ave(event_pre_window, lag(ISIN), FUN=cumsum)
})
Date ISIN R STOXX_Return Rating_Change Rating event_pre_window
10 2016-11-01 CH00 -0.0733761 -0.0220567 0.00 A NA
11 2016-12-01 CH00 -0.0107454 0.0182992 0.00 A 0
12 2017-01-01 CH00 0.0457421 0.0641541 1.90 A 0
13 2017-02-01 CH00 0.0208479 0.0006471 0.00 A 1
14 2017-03-01 CH00 0.0351640 0.0364348 0.00 A 1
22 2017-11-01 CH00 -0.0780496 0.0179349 0.00 A 1
23 2017-12-01 CH00 0.0688210 -0.0312227 0.00 A 1
24 2018-01-01 CH00 -0.0064685 0.0144049 -0.90 A 1
POST-WINDOW
filter_lmco_post_tmp1 <- within(data, {
event_post_window <- if_else(Rating_Change !=0,1,0)
event_post_window <- ave(event_post_window, ISIN, FUN=cumsum)
})
Date ISIN R STOXX_Return Rating_Change Rating event_post_window
10 2016-11-01 CH00 -0.0733761 -0.0220567 0.00 A 0
11 2016-12-01 CH00 -0.0107454 0.0182992 0.00 A 0
12 2017-01-01 CH00 0.0457421 0.0641541 1.90 A 1
13 2017-02-01 CH00 0.0208479 0.0006471 0.00 A 1
14 2017-03-01 CH00 0.0351640 0.0364348 0.00 A 1
22 2017-11-01 CH00 -0.0780496 0.0179349 0.00 A 1
23 2017-12-01 CH00 0.0688210 -0.0312227 0.00 A 1
24 2018-01-01 CH00 -0.0064685 0.0144049 -0.90 A 2
25 2018-02-01 CH00 -0.0997418 0.0119439 0.00 A 2
You can see that if there is an event (line 12 or 24) the pre and post event IDs are not the same. The first event-id of pre window starts with 0 and of post window with 1. This is because I lagged the pre-events. However, if I do not lag, then the actual event is not included in the pre-event. So is there a way to get for both, the pre and post window an "ID" so that the matching afterwards is possible?
Consider assigning columns for the pre-windows with cumsum, ifelse and ave. Then call by (or split) to create a list of data frames for each ISIN and event pre-window. Finally, use tail to retrieve last 3 rows and pass into your modeling function. All handled in base R!
# CREATE EVENT PRE-WINDOW COLUMN
my_df <- within(my_df, {
event_pre_window <- ifelse(Rating_change != 0, 1, 0)
event_pre_window <- ave(event_pre_window, ISIN, FUN=cumsum)
event_pre_window <- ifelse(Rating_Change != 0, event_pre_window-1, event_pre_window)
})
# DEFINE FUNCTION TO PROCESS SINGLE DATAFRAME
my_lm_model <- function(df) {
# ... code to run lm and return results on each pre-window
}
# SPLIT DF BY ISIN AND PRE-WINDOWS
# CALL ABOVE FUNCTION ON EACH TAILED SUBSET
pre_windows_lm_results_list <- by(
my_df,
my_df[c("ISIN", "event_pre_window")],
function(sub) my_lm_model(tail(sub, 3))
)

How to find a value by date from a list with xts

I have a list with xts world
list
$`XX`
return
2018-01-31 2.16
2018-02-28 2.06
2018-03-31 2.12
2018-04-30 2.41
2018-05-31 2.07
$`YY`
return
2018-01-31 1.12
2018-02-28 0.06
2018-03-31 3.12
$`ZZ`
return
2018-01-31 3.15
2018-02-28 1.03
2018-03-31 0.11
2018-04-30 1.42
2018-05-31 2.04
I need to make a matrix like this
m_2018_05_31
return
[1,] 2.07
[2,] NA
[3,] 2.04
I used this and I got an error because there is not value in YY
m_2018_05_31 <- matrix(1:3)
for(t in 1:3) {
m_2018_05_31[t,]<-list[[t]]$return["2018-05-31"]
}
Here is another option leveraging the merge capability of xts:
d <- "2018-05-31"
do.call(rbind, lapply(list(X, Y), function(x) merge(x, as.Date(d), fill=NA)[d]))
output:
x
2018-05-31 NA
2018-05-31 2.07
data:
X=as.xts(read.zoo(text="
2018-01-31 2.16
2018-02-28 2.06"))
Y=as.xts(read.zoo(text="
2018-03-31 2.12
2018-04-30 2.41
2018-05-31 2.07"))
If you list of xts object called data you can use coredata and index function.
library(zoo)
sapply(data, function(x) {
inds <- index(x) == as.Date('2018-05-31')
if(any(inds)) coredata(x)[inds] else NA
})
# XX YY ZZ
#2.07 NA 2.04

How to combine 2 rows in 1

I'm trying to reshape a data frame but I'm totally lost in how to proceed:
> test
# Time Entry Order Size Price S / L T / P Profit Balance
1 1 2017-01-11 00:00:00 buy 1 0.16 1.05403 1.0449 1.07838 NA NA
2 3 2017-01-24 16:00:00 s/l 1 0.16 1.04490 1.0449 1.07838 -97.28 9902.72
As you can see, we have 2 (or more) registers for one order ID. What I want to do is combine those 2 rows into one by adding several new columns: Exit (that's where the "s/l" entry of the second observation should go), Exit Price (there should go the data for the Price column on the second entry) and replace the NA from the first entry with the data of the second one on the Profit and Balance columns.
By the way, the original name of the Entry column is "Type" but I already changed that, so that's why it doesn't make that much sense of having the exit reason of the trade on a column called "Entry". So far I've only thought of extracting the data on several vectors and then just do a mutate on the first entry and dropping the second one, but I'm quite sure there's a better way of doing that. Also, that stone-age approach would be useless when applied to the whole data frame.
If possible, I'd like to stick to the tidyverse library to do this just for ease of replication. Thank you in advance for your suggestions!
I ended up sorting it out! My solution was to split the data frame in 2, reshape each half as needed, and then full joining them. Here's the initial data frame:
> head(backtest_table, n = 10)
# Time Type Order Size Price S / L T / P Profit Balance
1 1 2017.01.11 00:00 buy 1 0.16 1.05403 1.04490 1.07838 NA NA
2 2 2017.01.19 00:00 buy 2 0.16 1.05376 1.04480 1.07764 NA NA
3 3 2017.01.24 16:00 s/l 1 0.16 1.04490 1.04490 1.07838 -97.28 9902.72
4 4 2017.01.24 16:00 s/l 2 0.16 1.04480 1.04480 1.07764 -95.48 9807.24
5 5 2017.02.09 00:00 buy 3 0.15 1.05218 1.04265 1.07758 NA NA
6 6 2017.03.03 16:00 t/p 3 0.15 1.07758 1.04265 1.07758 251.75 10058.99
7 7 2017.03.29 00:00 buy 4 0.15 1.08826 1.07859 1.11405 NA NA
8 8 2017.04.04 00:00 close 4 0.15 1.08416 1.07859 1.11405 -41.24 10017.75
9 9 2017.04.04 00:00 sell 5 0.15 1.08416 1.09421 1.05737 NA NA
10 10 2017.04.07 00:00 sell 6 0.15 1.08250 1.09199 1.05718 NA NA
Here's the code I used to modify everything:
# Re-format data
library(lubridate)
# Separate entries and exits
entries <- backtest_table %>% filter(Type %in% c("buy", "sell"))
exits <- backtest_table %>% filter(!Type %in% c("buy", "sell"))
# Reshape entries and exits
# Entries
entries <- entries[-c(1, 9, 10)]
colnames(entries) <- c("Entry time", "Entry type", "Order", "Entry volume",
"Entry price", "Entry SL", "Entry TP")
entries$`Entry time` <- entries$`Entry time` %>% ymd_hm()
entries$`Entry type` <- as.factor(entries$`Entry type`)
# Exits
exits <- exits[-1]
colnames(exits) <- c("Exit time", "Exit type", "Order", "Exit volume",
"Exit price", "Exit SL", "Exit TP", "Profit", "Balance")
exits$`Exit time` <- exits$`Exit time` %>% ymd_hm()
exits$`Exit type` <- as.factor(exits$`Exit type`)
# Join re-shaped data
test <- full_join(entries, exits, by = c("Order"))
And here's the output of that:
> head(test, n = 10)
Entry time Entry type Order Entry volume Entry price Entry SL Entry TP Exit time
1 2017-01-11 buy 1 0.16 1.05403 1.04490 1.07838 2017-01-24 16:00:00
2 2017-01-19 buy 2 0.16 1.05376 1.04480 1.07764 2017-01-24 16:00:00
3 2017-02-09 buy 3 0.15 1.05218 1.04265 1.07758 2017-03-03 16:00:00
4 2017-03-29 buy 4 0.15 1.08826 1.07859 1.11405 2017-04-04 00:00:00
5 2017-04-04 sell 5 0.15 1.08416 1.09421 1.05737 2017-05-26 10:00:00
6 2017-04-07 sell 6 0.15 1.08250 1.09199 1.05718 2017-05-01 09:20:00
7 2017-04-19 sell 7 0.15 1.07334 1.08309 1.04733 2017-04-25 10:00:00
8 2017-05-05 sell 8 0.14 1.07769 1.08773 1.05093 2017-05-29 14:00:00
9 2017-05-24 sell 9 0.14 1.06673 1.07749 1.03803 2017-06-22 18:00:00
10 2017-06-14 sell 10 0.14 1.04362 1.05439 1.01489 2017-06-15 06:40:00
Exit type Exit volume Exit price Exit SL Exit TP Profit Balance
1 s/l 0.16 1.04490 1.04490 1.07838 -97.28 9902.72
2 s/l 0.16 1.04480 1.04480 1.07764 -95.48 9807.24
3 t/p 0.15 1.07758 1.04265 1.07758 251.75 10058.99
4 close 0.15 1.08416 1.07859 1.11405 -41.24 10017.75
5 t/p 0.15 1.05737 1.09421 1.05737 265.58 10091.18
6 s/l 0.15 1.09199 1.09199 1.05718 -94.79 9825.60
7 s/l 0.15 1.08309 1.08309 1.04733 -97.36 9920.39
8 t/p 0.14 1.05093 1.08773 1.05093 247.61 10338.79
9 t/p 0.14 1.03803 1.07749 1.03803 265.59 10504.05
10 s/l 0.14 1.05439 1.05439 1.01489 -100.33 10238.46
And that combined the observations where a trade was added that showed NAs on the last columns with the observations where that trade was closed, populating the last columns with the actual result and new account balance!
If someone has suggestions on how to improve the system please let me know!

How to use filter function to pick out specific dates?

I cannot filter my data set for a specific date.
padf_12 <- read.table("Paddus_2012_sheet2.csv", head = TRUE, sep=";" )
tibble(padf_12)
padf_12 <- padf_12 %>%
mutate_at(vars(Block, Treatment), factor) %>%
mutate("Date"=dmy(Date, tz = "UTC"))
I have tried different ways to filter the data e.g.
padf_12 <- padf_12 %>%
filter_at(padf_12, vars(Date == "2012-08-14"))
and
padf_12 <- padf_12 %>%
filter(padf_12, Treatment == "2012-08-14")
Due to error codes I have tried to convert my dates to POSIXct
padf_12 <- padf_12 %>%
mutate(as.POSIXct(padf_12$Date, tz="", format="%Y-%m-%d"))
My data looks like this:
Sample Date Time Plot Ch..Vol..L. Plot..old. iButton Treatment Block X2.methylbutyl.acetat X3.hexenyl.acetate
1 31-K1 20120522 2012-05-22 14:01:00 C1 13 K1 2198C9 C 1 0.00 0.02
2 32-K1 20120613 2012-06-13 10:19:00 C1 13 K1 2198C9 C 1 0.00 0.00
3 33-K1 20120626 2012-06-26 12:19:00 C1 13 K1 21980 C 1 0.00 0.00
4 34-K1 20120715 2012-07-15 12:15:00 C1 13 K1 2198CD C 1 0.00 0.02
5 35-K1 20120814 2012-08-14 C1 13 K1 C 1 0.00 2.34
6 36-K2 20120522 2012-05-22 15:12:00 C2 13 K2 2198C9 C 2 0.01 0.04
And here's a link to the full data set:
https://www.dropbox.com/s/m4qfrdagqxvdxnh/Filtering%20problem.R?dl=0
Any help is much appreciated.
Removing the data object in the function as suggested by TinglTanglBob did the trick!

Store data according to starting date

I am working with multiple csv files stored in a folder (2488). Each csv contains monthly data from gaugin stations with the format years * months. I want to store all these csv files in a big data frame in which cols are the IDs of the different gaugin stations and the rows the times %Y-%m-%d.
For this purpose, I have list all the files with list.files:
a <- list.files(pattern="./*.csv",full.names=TRUE)
And create a dummy data frame with the final dimensions:
gst <- data.frame(NA,1860,2488) # 1860 times - 2488 stations
Each csv file starts in a different date. The earliest date found is January 1863, so I created the data frame with 1860 rows from that starting date to June 2017.
I create a sequence of dates to name the rows of gst:
s <- paste0(1863,"-01-01")
e <- paste0(2017,"-12-31")
ss<- chron(s, format='y-m-d')
ee<- chron(e, format='y-m-d')
dates <- seq.dates(ss,ee,by='months')
In the following loop, I read each csv file. First I change the initial data frame format: years * months + total column.
# Initial format
Jan Feb Mar Apr
1993 NA 0.05 0.05 0.06
1994 0.18 0.15 0.1 0.19
1995 0.22 0.23 0.26 0.11
1996 0.14 0.11 0.1 0.08
1997 0.12 0.16 0.07 0.05
1998 0.12 0.07 0.12 0.18
1999 0.07 0.32 0.14 0.15
2000 0.13 0.22 0.15 0.1
2001 0.18 0.09 0.5 0.26
to a single column data frame (kk.df) with data stored as:
Date Value
93-01-01 NA
93-02-01 0.05
93-03-01 0.05
93-04-01 0.06
93-05-01 0.05
93-06-01 0.05
93-07-01 0.03
93-08-01 0.03
93-09-01 0.05
93-10-01 0.09
93-11-01 0.04
93-12-01 0.10
This is the loop I am working with:
for (i in 1:length(a)){
kk <- read.csv(a[i])
colnames(kk) <- c(seq(1,12,1),'total') # 12 (months) columns and a total column
kk.ts <- ts(as.vector(t(as.matrix(kk))),
start=as.numeric(c(rownames(kk)[1],1)), end= as.numeric(c(rownames(kk)[dim(kk)[1]],12)),frequency=12)
kk.df <- as.data.frame(kk.ts)
colnames(kk.df) <- a[[i]]
a <- paste0(start,"-01-01")
b <- paste0(end,"-12-31")
ac<- chron(a, format='y-m-d')
bc<- chron(b, format='y-m-d')
times <- seq.dates(ac,bc, by="months")
rownames(kk.df) <- times
gst[i,] <- kk.df
}
My question is, as I am storing the same number of columns as gaugin stations I have (2488 stations) and each station start at a different Year-month, how can I specify when store each i for each station, the row in which it must start?
If i = 1 and the first record is in 1993-01-01, I want that column to start at the row of gst that corresponds to 1993-01-01 and so on with the rest of the stations.
Thank you so much.
Maybe you could incorporate a join in your for loop for that:
df = data.frame(date = seq(Sys.Date(),Sys.Date()+3,by=1))
station1 = data.frame(date = seq(Sys.Date()+2,Sys.Date()+3,by=1),data = c(1,2))
station2 = data.frame(date = seq(Sys.Date()+1,Sys.Date()+2,by=1),data = c(2,3))
df = df %>% left_join(station1) %>% rename(station1=data)
df = df %>% left_join(station2) %>% rename(station2=data)
Input:
> df
date
1 2017-07-17
2 2017-07-18
3 2017-07-19
4 2017-07-20
> station1
date data
1 2017-07-19 1
2 2017-07-20 2
> station2
date data
1 2017-07-18 2
2 2017-07-19 3
Output:
> df
date station1 station2
1 2017-07-17 NA NA
2 2017-07-18 NA 2
3 2017-07-19 1 3
4 2017-07-20 2 NA

Resources