Scraping Financial Tables From Web Page with R, rvest,Rcurl - r

I'm trying parsing financial tables from web page. I proceeded. But I am not able to arrange list, or data.frame
library(rvest)
link <- "http://www.marketwatch.com/investing/stock/garan/financials/balance-sheet/quarter"
read <- read_html(link)
prs <- html_nodes(read, ".financials")
irre <- html_text(prs)
re <- strsplit(irre, split = "\r\n")
re is something like this:
[27] "Assets"
[28] ""
[29] " "
[30] " "
[31] " All values TRY millions."
[32] " 31-Dec-201431-Mar-201530-Jun-201530-Sep-201531-Dec-2015"
[33] " 5-qtr trend"
[34] " "
[35] " "
[36] " "
[37] " "
[38] " Total Cash & Due from Banks"
[39] " 27.26B26.27B26.7B34.51B27.9B"
[40] " "
[41] " "
bla bla...
How Can I edit this list through data.frame that properly like this page

Try
library(XML)
theurl <- "http://www.marketwatch.com/investing/stock/garan/financials/balance-sheet/quarter"
re <- readHTMLTable(theurl)
The result is a list with two dataframes.

Related

How to read a matrix in R with set size

I have a matrix, saved as a file (no extension) looking like this:
Peter Westons NH 54 RTcoef level B matrix from L70 Covstats.
2.61949322E+00 2.27966995E+00 1.68120147E+00 9.88238464E-01 8.38279026E-01
7.41276375E-01
2.27966995E+00 2.31885465E+00 1.53558372E+00 4.87789344E-01 2.90254400E-01
2.56963125E-01
1.68120147E+00 1.53558372E+00 1.26129096E+00 8.18048022E-01 5.66120186E-01
3.23866166E-01
9.88238464E-01 4.87789344E-01 8.18048022E-01 1.38558423E+00 1.21272607E+00
7.20283781E-01
8.38279026E-01 2.90254400E-01 5.66120186E-01 1.21272607E+00 1.65314082E+00
1.35926028E+00
7.41276375E-01 2.56963125E-01 3.23866166E-01 7.20283781E-01 1.35926028E+00
1.74777330E+00
How do I go about reading this in as a fixed 6*6 matrix, skipping the first header? I don't see any options for the amount of columns in read.matrix, I tried with the scan() -> matrix() option but I can't read in the file as the skip parameter in scan() doesn't seem to work. I feel there must be a simple option to do this.
My original file is larger, and has 17 full rows of 5 elements and 1 row of 1 element in this structure, example of what needs to be in one row:
[1] " 2.61949322E+00 2.27966995E+00 1.68120147E+00 9.88238464E-01 8.38279026E-01"
[2] " 7.41276375E-01 5.23588785E-01 1.09559244E-01 -9.58430529E-02 -3.24544839E-02"
[3] " 1.96694874E-02 3.39249911E-02 1.54438478E-02 2.38380549E-03 9.59475077E-03"
[4] " 8.02748175E-03 1.63922615E-02 4.51778592E-04 -1.32080759E-02 -2.06313988E-02"
[5] " -1.56037533E-02 -3.35496588E-03 -4.22450803E-03 -3.17468525E-03 3.23012615E-03"
[6] " -8.68914773E-03 -5.94151619E-03 2.34059840E-04 -2.76737270E-03 -4.90334584E-03"
[7] " 1.53812087E-04 5.69891977E-03 5.33816835E-03 3.32982333E-03 -2.62856968E-03"
[8] " -5.15188677E-03 -4.47782553E-03 -5.49510247E-03 -3.71780229E-03 9.80192203E-04"
[9] " 4.18101180E-03 5.47513662E-03 4.14679058E-03 -2.81461574E-03 -4.67580613E-03"
[10] " 3.41841523E-04 4.07771227E-03 7.06154094E-03 6.61650765E-03 5.97925136E-03"
[11] " 3.92987162E-03 1.72895946E-03 -3.47249017E-03 9.90977857E-03 -2.36066909E-31"
[12] " -8.62803933E-32 -1.32472387E-31 -1.02360189E-32 -5.11800943E-33 -4.16409844E-33"
[13] " -5.11800943E-33 -2.52126889E-32 -2.52126889E-32 -4.16409844E-33 -4.16409844E-33"
[14] " -5.11800943E-33 -5.11800943E-33 -4.16409844E-33 -2.52126889E-32 -2.52126889E-32"
[15] " -2.52126889E-32 -1.58614773E-33 -1.58614773E-33 -2.55900472E-33 -1.26063444E-32"
[16] " -7.93073863E-34 -1.04102461E-33 -3.19875590E-34 -3.19875590E-34 -3.19875590E-34"
[17] " -2.60256152E-34 -1.30128076E-34 0.00000000E+00 1.78501287E-02 -1.14423068E-11"
[18] " 3.00625863E-02"
So the full matrix should be 86*86.
Thanks a bunch
Try this option :
Read the file with readLines removing the first line. ([-1]).
Split values on whitespace and create 1 X 6 matrix from every combination of two rows.
Combine them together in one matrix with do.call(rbind, ..).
rows <- readLines('filename')[-1]
result <- do.call(rbind,
tapply(rows, ceiling(seq_along(rows)/2), function(x)
strsplit(paste0(trimws(x), collapse = ' '), '\\s+')[[1]]))

How can I send a text to a logfile given that a certain event has finished within a loop in R?

I try to get error rates for different parameter settings for a random forest (classification).
Given that I use a loop and this takes considerable time i would like to know how much time has passed up until a certain point. For this I would like to get a result saved on a logfile each time a certain event has passed. the code looks like this.
library(randomForest)
ntree<-c(1:1000)
mtry<-c(1:30)
set.seed(123)
for (j in mtry) {
for (i in ntree) {
rf1 <- randomForest(mymodel,mtry=j, ntree=i)
result = data.frame(mtry=j,ntree=i,
OOB=rf1[["err.rate"]][nrow(rf1[["err.rate"]]),"OOB"])
oob_NP = rbind(oob_NP, result)
}
}
I would like to get a result in a log file for every hundred model...So show me the error rate result for
mtry=1, ntree=100
mtry=1, ntree=200
.
.
.
mtry=30,ntree=1000
Anyone an idea how to integrate this in the code?
This can be solved with sprintf to produce the log text lines and cat to write them to a connection.
logfile <- "Tacatico.log"
ntree <- 1:10
mtry <- 1:3
logfile_con <- file(logfile, open = "wt")
for (j in mtry) {
for (i in ntree) {
logtext <- sprintf("mtry=%d ntree=%d", j, i)
cat(logtext, '\n', file = logfile_con)
}
}
close(logfile_con)
Check what was written to the log file.
readLines(logfile)
# [1] "mtry=1 ntree=1 " "mtry=1 ntree=2 " "mtry=1 ntree=3 "
# [4] "mtry=1 ntree=4 " "mtry=1 ntree=5 " "mtry=1 ntree=6 "
# [7] "mtry=1 ntree=7 " "mtry=1 ntree=8 " "mtry=1 ntree=9 "
#[10] "mtry=1 ntree=10 " "mtry=2 ntree=1 " "mtry=2 ntree=2 "
#[13] "mtry=2 ntree=3 " "mtry=2 ntree=4 " "mtry=2 ntree=5 "
#[16] "mtry=2 ntree=6 " "mtry=2 ntree=7 " "mtry=2 ntree=8 "
#[19] "mtry=2 ntree=9 " "mtry=2 ntree=10 " "mtry=3 ntree=1 "
#[22] "mtry=3 ntree=2 " "mtry=3 ntree=3 " "mtry=3 ntree=4 "
#[25] "mtry=3 ntree=5 " "mtry=3 ntree=6 " "mtry=3 ntree=7 "
#[28] "mtry=3 ntree=8 " "mtry=3 ntree=9 " "mtry=3 ntree=10 "

R H2o object not found H2OKeyNotFoundArgumentException

R Version: R version 3.5.1 (2018-07-02)
H2O cluster version: 3.20.0.2
The dataset used here is available on Kaggle (Home credit risk). Prior to using h2o automl, the necessary treatment of missing values and selection of relevant categorical variables has already been carried out. Can you assist me in figuring out what is the underlying cause for this error?
Thanks
Code:
h2o.init()
h2o.no_progress()
# y_train_processed_tbl is the target variable
# x_train_processed_tbl is the remaining data post dealing with Missing
# values
data_h2o <- as.h2o(bind_cols(y_train_processed_tbl, x_train_processed_tbl))
splits_h2o <- h2o.splitFrame(data_h2o, ratios = c(0.7, 0.15), seed = 1234)
train_h2o <- splits_h2o[[1]]
valid_h2o <- splits_h2o[[2]]
test_h2o <- splits_h2o[[3]]
y <- "TARGET"
x <- setdiff(names(train_h2o), y)
automl_models_h2o <- h2o.automl(x = x,y = y,
training_frame = train_h2o, validation_frame = valid_h2o,
leaderboard_frame = test_h2o,
max_runtime_secs = 90
)
automl_leader <- automl_models_h2o#leader
# Error in performance_h2o
performance_h2o <- h2o.performance(automl_leader, newdata = test_h2o)
ERROR: Unexpected HTTP Status code: 404 Not Found
water.exceptions.H2OKeyNotFoundArgumentException
[1] "water.exceptions.H2OKeyNotFoundArgumentException: Object 'dummy' not
found in function: predict for argument: model"
[2] " water.api.ModelMetricsHandler.score(ModelMetricsHandler.java:235)"
[3] " sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)"
[4] " sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)"
[5] " sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)"
[6] " java.lang.reflect.Method.invoke(Unknown Source)"
[7] " water.api.Handler.handle(Handler.java:63)"
[8] " water.api.RequestServer.serve(RequestServer.java:451)"
[9] " water.api.RequestServer.doGeneric(RequestServer.java:296)"
[10] " water.api.RequestServer.doPost(RequestServer.java:222)"
[11] " javax.servlet.http.HttpServlet.service(HttpServlet.java:755)"
[12] " javax.servlet.http.HttpServlet.service(HttpServlet.java:848)"
[13] " org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)"
[14] " org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)"
[15] " org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)"
[16] " org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:429)"
[17] " org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)"
[18] " org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)"
[19] " org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)"
[20] " org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)"
[21] " water.JettyHTTPD$LoginHandler.handle(JettyHTTPD.java:197)"
[22] " org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)"
[23] " org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)"
[24] " org.eclipse.jetty.server.Server.handle(Server.java:370)"
[25] " org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)"
[26] " org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)"
[27] " org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)"
[28] " org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)"
[29] " org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)"
[30] " org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)"
[31] " org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)"
[32] " org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)"
[33] " org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)"
[34] " org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)"
[35] " java.lang.Thread.run(Unknown Source)"
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix =
page, :
ERROR MESSAGE:
Object 'dummy' not found in function: predict for argument: model
The issue here is that you only gave AutoML 90 seconds to run, so it did not have time to train even one model. In the next stable release of H2O, the error message will be gone and instead you will simply get a Leaderboard with no rows (we are fixing this so that it's handled more gracefully).
Rather than using max_runtime_secs = 90, you could increase that to something much larger (the default is 3600 secs, or 1 hour). Alternatively you can specify the number of models you want instead by setting max_models = 20, for example.
If you do use max_models, I'd recommend setting max_runtime_secs to something large (e.g. 999999999) so that you don't run out of time. The AutoML process will stop when it reaches the first of max_models or max_runtime_secs.
I posted a similar answer here.
My code was working fine, then I tweaked it and got the same error.
To fix it, instead of using automl_models_h2o#leader to save the leader for predictions/performance, save the leader using h2o.getModel().
Change your automl_leader initialization:
...
# get model name from list
automl_models_h2o#leaderboard
# change MODEL_NAME_HERE to a model name from your leaderboard list.
automl_leader <- h2o.getModel("MODEL_NAME_HERE")
performance_h2o <- h2o.performance(automl_leader, newdata = test_h2o)
...

Best way to clean data in R to and convert to XTS

I am trying to clean up some data I downloaded from the web an convert to XTS. I found some documentation on CRAN using GREPL to clean up the data, but am wondering if there is an easier way to do this other than using GREPL. I was hoping someone would be able to help me with the code to clean this data up either using GREPL or another function in R. Thank you in advance for any assistance you can provide me with.
[1] "{"
[2] " \"Meta Data\": {"
[3] " \"1. Information\": \"Daily Prices (open, high, low, close) and Volumes\","
[4] " \"2. Symbol\": \"MSFT\","
[5] " \"3. Last Refreshed\": \"2017-06-08 15:15:00\","
[6] " \"4. Output Size\": \"Compact\","
[7] " \"5. Time Zone\": \"US/Eastern\""
[8] " },"
[9] " \"2017-01-19\": {"
[10] " \"1. open\": \"62.2400\","
[11] " \"2. high\": \"62.9800\","
[12] " \"3. low\": \"62.1950\","
[13] " \"4. close\": \"62.3000\","
[14] " \"5. volume\": \"18451655\""
[15] " },"
[16] " \"2017-01-18\": {"
[17] " \"1. open\": \"62.6700\","
[18] " \"2. high\": \"62.7000\","
[19] " \"3. low\": \"62.1200\","
[20] " \"4. close\": \"62.5000\","
[21] " \"5. volume\": \"19670102\""
[22] " },"
[23] " \"2017-01-17\": {"
[24] " \"1. open\": \"62.6800\","
[25] " \"2. high\": \"62.7000\","
[26] " \"3. low\": \"62.0300\","
[27] " \"4. close\": \"62.5300\","
[28] " \"5. volume\": \"20663983\""
[29] " }"
[30] " }"
[31] "}"
The final output for this data would look like:
Open High Low Close Volume
2017-01-17 62.68 62.70 62.03 62.53 20663983
2017-01-18 62.67 62.70 62.12 62.50 19670102
2017-01-19 62.24 62.98 62.195 62.30 18451655
As beigel suggested, the first thing you need to do is parse the JSON.
Lines <-
"{
\"Meta Data\": {
\"1. Information\": \"Daily Prices (open, high, low, close) and Volumes\",
\"2. Symbol\": \"MSFT\",
\"3. Last Refreshed\": \"2017-06-08 15:15:00\",
\"4. Output Size\": \"Compact\",
\"5. Time Zone\": \"US/Eastern\"
},
\"2017-01-19\": {
\"1. open\": \"62.2400\",
\"2. high\": \"62.9800\",
\"3. low\": \"62.1950\",
\"4. close\": \"62.3000\",
\"5. volume\": \"18451655\"
},
\"2017-01-18\": {
\"1. open\": \"62.6700\",
\"2. high\": \"62.7000\",
\"3. low\": \"62.1200\",
\"4. close\": \"62.5000\",
\"5. volume\": \"19670102\"
},
\"2017-01-17\": {
\"1. open\": \"62.6800\",
\"2. high\": \"62.7000\",
\"3. low\": \"62.0300\",
\"4. close\": \"62.5300\",
\"5. volume\": \"20663983\"
}
}"
parsedLines <- jsonlite::fromJSON(Lines)
Now that the data are in a usable structure, we can start cleaning it. Notice that each element in parsedLines is another list. Let's convert them to vectors with unlist, so we will have a list of vectors instead of a list of lists.
parsedLines <- lapply(parsedLines, unlist)
Now you might have noticed that the first element in parsedLines is metadata. We can attach that to the final object later. But first, let's rbind all the others elements into a matrix. We can do that for any length list by using do.call.
ohlcv <- do.call(rbind, parsedLines[-1]) # [-1] removes the first element
Now we can clean up the column names and convert the data from character to numeric.
colnames(ohlcv) <- gsub("^[[:digit:]]\\.", "", colnames(ohlcv))
ohlcv <- type.convert(ohlcv)
At this point, I would personally convert to an xts object and attach the metadata. But you can continue with the ohlcv matrix, convert it to a data.frame, tibble, etc.
# convert to xts
x <- as.xts(ohlcv, dateFormat = "Date")
# attach attributes
metadata <- parsedLines[[1]]
names(metadata) <- gsub("[[:digit:]]|\\.|[[:space:]]", "", names(metadata))
xtsAttributes(x) <- metadata
# view attributes
str(x)
An 'xts' object on 2017-01-17/2017-01-19 containing:
Data: num [1:3, 1:5] 62.7 62.7 62.2 62.7 62.7 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:5] " open" " high" " low" " close" ...
Indexed by objects of class: [Date] TZ: UTC
xts Attributes:
List of 5
$ Information : chr "Daily Prices (open, high, low, close) and Volumes"
$ Symbol : chr "MSFT"
$ LastRefreshed: chr "2017-06-08 15:15:00"
$ OutputSize : chr "Compact"
$ TimeZone : chr "US/Eastern"

xpathSApply webscrape returns NULL

Using my trusty firebug and firepath plug-ins I'm trying to scrape some data.
require(XML)
url <- "http://www.hkjc.com/english/racing/display_sectionaltime.asp?racedate=25/05/2008&Raceno=2&All=0"
tree <- htmlTreeParse(url, useInternalNodes = T)
t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/table/tbody/tr/td[1]/font", xmlValue) # works
This works! t now contains "Meeting Date: 25/05/2008, Sha Tin\r\n\t\t\t\t\t\t"
If I try to capture the first sectional time of 29.4 thusly:
t <- xpathSApply(tree, "//html/body/table/tbody/tr/td[2]/table[2]/tbody/tr/td/font/a/table/tbody/tr[3]/td/table/tbody/tr/td/table[2]/tbody/tr[5]/td[1]", xmlValue) # doesn't work
t contains NULL.
Any ideas what I've done wrong? Many thanks.
First off, I can't find that first sectional time of 29.4. The one I see on the page you linked is 24.5 or I'm misunderstanding what you are looking for.
Here's a way of grabbing that one using rvest and SelectorGadget for Chrome:
library(rvest)
html <- read_html(url)
t <- html %>%
html_nodes(".bigborder table tr+ tr td:nth-child(2) font") %>%
html_text(trim = T)
> t
[1] "24.5"
This differs a bit from your approach but I hope it helps. Not sure how to properly scrape the meeting time that way, but this at least works:
mt <- html %>%
html_nodes("font > table font") %>%
html_text(trim = T)
> mt
[1] "Meeting Date: 25/05/2008, Sha Tin" "4 - 1200M - (060-040) - Turf - A Course - Good To Firm"
[3] "MONEY TALKS HANDICAP" "Race\tTime :"
[5] "(24.5)" "(48.1)"
[7] "(1.10.3)" "Sectional Time :"
[9] "24.5" "23.6"
[11] "22.2"
> mt[1]
[1] "Meeting Date: 25/05/2008, Sha Tin"
Looks like the comments just after the <a> may be throwing you off.
<a name="Race1">
<!-- test0 table start -->
<table class="bigborder" border="0" cellpadding="0" cellspacing="0" width="760">...
<!--0 table End -->
<!-- test1 table start -->
<br>
<br>
</a>
This seems to work:
t <- xpathSApply(tree, '//tr/td/font[text()="Sectional Time : "]/../following-sibling::td[1]/font', xmlValue)
You might want to try something a little less fragile then that long direct path.
Update
If you are after all of the times in the "1st Sec." column: 29.4, 28.7, etc...
t <- xpathSApply(
tree,
"//tr/td[starts-with(.,'1st Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[1]",
xmlValue
)
Looks for the "1st Sec." column, then jump up to its row, grab every other row's 1st td value.
[1] "29.4 "
[2] "28.7 "
[3] "29.2 "
[4] "29.0 "
[5] "29.3 "
[6] "28.2 "
[7] "29.5 "
[8] "29.5 "
[9] "30.1 "
[10] "29.8 "
[11] "29.6 "
[12] "29.9 "
[13] "29.1 "
[14] "29.8 "
I've removed all the extra whitespace (\r\n\t\t...) for display purposes here.
If you wanted to make it a little more dynamic, you could grab the column value under "1st Sec." or any other column. Replace
/td[1]
with
td[count(//tr/td[starts-with(.,'1st Sec.')]/preceding-sibling::*)+1]
Using that, you could update the name of the column, and grab the corresponding values. For all "3rd Sec." times:
"//tr/td[starts-with(.,'3rd Sec.')]/../following-sibling::*[position() mod 2 = 0]/td[count(//tr/td[starts-with(.,'3rd Sec.')]/preceding-sibling::*)+1]"
[1] "23.3 "
[2] "23.7 "
[3] "23.3 "
[4] "23.8 "
[5] "23.7 "
[6] "24.5 "
[7] "24.1 "
[8] "24.0 "
[9] "24.1 "
[10] "24.1 "
[11] "23.9 "
[12] "23.9 "
[13] "24.3 "
[14] "24.0 "

Resources