Best way to clean data in R to and convert to XTS - r

I am trying to clean up some data I downloaded from the web an convert to XTS. I found some documentation on CRAN using GREPL to clean up the data, but am wondering if there is an easier way to do this other than using GREPL. I was hoping someone would be able to help me with the code to clean this data up either using GREPL or another function in R. Thank you in advance for any assistance you can provide me with.
[1] "{"
[2] " \"Meta Data\": {"
[3] " \"1. Information\": \"Daily Prices (open, high, low, close) and Volumes\","
[4] " \"2. Symbol\": \"MSFT\","
[5] " \"3. Last Refreshed\": \"2017-06-08 15:15:00\","
[6] " \"4. Output Size\": \"Compact\","
[7] " \"5. Time Zone\": \"US/Eastern\""
[8] " },"
[9] " \"2017-01-19\": {"
[10] " \"1. open\": \"62.2400\","
[11] " \"2. high\": \"62.9800\","
[12] " \"3. low\": \"62.1950\","
[13] " \"4. close\": \"62.3000\","
[14] " \"5. volume\": \"18451655\""
[15] " },"
[16] " \"2017-01-18\": {"
[17] " \"1. open\": \"62.6700\","
[18] " \"2. high\": \"62.7000\","
[19] " \"3. low\": \"62.1200\","
[20] " \"4. close\": \"62.5000\","
[21] " \"5. volume\": \"19670102\""
[22] " },"
[23] " \"2017-01-17\": {"
[24] " \"1. open\": \"62.6800\","
[25] " \"2. high\": \"62.7000\","
[26] " \"3. low\": \"62.0300\","
[27] " \"4. close\": \"62.5300\","
[28] " \"5. volume\": \"20663983\""
[29] " }"
[30] " }"
[31] "}"
The final output for this data would look like:
Open High Low Close Volume
2017-01-17 62.68 62.70 62.03 62.53 20663983
2017-01-18 62.67 62.70 62.12 62.50 19670102
2017-01-19 62.24 62.98 62.195 62.30 18451655

As beigel suggested, the first thing you need to do is parse the JSON.
Lines <-
"{
\"Meta Data\": {
\"1. Information\": \"Daily Prices (open, high, low, close) and Volumes\",
\"2. Symbol\": \"MSFT\",
\"3. Last Refreshed\": \"2017-06-08 15:15:00\",
\"4. Output Size\": \"Compact\",
\"5. Time Zone\": \"US/Eastern\"
},
\"2017-01-19\": {
\"1. open\": \"62.2400\",
\"2. high\": \"62.9800\",
\"3. low\": \"62.1950\",
\"4. close\": \"62.3000\",
\"5. volume\": \"18451655\"
},
\"2017-01-18\": {
\"1. open\": \"62.6700\",
\"2. high\": \"62.7000\",
\"3. low\": \"62.1200\",
\"4. close\": \"62.5000\",
\"5. volume\": \"19670102\"
},
\"2017-01-17\": {
\"1. open\": \"62.6800\",
\"2. high\": \"62.7000\",
\"3. low\": \"62.0300\",
\"4. close\": \"62.5300\",
\"5. volume\": \"20663983\"
}
}"
parsedLines <- jsonlite::fromJSON(Lines)
Now that the data are in a usable structure, we can start cleaning it. Notice that each element in parsedLines is another list. Let's convert them to vectors with unlist, so we will have a list of vectors instead of a list of lists.
parsedLines <- lapply(parsedLines, unlist)
Now you might have noticed that the first element in parsedLines is metadata. We can attach that to the final object later. But first, let's rbind all the others elements into a matrix. We can do that for any length list by using do.call.
ohlcv <- do.call(rbind, parsedLines[-1]) # [-1] removes the first element
Now we can clean up the column names and convert the data from character to numeric.
colnames(ohlcv) <- gsub("^[[:digit:]]\\.", "", colnames(ohlcv))
ohlcv <- type.convert(ohlcv)
At this point, I would personally convert to an xts object and attach the metadata. But you can continue with the ohlcv matrix, convert it to a data.frame, tibble, etc.
# convert to xts
x <- as.xts(ohlcv, dateFormat = "Date")
# attach attributes
metadata <- parsedLines[[1]]
names(metadata) <- gsub("[[:digit:]]|\\.|[[:space:]]", "", names(metadata))
xtsAttributes(x) <- metadata
# view attributes
str(x)
An 'xts' object on 2017-01-17/2017-01-19 containing:
Data: num [1:3, 1:5] 62.7 62.7 62.2 62.7 62.7 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:5] " open" " high" " low" " close" ...
Indexed by objects of class: [Date] TZ: UTC
xts Attributes:
List of 5
$ Information : chr "Daily Prices (open, high, low, close) and Volumes"
$ Symbol : chr "MSFT"
$ LastRefreshed: chr "2017-06-08 15:15:00"
$ OutputSize : chr "Compact"
$ TimeZone : chr "US/Eastern"

Related

R H2o object not found H2OKeyNotFoundArgumentException

R Version: R version 3.5.1 (2018-07-02)
H2O cluster version: 3.20.0.2
The dataset used here is available on Kaggle (Home credit risk). Prior to using h2o automl, the necessary treatment of missing values and selection of relevant categorical variables has already been carried out. Can you assist me in figuring out what is the underlying cause for this error?
Thanks
Code:
h2o.init()
h2o.no_progress()
# y_train_processed_tbl is the target variable
# x_train_processed_tbl is the remaining data post dealing with Missing
# values
data_h2o <- as.h2o(bind_cols(y_train_processed_tbl, x_train_processed_tbl))
splits_h2o <- h2o.splitFrame(data_h2o, ratios = c(0.7, 0.15), seed = 1234)
train_h2o <- splits_h2o[[1]]
valid_h2o <- splits_h2o[[2]]
test_h2o <- splits_h2o[[3]]
y <- "TARGET"
x <- setdiff(names(train_h2o), y)
automl_models_h2o <- h2o.automl(x = x,y = y,
training_frame = train_h2o, validation_frame = valid_h2o,
leaderboard_frame = test_h2o,
max_runtime_secs = 90
)
automl_leader <- automl_models_h2o#leader
# Error in performance_h2o
performance_h2o <- h2o.performance(automl_leader, newdata = test_h2o)
ERROR: Unexpected HTTP Status code: 404 Not Found
water.exceptions.H2OKeyNotFoundArgumentException
[1] "water.exceptions.H2OKeyNotFoundArgumentException: Object 'dummy' not
found in function: predict for argument: model"
[2] " water.api.ModelMetricsHandler.score(ModelMetricsHandler.java:235)"
[3] " sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)"
[4] " sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)"
[5] " sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)"
[6] " java.lang.reflect.Method.invoke(Unknown Source)"
[7] " water.api.Handler.handle(Handler.java:63)"
[8] " water.api.RequestServer.serve(RequestServer.java:451)"
[9] " water.api.RequestServer.doGeneric(RequestServer.java:296)"
[10] " water.api.RequestServer.doPost(RequestServer.java:222)"
[11] " javax.servlet.http.HttpServlet.service(HttpServlet.java:755)"
[12] " javax.servlet.http.HttpServlet.service(HttpServlet.java:848)"
[13] " org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)"
[14] " org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)"
[15] " org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)"
[16] " org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:429)"
[17] " org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)"
[18] " org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)"
[19] " org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)"
[20] " org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)"
[21] " water.JettyHTTPD$LoginHandler.handle(JettyHTTPD.java:197)"
[22] " org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)"
[23] " org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)"
[24] " org.eclipse.jetty.server.Server.handle(Server.java:370)"
[25] " org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)"
[26] " org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)"
[27] " org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)"
[28] " org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)"
[29] " org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)"
[30] " org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)"
[31] " org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)"
[32] " org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)"
[33] " org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)"
[34] " org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)"
[35] " java.lang.Thread.run(Unknown Source)"
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix =
page, :
ERROR MESSAGE:
Object 'dummy' not found in function: predict for argument: model
The issue here is that you only gave AutoML 90 seconds to run, so it did not have time to train even one model. In the next stable release of H2O, the error message will be gone and instead you will simply get a Leaderboard with no rows (we are fixing this so that it's handled more gracefully).
Rather than using max_runtime_secs = 90, you could increase that to something much larger (the default is 3600 secs, or 1 hour). Alternatively you can specify the number of models you want instead by setting max_models = 20, for example.
If you do use max_models, I'd recommend setting max_runtime_secs to something large (e.g. 999999999) so that you don't run out of time. The AutoML process will stop when it reaches the first of max_models or max_runtime_secs.
I posted a similar answer here.
My code was working fine, then I tweaked it and got the same error.
To fix it, instead of using automl_models_h2o#leader to save the leader for predictions/performance, save the leader using h2o.getModel().
Change your automl_leader initialization:
...
# get model name from list
automl_models_h2o#leaderboard
# change MODEL_NAME_HERE to a model name from your leaderboard list.
automl_leader <- h2o.getModel("MODEL_NAME_HERE")
performance_h2o <- h2o.performance(automl_leader, newdata = test_h2o)
...

how may I DELETE numbers or symbols after "(" or "["

names(score)
[1] "(Intercept)" "aado2_calc(20,180]" "aado2_calc(360,460]"
[4] "aado2_calc(460,629]" "albumin[1,1.8]" "albumin(1.8,2.2]"
[7] "albumin(2.2,2.8]" "aniongap(15,18]" "aniongap(18,20]"
[10] "aniongap(20,22]" "aniongap(22,25]" "aniongap(25,49]"
[13] "ethnicityBLACK" "ethnicityUNKNOWN" "admission_typeEMERGENCY"
[16] "electivesurgery" "mechvent" "congestive_heart_failure"
[19] "cardiac_arrhythmias" "renal_failure" "liver_disease"
[22] "lymphoma" "metastatic_cancer" "coagulopathy"
[25] "obesity" "fluid_electrolyte"
In this program, I want to delete symbols or numbers behind "(" or "[". For example, "albumin[1,1.8]" should be "albumin".
We can use sub to match either the ( or (|) the [ followed by one or more number ([0-9]+) and the rest of the characters and replace it with blank
sub("(\\(|\\[)[0-9]+.*", "", names(score))
#[1] "(Intercept)" "aado2_calc" "aado2_calc" "aado2_calc" "albumin"
#[6] "albumin" "albumin" "aniongap" "aniongap" "aniongap"
#[11] "aniongap" "aniongap" "ethnicityBLACK" "ethnicityUNKNOWN" "admission_typeEMERGENCY"
#[16] "electivesurgery" "mechvent" "congestive_heart_failure" "cardiac_arrhythmias" "renal_failure"
#[21] "liver_disease" "lymphoma" "metastatic_cancer" "coagulopathy" "obesity"
#[26] "fluid_electrolyte"

Scraping Financial Tables From Web Page with R, rvest,Rcurl

I'm trying parsing financial tables from web page. I proceeded. But I am not able to arrange list, or data.frame
library(rvest)
link <- "http://www.marketwatch.com/investing/stock/garan/financials/balance-sheet/quarter"
read <- read_html(link)
prs <- html_nodes(read, ".financials")
irre <- html_text(prs)
re <- strsplit(irre, split = "\r\n")
re is something like this:
[27] "Assets"
[28] ""
[29] " "
[30] " "
[31] " All values TRY millions."
[32] " 31-Dec-201431-Mar-201530-Jun-201530-Sep-201531-Dec-2015"
[33] " 5-qtr trend"
[34] " "
[35] " "
[36] " "
[37] " "
[38] " Total Cash & Due from Banks"
[39] " 27.26B26.27B26.7B34.51B27.9B"
[40] " "
[41] " "
bla bla...
How Can I edit this list through data.frame that properly like this page
Try
library(XML)
theurl <- "http://www.marketwatch.com/investing/stock/garan/financials/balance-sheet/quarter"
re <- readHTMLTable(theurl)
The result is a list with two dataframes.

How to find unique extra column name between two data.frames?

I have two almost identical data.frames, and I want to find the unique column name that is added to the x.2 object.
> colnames(x.1)
[1] "listPrice" "rent" "floor" "livingArea"
[5] "rooms" "published" "constructionYear" "objectType"
[9] "booliId" "soldDate" "soldPrice" "url"
[13] "additionalArea" "isNewConstruction" "location.namedAreas" "location.address.streetAddress"
[17] "location.address.city" "location.position.latitude" "location.position.longitude" "location.region.municipalityName"
[21] "location.region.countyName" "location.distance.ocean" "source.name" "source.id"
[25] "source.type" "source.url" "areaSize" "priceDiff"
[29] "perc.priceDiff" "sqrmPrice"
> colnames(x.2)
[1] "listPrice" "livingArea" "additionalArea" "plotArea"
[5] "rooms" "published" "constructionYear" "objectType"
[9] "booliId" "soldDate" "soldPrice" "url"
[13] "isNewConstruction" "floor" "rent" "location.namedAreas"
[17] "location.address.streetAddress" "location.address.city" "location.position.latitude" "location.position.longitude"
[21] "location.region.municipalityName" "location.region.countyName" "location.distance.ocean" "source.name"
[25] "source.id" "source.type" "source.url" "areaSize"
[29] "priceDiff" "perc.priceDiff" "sqrmPrice"
You can use setdiff to get the column names that are in 'x.2' and not in 'x.1'
setdiff(colnames(x.2), colnames(x.1))
Try
colnames(x.2)[!colnames(x.2) %in% colnames(x.1)]

How to invoke Stata and run syntax via R?

I have an odd situation, and please pardon me for not providing a reproducible example for this question. I have more than 1000 lines of syntax written for Stata to carry out multiple analyses (I wrote it before I started using R). This syntax is used to perform analysis in a quarterly dataset every 3 months to create a report. Results of the analyses are saved in csv files, and read via R, and put into a Word document using ReporterS package.
Is there any way to invoke Stata via R, and specify/pipe the syntax to run it? (I understand the reverse situation can be done using rsource (user-written command) in Stata). I can still manually fire up Stata and run the syntax there. But is it possible to do it via R? So, a shiny app/web interface can be created to do this part, and the user doesn't need to do it manually?
As #thelatemail suggests, the easiest thing to do here is simply run Stata in batch mode from a system call.
Here's an example do file (called "example.do"):
log using out.log, replace
sysuse auto
regress mpg weight foreign
And here's the R code to run it and retrieve the output (assuming Stata is on your path and you replace Stata-64 with the appropriate binary file on your machine):
> system("Stata-64 /e do example.do"); readLines("out.log")
[1] "-----------------------------------------------------------------------------------------------------------------------"
[2] " name: <unnamed>"
[3] " log: FilePathHere"
[4] " log type: text"
[5] " opened on: 9 Jan 2015, 13:34:18"
[6] ""
[7] ". sysuse auto"
[8] "(1978 Automobile Data)"
[9] ""
[10] ". regress mpg weight foreign"
[11] ""
[12] " Source | SS df MS Number of obs = 74"
[13] "-------------+------------------------------ F( 2, 71) = 69.75"
[14] " Model | 1619.2877 2 809.643849 Prob > F = 0.0000"
[15] " Residual | 824.171761 71 11.608053 R-squared = 0.6627"
[16] "-------------+------------------------------ Adj R-squared = 0.6532"
[17] " Total | 2443.45946 73 33.4720474 Root MSE = 3.4071"
[18] ""
[19] "------------------------------------------------------------------------------"
[20] " mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]"
[21] "-------------+----------------------------------------------------------------"
[22] " weight | -.0065879 .0006371 -10.34 0.000 -.0078583 -.0053175"
[23] " foreign | -1.650029 1.075994 -1.53 0.130 -3.7955 .4954422"
[24] " _cons | 41.6797 2.165547 19.25 0.000 37.36172 45.99768"
[25] "------------------------------------------------------------------------------"
[26] ""
[27] ". "
[28] "end of do-file"
[29] ""
[30] ". exit, clear"
It may be easier to parse the output if you log using Stata Markup Control Language (SMCL), by replacing the first line of the do file with log using out.log, replace smcl. Then the output will be:
[1] "{smcl}"
[2] "{com}{sf}{ul off}{txt}{.-}"
[3] " name: {res}<unnamed>"
[4] " {txt}log: {res}FilePathHere"
[5] " {txt}log type: {res}smcl"
[6] " {txt}opened on: {res} 9 Jan 2015, 13:41:53"
[7] "{txt}"
[8] "{com}. sysuse auto"
[9] "{txt}(1978 Automobile Data)"
[10] ""
[11] "{com}. regress mpg weight foreign"
[12] ""
[13] " {txt}Source {c |} SS df MS Number of obs ={res} 74"
[14] "{txt}{hline 13}{char +}{hline 30} F( 2, 71) ={res} 69.75"
[15] " {txt} Model {char |} {res} 1619.2877 2 809.643849 {txt}Prob > F = {res} 0.0000"
[16] " {txt}Residual {char |} {res} 824.171761 71 11.608053 {txt}R-squared = {res} 0.6627"
[17] "{txt}{hline 13}{char +}{hline 30} Adj R-squared = {res} 0.6532"
[18] " {txt} Total {char |} {res} 2443.45946 73 33.4720474 {txt}Root MSE = {res} 3.4071"
[19] ""
[20] "{txt}{hline 13}{c TT}{hline 11}{hline 11}{hline 9}{hline 8}{hline 13}{hline 12}"
[21] "{col 1} mpg{col 14}{c |} Coef.{col 26} Std. Err.{col 38} t{col 46} P>|t|{col 54} [95% Con{col 67}f. Interval]"
[22] "{hline 13}{c +}{hline 11}{hline 11}{hline 9}{hline 8}{hline 13}{hline 12}"
[23] "{space 6}weight {c |}{col 14}{res}{space 2}-.0065879{col 26}{space 2} .0006371{col 37}{space 1} -10.34{col 46}{space 3}0.000{col 54}{space 4}-.0078583{col 67}{space 3}-.0053175"
[24] "{txt}{space 5}foreign {c |}{col 14}{res}{space 2}-1.650029{col 26}{space 2} 1.075994{col 37}{space 1} -1.53{col 46}{space 3}0.130{col 54}{space 4} -3.7955{col 67}{space 3} .4954422"
[25] "{txt}{space 7}_cons {c |}{col 14}{res}{space 2} 41.6797{col 26}{space 2} 2.165547{col 37}{space 1} 19.25{col 46}{space 3}0.000{col 54}{space 4} 37.36172{col 67}{space 3} 45.99768"
[26] "{txt}{hline 13}{c BT}{hline 11}{hline 11}{hline 9}{hline 8}{hline 13}{hline 12}"
[27] "{res}{txt}"
[28] "{com}. "
[29] "{txt}end of do-file"

Resources