I want to create a matrix of stockdata from n number of companies from a ticker list, though im struggling with appending them horizontally, it only works to append them vertically.
Also other functions like merge or rbind which i have tried, but they cannot work with the variables parsed as a string, so the hard part here is that i want to append n variables which are retrieved from the tickerlist which has n number of stocks. Other suggestions are welcome to get the same result.
Stocklist data:
> dput(stockslist)
structure(list(V1 = c("AMD", "MSFT", "SBUX", "IBM", "AAPL", "GSPC",
"AMZN")), .Names = "V1", class = "data.frame", row.names = c(NA,
-7L))
code:
library(quantmod)
library(tseries)
library(plyr)
library(PortfolioAnalytics)
library(PerformanceAnalytics)
library(zoo)
library(plotly)
tickerlist <- "sp500.csv" #CSV containing tickers on rows
stockslist <- read.csv("sp500.csv", header = FALSE, stringsAsFactors = F)
nrstocks = length(stockslist[,1]) #The number of stocks to download
maxretryattempts <- 5 #If there is an error downloading a price how many
times to retry
startDate = as.Date("2010-01-13")
for (i in 1:nrstocks) {
stockdata <- getSymbols(c(stockslist[i,1]), src = "yahoo", from =
startDate)
# pick 6th column of the ith stock
write.table((eval(parse(text=paste(stockslist[i,1]))))[,6], file =
"test.csv", append = TRUE, row.names=F)
}
This is exactly a great opportunity to talk about lists of dataframes. Having said that ...
Side bar: I really don't like side-effects. getSymbols defaults to using side-effect to saving the data into the parent frame/environment, and though this may be fine for most uses, I prefer functional methods. Luckily, using auto.assign=FALSE returns its behavior to within my bounds of comfort.
library(quantmod)
stocklist <- c("AMD", "MSFT")
startDate <- as.Date("2010-01-13")
dat <- sapply(stocklist, getSymbols, src = "google", from = startDate, auto.assign = FALSE,
simplify = FALSE)
str(dat)
# List of 2
# $ AMD :An 'xts' object on 2010-01-13/2017-05-16 containing:
# Data: num [1:1846, 1:5] 8.71 9.18 9.13 8.84 8.98 9.01 8.55 8.01 8.03 8.03 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:5] "AMD.Open" "AMD.High" "AMD.Low" "AMD.Close" ...
# Indexed by objects of class: [Date] TZ: UTC
# xts Attributes:
# List of 2
# ..$ src : chr "google"
# ..$ updated: POSIXct[1:1], format: "2017-05-16 21:01:37"
# $ MSFT:An 'xts' object on 2010-01-13/2017-05-16 containing:
# Data: num [1:1847, 1:5] 30.3 30.3 31.1 30.8 30.8 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:5] "MSFT.Open" "MSFT.High" "MSFT.Low" "MSFT.Close" ...
# Indexed by objects of class: [Date] TZ: UTC
# xts Attributes:
# List of 2
# ..$ src : chr "google"
# ..$ updated: POSIXct[1:1], format: "2017-05-16 21:01:37"
Though I only did two symbols, it should work for many more without problem. Also, I shifted to using Google since Yahoo was asking for authentication.
You used write.csv(...), realize that you will lose the timestamp for each datum, since the CSV will look something like:
"AMD.Open","AMD.High","AMD.Low","AMD.Close","AMD.Volume"
8.71,9.2,8.55,9.15,32741845
9.18,9.26,8.92,9,22658744
9.13,9.19,8.8,8.84,34344763
8.84,9.21,8.84,9.01,24875646
Using "AMD" as an example, consider:
write.csv(as.data.frame(AMD), file="AMD.csv", row.names = TRUE)
head(read.csv("~/Downloads/AMD.csv", row.names = 1))
# AMD.Open AMD.High AMD.Low AMD.Close AMD.Volume
# 2010-01-13 8.71 9.20 8.55 9.15 32741845
# 2010-01-14 9.18 9.26 8.92 9.00 22658744
# 2010-01-15 9.13 9.19 8.80 8.84 34344763
# 2010-01-19 8.84 9.21 8.84 9.01 24875646
# 2010-01-20 8.98 9.00 8.76 8.87 22813520
# 2010-01-21 9.01 9.10 8.77 8.99 37888647
To save all of them at once:
ign <- mapply(function(x, fn) write.csv(as.data.frame(x), file = fn, row.names = TRUE),
dat, names(dat))
There are other ways to store your data such as Rdata files (save()).
It is not clear to me if you are intending to append them as additional columns (i.e., cbind behavior) or as rows (rbind). Between the two, I tend towards "rows", but I'll start with "columns" first.
"Appending" by column
This may be appropriate if you want day-by-day ticker comparisons (though there are arguably better ways to prepare for this). You'll run into problems, since they have (and most likely will have) different numbers of rows:
sapply(dat, nrow)
# AMD MSFT
# 1846 1847
In this case, you might want to join based on the dates (row names). To do this well, you should probably convert the row names (dates) to a column and merge on that column:
dat2 <- lapply(dat, function(x) {
x <- as.data.frame(x)
x$date <- rownames(x)
rownames(x) <- NULL
x
})
datwide <- Reduce(function(a, b) merge(a, b, by = "date", all = TRUE), dat2)
As a simple demonstration, remembering that there is one more row in "MSFT" than in "AMD", we can find that row and prove that things are still looking alright with:
which(! complete.cases(datwide))
# [1] 1251
datwide[1251 + -2:2,]
# date AMD.Open AMD.High AMD.Low AMD.Close AMD.Volume MSFT.Open MSFT.High MSFT.Low MSFT.Close MSFT.Volume
# 1249 2014-12-30 2.64 2.70 2.63 2.63 7783709 47.44 47.62 46.84 47.02 16384692
# 1250 2014-12-31 2.64 2.70 2.64 2.67 11177917 46.73 47.44 46.45 46.45 21552450
# 1251 2015-01-02 NA NA NA NA NA 46.66 47.42 46.54 46.76 27913852
# 1252 2015-01-05 2.67 2.70 2.64 2.66 8878176 46.37 46.73 46.25 46.32 39673865
# 1253 2015-01-06 2.65 2.66 2.55 2.63 13916645 46.38 46.75 45.54 45.65 36447854
"Appending" by row
getSymbols names the columns unique to the ticker, a slight frustration. Additionally, since we'll be discarding the column names, we should preserve the symbol name in the data.
dat3 <- lapply(dat, function(x) {
ticker <- gsub("\\..*", "", colnames(x)[1])
colnames(x) <- gsub(".*\\.", "", colnames(x))
x <- as.data.frame(x)
x$date <- rownames(x)
x$symbol <- ticker
rownames(x) <- NULL
x
}) # can also be accomplished with mapply(..., dat, names(dat))
datlong <- Reduce(function(a, b) rbind(a, b, make.row.names = FALSE), dat3)
head(datlong)
# Open High Low Close Volume date symbol
# 1 8.71 9.20 8.55 9.15 32741845 2010-01-13 AMD
# 2 9.18 9.26 8.92 9.00 22658744 2010-01-14 AMD
# 3 9.13 9.19 8.80 8.84 34344763 2010-01-15 AMD
# 4 8.84 9.21 8.84 9.01 24875646 2010-01-19 AMD
# 5 8.98 9.00 8.76 8.87 22813520 2010-01-20 AMD
# 6 9.01 9.10 8.77 8.99 37888647 2010-01-21 AMD
nrow(datlong)
# [1] 3693
Related
I am new to R and writing functions. I've spent hours trying to figure this out and searching Google, but can't seem to find anything. Hopefully you can help? I want to use lapply() to analyze the data below using the ts() function.
My code looks like this:
library(dplyr)
#group out different sites
mylist <- data %>%
group_by(Site)
mylist
#Write ts() function
alpha_function = function(x) {
ts_alpha = ts(x$Temperature, frequency=12, start=c(0017, 7, 20))
return(data.frame(ts_alpha))
}
#Run list through lapply()
results = lapply(mylist, alpha_function())
But I get this error: argument "x" is missing with no default.
I have a data set that looks like:
Site(factor) Date(POSIXct) Temperature(num)
1 0017-03-04 2.73
2 0017-03-04 3.73
3 0017-03-04 2.71
4 0017-03-04 2.22
5 0017-03-04 2.89
etc.
I have over 3,000 temperature readings at different dates for 5 different sites.
Thanks in advance!
I'm not exactly an R guy, but I would wager this line:
results = lapply(mylist, alpha_function())
should be
results = lapply(mylist, alpha_function).
What you have calls the alpha function when you are trying to supply it to lapply, when what you really (most likely) want to do is provide a reference to the function without calling it. (The error you are getting indicates that alpha_function needs an x parameter when being called like alpha_function()).
A recommended approach when working with dplyr and the tidyverse is to keep things in data frames:
library(tidyverse)
library(zoo)
dat %>%
nest(-Site) %>%
mutate(data = map(data, ~ zoo(.x$Temperature, .x$Date)))
# # A tibble: 5 x 2
# Site data
# <fct> <list>
# 1 a <S3: zoo>
# 2 b <S3: zoo>
# 3 c <S3: zoo>
# 4 d <S3: zoo>
# 5 e <S3: zoo>
Or if we must have ts rather than zoo objects, we can use as.ts(zoo(...)).
In case we still prefer regular lists, we can use base split() and lapply():
dat %>%
split(.$Site) %>%
lapply(function(.x) zoo(.x$Temperature, .x$Date))
# List of 5
# $ a:‘zoo’ series from 2017-03-04 12:00:00 to 2017-05-06 00:30:00
# Data: num [1:3000] 5.37 5.49 5.32 5.44 5.43 ...
# Index: POSIXct[1:3000], format: "2017-03-04 12:00:00" ...
# $ b:‘zoo’ series from 2017-03-04 12:00:00 to 2017-05-06 00:30:00
# Data: num [1:3000] 5.36 5.22 5.15 5.41 5.41 ...
# Index: POSIXct[1:3000], format: "2017-03-04 12:00:00" ...
# $ c:‘zoo’ series from 2017-03-04 12:00:00 to 2017-05-06 00:30:00
# Data: num [1:3000] 6.08 6.11 6.22 6.13 6.03 ...
# Index: POSIXct[1:3000], format: "2017-03-04 12:00:00" ...
# $ d:‘zoo’ series from 2017-03-04 12:00:00 to 2017-05-06 00:30:00
# Data: num [1:3000] 5.06 4.96 5.23 5.16 5.29 ...
# Index: POSIXct[1:3000], format: "2017-03-04 12:00:00" ...
# $ e:‘zoo’ series from 2017-03-04 12:00:00 to 2017-05-06 00:30:00
# Data: num [1:3000] 5.1 5.08 5.14 5.13 5.22 ...
# Index: POSIXct[1:3000], format: "2017-03-04 12:00:00" ...
(where dat is generated as follows:
n_sites <- 5
n_dates <- 3000
set.seed(123) ; dat <- tibble(
Site = factor(rep(letters[1:n_sites], each = n_dates)),
Date = rep(seq.POSIXt(as.POSIXct("2017-03-04 12:00:00"), by = "30 min", length.out = n_dates), times = n_sites),
Temperature = as.vector(replicate(n_sites, runif(1, 5, 6) + cumsum(rnorm(n_dates, 0, 0.1))))
)
I would like to compute the range and the dates of the maximum price and low price of of each variable that is stored in xts format and store the results in a data frame. The result that I look for is a data frame that will contain the variable name,max value date,max value, min value date,min value.
I've imported two stocks data by using quantmod package and wrote a function to compute the ranges (yet without the dates of maximum price and low price) but with no succees.
d<-getSymbols(c("ZTS","ZX") , src = 'yahoo', from = '2015-01-01', auto.assign = T)
d<-cbind(ZTS,ZX)
head(d)
ZTS.Open ZTS.High ZTS.Low ZTS.Close ZTS.Volume ZTS.Adjusted ZX.Open ZX.High ZX.Low ZX.Close
2015-01-02 43.46 43.70 43.07 43.31 1784200 43.07725 1.40 1.40 1.21 1.24
2015-01-05 43.25 43.63 42.97 43.05 3112100 42.81864 1.35 1.38 1.24 1.32
2015-01-06 43.15 43.36 42.30 42.63 3977200 42.40090 1.28 1.29 1.22 1.22
2015-01-07 43.00 43.56 42.98 43.51 2481800 43.27617 1.24 1.34 1.24 1.29
2015-01-08 44.75 44.87 44.00 44.18 3121300 43.94257 1.26 1.28 1.17 1.18
2015-01-09 44.06 44.44 43.68 44.25 2993200 44.01220 1.30 1.39 1.22 1.27
ZX.Volume ZX.Adjusted
2015-01-02 20400 1.24
2015-01-05 43200 1.32
2015-01-06 16700 1.22
2015-01-07 6200 1.29
2015-01-08 17200 1.18
2015-01-09 60200 1.27
s<- for (i in names(d[,-1])) function(x) {max(x);min(x) }
> s
NULL
str(d)
An ‘xts’ object on 2015-01-02/2015-08-13 containing:
Data: num [1:155, 1:12] 43.5 43.2 43.2 43 44.8 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:12] "ZTS.Open" "ZTS.High" "ZTS.Low" "ZTS.Close" ...
Indexed by objects of class: [Date] TZ: UTC
xts Attributes:
List of 2
$ src : chr "yahoo"
$ updated: POSIXct[1:1], format: "2015-08-14 18:35:14"
x
I can't tell if you date column is the rownames or an actual column, but the code below should produce your result if the date column is called date in the dataframe.
do.call(rbind, apply(d[,-1], 2, function(col) {
max_ind <- which.max(col)
min_ind <- which.min(col)
list(max=col[max_ind], max_date=d$date[max_ind], min=col[min_ind],
min_date=d$date[min_ind])
}))
and if the date column is the rownames
do.call(rbind, apply(d, 2, function(col) {
max_ind <- which.max(col)
min_ind <- which.min(col)
list(max=col[max_ind], max_date=as.character(index(d))[max_ind], min=col[min_ind],
min_date=as.character(index(d))[min_ind])
}))
The code does an apply over columns finding the index of the maximum and minimum, then returns the values and dates corresponding to these.
I have two multivariate time series x and y, both covering approximately the same range in time (one starts two years before the other, but they end on the same date). Both series have missing observations in the form of empty columns next to the date column, and also in the sense that one of the series has several dates that are not found in the other, and vice versa.
I would like to create a data frame (or similar) with a column that lists all the dates found in x OR y, without duplicate dates. For each date (row), I would like to horizontally stack the observations from x next to the observations from y, with NA's filling the missing cells. Example:
>x
"1987-01-01" 7.1 NA 3
"1987-01-02" 5.2 5 2
"1987-01-06" 2.3 NA 9
>y
"1987-01-01" 55.3 66 45
"1987-01-03" 77.3 87 34
# result I would like
"1987-01-01" 7.1 NA 3 55.3 66 45
"1987-01-02" 5.2 5 2 NA NA NA
"1987-01-03" NA NA NA 77.3 87 34
"1987-01-06" 2.3 NA 9 NA NA NA
What I have tried: with the zoo package, I've tried the merge.zoo method, but this seems to just stack the two series next to each other, with the dates (as numbers, e.g. "1987-01-02" shown as 6210) from each series appearing in two separate columns.
I've sat for hours getting almost nowhere, so all help is appreciated.
EDIT: some code included below as per suggestion from Soumendra
atcoa <- read.csv(file = "ATCOA_full_adj.csv", header = TRUE)
atcob <- read.csv(file = "ATCOB_full_adj.csv", header = TRUE)
atcoa$date <- as.Date(atcoa$date)
atcob$date <- as.Date(atcob$date)
# only number of observations and the observations themselves differ
>str(atcoa)
'data.frame': 6151 obs. of 8 variables:
$ date :Class 'Date' num [1:6151] 6210 6213 6215 6216 6217 ...
$ max : num 4.31 4.33 4.38 4.18 4.13 4.05 4.08 4.05 4.08 4.1 ...
$ min : num 4.28 4.31 4.28 4.13 4.05 3.95 3.97 3.95 4 4.02 ...
$ close : num 4.31 4.33 4.31 4.15 4.1 3.97 4 3.97 4.08 4.02 ...
$ avg : num NA NA NA NA NA NA NA NA NA NA ...
$ tot.vol : int 877733 89724 889437 1927113 3050611 846525 1782774 1497998 2504466 5636999 ...
$ turnover : num 3762300 388900 3835900 8015900 12468100 ...
$ transactions: int 12 9 24 17 31 26 34 35 37 33 ...
>atcoa[1:1, ]
date a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions
1 1987-01-02 4.31 4.28 4.31 NA 877733 3762300 12
# using timeSeries package
ts.atcoa <- timeSeries::as.timeSeries(atcoa, format = "%Y-%m-%d")
ts.atcob <- timeSeries::as.timeSeries(atcob, format = "%Y-%m-%d")
>str(ts.atcoa)
Time Series:
Name: object
Data Matrix:
Dimension: 6151 7
Column Names: a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions
Row Names: 1970-01-01 01:43:30 ... 1970-01-01 04:12:35
Positions:
Start: 1970-01-01 01:43:30
End: 1970-01-01 04:12:35
With:
Format: %Y-%m-%d %H:%M:%S
FinCenter: GMT
Units: a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions
Title: Time Series Object
Documentation: Wed Aug 17 13:00:50 2011
>ts.atcoa[1:1, ]
GMT
a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions
1970-01-01 01:43:30 4.31 4.28 4.31 NA 877733 3762300 12
# The following will create an object of class "data frame" and mode "list", which contains observations for the days mutual for the two series
>ts.atco <- timeSeries::merge(atcoa, atcob) # produces same result as base::merge, apparently
>ts.atco[1:1, ]
date a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions b.max b.min b.close b.avg b.tot.vol b.turnover b.transactions
1 1989-08-25 7.92 7.77 7.79 NA 269172 2119400 19 7.69 7.56 7.64 NA 81176693 593858000 12
EDIT: problem solved by (using zoo package)
atcoa <- read.zoo(read.csv(file = "ATCOA_full_adj.csv", header = TRUE))
atcob <- read.zoo(read.csv(file = "ATCOB_full_adj.csv", header = TRUE))
names(atcoa) <- c("a.max", "a.min", "a.close",
"a.avg", "a.tot.vol", "a.turnover", "a.transactions")
names(atcob) <- c("b.max", "b.min", "b.close",
"b.avg", "b.tot.vol", "b.turnover", "b.transactions")
atco <- merge.zoo(atcoa, atcob)
Thank you all for your help.
Try this:
Lines.x <- '"1987-01-01" 7.1 NA 3
"1987-01-02" 5.2 5 2
"1987-01-06" 2.3 NA 9'
Lines.y <- '"1987-01-01" 55.3 66 45
"1987-01-03" 77.3 87 34'
library(zoo)
# in reality x might be in a file and might be read via: x <- read.zoo("x.dat")
# ditto for y. See ?read.zoo and the zoo-read vignette if you need other args too
x <- read.zoo(text = Lines.x)
y <- read.zoo(text = Lines.y)
merge(x, y)
giving:
V2.x V3.x V4.x V2.y V3.y V4.y
1987-01-01 7.1 NA 3 55.3 66 45
1987-01-02 5.2 5 2 NA NA NA
1987-01-03 NA NA NA 77.3 87 34
1987-01-06 2.3 NA 9 NA NA NA
You can create a timeSeries (timeSeries library) object from your dates, merge them (timeSeries default merge behaviour is different from zoo and xts and does exactly what you are asking for) and then make zoo/xts objects out of the result in case you don't want to stay with timeSeries.
One quick way to test is the following, assuming you have two zoo objects zz1 and zz2 -
library(timeSeries)
as.zoo(merge(as.timeSeries(zz1), as.timeSeries(zz2)))
Compare the output of the above command with
merge(zz1, zz2)
You can also cbind -
cbind(zz1, zz2)
provided there are no shared columns with same names. Even if such column are there, you can choose the columns by which you cbind, and you will get a zoo object.
cbind(zz1[, 1:2], zz2[, 2:3]) #Assuming other columns are common
here, i found a more generic aproach from stat.ethz.ch
a <- ts(1:10, start=c(2014,6), frequency=12)
b <- ts(1:12, start=c(2015,1), frequency=12)
library(zoo)
m <- merge(a = as.zoo(a), b = as.zoo(b))
to get a ts object back:
as.ts(m)
How about this:
## Generate unique sorted time values.
i <- sort(unique(c(index(x), index(y))))
## Empty data matrix.
v <- matrix(nrow=length(i), ncol=6, NA)
## Pull in data items.
v[match(index(x), i), 1:3] <- coredata(x)
v[match(index(y), i), 4:6] <- coredata(y)
## Build new zoo object.
d <- zoo(v, order.by=i)
I am using the following code to get information from a web site (http://q.stock.sohu.com/cn/000002/lshq.shtml). But I do not know how to get a data frame which includes "date,open,close,high,low". Any help would be appreciated.
thepage = readLines('http://q.stock.sohu.com/hisHq?code=cn_000002&start=20131120&end=20140318&stat=1&order=D&period=d&callback=historySearchHandler&rt=jsonp&r=0.8740235545448934&0.28161772061461654')
How can I get the data frame?
I don't know which parts of the return JSON are the actual values you need, but I assume they are components of the hq record. This should work:
library(RJSONIO)
library(RCurl)
# get the raw data
dat.json.raw <- getURL("http://q.stock.sohu.com/hisHq?code=cn_000002&start=20131120&end=20140318&stat=1&order=D&period=d&callback=historySearchHandler&rt=jsonp&r=0.8740235545448934&0.28161772061461654%27")
tt <- textConnection(dat.json.raw)
dat.json <- readLines(tt)
close(tt)
# remove callback
dat.json <- gsub("^historySearchHandler\\(", "", dat.json)
dat.json <- gsub("\\)$", "", dat.json)
# convert to R structure
dat.l <- fromJSON(dat.json)
# get the meaty part of the data into a data.frame
dat <- data.frame(t(sapply(dat.l[[1]]$hq, unlist)), stringsAsFactors=FALSE)
dat$X1 <- as.Date(dat$X1)
dat$X2 <- as.numeric(dat$X2)
dat$X3 <- as.numeric(dat$X3)
dat$X4 <- as.numeric(dat$X4)
str(dat)
## 'data.frame': 79 obs. of 10 variables:
## $ X1 : Date, format: "2014-03-18" "2014-03-17" "2014-03-14" ...
## $ X2 : num 7.76 7.6 7.68 7.58 7.48 7.19 7.22 7.34 6.76 6.92 ...
## $ X3 : num 7.6 7.76 7.53 7.71 7.6 7.5 7.15 7.27 7.32 6.76 ...
## $ X4 : num -0.16 0.23 -0.18 0.11 0.1 0.35 -0.12 -0.05 0.56 -0.16 ...
## $ X5 : chr "-2.06%" "3.05%" "-2.33%" "1.45%" ...
## $ X6 : chr "7.55" "7.59" "7.50" "7.53" ...
## $ X7 : chr "7.76" "7.80" "7.81" "7.85" ...
## $ X8 : chr "843900" "1177079" "1303110" "1492359" ...
## $ X9 : chr "64268.06" "90829.30" "99621.34" "114990.40" ...
## $ X10: chr "0.87%" "1.22%" "1.35%" "1.54%" ...
head(dat)
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
## 1 2014-03-18 7.76 7.60 -0.16 -2.06% 7.55 7.76 843900 64268.06 0.87%
## 2 2014-03-17 7.60 7.76 0.23 3.05% 7.59 7.80 1177079 90829.30 1.22%
## 3 2014-03-14 7.68 7.53 -0.18 -2.33% 7.50 7.81 1303110 99621.34 1.35%
## 4 2014-03-13 7.58 7.71 0.11 1.45% 7.53 7.85 1492359 114990.40 1.54%
## 5 2014-03-12 7.48 7.60 0.10 1.33% 7.42 7.85 2089873 160315.88 2.16%
## 6 2014-03-11 7.19 7.50 0.35 4.90% 7.15 7.59 1892488 141250.94 1.96%
You'll need to fix some of the other columns (since I don't know exactly what you need).
For folks who don't like the warnings that come back from the fromJSON call, you can just wrap the readLines with a paste : dat.json <- paste(readLines(tt), collapse=""). It's not necessary (the warnings are harmless) so I don't usually bother with the extra step.
Seems like you're trying to scrape a website that presents the data in JSON.
For that, in addition to the "usual steps" that you need to do in order to scrape a website you'll also need to deal with parsing and manipulating JSON data:
Usual approach
If you have a HTML that has an easy to grab table, this should work:
require("XML")
x <- readHTMLTable(
doc="swww.someurl.com"
)
Otherwise you'll definitely need to use some XPath to get to the nodes that you're interested in.
This usually involves parsing the HTML code via htmlTreeParse() and getting to the respective nodes via getNodeSet() and the like:
x <- htmlTreeParse(
file="swww.someurl.com",
isURL=TRUE,
useInternalNodes=TRUE
)
res <- getNodeSet(x, <your-xpath-statement>)
Approach including JSON data
Parse the HTML code:
x <- htmlTreeParse(
file="http://q.stock.sohu.com/hisHq?code=cn_000002&start=20131120&end=20140318&stat=1&order=D&period=d&callback=historySearchHandler&rt=jsonp&r=0.8740235545448934&0.28161772061461654",
isURL=TRUE,
useInternalNodes=TRUE
)
Retrieve the actual JSON data:
json <- getNodeSet(x, "//body/p")
json <- xmlValue(json[[1]])
Get rid of non-JSON components:
json <- gsub("historySearchHandler\\(", "", json, perl=TRUE)
json <- gsub("\\)$", "", json, perl=TRUE)
Parse the JSON data:
require("jsonlite")
fromJSON(json, simplifyVector=FALSE)
[[1]]
[[1]]$status
[1] 0
[[1]]$hq
[[1]]$hq[[1]]
[[1]]$hq[[1]][[1]]
[1] "2014-03-18"
[[1]]$hq[[1]][[2]]
[1] "7.76"
[...]
Now you need to bring that into a more data.frame-like order (methods that come to mind are do.call(), rbind(), cbind).
Encoding
Sooner or later (rather sooner than later as we see in this very example), you'll be confronted with encoding issues (stuff like "ÀÛ¼Æ:").
You can play around with different encodings either directly when parsing the HTML code (argument encoding in htmlTreeParse()) or modify a character string's encoding via Encoding "afterwards". I wasn't able to get it all correct for your values, though. Encoding issues can be quite a pain.
General suggestion
I'd recommend you to choose english-based examples (an english-based website in this case) in the future as otherwise you're tremendously limiting the amount of people that might be able to help you.
I have two multivariate time series x and y, both covering approximately the same range in time (one starts two years before the other, but they end on the same date). Both series have missing observations in the form of empty columns next to the date column, and also in the sense that one of the series has several dates that are not found in the other, and vice versa.
I would like to create a data frame (or similar) with a column that lists all the dates found in x OR y, without duplicate dates. For each date (row), I would like to horizontally stack the observations from x next to the observations from y, with NA's filling the missing cells. Example:
>x
"1987-01-01" 7.1 NA 3
"1987-01-02" 5.2 5 2
"1987-01-06" 2.3 NA 9
>y
"1987-01-01" 55.3 66 45
"1987-01-03" 77.3 87 34
# result I would like
"1987-01-01" 7.1 NA 3 55.3 66 45
"1987-01-02" 5.2 5 2 NA NA NA
"1987-01-03" NA NA NA 77.3 87 34
"1987-01-06" 2.3 NA 9 NA NA NA
What I have tried: with the zoo package, I've tried the merge.zoo method, but this seems to just stack the two series next to each other, with the dates (as numbers, e.g. "1987-01-02" shown as 6210) from each series appearing in two separate columns.
I've sat for hours getting almost nowhere, so all help is appreciated.
EDIT: some code included below as per suggestion from Soumendra
atcoa <- read.csv(file = "ATCOA_full_adj.csv", header = TRUE)
atcob <- read.csv(file = "ATCOB_full_adj.csv", header = TRUE)
atcoa$date <- as.Date(atcoa$date)
atcob$date <- as.Date(atcob$date)
# only number of observations and the observations themselves differ
>str(atcoa)
'data.frame': 6151 obs. of 8 variables:
$ date :Class 'Date' num [1:6151] 6210 6213 6215 6216 6217 ...
$ max : num 4.31 4.33 4.38 4.18 4.13 4.05 4.08 4.05 4.08 4.1 ...
$ min : num 4.28 4.31 4.28 4.13 4.05 3.95 3.97 3.95 4 4.02 ...
$ close : num 4.31 4.33 4.31 4.15 4.1 3.97 4 3.97 4.08 4.02 ...
$ avg : num NA NA NA NA NA NA NA NA NA NA ...
$ tot.vol : int 877733 89724 889437 1927113 3050611 846525 1782774 1497998 2504466 5636999 ...
$ turnover : num 3762300 388900 3835900 8015900 12468100 ...
$ transactions: int 12 9 24 17 31 26 34 35 37 33 ...
>atcoa[1:1, ]
date a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions
1 1987-01-02 4.31 4.28 4.31 NA 877733 3762300 12
# using timeSeries package
ts.atcoa <- timeSeries::as.timeSeries(atcoa, format = "%Y-%m-%d")
ts.atcob <- timeSeries::as.timeSeries(atcob, format = "%Y-%m-%d")
>str(ts.atcoa)
Time Series:
Name: object
Data Matrix:
Dimension: 6151 7
Column Names: a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions
Row Names: 1970-01-01 01:43:30 ... 1970-01-01 04:12:35
Positions:
Start: 1970-01-01 01:43:30
End: 1970-01-01 04:12:35
With:
Format: %Y-%m-%d %H:%M:%S
FinCenter: GMT
Units: a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions
Title: Time Series Object
Documentation: Wed Aug 17 13:00:50 2011
>ts.atcoa[1:1, ]
GMT
a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions
1970-01-01 01:43:30 4.31 4.28 4.31 NA 877733 3762300 12
# The following will create an object of class "data frame" and mode "list", which contains observations for the days mutual for the two series
>ts.atco <- timeSeries::merge(atcoa, atcob) # produces same result as base::merge, apparently
>ts.atco[1:1, ]
date a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions b.max b.min b.close b.avg b.tot.vol b.turnover b.transactions
1 1989-08-25 7.92 7.77 7.79 NA 269172 2119400 19 7.69 7.56 7.64 NA 81176693 593858000 12
EDIT: problem solved by (using zoo package)
atcoa <- read.zoo(read.csv(file = "ATCOA_full_adj.csv", header = TRUE))
atcob <- read.zoo(read.csv(file = "ATCOB_full_adj.csv", header = TRUE))
names(atcoa) <- c("a.max", "a.min", "a.close",
"a.avg", "a.tot.vol", "a.turnover", "a.transactions")
names(atcob) <- c("b.max", "b.min", "b.close",
"b.avg", "b.tot.vol", "b.turnover", "b.transactions")
atco <- merge.zoo(atcoa, atcob)
Thank you all for your help.
Try this:
Lines.x <- '"1987-01-01" 7.1 NA 3
"1987-01-02" 5.2 5 2
"1987-01-06" 2.3 NA 9'
Lines.y <- '"1987-01-01" 55.3 66 45
"1987-01-03" 77.3 87 34'
library(zoo)
# in reality x might be in a file and might be read via: x <- read.zoo("x.dat")
# ditto for y. See ?read.zoo and the zoo-read vignette if you need other args too
x <- read.zoo(text = Lines.x)
y <- read.zoo(text = Lines.y)
merge(x, y)
giving:
V2.x V3.x V4.x V2.y V3.y V4.y
1987-01-01 7.1 NA 3 55.3 66 45
1987-01-02 5.2 5 2 NA NA NA
1987-01-03 NA NA NA 77.3 87 34
1987-01-06 2.3 NA 9 NA NA NA
You can create a timeSeries (timeSeries library) object from your dates, merge them (timeSeries default merge behaviour is different from zoo and xts and does exactly what you are asking for) and then make zoo/xts objects out of the result in case you don't want to stay with timeSeries.
One quick way to test is the following, assuming you have two zoo objects zz1 and zz2 -
library(timeSeries)
as.zoo(merge(as.timeSeries(zz1), as.timeSeries(zz2)))
Compare the output of the above command with
merge(zz1, zz2)
You can also cbind -
cbind(zz1, zz2)
provided there are no shared columns with same names. Even if such column are there, you can choose the columns by which you cbind, and you will get a zoo object.
cbind(zz1[, 1:2], zz2[, 2:3]) #Assuming other columns are common
here, i found a more generic aproach from stat.ethz.ch
a <- ts(1:10, start=c(2014,6), frequency=12)
b <- ts(1:12, start=c(2015,1), frequency=12)
library(zoo)
m <- merge(a = as.zoo(a), b = as.zoo(b))
to get a ts object back:
as.ts(m)
How about this:
## Generate unique sorted time values.
i <- sort(unique(c(index(x), index(y))))
## Empty data matrix.
v <- matrix(nrow=length(i), ncol=6, NA)
## Pull in data items.
v[match(index(x), i), 1:3] <- coredata(x)
v[match(index(y), i), 4:6] <- coredata(y)
## Build new zoo object.
d <- zoo(v, order.by=i)