How to scrape data from a web site? - r

I am using the following code to get information from a web site ( But I do not know how to get a data frame which includes "date,open,close,high,low". Any help would be appreciated.
thepage = readLines('')
How can I get the data frame?

I don't know which parts of the return JSON are the actual values you need, but I assume they are components of the hq record. This should work:
# get the raw data
dat.json.raw <- getURL("")
tt <- textConnection(dat.json.raw)
dat.json <- readLines(tt)
# remove callback
dat.json <- gsub("^historySearchHandler\\(", "", dat.json)
dat.json <- gsub("\\)$", "", dat.json)
# convert to R structure
dat.l <- fromJSON(dat.json)
# get the meaty part of the data into a data.frame
dat <- data.frame(t(sapply(dat.l[[1]]$hq, unlist)), stringsAsFactors=FALSE)
dat$X1 <- as.Date(dat$X1)
dat$X2 <- as.numeric(dat$X2)
dat$X3 <- as.numeric(dat$X3)
dat$X4 <- as.numeric(dat$X4)
## 'data.frame': 79 obs. of 10 variables:
## $ X1 : Date, format: "2014-03-18" "2014-03-17" "2014-03-14" ...
## $ X2 : num 7.76 7.6 7.68 7.58 7.48 7.19 7.22 7.34 6.76 6.92 ...
## $ X3 : num 7.6 7.76 7.53 7.71 7.6 7.5 7.15 7.27 7.32 6.76 ...
## $ X4 : num -0.16 0.23 -0.18 0.11 0.1 0.35 -0.12 -0.05 0.56 -0.16 ...
## $ X5 : chr "-2.06%" "3.05%" "-2.33%" "1.45%" ...
## $ X6 : chr "7.55" "7.59" "7.50" "7.53" ...
## $ X7 : chr "7.76" "7.80" "7.81" "7.85" ...
## $ X8 : chr "843900" "1177079" "1303110" "1492359" ...
## $ X9 : chr "64268.06" "90829.30" "99621.34" "114990.40" ...
## $ X10: chr "0.87%" "1.22%" "1.35%" "1.54%" ...
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
## 1 2014-03-18 7.76 7.60 -0.16 -2.06% 7.55 7.76 843900 64268.06 0.87%
## 2 2014-03-17 7.60 7.76 0.23 3.05% 7.59 7.80 1177079 90829.30 1.22%
## 3 2014-03-14 7.68 7.53 -0.18 -2.33% 7.50 7.81 1303110 99621.34 1.35%
## 4 2014-03-13 7.58 7.71 0.11 1.45% 7.53 7.85 1492359 114990.40 1.54%
## 5 2014-03-12 7.48 7.60 0.10 1.33% 7.42 7.85 2089873 160315.88 2.16%
## 6 2014-03-11 7.19 7.50 0.35 4.90% 7.15 7.59 1892488 141250.94 1.96%
You'll need to fix some of the other columns (since I don't know exactly what you need).
For folks who don't like the warnings that come back from the fromJSON call, you can just wrap the readLines with a paste : dat.json <- paste(readLines(tt), collapse=""). It's not necessary (the warnings are harmless) so I don't usually bother with the extra step.

Seems like you're trying to scrape a website that presents the data in JSON.
For that, in addition to the "usual steps" that you need to do in order to scrape a website you'll also need to deal with parsing and manipulating JSON data:
Usual approach
If you have a HTML that has an easy to grab table, this should work:
x <- readHTMLTable(
Otherwise you'll definitely need to use some XPath to get to the nodes that you're interested in.
This usually involves parsing the HTML code via htmlTreeParse() and getting to the respective nodes via getNodeSet() and the like:
x <- htmlTreeParse(
res <- getNodeSet(x, <your-xpath-statement>)
Approach including JSON data
Parse the HTML code:
x <- htmlTreeParse(
Retrieve the actual JSON data:
json <- getNodeSet(x, "//body/p")
json <- xmlValue(json[[1]])
Get rid of non-JSON components:
json <- gsub("historySearchHandler\\(", "", json, perl=TRUE)
json <- gsub("\\)$", "", json, perl=TRUE)
Parse the JSON data:
fromJSON(json, simplifyVector=FALSE)
[1] 0
[1] "2014-03-18"
[1] "7.76"
Now you need to bring that into a more data.frame-like order (methods that come to mind are, rbind(), cbind).
Sooner or later (rather sooner than later as we see in this very example), you'll be confronted with encoding issues (stuff like "ÀÛ¼Æ:").
You can play around with different encodings either directly when parsing the HTML code (argument encoding in htmlTreeParse()) or modify a character string's encoding via Encoding "afterwards". I wasn't able to get it all correct for your values, though. Encoding issues can be quite a pain.
General suggestion
I'd recommend you to choose english-based examples (an english-based website in this case) in the future as otherwise you're tremendously limiting the amount of people that might be able to help you.


Problems with full_join in R. no applicable method to "character"

I am new to R world I am struggling with full_join function. I am pretty sure the problem is easy. I got it working on other situations I assume they were the same as the present one. Anyhow, probably someone can help me. Let's go:
I have several datasets within a big list:
NDVI2003 <- ls(pattern = "x2003_meanNDVI_m.*$")
PixelQa2003 <- ls(pattern = "x2003_meanPixelQa_m.*$")
full_list <-, list(NDVI2003,PixelQa2003))
The first 2 functions are just grabbing some files from a folder. This files look like:
> str(x2003_meanNDVI_m1)
'data.frame': 354 obs. of 5 variables:
$ date : chr "2001-12-03" "2001-12-10" "2001-12-19" "2001-12-26" ...
$ 2003_NDVI_1: num 0.441 0.518 0.322 0.311 0.499 0.319 0.163 0.134 0.452 0.536 ...
$ 2003_NDVI_2: num 0.377 0.446 0.075 0.1 0.006 0.279 0.368 0.135 0.423 0.522 ...
$ 2003_NDVI_3: num 0.332 0.397 0.07 0.093 0.006 0.236 0.469 0.127 0.411 0.535 ...
$ 2003_NDVI_4: num 0.653 0.621 0.536 0.064 0.652 0.576 0.52 0.158 0.666 0.663 ...
The 3rd function is simply getting together all these files:
> head(full_list,20)
[1] "x2003_meanNDVI_m1" "x2003_meanNDVI_m2" "x2003_meanNDVI_m3" "x2003_meanNDVI_m4" "x2003_meanNDVI_m5"
[6] "x2003_meanNDVI_m6" "x2003_meanPixelQa_m1" "x2003_meanPixelQa_m2" "x2003_meanPixelQa_m3" "x2003_meanPixelQa_m4"
[11] "x2003_meanPixelQa_m5" "x2003_meanPixelQa_m6"
So far, very simple. Now it comes to the problem... I want to join all these files by the column 'date'. This very same procedure is working on other scripts I built:
data2003 <- reduce(full_list, full_join, by="date")
But I keep getting an error:
> data2003 <- reduce(full_list, full_join, by="date")
Error in UseMethod("full_join") :
no applicable method for 'full_join' applied to an object of class "character"
So far, what I have tried:
Changing the column type from character, to date, to number... Nothing.
Altering the order of dplyr and plyr packages when opening R.
Changing variable names and so on.
full_lst <- list(NDVI2003,PixelQa2003) instead of full_list <-, list(NDVI2003,PixelQa2003))
-Adding full_list <- mget(full_list)
Google for hours lookin for an answer...
Any help will be really welcome.

How to append columns horizontally in R in a loop?

I want to create a matrix of stockdata from n number of companies from a ticker list, though im struggling with appending them horizontally, it only works to append them vertically.
Also other functions like merge or rbind which i have tried, but they cannot work with the variables parsed as a string, so the hard part here is that i want to append n variables which are retrieved from the tickerlist which has n number of stocks. Other suggestions are welcome to get the same result.
Stocklist data:
> dput(stockslist)
structure(list(V1 = c("AMD", "MSFT", "SBUX", "IBM", "AAPL", "GSPC",
"AMZN")), .Names = "V1", class = "data.frame", row.names = c(NA,
tickerlist <- "sp500.csv" #CSV containing tickers on rows
stockslist <- read.csv("sp500.csv", header = FALSE, stringsAsFactors = F)
nrstocks = length(stockslist[,1]) #The number of stocks to download
maxretryattempts <- 5 #If there is an error downloading a price how many
times to retry
startDate = as.Date("2010-01-13")
for (i in 1:nrstocks) {
stockdata <- getSymbols(c(stockslist[i,1]), src = "yahoo", from =
# pick 6th column of the ith stock
write.table((eval(parse(text=paste(stockslist[i,1]))))[,6], file =
"test.csv", append = TRUE, row.names=F)
This is exactly a great opportunity to talk about lists of dataframes. Having said that ...
Side bar: I really don't like side-effects. getSymbols defaults to using side-effect to saving the data into the parent frame/environment, and though this may be fine for most uses, I prefer functional methods. Luckily, using auto.assign=FALSE returns its behavior to within my bounds of comfort.
stocklist <- c("AMD", "MSFT")
startDate <- as.Date("2010-01-13")
dat <- sapply(stocklist, getSymbols, src = "google", from = startDate, auto.assign = FALSE,
simplify = FALSE)
# List of 2
# $ AMD :An 'xts' object on 2010-01-13/2017-05-16 containing:
# Data: num [1:1846, 1:5] 8.71 9.18 9.13 8.84 8.98 9.01 8.55 8.01 8.03 8.03 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:5] "AMD.Open" "AMD.High" "AMD.Low" "AMD.Close" ...
# Indexed by objects of class: [Date] TZ: UTC
# xts Attributes:
# List of 2
# ..$ src : chr "google"
# ..$ updated: POSIXct[1:1], format: "2017-05-16 21:01:37"
# $ MSFT:An 'xts' object on 2010-01-13/2017-05-16 containing:
# Data: num [1:1847, 1:5] 30.3 30.3 31.1 30.8 30.8 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:5] "MSFT.Open" "MSFT.High" "MSFT.Low" "MSFT.Close" ...
# Indexed by objects of class: [Date] TZ: UTC
# xts Attributes:
# List of 2
# ..$ src : chr "google"
# ..$ updated: POSIXct[1:1], format: "2017-05-16 21:01:37"
Though I only did two symbols, it should work for many more without problem. Also, I shifted to using Google since Yahoo was asking for authentication.
You used write.csv(...), realize that you will lose the timestamp for each datum, since the CSV will look something like:
Using "AMD" as an example, consider:
write.csv(, file="AMD.csv", row.names = TRUE)
head(read.csv("~/Downloads/AMD.csv", row.names = 1))
# AMD.Open AMD.High AMD.Low AMD.Close AMD.Volume
# 2010-01-13 8.71 9.20 8.55 9.15 32741845
# 2010-01-14 9.18 9.26 8.92 9.00 22658744
# 2010-01-15 9.13 9.19 8.80 8.84 34344763
# 2010-01-19 8.84 9.21 8.84 9.01 24875646
# 2010-01-20 8.98 9.00 8.76 8.87 22813520
# 2010-01-21 9.01 9.10 8.77 8.99 37888647
To save all of them at once:
ign <- mapply(function(x, fn) write.csv(, file = fn, row.names = TRUE),
dat, names(dat))
There are other ways to store your data such as Rdata files (save()).
It is not clear to me if you are intending to append them as additional columns (i.e., cbind behavior) or as rows (rbind). Between the two, I tend towards "rows", but I'll start with "columns" first.
"Appending" by column
This may be appropriate if you want day-by-day ticker comparisons (though there are arguably better ways to prepare for this). You'll run into problems, since they have (and most likely will have) different numbers of rows:
sapply(dat, nrow)
# 1846 1847
In this case, you might want to join based on the dates (row names). To do this well, you should probably convert the row names (dates) to a column and merge on that column:
dat2 <- lapply(dat, function(x) {
x <-
x$date <- rownames(x)
rownames(x) <- NULL
datwide <- Reduce(function(a, b) merge(a, b, by = "date", all = TRUE), dat2)
As a simple demonstration, remembering that there is one more row in "MSFT" than in "AMD", we can find that row and prove that things are still looking alright with:
which(! complete.cases(datwide))
# [1] 1251
datwide[1251 + -2:2,]
# date AMD.Open AMD.High AMD.Low AMD.Close AMD.Volume MSFT.Open MSFT.High MSFT.Low MSFT.Close MSFT.Volume
# 1249 2014-12-30 2.64 2.70 2.63 2.63 7783709 47.44 47.62 46.84 47.02 16384692
# 1250 2014-12-31 2.64 2.70 2.64 2.67 11177917 46.73 47.44 46.45 46.45 21552450
# 1251 2015-01-02 NA NA NA NA NA 46.66 47.42 46.54 46.76 27913852
# 1252 2015-01-05 2.67 2.70 2.64 2.66 8878176 46.37 46.73 46.25 46.32 39673865
# 1253 2015-01-06 2.65 2.66 2.55 2.63 13916645 46.38 46.75 45.54 45.65 36447854
"Appending" by row
getSymbols names the columns unique to the ticker, a slight frustration. Additionally, since we'll be discarding the column names, we should preserve the symbol name in the data.
dat3 <- lapply(dat, function(x) {
ticker <- gsub("\\..*", "", colnames(x)[1])
colnames(x) <- gsub(".*\\.", "", colnames(x))
x <-
x$date <- rownames(x)
x$symbol <- ticker
rownames(x) <- NULL
}) # can also be accomplished with mapply(..., dat, names(dat))
datlong <- Reduce(function(a, b) rbind(a, b, make.row.names = FALSE), dat3)
# Open High Low Close Volume date symbol
# 1 8.71 9.20 8.55 9.15 32741845 2010-01-13 AMD
# 2 9.18 9.26 8.92 9.00 22658744 2010-01-14 AMD
# 3 9.13 9.19 8.80 8.84 34344763 2010-01-15 AMD
# 4 8.84 9.21 8.84 9.01 24875646 2010-01-19 AMD
# 5 8.98 9.00 8.76 8.87 22813520 2010-01-20 AMD
# 6 9.01 9.10 8.77 8.99 37888647 2010-01-21 AMD
# [1] 3693

Web scraping data table in R not working, XML or getURL

Normally I don't have any issues getting table data from sites, but this one is throwing me for a loop.
I've tried various suggestion from the site:
[R: Scraping Site, Incrementing Loop by Date in URL, Saving To CSV
[Scraping from aspx website using R
[web scraping in R
I've tried the two methods to try and get something from the site and end up with errors.
The first approach:
#####Reading in data
#pulling rainfall data csv
direct_rainfall <- read.csv(url(getURL(" /cgi-progs/getMonthlyCSV?station_id=CVT&dur_code=M&sensor_num=2&start_date=1/1/2000&end_date=now")))
This ends with the following error:
Error in function (type, msg, asError = TRUE) :
Failed to connect to port 80: Timed out
The second method:
#xml data pull method
url = ""
doc = htmlParse(url)
Which end with the following error:
Error: failed to load external entity ""
Any guidance would be appreciated. I just can't figure out why I'm getting nothing when I try and pull from the URL.
If you look at the website, it's a reasonably nicely formatted CSV. Happily, if you pass read.csv a URL, it will automatically handle the connection for you, so all you really need is:
url <- ''
df <- read.csv(url, skip = 3, nrows = 17, na.strings = 'm')
## X.station. X.sensor. X.year. X.month. X01 X02 X03 X04 X05 X06
## 1 CVT 2 2000 NA 20.90 19.44 3.74 3.31 5.02 0.85
## 2 CVT 2 2001 NA 7.23 9.53 3.86 7.47 0.00 0.15
## 3 CVT 2 2002 NA 3.60 4.43 8.71 2.76 2.78 0.00
## 4 CVT 2 2003 NA 1.71 4.34 4.45 13.45 2.95 0.00
## 5 CVT 2 2004 NA 3.41 10.57 1.80 0.87 0.90 0.00

How do I read ragged/implied do data into r

How do I read data like the example below? (My actual files are like formatted per -- They look like fortran implied-do writes)
The issue I have is that there are multiple headers and vectors within the file having differing numbers of values per line. Scan seems to start from the beginning for .gz files, while I want the reads to parse incrementally through the file.
This is a headerline with a name.
The fourth line has the number of elements in the first vector,
and the next vector is encoded similarly
1 2 3
4 5 6
1 2 3
4 5 6
7 8
This doesn't work as I'd like:
This sort of works, but I have to then calculate the skip values:
Ah... I discovered that using the gzfile() to open does not allow seek operations on the data, so the scan()s all rewind and start at the beginning of the file. If I unzip the file and operate on the uncompressed data, I can read the various bits incrementally with readLines(fh,n) and scan(fh,n=n)
if (skip !=0 ){junk<-readLines(fh,skip)}
... # still need to process a parenthesized complex array, but that is a different problem.
Looking at a few sample files, it looks like you only need to determine the number to be read once, and that can be used for processing all parts of the file.
As I mentioned in a comment, grep would be useful for helping automate the process. Here's a quick function I came up with:
ReadFunky <- function(myFile) {
fh <- gzfile(myFile)
myFile <- readLines(fh)
vecLen <- as.numeric(myFile[5])
startAt <- grep(paste("^\\s+", vecLen), myFile)
T1 <- lapply(startAt[-5], function(x) {
scan(fh, n = vecLen, skip = x)
T2 <- gsub("\\(|\\)", "",
unlist(strsplit(myFile[(startAt[5]+1):length(myFile)], ")(",
fixed = TRUE)))
T2 <- read.csv(text = T2, header = FALSE)
T2 <- split(T2, rep(1:vecLen, each = vecLen))
T1[[5]] <- T2
names(T1) <- myFile[startAt-1]
You can apply it to a downloaded file. Just replace with the actual path to where you downloaded the file.
temp <- ReadFunky("~/Downloads/AL182012_1030_0730.gz")
The function returns a list. The first four items in the list are the vectors of coordinates.
# List of 4
# $ MERCATOR X COORDINATES ... KILOMETERS : num [1:159] -476 -470 -464 -458 -452 ...
# $ MERCATOR Y COORDINATES ... KILOMETERS : num [1:159] -476 -470 -464 -458 -452 ...
# $ EAST LONGITUDE COORDINATES ... DEGREES: num [1:159] -81.1 -81 -80.9 -80.9 -80.8 ...
# $ NORTH LATITUDE COORDINATES ... DEGREES: num [1:159] 36.2 36.3 36.3 36.4 36.4 ...
The fifth item is a set of 2-column data.frames that contain the data from your "parenthesized complex array". Not really sure what the best structure for this data was, so I just stuck it in data.frames. You'll get as many data.frames as the expected number of values for the given data set (in this case, 159).
# [1] 159
# List of 4
# $ 1:'data.frame': 159 obs. of 2 variables:
# ..$ V1: num [1:159] 7.59 7.6 7.59 7.59 7.58 ...
# ..$ V2: num [1:159] -1.33 -1.28 -1.22 -1.16 -1.1 ...
# $ 2:'data.frame': 159 obs. of 2 variables:
# ..$ V1: num [1:159] 7.66 7.66 7.65 7.65 7.64 ...
# ..$ V2: num [1:159] -1.29 -1.24 -1.19 -1.13 -1.07 ...
# $ 3:'data.frame': 159 obs. of 2 variables:
# ..$ V1: num [1:159] 7.73 7.72 7.72 7.71 7.7 ...
# ..$ V2: num [1:159] -1.26 -1.21 -1.15 -1.1 -1.04 ...
# $ 4:'data.frame': 159 obs. of 2 variables:
# ..$ V1: num [1:159] 7.8 7.8 7.79 7.78 7.76 ...
# ..$ V2: num [1:159] -1.22 -1.17 -1.12 -1.06 -1.01 ...
If you want to modify the function so you can read directly from the FTP url, change the first two lines to read as the following and continue from the "myFile" line:
ReadFunky <- function(myFile, fromURL = TRUE) {
if (isTRUE(fromURL)) {
x <- strsplit(myFile, "/")[[1]]
y <- download.file(myFile, destfile = x[length(x)])
fh <- gzfile(x[length(x)])
} else {
fh <- gzfile(myFile)
Usage would be like: temp <- ReadFunky("") for a file that you are going to download directly, and temp <- ReadFunky("~/AL182012_1023_1330.gz", fromURL=FALSE) for a file that you already have saved on your system.

R Summary based on column name length

I have the following problem:
I have a matrix with 80 columns which names have either 10/11, 21/22,31/32 or 42/43 characters. The names are totally different but the lenth fits always in one of the four groups. Now I would like to add four columns were I get the sum of all the values of columns corresponding to one group. Here is a little example of what I mean
g$group1<-"sum of all columns of with headers of length 1 (in this case a+b)"
g$group2<-"sum of all columns of with headers of length 2 (in this case cc+dd)"
g$group3<-"sum of all columns of with headers of length 3 (in this case eee+fff)"
I was able to transfer the matrix to a dataframe using melt() and carrying out the operation using stringr::str_length(). However, I could not transform this back to a matrix which I really need as final output. The columns are not in order and ordering would not help me much, since the number of columns depends on the outcome of the previous calculation and it would be too tedious to define dataframe ranges every time again.
Hope you can help.
You want this:
tmp <- nchar(names(g))
chargroups <- split(1:dim(g)[2], tmp)
# `chargroups` is a list of groups of columns with same number of letters in name
sapply(chargroups, function(x) {
if(length(x)>1) # rowSums can only accept 2+-dimensional object
# `x` is, for each number of letters, a vector of column indices of `g`
The key part of this is that nchar is going to determine how long the column names are. The rest is pretty straightforward.
EDIT: In your actual code, though you should deal with the ranges of name lengths by just doing something like the following after you define tmp but before the sapply statement:
tmp[tmp==10] <- 11
tmp[tmp==21] <- 22
tmp[tmp==31] <- 32
tmp[tmp==32] <- 43
Another approach
a <- rnorm(1:100)
b <- rnorm(1:100)
cc <- rnorm(1:100)
dd <- rnorm(1:100)
eee <- rnorm(1:100)
fff <- rnorm(1:100)
g <- data.frame(a,b,cc,dd,eee,fff)
for ( i in 1:3 )
eval(parse(text = sprintf("g$group%s <- rowSums(g[nchar(names(g)) == %s])", i, i)))
## 'data.frame': 100 obs. of 9 variables:
## $ a : num -0.5605 -0.2302 1.5587 0.0705 0.1293 ...
## $ b : num -0.71 0.257 -0.247 -0.348 -0.952 ...
## $ cc : num 2.199 1.312 -0.265 0.543 -0.414 ...
## $ dd : num -0.715 -0.753 -0.939 -1.053 -0.437 ...
## $ eee : num -0.0736 -1.1687 -0.6347 -0.0288 0.6707 ...
## $ fff : num -0.602 -0.994 1.027 0.751 -1.509 ...
## $ group1: num -1.2709 0.0267 1.312 -0.277 -0.8223 ...
## $ group2: num 1.484 0.56 -1.204 -0.509 -0.851 ...
## $ group3: num -0.675 -2.162 0.392 0.722 -0.838 ...
