I have this example data, where I load some tickers
libs <- c('quantmod')
lapply(libs, require, character.only = T)
tickers<-c('T','AMD','AA','AMAT','BAC')
getSymbols(tickers,from="2013-01-01")
Then I created function like
FUNtest<-function (x,y){
data<-x
close<-data[,y]
return(tail(close))
}
which works like for example
FUNtest(AMD,4)
and the result is tail of closing prices of AMD
AMD.Close
2014-07-16 4.66
2014-07-17 4.57
2014-07-18 3.83
2014-07-21 3.78
2014-07-22 3.80
2014-07-23 3.76
But, for later usage, I need to be able to use function this way
FUNtest(tickers[2],4)
but it doesn't work. If I call
tickers[2]
it shows
> tickers[2]
[1] "AMD"
but it is not able to work in function. And advices how to fix it?
Thanks
There's a big difference between
FUNtest(AMD,4)
and
FUNtest("AMD",4)
With the former, you are passing a name which points to an xts object. In the latter, you are simply passing a character string. This string is in no way directly connected to the object of the same name.
If you want a function that works if you pass a character or an xts object, you can do
FUNtest<-function (x,y){
if(is(x, "xts")) {
data <- x
} else if (is(x, "character")) {
data <- get(x)
} else {
stop(paste("invalid x class:", class(x)))
}
close <- data[,y]
return(tail(close))
}
then both
FUNtest(AMD, 4)
FUNtest(tickers[2], 4)
will work.
But even better is not to use the behavior of quantmod where it adds variables to your global environment. This is the default that's being phased out because it encourages bad behavior. It's better to store them all in a list like
symb<-lapply(setNames(tickers, tickers), function(x)
getSymbols(x,from="2013-01-01", auto.assign=F))
Then you can have symb$AMAT or symb[["AMAT"]] depending on how you want to extract the data. The latter form is more flexible because you can specify a variable with a particular value or you can perform an action to all the data.sets by lapply-ing over the list.
You could try using get in the function.
get("AMD") finds AMD in the evaluation frame (or not) and returns the value attached to it.
> FUNtest<-function (x,y){
data<-get(x)
close<-data[,y]
return(tail(close))
}
> FUNtest(tickers[2], 4)
# AMD.Close
# 2014-07-16 4.66
# 2014-07-17 4.57
# 2014-07-18 3.83
# 2014-07-21 3.78
# 2014-07-22 3.80
# 2014-07-23 3.76
Also, there isn't really a need to use return here. This function might be better for you
> f <- function(x, y){ x <- get(x); tail(x[, y], 3) }
## on the entire tickers vector, get column 4 and bind them
> do.call(cbind, lapply(tickers, f, y = 4))
# T.Close AMD.Close AA.Close AMAT.Close BAC.Close
# 2014-07-16 36.45 4.66 16.60 22.85 15.51
# 2014-07-17 36.03 4.57 16.33 22.77 15.20
# 2014-07-18 36.17 3.83 16.49 23.00 15.49
eval can also be quite useful for unquoted arguments
> f <- function(x){ eval(x) }
> head(f(AMD), 3)
# AMD.Open AMD.High AMD.Low AMD.Close AMD.Volume AMD.Adjusted
# 2013-01-02 2.55 2.57 2.45 2.53 27214800 2.53
# 2013-01-03 2.52 2.59 2.46 2.49 24966900 2.49
# 2013-01-04 2.51 2.59 2.49 2.59 22054200 2.59
In the first case you're passing a dataframe called AMD; in the second you're just passing a character value "AMD".
I'm guessing that the dataframe AMD is already loaded into your work space so that's why FUNtest works in the first case.
if you want the function to work try either passing the data frame you want to the function, or tell the function where to find the data frame you want.
Related
I am new to R.
I wrote a code for an assignment which reads several csv files and binds it into a data frame and then according to the id, calculates the mean of either nitrate or sulfate.
Data sample:
Date sulfate nitrate ID
<date> <dbl> <dbl> <dbl>
1 2003-10-06 7.21 0.651 1
2 2003-10-12 5.99 0.428 1
3 2003-10-18 4.68 1.04 1
4 2003-10-24 3.47 0.363 1
5 2003-10-30 2.42 0.507 1
6 2003-11-11 1.43 0.474 1
...
To read the files and create a data.frame, I wrote this function:
pollutantmean <- function (pollutant, id = 1:332) {
#creating a data frame from several files
file_m <- list.files(path = "specdata", pattern = "*.csv", full.names = TRUE)
read_file_m <- lapply(file_m, read_csv)
df_1 <- bind_rows(read_file_m)
# delete NAs
df_clean <- df_1[complete.cases(df_1),]
#select rows according to id
df_asid_clean <- filter(df_clean, ID %in% id)
#count the mean of the column
mean_result <- mean(df_asid_clean[, pollutant])
mean_result
However, when the read_csv function is applied, certain entries in nitrate column are read as col_logical, although the whole class of the column remains numeric and the entries are numeric. It seems that the code "expects" to receive logical value, although the real value is not.
Throughout the reading I get this message:
<...>
Parsed with column specification:
cols(
Date = col_date(format = ""),
sulfate = col_double(),
nitrate = col_logical(),
ID = col_double()
)
Warning: 41 parsing failures.
row col expected actual file
2055 nitrate 1/0/T/F/TRUE/FALSE 0.383 'specdata/288.csv'
2067 nitrate 1/0/T/F/TRUE/FALSE 0.355 'specdata/288.csv'
2073 nitrate 1/0/T/F/TRUE/FALSE 0.469 'specdata/288.csv'
2085 nitrate 1/0/T/F/TRUE/FALSE 0.144 'specdata/288.csv'
2091 nitrate 1/0/T/F/TRUE/FALSE 0.0984 'specdata/288.csv'
.... ....... .................. ...... ..................
See problems(...) for more details.
I tried to change the column class by writing
df_1[,nitrate] <- as.numeric(as.character(df_1[, nitrate])
, after binding rows, but it only shows that NAs are again introduced in step which calculates the mean.
What is wrong here, and how could I solve it?
Would appreciate your help!
UPDATE: tried to insert read_csv(col_types = list...), but I get "files" argument is not defined. As I understand, the R reads inside read_csv first, then lapply and because there is not "file" given at the time, it shows error.
The problem with readr::read_csv() failure in parsing the column types can be overcome by passing a col_types= argument in lapply(). We do this as follows:
pollutantmean <- function (directory,pollutant,id=1:332){
require(readr)
require(dplyr)
file_m <- list.files(path = directory, pattern = "*.csv", full.names = TRUE)[id]
read_file_m <- lapply(file_m, read_csv,col_types=list(col_date(),col_double(),
col_double(),col_integer()))
# rest of code goes here. Since I am a Community Mentor in the
# JHU Data Science Specialization, I am not allowed to post
# a complete solution to the programming assignment
}
Note that I use the [ form of the extract operator to subset the list of file names with the id vector that is an argument to the function, which avoids reading a lot of data that isn't necessary. This eliminates the need for the filter() statement in the code posted in the question.
With some additional programming statements to complete the assignment, the code in my answer produces the correct results for the three examples posted with the assignment, as listed below.
> pollutantmean("specdata","sulfate",1:10)
[1] 4.064128
> pollutantmean("specdata", "nitrate", 70:72)
[1] 1.706047
> pollutantmean("specdata", "nitrate", 23)
[1] 1.280833
Alternately we could implement lapply() with an anonymous function that also uses read_csv() as follows:
read_file_m <- lapply(file_m, function(x) {read_csv(x,col_types=list(col_date(),col_double(),
col_double(),col_integer()))})
NOTE: while it is completely understandable that students who have been exposed to the tidyverse would like to use it for the programming assignment, the fact that dplyr isn't introduced until the next course in the sequence (and readr isn't covered at all) makes it much more difficult to use for assignments in R Programming, especially the first assignment, where dplyr non-standard evaluation gives people fits. An example of this situation is yet another Stackoverflow question on pollutantmean().
With the read_csv update you don't need lapply, you can simply pass along the file path directly to read_csv as you already have defined.
Regarding the column types this can then be sen manually in the col_type argument:
col_type=cols(Date-col_date,sulfate=...)
There is a lot of documentation on how to read multiple CSVs and bind them into one data frame. I have 5000+ CSV files I need to read in and bind into one data structure.
In particular I've followed the discussion here: Issue in Loading multiple .csv files into single dataframe in R using rbind
The weird thing is that base R is much faster than any other solution I've tried.
Here's what my CSV looks like:
> head(PT)
Line Timestamp Lane.01 Lane.02 Lane.03 Lane.04 Lane.05 Lane.06 Lane.07 Lane.08
1 PL1 05-Jan-16 07:17:36 NA NA NA NA NA NA NA NA
2 PL1 05-Jan-16 07:22:38 NA NA NA NA NA NA NA NA
3 PL1 05-Jan-16 07:27:41 NA NA NA NA NA NA NA NA
4 PL1 05-Jan-16 07:32:43 9.98 10.36 10.41 10.16 10.10 9.97 10.07 9.59
5 PL1 05-Jan-16 07:37:45 9.65 8.87 9.88 9.86 8.85 8.75 9.19 8.51
6 PL1 05-Jan-16 07:42:47 9.14 8.98 9.29 9.04 9.01 9.06 9.12 9.08
I've created three methods for reading in and binding the data. The files are located in a separate directory which I define as:
dataPath <- "data"
PTfiles <- list.files(path=dataPath, full.names = TRUE)
Method 1: Base R
classes <- c("factor", "character", rep("numeric",8))
# build function to load data
load_data <- function(dataPath, classes) {
tables <- lapply(PTfiles, read.csv, colClasses=classes, na.strings=c("NA", ""))
do.call(rbind, tables)
}
#clock
method1 <- system.time(
PT <- load_data(path, classes)
)
Method 2: read_csv
In this case I created a wrapper function for read_csv to use
#create wrapper function for read_csv
read_csv.wrap <- function(x) { read_csv(x, skip = 1, na=c("NA", ""),
col_names = c("tool", "timestamp", paste("lane", 1:8, sep="")),
col_types =
cols(
tool = col_character(),
timestamp = col_character(),
lane1 = col_double(),
lane2 = col_double(),
lane3 = col_double(),
lane4 = col_double(),
lane5 = col_double(),
lane6 = col_double(),
lane7 = col_double(),
lane8 = col_double()
)
)
}
##
# Same as method 1, just uses read_csv instead of read.csv
load_data2 <- function(dataPath) {
tables <- lapply(PTfiles, read_csv.wrap)
do.call(rbind, tables)
}
#clock
method2 <- system.time(
PT2 <- load_data2(path)
)
Method 3: read_csv + dplyr::bind_rows
load_data3 <- function(dataPath) {
tables <- lapply(PTfiles, read_csv.wrap)
dplyr::bind_rows(tables)
}
#clock
method3 <- system.time(
PT3 <- load_data3(path)
)
What I can't figure out, is why read_csv and dplyr methods are slower for elapsed time when they should be faster. The CPU time is decreased, but why would the elapsed time (file system) increase? What's going on here?
Edit - I added the data.table method as suggested in the comments
Method 4 data.table
library(data.table)
load_data4 <- function(dataPath){
tables <- lapply(PTfiles, fread)
rbindlist(tables)
}
method4 <- system.time(
PT4 <- load_data4(path)
)
The data.table method is the fastest from a CPU standpoint. But the question still stands on what is going on with the read_csv methods that makes them so slow.
> rbind(method1, method2, method3, method4)
user.self sys.self elapsed
method1 0.56 0.39 1.35
method2 0.42 1.98 13.96
method3 0.36 2.25 14.69
method4 0.34 0.67 1.74
I would do that in the terminal(Unix). I would put all files int the same folder and then navigate to that folder (in terminal), the use the following command to create only one CSV file:
cat *.csv > merged_csv_file.csv
One observation regarding this method is that the header of each file will show up in the middle of the observations. To solve this I would suggest you do:
Get just the header from the first file
head -2 file1.csv > merged_csv_file.csv
then skip the first "X" lines from the other files, with the folling command, where "X" is the number of lines to skip.
tail -n +3 -q file*.csv >> merged_csv_file.csv
-n +3 makes tail print lines from 3rd to the end, -q tells it not to print the header with the file name (read man), >> adds to the file, not overwrites it as >.
I might have found a related issue. I am reading in nested CSV data from some simulation output, where multiple columns have CSV formatted data as elements, which I need to unnest and reshape for analysis.
With simulations where I have many runs, this resulted in thousands of elements that needed to be parsed. Using map(.,read_csv) this would take hours to transform. When I rewrote my script to apply read.csv in a lambda function, the operation would complete in seconds.
I'm curious if there is some intermediate system I/O operation or error handling that creates a bottleneck you wouldn't run into with a single input file.
Any idea why this is so slow?
I'm trying to index a matrix to its first row.
> nrow(cfm2)
[1] 8326
> head(cfm2)
TSX TY1 GC1:CAD CL1:CAD
1983-12-01 2558.0 80.43750 497.3676 36.29842
1983-12-02 2550.2 79.84375 496.1024 36.55753
1983-12-05 2540.2 79.81250 496.9146 36.49022
1983-12-06 2536.8 79.84375 495.9524 36.41626
1983-12-07 2549.3 79.68750 501.9910 36.16226
1983-12-08 2535.5 79.25000 484.1805 36.44115
I tried this, got odd results:
> cfm <- cfm2 / drop(coredata(cfm2[1]))
> head(cfm)
TSX TY1 GC1:CAD CL1:CAD
1983-12-01 1.0000000 0.16172644 0.1944361 0.07298106
1983-12-02 31.7041175 2.19964833 6.1675513 1.00713855
1983-12-05 5.1072883 0.03120113 0.9990891 0.01426514
1983-12-06 69.8873485 0.99261849 13.6631959 0.45272736
1983-12-07 0.9965989 0.16021851 0.1962435 0.07270731
1983-12-08 31.5213675 2.18329086 6.0193380 1.00393224
I tried this, got the right results, but took forever.
test.cfm <- function(cfm){
cfm1 <- cfm
for(cc in 1:ncol(cfm)){
for(rr in 1:nrow(cfm)){
coredata(cfm[rr,cc]) <- 100 * coredata(cfm1[rr,cc]) / coredata(cfm1[1,cc])
}
}
return(cfm)
}
> system.time(cfm <- test.cfm(cfm2))
user system elapsed
17.809 4.886 22.896
> head(cfm)
TSX TY1 GC1:CAD CL1:CAD
1983-12-01 100.00000 100.00000 100.00000 100.00000
1983-12-02 99.69507 99.26185 99.74561 100.71385
1983-12-05 99.30414 99.22300 99.90891 100.52841
1983-12-06 99.17123 99.26185 99.71545 100.32465
1983-12-07 99.65989 99.06760 100.92956 99.62491
1983-12-08 99.12041 98.52370 97.34861 100.39322
Any idea what's going on? I'm sure this must be obvious for those with experience, but I'm perplexed...
Your loop is over every. single. element. Let R's vectorization help you.
test.cfm.new <- function(cfm) {
for(cc in 1:ncol(cfm)) {
cfm[,cc] <- cfm[,cc] / drop(coredata(cfm[1,cc]))
}
return(100 * cfm)
}
require(xts)
data(sample_matrix)
x <- as.xts(sample_matrix)
system.time(cfm <- test.cfm(x))
# user system elapsed
# 0.111 0.000 0.112
system.time(cfm.new <- test.cfm.new(x))
# user system elapsed
# 0.000 0.000 0.001
all.equal(cfm, cfm.new)
# [1] TRUE
I've been trying to increase the speed of some code. I've removed all loops, am using vectors and have streamed lined just about everything. I've timed each iteration of my code and it appears to be slowing as iterations increase.
### The beginning iterations
user system elapsed
0.03 0.00 0.03
user system elapsed
0.03 0.00 0.04
user system elapsed
0.03 0.00 0.03
user system elapsed
0.04 0.00 0.05
### The ending iterations
user system elapsed
3.06 0.08 3.14
user system elapsed
3.10 0.05 3.15
user system elapsed
3.08 0.06 3.15
user system elapsed
3.30 0.06 3.37
I have 598 iterations and right now it takes about 10 minutes. I'd like to speed things up. Here's how my code looks. You'll need the RColorBrewer and fields packages. Here's my data. Yes I know its big, make sure you download the zip file.
StreamFlux <- function(data,NoR,NTS){
###Read in data to display points###
WLX = c(8,19,29,20,13,20,21)
WLY = c(25,28,25,21,17,14,12)
WLY = 34 - WLY
WLX = WLX / 44
WLY = WLY / 33
timedata = NULL
mf <- function(i){
b = (NoR+8) * (i-1) + 8
###I read in data one section at a time to avoid headers
mydata = read.table(data,skip=b,nrows=NoR, header=FALSE)
rows = 34-mydata[,2]
cols = 45-mydata[,3]
flows = mydata[,7]
rows = as.numeric(rows)
cols = as.numeric(cols)
rm(mydata)
###Create Flux matrix
flow_mat <- matrix(0,44,33)
###Populate matrix###
flow_mat[(rows - 1) * 44 + (45-cols)] <- flows+flow_mat[(rows - 1) * 44 + (45-cols)]
flow_mat[flow_mat == 0] <- NA
rm(flows)
rm(rows)
rm(cols)
timestep = i
###Specifying jpeg info###
jpeg(paste("Steamflow", timestep, ".jpg",sep = ''),
width = 640, height=441,quality=75,bg="grey")
image.plot(flow_mat, zlim=c(-1,1),
col=brewer.pal(11, "RdBu"),yaxt="n",
xaxt="n", main=paste("Stress Period ",
timestep, sep = ""))
points(WLX,WLY)
dev.off()
rm(flow_mat)
}
ST<- function(x){functiontime=system.time(mf(x))
print(functiontime)}
lapply(1:NTS, ST)
}
This is how to run the function
###To run all timesteps###
StreamFlux("stream_out.txt",687,598)
###To run the first 100 timesteps###
StreamFlux("stream_out.txt",687,100)
###The first 200 timesteps###
StreamFlux("stream_out.txt",687,200)
To test remove print(functiontime) to stop it printing at every timestep then
> system.time(StreamFlux("stream_out.txt",687,100))
user system elapsed
28.22 1.06 32.67
> system.time(StreamFlux("stream_out.txt",687,200))
user system elapsed
102.61 2.98 106.20
What I'm looking for is anyway to speed up running this code and possibly an explanation of why it is slowing down? Should I just run it in parts, seems a stupid solution. I've read about dlply from the plyr. It seems to have worked here but would that help in my case? How about parallel processing, I think I can figure that out but is it worth the trouble in this case?
I will follow #PaulHiemstra's suggestion and post my comment as an answer. Who can resist Internet points? ;)
From a quick glance at your code, I agree with #joran's second point in his comment: your loop/function is probably slowing down due to repeatedly reading in your data. More specifically, this part of the code probably needs to be fixed:
read.table(data, skip=b, nrows=NoR, header=FALSE).
In particular, I think the skip=b argument is the culprit. You should read in all the data at the beginning, if possible, and then retrieve the necessary parts from memory for the calculations.
I am trying to get historical prices for VIX futures by downloading all the CSV files on this page (http://cfe.cboe.com/Products/historicalVIX.aspx). Here is the code I am using to do this:
library(XML)
#Extract all links for url
url <- "http://cfe.cboe.com/Products/historicalVIX.aspx"
doc <- htmlParse(url)
links <- xpathSApply(doc, "//a/#href")
free(doc)
#Filter out URLs ending with csv and complete the link.
links <- links[substr(links, nchar(links) - 2, nchar(links)) == "csv"]
links <- paste("http://cfe.cboe.com", links, sep="")
#Peform read.csv on each url in links, skipping the first two URLs as they are not relevant.
c <- lapply(links[-(1:2)], read.csv, header = TRUE)
I get the error:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
Upon further investigation, I realize this is because some of the CSV files are formatted differently. If I load the URL links[9] manually, I see that the first row has this disclaimer:
CFE data is compiled for the .......use of CFE data is subject to the Terms and Conditions of CBOE's Websites.
Most of the other files (e.g.links[8] and links[10]) are fine so it seems this has been randomly inserted. Is there some R magic that can be done to handle this?
Thank you.
I have a getSymbols.cfe method in my qmao package (for the getSymbols function in quantmod package) that will make this a lot easier.
#install.packages('qmao', repos='http://r-forge.r-project.org')
library(qmao)
This is from the examples section of ?getSymbols.cfe (please read the help page as the function has a few arguments that you may want to be different than the defaults)
getSymbols(c("VX_U11", "VX_V11"),src='cfe')
#all contracts expiring in 2010 and 2011.
getSymbols("VX",Months=1:12,Years=2010:2011,src='cfe')
#getSymbols("VX",Months=1:12,Years=10:11,src='cfe') #same
And it's not just for VIX
getSymbols(c("VM","GV"),src='cfe') #The mini-VIX and Gold vol contracts expiring this month
If you're not familiar with getSymbols, by default it stores the data in your .GlobalEnv and return the name of the object that was saved.
> getSymbols("VX_Z12", src='cfe')
[1] "VX_Z12"
> tail(VX_Z12)
VX_Z12.Open VX_Z12.High VX_Z12.Low VX_Z12.Close VX_Z12.Settle VX_Z12.Change VX_Z12.Volume VX_Z12.EFP VX_Z12.OpInt
2012-10-26 19.20 19.35 18.62 18.87 18.9 0.0 22043 15 71114
2012-10-31 18.55 19.50 18.51 19.46 19.5 0.6 46405 319 89674
2012-11-01 19.35 19.35 17.75 17.87 17.9 -1.6 40609 2046 95720
2012-11-02 17.90 18.65 17.55 18.57 18.6 0.7 42592 1155 100691
2012-11-05 18.60 20.15 18.43 18.86 18.9 0.3 28136 110 102746
2012-11-06 18.70 18.85 17.75 18.06 18.1 -0.8 35599 851 110638
Edit
I see now that I did not answer your question, but rather pointed you to another way to get the same error! A simple way to make your code work, is to make a wrapper for read.csv that uses readLines to see if the first row contains the disclaimer; if it does, skip the the first row, otherwise use read.csv as normal.
myRead.csv <- function(x, ...) {
if (grepl("Terms and Conditions", readLines(x, 1))) { #is the first row the disclaimer?
read.csv(x, skip=1, ...)
} else read.csv(x, ...)
}
L <- lapply(links[-(1:2)], myRead.csv, header = TRUE)
I also applied that patch to getSymbols.cfe. You can get the latest version of qmao (1.3.11) using svn checkout (see this post if you need help with that), or, you can wait until R-Forge builds it for you which usually happens pretty quickly, but could take up to a couple of days.