Very slow: xts dividing a column by first row - r

Any idea why this is so slow?
I'm trying to index a matrix to its first row.
> nrow(cfm2)
[1] 8326
> head(cfm2)
TSX TY1 GC1:CAD CL1:CAD
1983-12-01 2558.0 80.43750 497.3676 36.29842
1983-12-02 2550.2 79.84375 496.1024 36.55753
1983-12-05 2540.2 79.81250 496.9146 36.49022
1983-12-06 2536.8 79.84375 495.9524 36.41626
1983-12-07 2549.3 79.68750 501.9910 36.16226
1983-12-08 2535.5 79.25000 484.1805 36.44115
I tried this, got odd results:
> cfm <- cfm2 / drop(coredata(cfm2[1]))
> head(cfm)
TSX TY1 GC1:CAD CL1:CAD
1983-12-01 1.0000000 0.16172644 0.1944361 0.07298106
1983-12-02 31.7041175 2.19964833 6.1675513 1.00713855
1983-12-05 5.1072883 0.03120113 0.9990891 0.01426514
1983-12-06 69.8873485 0.99261849 13.6631959 0.45272736
1983-12-07 0.9965989 0.16021851 0.1962435 0.07270731
1983-12-08 31.5213675 2.18329086 6.0193380 1.00393224
I tried this, got the right results, but took forever.
test.cfm <- function(cfm){
cfm1 <- cfm
for(cc in 1:ncol(cfm)){
for(rr in 1:nrow(cfm)){
coredata(cfm[rr,cc]) <- 100 * coredata(cfm1[rr,cc]) / coredata(cfm1[1,cc])
}
}
return(cfm)
}
> system.time(cfm <- test.cfm(cfm2))
user system elapsed
17.809 4.886 22.896
> head(cfm)
TSX TY1 GC1:CAD CL1:CAD
1983-12-01 100.00000 100.00000 100.00000 100.00000
1983-12-02 99.69507 99.26185 99.74561 100.71385
1983-12-05 99.30414 99.22300 99.90891 100.52841
1983-12-06 99.17123 99.26185 99.71545 100.32465
1983-12-07 99.65989 99.06760 100.92956 99.62491
1983-12-08 99.12041 98.52370 97.34861 100.39322
Any idea what's going on? I'm sure this must be obvious for those with experience, but I'm perplexed...

Your loop is over every. single. element. Let R's vectorization help you.
test.cfm.new <- function(cfm) {
for(cc in 1:ncol(cfm)) {
cfm[,cc] <- cfm[,cc] / drop(coredata(cfm[1,cc]))
}
return(100 * cfm)
}
require(xts)
data(sample_matrix)
x <- as.xts(sample_matrix)
system.time(cfm <- test.cfm(x))
# user system elapsed
# 0.111 0.000 0.112
system.time(cfm.new <- test.cfm.new(x))
# user system elapsed
# 0.000 0.000 0.001
all.equal(cfm, cfm.new)
# [1] TRUE

Related

How let a Countdown run in R [duplicate]

How do you pause an R script for a specified number of seconds or miliseconds? In many languages, there is a sleep function, but ?sleep references a data set. And ?pause and ?wait don't exist.
The intended purpose is for self-timed animations. The desired solution works without asking for user input.
See help(Sys.sleep).
For example, from ?Sys.sleep
testit <- function(x)
{
p1 <- proc.time()
Sys.sleep(x)
proc.time() - p1 # The cpu usage should be negligible
}
testit(3.7)
Yielding
> testit(3.7)
user system elapsed
0.000 0.000 3.704
Sys.sleep() will not work if the CPU usage is very high; as in other critical high priority processes are running (in parallel).
This code worked for me. Here I am printing 1 to 1000 at a 2.5 second interval.
for (i in 1:1000)
{
print(i)
date_time<-Sys.time()
while((as.numeric(Sys.time()) - as.numeric(date_time))<2.5){} #dummy while loop
}
TL;DR sys_sleep a new stable and precise sleep function
We already know that Sys.sleep could work not as expected, e.g. when CPU usage is very high.
That is why I decided to prepare a high quality function powered by microbenchmark::get_nanotime() and while/repeat mechanics.
#' Alternative to Sys.sleep function
#' Expected to be more stable
#' #param val `numeric(1)` value to sleep.
#' #param unit `character(1)` the available units are nanoseconds ("ns"), microseconds ("us"), milliseconds ("ms"), seconds ("s").
#' #note dependency on `microbenchmark` package to reuse `microbenchmark::get_nanotime()`.
#' #examples
#' # sleep 1 second in different units
#' sys_sleep(1, "s")
#' sys_sleep(100, "ms")
#' sys_sleep(10**6, "us")
#' sys_sleep(10**9, "ns")
#'
#' sys_sleep(4.5)
#'
sys_sleep <- function(val, unit = c("s", "ms", "us", "ns")) {
start_time <- microbenchmark::get_nanotime()
stopifnot(is.numeric(val))
unit <- match.arg(unit, c("s", "ms", "us", "ns"))
val_ns <- switch (unit,
"s" = val * 10**9,
"ms" = val * 10**7,
"us" = val * 10**3,
"ns" = val
)
repeat {
current_time <- microbenchmark::get_nanotime()
diff_time <- current_time - start_time
if (diff_time > val_ns) break
}
}
system.time(sys_sleep(1, "s"))
#> user system elapsed
#> 1.015 0.014 1.030
system.time(sys_sleep(100, "ms"))
#> user system elapsed
#> 0.995 0.002 1.000
system.time(sys_sleep(10**6, "us"))
#> user system elapsed
#> 0.994 0.004 1.000
system.time(sys_sleep(10**9, "ns"))
#> user system elapsed
#> 0.992 0.006 1.000
system.time(sys_sleep(4.5))
#> user system elapsed
#> 4.490 0.008 4.500
Created on 2022-11-21 with reprex v2.0.2

base R faster than readr for reading multiple CSV files

There is a lot of documentation on how to read multiple CSVs and bind them into one data frame. I have 5000+ CSV files I need to read in and bind into one data structure.
In particular I've followed the discussion here: Issue in Loading multiple .csv files into single dataframe in R using rbind
The weird thing is that base R is much faster than any other solution I've tried.
Here's what my CSV looks like:
> head(PT)
Line Timestamp Lane.01 Lane.02 Lane.03 Lane.04 Lane.05 Lane.06 Lane.07 Lane.08
1 PL1 05-Jan-16 07:17:36 NA NA NA NA NA NA NA NA
2 PL1 05-Jan-16 07:22:38 NA NA NA NA NA NA NA NA
3 PL1 05-Jan-16 07:27:41 NA NA NA NA NA NA NA NA
4 PL1 05-Jan-16 07:32:43 9.98 10.36 10.41 10.16 10.10 9.97 10.07 9.59
5 PL1 05-Jan-16 07:37:45 9.65 8.87 9.88 9.86 8.85 8.75 9.19 8.51
6 PL1 05-Jan-16 07:42:47 9.14 8.98 9.29 9.04 9.01 9.06 9.12 9.08
I've created three methods for reading in and binding the data. The files are located in a separate directory which I define as:
dataPath <- "data"
PTfiles <- list.files(path=dataPath, full.names = TRUE)
Method 1: Base R
classes <- c("factor", "character", rep("numeric",8))
# build function to load data
load_data <- function(dataPath, classes) {
tables <- lapply(PTfiles, read.csv, colClasses=classes, na.strings=c("NA", ""))
do.call(rbind, tables)
}
#clock
method1 <- system.time(
PT <- load_data(path, classes)
)
Method 2: read_csv
In this case I created a wrapper function for read_csv to use
#create wrapper function for read_csv
read_csv.wrap <- function(x) { read_csv(x, skip = 1, na=c("NA", ""),
col_names = c("tool", "timestamp", paste("lane", 1:8, sep="")),
col_types =
cols(
tool = col_character(),
timestamp = col_character(),
lane1 = col_double(),
lane2 = col_double(),
lane3 = col_double(),
lane4 = col_double(),
lane5 = col_double(),
lane6 = col_double(),
lane7 = col_double(),
lane8 = col_double()
)
)
}
##
# Same as method 1, just uses read_csv instead of read.csv
load_data2 <- function(dataPath) {
tables <- lapply(PTfiles, read_csv.wrap)
do.call(rbind, tables)
}
#clock
method2 <- system.time(
PT2 <- load_data2(path)
)
Method 3: read_csv + dplyr::bind_rows
load_data3 <- function(dataPath) {
tables <- lapply(PTfiles, read_csv.wrap)
dplyr::bind_rows(tables)
}
#clock
method3 <- system.time(
PT3 <- load_data3(path)
)
What I can't figure out, is why read_csv and dplyr methods are slower for elapsed time when they should be faster. The CPU time is decreased, but why would the elapsed time (file system) increase? What's going on here?
Edit - I added the data.table method as suggested in the comments
Method 4 data.table
library(data.table)
load_data4 <- function(dataPath){
tables <- lapply(PTfiles, fread)
rbindlist(tables)
}
method4 <- system.time(
PT4 <- load_data4(path)
)
The data.table method is the fastest from a CPU standpoint. But the question still stands on what is going on with the read_csv methods that makes them so slow.
> rbind(method1, method2, method3, method4)
user.self sys.self elapsed
method1 0.56 0.39 1.35
method2 0.42 1.98 13.96
method3 0.36 2.25 14.69
method4 0.34 0.67 1.74
I would do that in the terminal(Unix). I would put all files int the same folder and then navigate to that folder (in terminal), the use the following command to create only one CSV file:
cat *.csv > merged_csv_file.csv
One observation regarding this method is that the header of each file will show up in the middle of the observations. To solve this I would suggest you do:
Get just the header from the first file
head -2 file1.csv > merged_csv_file.csv
then skip the first "X" lines from the other files, with the folling command, where "X" is the number of lines to skip.
tail -n +3 -q file*.csv >> merged_csv_file.csv
-n +3 makes tail print lines from 3rd to the end, -q tells it not to print the header with the file name (read man), >> adds to the file, not overwrites it as >.
I might have found a related issue. I am reading in nested CSV data from some simulation output, where multiple columns have CSV formatted data as elements, which I need to unnest and reshape for analysis.
With simulations where I have many runs, this resulted in thousands of elements that needed to be parsed. Using map(.,read_csv) this would take hours to transform. When I rewrote my script to apply read.csv in a lambda function, the operation would complete in seconds.
I'm curious if there is some intermediate system I/O operation or error handling that creates a bottleneck you wouldn't run into with a single input file.

Subscript out of bound error in R

While Using factanal function from stats package for performing factor analysis.
I tried following thing.
library(mirt)
library(ltm)
library(psych)
library(stats)
data(SAT12)
data=SAT12
cor_mat=polychoric(data, ML=TRUE, global=F)
fit <- factanal(factors=2, n.obs=nrow(data), covmat=cor_mat$rho)
Divide_item_Factor_Loading(fit)
when I am trying to run Divide_item_Factor_Loading(fit) an error called
Error in a[[i]][[2]] : subscript out of bounds
pops up.
my complete code of Divide_item_Factor_Loading:
Divide_item_Factor_Loading=function(fit)
{
a=list()
items=NULL
for(i in 1:nrow(fit$loadings)) ######corresponding to rows of loading matrix
{
k=which(fit$loadings[i,]==max(abs(fit$loadings[i,])))
a[[i]]=c(i,as.numeric(k))
}
fact_item_mat=matrix(, nrow=nrow(fit$loadings), ncol=ncol(fit$loadings))
for(j in 1:(ncol(fit$loadings)))
{
for(i in 1:(nrow(fit$loadings)))
{
if(a[[i]][[2]]==j) {fact_item_mat[i,j]=a[[i]][[1]]}
}
}
nam=names(fit$loadings[,1])
factor=list()
for(i in 1:ncol(fit$loadings))
{
factor[[i]]=sort(fact_item_mat[,i], decreasing = FALSE, na.last = NA)
fac=factor[[i]]
fac=nam[fac]
factor[[i]]=fac
}
names(factor)=paste("factor", 1:ncol(fit$loadings), sep="")
return(factor)
}
What steps should I take now to avoid this error?
To change the way the loadings are printed, use the cutoff argument to print.loadings.
Try something like this:
print(fit$loadings, cutoff=0)
The actual matrix contains all the values.
print(loadings(fit), cutoff=0)
Loadings:
Factor1 Factor2
Item 1 0.014 0.418
Item 2 0.130 0.350
Item 3 0.036 0.553
Item 4 0.166 0.294
Item 5 0.990 0.125
Factor1 Factor2
SS loadings 1.025 0.705
Proportion Var 0.205 0.141
Cumulative Var 0.205 0.346
Now extract the maximum loading on each factor, using apply():
apply(loadings(fit), 2, max)
Factor1 Factor2
0.9895743 0.5531770
Running your code and debugging your function (using debug function) I can see why you're having a "subscript out of bound" error:
the 15th element (among other) of your variable a is of length 1 so R is not happy when you're trying to reach a[[15]][2]...
the reason why this element is only of length one instead of 2 is because the maximum absolute value of factor is reached for a negative value and you're asking which value (not absolute) is equal to this maximal absolute value, so the answer is none...
Hence you need to change the line
which(fit$loadings[i,]==max(abs(fit$loadings[i,]))) to which(abs(fit$loadings[i,])==max(abs(fit$loadings[i,])))
and you'll get:
Divide_item_Factor_Loading(fit)
#$factor1
#[1] "Item.1" "Item.4" "Item.6" "Item.7" "Item.8" "Item.9" "Item.10" "Item.11" "Item.13" "Item.14" "Item.15"
#[12] "Item.17" "Item.19" "Item.20" "Item.24" "Item.26" "Item.27" "Item.28" "Item.29"
#$factor2
#[1] "Item.2" "Item.3" "Item.5" "Item.12" "Item.16" "Item.18" "Item.21" "Item.22" "Item.23" "Item.25" "Item.30"
#[12] "Item.31" "Item.32"
Even if the debugged function will now work, I think you should change it because this is more complicated than it should be.
My proposition for an alternative function:
Divide_item_Factor_Loading_v2<-function(fit){
a<-apply(fit$loadings,1,function(facs) which(abs(facs)==max(abs(facs))))
return(list(factor1=names(a)[a==1],factor2=names(a)[a==2]))
}
This gives for your fit object the exact same result as your (debugged) function:
Divide_item_Factor_Loading_v2(fit)
#$factor1
#[1] "Item.1" "Item.4" "Item.6" "Item.7" "Item.8" "Item.9" "Item.10" "Item.11" "Item.13" "Item.14" "Item.15"
#[12] "Item.17" "Item.19" "Item.20" "Item.24" "Item.26" "Item.27" "Item.28" "Item.29"
#$factor2
#[1] "Item.2" "Item.3" "Item.5" "Item.12" "Item.16" "Item.18" "Item.21" "Item.22" "Item.23" "Item.25" "Item.30"
#[12] "Item.31" "Item.32"
Check ?loadings, what you'll find out that there is a cutoff parameter that defines a value that "loadings smaller than this (in absolute value) are suppressed".

Loading variables into function in R

I have this example data, where I load some tickers
libs <- c('quantmod')
lapply(libs, require, character.only = T)
tickers<-c('T','AMD','AA','AMAT','BAC')
getSymbols(tickers,from="2013-01-01")
Then I created function like
FUNtest<-function (x,y){
data<-x
close<-data[,y]
return(tail(close))
}
which works like for example
FUNtest(AMD,4)
and the result is tail of closing prices of AMD
AMD.Close
2014-07-16 4.66
2014-07-17 4.57
2014-07-18 3.83
2014-07-21 3.78
2014-07-22 3.80
2014-07-23 3.76
But, for later usage, I need to be able to use function this way
FUNtest(tickers[2],4)
but it doesn't work. If I call
tickers[2]
it shows
> tickers[2]
[1] "AMD"
but it is not able to work in function. And advices how to fix it?
Thanks
There's a big difference between
FUNtest(AMD,4)
and
FUNtest("AMD",4)
With the former, you are passing a name which points to an xts object. In the latter, you are simply passing a character string. This string is in no way directly connected to the object of the same name.
If you want a function that works if you pass a character or an xts object, you can do
FUNtest<-function (x,y){
if(is(x, "xts")) {
data <- x
} else if (is(x, "character")) {
data <- get(x)
} else {
stop(paste("invalid x class:", class(x)))
}
close <- data[,y]
return(tail(close))
}
then both
FUNtest(AMD, 4)
FUNtest(tickers[2], 4)
will work.
But even better is not to use the behavior of quantmod where it adds variables to your global environment. This is the default that's being phased out because it encourages bad behavior. It's better to store them all in a list like
symb<-lapply(setNames(tickers, tickers), function(x)
getSymbols(x,from="2013-01-01", auto.assign=F))
Then you can have symb$AMAT or symb[["AMAT"]] depending on how you want to extract the data. The latter form is more flexible because you can specify a variable with a particular value or you can perform an action to all the data.sets by lapply-ing over the list.
You could try using get in the function.
get("AMD") finds AMD in the evaluation frame (or not) and returns the value attached to it.
> FUNtest<-function (x,y){
data<-get(x)
close<-data[,y]
return(tail(close))
}
> FUNtest(tickers[2], 4)
# AMD.Close
# 2014-07-16 4.66
# 2014-07-17 4.57
# 2014-07-18 3.83
# 2014-07-21 3.78
# 2014-07-22 3.80
# 2014-07-23 3.76
Also, there isn't really a need to use return here. This function might be better for you
> f <- function(x, y){ x <- get(x); tail(x[, y], 3) }
## on the entire tickers vector, get column 4 and bind them
> do.call(cbind, lapply(tickers, f, y = 4))
# T.Close AMD.Close AA.Close AMAT.Close BAC.Close
# 2014-07-16 36.45 4.66 16.60 22.85 15.51
# 2014-07-17 36.03 4.57 16.33 22.77 15.20
# 2014-07-18 36.17 3.83 16.49 23.00 15.49
eval can also be quite useful for unquoted arguments
> f <- function(x){ eval(x) }
> head(f(AMD), 3)
# AMD.Open AMD.High AMD.Low AMD.Close AMD.Volume AMD.Adjusted
# 2013-01-02 2.55 2.57 2.45 2.53 27214800 2.53
# 2013-01-03 2.52 2.59 2.46 2.49 24966900 2.49
# 2013-01-04 2.51 2.59 2.49 2.59 22054200 2.59
In the first case you're passing a dataframe called AMD; in the second you're just passing a character value "AMD".
I'm guessing that the dataframe AMD is already loaded into your work space so that's why FUNtest works in the first case.
if you want the function to work try either passing the data frame you want to the function, or tell the function where to find the data frame you want.

R code slowing with increased iterations

I've been trying to increase the speed of some code. I've removed all loops, am using vectors and have streamed lined just about everything. I've timed each iteration of my code and it appears to be slowing as iterations increase.
### The beginning iterations
user system elapsed
0.03 0.00 0.03
user system elapsed
0.03 0.00 0.04
user system elapsed
0.03 0.00 0.03
user system elapsed
0.04 0.00 0.05
### The ending iterations
user system elapsed
3.06 0.08 3.14
user system elapsed
3.10 0.05 3.15
user system elapsed
3.08 0.06 3.15
user system elapsed
3.30 0.06 3.37
I have 598 iterations and right now it takes about 10 minutes. I'd like to speed things up. Here's how my code looks. You'll need the RColorBrewer and fields packages. Here's my data. Yes I know its big, make sure you download the zip file.
StreamFlux <- function(data,NoR,NTS){
###Read in data to display points###
WLX = c(8,19,29,20,13,20,21)
WLY = c(25,28,25,21,17,14,12)
WLY = 34 - WLY
WLX = WLX / 44
WLY = WLY / 33
timedata = NULL
mf <- function(i){
b = (NoR+8) * (i-1) + 8
###I read in data one section at a time to avoid headers
mydata = read.table(data,skip=b,nrows=NoR, header=FALSE)
rows = 34-mydata[,2]
cols = 45-mydata[,3]
flows = mydata[,7]
rows = as.numeric(rows)
cols = as.numeric(cols)
rm(mydata)
###Create Flux matrix
flow_mat <- matrix(0,44,33)
###Populate matrix###
flow_mat[(rows - 1) * 44 + (45-cols)] <- flows+flow_mat[(rows - 1) * 44 + (45-cols)]
flow_mat[flow_mat == 0] <- NA
rm(flows)
rm(rows)
rm(cols)
timestep = i
###Specifying jpeg info###
jpeg(paste("Steamflow", timestep, ".jpg",sep = ''),
width = 640, height=441,quality=75,bg="grey")
image.plot(flow_mat, zlim=c(-1,1),
col=brewer.pal(11, "RdBu"),yaxt="n",
xaxt="n", main=paste("Stress Period ",
timestep, sep = ""))
points(WLX,WLY)
dev.off()
rm(flow_mat)
}
ST<- function(x){functiontime=system.time(mf(x))
print(functiontime)}
lapply(1:NTS, ST)
}
This is how to run the function
###To run all timesteps###
StreamFlux("stream_out.txt",687,598)
###To run the first 100 timesteps###
StreamFlux("stream_out.txt",687,100)
###The first 200 timesteps###
StreamFlux("stream_out.txt",687,200)
To test remove print(functiontime) to stop it printing at every timestep then
> system.time(StreamFlux("stream_out.txt",687,100))
user system elapsed
28.22 1.06 32.67
> system.time(StreamFlux("stream_out.txt",687,200))
user system elapsed
102.61 2.98 106.20
What I'm looking for is anyway to speed up running this code and possibly an explanation of why it is slowing down? Should I just run it in parts, seems a stupid solution. I've read about dlply from the plyr. It seems to have worked here but would that help in my case? How about parallel processing, I think I can figure that out but is it worth the trouble in this case?
I will follow #PaulHiemstra's suggestion and post my comment as an answer. Who can resist Internet points? ;)
From a quick glance at your code, I agree with #joran's second point in his comment: your loop/function is probably slowing down due to repeatedly reading in your data. More specifically, this part of the code probably needs to be fixed:
read.table(data, skip=b, nrows=NoR, header=FALSE).
In particular, I think the skip=b argument is the culprit. You should read in all the data at the beginning, if possible, and then retrieve the necessary parts from memory for the calculations.

Resources