The function below takes a folder of CSV files (each file is a financial time series with datetime, open, high, low, close columns) and creates a single XTS object for each of the open, high, low, close prices, where each XTS column is an individual security. For my use case, this representation allows for much more convenient and faster processing (vs. single XTS for each file).
require(quantmod)
LoadUniverseToEnv <- function(srcDir, env) {
fileList <- list.files(srcDir)
if (length(fileList) == 0)
stop("No files found!")
env$op <- NULL
env$hi <- NULL
env$lo <- NULL
env$cl <- NULL
cols <- NULL
for (file in fileList) {
filePath <- sprintf("%s/%s", srcDir, file)
if (file.info(filePath)$isdir == FALSE) {
x <- as.xts(read.zoo(filePath, header=TRUE, sep=",", tz=""))
cols <- c(sub("_.*", "", file), cols)
# do outer join
env$op <- merge(Op(x), env$op)
env$hi <- merge(Hi(x), env$hi)
env$lo <- merge(Lo(x), env$lo)
env$cl <- merge(Cl(x), env$cl)
cat(sprintf("%s : added: %s from: %s to: %s\n", as.character(Sys.time()), file, start(x), end(x)))
}
}
colnames(env$op) <- cols
colnames(env$hi) <- cols
colnames(env$lo) <- cols
colnames(env$cl) <- cols
}
Performance is fine for a limited number of files, but slows linearly with the width of the XTS object and so becomes a problem for large datasets. The bottleneck is CPU during the merge, when a new column is being appended to each of the four objects (e.g. 100ms initally slowing by 1ms/column)
Since it's CPU bound, my first thought is to parallelize by merging n batches of files and then merge the results, but I'm wondering if there's a better way.
The best solution I found for this was to merge in "chunks". For example, assuming 100 columns, merging into 10 XTS objects with 10 columns each and then merging those 10 objects dramatically improves performance.
The below example shows a 1500% improvement when merging 2000 xts objects with 1000 rows each and identical indexes.
Example:
require(xts)
require(foreach)
nCols <- 2000
nRows <- 1000
x <- xts(runif(nRows), order.by=as.Date(seq(1:nRows)))
xList <- list()
for (i in 1:nCols)
xList[[i]] <- x
testA <- function() {
merged <- NULL
for (x in xList)
merged <- merge(x, merged)
colnames(merged) <- 1:length(xList)
merged
}
testB <- function() {
nChunks <- floor(sqrt(length(xList)))
idx <- split(1:n, sort(1:n %% nChunks))
merged <- foreach (chunk = 1:nChunks, .combine = "merge") %do% {
merged <- foreach (i = idx[[chunk]], .combine = "merge") %do% {
xList[[i]]
}
merged
}
colnames(merged) <- 1:length(xList)
merged
}
print("Test A")
print(system.time(resultA <- testA()))
print("Test B")
print(system.time(resultB <- testB()))
print(sprintf("Identical : %s", identical(resultA, resultB)))
print(sprintf("Dimensions: %dx%d", ncol(resultA), nrow(resultA)))
Output:
[1] "Test A"
user system elapsed
33.12 3.18 36.30
[1] "Test B"
user system elapsed
2.28 0.01 2.31
[1] "Identical : TRUE"
[1] "Dimensions: 2000x1000"
Note that the foreach is not running in parallel.
Related
I have 1000 json files. And I would like to read them in parallel. I have 4 CPU cores.
I have a character vector which has the names of all the files as following:-
cik_files <- list.files("./data/", pattern = ".json")
And using this vector I load the file and extract the data and add it to the following list:-
data <- list()
Below is the code for extracting the data:-
for(i in 1:1000){
data1 <- fromJSON(paste0("./data/", cik_files[i]), flatten = TRUE)
if(("NetIncomeLoss" %in% names(data1$facts$`us-gaap`))){
data1 <- data1$facts$`us-gaap`$NetIncomeLoss$units$USD
data1 <- data1[grep("CY20[0-9]{2}$", data1$frame), c(3, 9)]
try({if(nrow(data1) > 0){
data1$cik <- strtrim(cik_files[i], 13)
data[[length(data) + 1]] <- data1
}}, silent = TRUE)
}
}
This however, takes quite a lot of time. So I was wondering how I can run the code within the for loop but in parallel.
Thanks in advance.
Here is an attempt to solve the problem in the question. Untested, since there is no data.
Step 1
First of all, rewrite the loop in the question as a function.
f <- function(i, path = "./data", cik_files){
filename <- file.path(path, cik_files[i])
data1 <- fromJSON(filename, flatten = TRUE)
if(("NetIncomeLoss" %in% names(data1$facts$`us-gaap`))){
data1 <- data1$facts$`us-gaap`$NetIncomeLoss$units$USD
found <- grep("CY20[0-9]{2}$", data1$frame)
if(length(found) > 0){
tryCatch({
out <- data1[found, c(3, 9)]
out$cik <- strtrim(cik_files[i], 13)
out
},
error = function(e) e,
warning = function(w) w)
} else NULL
} else NULL
}
Step 2
Now load the package parallel and run one of the following, depending on OS.
library(parallel)
# Not on Windows
library(jsonlite)
json_list <- mclapply(seq_along(cik_files), f, cik_files = cik_files)
# Windows
ncores <- detectCores()
cl <- makeCluster(ncores - 1L)
clusterExport(cl, "cik_files")
clusterEvalQ(cl, "cik_files")
clusterEvalQ(cl, library(jsonlite))
json_list <- parLapply(cl, seq_along(cik_files), f, cik_files = cik_files)
stopCluster(cl)
Step 3
Extract the data from the returned list json_list.
err <- sapply(json_list, inherits, "error")
warn <- sapply(json_list, inherits, "warning")
ok <- !(err | warn)
json_list[ok] # correctly read in
I have a foreach %dopar% code setup to process my data in parallel but I am looking at ways to increase the speed. Basically I have a large data frame that is loaded using fread and
The size of the dataframe is 225 obs x 655369 variables. The foreach command selects two variables at a time, runs this process function (code that calculated various mediation, moderation, and conditional process models), for a total of 327,684 times. For this function the data must all be within the same dataframe. I noticed that the size of the dataframe seems to greatly slow down the foreach function.
From what I can tell the major cause of the slow down due to dataframe size is because of how the process function accesses the data for processing. So, what I am guessing is that each time the foreach runs the process function parses the entire dataframe until it finds the correct variable for each of the inputs.
One of my thoughts is to just chunk the data into smaller data frames to speed up processing time, and then merge the outputs together at the end. But I was wondering if anyone else has any suggestions for speeding this up as I am obviously not overly familiar with R.
The variable names for area_list and thickness_list, which are the mediators and the only values that change between each loop are labelled like this, such that the last number is either 0 or 1 for the pair, with all other numbers matching:
value_0_0_0 with value_0_0_1 for the first loop
value_1_0_0 with value_1_0_1 for the second loop
value_2_0_0 with value_2_0_1 for the third loop
value_3_0_0 with value_3_0_1 for the fourth loop
...
value_327684_1_0 with value_327684_1_1 etc.
options(scipen=999)
library(tidyverse)
library(foreach)
library(iterators)
library(parallel)
library(doParallel)
library("data.table")
library('janitor')
source("/scratch/R/process.r")
nCores <- detectCores() - 1
cl <- makeCluster(nCores)
registerDoParallel(cl)
my_data <- fread(
file = "/scratch/R/data.csv", header = TRUE, fill=TRUE, data.table = FALSE)
#Change values from -999 to NA in specific columns to avoid data issues (McAuley data)
my_data[, 88:133][my_data[, 88:133] == -999] <- NA
#Create dataframe for prepost useable data only
prepost_df <- subset(my_data, Select_PrePost==1)
pre_df <- subset(my_data, Select==1)
Large_MyData <- fread(
file = "/scratch/R/large.csv", header = TRUE, sep = ",", data.table = FALSE)
area_list <- names(Large_MyData)[grep("_1$",names(Large_MyData))]
thickness_list <- names(Large_MyData)[grep("_0$",names(Large_MyData))]
merged_data <- merge(pre_df, Large_MyData, by = "subs")
yvalue = "y"
xvalue = "x"
covariates = c("a","g","e")
ptm <- proc.time()
loopResults<-
foreach(area=area_list,thickness=thickness_list, .combine = rbind) %dopar%{
if (merged_data[area][1,1] == 0) {
merge_df3<-rbind(area,thickness)
merge_df_out<-cbind(merge_df3,yvalue,xvalue,'','','','','','')
} else {
result<-process(data=merged_data,y=yvalue,x=xvalue,m=c(area,thickness),cov=covariates,
model=4,contrast=1,boot=5000,save=2,modelbt=1,outscreen=0)
indirectEffects<-result[23:24,1:4]
indirectEffects_bootmean_area<-result[27,2]*result[38,2]
indirectEffects_bootmean_thickness<-result[32,2]*result[39,2]
indirectEffects_bootscore_area<-(indirectEffects_bootmean_area/result[23,2])
indirectEffects_bootscore_thickness<-(indirectEffects_bootmean_thickness/result[24,2])
merge_df1<-rbind(indirectEffects_bootmean_area,indirectEffects_bootmean_thickness)
merge_df2<-rbind(indirectEffects_bootscore_area,indirectEffects_bootscore_thickness)
merge_df3<-rbind(area,thickness)
merge_df_out<-cbind(merge_df3,yvalue,xvalue,indirectEffects,merge_df1,merge_df2)
}
}
proc.time() - ptm
stopCluster(cl)
colnames(loopResults) <- c("Vector","yvalue","xvalue","Effect","BootSE","BootLLCI","BootULCI","BootMean","boot_score")
loopResults
I'm currently trying to write an R script to import a variety of files I've created related to a dataset. This involves reading a lot of .txt files using several nested for loops based on how I've organized the directories and names of the files.
I can run the inner most loop fine (albiet a little slow). However, trying to run the second loop or any further loops creates the following error:
Error: vector memory exhausted (limit reached?)
I believe this may be related to how R handles memory? I'm running R out of Rstuidio. I've also tried the solution posted here with no luck
'R
R version 3.5.1 (2018-07-02) -- "Feather Spray"
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Code Below
subjects <- 72
loop1_names <- as.character(list('a','b','c'))
loop2_names <- as.character(list('one','two','three'))
loop3_names <- as.character(list('N1','N2'))
loop4_names<- as.character(list('choice1','choice2','choice3'))
i<-1;j<-1;
loop3.subset<- data.frame
for(k in 1:length(loop3_names)){
loop4.subset<- data.frame()#Data frame for handling each set of loop 4 values
for(l in 1:length(loop4_names)){
#Code for extracting the variables for each measure
measures.path <- file.path(results_fldr, 'amp_measures',loop1_names[i],loop2_names[j],'mont',loop3_names[k])
measures.data <- read.table(file.path(measures.path, paste(paste(loop1_names[i],loop2_names[j],loop3_names[k],loop4_names[l],sep = '_'),'.txt',sep = '')), header = T, nrows = subjects)
#Get rid of the IDs, we'll add those back in later
col_idx_ID <- grep('ID', names(measures.data))
measures.data <- as.data.frame(measures.data[,-col_idx_ID])# make sure when trimming to keep the measures as a data frame
names(measures.data) <- c(paste(loop1_names[i],loop2_names[j],loop3_names[k],loop4_names[l],sep = '_'))#Add a label to the data
#Now combine this data with the other data in the loop4 subset data frame
if(l == 1){
loop4.subset <- measures.data
} else {
loop4.subset <- merge(erp.subset,measures.data)
}
}#End l/loop 4
if(k == 1){
loop3.subset <- loop4.subset
} else {
freq.subset <- merge(loop3.subset,loop4.subset)
}
}#End k/loop 3
Generally I would suggest you read in only part of the data to memory, then write the partially merge to disk. In the example below which of course I can't run because I don't have your files. I write to disk after each i, j loop and then after that is done have 9 files. Now you merge those 6 files in another loop. If you still have memory problems break this up into another 2 files by first doing the "j" merge and writing each to 3 "i" files. Then if you can't merge those files you have a fundamental problem with lack of memory on your machine.
subjects <- 72
loop1_names <- as.character(list('a','b','c'))
loop2_names <- as.character(list('one','two','three'))
loop3_names <- as.character(list('N1','N2'))
loop4_names<- as.character(list('choice1','choice2','choice3'))
for(i in 1:length(loop1_names)) {
for(j in 1:length(loop2_names)) {
loop3.subset<- data.frame
for(k in 1:length(loop3_names)){
loop4.subset<- data.frame()
for(l in 1:length(loop4_names)){
##Code for extracting the variables for each measure
measures.path <- file.path(results_fldr,
'amp_measures',
loop1_names[i],
loop2_names[j],
'mont',
loop3_names[k])
measures.data <- read.table(file.path(measures.path, paste(paste(loop1_names[i],
loop2_names[j],
loop3_names[k],
loop4_names[l],
sep = '_'),'.txt',sep = '')),
header = T, nrows = subjects)
##Get rid of the IDs, we'll add those back in later
col_idx_ID <- grep('ID', names(measures.data))
measures.data <- as.data.frame(measures.data[,-col_idx_ID])
names(measures.data) <- c(paste(loop1_names[i],
loop2_names[j],
loop3_names[k],
loop4_names[l],
sep = '_'))
## Now combine this data with the other data in the loop4 subset data frame
if(l == 1){
loop4.subset <- measures.data
} else {
loop4.subset <- merge(erp.subset,measures.data)
}
}#End l/loop 4
if(k == 1){
loop3.subset <- loop4.subset
} else {
freq.subset <- merge(loop3.subset,loop4.subset)
}
}#End k/loop 3
write.table(freq.subset, paste0(i, "_", j, ".txt"))
}
}
## Now you have 6 files to read in a merge.
## Something like this:
df <- NULL
for(i in 1:length(loop1_names)) {
for(j in 1:length(loop2_names)) {
df1 <- read.table(paste0(i, "_", j, ".txt"))
df <- merge(df, df1)
}
}
I am writing an R program that involves analyzing a large amount of unstructured text data and creating a word-frequency matrix. I've been using the wfm and wfdf functions from the qdap package, but have noticed that this is a bit slow for my needs. It appears that the production of the word-frequency matrix is the bottleneck.
The code for my function is as follows.
library(qdap)
liwcr <- function(inputText, dict) {
if(!file.exists(dict))
stop("Dictionary file does not exist.")
# Read in dictionary categories
# Start by figuring out where the category list begins and ends
dictionaryText <- readLines(dict)
if(!length(grep("%", dictionaryText))==2)
stop("Dictionary is not properly formatted. Make sure category list is correctly partitioned (using '%').")
catStart <- grep("%", dictionaryText)[1]
catStop <- grep("%", dictionaryText)[2]
dictLength <- length(dictionaryText)
dictionaryCategories <- read.table(dict, header=F, sep="\t", skip=catStart, nrows=(catStop-2))
wordCount <- word_count(inputText)
outputFrame <- dictionaryCategories
outputFrame["count"] <- 0
# Now read in dictionary words
no_col <- max(count.fields(dict, sep = "\t"), na.rm=T)
dictionaryWords <- read.table(dict, header=F, sep="\t", skip=catStop, nrows=(dictLength-catStop), fill=TRUE, quote="\"", col.names=1:no_col)
workingMatrix <- wfdf(inputText)
for (i in workingMatrix[,1]) {
if (i %in% dictionaryWords[, 1]) {
occurrences <- 0
foundWord <- dictionaryWords[dictionaryWords$X1 == i,]
foundCategories <- foundWord[1,2:no_col]
for (w in foundCategories) {
if (!is.na(w) & (!w=="")) {
existingCount <- outputFrame[outputFrame$V1 == w,]$count
outputFrame[outputFrame$V1 == w,]$count <- existingCount + workingMatrix[workingMatrix$Words == i,]$all
}
}
}
}
return(outputFrame)
}
I realize the for loop is inefficient, so in an effort to locate the bottleneck, I tested it without this portion of the code (simply reading in each text file and producing the word-frequency matrix), and seen very little in the way of speed improvements. Example:
library(qdap)
fn <- reports::folder(delete_me)
n <- 10000
lapply(1:n, function(i) {
out <- paste(sample(key.syl[[1]], 30, T), collapse = " ")
cat(out, file=file.path(fn, sprintf("tweet%s.txt", i)))
})
filename <- sprintf("tweet%s.txt", 1:n)
for(i in 1:length(filename)){
print(filename[i])
text <- readLines(paste0("/toshi/twitter_en/", filename[i]))
freq <- wfm(text)
}
The input files are Twitter and Facebook status postings.
Is there any way to improve the speed for this code?
EDIT2: Due to institutional restrictions, I can't post any of the raw data. However, just to give an idea of what I'm dealing with: 25k text files, each with all the available tweets from an individual Twitter user. There are also an additional 100k files with Facebook status updates, structured in the same way.
Here is a qdap approach and a mixed qdap/tm approach that is faster. I provide the code and then the timings on each. Basically I read everything in at once and operator on the entire data set. You could then split it back apart if you wanted with split.
A MWE that you should provide with questions
library(qdap)
fn <- reports::folder(delete_me)
n <- 10000
lapply(1:n, function(i) {
out <- paste(sample(key.syl[[1]], 30, T), collapse = " ")
cat(out, file=file.path(fn, sprintf("tweet%s.txt", i)))
})
filename <- sprintf("tweet%s.txt", 1:n)
The qdap approach
tic <- Sys.time() ## time it
dat <- list2df(setNames(lapply(filename, function(x){
readLines(file.path(fn, x))
}), tools::file_path_sans_ext(filename)), "text", "tweet")
difftime(Sys.time(), tic) ## time to read in
the_wfm <- with(dat, wfm(text, tweet))
difftime(Sys.time(), tic) ## time to make wfm
Timing qdap approach
> tic <- Sys.time() ## time it
>
> dat <- list2df(setNames(lapply(filename, function(x){
+ readLines(file.path(fn, x))
+ }), tools::file_path_sans_ext(filename)), "text", "tweet")
There were 50 or more warnings (use warnings() to see the first 50)
>
> difftime(Sys.time(), tic) ## time to read in
Time difference of 2.97617 secs
>
> the_wfm <- with(dat, wfm(text, tweet))
>
> difftime(Sys.time(), tic) ## time to make wfm
Time difference of 48.9238 secs
The qdap-tm combined approach
tic <- Sys.time() ## time it
dat <- list2df(setNames(lapply(filename, function(x){
readLines(file.path(fn, x))
}), tools::file_path_sans_ext(filename)), "text", "tweet")
difftime(Sys.time(), tic) ## time to read in
tweet_corpus <- with(dat, as.Corpus(text, tweet))
tdm <- tm::TermDocumentMatrix(tweet_corpus,
control = list(removePunctuation = TRUE,
stopwords = FALSE))
difftime(Sys.time(), tic) ## time to make TermDocumentMatrix
Timing qdap-tm combined approach
> tic <- Sys.time() ## time it
>
> dat <- list2df(setNames(lapply(filename, function(x){
+ readLines(file.path(fn, x))
+ }), tools::file_path_sans_ext(filename)), "text", "tweet")
There were 50 or more warnings (use warnings() to see the first 50)
>
> difftime(Sys.time(), tic) ## time to read in
Time difference of 3.108177 secs
>
>
> tweet_corpus <- with(dat, as.Corpus(text, tweet))
>
> tdm <- tm::TermDocumentMatrix(tweet_corpus,
+ control = list(removePunctuation = TRUE,
+ stopwords = FALSE))
>
> difftime(Sys.time(), tic) ## time to make TermDocumentMatrix
Time difference of 13.52377 secs
There is a qdap-tm Package Compatibility (-CLICK HERE-) to help users move between qdap and tm. As you can see on 10000 tweets the combined approach is ~3.5 x faster. A purely tm approach may be faster still. Also if you want the wfm use as.wfm(tdm) to coerce the TermDocumentMatrix.
Your code though is slower either way because it's not the R way to do things. I'd recommend reading some additional info on R to get better at writing faster code. I'm currently working through Hadley Wickham's Advanced R that I'd recommend.
I have a large list of objects (say 100k elements). Each element will have to be processed by a function "process" BUT I would like to do the processing in chunks... say 20 passes for example as I want to save processing results into a hard drive file and keep operating memory free.
I'm new to R and I know that it should involve some apply magic but I don't know how to do it (yet).
Any guidance would be much appreciated.
A small example:
objects <- list();
for (i in 1:100){
objects <- append(objects, 500);
}
objects;
processOneElement <- function(x){
x/20 + 23;
}
I would like to process first 20 elements in one go and save results then process second 20 elements in second go and save results... and so on
objects <- list();
for (i in 1:100){
objects <- append(objects, 500);
}
objects;
process <- function(x){
x/20 + 23;
}
results <- lapply(objects, FUN=process)
index <- seq(1, length(objects), by=20);
lapply(index, function(idx1) {
idx2 <- min(idx1+20-1, length(objects));
batch <- lapply(idx:idx2, function(x) {
process(objects[[x]]);
})
write.table(batch, paste("batch", idx1, sep=""));
})
With what you have given, this is the answer I could suggest. Assuming your list is stored in list.object,
lapply(seq(1, length(list.object), by=20), function(idx) {
# here idx will be 1, 21, 41 etc...
idx2 <- min(idx+20-1, length(list.object))
# do what you want here..
batch.20.processed <- lapply(idx:idx2, function(x) {
process(list.object[[x]]) # passes idx:idx2 indices one at a time
})
# here you have processed list with 20 elements
# finally write to file
lapply(1:20, function(x) {
write.table(batch.20.processed[[x]], ...)
# where "..." is all other allowed arguments to write.table
# such as row.names, col.names, quote etc.
# don't literally pass "..." to write.table
})
}