I am writing an R program that involves analyzing a large amount of unstructured text data and creating a word-frequency matrix. I've been using the wfm and wfdf functions from the qdap package, but have noticed that this is a bit slow for my needs. It appears that the production of the word-frequency matrix is the bottleneck.
The code for my function is as follows.
library(qdap)
liwcr <- function(inputText, dict) {
if(!file.exists(dict))
stop("Dictionary file does not exist.")
# Read in dictionary categories
# Start by figuring out where the category list begins and ends
dictionaryText <- readLines(dict)
if(!length(grep("%", dictionaryText))==2)
stop("Dictionary is not properly formatted. Make sure category list is correctly partitioned (using '%').")
catStart <- grep("%", dictionaryText)[1]
catStop <- grep("%", dictionaryText)[2]
dictLength <- length(dictionaryText)
dictionaryCategories <- read.table(dict, header=F, sep="\t", skip=catStart, nrows=(catStop-2))
wordCount <- word_count(inputText)
outputFrame <- dictionaryCategories
outputFrame["count"] <- 0
# Now read in dictionary words
no_col <- max(count.fields(dict, sep = "\t"), na.rm=T)
dictionaryWords <- read.table(dict, header=F, sep="\t", skip=catStop, nrows=(dictLength-catStop), fill=TRUE, quote="\"", col.names=1:no_col)
workingMatrix <- wfdf(inputText)
for (i in workingMatrix[,1]) {
if (i %in% dictionaryWords[, 1]) {
occurrences <- 0
foundWord <- dictionaryWords[dictionaryWords$X1 == i,]
foundCategories <- foundWord[1,2:no_col]
for (w in foundCategories) {
if (!is.na(w) & (!w=="")) {
existingCount <- outputFrame[outputFrame$V1 == w,]$count
outputFrame[outputFrame$V1 == w,]$count <- existingCount + workingMatrix[workingMatrix$Words == i,]$all
}
}
}
}
return(outputFrame)
}
I realize the for loop is inefficient, so in an effort to locate the bottleneck, I tested it without this portion of the code (simply reading in each text file and producing the word-frequency matrix), and seen very little in the way of speed improvements. Example:
library(qdap)
fn <- reports::folder(delete_me)
n <- 10000
lapply(1:n, function(i) {
out <- paste(sample(key.syl[[1]], 30, T), collapse = " ")
cat(out, file=file.path(fn, sprintf("tweet%s.txt", i)))
})
filename <- sprintf("tweet%s.txt", 1:n)
for(i in 1:length(filename)){
print(filename[i])
text <- readLines(paste0("/toshi/twitter_en/", filename[i]))
freq <- wfm(text)
}
The input files are Twitter and Facebook status postings.
Is there any way to improve the speed for this code?
EDIT2: Due to institutional restrictions, I can't post any of the raw data. However, just to give an idea of what I'm dealing with: 25k text files, each with all the available tweets from an individual Twitter user. There are also an additional 100k files with Facebook status updates, structured in the same way.
Here is a qdap approach and a mixed qdap/tm approach that is faster. I provide the code and then the timings on each. Basically I read everything in at once and operator on the entire data set. You could then split it back apart if you wanted with split.
A MWE that you should provide with questions
library(qdap)
fn <- reports::folder(delete_me)
n <- 10000
lapply(1:n, function(i) {
out <- paste(sample(key.syl[[1]], 30, T), collapse = " ")
cat(out, file=file.path(fn, sprintf("tweet%s.txt", i)))
})
filename <- sprintf("tweet%s.txt", 1:n)
The qdap approach
tic <- Sys.time() ## time it
dat <- list2df(setNames(lapply(filename, function(x){
readLines(file.path(fn, x))
}), tools::file_path_sans_ext(filename)), "text", "tweet")
difftime(Sys.time(), tic) ## time to read in
the_wfm <- with(dat, wfm(text, tweet))
difftime(Sys.time(), tic) ## time to make wfm
Timing qdap approach
> tic <- Sys.time() ## time it
>
> dat <- list2df(setNames(lapply(filename, function(x){
+ readLines(file.path(fn, x))
+ }), tools::file_path_sans_ext(filename)), "text", "tweet")
There were 50 or more warnings (use warnings() to see the first 50)
>
> difftime(Sys.time(), tic) ## time to read in
Time difference of 2.97617 secs
>
> the_wfm <- with(dat, wfm(text, tweet))
>
> difftime(Sys.time(), tic) ## time to make wfm
Time difference of 48.9238 secs
The qdap-tm combined approach
tic <- Sys.time() ## time it
dat <- list2df(setNames(lapply(filename, function(x){
readLines(file.path(fn, x))
}), tools::file_path_sans_ext(filename)), "text", "tweet")
difftime(Sys.time(), tic) ## time to read in
tweet_corpus <- with(dat, as.Corpus(text, tweet))
tdm <- tm::TermDocumentMatrix(tweet_corpus,
control = list(removePunctuation = TRUE,
stopwords = FALSE))
difftime(Sys.time(), tic) ## time to make TermDocumentMatrix
Timing qdap-tm combined approach
> tic <- Sys.time() ## time it
>
> dat <- list2df(setNames(lapply(filename, function(x){
+ readLines(file.path(fn, x))
+ }), tools::file_path_sans_ext(filename)), "text", "tweet")
There were 50 or more warnings (use warnings() to see the first 50)
>
> difftime(Sys.time(), tic) ## time to read in
Time difference of 3.108177 secs
>
>
> tweet_corpus <- with(dat, as.Corpus(text, tweet))
>
> tdm <- tm::TermDocumentMatrix(tweet_corpus,
+ control = list(removePunctuation = TRUE,
+ stopwords = FALSE))
>
> difftime(Sys.time(), tic) ## time to make TermDocumentMatrix
Time difference of 13.52377 secs
There is a qdap-tm Package Compatibility (-CLICK HERE-) to help users move between qdap and tm. As you can see on 10000 tweets the combined approach is ~3.5 x faster. A purely tm approach may be faster still. Also if you want the wfm use as.wfm(tdm) to coerce the TermDocumentMatrix.
Your code though is slower either way because it's not the R way to do things. I'd recommend reading some additional info on R to get better at writing faster code. I'm currently working through Hadley Wickham's Advanced R that I'd recommend.
Related
I have 1000 json files. And I would like to read them in parallel. I have 4 CPU cores.
I have a character vector which has the names of all the files as following:-
cik_files <- list.files("./data/", pattern = ".json")
And using this vector I load the file and extract the data and add it to the following list:-
data <- list()
Below is the code for extracting the data:-
for(i in 1:1000){
data1 <- fromJSON(paste0("./data/", cik_files[i]), flatten = TRUE)
if(("NetIncomeLoss" %in% names(data1$facts$`us-gaap`))){
data1 <- data1$facts$`us-gaap`$NetIncomeLoss$units$USD
data1 <- data1[grep("CY20[0-9]{2}$", data1$frame), c(3, 9)]
try({if(nrow(data1) > 0){
data1$cik <- strtrim(cik_files[i], 13)
data[[length(data) + 1]] <- data1
}}, silent = TRUE)
}
}
This however, takes quite a lot of time. So I was wondering how I can run the code within the for loop but in parallel.
Thanks in advance.
Here is an attempt to solve the problem in the question. Untested, since there is no data.
Step 1
First of all, rewrite the loop in the question as a function.
f <- function(i, path = "./data", cik_files){
filename <- file.path(path, cik_files[i])
data1 <- fromJSON(filename, flatten = TRUE)
if(("NetIncomeLoss" %in% names(data1$facts$`us-gaap`))){
data1 <- data1$facts$`us-gaap`$NetIncomeLoss$units$USD
found <- grep("CY20[0-9]{2}$", data1$frame)
if(length(found) > 0){
tryCatch({
out <- data1[found, c(3, 9)]
out$cik <- strtrim(cik_files[i], 13)
out
},
error = function(e) e,
warning = function(w) w)
} else NULL
} else NULL
}
Step 2
Now load the package parallel and run one of the following, depending on OS.
library(parallel)
# Not on Windows
library(jsonlite)
json_list <- mclapply(seq_along(cik_files), f, cik_files = cik_files)
# Windows
ncores <- detectCores()
cl <- makeCluster(ncores - 1L)
clusterExport(cl, "cik_files")
clusterEvalQ(cl, "cik_files")
clusterEvalQ(cl, library(jsonlite))
json_list <- parLapply(cl, seq_along(cik_files), f, cik_files = cik_files)
stopCluster(cl)
Step 3
Extract the data from the returned list json_list.
err <- sapply(json_list, inherits, "error")
warn <- sapply(json_list, inherits, "warning")
ok <- !(err | warn)
json_list[ok] # correctly read in
RStudio Version 1.0.143
Windows Ver: Windows10 Pro
I have 300+ files which has the same struction, and I want to create a loop, so it can calculate the correlation index of the required files. I can get the right files and calculate the correlation index, but I can't get them all showed as a result. I tried to save them to a vector, but it tells me "the object not found". and if it can work, I also worried about whether the content of the vector will stay if I run the function for several times. Here's the loop:
for(i in ind_larg){
+ specdata_i <- read.csv(i)
+ com_case_ind <- complete.cases(specdata_i)
+ sulfate_i <- specdata_i[,2][com_case_ind]
+ nitrate_i <- specdata_i[,3][com_case_ind]
+ ou[i] <- cor(sulfate_i, nitrate_i)
+ }
and the result
Error: object 'ou' not found
I'm not sure if you need the rest of the code before this, so I attach them at the end here.
> setwd("C:/Users/sunxi/Coursera/specdata")
> ind <- dir(path = "C:/Users/sunxi/Coursera/specdata", pattern = ".csv") #Save the index of the files to a vector.
> specdata_ful <- lapply(ind, read.csv) #combine all the files to a data frame.
> specdat_recon_ful <- do.call(rbind, specdata_ful) #Reconstruct the data frame to put the same variable in one column.
> com_case_ful <- complete.cases(specdat_recon_ful) #Filter the complete cases.
> id_ful <- specdat_recon_ful[,4][com_case_ful] #The ID of the complete cases.
> sulfate_ful <- specdat_recon_ful[,2][com_case_ful] #The sulfate value of the complete cases.
> nitrate_ful <- specdat_recon_ful[,3][com_case_ful] #The nitrate value of the complete cases.
> id_freq_ful <- table(id_ful) #Summary the frequency in each id
> id_freq_mat_ful <- as.data.frame(id_freq_ful) #transfer the table into the data.frame.
> good <- id_freq_mat_ful[["Freq"]] > 1000 #Filter the freqency larger than threshold.
> id_good <- id_freq_mat_ful[["id_ful"]][good] #Filter the id has the frequency of complete cases larger than the threshold.
> ind_larg <- ind[id_good] #Create an index for the id has required requency.
You have to create the variable ou before you access it with ou[i]:
ou <- c()
for(i in ind_larg){
# your loop here...
ou[i] <- cor(sulfate_i, nitrate_i)
}
I came across this function a while back that was created for fixing PCA values. The problem with the function was that it wasn't compatible xts time series objects.
amend <- function(result) {
result.m <- as.matrix(result)
n <- dim(result.m)[1]
delta <- apply(abs(result.m[-1,] - result.m[-n,]), 1, sum)
delta.1 <- apply(abs(result.m[-1,] + result.m[-n,]), 1, sum)
signs <- c(1, cumprod(rep(-1, n-1) ^ (delta.1 <= delta)))
zoo(result * signs)
}
Full sample can be found https://stats.stackexchange.com/questions/34396/im-getting-jumpy-loadings-in-rollapply-pca-in-r-can-i-fix-it
The problem is that applying the function on a xts object with multiple columns and rows wont solve the problem. Is there a elegant way of applying the algorithm for a matrix of xts objects?
My current solution given a single column with multiple row is to loop through row by row...which is slow and tedious. Imagine having to do it column by column also.
Thanks,
Here is some code to get one started:
rm(list=ls())
require(RCurl)
sit = getURLContent('https://github.com/systematicinvestor/SIT/raw/master/sit.gz', binary=TRUE, followlocation = TRUE, ssl.verifypeer = FALSE)
con = gzcon(rawConnection(sit, 'rb'))
source(con)
close(con)
load.packages('quantmod')
data <- new.env()
tickers<-spl("VTI,IEF,VNQ,TLT")
getSymbols(tickers, src = 'yahoo', from = '1980-01-01', env = data, auto.assign = T)
for(i in ls(data)) data[[i]] = adjustOHLC(data[[i]], use.Adjusted=T)
bt.prep(data, align='remove.na', dates='1990::2013')
prices<-data$prices[,-10] #don't include cash
retmat<-na.omit(prices/mlag(prices) - 1)
rollapply(retmat, 500, function(x) summary(princomp(x))$loadings[, 1], by.column = FALSE, align = "right") -> princomproll
require(lattice)
xyplot(amend(pruncomproll))
plotting "princomproll" will get you jumpy loadings...
It isn't very obvious how the amend function relates to the script below it (since it isn't called there), or what you are trying to achieve. There are a couple of small changes that can be made. I haven't profiled the difference, but it's a little more readable if nothing else.
You remove the first and last rows of the result twice.
rowSums might be slightly more efficient for getting the row sums than apply.
rep.int is a little bit fster than rep.
amend <- function(result) {
result <- as.matrix(result)
n <- nrow(result)
without_first_row <- result[-1,]
without_last_row <- result[-n,]
delta_minus <- rowSums(abs(without_first_row - without_last_row))
delta_plus <- rowSums(abs(without_first_row + without_last_row))
signs <- c(1, cumprod(rep.int(-1, n-1) ^ (delta_plus <= delta_minus)))
zoo(result * signs)
}
The function below takes a folder of CSV files (each file is a financial time series with datetime, open, high, low, close columns) and creates a single XTS object for each of the open, high, low, close prices, where each XTS column is an individual security. For my use case, this representation allows for much more convenient and faster processing (vs. single XTS for each file).
require(quantmod)
LoadUniverseToEnv <- function(srcDir, env) {
fileList <- list.files(srcDir)
if (length(fileList) == 0)
stop("No files found!")
env$op <- NULL
env$hi <- NULL
env$lo <- NULL
env$cl <- NULL
cols <- NULL
for (file in fileList) {
filePath <- sprintf("%s/%s", srcDir, file)
if (file.info(filePath)$isdir == FALSE) {
x <- as.xts(read.zoo(filePath, header=TRUE, sep=",", tz=""))
cols <- c(sub("_.*", "", file), cols)
# do outer join
env$op <- merge(Op(x), env$op)
env$hi <- merge(Hi(x), env$hi)
env$lo <- merge(Lo(x), env$lo)
env$cl <- merge(Cl(x), env$cl)
cat(sprintf("%s : added: %s from: %s to: %s\n", as.character(Sys.time()), file, start(x), end(x)))
}
}
colnames(env$op) <- cols
colnames(env$hi) <- cols
colnames(env$lo) <- cols
colnames(env$cl) <- cols
}
Performance is fine for a limited number of files, but slows linearly with the width of the XTS object and so becomes a problem for large datasets. The bottleneck is CPU during the merge, when a new column is being appended to each of the four objects (e.g. 100ms initally slowing by 1ms/column)
Since it's CPU bound, my first thought is to parallelize by merging n batches of files and then merge the results, but I'm wondering if there's a better way.
The best solution I found for this was to merge in "chunks". For example, assuming 100 columns, merging into 10 XTS objects with 10 columns each and then merging those 10 objects dramatically improves performance.
The below example shows a 1500% improvement when merging 2000 xts objects with 1000 rows each and identical indexes.
Example:
require(xts)
require(foreach)
nCols <- 2000
nRows <- 1000
x <- xts(runif(nRows), order.by=as.Date(seq(1:nRows)))
xList <- list()
for (i in 1:nCols)
xList[[i]] <- x
testA <- function() {
merged <- NULL
for (x in xList)
merged <- merge(x, merged)
colnames(merged) <- 1:length(xList)
merged
}
testB <- function() {
nChunks <- floor(sqrt(length(xList)))
idx <- split(1:n, sort(1:n %% nChunks))
merged <- foreach (chunk = 1:nChunks, .combine = "merge") %do% {
merged <- foreach (i = idx[[chunk]], .combine = "merge") %do% {
xList[[i]]
}
merged
}
colnames(merged) <- 1:length(xList)
merged
}
print("Test A")
print(system.time(resultA <- testA()))
print("Test B")
print(system.time(resultB <- testB()))
print(sprintf("Identical : %s", identical(resultA, resultB)))
print(sprintf("Dimensions: %dx%d", ncol(resultA), nrow(resultA)))
Output:
[1] "Test A"
user system elapsed
33.12 3.18 36.30
[1] "Test B"
user system elapsed
2.28 0.01 2.31
[1] "Identical : TRUE"
[1] "Dimensions: 2000x1000"
Note that the foreach is not running in parallel.
I've a code that works perfectly for my purpose (it reads some files with a specific pattern, read the matrix within each file and compute something using each filepair...the final output is a matrix that has the same size of the file number) and looks like this:
m<- 100
output<- matrix(0, m, m)
lista<- list.files(pattern = "q")
listan<- as.matrix(lista)
n <- nrow(listan)
for (i in 1:n) {
AA <- read.table((listan[i,]), header = FALSE)
A<- as.matrix(AA)
dVarX <- sqrt(mean(A * A))
for (j in i:n) {
BB <- read.table ((listan[j,]), header = FALSE)
B<- as.matrix(BB)
V <- sqrt (dVarX * (sqrt(mean(B * B))))
output[i,j] <- (sqrt(mean(A * B))) / V
}
}
My problem is that it takes a lot of time (I have about 5000 matrixes, that means 5000x5000 loops).
I would like to parallelize, but I need some help!
Waiting for your kind suggestions!
Thank you in advance!
Gab
The bottleneck is likely reading from disk. Running code in parallel isn't guaranteed to make things faster. In this case, multiple processes attempting to read from the same disk at the same time is likely to be even slower than a single process.
Since your matrices are being written by another R process, you really should save them in R's binary format. You're reading every matrix once and only once, so the only way to make your program faster is to make reading from disk faster.
Here's an example that shows you how much faster it could be:
# make some random data and write it to disk
set.seed(21)
for(i in 0:9) {
m <- matrix(runif(700*700), 700, 700)
f <- paste0("f",i)
write(m, f, 700) # text format
saveRDS(m, paste0(f,".rds")) # binary format
}
# initialize two output objects
m <- 10
o1 <- o2 <- matrix(NA, m, m)
# get list of file names
files <- list.files(pattern="^f[[:digit:]]+$")
n <- length(files)
First, let's run your your code using scan, which is already a lot faster than your current solution with read.table.
system.time({
for (i in 1:n) {
A <- scan(files[i],quiet=TRUE)
for (j in i:n) {
B <- scan(files[j],quiet=TRUE)
o1[i,j] <- sqrt(mean(A*B)) / sqrt(sqrt(mean(A*A)) * sqrt(mean(B*B)))
}
}
})
# user system elapsed
# 31.37 0.78 32.58
Now, let's re-run that code using the files saved in R's binary format:
system.time({
for (i in 1:n) {
fA <- paste0(files[i],".rds")
A <- readRDS(fA)
for (j in i:n) {
fB <- paste0(files[j],".rds")
B <- readRDS(fB)
o2[i,j] <- sqrt(mean(A*B)) / sqrt(sqrt(mean(A*A)) * sqrt(mean(B*B)))
}
}
})
# user system elapsed
# 2.42 0.39 2.92
So the binary format is ~10x faster! And the output is the same:
all.equal(o1,o2)
# [1] TRUE