I have the sample code below:
f <- function(x) {
v = 0
start = Sys.time()
for (i in 1:x) {
v = v + 1
}
end = Sys.time()
print(end-start)
print(paste0("f() took: ", end-start))
}
f(10)
The 2 outputs are:
Time difference of 5.960464e-06 secs
"f() took: 5.96046447753906e-06"
My question is why when used with paste0 the output is different.
Edit:
If one wish to get the units of time do the following:
print(paste0("f() took: ", end-start, " ", units(end-start)))
Have a look at ?print.difftime:
The print() method calls these “time differences”.
It's an in-built method for print that displays differences in dates in this format.
class(Sys.Date()-Sys.Date()+1)
[1] "difftime"
class(paste(Sys.Date()-Sys.Date()+1))
[1] "character"
paste0("f() took: ", end-start) becomes a character class and so print will handle it differently
I'm working through a tutorial I found online about writing efficient R code. I worked on seeing the same result -- f1 should be faster than f2 -- but I'm seeing the opposite.
I've tried running the code in the console and in the RSweave environment and I get the different solutions. Of course, the timing is different each time it's run but usually f1 has a lower runtime than f2 in the console, but the opposite is true in RSweave -- I simply copied the code from into one RSweave chunk.
f1 <- function(x){
return(log((sqrt(x*10^3))))
}
f2 <- function(x){
a<- x*10^3
b<- sqrt(a)
c<- log(b)
return(c)
}
start_time <- Sys.time()
for(i in 1:1000){
f1(1)
}
end_time <- Sys.time()
(runtime <-end_time - start_time)
R Console: Time difference of 0.065943 secs
RSweave: Time difference of 0.05585098 secs
start_time <- Sys.time()
for(i in 1:1000){
f2(1)
}
end_time <- Sys.time()
(runtime2 <-end_time - start_time)
Console: Time difference of 0.1935871 secs
RSweave: Time difference of 0.05243397 secs
Does RSweave do something different that would yield this strange behavior?
I've tried to use Sys.time to get the time elapsed between two points. However, it doesn't output in a way I like.
This is how it looks now:
a <- Sys.time
...running stuff between these two points...
b <- Sys.time
c <- b - a
c
Time difference of 1.00558 hours
I only want the number and the units. I know that to get just the number I can do:
c[[1]]
However, sometimes the result of c can give me seconds or minutes. I only want instances wherein I have the number and when the units are in hours. Does anyone know of a way such that I would get something like the following, using Sys.time() (or any alternative):
if (units == "hours")
{
if (number => 1)
{
#do something
}
}
Using difftime of base R allows you to obtain the time difference in different units. Rest is formatting.
a = Sys.time()
Sys.sleep(5) #do something
b = Sys.time()
paste0(round(as.numeric(difftime(time1 = b, time2 = a, units = "secs")), 3), " Seconds")
#[1] "5.091 Seconds"
The package tictoc simplifies this kind of timing. It doesn't return hours, but we can create a new function that converts its second-based measurements into hours.
library(tictoc)
toc_hour <- function() {
x <- toc()
(x$toc - x$tic) / 3600
}
You normally start the timer with tic() and stop it with toc().
tic()
Sys.sleep(2)
toc()
# 2.02 sec elapsed
Calling toc_hour() instead of toc() returns the number of hours that have elapsed.
tic()
Sys.sleep(2)
toc_hour()
# 2.25 sec elapsed
# elapsed
# 0.000625
It still prints the number of seconds above the hours, but if you capture the result it will only store the number of hours for downstream analysis.
tic()
Sys.sleep(2)
x <- toc_hour()
if(x < 1) {print("This took under an hour")}
You can evaluate everything as an argument to the system.time function. It will give you the elapsed time in seconds.
paste0(system.time( rnorm(1000000, 0, 1) )[3] / 3600, " hours")
# "2.58333333334172e-05 hours"
Alternatively, you can use Frank's suggestion in the comments. difftime(b, a, units = "hours") which is probably the dominant solution in most cases
The tictoc package normally returns seconds. The other solutions from this package manually converts this to other units but I find it still doesn't look right. Instead, use the built-in func.toc argument in toc() to change the output. For example:
toc_min <- function(tic,toc,msg="") {
mins <- round((((toc-tic)/60)),2)
outmsg <- paste0(mins, " minutes elapsed")
}
And then:
tic()
Sys.sleep(1)
toc(func.toc=toc_min)
returns
0.02 minutes elapsed
I think lubridate is the quickest solution for you:
start <- Sys.time()
## Do Stuff here
end <- Sys.time()
elapsed <- lubridate::ymd_hms(end) - lubridate::ymd_hms(start)
message(elapsed)
It should return something useful like:
"Time difference of 12.1 hours"
Maybe you can try the ´tictoc´ package.
As described in the documentation you can do the following:
tic()
#Do something
toc(log = TRUE, quiet = TRUE)
#Put the result in a log
log.txt <- tic.log(format = TRUE)
#Extract only the value
res <- gsub( " sec elapsed", "", unlist(log.txt))
#Clear the log
tic.clearlog()
That way, res gives you only the value and is in seconds, so it is pretty simple to have hours then.
Moreover, if you don't clear the log you can put successions of tic() and toc() and put everything in your log.txt, and then gsub( " sec elapsed", "", unlist(log.txt)) will give you a vector of strings with the value in seconds for each iteration which can be pretty useful
old <- Sys.time()
// MY code
new <- Sys.time()
total time = old-new
the output comes "Time difference of -6.661923 secs"
instead i want "Execution time : 0.35secs"
You can use sprintf as below:
old <- Sys.time()
rnorm(500,0,1)
new <- Sys.time()
x <- (new - old)
sprintf("The execution time is %5.2f secs",x)
Output:
[1] "The execution time is 1.08 secs"
Something like
old <- Sys.time()
#code
new <- Sys.time()
total_time <- paste0("Execution time: ", as.numeric(new-old), "secs")
I am writing an R program that involves analyzing a large amount of unstructured text data and creating a word-frequency matrix. I've been using the wfm and wfdf functions from the qdap package, but have noticed that this is a bit slow for my needs. It appears that the production of the word-frequency matrix is the bottleneck.
The code for my function is as follows.
library(qdap)
liwcr <- function(inputText, dict) {
if(!file.exists(dict))
stop("Dictionary file does not exist.")
# Read in dictionary categories
# Start by figuring out where the category list begins and ends
dictionaryText <- readLines(dict)
if(!length(grep("%", dictionaryText))==2)
stop("Dictionary is not properly formatted. Make sure category list is correctly partitioned (using '%').")
catStart <- grep("%", dictionaryText)[1]
catStop <- grep("%", dictionaryText)[2]
dictLength <- length(dictionaryText)
dictionaryCategories <- read.table(dict, header=F, sep="\t", skip=catStart, nrows=(catStop-2))
wordCount <- word_count(inputText)
outputFrame <- dictionaryCategories
outputFrame["count"] <- 0
# Now read in dictionary words
no_col <- max(count.fields(dict, sep = "\t"), na.rm=T)
dictionaryWords <- read.table(dict, header=F, sep="\t", skip=catStop, nrows=(dictLength-catStop), fill=TRUE, quote="\"", col.names=1:no_col)
workingMatrix <- wfdf(inputText)
for (i in workingMatrix[,1]) {
if (i %in% dictionaryWords[, 1]) {
occurrences <- 0
foundWord <- dictionaryWords[dictionaryWords$X1 == i,]
foundCategories <- foundWord[1,2:no_col]
for (w in foundCategories) {
if (!is.na(w) & (!w=="")) {
existingCount <- outputFrame[outputFrame$V1 == w,]$count
outputFrame[outputFrame$V1 == w,]$count <- existingCount + workingMatrix[workingMatrix$Words == i,]$all
}
}
}
}
return(outputFrame)
}
I realize the for loop is inefficient, so in an effort to locate the bottleneck, I tested it without this portion of the code (simply reading in each text file and producing the word-frequency matrix), and seen very little in the way of speed improvements. Example:
library(qdap)
fn <- reports::folder(delete_me)
n <- 10000
lapply(1:n, function(i) {
out <- paste(sample(key.syl[[1]], 30, T), collapse = " ")
cat(out, file=file.path(fn, sprintf("tweet%s.txt", i)))
})
filename <- sprintf("tweet%s.txt", 1:n)
for(i in 1:length(filename)){
print(filename[i])
text <- readLines(paste0("/toshi/twitter_en/", filename[i]))
freq <- wfm(text)
}
The input files are Twitter and Facebook status postings.
Is there any way to improve the speed for this code?
EDIT2: Due to institutional restrictions, I can't post any of the raw data. However, just to give an idea of what I'm dealing with: 25k text files, each with all the available tweets from an individual Twitter user. There are also an additional 100k files with Facebook status updates, structured in the same way.
Here is a qdap approach and a mixed qdap/tm approach that is faster. I provide the code and then the timings on each. Basically I read everything in at once and operator on the entire data set. You could then split it back apart if you wanted with split.
A MWE that you should provide with questions
library(qdap)
fn <- reports::folder(delete_me)
n <- 10000
lapply(1:n, function(i) {
out <- paste(sample(key.syl[[1]], 30, T), collapse = " ")
cat(out, file=file.path(fn, sprintf("tweet%s.txt", i)))
})
filename <- sprintf("tweet%s.txt", 1:n)
The qdap approach
tic <- Sys.time() ## time it
dat <- list2df(setNames(lapply(filename, function(x){
readLines(file.path(fn, x))
}), tools::file_path_sans_ext(filename)), "text", "tweet")
difftime(Sys.time(), tic) ## time to read in
the_wfm <- with(dat, wfm(text, tweet))
difftime(Sys.time(), tic) ## time to make wfm
Timing qdap approach
> tic <- Sys.time() ## time it
>
> dat <- list2df(setNames(lapply(filename, function(x){
+ readLines(file.path(fn, x))
+ }), tools::file_path_sans_ext(filename)), "text", "tweet")
There were 50 or more warnings (use warnings() to see the first 50)
>
> difftime(Sys.time(), tic) ## time to read in
Time difference of 2.97617 secs
>
> the_wfm <- with(dat, wfm(text, tweet))
>
> difftime(Sys.time(), tic) ## time to make wfm
Time difference of 48.9238 secs
The qdap-tm combined approach
tic <- Sys.time() ## time it
dat <- list2df(setNames(lapply(filename, function(x){
readLines(file.path(fn, x))
}), tools::file_path_sans_ext(filename)), "text", "tweet")
difftime(Sys.time(), tic) ## time to read in
tweet_corpus <- with(dat, as.Corpus(text, tweet))
tdm <- tm::TermDocumentMatrix(tweet_corpus,
control = list(removePunctuation = TRUE,
stopwords = FALSE))
difftime(Sys.time(), tic) ## time to make TermDocumentMatrix
Timing qdap-tm combined approach
> tic <- Sys.time() ## time it
>
> dat <- list2df(setNames(lapply(filename, function(x){
+ readLines(file.path(fn, x))
+ }), tools::file_path_sans_ext(filename)), "text", "tweet")
There were 50 or more warnings (use warnings() to see the first 50)
>
> difftime(Sys.time(), tic) ## time to read in
Time difference of 3.108177 secs
>
>
> tweet_corpus <- with(dat, as.Corpus(text, tweet))
>
> tdm <- tm::TermDocumentMatrix(tweet_corpus,
+ control = list(removePunctuation = TRUE,
+ stopwords = FALSE))
>
> difftime(Sys.time(), tic) ## time to make TermDocumentMatrix
Time difference of 13.52377 secs
There is a qdap-tm Package Compatibility (-CLICK HERE-) to help users move between qdap and tm. As you can see on 10000 tweets the combined approach is ~3.5 x faster. A purely tm approach may be faster still. Also if you want the wfm use as.wfm(tdm) to coerce the TermDocumentMatrix.
Your code though is slower either way because it's not the R way to do things. I'd recommend reading some additional info on R to get better at writing faster code. I'm currently working through Hadley Wickham's Advanced R that I'd recommend.