R profiling spending a lot of time using .External2 - r

I am learning how to use R profiling, and have run the Rprof command on my code.
The summaryRprof function has shown that a lot of time is spent using .External2. What is this? Additionally, there is a large proportion of the total time spent on <Anonymous>, is there a way to find out what this is?
> summaryRprof("test")
$by.self
self.time self.pct total.time total.pct
".External2" 4.30 27.74 4.30 27.74
"format.POSIXlt" 2.70 17.42 2.90 18.71
"which.min" 2.38 15.35 4.12 26.58
"-" 1.30 8.39 1.30 8.39
"order" 1.16 7.48 1.16 7.48
"match" 0.58 3.74 0.58 3.74
"file" 0.44 2.84 0.44 2.84
"abs" 0.40 2.58 0.40 2.58
"scan" 0.30 1.94 0.30 1.94
"anyDuplicated.default" 0.20 1.29 0.20 1.29
"unique.default" 0.20 1.29 0.20 1.29
"unlist" 0.18 1.16 0.20 1.29
"c" 0.16 1.03 0.16 1.03
"data.frame" 0.14 0.90 0.22 1.42
"structure" 0.12 0.77 1.74 11.23
"as.POSIXct.POSIXlt" 0.12 0.77 0.12 0.77
"strptime" 0.12 0.77 0.12 0.77
"as.character" 0.08 0.52 0.90 5.81
"make.unique" 0.08 0.52 0.16 1.03
"[.data.frame" 0.06 0.39 1.54 9.94
"<Anonymous>" 0.04 0.26 4.34 28.00
"lapply" 0.04 0.26 1.70 10.97
"rbind" 0.04 0.26 0.94 6.06
"as.POSIXlt.POSIXct" 0.04 0.26 0.04 0.26
"ifelse" 0.04 0.26 0.04 0.26
"paste" 0.02 0.13 0.92 5.94
"merge.data.frame" 0.02 0.13 0.56 3.61
"[<-.factor" 0.02 0.13 0.52 3.35
"stopifnot" 0.02 0.13 0.04 0.26
".deparseOpts" 0.02 0.13 0.02 0.13
".External" 0.02 0.13 0.02 0.13
"close.connection" 0.02 0.13 0.02 0.13
"doTryCatch" 0.02 0.13 0.02 0.13
"is.na" 0.02 0.13 0.02 0.13
"is.na<-.default" 0.02 0.13 0.02 0.13
"mean" 0.02 0.13 0.02 0.13
"seq.int" 0.02 0.13 0.02 0.13
"sum" 0.02 0.13 0.02 0.13
"sys.function" 0.02 0.13 0.02 0.13
$by.total
total.time total.pct self.time self.pct
"write.table" 5.10 32.90 0.00 0.00
"<Anonymous>" 4.34 28.00 0.04 0.26
".External2" 4.30 27.74 4.30 27.74
"mapply" 4.22 27.23 0.00 0.00
"head" 4.16 26.84 0.00 0.00
"which.min" 4.12 26.58 2.38 15.35
"eval" 3.16 20.39 0.00 0.00
"eval.parent" 3.14 20.26 0.00 0.00
"write.csv" 3.14 20.26 0.00 0.00
"format" 2.92 18.84 0.00 0.00
"format.POSIXlt" 2.90 18.71 2.70 17.42
"do.call" 1.78 11.48 0.00 0.00
"structure" 1.74 11.23 0.12 0.77
"lapply" 1.70 10.97 0.04 0.26
"FUN" 1.66 10.71 0.00 0.00
"format.POSIXct" 1.62 10.45 0.00 0.00
"[.data.frame" 1.54 9.94 0.06 0.39
"[" 1.54 9.94 0.00 0.00
"-" 1.30 8.39 1.30 8.39
"order" 1.16 7.48 1.16 7.48
"rbind" 0.94 6.06 0.04 0.26
"paste" 0.92 5.94 0.02 0.13
"as.character" 0.90 5.81 0.08 0.52
"read.csv" 0.84 5.42 0.00 0.00
"read.table" 0.84 5.42 0.00 0.00
"as.character.POSIXt" 0.82 5.29 0.00 0.00
"match" 0.58 3.74 0.58 3.74
"merge.data.frame" 0.56 3.61 0.02 0.13
"merge" 0.56 3.61 0.00 0.00
"[<-.factor" 0.52 3.35 0.02 0.13
"[<-" 0.52 3.35 0.00 0.00
"strftime" 0.48 3.10 0.00 0.00
"file" 0.44 2.84 0.44 2.84
"weekdays" 0.42 2.71 0.00 0.00
"weekdays.POSIXt" 0.42 2.71 0.00 0.00
"abs" 0.40 2.58 0.40 2.58
"unique" 0.38 2.45 0.00 0.00
"scan" 0.30 1.94 0.30 1.94
"data.frame" 0.22 1.42 0.14 0.90
"cbind" 0.22 1.42 0.00 0.00
"anyDuplicated.default" 0.20 1.29 0.20 1.29
"unique.default" 0.20 1.29 0.20 1.29
"unlist" 0.20 1.29 0.18 1.16
"anyDuplicated" 0.20 1.29 0.00 0.00
"as.POSIXct" 0.18 1.16 0.00 0.00
"as.POSIXlt" 0.18 1.16 0.00 0.00
"c" 0.16 1.03 0.16 1.03
"make.unique" 0.16 1.03 0.08 0.52
"as.POSIXct.POSIXlt" 0.12 0.77 0.12 0.77
"strptime" 0.12 0.77 0.12 0.77
"as.POSIXlt.character" 0.12 0.77 0.00 0.00
"object.size" 0.12 0.77 0.00 0.00
"as.POSIXct.default" 0.10 0.65 0.00 0.00
"Ops.POSIXt" 0.08 0.52 0.00 0.00
"type.convert" 0.08 0.52 0.00 0.00
"!=" 0.06 0.39 0.00 0.00
"as.POSIXlt.factor" 0.06 0.39 0.00 0.00
"as.POSIXlt.POSIXct" 0.04 0.26 0.04 0.26
"ifelse" 0.04 0.26 0.04 0.26
"stopifnot" 0.04 0.26 0.02 0.13
"$" 0.04 0.26 0.00 0.00
"$.data.frame" 0.04 0.26 0.00 0.00
"[[" 0.04 0.26 0.00 0.00
"[[.data.frame" 0.04 0.26 0.00 0.00
"head.default" 0.04 0.26 0.00 0.00
".deparseOpts" 0.02 0.13 0.02 0.13
".External" 0.02 0.13 0.02 0.13
"close.connection" 0.02 0.13 0.02 0.13
"doTryCatch" 0.02 0.13 0.02 0.13
"is.na" 0.02 0.13 0.02 0.13
"is.na<-.default" 0.02 0.13 0.02 0.13
"mean" 0.02 0.13 0.02 0.13
"seq.int" 0.02 0.13 0.02 0.13
"sum" 0.02 0.13 0.02 0.13
"sys.function" 0.02 0.13 0.02 0.13
"%in%" 0.02 0.13 0.00 0.00
".rs.getSingleClass" 0.02 0.13 0.00 0.00
"[.POSIXlt" 0.02 0.13 0.00 0.00
"==" 0.02 0.13 0.00 0.00
"close" 0.02 0.13 0.00 0.00
"data.row.names" 0.02 0.13 0.00 0.00
"deparse" 0.02 0.13 0.00 0.00
"factor" 0.02 0.13 0.00 0.00
"is.na<-" 0.02 0.13 0.00 0.00
"match.arg" 0.02 0.13 0.00 0.00
"match.call" 0.02 0.13 0.00 0.00
"pushBack" 0.02 0.13 0.00 0.00
"seq" 0.02 0.13 0.00 0.00
"seq.POSIXt" 0.02 0.13 0.00 0.00
"simplify2array" 0.02 0.13 0.00 0.00
"tryCatch" 0.02 0.13 0.00 0.00
"tryCatchList" 0.02 0.13 0.00 0.00
"tryCatchOne" 0.02 0.13 0.00 0.00
"which" 0.02 0.13 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 15.5

Related

Interpreting a profiling file (Rprof)

I have code which takes a lot of time executing:
dataRaw<-pblapply(femme,function (x) {
article<-user(x,date=FALSE,weight=FALSE)
names<-rep(x,length(article))
result<-matrix(c(names,article),ncol=2)
})
dataRaw<-do.call(rbind,dataRaw)
dataRaw[,3]<-vector(length=length(dataRaw[,2]))
dataRaw[,3]<-pbapply(dataRaw,1,function(x){
Rprof(filename = "profile.out")
revisions<-revisionsPage(x[2])
rank<-rankingContrib(revisions,50)
rank<-rank$contrib
x[1] %in% rank
Rprof(NULL)
})
result<-as.vector(dataRaw[dataRaw$ranking==TRUE,2])
Lanching the summaryRprof function, it give me this
$by.self
self.time self.pct total.time total.pct
".Call" 0.46 95.83 0.46 95.83
"as.data.frame.numeric" 0.02 4.17 0.02 4.17
$by.total
total.time total.pct self.time self.pct
"FUN" 0.48 100.00 0.00 0.00
"pbapply" 0.48 100.00 0.00 0.00
".Call" 0.46 95.83 0.46 95.83
"<Anonymous>" 0.46 95.83 0.00 0.00
"GET" 0.46 95.83 0.00 0.00
"request_fetch" 0.46 95.83 0.00 0.00
"request_fetch.write_memory" 0.46 95.83 0.00 0.00
"request_perform" 0.46 95.83 0.00 0.00
"revisionsPage" 0.46 95.83 0.00 0.00
"as.data.frame.numeric" 0.02 4.17 0.02 4.17
"as.data.frame" 0.02 4.17 0.00 0.00
"data.frame" 0.02 4.17 0.00 0.00
"rankingContrib" 0.02 4.17 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 0.48
Appears it is the ".Call" function which takes all the machine time. What is this .Call entry?

corr.test arguments imply differing number of rows

I have seen this error multiple times in different projects and I was wondering if there is a way to tell which line caused the error in general?
My specific case:
http://archive.ics.uci.edu/ml/machine-learning-databases/00275/
#using the bike.csv
data<-read.csv("PATH_HERE\\Bike-Sharing-Dataset\\day.csv",header=TRUE)
require(psych)
corr.test(data)
data<-data[,c("atemp","casual","cnt","holiday","hum","mnth","registered",
"season","temp","weathersit","weekday","windspeed","workingday","yr")]
data[data=='']<-NA
#View(data)
require(psych)
cors<-corr.test(data)
returns the error:
Error in data.frame(lower = lower, r = r[lower.tri(r)], upper = upper, :
arguments imply differing number of rows: 0, 91
It works for me
> #using the bike.csv
> data <- read.csv("day.csv",header=TRUE)
> require(psych)
> corr.test(data)
Error in cor(x, use = use, method = method) : 'x' must be numeric
> data <- data[,c("atemp","casual","cnt","holiday","hum","mnth","registered",
+ "season","temp","weathersit","weekday","windspeed","workingday","yr")]
> data[data==''] <- NA
> #View(data)
>
> require(psych)
> cors <- corr.test(data)
> cors
Call:corr.test(x = data)
Correlation matrix
atemp casual cnt holiday hum mnth registered season temp
atemp 1.00 0.54 0.63 -0.03 0.14 0.23 0.54 0.34 0.99
casual 0.54 1.00 0.67 0.05 -0.08 0.12 0.40 0.21 0.54
cnt 0.63 0.67 1.00 -0.07 -0.10 0.28 0.95 0.41 0.63
holiday -0.03 0.05 -0.07 1.00 -0.02 0.02 -0.11 -0.01 -0.03
hum 0.14 -0.08 -0.10 -0.02 1.00 0.22 -0.09 0.21 0.13
mnth 0.23 0.12 0.28 0.02 0.22 1.00 0.29 0.83 0.22
registered 0.54 0.40 0.95 -0.11 -0.09 0.29 1.00 0.41 0.54
season 0.34 0.21 0.41 -0.01 0.21 0.83 0.41 1.00 0.33
temp 0.99 0.54 0.63 -0.03 0.13 0.22 0.54 0.33 1.00
weathersit -0.12 -0.25 -0.30 -0.03 0.59 0.04 -0.26 0.02 -0.12
weekday -0.01 0.06 0.07 -0.10 -0.05 0.01 0.06 0.00 0.00
windspeed -0.18 -0.17 -0.23 0.01 -0.25 -0.21 -0.22 -0.23 -0.16
workingday 0.05 -0.52 0.06 -0.25 0.02 -0.01 0.30 0.01 0.05
yr 0.05 0.25 0.57 0.01 -0.11 0.00 0.59 0.00 0.05
weathersit weekday windspeed workingday yr
atemp -0.12 -0.01 -0.18 0.05 0.05
casual -0.25 0.06 -0.17 -0.52 0.25
cnt -0.30 0.07 -0.23 0.06 0.57
holiday -0.03 -0.10 0.01 -0.25 0.01
hum 0.59 -0.05 -0.25 0.02 -0.11
mnth 0.04 0.01 -0.21 -0.01 0.00
registered -0.26 0.06 -0.22 0.30 0.59
season 0.02 0.00 -0.23 0.01 0.00
temp -0.12 0.00 -0.16 0.05 0.05
weathersit 1.00 0.03 0.04 0.06 -0.05
weekday 0.03 1.00 0.01 0.04 -0.01
windspeed 0.04 0.01 1.00 -0.02 -0.01
workingday 0.06 0.04 -0.02 1.00 0.00
yr -0.05 -0.01 -0.01 0.00 1.00
Sample Size
[1] 731
Probability values (Entries above the diagonal are adjusted for multiple tests.)
atemp casual cnt holiday hum mnth registered season temp
atemp 0.00 0.00 0.00 1.00 0.01 0.00 0.00 0.00 0.00
casual 0.00 0.00 0.00 1.00 1.00 0.04 0.00 0.00 0.00
cnt 0.00 0.00 0.00 1.00 0.28 0.00 0.00 0.00 0.00
holiday 0.38 0.14 0.06 0.00 1.00 1.00 0.15 1.00 1.00
hum 0.00 0.04 0.01 0.67 0.00 0.00 0.58 0.00 0.03
mnth 0.00 0.00 0.00 0.60 0.00 0.00 0.00 0.00 0.00
registered 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00
season 0.00 0.00 0.00 0.78 0.00 0.00 0.00 0.00 0.00
temp 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00
weathersit 0.00 0.00 0.00 0.35 0.00 0.24 0.00 0.60 0.00
weekday 0.84 0.11 0.07 0.01 0.16 0.80 0.12 0.93 1.00
windspeed 0.00 0.00 0.00 0.87 0.00 0.00 0.00 0.00 0.00
workingday 0.16 0.00 0.10 0.00 0.51 0.87 0.00 0.74 0.15
yr 0.21 0.00 0.00 0.83 0.00 0.96 0.00 0.96 0.20
weathersit weekday windspeed workingday yr
atemp 0.05 1.00 0.00 1.00 1.00
casual 0.00 1.00 0.00 0.00 0.00
cnt 0.00 1.00 0.00 1.00 0.00
holiday 1.00 0.25 1.00 0.00 1.00
hum 0.00 1.00 0.00 1.00 0.13
mnth 1.00 1.00 0.00 1.00 1.00
registered 0.00 1.00 0.00 0.00 0.00
season 1.00 1.00 0.00 1.00 1.00
temp 0.05 1.00 0.00 1.00 1.00
weathersit 0.00 1.00 1.00 1.00 1.00
weekday 0.40 0.00 1.00 1.00 1.00
windspeed 0.29 0.70 0.00 1.00 1.00
workingday 0.10 0.33 0.61 0.00 1.00
yr 0.19 0.88 0.75 0.96 0.00
To see confidence intervals of the correlations, print with the short=FALSE option
>
It works for me:::
rm(list=ls())
# http://archive.ics.uci.edu/ml/machine-learning-databases/00275/
#using the bike.csv
day <- read.csv("Bike-Sharing-Dataset//day.csv")
require(psych)
day<-day[,c("atemp","casual","cnt","holiday","hum","mnth","registered",
"season","temp","weathersit","weekday","windspeed","workingday","yr")]
day[day=='']<-NA
require(psych)
corr.test(day)
# corr.test(day)
# Call:corr.test(x = day)
# Correlation matrix
# atemp casual cnt holiday hum mnth registered season temp weathersit weekday windspeed workingday yr
# atemp 1.00 0.54 0.63 -0.03 0.14 0.23 0.54 0.34 0.99 -0.12 -0.01 -0.18 0.05 0.05
# casual 0.54 1.00 0.67 0.05 -0.08 0.12 0.40 0.21 0.54 -0.25 0.06 -0.17 -0.52 0.25
# cnt 0.63 0.67 1.00 -0.07 -0.10 0.28 0.95 0.41 0.63 -0.30 0.07 -0.23 0.06 0.57
# holiday -0.03 0.05 -0.07 1.00 -0.02 0.02 -0.11 -0.01 -0.03 -0.03 -0.10 0.01 -0.25 0.01
# hum 0.14 -0.08 -0.10 -0.02 1.00 0.22 -0.09 0.21 0.13 0.59 -0.05 -0.25 0.02 -0.11
# mnth 0.23 0.12 0.28 0.02 0.22 1.00 0.29 0.83 0.22 0.04 0.01 -0.21 -0.01 0.00
# registered 0.54 0.40 0.95 -0.11 -0.09 0.29 1.00 0.41 0.54 -0.26 0.06 -0.22 0.30 0.59
# season 0.34 0.21 0.41 -0.01 0.21 0.83 0.41 1.00 0.33 0.02 0.00 -0.23 0.01 0.00
# temp 0.99 0.54 0.63 -0.03 0.13 0.22 0.54 0.33 1.00 -0.12 0.00 -0.16 0.05 0.05
# weathersit -0.12 -0.25 -0.30 -0.03 0.59 0.04 -0.26 0.02 -0.12 1.00 0.03 0.04 0.06 -0.05
# weekday -0.01 0.06 0.07 -0.10 -0.05 0.01 0.06 0.00 0.00 0.03 1.00 0.01 0.04 -0.01
# windspeed -0.18 -0.17 -0.23 0.01 -0.25 -0.21 -0.22 -0.23 -0.16 0.04 0.01 1.00 -0.02 -0.01
# workingday 0.05 -0.52 0.06 -0.25 0.02 -0.01 0.30 0.01 0.05 0.06 0.04 -0.02 1.00 0.00
# yr 0.05 0.25 0.57 0.01 -0.11 0.00 0.59 0.00 0.05 -0.05 -0.01 -0.01 0.00 1.00
# Sample Size
# [1] 731
# Probability values (Entries above the diagonal are adjusted for multiple tests.)
# atemp casual cnt holiday hum mnth registered season temp weathersit weekday windspeed workingday yr
# atemp 0.00 0.00 0.00 1.00 0.01 0.00 0.00 0.00 0.00 0.05 1.00 0.00 1.00 1.00
# casual 0.00 0.00 0.00 1.00 1.00 0.04 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00
# cnt 0.00 0.00 0.00 1.00 0.28 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00
# holiday 0.38 0.14 0.06 0.00 1.00 1.00 0.15 1.00 1.00 1.00 0.25 1.00 0.00 1.00
# hum 0.00 0.04 0.01 0.67 0.00 0.00 0.58 0.00 0.03 0.00 1.00 0.00 1.00 0.13
# mnth 0.00 0.00 0.00 0.60 0.00 0.00 0.00 0.00 0.00 1.00 1.00 0.00 1.00 1.00
# registered 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00
# season 0.00 0.00 0.00 0.78 0.00 0.00 0.00 0.00 0.00 1.00 1.00 0.00 1.00 1.00
# temp 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.05 1.00 0.00 1.00 1.00
# weathersit 0.00 0.00 0.00 0.35 0.00 0.24 0.00 0.60 0.00 0.00 1.00 1.00 1.00 1.00
# weekday 0.84 0.11 0.07 0.01 0.16 0.80 0.12 0.93 1.00 0.40 0.00 1.00 1.00 1.00
# windspeed 0.00 0.00 0.00 0.87 0.00 0.00 0.00 0.00 0.00 0.29 0.70 0.00 1.00 1.00
# workingday 0.16 0.00 0.10 0.00 0.51 0.87 0.00 0.74 0.15 0.10 0.33 0.61 0.00 1.00
# yr 0.21 0.00 0.00 0.83 0.00 0.96 0.00 0.96 0.20 0.19 0.88 0.75 0.96 0.00
#
# To see confidence intervals of the correlations, print with the short=FALSE option
cheers

R monthly average from monthly time series data

Monthly rainfall data is in a time series from 1983 Jan. to 2012 Dec.
One.Month.RainfallSJ.inch <- window(TS.RainfallSJ_inch, start=c(1983, 1), end=c(2012, 12))
One.Month.RainfallSJ.inch
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1983 7.41 4.87 5.92 3.90 0.15 0.00 0.00 0.02 1.08 0.19 5.26 3.82
1984 0.17 1.44 0.90 0.54 0.00 0.01 0.00 0.00 0.02 1.75 3.94 1.73
1985 0.74 0.76 2.98 0.48 0.23 0.00 0.13 0.00 0.35 0.98 2.47 1.40
1986 2.41 6.05 3.99 0.66 0.16 0.00 0.00 0.00 1.02 0.08 0.17 0.85
1987 1.60 2.10 1.87 0.14 0.00 0.00 0.00 0.00 0.00 0.93 1.65 3.31
1988 2.08 0.62 0.06 1.82 0.66 0.00 0.00 0.00 0.00 0.06 1.42 2.14
1989 1.06 1.07 1.91 0.57 0.09 0.00 0.00 0.00 0.83 1.33 0.80 0.04
1990 1.93 1.61 0.89 0.22 2.38 0.00 0.15 0.00 0.24 0.25 0.24 2.03
1991 0.18 2.22 6.17 0.18 0.15 0.06 0.00 0.04 0.12 0.85 0.43 2.43
1992 1.73 6.59 3.37 0.42 0.00 0.25 0.00 0.00 0.00 0.66 0.05 4.51
1993 6.98 4.71 2.81 0.54 0.47 0.54 0.00 0.00 0.00 0.67 2.17 1.99
1994 1.33 3.03 0.44 1.47 1.21 0.01 0.00 0.00 0.07 0.27 2.37 1.76
1995 8.66 0.53 6.85 1.06 1.27 0.84 0.01 0.00 0.00 0.00 0.05 4.71
1996 3.03 4.85 2.62 0.75 1.42 0.00 0.00 0.00 0.01 1.08 1.65 4.78
1997 6.80 0.14 0.17 0.11 0.55 0.21 0.00 0.51 0.00 0.69 5.01 1.85
1998 4.81 10.23 2.40 1.46 1.93 0.00 0.00 0.00 0.05 0.60 1.77 0.72
1999 3.25 2.88 2.69 1.56 0.02 0.14 0.14 0.00 0.00 0.00 0.50 0.55
2000 3.57 4.56 1.69 0.74 0.40 0.30 0.00 0.01 0.12 2.16 0.44 0.31
2001 2.87 4.44 1.71 1.48 0.00 0.13 0.00 0.00 0.13 0.12 2.12 4.47
2002 0.75 0.81 1.80 0.35 0.68 0.00 0.00 0.00 0.00 0.00 1.99 6.60
2003 0.65 1.65 0.77 2.95 0.72 0.00 0.00 0.03 0.03 0.00 1.91 4.91
2004 1.61 4.28 0.49 0.40 0.08 0.00 0.00 0.00 0.15 3.04 0.73 4.32
2005 3.47 5.31 3.55 2.52 0.00 0.00 0.01 0.00 0.00 0.10 0.45 5.47
2006 2.94 2.39 6.55 4.55 0.45 0.00 0.00 0.00 0.00 0.39 1.38 1.77
2007 0.93 3.49 0.46 0.96 0.08 0.00 0.01 0.00 0.26 1.13 0.55 1.18
2008 5.81 1.81 0.15 0.03 0.00 0.00 0.00 0.00 0.00 0.19 1.33 1.53
2009 1.30 5.16 1.89 0.30 0.09 0.01 0.00 0.02 0.19 2.41 0.41 2.16
2010 4.58 2.12 2.05 3.03 0.35 0.00 0.00 0.00 0.00 0.25 1.76 2.53
2011 0.96 3.15 4.32 0.20 0.40 1.51 0.00 0.00 0.00 0.77 0.08 0.08
2012 0.90 0.63 1.98 1.88 0.00 0.15 0.00 0.00 0.01 0.35 2.59 4.24
How can I code Jan. average value from 1983 to 2012 and so on?
Thanks,
Nahm
Try maybe colMeans
colMeans(One.Month.RainfallSJ.inch)
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
# 2.8170000 3.1166667 2.4483333 1.1756667 0.4646667 0.1386667 0.0150000 0.0210000 0.1560000 0.7100000 1.5230000
# Dec
# 2.6063333

Boxplot of table using ggplot2

I'm trying to plot a boxplot graph with my data, using 'ggplot' in R, but I just can't do it. Can anyone help me out?
The data is like the table below:
Paratio ShapeIdx FracD NNDis Core
-3.00 1.22 0.14 2.71 7.49
-1.80 0.96 0.16 0.00 7.04
-3.00 1.10 0.13 2.71 6.85
-1.80 0.83 0.16 0.00 6.74
-0.18 0.41 0.27 0.00 6.24
-1.66 0.12 0.11 2.37 6.19
-1.07 0.06 0.14 0.00 6.11
-0.32 0.18 0.23 0.00 5.93
-1.16 0.32 0.15 0.00 5.59
-0.94 0.14 0.15 1.96 5.44
-1.13 0.31 0.16 0.00 5.42
-1.35 0.40 0.15 0.00 5.38
-0.53 0.25 0.20 2.08 5.32
-1.96 0.36 0.12 0.00 5.27
-1.09 0.07 0.13 0.00 5.22
-1.35 0.27 0.14 0.00 5.21
-1.25 0.21 0.14 0.00 5.19
-1.02 0.25 0.16 0.00 5.19
-1.28 0.22 0.14 0.00 5.11
-1.44 0.32 0.14 0.00 5.00
And what I exactly want is a boxplot of each column, without any relation "column by column".
ggplot2 requires data in a specific format. Here, you need x= and y= where y will be the values and x will be the corresponding column ids. Use melt from reshape2 package to melt the data to get the data in this format and then plot.
require(reshape2)
ggplot(data = melt(dd), aes(x=variable, y=value)) + geom_boxplot(aes(fill=variable))

R: Is there a fast approximate correlation library for large time series?

I am trying to make a software that will, in real time, find the top N correlated time series windows (to a query series).
There are approximately 5000 windows, each 34 rows in length. With respect to the query series I need the 300 most correlated windows.
Currently I am using the cor function, but it is proving to be entirely way too slow. I need response times under a second. Under 250ms would be great, but anything in that vicinity would do.
Is there a "fast approximate correlation" library for R that I can use to reduce the size of my large "contestant list" (the 5000 windows)?
If not, is there another method to shrink this list somewhat?
Here is the function that I am running:
GetTopN<-function(n)
{
Rprof()
x<- LastBars()
x<-as.data.frame(cbind(x[-1,1],diff(x[,2])))
colnames(x)<-c('RPos','M1')
actionlist<-GetFiltered()
print(nrow(actionlist))
crs<-mat.or.vec(nrow(actionlist),2) #will hold correlations
for(i in 1:nrow(actionlist))
{
crs[i,2]<-cor(z[actionlist$RPos[i]+n1,2],x[,2])
}
crs[,1]<-actionlist$OpenTime
sorted <- crs[order(crs[,2], decreasing=T),1:2]
topx<- head(sorted,n)
bottomx <- tail(sorted,n)
rownames(bottomx)<-NULL
DF<-as.data.frame(rbind(topx,bottomx),row.names=NULL )
colnames(DF)<-c('ptime','weight')
sqlSave(channel,dat=DF,tablename='ReducedList',append=F,rownames=F,safer=F)
FillActionList()
Rprof(NULL)
summaryRprof()
}
And here is the output from summaryRprof:
$by.self
self.time self.pct total.time total.pct
[.data.frame 0.68 25.37 0.98 36.57
.Call 0.22 8.21 0.22 8.21
cor 0.16 5.97 2.30 85.82
is.data.frame 0.14 5.22 1.26 47.01
[ 0.14 5.22 1.12 41.79
stopifnot 0.14 5.22 0.30 11.19
sys.call 0.14 5.22 0.18 6.72
GetTopN 0.12 4.48 2.68 100.00
eval 0.10 3.73 0.46 17.16
deparse 0.10 3.73 0.34 12.69
%in% 0.10 3.73 0.22 8.21
$ 0.10 3.73 0.10 3.73
c 0.08 2.99 0.08 2.99
.deparseOpts 0.06 2.24 0.14 5.22
formals 0.06 2.24 0.08 2.99
pmatch 0.06 2.24 0.08 2.99
names 0.06 2.24 0.06 2.24
match 0.04 1.49 0.12 4.48
sys.parent 0.04 1.49 0.04 1.49
match.arg 0.02 0.75 0.58 21.64
length 0.02 0.75 0.02 0.75
matrix 0.02 0.75 0.02 0.75
mode 0.02 0.75 0.02 0.75
order 0.02 0.75 0.02 0.75
parent.frame 0.02 0.75 0.02 0.75
sys.function 0.02 0.75 0.02 0.75
$by.total
total.time total.pct self.time self.pct
GetTopN 2.68 100.00 0.12 4.48
cor 2.30 85.82 0.16 5.97
is.data.frame 1.26 47.01 0.14 5.22
[ 1.12 41.79 0.14 5.22
[.data.frame 0.98 36.57 0.68 25.37
match.arg 0.58 21.64 0.02 0.75
eval 0.46 17.16 0.10 3.73
deparse 0.34 12.69 0.10 3.73
stopifnot 0.30 11.19 0.14 5.22
.Call 0.22 8.21 0.22 8.21
%in% 0.22 8.21 0.10 3.73
sqlQuery 0.20 7.46 0.00 0.00
sys.call 0.18 6.72 0.14 5.22
odbcQuery 0.18 6.72 0.00 0.00
GetFiltered 0.16 5.97 0.00 0.00
match.call 0.16 5.97 0.00 0.00
.deparseOpts 0.14 5.22 0.06 2.24
match 0.12 4.48 0.04 1.49
$ 0.10 3.73 0.10 3.73
c 0.08 2.99 0.08 2.99
formals 0.08 2.99 0.06 2.24
pmatch 0.08 2.99 0.06 2.24
names 0.06 2.24 0.06 2.24
sys.parent 0.04 1.49 0.04 1.49
LastBars 0.04 1.49 0.00 0.00
length 0.02 0.75 0.02 0.75
matrix 0.02 0.75 0.02 0.75
mode 0.02 0.75 0.02 0.75
order 0.02 0.75 0.02 0.75
parent.frame 0.02 0.75 0.02 0.75
sys.function 0.02 0.75 0.02 0.75
mat.or.vec 0.02 0.75 0.00 0.00
odbcFetchRows 0.02 0.75 0.00 0.00
odbcUpdate 0.02 0.75 0.00 0.00
sqlGetResults 0.02 0.75 0.00 0.00
sqlSave 0.02 0.75 0.00 0.00
sqlwrite 0.02 0.75 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 2.68
Looking at the summaryRprofs output it seems that perhaps [.data.frame takes the longest. I do not see how to get around that though.
As Vincent points out in comments, computing (Pearson) correlation is itself pretty quick. Once you exhausted the basic R profiling and speeding up tricks, you can always go
multicore and/or parallel via appropriate R packages
use compiled code, and I can think of a package to facilitate that
even consider GPUs as e.g. my Intro to High-Performance Computing with R slides (on my presentations page) contained an example of computing the (more expensive Kendall) correlation for a large gain

Resources