Why does my carbon-cache process occupies ever increasing amount of memory? - graphite

I'm using graphite+diamond to monitor hundreds of my servers. I found that the carbon-cache is killed by oom-killer every 20 hours. At first, I thought maybe it's due to my disk is relative slow since it's a SATA disk, not SSD. However, when I use iostat to check the util of my disk, it's only about 70%:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvdap1 2.00 0.00 313.00 0.00 2484.00 0.00 15.87 0.84 2.67 2.67 0.00 2.43 76.05
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvdap1 1.50 144.50 261.50 306.50 2136.00 1804.00 13.87 1.13 2.00 3.03 1.11 1.27 72.30
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvdap1 0.50 97.00 137.00 332.50 1120.00 1718.00 12.09 1.98 4.23 6.69 3.21 1.70 79.90
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvdap1 2.50 0.00 163.50 0.00 1334.00 0.00 16.32 0.63 3.86 3.86 0.00 3.58 58.50
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvdap1 1.00 102.00 131.50 167.00 1048.00 1076.00 14.23 0.71 2.39 4.32 0.87 1.80 53.65
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvdap1 0.00 0.00 83.00 0.50 642.00 4.00 15.47 0.20 2.46 2.47 0.00 2.33 19.45
And my CPU usage is also not very high:
%Cpu0 : 34.8 us, 5.2 sy, 0.0 ni, 58.2 id, 0.0 wa, 0.0 hi, 1.0 si, 0.7 st
%Cpu1 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 0.0 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.3 st
%Cpu3 : 6.5 us, 1.7 sy, 0.0 ni, 5.4 id, 85.7 wa, 0.0 hi, 0.0 si, 0.7 st
How should I deal with this problem?
PS: our console.log is as follows:
07/06/2017 19:41:57 :: Sorted 16 cache queues in 0.000308 seconds
07/06/2017 19:41:57 :: Sorted 2 cache queues in 0.000200 seconds
07/06/2017 19:41:58 :: Sorted 564 cache queues in 0.000762 seconds
07/06/2017 19:41:58 :: Sorted 116 cache queues in 0.000388 seconds
07/06/2017 19:41:59 :: Sorted 820 cache queues in 0.001008 seconds
07/06/2017 19:42:00 :: Sorted 52 cache queues in 0.000354 seconds
07/06/2017 19:42:00 :: Sorted 1 cache queues in 0.000175 seconds
07/06/2017 19:42:01 :: Sorted 491 cache queues in 0.000530 seconds
07/06/2017 19:42:01 :: Sorted 101 cache queues in 0.000431 seconds
07/06/2017 19:42:01 :: Sorted 21 cache queues in 0.000283 seconds
07/06/2017 19:42:02 :: Sorted 1342 cache queues in 0.001589 seconds
07/06/2017 19:42:02 :: Sorted 224 cache queues in 0.000525 seconds
07/06/2017 19:42:02 :: Sorted 67 cache queues in 0.000299 seconds
07/06/2017 19:42:03 :: Sorted 1812 cache queues in 0.002230 seconds
07/06/2017 19:42:03 :: Sorted 360 cache queues in 0.000583 seconds
07/06/2017 19:42:03 :: Sorted 109 cache queues in 0.000430 seconds
07/06/2017 19:42:03 :: Sorted 27 cache queues in 0.000269 seconds
07/06/2017 19:42:04 :: Sorted 1570 cache queues in 0.001632 seconds
07/06/2017 19:42:05 :: Sorted 348 cache queues in 0.000656 seconds

Carbon has a ratelimit on how much writes it does per second. If your disk isn't saturated yet you can increase this. Be careful that if you set this too high you can starve other applications on this system or other hosts if you are on shared storage (SAN/NAS).
You can find this rate limit in the carbon.conf file. The setting is:
MAX_UPDATES_PER_SECOND =
To prevent the system from killing carbon you could consider configuring the maximum cache size. This will prevent carbon being killed, but will drop metrics if it hits the limit. The limit is in number of points in the cache, see the metric carbon.agents.$instance.cache.size to determine a good value. Also in carbon.conf:
MAX_CACHE_SIZE =
Also keep in mind that due to Python's Global Interpreter Lock (GIL), carbon can only run on one core at the same time. Your current CPU usage seems fine, but if your load increases more you could consider running 4 carbon caches (as you have 4 cores) with a carbon-relay in front to fully utilize your systems resources.

Related

R only ever runs on a certain CPU in Linux

I have an 8-core RHEL Linux machine running R 4.0.2.
If I ask R for the number of cores, I can confirm that 8 are available.
> print(future::availableWorkers())
[1] "localhost" "localhost" "localhost" "localhost" "localhost" "localhost"
[7] "localhost" "localhost"
> print(parallel::detectCores())
[1] 8
However, if I run this simple example
f <- function(out=0) {
for (i in 1:1e10) out <- out + 1
}
output <- parallel::mclapply(1:8, f, mc.cores = 8)
my top indicates that only 1 core is being used (so that each worker is using 1/8th of that core, or 1/64th of the entire machine).
%Cpu0 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 2.0 us, 0.0 sy, 0.0 ni, 98.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 32684632 total, 28211076 free, 2409992 used, 2063564 buff/cache
KiB Swap: 16449532 total, 11475052 free, 4974480 used. 29213180 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3483 user 20 0 493716 57980 948 R 1.8 0.2 0:18.09 R
3479 user 20 0 493716 57980 948 R 1.5 0.2 0:18.09 R
3480 user 20 0 493716 57980 948 R 1.5 0.2 0:18.08 R
3481 user 20 0 493716 57980 948 R 1.5 0.2 0:18.09 R
3482 user 20 0 493716 57980 948 R 1.5 0.2 0:18.09 R
3484 user 20 0 493716 57980 948 R 1.5 0.2 0:18.09 R
3485 user 20 0 493716 57980 948 R 1.5 0.2 0:18.09 R
3486 user 20 0 493716 57980 948 R 1.5 0.2 0:18.09 R
Does anyone know what might be going on here? Another StackOverflow question that documents similar behavior is here. It's clear that I messed up the install somehow. I followed these install instructions for RHEL 7. I'm guessing there is a dependency missing, but I have no idea where to look. If anyone has any ideas of diagnostics to run, etc., they would be most appreciated.
For further context, I have R 3.4.1 also installed on my machine, and when I run this code, everything works fine. (I installed that version through yum.)
I also installed R 4.0.3 yesterday using the same instructions linked above, and it suffers from the same problem.
First run
system(sprintf("taskset -p 0xffffffff %d", Sys.getpid()))
then your simple example
f <- function(out=0) { for (i in 1:1e10) out <- out + 1 }
output <- parallel::mclapply(1:8, f, mc.cores = 8)
works on all 8 cores.

How to guarantee that the number of processes are properly allocated to each of the nodes

currently I am new to using IMB benchmark and wanted to make sure
whether checking the configured number of processes (np) is evenly allocated in the nodes and each cores
Currently the IMB-benchmark provides the throughput per sec based on the fixed size of message with fixed size of duration
For example like this one below.
$ mpirun -np 64 -machinefile hosts_infin ./IMB-MPI1 -map 32x2 Sendrecv
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 0.76 0.76 0.76 0.00
1 1000 0.85 0.85 0.85 2.35
2 1000 0.79 0.79 0.79 5.06
4 1000 0.80 0.80 0.80 10.00
8 1000 0.78 0.78 0.78 20.45
16 1000 0.79 0.80 0.80 40.16
32 1000 0.79 0.79 0.79 80.61
64 1000 0.79 0.79 0.79 162.59
128 1000 0.82 0.82 0.82 311.41
256 1000 0.91 0.91 0.91 565.42
512 1000 0.95 0.95 0.95 1082.13
1024 1000 0.99 0.99 0.99 2076.87
2048 1000 1.27 1.27 1.27 3229.91
4096 1000 1.71 1.71 1.71 4802.87
8192 1000 2.49 2.50 2.50 6565.97
16384 1000 4.01 4.01 4.01 8167.28
32768 1000 7.08 7.09 7.08 9249.23
65536 640 22.89 22.89 22.89 5725.50
131072 320 37.45 37.45 37.45 6999.22
262144 160 65.74 65.76 65.75 7972.53
524288 80 120.10 120.15 120.12 8727.37
1048576 40 228.63 228.73 228.68 9168.57
2097152 20 445.38 445.69 445.53 9410.86
4194304 10 903.77 905.97 904.87 9259.29
#-----------------------------------------------------------------------------
However, this does not guarantee that the processes are evenly distributed to the nodes when I configure the different varying number of processors.
Would assigning to the specific core be possible in IMB-benchmark? or do I need to use the alternative benchmark to do this?
The way that processes (mpi ranks) are allocated into nodes is something that depends on the mpi launcher (mpirun in your case), it has nothing to do with the application, although application performance will get impacted as well as application execution if using local resources such as files in local filesystems. BTW, the process pinning to specific cores is also responsibility of the mpi launcher or the mpi library.
Since you don't specify which mpi implementation you use, I strongly encourage you to look at man mpirun. Some mpirun commands allow -ppn to specify the number of processes per node. Others allow you to limit the number of slots in ymthe machinefile.
Finally, some mpi launchers offer certain levels of verbosity to show the user how the processes are mapped into the nodes.

Ramp up/down missing time-series data in R

I have a set of time-series data (GPS speed data, specifically), which includes gaps of missing values where the signal was lost. For missing periods of short durations I am about to fill simply using a na.spline, however this is inappropriate with longer time periods. I would like to ramp the values from the last true value down to zero, based on predefined acceleration limits.
#create sample data frame
test <- as.data.frame(c(6,5.7,5.4,5.14,4.89,4.64,4.41,4.19,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,5,5.1,5.3,5.4,5.5))
names(test)[1] <- "speed"
#set rate of acceleration for ramp
ramp <- 6
#set sampling rate of receiver
Hz <- 1/10
So for missing data the ramp would use the previous value and the rate of acceleration to get the next data point, until speed reached zero (i.e. last speed [4.19] + (Hz * ramp)), yielding the following values:
3.59
2.99
2.39
1.79
1.19
0.59
0
Lastly, I need to do this in the reverse fashion, to ramp up from zero when the signal picks back up again.
Hope this is clear.
Cheers
It's not really elegant, but you can do it in a loop.
na.pos <- which(is.na(test$speed))
acc = FALSE
for (i in na.pos) {
if (acc) {
speed <- test$speed[i-1]+(Hz*ramp)
}
else {
speed <- test$speed[i-1]-(Hz*ramp)
if (round(speed,1) < 0) {
acc <- TRUE
speed <- test$speed[i-1]+(Hz*ramp)
}
}
test[i,] <- speed
}
The result is:
speed
1 6.00
2 5.70
3 5.40
4 5.14
5 4.89
6 4.64
7 4.41
8 4.19
9 3.59
10 2.99
11 2.39
12 1.79
13 1.19
14 0.59
15 -0.01
16 0.59
17 1.19
18 1.79
19 2.39
20 2.99
21 3.59
22 4.19
23 4.79
24 5.00
25 5.10
26 5.30
27 5.40
28 5.50
Note that '-0.01', because 0.59-(6*10) is -0.01, not 0. You can round it later, I decided not to.
When the question says "ramp the values from the last true value down to zero" in each run of NAs I assume that that means that any remaining NAs in the run after reaching zero are also to be replaced by zero.
Now, use rleid from data.table to create a grouping vector the same length as test$speed identifying each run in is.na(test$speed) and use ave to create sequence numbers within such groups, seqno. Then calculate the declining sequences, ramp_down by combining na.locf(test$speed) and seqno. Finally replace the NAs.
library(data.table)
library(zoo)
test_speed <- test$speed
seqno <- ave(test_speed, rleid(is.na(test_speed)), FUN = seq_along)
ramp_down <- pmax(na.locf(test_speed) - seqno * ramp * Hz, 0)
result <- ifelse(is.na(test_speed), ramp_down, test_speed)
giving:
> result
[1] 6.00 5.70 5.40 5.14 4.89 4.64 4.41 4.19 3.59 2.99 2.39 1.79 1.19 0.59 0.00
[16] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.00 5.10 5.30 5.40 5.50

Analysis of several behaviours of 2 individuals

I am analysing several animal behaviours during a defined time period.
I watch videos of the animals and their behaviours. I record when each behaviour is displayed. They will display each behaviour several times during the recording (which correspond to the different events). Sometimes 2 or 3 behaviours can be displayed at the same time during the recording, but they don't usually start/finish exactly at the same time (so they overlap partly).
I end up with a series of events for each behaviour, and for each event I have their onset, duration and end point (see example hereafter).
I need to extract from this data the total amount during which behaviour 1 overlaps with behaviour 2 / behaviour 1 overlaps with behaviour 3 / behaviour 2 overlaps with behaviour 3. This is so that I can find correlations between behaviours, which ones tend to be displayed at the same time, which ones do not, ...
I am only a beginner with programming (mostly R) and I find it hard to get started. Can you please advise me how to proceed? Many thanks!
Example with a series of events for 3 behaviours:
Event tracked Onset Duration End
Behaviour 1 _event 1 7.40 548.88 556.28
Behaviour 1 _event 2 36.20 0.47 36.67
Behaviour 1 _event 3 48.45 0.25 48.70
Behaviour 1 _event 4 68.92 1.53 70.45
Behaviour 1 _event 5 75.48 0.22 75.70
Behaviour 1 _event 6 89.75 0.66 90.41
Behaviour 1 _event 7 94.62 0.16 94.78
Behaviour 1 _event 8 101.78 0.22 102.00
Behaviour 1 _event 9 108.86 0.59 109.45
Behaviour 1 _event 10 146.35 0.66 147.00
Behaviour 1 _event 11 150.20 0.75 150.95
Behaviour 1 _event 12 152.98 0.66 153.64
Behaviour 1 _event 13 157.84 0.56 158.41
Behaviour 2_event 1 7.52 0.38 7.90
Behaviour 2_event 2 18.73 0.16 18.88
Behaviour 2_event 3 19.95 2.25 22.20
Behaviour 2_event 4 26.41 0.25 26.66
Behaviour 2_event 5 35.91 0.16 36.07
Behaviour 2_event 6 37.29 0.34 37.63
Behaviour 2_event 7 38.13 0.72 38.85
Behaviour 2_event 8 40.19 0.31 40.51
Behaviour 2_event 9 44.26 0.16 44.41
Behaviour 2_event 10 45.32 0.16 45.48
Behaviour 2_event 11 54.84 1.44 56.27
Behaviour 2_event 12 56.65 1.19 57.84
Behaviour 2_event 13 61.59 1.03 62.62
Behaviour 2_event 14 81.13 3.83 84.96
Behaviour 2_event 15 86.65 0.31 86.96
Behaviour 2_event 16 90.15 0.19 90.34
Behaviour 2_event 17 96.97 0.53 97.50
Behaviour 2_event 18 107.12 0.22 107.34
Behaviour 2_event 19 118.53 0.41 118.94
Behaviour 2_event 20 127.76 0.25 128.01
Behaviour 2_event 21 129.45 0.69 130.13
Behaviour 2_event 22 130.60 2.31 132.91
Behaviour 2_event 23 141.01 0.41 141.41
Behaviour 2_event 24 152.85 0.37 153.23
Behaviour 2_event 25 156.54 0.13 156.66
Behaviour 3_event 1 7.71 1.94 9.65
Behaviour 3_event 2 11.12 1.53 12.65
Behaviour 3_event 3 19.01 0.19 19.20
Behaviour 3_event 4 20.01 3.97 23.98
Behaviour 3_event 5 24.95 4.22 29.16
Behaviour 3_event 6 29.70 2.19 31.88
Behaviour 3_event 7 33.23 2.50 35.73
Behaviour 3_event 8 36.82 0.44 37.26
Behaviour 3_event 9 38.20 1.16 39.35
Behaviour 3_event 10 39.91 2.13 42.04
Behaviour 3_event 11 42.49 3.62 46.11
Behaviour 3_event 12 47.09 0.53 47.62
Behaviour 3_event 13 48.15 0.34 48.49
Behaviour 3_event 14 49.40 2.13 51.52
Behaviour 3_event 15 57.57 2.25 59.82
Behaviour 3_event 16 60.89 0.88 61.76
Behaviour 3_event 17 66.85 6.78 73.63
Behaviour 3_event 18 75.65 3.03 78.68
In order to do the kind of study you want to do, it might be easiest to convert the data to a time series with variables on states (i.e. whether behavior 1, 2, 3, etc. is being displayed.) So you want to transform the dataset you have to one that looks like
time animal behav_1 behav_2 behav_3
0 1 FALSE TRUE FALSE
0 2 TRUE FALSE FALSE
1 1 FALSE TRUE FALSE
1 2 TRUE FALSE TRUE
... ... ... ... ...
Each row tells whether a particular animal is displaying each of the three behaviors at the given time. (I am assuming here that you have multiple animals and you want to keep their behavior data separate.)
Then you could easily approximate many of the quantities you are interested in. For example, you could compute the probability that an animal is doing behavior 1 given it is doing behavior 2 by
Computing a column data$behav_1_and_2 <- data$behav_1 & data$behav_2
Dividing the sum of the col behav_1_and_2 by the sum of behav_2: sum(data$behav_1_and_2) / sum(data$behav_2)
Okay, but how do you transform the data? First, decide how many time points you want to check. Maybe you should increment by about 0.1.
num_animals <- 10 // How many animals you have
time_seq <- seq(from = 0, to = 600, by = 0.1) // to should be end of video
data <- expand.grid(time = time_seq, animal = num_animals)
That gets you the first two columns of the data frame you want. Then you need to compute the three behavior columns. Define a function that takes the time, animal, and name of the behavior column, and returns TRUE if the animal is doing that behavior at the time, or FALSE if not.
has_behavior <- function(time, animal, behavior) {
...
}
(I'm going to let you figure out how to make that function.) With that function in hand, you can then create the last three columns with a loop:
// First create empty columns
data$behav_1 <- logical(nrow(data))
data$behav_2 <- logical(nrow(data))
data$behav_3 <- logical(nrow(data))
// Now loop through rows
for (i in 1:nrow(data)) {
data$behav_1[i] <- has_behavior(data$time[i], data$animal[i], 1)
data$behav_2[i] <- has_behavior(data$time[i], data$animal[i], 2)
data$behav_3[i] <- has_behavior(data$time[i], data$animal[i], 3)
}
With data in this format, you should be able to study the problem much more easily. You can compute those summary quantities easily, as I outlined earlier. This data frame is also set up to be useful for doing time series modeling. And it's also tidy, making it easy to use with packages like dplyr for data summarising and ggplot2 for visualization. (You can learn more about those last two tools in the free online book R for Data Science by Hadley Wickham.)

Profiling data.table's setkey operation with Rprof

I am working with a relatively large data.table dataset and trying to profile/optimize the code. I am using Rprof, but I'm noticing that the majority of time spent within a setkey operation is not included in the Rprof summary. Is there a way to include this time spent?
Here is a small test to show how time spent setting the key for a data table is not represented in the Rprof summary:
Create a test function that runs a profiled setkey operation on a data table:
testFun <- function(testTbl) {
Rprof()
setkey(testTbl, x, y, z)
Rprof(NULL)
print(summaryRprof())
}
Then create a test data table that is large enough to feel the weight of the setkey operation:
testTbl = data.table(x=sample(1:1e7, 1e7), y=sample(1:1e7,1e7), z=sample(1:1e7,1e7))
Then run the code, and wrap it within a system.time operation to show the difference between the system.time total time and the rprof total time:
> system.time(testFun(testTbl))
$by.self
self.time self.pct total.time total.pct
"sort.list" 0.88 75.86 0.88 75.86
"<Anonymous>" 0.08 6.90 1.00 86.21
"regularorder1" 0.08 6.90 0.92 79.31
"radixorder1" 0.08 6.90 0.12 10.34
"is.na" 0.02 1.72 0.02 1.72
"structure" 0.02 1.72 0.02 1.72
$by.total
total.time total.pct self.time self.pct
"setkey" 1.16 100.00 0.00 0.00
"setkeyv" 1.16 100.00 0.00 0.00
"system.time" 1.16 100.00 0.00 0.00
"testFun" 1.16 100.00 0.00 0.00
"fastorder" 1.14 98.28 0.00 0.00
"tryCatch" 1.14 98.28 0.00 0.00
"tryCatchList" 1.14 98.28 0.00 0.00
"tryCatchOne" 1.14 98.28 0.00 0.00
"<Anonymous>" 1.00 86.21 0.08 6.90
"regularorder1" 0.92 79.31 0.08 6.90
"sort.list" 0.88 75.86 0.88 75.86
"radixorder1" 0.12 10.34 0.08 6.90
"doTryCatch" 0.12 10.34 0.00 0.00
"is.na" 0.02 1.72 0.02 1.72
"structure" 0.02 1.72 0.02 1.72
"is.unsorted" 0.02 1.72 0.00 0.00
"simpleError" 0.02 1.72 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 1.16
user system elapsed
31.112 0.211 31.101
Note the 1.16 and 31.101 time differences.
Reading ?Rprof, I see why this difference might have occurred:
Functions will only be recorded in the profile log if they put a
context on the call stack (see sys.calls). Some primitive functions do
not do so: specifically those which are of type "special" (see the ‘R
Internals’ manual for more details).
So is this the reason why time spent within the setkey operation isn't represented in Rprof? Is there a workaround have Rprof watch all of data.table's operations (including setkey, and maybe others that I haven't noticed)? I essentially want to have the system.time and Rprof time match up.
Here is the most-likely relevant sessionInfo():
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
data.table_1.8.11
I still observe this issue when Rprof() isn't within a function call:
> testFun <- function(testTbl) {
+ setkey(testTbl, x, y, z)
+ }
> Rprof()
> system.time(testFun(testTbl))
user system elapsed
28.855 0.191 28.854
> Rprof(NULL)
> summaryRprof()
$by.self
self.time self.pct total.time total.pct
"sort.list" 0.86 71.67 0.88 73.33
"regularorder1" 0.08 6.67 0.92 76.67
"<Anonymous>" 0.06 5.00 0.98 81.67
"radixorder1" 0.06 5.00 0.10 8.33
"gc" 0.06 5.00 0.06 5.00
"proc.time" 0.04 3.33 0.04 3.33
"is.na" 0.02 1.67 0.02 1.67
"sys.function" 0.02 1.67 0.02 1.67
$by.total
total.time total.pct self.time self.pct
"system.time" 1.20 100.00 0.00 0.00
"setkey" 1.10 91.67 0.00 0.00
"setkeyv" 1.10 91.67 0.00 0.00
"testFun" 1.10 91.67 0.00 0.00
"fastorder" 1.08 90.00 0.00 0.00
"tryCatch" 1.08 90.00 0.00 0.00
"tryCatchList" 1.08 90.00 0.00 0.00
"tryCatchOne" 1.08 90.00 0.00 0.00
"<Anonymous>" 0.98 81.67 0.06 5.00
"regularorder1" 0.92 76.67 0.08 6.67
"sort.list" 0.88 73.33 0.86 71.67
"radixorder1" 0.10 8.33 0.06 5.00
"doTryCatch" 0.10 8.33 0.00 0.00
"gc" 0.06 5.00 0.06 5.00
"proc.time" 0.04 3.33 0.04 3.33
"is.na" 0.02 1.67 0.02 1.67
"sys.function" 0.02 1.67 0.02 1.67
"formals" 0.02 1.67 0.00 0.00
"is.unsorted" 0.02 1.67 0.00 0.00
"match.arg" 0.02 1.67 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 1.2
EDIT2: Same issue with 1.8.10 on my machine with only the data.table package loaded. Times are not equal even when the Rprof() call is not within a function:
> library(data.table)
data.table 1.8.10 For help type: help("data.table")
> base::source("/tmp/r-plugin-claytonstanley/Rsource-86075-preProcess.R", echo=TRUE)
> testFun <- function(testTbl) {
+ setkey(testTbl, x, y, z)
+ }
> testTbl = data.table(x=sample(1:1e7, 1e7), y=sample(1:1e7,1e7), z=sample(1:1e7,1e7))
> Rprof()
> system.time(testFun(testTbl))
user system elapsed
29.516 0.281 29.760
> Rprof(NULL)
> summaryRprof()
EDIT3: Doesn't work even if setkey is not within a function:
> library(data.table)
data.table 1.8.10 For help type: help("data.table")
> testTbl = data.table(x=sample(1:1e7, 1e7), y=sample(1:1e7,1e7), z=sample(1:1e7,1e7))
> Rprof()
> setkey(testTbl, x, y, z)
> Rprof(NULL)
> summaryRprof()
EDIT4: Doesn't work even when R is called from a --vanilla bare-bones terminal prompt.
EDIT5: Does work when tested on a Linux VM. But still does not work on darwin machine for me.
EDIT6: Doesn't work after watching the Rprof.out file get created, so it isn't a write access issue.
EDIT7: Doesn't work after compiling data.table from source and creating a new temp user and running on that account.
EDIT8: Doesn't work when compiling R 3.0.2 from source for darwin via MacPorts.
EDIT9: Does work on a different darwin machine, a Macbook Pro laptop running the same OS version (10.6.8). Still doesn't work on a MacPro desktop machine running same OS version, R version, data.table version, etc.
I'm thinking it's b/c the desktop machine is running in 64-bit kernel mode (not default), and the laptop is 32-bit (default). Confirmed.
Great question. Given edits, I'm not sure then, can't reproduce. Leaving remainder of answer here for now.
I've tested on my (very slow) netbook and it works fine, see output below.
I can tell you right now why setkey is so slow on that test case. When the number of levels are large (greater than 100,000 as here) it reverts to comparison sort rather than counting sort. Yes pretty poor if you have data like that in practice. Typically we have under 100,000 unique values in the first column then, say, dates in the 2nd column. Both columns can be sorted using counting sort and performance is ok.
It's a known issue and we've been working hard on it. Arun has implemented radix sort for integers with range > 100,000 to solve this problem and that's in the next release. But we are still tidying up v1.8.11. See our presentation in Cologne which goes into more detail and gives some idea of speedups.
Inroduction to data.table and news from v1.8.11
Here is the output with v1.8.10, along with R version and lscpu info (for your entertainment). I like to test on a very poor machine with small cache so that in development I can see what's likely to bite when the data is scaled up on larger machines with larger cache.
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 20
Model: 2
Stepping: 0
CPU MHz: 800.000
BogoMIPS: 1995.01
Virtualisation: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
NUMA node0 CPU(s): 0,1
$ R
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
> require(data.table)
Loading required package: data.table
data.table 1.8.10 For help type: help("data.table")
> testTbl = data.table(x=sample(1:1e7, 1e7), y=sample(1:1e7,1e7), z=sample(1:1e7,1e7))
> testTbl
x y z
1: 1748920 6694402 7501082
2: 4571252 565976 5695727
3: 1284455 8282944 7706392
4: 8452994 8765774 6541097
5: 6429283 329475 5271154
---
9999996: 2019750 5956558 1735214
9999997: 1096888 1657401 3519573
9999998: 1310171 9002746 350394
9999999: 5393125 5888350 7657290
10000000: 2210918 7577598 5002307
> Rprof()
> setkey(testTbl, x, y, z)
> Rprof(NULL)
> summaryRprof()
$by.self
self.time self.pct total.time total.pct
"sort.list" 195.44 91.34 195.44 91.34
".Call" 5.38 2.51 5.38 2.51
"<Anonymous>" 4.32 2.02 203.62 95.17
"radixorder1" 4.32 2.02 4.74 2.22
"regularorder1" 4.28 2.00 199.30 93.15
"is.na" 0.12 0.06 0.12 0.06
"any" 0.10 0.05 0.10 0.05
$by.total
total.time total.pct self.time self.pct
"setkey" 213.96 100.00 0.00 0.00
"setkeyv" 213.96 100.00 0.00 0.00
"fastorder" 208.36 97.38 0.00 0.00
"tryCatch" 208.36 97.38 0.00 0.00
"tryCatchList" 208.36 97.38 0.00 0.00
"tryCatchOne" 208.36 97.38 0.00 0.00
"<Anonymous>" 203.62 95.17 4.32 2.02
"regularorder1" 199.30 93.15 4.28 2.00
"sort.list" 195.44 91.34 195.44 91.34
".Call" 5.38 2.51 5.38 2.51
"radixorder1" 4.74 2.22 4.32 2.02
"doTryCatch" 4.74 2.22 0.00 0.00
"is.unsorted" 0.22 0.10 0.00 0.00
"is.na" 0.12 0.06 0.12 0.06
"any" 0.10 0.05 0.10 0.05
$sample.interval
[1] 0.02
$sampling.time
[1] 213.96
>
The problem was that the darwin machine was running Snow Leopard with a 64-bit kernel, which is not the default for that OS X version.
I also verified that this is not a problem for another darwin machine running Mountain Lion which uses a 64-bit kernel by default. So it's an interaction between Snow Leopard and running a 64-bit kernel specifically.
As a side note, the official OS X binary installer for R is still built with Snow Leopard, so I do think that this issue is still relevant, as Snow Leopard is still a widely-used OS X version.
When the 64-bit kernel in Snow Leopard is enabled, no kernel extensions that are compatible only with the 32-bit kernel are loaded. After booting into the default 32-bit kernel for Snow Leopard, kextfind shows that these 32-bit only kernel extensions are on the machine and (most likely) loaded:
$ kextfind -not -arch x86_64
/System/Library/Extensions/ACard6280ATA.kext
/System/Library/Extensions/ACard62xxM.kext
/System/Library/Extensions/ACard67162.kext
/System/Library/Extensions/ACard671xSCSI.kext
/System/Library/Extensions/ACard6885M.kext
/System/Library/Extensions/ACard68xxM.kext
/System/Library/Extensions/AppleIntelGMA950.kext
/System/Library/Extensions/AppleIntelGMAX3100.kext
/System/Library/Extensions/AppleIntelGMAX3100FB.kext
/System/Library/Extensions/AppleIntelIntegratedFramebuffer.kext
/System/Library/Extensions/AppleProfileFamily.kext/Contents/PlugIns/AppleIntelYonahProfile.kext
/System/Library/Extensions/IO80211Family.kext/Contents/PlugIns/AirPortAtheros.kext
/System/Library/Extensions/IONetworkingFamily.kext/Contents/PlugIns/AppleRTL8139Ethernet.kext
/System/Library/Extensions/IOSerialFamily.kext/Contents/PlugIns/InternalModemSupport.kext
/System/Library/Extensions/IOSerialFamily.kext/Contents/PlugIns/MotorolaSM56KUSB.kext
/System/Library/Extensions/JMicronATA.kext
/System/Library/Extensions/System.kext/PlugIns/BSDKernel6.0.kext
/System/Library/Extensions/System.kext/PlugIns/IOKit6.0.kext
/System/Library/Extensions/System.kext/PlugIns/Libkern6.0.kext
/System/Library/Extensions/System.kext/PlugIns/Mach6.0.kext
/System/Library/Extensions/System.kext/PlugIns/System6.0.kext
/System/Library/Extensions/ufs.kext
So it could be any one of those loaded extensions that is enabling something for the Rprof package to use, so that the setkey operation in data.table is profiled correctly.
If anyone wants to investigate this further, dig a bit deeper, and get to the root cause of the problem, please post an answer and I'll happily accept that one.

Resources