My program execution was aborted with the following diagnostics:
Heap exhausted during garbage collection: 0 bytes available, 16 requested.
Gen Boxed Code Raw LgBox LgCode LgRaw Pin Alloc Waste Trig WP GCs Mem-age
3 21843 1 47 0 0 0 59 716955392 368896 2000000 21891 0 1.0481
4 0 0 0 0 0 0 0 0 0 2000000 0 0 0.0000
5 0 0 0 0 0 0 0 0 0 2000000 0 0 0.0000
6 491 2 223 55 0 10 0 24917312 674496 2000000 781 0 0.0000
7 10080 0 15 0 0 0 0 330663696 129264 2000000 10095 0 0.0000
Total bytes allocated = 1072536400
Dynamic-space-size bytes = 1073741824
GC control variables:
*GC-INHIBIT* = true
*GC-PENDING* = true
*STOP-FOR-GC-PENDING* = false
fatal error encountered in SBCL pid 88102(tid 0x7fff9e07c380):
Heap exhausted, game over.
Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb>
Is there a way to find where all of the memory was consumed?
The program itself is here: https://github.com/hemml/gridgen2
SBCL's (room t) will give you quite a bit more information if you can do it before you run out of heap. I'm unfamiliar with LDB and whether or not it can execute room. However you could wrap a call to (room t) with something that redirects its output to file, and add the function to the *after-gc-hooks* list so you can see the (extremely verbose) growth patterns.
My raw data contains numeric values with a recall of the headers every 20 lines.
I wish to remove the repeated header lines with R. I know it's quite easy with sed command but I wish the R script to handle all steps of tidying data.
> raw <- read.delim("./vmstat_archiveadm_s.txt")
> head(raw)
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s2 s3 vc -- in sy cs us sy id
0 0 0 100097600 97779056 285 426 53 0 0 0 367 86 6 0 0 1206 7711 2630 1 0 99
0 0 0 96908192 94414488 7 31 0 0 0 0 0 120 0 0 0 2782 5775 5042 2 0 97
0 0 0 96889840 94397152 0 1 0 0 0 0 0 122 0 0 0 2737 5591 4958 2 0 97
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s2 s3 vc -- in sy cs us sy id
0 0 0 100065744 97745448 282 422 52 0 0 0 363 89 6 0 0 1233 7690 2665 1 0 99
0 0 0 96725312 94222040 7 31 0 0 0 0 0 604 69 0 0 5269 5703 7910 2 1 97
0 0 0 96668624 94170784 0 0 0 0 0 0 0 155 53 0 0 3047 5505 5317 2 0 97
0 0 0 96595104 94086816 0 0 0 0 0 0 0 174 0 0 0 2879 5567 5068 2 0 97
1 0 0 96521376 94025504 0 0 0 0 0 0 0 121 0 0 0 2812 5471 5105 2 0 97
0 0 0 96503256 93994896 0 0 0 0 0 0 0 121 0 0 0 2731 5621 4981 2 0 97
(...)
Try this :
where df is the dataframe
x = seq(6,100,21)
df = df[-x,]
Sequence will generate a string of numbers from 6 till 100 at an interval of 21.
Therefore, in this case :
6 27 48 69 90
Remove them from the dataframe by
df[-x,]
EDIT:
To do this for the entire dataframe, replace 100 with number of rows. i.e
seq(6,nrow(df),21)
Instead of processing the output in R I will clean it at the generation level:
$ vmstat 1 | egrep -v '^ kthr|^ r'
0 0 0 154831904 153906536 215 471 0 0 0 0 526 33 32 0 0 1834 14171 5253 0 0 99
1 0 0 154805632 153354296 9 32 0 0 0 0 0 0 0 0 0 1463 610 739 0 0 100
1 0 0 154805632 153354696 0 4 0 0 0 0 0 0 0 0 0 1408 425 634 0 0 100
0 0 0 154805632 153354696 0 0 0 0 0 0 0 0 0 0 0 1341 381 658 0 0 100
0 0 0 154805632 153354696 0 0 0 0 0 0 0 0 0 0 0 1299 353 610 0 0 100
1 0 0 154805632 153354696 0 0 0 0 0 0 0 0 0 0 0 1319 375 638 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1308 367 614 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1336 395 650 0 0 100
1 0 0 154805632 153354640 0 0 0 0 0 0 0 44 44 0 0 1594 378 878 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 66 65 0 0 1763 382 1015 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1312 411 645 0 0 100
0 0 0 154805632 153354640 0 0 0 0 0 0 0 0 0 0 0 1342 390 647 0 0 100
I am creating a time series of daily sales of a given item at a retailer. I have several questions outlined below that I would like some help with (data and code to follow). Note that I am reading my actual data set from a csv file, where observations (dates) are in rows, and each variable are in the columns. Thank you ahead of time for your help, and please know that I am new to R coding.
1) It appears as if R is reading my time series by the observation number of the date (ie, April 5th, the 5th date in the data set, has a value of 5, rather than the 297 units that sold on that particular day). How can I remedy this?
2) I believe that my 'ts' statement is telling R that the data begins on the 91st day (April 1st) of 2013; have I coded this correctly? When I plot the data, it appears that R may be interpreting this statement differently.
3) Do I need to create a separate time series for my xreg? For example, should I create a time series for each variable, then take the union of those, and then cbind them?
4) Have I logged the variables in the correct statements, or should I do it elsewhere in the code?
require("forecast")
G<-read.csv("SingleItemToyDataset.csv")
GT<-ts(G$Units, start = c(2013, 91), frequency = 365.25)
X = cbind(log(G$Price), G$Time, as.factor(G$PromoOne), as.factor(G$PromoTwo), as.factor(G$Mon), as.factor(G$Tue), as.factor(G$Wed), as.factor(G$Thu), as.factor(G$Fri), as.factor(G$Sat))
Fit<-auto.arima(log(GT), xreg = X)
Date Day Units Price Time PromoOne PromoTwo Mon Tue Wed Thu Fri Sat
1 4/1/2013 Mon 351 5.06 1 1 0 1 0 0 0 0 0
2 4/2/2013 Tue 753 4.90 2 1 0 0 1 0 0 0 0
3 4/3/2013 Wed 133 5.32 3 1 0 0 0 1 0 0 0
4 4/4/2013 Thu 150 5.14 4 1 0 0 0 0 1 0 0
5 4/5/2013 Fri 297 5.00 5 1 0 0 0 0 0 1 0
6 4/6/2013 Sat 688 5.27 6 1 0 0 0 0 0 0 1
7 4/7/2013 Sun 1,160 5.06 7 1 0 0 0 0 0 0 0
8 4/8/2013 Mon 613 5.07 8 1 0 1 0 0 0 0 0
9 4/9/2013 Tue 430 5.07 9 1 0 0 1 0 0 0 0
10 4/10/2013 Wed 400 5.03 10 1 0 0 0 1 0 0 0
11 4/11/2013 Thu 1,530 4.97 11 1 0 0 0 0 1 0 0
12 4/12/2013 Fri 2,119 5.00 12 0 1 0 0 0 0 1 0
13 4/13/2013 Sat 1,094 5.09 13 0 1 0 0 0 0 0 1
14 4/14/2013 Sun 736 5.02 14 1 0 0 0 0 0 0 0
15 4/15/2013 Mon 518 5.10 15 1 0 1 0 0 0 0 0
16 4/16/2013 Tue 485 5.02 16 1 0 0 1 0 0 0 0
17 4/17/2013 Wed 472 5.05 17 1 0 0 0 1 0 0 0
18 4/18/2013 Thu 406 5.03 18 1 0 0 0 0 1 0 0
19 4/19/2013 Fri 564 5.00 19 1 0 0 0 0 0 1 0
20 4/20/2013 Sat 475 5.09 20 1 0 0 0 0 0 0 1
21 4/21/2013 Sun 621 5.04 21 1 0 0 0 0 0 0 0
22 4/22/2013 Mon 714 5.02 22 1 0 1 0 0 0 0 0
23 4/23/2013 Tue 1,217 5.32 23 0 0 0 1 0 0 0 0
24 4/24/2013 Wed 1,253 5.45 24 0 0 0 0 1 0 0 0
25 4/25/2013 Thu 1,169 5.06 25 0 0 0 0 0 1 0 0
26 4/26/2013 Fri 1,216 5.01 26 0 0 0 0 0 0 1 0
27 4/27/2013 Sat 1,127 5.02 27 0 0 0 0 0 0 0 1
28 4/28/2013 Sun 693 5.04 28 1 0 0 0 0 0 0 0
29 4/29/2013 Mon 388 5.01 29 1 0 1 0 0 0 0 0
30 4/30/2013 Tue 305 5.01 30 1 0 0 1 0 0 0 0
31 5/1/2013 Wed 207 5.03 31 1 0 0 0 1 0 0 0
32 5/2/2013 Thu 612 4.97 32 1 0 0 0 0 1 0 0
33 5/3/2013 Fri 671 5.01 33 1 0 0 0 0 0 1 0
34 5/4/2013 Sat 1,151 5.04 34 1 0 0 0 0 0 0 1
35 5/5/2013 Sun 2,578 5.00 35 1 0 0 0 0 0 0 0
36 5/6/2013 Mon 2,364 5.01 36 1 0 1 0 0 0 0 0
37 5/7/2013 Tue 423 5.03 37 1 0 0 1 0 0 0 0
38 5/8/2013 Wed 388 5.04 38 1 0 0 0 1 0 0 0
39 5/9/2013 Thu 1,417 4.70 39 0 1 0 0 0 1 0 0
40 5/10/2013 Fri 1,607 4.59 40 0 1 0 0 0 0 1 0
41 5/11/2013 Sat 1,217 4.86 41 1 0 0 0 0 0 0 1
42 5/12/2013 Sun 545 5.12 42 1 0 0 0 0 0 0 0
43 5/13/2013 Mon 461 5.01 43 1 0 1 0 0 0 0 0
44 5/14/2013 Tue 358 4.97 44 1 0 0 1 0 0 0 0
45 5/15/2013 Wed 310 5.00 45 1 0 0 0 1 0 0 0
46 5/16/2013 Thu 925 4.63 46 1 0 0 0 0 1 0 0
47 5/17/2013 Fri 266 4.99 47 1 0 0 0 0 0 1 0
48 5/18/2013 Sat 183 5.15 48 0 0 0 0 0 0 0 1
49 5/19/2013 Sun 363 5.20 49 0 0 0 0 0 0 0 0
50 5/20/2013 Mon 5,469 4.99 50 1 0 1 0 0 0 0 0
51 5/21/2013 Tue 647 4.81 51 1 0 0 1 0 0 0 0
52 5/22/2013 Wed 421 4.97 52 1 0 0 0 1 0 0 0
53 5/23/2013 Thu 353 4.93 53 1 0 0 0 0 1 0 0
54 5/24/2013 Fri 375 4.95 54 1 0 0 0 0 0 1 0
55 5/25/2013 Sat 575 4.88 55 1 0 0 0 0 0 0 1
56 5/26/2013 Sun 707 4.92 56 0 0 0 0 0 0 0 0
57 5/27/2013 Mon 533 4.89 57 0 0 1 0 0 0 0 0
58 5/28/2013 Tue 641 4.66 58 0 0 0 1 0 0 0 0
59 5/29/2013 Wed 264 4.85 59 0 0 0 0 1 0 0 0
60 5/30/2013 Thu 186 5.74 60 1 0 0 0 0 1 0 0
61 5/31/2013 Fri 207 6.40 61 1 0 0 0 0 0 1 0
1) I'm not sure exactly what you mean here, but perhaps you are confused by the row names (numbers in this case) that R has assigned to your data frame G. Assuming the data.frame printed below your code is what G looks like, it looks to me like G$Units does indeed have the data you're interested in modeling (note, however, that R is perhaps treating G$Units as a character class because of the commas in the number; you should remove those from your .csv file).
2) For modeling with auto.arima() (or arima() in base R), the univariate ts does not need to be an actual ts object. So, you don't really need to create GT. That said, however, The start and freq arguments to ts() can be a bit odd to figure out. In this case, you need to set freq=365 even though a year is technically a bit longer (i.e., GT <- ts(G$Units, start=c(2013,91), freq=365))
3) No, you do not need to create a separate time series for xreg. In fact, you don't need to create factors for your promos/days because they are already coded as 0/1. Thus, something like X <- G[,-c(1,2,3,5)]; X$Price <- log(X$Price) would suffice. (Aside: why are you using Time as a covariate; there doesn't appear to be any trend in the data?).
4) Yes, log-transforming the (co)variates where you did is fine, but I'm curious as to why the price covariate needs to be log-transformed?
Some process (or threads) is hammering CPU0 as you can see in mpstat 30 2
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 13 0 2 7 0 151 0 4250 99 1 0 0
1 114 0 2 197 84 5220 5 10 109 0 10518 30 2 0 67
2 79 0 1 184 83 5208 5 5 89 0 9788 30 2 0 68
3 67 0 1 181 84 5150 5 4 87 0 9510 30 2 0 69
4 53 0 3 171 72 12238 4 7 183 0 22214 3 3 0 94
5 43 0 3 135 7 218 2 6 16 0 162 0 1 0 99
6 110 0 2 172 79 4918 5 3 164 0 9553 34 2 0 64
7 120 0 1 180 80 4873 4 4 194 0 9494 32 2 0 66
8 53 0 1 23 2 28665 5 7 494 0 62023 12 9 0 79
9 43 0 0 34 2 21469 6 8 676 0 58090 10 13 0 77
10 59 0 1 210 2 33462 4 4 227 0 63500 7 16 0 78
11 93 0 2 16940 16627 1261 2 6 1027 0 2043 0 10 0 90
12 17 0 1 65 3 59 0 3 3 0 19 0 0 0 100
13 6 0 1 89 4 104 0 3 2 0 9 0 0 0 100
14 4 0 10 65 5 54 0 3 1 0 12 0 0 0 100
15 4 0 1 66 6 56 0 3 2 0 21 0 0 0 100
16 2 0 0 91 16 78 0 3 2 0 30 0 0 0 100
17 17 0 1 80 15 70 0 4 2 0 79 0 0 0 100
18 76 0 3 14946 14928 25 0 4 24 0 102 0 4 0 96
19 57 0 0 20 2 17 0 3 15 0 107 0 0 0 100
20 18 0 0 26 0 25 0 3 10 0 21 0 0 0 100
21 0 0 0 106 70 46 0 3 4 0 40 0 1 0 99
22 13 0 0 31 3 28 0 3 4 0 49 0 0 0 100
23 0 0 0 35 5 24 0 3 5 0 54 0 0 0 100
but with prstat -P0 only see the ndbmtd running wit around 15% on CPU0
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
20028 root 77G 75G cpu0 40 0 8369:33:0 15% ndbmtd/44
660 root 6200K 3700K sleep 59 0 0:00:53 0.0% inetd/4
159 daemon 4540K 2408K sleep 59 0 0:00:09 0.0% kcfd/3
11 root 11M 10M sleep 59 0 0:00:58 0.0% svc.configd/15
Is there a way to show all processes and treads on CPU0?
To show all processes and threads (LWPs) on CPU0:
prstat -P0 -L
Here is (a small part of) a data frame "df" with :
11 variables "v1" to "v11"
and an index column "indx" (with 1 <= indx <= 11).
"indx" was obtained through a previous step on another data frame and was then merged to "df" :
> df
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 indx
1 223 0 95 605 95 0 0 0 0 189 0 10
2 32 0 0 32 0 26 0 0 0 32 0 6
3 0 0 127 95 64 32 0 0 0 350 0 10
4 141 0 188 0 361 0 0 0 0 145 0 3
5 32 0 183 0 127 0 0 0 0 246 0 3
6 67 0 562 0 0 0 0 0 0 173 0 3
7 64 0 898 0 6 0 0 0 0 0 0 3
8 0 0 16 0 32 0 0 0 0 55 0 10
9 0 0 165 0 0 0 312 0 0 190 0 10
10 0 0 210 0 0 0 190 0 0 11 0 7
I need to build a new column "vsel" which value is "v(indx)"
(that is, for the 1rst row : vsel=189 because indx=10 and v10=189)
I successfully obtained this result by using a "for" loop :
> df
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 indx vsel
1 223 0 95 605 95 0 0 0 0 189 0 10 189
2 32 0 0 32 0 26 0 0 0 32 0 6 26
3 0 0 127 95 64 32 0 0 0 350 0 10 350
4 141 0 188 0 361 0 0 0 0 145 0 3 188
5 32 0 183 0 127 0 0 0 0 246 0 3 183
6 67 0 562 0 0 0 0 0 0 173 0 3 562
7 64 0 898 0 6 0 0 0 0 0 0 3 898
8 0 0 16 0 32 0 0 0 0 55 0 10 55
9 0 0 165 0 0 0 312 0 0 190 0 10 190
10 0 0 210 0 0 0 190 0 0 11 0 7 190
The code is :
df$vsel = NA
for (i in seq(1:nrow(df)) )
{
r = df[i,]
ind = r$indx
df[i,"vsel"] = r[ind]
}
... I would like to avoid this loop (as it is rather slow when the data frame is big).
There is probably a (faster) R-type way :
maybe with apply(df, 1, ...) ?
or ddply ?
Thanks for any help …
Matrix indexing to the rescue! R has a way of doing exactly what you are describing.
It is simple and powerful but surprisingly little-known.
df$vsel <- df[cbind(1:nrow(df), df$indx)]
You can do that :
f <- function(i){df[i,df[i,]$indx]}
temp <- sapply(FUN=f,X=1:length(df[,1]))
cbind(df,vsel=temp)
Here's a fully vectorized solution that is hard to beat in terms of speed.
df$vsel <- as.matrix(df)[1:nrow(df) + nrow(df)*(df$indx-1)]
This utilizes the fact that a matrix is internally stored as a long vector (column wise). The 1:nrow(df) will thereby specify row and nrow(df)*(df$indx-1) column. This does not work if you have mixed data types in df as everything would then be turned into strings by as.matrix.