I had thought that R had a standard overhead for storing objects (24 bytes, it seems, at least for integer vectors), but a simple test revealed that it's more complex than I realized. For instance, taking integer vectors up to length 100 (using random sampling, hoping to avoid any sneaky sequence compression tricks that might be out there), I found that different length vectors could have the same size, as follows:
> N = 100
> V = vector(length = 100)
> for(L in 1:N){
+ z = sample(N, L, replace = TRUE)
+ V[L] = object.size(z)
+ }
>
> options('width'=88)
> V
[1] 48 48 56 56 72 72 72 72 88 88 88 88 104 104 104 104 168 168 168 168
[21] 168 168 168 168 168 168 168 168 168 168 168 168 176 176 184 184 192 192 200 200
[41] 208 208 216 216 224 224 232 232 240 240 248 248 256 256 264 264 272 272 280 280
[61] 288 288 296 296 304 304 312 312 320 320 328 328 336 336 344 344 352 352 360 360
[81] 368 368 376 376 384 384 392 392 400 400 408 408 416 416 424 424 432 432 440 440
I'm very impressed by the 152 values that shows up (observation: 152 = 128 + 24, though 280 = 256 + 24 isn't as prominent). Can someone explain how these allocations arise? I have been unable to find a clear definition in the documentation, though V cells come up.
Even if you try N <- 10000, all values occur exactly twice, except for vectors of length :
5 to 8 (56 bytes)
9 to 12 (72 bytes)
13 to 16 (88 bytes)
17 to 32 (152 bytes)
The fact that the number of bytes occurs twice, comes from the simple fact that the memory is allocated in pieces of 8 bytes (referred to as Vcells in ?gc ) and integers take only 4 bytes.
Next to that, the internal structure of objects in R makes a distinguishment between small and large vectors for allocating memory. Small vectors are allocated in bigger blocks of about 2Kb, whereas larger vectors are allocated individually. The ‘small’ vectors consist of 6 defined classes, based on length, and are able to store vector data of up to 8, 16, 32, 48, 64 and 128 bytes. As an integer takes only 4 bytes, you have 2, 4, 8, 12, 16 and 32 integers you can store in these 6 classes. This explains the pattern you see.
The extra number of bytes is for the header (which forms the Ncells in ?gc). If you're really interested in all this, read through the R Internals manual.
And, as you guessed, the 24 extra bytes are from the headers (or Ncells ). It's in fact a bit more complicated than that, but the exact details can be found in the R internals manual
Related
I have a sample of .txt file here. A snapshot of my data is below:
I want to import this .txt file into R. The first column contains 13 characters. For the first row, the first column should be "201701001 011" and 236, 240, 236 are the 2nd, 3rd, and 4th column......
I tried the code below:
data <- read.table("<path>\\Sample.txt", sep = "\t")
But all variables are condensed into a single column. How should I separate them into different columns?
The reason that all variables are condensed into one is that there are no tabs in the input file. Instead try one of these.
1) read.fwf This file has fixed width fields so use read.fwf specifying the field widths as the second argument. No packages are used.
u <- "https://raw.githubusercontent.com/Patricklv/Importing-.txt-file/master/Sample.txt"
widths <- c(13, rep(8, 9))
read.fwf(u, widths)
giving:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 201701001 011 236 240 236 226 224 238 239 240 232
2 201701001 111 299 285 237 252 227 249 237 233 238
3 201701001 211 287 292 296 230 237 234 235 254 251
4 201701001 311 286 287 311 283 237 240 226 240 246
5 201701001 411 270 273 282 318 277 243 248 236 243
6 201701001 511 279 276 284 280 305 285 262 249 241
7 201701001 611 288 284 286 281 272 299 284 257 238
8 201701001 711 293 290 292 284 269 278 298 282 257
9 201701001 811 314 305 290 298 267 265 282 292 277
10 201701001 911 314 310 310 295 288 270 261 292 292
11 2017010011011 308 311 321 309 281 277 270 250 301
12 2017010011111 325 319 312 332 303 287 294 275 254
It seems easy enough to count the fields by hand as we have done above but it could be done automatically from the first line of data L1 by locating the field ends, ends, which occur at a digit followed by two or more spaces (\\d +) or (|) by a digit followed by end of line (\\d$). It is important that there are at least two spaces since a single space can appear within the first field. Finally, the field widths, widths, are the first component of ends followed by the differences of successive positions in ends.
L1 <- readLines(u, 1)
ends <- gregexpr("\\d |\\d$", L1)[[1]]
widths <- c(ends[1], diff(ends))
2) This is an alternative. Since a single space can appear in the first field and all real separators consist of at least 2 spaces we can read in the file, replace all occurrences of a run of 2 or more spaces with comma and then read that using a comma separator. u is from above. This is a bit longer but is still only one line and eliminates the need to count field widths. No packages are used.
read.table(text = gsub(" +", ",", readLines(u)), sep = ",")
3) Another alternative can be based on the fact that we already know from the question that the first field is 13 characters and the remaining fields are well separated by spaces so pick off the first field and cbind it to the rest re-reading the remainder using read.table. Again, no packages are used.
L <- readLines(u)
cbind(V0 = substring(L, 1, 13), read.table(text = substring(L, 14)))
Use read_table from package readr:
df<-readr::read_table("https://raw.githubusercontent.com/Patricklv/Importing-.txt-file/master/Sample.txt",col_names=F)
I am getting an error that I don't really understand at all. I was just messing around with generating some sequences, and I came across this problem:
This should create a sequence of 50 numbers.
seq.int(from=1,to=1000,by=5,length.out=50)
But if I enter this in the console I get the error message:
Error in seq.int(from = 1, to = 1000, by = 5, length.out = 50) :
too many arguments
If I look at the help (?seq), in the Usage section there is this line in there which makes it seem as though I called the function correctly, and it allows this many number of arguments:
seq.int(from, to, by, length.out, along.with, ...)
So what the heck is going on? I am I missing something fundamental, or are the docs out of date?
NOTE
The arguments I am providing to the function in the code sample are just for sake of example. I'm not trying to solve a particular problem, just curious as to why I get the error.
It's not clear what you expect as output from this line of code, and you're getting an error because R doesn't want to resolve the contradictions for you.
Here is some valid output, and the line of code you'd use to achieve each. This is a case where you need to decide for yourself which approach to use given the task you have in mind:
Override length.out
[1] 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86
...
[199] 991 996
#via:
seq.int(from=1,to=1000,by=5)
Override by
[1] 1.00000 21.38776 41.77551 62.16327 82.55102 102.93878 123.32653
[8] 143.71429 164.10204 184.48980 204.87755 225.26531 245.65306 266.04082
[15] 286.42857 306.81633 327.20408 347.59184 367.97959 388.36735 408.75510
[22] 429.14286 449.53061 469.91837 490.30612 510.69388 531.08163 551.46939
[29] 571.85714 592.24490 612.63265 633.02041 653.40816 673.79592 694.18367
[36] 714.57143 734.95918 755.34694 775.73469 796.12245 816.51020 836.89796
[43] 857.28571 877.67347 898.06122 918.44898 938.83673 959.22449 979.61224
[50] 1000.00000
#via:
seq.int(from=1,to=1000,length.out=50)
Override to
[1] 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101
[22] 106 111 116 121 126 131 136 141 146 151 156 161 166 171 176 181 186 191 196 201 206
[43] 211 216 221 226 231 236 241 246
#via:
seq.int(from=1,by=5,length.out=50)
Override from
[1] 755 760 765 770 775 780 785 790 795 800 805 810 815 820 825 830 835 840
[19] 845 850 855 860 865 870 875 880 885 890 895 900 905 910 915 920 925 930
[37] 935 940 945 950 955 960 965 970 975 980 985 990 995 1000
#via:
seq.int(to=1000,by=5,length.out=50)
A priori, R has no way of telling which of the above you'd like, nor should it. You as programmer need to decide which inputs take precedence.
And you're right that this should be documented; for now, take a look at the source of .Primitive("seq.int"), as linked originally by #nongkrong.
No, there is nothing fundamental to the R language that I was missing that was the source of the problem. The problem is that the documents, at least at time of writing, are misleading and/or incorrect.
I am trying to create sequences of number of 6 cases, but with 144 cases intervals.
Like this one for example
c(1:6, 144:149, 288:293)
1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
How could I generate automatically such a sequence with
seq
or with another function ?
I find the sequence function to be helpful in this case. If you had your data in a structure like this:
(info <- data.frame(start=c(1, 144, 288), len=c(6, 6, 6)))
# start len
# 1 1 6
# 2 144 6
# 3 288 6
then you could do this in one line with:
sequence(info$len) + rep(info$start-1, info$len)
# [1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
Note that this solution works even if the sequences you're combining are different lengths.
Here's one approach:
unlist(lapply(c(0L,(1:2)*144L-1L),`+`,seq_len(6)))
# or...
unlist(lapply(c(1L,(1:2)*144L),function(x)seq(x,x+5)))
Here's a way I like a little better:
rep(c(0L,(1:2)*144L-1L),each=6) + seq_len(6)
Generalizing...
rlen <- 6L
rgap <- 144L
rnum <- 3L
starters <- c(0L,seq_len(rnum-1L)*rgap-1L)
rep(starters, each=rlen) + seq_len(rlen)
# or...
unlist(lapply(starters+1L,function(x)seq(x,x+rlen-1L)))
This can also be done using seq or seq.int
x = c(1, 144, 288)
c(sapply(x, function(y) seq.int(y, length.out = 6)))
#[1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
As #Frank mentioned in the comments here is another way to achieve this using #josilber's data structure (This is useful particularly when there is a need of different sequence length for different intervals)
c(with(info, mapply(seq.int, start, length.out=len)))
#[1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
From R >= 4.0.0, you can now do this in one line with sequence:
sequence(c(6,6,6), from = c(1,144,288))
[1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
The first argument, nvec, is the length of each sequence; the second, from, is the starting point for each sequence.
As a function, with n being the number of intervals you want:
f <- function(n) sequence(rep(6,n), from = c(1,144*1:(n-1)))
f(3)
[1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
I am using R 3.3.2. OSX 10.9.4
I tried:
a<-c() # stores expected sequence
f<-288 # starting number of final sub-sequence
it<-144 # interval
for (d in seq(0,f,by=it))
{
if (d==0)
{
d=1
}
a<-c(a, seq(d,d+5))
print(d)
}
print(a)
AND the expected sequence stores in a.
[1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
And another try:
a<-c() # stores expected sequence
it<-144 # interval
lo<-4 # number of sub-sequences
for (d in seq(0,by=it, length.out = lo))
{
if (d==0)
{
d=1
}
a<-c(a, seq(d,d+5))
print(d)
}
print(a)
The result:
[1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293 432 433 434 435 436 437
I tackled this with cumsum function
seq_n <- 3 # number of sequences
rep(1:6, seq_n) + rep(c(0, cumsum(rep(144, seq_n-1))-1), each = 6)
# [1] 1 2 3 4 5 6 144 145 146 147 148 149 288 289 290 291 292 293
No need to calculate starting values of sequences as in the #josilber's solution, but the length of a sequence has to be constant.
I am in the process of implementing few algorithms for cluster analysis especially cluster validation. There are few ways such as cross validation, external index, internal index, relative index. I am trying to implement an algorithm that is under internal index.
Internal index - Based on the intrinsic content of the data. It is used to measure the goodness of a clustering structure without respect to external information.
My interest is Silhouette Coefficient
s(i) = b(i) - a(i) / max{a(i), b(i)}
To make it more clear lets assume I have following multi-model distribution:
library(mixtools)
wait = faithful$waiting
mixmdl = normalmixEM(wait)
plot(mixmdl,which=2)
lines(density(wait), lty=2, lwd=2)
We see that there are two clusters and cut off mark is around 68. There are no label data here so no ground truth to do cross-validation (Un-Supervised). So we need a mechanism to evaluate the clusters. In this case we know there are two cluster from visualization but how do we clear show that two distributions are actually belong to cluster. Base on what I red on wikipedia Silhouette gives us that validation.
I want to implement a method (which implements Silhouette) such that it takes a r list of values in my example its wait, number of clusters in this case 2, and the model which is the model and return average s(i).
I have started but can't really figure out how to go forward
Silhouette = function(rList, num_clusters, model) {
}
summary of my list looks like this:
Length Class Mode
clust_A 416014 -none- numeric
clust_B 72737 -none- numeric
clust_C 6078 -none- numeric
myList$clust_A will return points that are belong to that cluster
[1] 13 880 497 1864 392 55 1130 248 437 37 62 153 60 117
[15] 22 106 71 1026 446 1558 23 56 287 402 46 1506 115 2700
[29] 67 134 48 536 41 506 1098 33 30 280 225 16 25 17
[43] 63 1762 477 174 98 76 157 698 47 312 40 3 198 621
[57] 15 34 226 657 48 110 23 250 14 32 137 272 26 257
[71] 270 133 1734 78 134 8 5 225 187 166 35 15 94 2825
[85] 2 8 94 89 54 91 77 17 106 1397 16 25 16 103
problem is that I don't think the existing library accept this type of data structure.
Silhouette assumes that all clusters have the same variance.
IMHO, it does not make sense to use this measure with EM clustering.
My objective is to list the drift coefficient from a random walk with drift forecast function, applied to a set of historical data (below). Specifically I am trying to gather the drift coefficient starting from the random walk with drift model of the first year, then cumulatively to the last, recording the coefficient each time, meaning iteratively or each additional year (recording this into a list? if that is appropriate). To be clear each new random walk forecast is including all the previous years.
The data is a list of 241 consumption levels, and I am attempting to discern how the drift coefficent would change over the course of iteratively progressing from n=1 to n=241
Where for example the random walk with drift model is Y[t] = c + Y[t-1] + Z[t] where Z[t] is a normal error and c is the coefficient i am looking for. My current attempts at this involve a for loop function and extracting the c coefficient from the rwf() function from the "Forecast" package in R.
To extract this, I am doing as such
rwf(x, h = 1, drift = TRUE)$model[[1]]
which extracts the drift coefficient.
The problem is, my attempts at subsetting the data within the rwf call have failed, and I also don't believe, through trial and error and research, that rwf() supports the subset argument, as an lm model does for example. In this sense my attempts at looping the function have also failed.
An example of such code is
for (i in 1:5){print((rwf(x[1:i], h = 1, drift = TRUE))$model[[1]])}
which gives me the following error
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
In addition: Warning message:
In is.na(rows) : is.na() applied to non-(list or vector) of type 'NULL'
Any help would be much appreciated.
I read SO a lot for help but this is my first time asking a question.
The data is as follows
PCE
1 1306.7
2 1309.6
3 1335.3
4 1341.8
5 1389.2
6 1405.7
7 1414.2
8 1411.0
9 1401.6
10 1406.7
11 1425.0
12 1444.4
13 1474.7
14 1507.8
15 1536.6
16 1555.6
17 1575.2
18 1577.8
19 1583.0
20 1586.6
21 1608.4
22 1619.5
23 1622.4
24 1635.3
25 1636.1
26 1613.9
27 1627.1
28 1653.8
29 1675.6
30 1706.7
31 1732.9
32 1751.0
33 1752.9
34 1769.7
35 1792.1
36 1785.0
37 1787.4
38 1786.9
39 1813.4
40 1822.2
41 1858.7
42 1878.5
43 1901.6
44 1917.0
45 1944.2
46 1957.3
47 1976.0
48 2002.9
49 2019.6
50 2059.5
51 2095.8
52 2134.3
53 2140.2
54 2187.8
55 2212.0
56 2250.0
57 2313.2
58 2347.4
59 2353.5
60 2380.4
61 2390.3
62 2404.2
63 2437.0
64 2449.5
65 2464.6
66 2523.4
67 2562.1
68 2610.3
69 2622.3
70 2651.7
71 2668.6
72 2681.5
73 2702.9
74 2719.5
75 2731.9
76 2755.9
77 2748.4
78 2800.9
79 2826.6
80 2849.1
81 2896.5
82 2935.2
83 2991.2
84 3037.4
85 3108.6
86 3165.5
87 3163.9
88 3175.3
89 3166.0
90 3138.3
91 3149.2
92 3162.2
93 3115.8
94 3142.0
95 3194.4
96 3239.9
97 3274.2
98 3339.6
99 3370.3
100 3405.9
101 3450.3
102 3489.7
103 3509.0
104 3542.5
105 3595.9
106 3616.9
107 3694.2
108 3709.7
109 3739.6
110 3758.5
111 3756.3
112 3793.2
113 3803.3
114 3796.7
115 3710.5
116 3750.3
117 3800.3
118 3821.1
119 3821.1
120 3836.6
121 3807.6
122 3832.2
123 3845.9
124 3875.4
125 3946.1
126 3984.8
127 4063.9
128 4135.7
129 4201.3
130 4237.3
131 4297.9
132 4331.1
133 4388.1
134 4462.5
135 4503.2
136 4588.7
137 4598.8
138 4637.2
139 4686.6
140 4768.5
141 4797.2
142 4789.9
143 4854.0
144 4908.2
145 4920.0
146 5002.2
147 5038.5
148 5078.3
149 5138.1
150 5156.9
151 5180.0
152 5233.7
153 5259.3
154 5300.9
155 5318.4
156 5338.6
157 5297.0
158 5282.0
159 5322.2
160 5342.6
161 5340.2
162 5432.0
163 5464.2
164 5524.6
165 5592.0
166 5614.7
167 5668.6
168 5730.1
169 5781.1
170 5845.5
171 5888.8
172 5936.0
173 5994.6
174 6001.6
175 6050.8
176 6104.9
177 6147.8
178 6204.0
179 6274.2
180 6311.8
181 6363.2
182 6427.3
183 6453.3
184 6563.0
185 6638.1
186 6704.1
187 6819.5
188 6909.9
189 7015.9
190 7085.1
191 7196.6
192 7283.1
193 7385.8
194 7497.8
195 7568.3
196 7642.4
197 7710.0
198 7740.8
199 7770.0
200 7804.2
201 7926.4
202 7953.7
203 7994.1
204 8048.3
205 8076.9
206 8117.7
207 8198.1
208 8308.5
209 8353.7
210 8427.6
211 8465.1
212 8539.1
213 8631.3
214 8700.1
215 8786.2
216 8852.9
217 8874.9
218 8965.8
219 9019.8
220 9073.9
221 9158.3
222 9209.2
223 9244.5
224 9285.2
225 9312.6
226 9289.1
227 9285.8
228 9196.0
229 9076.0
230 9040.9
231 8998.5
232 9050.3
233 9060.2
234 9121.2
235 9186.9
236 9247.1
237 9328.4
238 9376.7
239 9392.7
240 9433.5
241 9482.1
You need at least two points to fit your model. Here's how I'd approach the problem after reading your data into a data.frame named x:
library(forecast)
drifts <- sapply(2:nrow(x), function(zz) rwf(x[1:zz,], drift = TRUE)$model$drift)
I'm not sure if this is what you were expecting or not, but here's a plot of your drift values: