Number in results from R - r

In R, when I press return for a line of code (for example, a histogram,) what does the [1] that comes up in the results mean?
If there's another line, it comes up as [18], then [35].

The numbers that you see in the console in the situation that you are describing are the indices of the first elements of the line.
1:20
# [1] 1 2 3 4 5 6 7 8 9 10 11 12
# [13] 13 14 15 16 17 18 19 20
How many values are displayed by line depends by default on the width of the console (at least in Rstudio).
The value I printed is a numeric vector of length 20, a single number is technically also a numeric vector, but of length 1, in R there is no different concept for both, thus when you print only one value the [1] still shows.
42
# [1] 42
It's not obvious, for example there is no function of length 2, c(mean, median) is a list (containing functions), but it works like this for said atomic modes (see ?atomic) and usually the classes that are built on them.
You might not always see these numbers on all objects because they depend on what print methods are called, which itself depends on the class.
library(glue)
glue("a")
# a # <- we don't see [1]!
mode(glue("a"))
# character
class(glue("a"))
# [1] "glue" "character"
The print method that is called when typing print(1:20) is print.default, it can be overriden to avoid displaying the [numbers] :
print.default <- function(x) cat(x,"\n")
print(1:20)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
rm(print.default) # necessary cleanup!
The autoprint (what you get when not calling print explicitly) won't change however, as auto-printing can only involve method dispatch for explicit classes (with a class attribute, a.k.a. objects)
Type methods(print) to see all the available methods.

Related

Difference between df["speed"] and df$speed

Suppose a data frame df has a column speed, then what is difference in the way accessing the column like so:
df["speed"]
or like so:
df$speed
The following calculates the mean value correctly:
lapply(df["speed"], mean)
But this prints all values under the column speed:
lapply(df$speed, mean)
There are two elements to the question in the OP. The first element was addressed in the comments: df["speed"] is an object of type data.frame() whereas df$speed is a numeric vector. We can see this via the str() function.
We'll illustrate this with Ezekiel's 1930 analysis of speed and stopping distance, the cars data set from the datasets package.
> library(datasets)
> data(cars)
>
> str(cars["speed"])
'data.frame': 50 obs. of 1 variable:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
> str(cars$speed)
num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
>
The second element that was not addressed in the comments is that lapply() behaves differently when passed a vector versus a list().
With a vector, lapply() processes each element in the vector independently, producing unexpected results for a function such as mean().
> unlist(lapply(cars$speed,mean))
[1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
[26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
What happened?
Since each element of cars$speed is processed by mean() independently, lapply() returns a list of 50 means of 1 number each: the original elements in the cars$speed vector.
Processing a list with lapply()
With a list, each element of the list is processed independently. We can calculate how many items will be processed by lapply() with the length() function.
> length(cars["speed"])
[1] 1
>
Since a data frame is also a list() that contains one element of type data.frame(), the length() function returns the value 1. Therefore, when processed by lapply(), a single mean is calculated, not one per row of the speed column.
> lapply(cars["speed"],mean)
$speed
[1] 15.4
>
If we pass the entire cars data frame as the input object for lapply(), we obtain one mean per column in the data frame, since both variables in the data frame are numeric.
> lapply(cars,mean)
$speed
[1] 15.4
$dist
[1] 42.98
>
A theoretical perspective
The differing behaviors of lapply() are explained by the fact that R is an object oriented language. In fact, John Chambers, creator of the S language on which R is based, once said:
In R, two slogans are helpful.
-- Everything that exists is an object, and
-- Everything that happens is a function call.
John Chambers, quoted in Advanced R, p. 79.
The fact that lapply() works differently on a data frame than a vector is an illustration of the object oriented feature of polymorphism where the same behavior is implemented in different ways for different types of objects.
While this looks like an beginner's question I think it's worth answering it since many beginners could have a similar question and a guide to the corresponding documentation is helpful IMHO.
No up-votes please - I am just collecting the comment fragments from the question that contribute to the answer - feel free to edit this answer...*
A data.frame is a list of vectors with the same length (number of elements). Please read the help in the R console (by typing ?data.frame)
The $ operator is implemented by returning one column as vector (?"$.data.frame")
lapply applies a function to each element of a list (see ?lapply). If the first param X is a scalar vector (integer, double...) with multiple elements, each element of the vector is converted ("coerced") into one separate list element (same as as.list(1:26))
Examples:
x <- data.frame(a = LETTERS, b = 1:26, stringsAsFactors = FALSE)
b.vector <- x$b
b.data.frame <- x["b"]
class(b.vector) # integer
class(b.data.frame) # data.frame
lapply(b.vector, mean)
# returns a result list with 26 list elements, the same as `lapply(1:26, mean)`
# [[1]]
# [1] 1
#
# [[2]]
# [1] 2
# ... up to list element 26
lapply(b.data.frame, mean)
# returns a list where each element of the input vector in param X
# becomes a separate list element (same as `as.list(1:26)`)
# $b
# [1] 13.5
So IMHO your original question can be reduced to: Why is lapply behaving differently if the first parameter is a scalar vector instead of a list?

shuffle values of factor in table [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
using a package for a glm which reads a dataframe in chunks. It is required that all levels of a factor occur in every chunk. I am looking for a nice strategy to rearrange observations so as to maximise the probability to have all values in every chunk.
The example would be
c(4,7,4,4,4,4,4,4,4,4,4,7,4,4,8,8,5,5)
for a size of the chunk of 8 the best rearrangement would be
c(4,7,5,8,4,4,4,4,4,4,4,7,4,4,8,8,4,5,8)
is there some elegant way to shuffel the data around?
just saw the comments..the library itself is called bigglm (where it reads data chunkwise). The vectors should be of eqal lenegth. The question is really just about re arranging the data that most level are present in most chunks
An example for the coumn of the dataframe can be found here
(https://www.dropbox.com/s/cth8kwcq9ph5j0p/d1.RData?dl=0)
the most important thing in this case is that as many levels as possible are present in as many chunks as possible. The smaller the chunk, the less memory will be needed when reading in. I think it would be a good point to assume 10 chunks.
I think I understand what you are asking for, though admittedly I am not familiar with the function that reads data in by chunks and uses stringsAsFactors = TRUE while making assumptions a priori on the makeup of the data (and does not offer a way to superimpose other characteristics of the factors). I offer in advance the suggestion that either you are misinterpreting the function or you are mis-applying it to your specific data problem.
I'm easily wrong in problems like this, so I'll try to address the inferred problem regardless.
You claim that the function will be reading in the first 8 elements, on which it do its processing. It must know that there are (in this case) four factors to be considered; the easiest way, as you are asking, is to have each of these factors present in each chunk. Once it has processed these first 8 rows, it will then read the second 8 elements. In the case of your sample data, this does not work since the second 8 elements does not include a 5.
I'll define slightly augmented data later to remedy this.
Assumptions / Rules
the number of unique values overall in the data must be no larger than the size of each chunk;
each factor must have at least as many occurrences as the number of chunks to be read; and
all chunks have precisely chunksize elements in them (i.e., full) except for the last chunk will have between 1 and chunksize elements in it; ergo,
the last chunk has at least as many elements as there are unique values.
Function Definition
Given those rules, here's some code. This is most certainly not the only solution, and it may not perform well with significantly large datasets (I have not done extensive testing).
myfunc <- function(x, chunksize = 8) {
numChunks <- ceiling(length(x) / chunksize)
uniqx <- unique(x)
lastChunkSize <- chunksize * (1 - numChunks) + length(x)
## check to see if it is mathematically possible
if (length(uniqx) > chunksize)
stop('more factors than can fit in one chunk')
if (any(table(x) < numChunks))
stop('not enough of at least one factor to cover all chunks')
if (lastChunkSize < length(uniqx))
stop('last chunk will not have all factors')
## actually arrange things in one feasible permutation
allIndices <- sapply(uniqx, function(z) which(z == x))
## fill one of each unique x into chunks
chunks <- lapply(1:numChunks, function(i) sapply(allIndices, `[`, i))
remainder <- unlist(sapply(allIndices, tail, n = -3))
remainderCut <- split(remainder, ceiling(seq_along(remainder)/4))
## combine them all together, wary of empty lists
finalIndices <- sapply(1:numChunks,
function(i) {
if (i <= length(remainderCut))
c(chunks[[i]], remainderCut[[i]])
else
chunks[[i]]
})
x[unlist(finalIndices)]
}
Supporting Execution
In your offered data, you have 18 elements requiring three chunks. Your data will fail on two accounts: three of the elements only occur twice, so the third chunk will most certainly not contain all elements; and your last chunk will only have two elements, which cannot contain each of the four.
I'll augment your data to satisfy both misses, with:
dat3 <- c(4,7,5,7,8,4,4,4,4,4,4,7,4,4,8,8,5,5,5,5)
which will not work unadjusted, if for no other reason than the last chunk will only have four 5's in it.
The solution:
myfunc(dat3, chunksize = 8)
## [1] 4 7 5 8 4 4 4 4 4 7 5 8 4 4 5 5 4 7 5 8
(spaces were added to the output for easy inspection). Each chunk has 4, 7, 5, 8 as its first four elements, therefore all factors are covered in each chunk.
Breakdown
A quick walkthrough (using debug(myfunc)), assuming x = dat3 and chunksize = 8. Jumping down the code:
## Browse[2]> uniqx
## [1] 4 7 5 8
## Browse[2]> allIndices
## [[1]]
## [1] 1 6 7 8 9 10 11 13 14
## [[2]]
## [1] 2 4 12
## [[3]]
## [1] 3 17 18 19 20
## [[4]]
## [1] 5 15 16
This shows the indices for each unique element. For example, there are 4's located at indices 1, 6, 7, etc.
## Browse[2]> chunks
## [[1]]
## [1] 1 2 3 5
## [[2]]
## [1] 6 4 17 15
## [[3]]
## [1] 7 12 18 16
There are three chunks to be filled, and this list starts forming those chunks. In this example, we have placed indices 1, 2, 3, and 5 in the first chunk. Looking back at allIndices, you'll see that these represent the first instance of each of uniqx, so the first chunk now contains c(4, 7, 5, 8), as do the other two chunks.
At this point, we have satisfied the basic requirement that each unique element be found in every chunk. The rest of the code fills with the remaining elements.
## Browse[2]> remainder
## [1] 8 9 10 11 13 14 19 20
These are all indices that have so far not been added to the chunks.
## Browse[2]> remainderCut
## $`1`
## [1] 8 9 10 11
## $`2`
## [1] 13 14 19 20
Though we have three chunks, we only have two lists here. This is fine, we have nothing (and need nothing) to add to the last chunk. We will then zip-merge these with chunks to form a list of index lists. (Note: you might be tempted to try mapply(function(a, b) c(a, b), chunks, remainderCut), but you may notice that if remainderCut is not the same size as chunks, as we see here, then its values are recycled. Not acceptable. Try it.)
## Browse[2]> finalIndices
## [[1]]
## [1] 1 2 3 5 8 9 10 11
## [[2]]
## [1] 6 4 17 15 13 14 19 20
## [[3]]
## [1] 7 12 18 16
Remember, each number represents the index from within x (originally dat3). We then unlist this split-vector and apply the indices to the data.

Understanding Dynamic Time Warping

We want to use the dtw library for R in order to shrink and expand certain time series data to a standard length.
Consider, three time series with equivalent columns. moref is of length(rows) 105, mobig is 130 and mosmall is 100. We want to project mobig and mosmall to a length of 105.
moref <- good_list[[2]]
mobig <- good_list[[1]]
mosmall <- good_list[[3]]
Therefore, we compute two alignments.
ali1 <- dtw(mobig, moref)
ali2 <- dtw(mosmall, moref)
If we print out the alignments the result is:
DTW alignment object
Alignment size (query x reference): 130 x 105
Call: dtw(x = mobig, y = moref)
DTW alignment object
Alignment size (query x reference): 100 x 105
Call: dtw(x = mosmall, y = moref)
So exactly what we want? From my understanding we need to use the warping functions ali1$index1 or ali1$index2 in order to shrink or expand the time series. However, if we invoke the following commands
length(ali1$index1)
length(ali2$index1)
length(ali1$index2)
length(ali2$index2)
the result is
[1] 198
[1] 162
[1] 198
[1] 162
These are vector with indices (probably refering to other vectors). Which one of these can we use for the mapping? Aren't they all to long?
First of all, we need to agree that index1 and index2 are two vectors of the same length that maps query/input data to reference/stored data and vice versa.
Since you did not give out any data. Here is some dummy data to give people an idea.
# Reference data is the template that we use as reference.
# say perfect pronunciation from CNN
data_reference <- 1:10
# Query data is the input data that we want to map to our reference
# say random youtube audio
data_query <- seq(1,10,0.5) + rnorm(19)
library(dtw)
alignment <- dtw(x=data_query, y=data_reference, keep=TRUE)
alignment$index1
alignment$index2
lcm <- alignment$costMatrix
image(x=1:nrow(lcm), y=1:ncol(lcm), lcm)
plot(alignment, type="threeway")
Here are the outputs:
> alignment$index1
[1] 1 2 3 4 5 6 7 7 8 9 10 11 12 13 13 14 14 15 16 17 18 19
> alignment$index2
[1] 1 1 1 2 2 3 3 4 5 6 6 6 6 6 7 8 9 9 9 9 10 10
So basically, the mapping from index1 to index2 is how to map input data to the reference data.
i.e. the 10th data point at the input data has been matched to the 6th data point from the template.
index1: Warping function φx(k) for the query
index2: Warping function φy(k) for the reference
-- Toni Giorgino
Per your question, "what is the deal with the length of the index", since it is basically the coordinates of the optimal, path, it could be as long as m+n(really shallow) or min(m,n) (perfect diagonal). Clearly, it is not a one-to-one mapping which might bothers people a little bit, I guess you can do more research from here how to pick up the mapping you want.
I don't know if there is some buildin function functionality to pick up the best one-to-one mapping. But here is one way.
library(plyr)
mapping <- data.frame(index1=alignment$index1, index2=alignment$index2)
mapping <- ddply(mapping, .(index1), summarize, index2_new = max(index2))
Now mapping contains a one-to-one mapping from query to reference. Then you can map the query to the reference and scale the mapped input in whatever way you want.
I am not exactly sure about the content below the line and anyone is more than welcome to make any improvement how the mapping and scaling should work.
References: 1, 2

convolution of positively supported functions in R

I want the convolution of two functions defined on [0,Inf), say
f=function(x)
(1+0.5*cos(2*pi*x))*(x>=0)
and
g=function(x)
exp(-2*x)*(x>0)
Using the integrate function of R I can do this,
cfg=function(x)
integrate(function(y) f(y)*g(x-y),0,x)$value
By searching the web, it seems that there are more efficient (and more accurate) ways of doing this (say using fft() or convolve()). Can anyone with such experiences explain how please?
Thanks!
convolve or fft solutions are to get a discrete result, rather than a function as you have defined in cfg. They can give you the numeric solution to cfg on some regular, discrete input.
fft is for periodic functions (only) so that is not going to help. However, convolve has a mode of operation called "open", which emulates the operation that is being performed by cfg.
Note that with type="open", you must reverse the second sequence (see ?convolve, "Details"). You also have to only use the first half of the result. Here is a pictoral example of the result of convolution of c(2,3,5) with c(7,11,13) as would be performed by convolve(c(2,3,5), rev(c(7,11,13)), type='open'):
2 3 5 2 3 5 2 3 5 2 3 5 2 3 5
13 11 7 13 11 7 13 11 7 13 11 7 13 11 7
Sum: 14 43 94 94 65
Note that evaluation the first three elements is similar to the results of your integration. The last three would be used for the reverse convolution.
Here is a comparison with your functions. Your function, vectorized, plotted with
y <- seq(0,10,by=.01)
plot(y, Vectorize(cfg)(y), type='l')
And an application of convolve plotted with the following code. Note that there are 100 points per unit interval in y so division by 100 is appropriate.
plot(y, convolve(f(y), rev(g(y)), type='open')[1:1001]/100, type='l')
These do not quite agree, but the convolution is much faster:
max(abs(Vectorize(cfg)(y) - convolve(f(y), rev(g(y)), type='open')[1:1001]/100))
## [1] 0.007474999
benchmark(Vectorize(cfg)(y), convolve(f(y), rev(g(y)), type='open')[1:1001]/100, columns=c('test', 'elapsed', 'relative'))
## test elapsed relative
## 2 convolve(f(y), rev(g(y)), type = "open")[1:1001]/100 0.056 1
## 1 Vectorize(cfg)(y) 5.824 104

Fixing set.seed for an entire session

I am using R to construct an agent based model with a monte carlo process. This means I got many functions that use a random engine of some kind. In order to get reproducible results, I must fix the seed. But, as far as I understand, I must set the seed before every random draw or sample. This is a real pain in the neck. Is there a way to fix the seed?
set.seed(123)
print(sample(1:10,3))
# [1] 3 8 4
print(sample(1:10,3))
# [1] 9 10 1
set.seed(123)
print(sample(1:10,3))
# [1] 3 8 4
There are several options, depending on your exact needs. I suspect the first option, the simplest is not sufficient, but my second and third options may be more appropriate, with the third option the most automatable.
Option 1
If you know in advance that the function using/creating random numbers will always draw the same number, and you don't reorder the function calls or insert a new call in between existing ones, then all you need do is set the seed once. Indeed, you probably don't want to keep resetting the seed as you'll just keep on getting the same set of random numbers for each function call.
For example:
> set.seed(1)
> sample(10)
[1] 3 4 5 7 2 8 9 6 10 1
> sample(10)
[1] 3 2 6 10 5 7 8 4 1 9
>
> ## second time round
> set.seed(1)
> sample(10)
[1] 3 4 5 7 2 8 9 6 10 1
> sample(10)
[1] 3 2 6 10 5 7 8 4 1 9
Option 2
If you really want to make sure that a function uses the same seed and you only want to set it once, pass the seed as an argument:
foo <- function(...., seed) {
## set the seed
if (!missing(seed))
set.seed(seed)
## do other stuff
....
}
my.seed <- 42
bar <- foo(...., seed = my.seed)
fbar <- foo(...., seed = my.seed)
(where .... means other args to your function; this is pseudo code).
Option 3
If you want to automate this even more, then you could abuse the options mechanism, which is fine if you are just doing this in a script (for a package you should use your own options object). Then your function can look for this option. E.g.
foo <- function() {
if (!is.null(seed <- getOption("myseed")))
set.seed(seed)
sample(10)
}
Then in use we have:
> getOption("myseed")
NULL
> foo()
[1] 1 2 9 4 8 7 10 6 3 5
> foo()
[1] 6 2 3 5 7 8 1 4 10 9
> options(myseed = 42)
> foo()
[1] 10 9 3 6 4 8 5 1 2 7
> foo()
[1] 10 9 3 6 4 8 5 1 2 7
> foo()
[1] 10 9 3 6 4 8 5 1 2 7
> foo()
[1] 10 9 3 6 4 8 5 1 2 7
I think this question suffers from a confusion. In the example, the seed has been set for the entire session. However, this does not mean it will produce the same set of numbers every time you use the print(sample)) command during a run; that would not resemble a random process, as it would be entirely determinate that the same three numbers would appear every time. Instead, what actually happens is that once you have set the seed, every time you run a script the same seed is used to produce a pseudo-random selection of numbers, that is, numbers that look as if they are random but are in fact produced by a reproducible process using the seed you have set.
If you rerun the entire script from the beginning, you reproduce those numbers that look random but are not. So, in the example, the second time that the seed is set to 123, the output is again 9, 10, and 1 which is exactly what you'd expect to see because the process is starting again from the beginning. If you were to continue to reproduce your first run by writing print(sample(1:10,3)), then the second set of output would again be 3, 8, and 4.
So the short answer to the question is: if you want to set a seed to create a reproducible process then do what you have done and set the seed once; however, you should not set the seed before every random draw because that will start the pseudo-random process again from the beginning.
This question is old, but still comes high in search results, and it seemed worth expanding on Spacedman's answer.
If you want to always return the same results from random processes, simply keep the seed set all the time with:
addTaskCallback(function(...) {set.seed(123);TRUE})
Now the output is the same every time:
print(sample(1:10,3))
# [1] 3 8 4
print(sample(1:10,3))
# [1] 3 8 4
You could do a wrapper function, like so:
> wrap.3.digit.sample <- function(x) {
+ set.seed(123)
+ return(sample(x, 3))
+ }
> wrap.3.digit.sample(c(1:10))
[1] 3 8 4
> wrap.3.digit.sample(c(1:10))
[1] 3 8 4
There is probably a more elegant way, and I'm sure someone will chime in with it. But, if they don't, this should make your life easier.
No need. Although the results are different from sample to sample (which you almost certainly want, otherwise the randomness is very questionable), results from run to run will be the same. See, here's the output from my machine.
> set.seed(123)
> sample(1:10,3)
[1] 3 8 4
> sample(1:10,3)
[1] 9 10 1
I suggest that you set.seed before calling each random number generator in R. I think what you need is reproducibility for Monte Carlo simulations. If in a for loop, you can set.seed(i) before calling sample, which guarantees to be fully reproducible. In your outer function, you may specify an argument seed=1 so that in the for loop, you use set.seed(i+seed).

Resources