shuffle values of factor in table [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
using a package for a glm which reads a dataframe in chunks. It is required that all levels of a factor occur in every chunk. I am looking for a nice strategy to rearrange observations so as to maximise the probability to have all values in every chunk.
The example would be
c(4,7,4,4,4,4,4,4,4,4,4,7,4,4,8,8,5,5)
for a size of the chunk of 8 the best rearrangement would be
c(4,7,5,8,4,4,4,4,4,4,4,7,4,4,8,8,4,5,8)
is there some elegant way to shuffel the data around?
just saw the comments..the library itself is called bigglm (where it reads data chunkwise). The vectors should be of eqal lenegth. The question is really just about re arranging the data that most level are present in most chunks
An example for the coumn of the dataframe can be found here
(https://www.dropbox.com/s/cth8kwcq9ph5j0p/d1.RData?dl=0)
the most important thing in this case is that as many levels as possible are present in as many chunks as possible. The smaller the chunk, the less memory will be needed when reading in. I think it would be a good point to assume 10 chunks.

I think I understand what you are asking for, though admittedly I am not familiar with the function that reads data in by chunks and uses stringsAsFactors = TRUE while making assumptions a priori on the makeup of the data (and does not offer a way to superimpose other characteristics of the factors). I offer in advance the suggestion that either you are misinterpreting the function or you are mis-applying it to your specific data problem.
I'm easily wrong in problems like this, so I'll try to address the inferred problem regardless.
You claim that the function will be reading in the first 8 elements, on which it do its processing. It must know that there are (in this case) four factors to be considered; the easiest way, as you are asking, is to have each of these factors present in each chunk. Once it has processed these first 8 rows, it will then read the second 8 elements. In the case of your sample data, this does not work since the second 8 elements does not include a 5.
I'll define slightly augmented data later to remedy this.
Assumptions / Rules
the number of unique values overall in the data must be no larger than the size of each chunk;
each factor must have at least as many occurrences as the number of chunks to be read; and
all chunks have precisely chunksize elements in them (i.e., full) except for the last chunk will have between 1 and chunksize elements in it; ergo,
the last chunk has at least as many elements as there are unique values.
Function Definition
Given those rules, here's some code. This is most certainly not the only solution, and it may not perform well with significantly large datasets (I have not done extensive testing).
myfunc <- function(x, chunksize = 8) {
numChunks <- ceiling(length(x) / chunksize)
uniqx <- unique(x)
lastChunkSize <- chunksize * (1 - numChunks) + length(x)
## check to see if it is mathematically possible
if (length(uniqx) > chunksize)
stop('more factors than can fit in one chunk')
if (any(table(x) < numChunks))
stop('not enough of at least one factor to cover all chunks')
if (lastChunkSize < length(uniqx))
stop('last chunk will not have all factors')
## actually arrange things in one feasible permutation
allIndices <- sapply(uniqx, function(z) which(z == x))
## fill one of each unique x into chunks
chunks <- lapply(1:numChunks, function(i) sapply(allIndices, `[`, i))
remainder <- unlist(sapply(allIndices, tail, n = -3))
remainderCut <- split(remainder, ceiling(seq_along(remainder)/4))
## combine them all together, wary of empty lists
finalIndices <- sapply(1:numChunks,
function(i) {
if (i <= length(remainderCut))
c(chunks[[i]], remainderCut[[i]])
else
chunks[[i]]
})
x[unlist(finalIndices)]
}
Supporting Execution
In your offered data, you have 18 elements requiring three chunks. Your data will fail on two accounts: three of the elements only occur twice, so the third chunk will most certainly not contain all elements; and your last chunk will only have two elements, which cannot contain each of the four.
I'll augment your data to satisfy both misses, with:
dat3 <- c(4,7,5,7,8,4,4,4,4,4,4,7,4,4,8,8,5,5,5,5)
which will not work unadjusted, if for no other reason than the last chunk will only have four 5's in it.
The solution:
myfunc(dat3, chunksize = 8)
## [1] 4 7 5 8 4 4 4 4 4 7 5 8 4 4 5 5 4 7 5 8
(spaces were added to the output for easy inspection). Each chunk has 4, 7, 5, 8 as its first four elements, therefore all factors are covered in each chunk.
Breakdown
A quick walkthrough (using debug(myfunc)), assuming x = dat3 and chunksize = 8. Jumping down the code:
## Browse[2]> uniqx
## [1] 4 7 5 8
## Browse[2]> allIndices
## [[1]]
## [1] 1 6 7 8 9 10 11 13 14
## [[2]]
## [1] 2 4 12
## [[3]]
## [1] 3 17 18 19 20
## [[4]]
## [1] 5 15 16
This shows the indices for each unique element. For example, there are 4's located at indices 1, 6, 7, etc.
## Browse[2]> chunks
## [[1]]
## [1] 1 2 3 5
## [[2]]
## [1] 6 4 17 15
## [[3]]
## [1] 7 12 18 16
There are three chunks to be filled, and this list starts forming those chunks. In this example, we have placed indices 1, 2, 3, and 5 in the first chunk. Looking back at allIndices, you'll see that these represent the first instance of each of uniqx, so the first chunk now contains c(4, 7, 5, 8), as do the other two chunks.
At this point, we have satisfied the basic requirement that each unique element be found in every chunk. The rest of the code fills with the remaining elements.
## Browse[2]> remainder
## [1] 8 9 10 11 13 14 19 20
These are all indices that have so far not been added to the chunks.
## Browse[2]> remainderCut
## $`1`
## [1] 8 9 10 11
## $`2`
## [1] 13 14 19 20
Though we have three chunks, we only have two lists here. This is fine, we have nothing (and need nothing) to add to the last chunk. We will then zip-merge these with chunks to form a list of index lists. (Note: you might be tempted to try mapply(function(a, b) c(a, b), chunks, remainderCut), but you may notice that if remainderCut is not the same size as chunks, as we see here, then its values are recycled. Not acceptable. Try it.)
## Browse[2]> finalIndices
## [[1]]
## [1] 1 2 3 5 8 9 10 11
## [[2]]
## [1] 6 4 17 15 13 14 19 20
## [[3]]
## [1] 7 12 18 16
Remember, each number represents the index from within x (originally dat3). We then unlist this split-vector and apply the indices to the data.

Related

Number in results from R

In R, when I press return for a line of code (for example, a histogram,) what does the [1] that comes up in the results mean?
If there's another line, it comes up as [18], then [35].
The numbers that you see in the console in the situation that you are describing are the indices of the first elements of the line.
1:20
# [1] 1 2 3 4 5 6 7 8 9 10 11 12
# [13] 13 14 15 16 17 18 19 20
How many values are displayed by line depends by default on the width of the console (at least in Rstudio).
The value I printed is a numeric vector of length 20, a single number is technically also a numeric vector, but of length 1, in R there is no different concept for both, thus when you print only one value the [1] still shows.
42
# [1] 42
It's not obvious, for example there is no function of length 2, c(mean, median) is a list (containing functions), but it works like this for said atomic modes (see ?atomic) and usually the classes that are built on them.
You might not always see these numbers on all objects because they depend on what print methods are called, which itself depends on the class.
library(glue)
glue("a")
# a # <- we don't see [1]!
mode(glue("a"))
# character
class(glue("a"))
# [1] "glue" "character"
The print method that is called when typing print(1:20) is print.default, it can be overriden to avoid displaying the [numbers] :
print.default <- function(x) cat(x,"\n")
print(1:20)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
rm(print.default) # necessary cleanup!
The autoprint (what you get when not calling print explicitly) won't change however, as auto-printing can only involve method dispatch for explicit classes (with a class attribute, a.k.a. objects)
Type methods(print) to see all the available methods.

Difference between df["speed"] and df$speed

Suppose a data frame df has a column speed, then what is difference in the way accessing the column like so:
df["speed"]
or like so:
df$speed
The following calculates the mean value correctly:
lapply(df["speed"], mean)
But this prints all values under the column speed:
lapply(df$speed, mean)
There are two elements to the question in the OP. The first element was addressed in the comments: df["speed"] is an object of type data.frame() whereas df$speed is a numeric vector. We can see this via the str() function.
We'll illustrate this with Ezekiel's 1930 analysis of speed and stopping distance, the cars data set from the datasets package.
> library(datasets)
> data(cars)
>
> str(cars["speed"])
'data.frame': 50 obs. of 1 variable:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
> str(cars$speed)
num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
>
The second element that was not addressed in the comments is that lapply() behaves differently when passed a vector versus a list().
With a vector, lapply() processes each element in the vector independently, producing unexpected results for a function such as mean().
> unlist(lapply(cars$speed,mean))
[1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
[26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
What happened?
Since each element of cars$speed is processed by mean() independently, lapply() returns a list of 50 means of 1 number each: the original elements in the cars$speed vector.
Processing a list with lapply()
With a list, each element of the list is processed independently. We can calculate how many items will be processed by lapply() with the length() function.
> length(cars["speed"])
[1] 1
>
Since a data frame is also a list() that contains one element of type data.frame(), the length() function returns the value 1. Therefore, when processed by lapply(), a single mean is calculated, not one per row of the speed column.
> lapply(cars["speed"],mean)
$speed
[1] 15.4
>
If we pass the entire cars data frame as the input object for lapply(), we obtain one mean per column in the data frame, since both variables in the data frame are numeric.
> lapply(cars,mean)
$speed
[1] 15.4
$dist
[1] 42.98
>
A theoretical perspective
The differing behaviors of lapply() are explained by the fact that R is an object oriented language. In fact, John Chambers, creator of the S language on which R is based, once said:
In R, two slogans are helpful.
-- Everything that exists is an object, and
-- Everything that happens is a function call.
John Chambers, quoted in Advanced R, p. 79.
The fact that lapply() works differently on a data frame than a vector is an illustration of the object oriented feature of polymorphism where the same behavior is implemented in different ways for different types of objects.
While this looks like an beginner's question I think it's worth answering it since many beginners could have a similar question and a guide to the corresponding documentation is helpful IMHO.
No up-votes please - I am just collecting the comment fragments from the question that contribute to the answer - feel free to edit this answer...*
A data.frame is a list of vectors with the same length (number of elements). Please read the help in the R console (by typing ?data.frame)
The $ operator is implemented by returning one column as vector (?"$.data.frame")
lapply applies a function to each element of a list (see ?lapply). If the first param X is a scalar vector (integer, double...) with multiple elements, each element of the vector is converted ("coerced") into one separate list element (same as as.list(1:26))
Examples:
x <- data.frame(a = LETTERS, b = 1:26, stringsAsFactors = FALSE)
b.vector <- x$b
b.data.frame <- x["b"]
class(b.vector) # integer
class(b.data.frame) # data.frame
lapply(b.vector, mean)
# returns a result list with 26 list elements, the same as `lapply(1:26, mean)`
# [[1]]
# [1] 1
#
# [[2]]
# [1] 2
# ... up to list element 26
lapply(b.data.frame, mean)
# returns a list where each element of the input vector in param X
# becomes a separate list element (same as `as.list(1:26)`)
# $b
# [1] 13.5
So IMHO your original question can be reduced to: Why is lapply behaving differently if the first parameter is a scalar vector instead of a list?

Add a percent chance of something happening in R

I have a vector of numbers stored in R.
my_vector <- c(1,2,3,4,5)
I want to add two to each number.
my_vector + 2
[1] 3 4 5 6 7
However, I want there to only be a twenty percent chance of adding two to the numbers in my vector each time I run the code. Is there a way to code this in R?
What I mean is, if I run the code, the output could be:
[1] 3 4 5 6 9
Or perhaps
[1] 5 4 5 6 7
i.e. there is only a 20% chance that any one number in the vector will get two added to it.
myvector + 2*sample(c(TRUE,FALSE), length(myvector), prob=c(0.2,0.8), repl=TRUE)
That will give a variable number of 2's to be added (which is what you were asking) but sometimes people want to know that exactly 20% will have a 2 added in whoch case it would be:
myvector + 2*sample(c(TRUE,rep(FALSE,4)))

Fixing set.seed for an entire session

I am using R to construct an agent based model with a monte carlo process. This means I got many functions that use a random engine of some kind. In order to get reproducible results, I must fix the seed. But, as far as I understand, I must set the seed before every random draw or sample. This is a real pain in the neck. Is there a way to fix the seed?
set.seed(123)
print(sample(1:10,3))
# [1] 3 8 4
print(sample(1:10,3))
# [1] 9 10 1
set.seed(123)
print(sample(1:10,3))
# [1] 3 8 4
There are several options, depending on your exact needs. I suspect the first option, the simplest is not sufficient, but my second and third options may be more appropriate, with the third option the most automatable.
Option 1
If you know in advance that the function using/creating random numbers will always draw the same number, and you don't reorder the function calls or insert a new call in between existing ones, then all you need do is set the seed once. Indeed, you probably don't want to keep resetting the seed as you'll just keep on getting the same set of random numbers for each function call.
For example:
> set.seed(1)
> sample(10)
[1] 3 4 5 7 2 8 9 6 10 1
> sample(10)
[1] 3 2 6 10 5 7 8 4 1 9
>
> ## second time round
> set.seed(1)
> sample(10)
[1] 3 4 5 7 2 8 9 6 10 1
> sample(10)
[1] 3 2 6 10 5 7 8 4 1 9
Option 2
If you really want to make sure that a function uses the same seed and you only want to set it once, pass the seed as an argument:
foo <- function(...., seed) {
## set the seed
if (!missing(seed))
set.seed(seed)
## do other stuff
....
}
my.seed <- 42
bar <- foo(...., seed = my.seed)
fbar <- foo(...., seed = my.seed)
(where .... means other args to your function; this is pseudo code).
Option 3
If you want to automate this even more, then you could abuse the options mechanism, which is fine if you are just doing this in a script (for a package you should use your own options object). Then your function can look for this option. E.g.
foo <- function() {
if (!is.null(seed <- getOption("myseed")))
set.seed(seed)
sample(10)
}
Then in use we have:
> getOption("myseed")
NULL
> foo()
[1] 1 2 9 4 8 7 10 6 3 5
> foo()
[1] 6 2 3 5 7 8 1 4 10 9
> options(myseed = 42)
> foo()
[1] 10 9 3 6 4 8 5 1 2 7
> foo()
[1] 10 9 3 6 4 8 5 1 2 7
> foo()
[1] 10 9 3 6 4 8 5 1 2 7
> foo()
[1] 10 9 3 6 4 8 5 1 2 7
I think this question suffers from a confusion. In the example, the seed has been set for the entire session. However, this does not mean it will produce the same set of numbers every time you use the print(sample)) command during a run; that would not resemble a random process, as it would be entirely determinate that the same three numbers would appear every time. Instead, what actually happens is that once you have set the seed, every time you run a script the same seed is used to produce a pseudo-random selection of numbers, that is, numbers that look as if they are random but are in fact produced by a reproducible process using the seed you have set.
If you rerun the entire script from the beginning, you reproduce those numbers that look random but are not. So, in the example, the second time that the seed is set to 123, the output is again 9, 10, and 1 which is exactly what you'd expect to see because the process is starting again from the beginning. If you were to continue to reproduce your first run by writing print(sample(1:10,3)), then the second set of output would again be 3, 8, and 4.
So the short answer to the question is: if you want to set a seed to create a reproducible process then do what you have done and set the seed once; however, you should not set the seed before every random draw because that will start the pseudo-random process again from the beginning.
This question is old, but still comes high in search results, and it seemed worth expanding on Spacedman's answer.
If you want to always return the same results from random processes, simply keep the seed set all the time with:
addTaskCallback(function(...) {set.seed(123);TRUE})
Now the output is the same every time:
print(sample(1:10,3))
# [1] 3 8 4
print(sample(1:10,3))
# [1] 3 8 4
You could do a wrapper function, like so:
> wrap.3.digit.sample <- function(x) {
+ set.seed(123)
+ return(sample(x, 3))
+ }
> wrap.3.digit.sample(c(1:10))
[1] 3 8 4
> wrap.3.digit.sample(c(1:10))
[1] 3 8 4
There is probably a more elegant way, and I'm sure someone will chime in with it. But, if they don't, this should make your life easier.
No need. Although the results are different from sample to sample (which you almost certainly want, otherwise the randomness is very questionable), results from run to run will be the same. See, here's the output from my machine.
> set.seed(123)
> sample(1:10,3)
[1] 3 8 4
> sample(1:10,3)
[1] 9 10 1
I suggest that you set.seed before calling each random number generator in R. I think what you need is reproducibility for Monte Carlo simulations. If in a for loop, you can set.seed(i) before calling sample, which guarantees to be fully reproducible. In your outer function, you may specify an argument seed=1 so that in the for loop, you use set.seed(i+seed).

Using cut2 from Hmisc to calculate cuts for different number of groups

I was trying to calculate equal quantile cuts for a vector by using cut2 from Hmisc.
library(Hmisc)
c <- c(-4.18304,-3.18343,-2.93237,-2.82836,-2.13478,-2.01892,-1.88773,
-1.83124,-1.74953,-1.74858,-0.63265,-0.59626,-0.5681)
cut2(c, g=3, onlycuts=TRUE)
[1] -4.18304 -2.01892 -1.74858 -0.56810
But I was expecting the following result (33%, 33%, 33%):
[1] -4.18304 -2.13478 -1.74858 -0.56810
Should I still use cut2 or try something different? How can I make it work? Thanks for your advice.
You are seeing the cutpoints, but you want the tabular counts, and you want them as fractions of the total, so do this instead:
> prop.table(table(cut2(c, g=3) ) )
[-4.18,-2.019) [-2.02,-1.749) [-1.75,-0.568]
0.3846154 0.3076923 0.3076923
(Obviously you cannot expect cut2 to create an exact split when the count of elements was not evenly divisible by 3.)
It seems that there were accidentally thirteen values in the original data set, instead of twelve. Thirteen values cannot be equally divided into three quantile groups (as mentioned by BondedDust). Here is the original problem, except that one selected data value (-1.74953) is excluded, making it twelve values. This gives the result originally expected:
library(Hmisc)
c<-c(-4.18304,-3.18343,-2.93237,-2.82836,-2.13478,-2.01892,-1.88773,-1.83124,-1.74858,-0.63265,-0.59626,-0.5681)
cut2(c, g=3,onlycuts=TRUE)
#[1] -4.18304 -2.13478 -1.74953 -0.5681
To make it clearer to anyone not familiar with cut2 from the Hmisc package (like me as of this morning), here's a similar problem, except that we'll use the integers 1 through 12 (assigned to the vector dozen_values).
library(Hmisc)
dozen_values <-1:12
quantile_groups <- cut2(dozen_values,g=3)
levels(quantile_groups)
## [1] "[1, 5)" "[5, 9)" "[9,12]"
cutpoints <- cut2(dozen_values, g=3, onlycuts=TRUE)
cutpoints
## [1] 1 5 9 12
# Show which values belong to which quantile group, using a data frame
quantile_DF <- data.frame(dozen_values, quantile_groups)
names(quantile_DF) <- c("value", "quantile_group")
quantile_DF
## value quantile_group
## 1 1 [1, 5)
## 2 2 [1, 5)
## 3 3 [1, 5)
## 4 4 [1, 5)
## 5 5 [5, 9)
## 6 6 [5, 9)
## 7 7 [5, 9)
## 8 8 [5, 9)
## 9 9 [9,12]
## 10 10 [9,12]
## 11 11 [9,12]
## 12 12 [9,12]
Notice that, the first quantile group includes everything up to, but not including, 5 (i.e. 1 thorough 4, in this case). The second quantile group contains 5 up to, but not including, 9 (i.e. 5 through 8, in this case). The third (last) quantile group contains 9 through 12, which includes the last value 12. Unlike the other quantile groups, the third quantile group includes the last value shown.
Anyway, you can see that the "cutpoints" 1, 5, 9, and 12 describe the start and end points of the quantile groups in the most concise way, but it is obtuse without reading relevant documentation (link to single page Inside-R site, instead of the almost 400 page PDF manual).
See this explanation about the parentheses vs square bracket notation, if it is unfamiliar to you.

Resources