Related
I have the following vector in R: c(0,1).
I am wishing to randomly sample from this vector 10 elements at a time, but such that no more than 2 elements repeat.
The code I have tried is sample(c(0,1),10,replace=T)
But I would like to get
sample(c(0,1),10,replace=T) = (0,1,1,0,1,1,0,0,1,0)
sample(z,4,replace=T) = (0,1,0,1,0,0,1,0,1,0)
but not
sample(z,4,replace=T) = (1,0,0,0,1,1,0,0,0)
And so on.
How could I accomplish this?
Since the number of repeats can only be 1 or 2, and since the value needs to alternate, you can achieve this in a one-liner by randomly choosing 1 or 2 repeats of each of a sequence of 1s and 0s, and truncating the result to 10 elements.
rep(rep(0:1, 5), times = sample(c(1:2), 10, TRUE))[1:10]
#> [1] 0 0 1 1 0 1 1 0 1 0
If you wish to remove the constraint of the sequence always starting with a zero, you can randomly subtract the result from 1:
abs(sample(0:1, 1) - rep(rep(0:1, 5), times = sample(c(1:2), 10, TRUE))[1:10])
#> [1] 1 1 0 0 1 0 0 1 1 0
foo <- function(){
innerfunc <- function(){sample(c(0, 1), 10, T)}
x <- innerfunc()
while(max(rle(x)$lengths) > 2){
x <- innerfunc()
}
x
}
foo()
This function will look at the max length of a sequence of zeroes and ones. If this is > 2, it reruns your sample function, named innerfunc in here.
I think this is an interesting coding practice if you would like to use recurssions, and below might be an option that gives some hints
f <- function(n) {
if (n <= 2) {
return(sample(c(0, 1), n, replace = TRUE))
}
m <- sample(c(1, 2), 1)
v <- Recall(n - m)
c(v, rep((tail(v, 1) + 1) %% 2, m))
}
I know I can use expand.grid for this, but I am trying to learn actual programming. My goal is to take what I have below and use a recursion to get all 2^n binary sequences of length n.
I can do this for n = 1, but I don't understand how I would use the same function in a recursive way to get the answer for higher dimensions.
Here is for n = 1:
binseq <- function(n){
binmat <- matrix(nrow = 2^n, ncol = n)
r <- 0 #row counter
for (i in 0:1) {
r <- r + 1
binmat[r,] <- i
}
return(binmat)
}
I know I have to use probably a cbind in the return statement. My intuition says the return statement should be something like cbind(binseq(n-1), binseq(n)). But, honestly, I'm completely lost at this point.
The desired output should produce something like what expand.grid gives:
n = 5
expand.grid(replicate(n, 0:1, simplify = FALSE))
It should just be a matrix as binmat is being filled recursively.
As requested in a comment (below), here is a limited implementation for binary sequences only:
eg.binary <- function(n, digits=0:1) {
if (n <= 0) return(matrix(0,0,0))
if (n == 1) return(matrix(digits, 2))
x <- eg.binary(n-1)
rbind(cbind(digits[1], x), cbind(digits[2], x))
}
After taking care of an initial case that R cannot handle correctly, it treats the "base case" of n=1 and then recursively obtains all n-1-digit binary strings and prepends each digit to each of them. The digits are prepended so that the binary strings end up in their usual lexicographic order (the same as expand.grid).
Example:
eg.binary(3)
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 1
[3,] 0 1 0
[4,] 0 1 1
[5,] 1 0 0
[6,] 1 0 1
[7,] 1 1 0
[8,] 1 1 1
A general explanation (with a more flexible solution) follows.
Distill the problem down to the basic operation of tacking the values of an array y onto the rows of a dataframe X, associating a whole copy of X with each value (via cbind) and appending the whole lot (via rbind):
cross <- function(X, y) {
do.call("rbind", lapply(y, function(z) cbind(X, z)))
}
For example,
cross(data.frame(A=1:2, b=letters[1:2]), c("X","Y"))
A b z
1 1 a X
2 2 b X
3 1 a Y
4 2 b Y
(Let's worry about the column names later.)
The recursive solution for a list of such arrays y assumes you have already carried out these operations for all but the last element of the list. It has to start somewhere, which evidently consists of converting an array into a one-column data frame. Thus:
eg_ <- function(y) {
n <- length(y)
if (n <= 1) {
as.data.frame(y)
} else {
cross(eg_(y[-n]), y[[n]])
}
}
Why the funny name? Because we might want to do some post-processing, such as giving the result nice names. Here's a fuller implementation:
eg <- function(y) {
# (Define `eg_` here to keep it local to `eg` if you like)
X <- eg_(y)
names.default <- paste0("Var", seq.int(length(y)))
if (is.null(names(y))) {
colnames(X) <- names.default
} else {
colnames(X) <- ifelse(names(y)=="", names.default, names(y))
}
X
}
For example:
eg(replicate(3, 0:1, simplify=FALSE))
Var1 Var2 Var3
1 0 0 0
2 1 0 0
3 0 1 0
4 1 1 0
5 0 0 1
6 1 0 1
7 0 1 1
8 1 1 1
eg(list(0:1, B=2:3))
Var1 B
1 0 2
2 1 2
3 0 3
4 1 3
Apparently this was the desired recursive code:
binseq <- function(n){
if(n == 1){
binmat <- matrix(c(0,1), nrow = 2, ncol = 1)
}else if(n > 1){
A <- binseq(n-1)
B <- cbind(rep(0, nrow(A)), A)
C <- cbind(rep(1, nrow(A)), A)
binmat <- rbind(B,C)
}
return(binmat)
}
Basically for n = 1 we create a [0, 1] matrix. For every n there after we add a column of 0's to the original matrix, and, separately, a column of 1's. Then we rbind the two matrices to get the final product. So I get what the algorithm is doing, but I don't really understand what the recursion is doing. For example, I don't understand the step from n = 2 to n = 3 based on the algorithm.
I have a large table with timestamps from several nights. Columns are an id for what night, an id for what timestamp within that night and the hearth rate at that timestamp, it looks like this:
allData <- data.table(nightNo=c(1,1,1,1,1,1,2,2,2,2), withinNightNo=c(1,2,3,4,5,6,1,2,3,4), HR=c(1:10))
nightNo withinNightNo HR
1 1 1
1 2 2
1 3 3
1 4 4
1 5 5
1 6 6
2 1 7
2 2 8
2 3 9
2 4 10
I'd like to add two new columns to the table, the slope and the cumsum of HR from up to the last 10 rows of the same night. I calculate the slope using linear regression and defined cumsum as: CUMSUMn = MAX(CUMSUMn-1, 0) + (valuen - MEAN(value1-n)). The result should look like this:
nightNo withinNightNo HR HRSlope HRCumsum
1 1 1 NaN 0.0
1 2 2 1 0.5
1 3 3 1 1.5
1 4 4 1 3.0
1 5 5 1 5.0
1 6 6 1 7.5
2 1 7 NaN 0.0
2 2 8 1 0.5
2 3 9 1 1.5
2 4 10 1 3.0
I've created code for both of these functions using for loops. They work, but my table is so large that it takes a long time to even calculate the slope/cumsum of a single value. My code looks like this:
# Add HRSlope column
allData$HRSlope <- 0
for(i in 1:nrow(allData)){
# Get points from up to last 10 seconds of the same night
start <- ifelse(i < 11, 1, (i-10))
points <- filter(allData[start:i,], nightNo == allData[i,]$nightNo)[, c("withinNightNo", "HR")]
# Calculate necessary values
meanX <- mean(points$withinNightNo)
meanY <- mean(points$HR)
meanXY <- mean(points$withinNightNo * points$HR)
meanX2 <- mean(points$withinNightNo^2)
# Calculate slope and add to table
allData[i,]$HRSlope <- (meanX * meanY - meanXY) / (meanX^2 - meanX2)
cat(i, "\n")
}
# Add cumsum column, and add first value to sum
allData$HRCumsum <- 0
Sum <- allData[1,]$HR
for(i in 2:nrow(allData)){
# Get sum and average of HR in night so far, reset Sum if new night started
Sum <- allData[i,]$HR + ifelse(allData[i,]$nightNo != allData[i-1,]$nightNo, 0 , Sum )
Average <- Sum / allData[i,]$withinNightNo
# Get previous cumsum, if available
pCumsum <- ifelse(allData[i,]$nightNo != allData[i-1,]$nightNo, 0 , allData[i-1,]$HRCumsum )
# Calculate current cumsum
allData[i,]$HRCumsum <- max(pCumsum, 0) + (allData[i,]$HR - Average)
cat(i, "\n")
}
Is there a more efficient way to do this, presumably without for loops?
EDIT:
I've been able to increase the speed of my slope function somewhat. It however still uses a forloop and it actually puts down a wrong value in a field for 9 times before putting down the correct value. Any thoughts on how to fix these two issues?
getSlope <- function(x, y) {
# Calculate necessary values
meanX <- mean(x)
meanY <- mean(y)
meanXY <- mean(x * y)
meanX2 <- mean(x^2)
# Calculate slope
return((meanX * meanY - meanXY) / (meanX^2 - meanX2))
}
# Loop back to 1
for(i in max(allData):1){
# Prevent i<=0
low <- ifelse(i < 10, 0, i-10)
# Grab up to last 10 points and calculate slope
allData[with(allData, withinNightNo > i-10 & withinNightNo <= i), slope := getSlope(withinNightNo, HR), by= nightNo]
}
EDIT2:
I've also been able to improve my cumsum a little, but it suffers from the same things as the slope. Besides that it takes larger chuncks of the table, because it needs to get the average, and needs to loop over all the data twice. Any thoughts on improving this would also be highly be appreciated.
# Calculate part of the cumsum
getCumsumPart <- function(x){
return(x-mean(x))
}
# Calculate valueN - mean(value1:N)
for(i in max(allData$withinNightNo):1){
allData[with(allData, withinNightNo <= i), cumsumPart:=
getCumsumPart(HR), by=nightNo]
}
# Calculate + max(cumsumN-1, 0)
for(i in max(allData$withinNightNo):1){
allData[with(allData, withinNightNo <= i & cumsumPart > 0), cumsum:= sum(cumsumPart), by=nightNo]
}
# Remove part table
allData$cumsumPart <- NULL
# Set NA values to 0
allData[with(allData, is.na(cumsum)), cumsum := 0]
Try this approach
library(dplyr)
library(caTools)
allData <- data.frame(nightNo=c(1,1,1,1,1,1,2,2,2,2),
withinNightNo=c(1,2,3,4,5,6,1,2,3,4),
HR=c(1:10))
group_fun <- function(grouped_df, window=10L) {
# slope
mean_x <- runmean(grouped_df$withinNightNo, window, align="right")
mean_y <- runmean(grouped_df$HR, window, align="right")
mean_xy <- runmean(grouped_df$withinNightNo * grouped_df$HR, window, align="right")
mean_xx <- runmean(grouped_df$withinNightNo * grouped_df$withinNightNo, window, align="right")
grouped_df$slope <- (mean_x * mean_y - mean_xy) / (mean_x^2 - mean_xx)
# cumsum
partial <- grouped_df$HR - mean_y # from above
# the "loop" is unavoidable here, I think
cumsum <- 0
grouped_df$cumsum <- sapply(partial, function(val) {
cumsum <<- max(cumsum, 0) + val
cumsum
})
grouped_df
}
out <- allData %>%
group_by(nightNo) %>%
do(group_fun(., window=3L)) # change window as desired
If I have a vector such as x <-c(1,2,3,4,5,6,7,8,9), I want a function f such that
f(vector,index,num) where it takes the vector and gives me num "closest" elements to that one on the index
Examples:
f(x,3,4) = c(1,2,4,5)
f(x,1,5) = c(2,3,4,5,6)
f(x,8,3) = c(6,7,9)
Since there is also the issue where if we have an odd num, we will need to choose whether to pick left or right side by symmetry, let's go with choosing the left side (but right side is ok too)
i.e f(x,4,5) = c(1,2,3,5,6) and f(x,7,3) = c(5,6,8)
I hope my question is clear, thank you for any help/responses!
edit: The original vector of c(1:9) is arbitrary, the vector could be a vector of strings, or a vector of length 1000 with shuffled numbers with repeats etc.
i.e c(1,7,4,2,3,7,2,6,234,56,8)
num_closest_by_indices <- function(v, idx, num) {
# Try the base case, where idx is not within (num/2) of the edge
i <- abs(seq_along(x) - idx)
i[idx] <- +Inf # sentinel
# If there are not enough elements in the base case, incrementally add more
for (cutoff_idx in seq(floor(num/2), num)) {
if (sum(i <= cutoff_idx) >= num) {
# This will add two extra indices every iteration. Strictly if we have an even length, we should add the leftmost one first and `continue`, to break ties towards the left.
return(v[i <= cutoff_idx])
}
}
}
Here's an illustration of this algorithm: we rank the indices in order of desirability, then pick the lowest num legal ones:
> seq_along(x)
1 2 3 4 5 6 7 8 9
> seq_along(x) - idx
-2 -1 0 1 2 3 4 5 6
> i <- abs(seq_along(x) - idx)
2 1 0 1 2 3 4 5 6
> i[idx] <- +Inf # sentinel to prevent us returning the element itself
2 1 Inf 1 2 3 4 5 6
Now we can just find num elements with smallest values (break ties arbitrarily, unless you have a preference (left)).
Our first guess is all indices <= (num/2) ; this might not be enough if index is within (num/2) of the start/end.
> i <= 2
TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
> v[i <= 2]
1 2 4 5
So, adapting #dash2's code to handle the corner cases where some indices are illegal (nonpositive, or > length(x)), i.e. ! %in% 1:L. Then min(elems) would be the number of illegal indices which we cannot pick, hence we must pick abs(min(elems)) more.
Notes:
in the end the code is simpler and faster to handle it by three piecewise cases. Aww.
it actually seems to simplify things if we pick (num+1) indices, then remove idx before returning the answer. Using result[-idx] to remove it.
Like so:
f <- function (vec, elem, n) {
elems <- seq(elem - ceiling(n/2), elem + floor(n/2))
if (max(elems) > length(vec)) elems <- elems - (max(elems) - length(vec))
if (elems[1] < 1) elems <- elems + (1 - elems[1])
elems <- setdiff(elems, elem)
vec[elems]
}
Giving results:
> f(1:9, 1, 5)
[1] 2 3 4 5 6
> f(1:9, 9, 5)
[1] 4 5 6 7 8
> f(1:9, 2, 5)
[1] 1 3 4 5 6
> f(1:9, 4, 5)
[1] 1 2 3 5 6
> f(1:9, 4, 4)
[1] 2 3 5 6
> f(1:9, 2, 4)
[1] 1 3 4 5
> f(1:9, 1, 4)
[1] 2 3 4 5
> f(1:9, 9, 4)
[1] 5 6 7 8
Start a function with the variable argument x first, and the reference table and n after
.nearest_n <- function(x, table, n) {
The algorithm assumes that table is numeric, without any duplicates, and all values finite; n has to be less than or equal to the length of the table
## assert & setup
stopifnot(
is.numeric(table), !anyDuplicated(table), all(is.finite(table)),
n <= length(table)
)
Sort the table and then 'clamp' maximum and minimum values
## sort and clamp
table <- c(-Inf, sort(table), Inf)
len <- length(table)
Find the interval in table where x occurs; findInterval() uses an efficient search. Use the interval index as the initial lower index, and add 1 for the upper index, making sure to stay in-bounds.
## where to start?
lower <- findInterval(x, table)
upper <- min(lower + 1L, len)
Find the nearest n neighbors by comparing the lower and upper index distance to x, record the nearest value, and increment the lower or upper index as appropriate and making sure to stay in-bounds
## find
nearest <- numeric(n)
for (i in seq_len(n)) {
if (abs(x - table[lower]) < abs(x - table[upper])) {
nearest[i] = table[lower]
lower = max(1L, lower - 1L)
} else {
nearest[i] = table[upper]
upper = min(len, upper + 1L)
}
}
Then return the solution and finish the function
nearest
}
The code might seem verbose, but is actually relatively efficient because the only operations on the entire vector (sort(), findInterval()) are implemented efficiently in R.
A particular advantage of this approach is that it can be vectorized in it's first argument, calculating the test for using lower (use_lower = ...) as a vector and using pmin() / pmax() as clamps.
.nearest_n <- function(x, table, n) {
## assert & setup
stopifnot(
is.numeric(table), !anyDuplicated(table), all(is.finite(table)),
n <= length(table)
)
## sort and clamp
table <- c(-Inf, sort(table), Inf)
len <- length(table)
## where to start?
lower <- findInterval(x, table)
upper <- pmin(lower + 1L, len)
## find
nearest <- matrix(0, nrow = length(x), ncol = n)
for (i in seq_len(n)) {
use_lower <- abs(x - table[lower]) < abs(x - table[upper])
nearest[,i] <- ifelse(use_lower, table[lower], table[upper])
lower[use_lower] <- pmax(1L, lower[use_lower] - 1L)
upper[!use_lower] <- pmin(len, upper[!use_lower] + 1L)
}
# return
nearest
}
For instance
> set.seed(123)
> table <- sample(100, 10)
> sort(table)
[1] 5 29 41 42 50 51 79 83 86 91
> .nearest_n(c(30, 20), table, 4)
[,1] [,2] [,3] [,4]
[1,] 29 41 42 50
[2,] 29 5 41 42
Generalize this by taking any argument and coercing it to the required form using a reference look-up table table0 and the indexes into it table1
nearest_n <- function(x, table, n) {
## coerce to common form
table0 <- sort(unique(c(x, table)))
x <- match(x, table0)
table1 <- match(table, table0)
## find nearest
m <- .nearest_n(x, table1, n)
## result in original form
matrix(table0[m], nrow = nrow(m))
}
As an example...
> set.seed(123)
> table <- sample(c(letters, LETTERS), 30)
> nearest_n(c("M", "Z"), table, 5)
[,1] [,2] [,3] [,4] [,5]
[1,] "o" "L" "O" "l" "P"
[2,] "Z" "z" "Y" "y" "w"
I have a data frame with {0,1} indicating whether a product was Small, Medium or Large.
dat <- data.frame(Sm = c(1,0,0), Med = c(0,1,0), Lg = c(0,0,1))
Sm Med Lg
1 1 0 0
2 0 1 0
3 0 0 1
I'm looking to assign 1's to the 0's leading up to a 1 in a given row. For example in row 2 the product is a "Med", so I'm looking to assign a 1 to the 0 in the "Sm" column.
Allocation size is a consideration so I'm looking for a vectorized approach without using a for loop please. The final solution should output the following:
Sm Med Lg
1 1 0 0
2 1 1 0
3 1 1 1
I've tried several variations of the code below, but the closest I can get is a ragged array which assigns all of the 1's correctly while dropping the elements that have legitimate 0's.
apply(dat, 1, function(x) {
x[1:which.max(x)] <- 1
})
[1] 1 1 1
And below, which gets close but without the needed trailing 0's
apply(dat, 1, function(x) {
temp <- x[1:which.max(x)]
unlist(lapply(temp, function(y) {
y <- 1
}))
})
[[1]]
Sm
1
[[2]]
Sm Med
1 1
[[3]]
Sm Med Lg
1 1 1
First, convert to matrix and use max.col to get the index of the 1 in each row:
mat <- as.matrix(dat)
mc <- max.col(mat)
logical construction Overwrite the matrix:
mat = +(col(mat) <= mc)
or construct an index of matrix positions to change and change 'em:
logical indexing
mat[col(mat) < mc] <- 1L
# or
mat[which(col(mat) < mc)] <- 1L
matrix indexing
idx <- do.call( rbind, lapply( seq_along(mc), function(i)
if (i==1L) NULL
else cbind(i,seq_len(mc[i]-1))
))
mat[idx] <- 1L
vector indexing
nr <- nrow(mat)
idx <- unlist( lapply( seq_along(mc), function(i)
if (mc[i]==1L) NULL
else seq(from = i, by = nr, length.out = mc[i]-1L)
))
mat[idx] <- 1L
The help for all three indexing methods can be found at help("[<-").
This will do what you want.
dat[which(dat$Med==1),]$Sm = 1
dat[which(dat$Lg==1),]$Med = 1
dat[which(dat$Lg==1),]$Sm = 1