vector manipulation in R - r

Suppose I have a vector of dimension n and it is composed of 0 and 1. Then I divide this vector into m equal bins. A bin is called active if it contains at least one "1". I want to write a command that returns the place of active bins and how many "1" they contain.
For example, I have this vector: n=15, m=5
[1 0 0 | 0 1 1 | 0 0 0 | 0 1 0| 1 1 1]
I want to have matrix [1 2 4 5] (the active bins) and [1 2 1 3] (how many 1 they contain).
Can I write this in R without using for loops?

I would do it like this:
a <- c(1,0,0,0,1,1,0,0,0,0,1,0,1,1,1)
m <- 5
idx <- rep(1:m, each=length(a)/m)
# how many ones?
no <- sapply(1:5, function(x) sum(a[idx==x]))
# which bins contain ones?
bins <- 1:m
bins[no>0]

Another approach to obtain the vector with number of ones:
x <- c(1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1)
n <- length(x)
m <- 5
size <- n/m
x.list <- split(x, cut(seq_along(x)/size, 0:m))
vapply(x.list, sum, 0)
From there, do as jigr does.

Related

Random Sample in R: Limiting Number of Repetitionsout of a 2 digit vector

I have the following vector in R: c(0,1).
I am wishing to randomly sample from this vector 10 elements at a time, but such that no more than 2 elements repeat.
The code I have tried is sample(c(0,1),10,replace=T)
But I would like to get
sample(c(0,1),10,replace=T) = (0,1,1,0,1,1,0,0,1,0)
sample(z,4,replace=T) = (0,1,0,1,0,0,1,0,1,0)
but not
sample(z,4,replace=T) = (1,0,0,0,1,1,0,0,0)
And so on.
How could I accomplish this?
Since the number of repeats can only be 1 or 2, and since the value needs to alternate, you can achieve this in a one-liner by randomly choosing 1 or 2 repeats of each of a sequence of 1s and 0s, and truncating the result to 10 elements.
rep(rep(0:1, 5), times = sample(c(1:2), 10, TRUE))[1:10]
#> [1] 0 0 1 1 0 1 1 0 1 0
If you wish to remove the constraint of the sequence always starting with a zero, you can randomly subtract the result from 1:
abs(sample(0:1, 1) - rep(rep(0:1, 5), times = sample(c(1:2), 10, TRUE))[1:10])
#> [1] 1 1 0 0 1 0 0 1 1 0
foo <- function(){
innerfunc <- function(){sample(c(0, 1), 10, T)}
x <- innerfunc()
while(max(rle(x)$lengths) > 2){
x <- innerfunc()
}
x
}
foo()
This function will look at the max length of a sequence of zeroes and ones. If this is > 2, it reruns your sample function, named innerfunc in here.
I think this is an interesting coding practice if you would like to use recurssions, and below might be an option that gives some hints
f <- function(n) {
if (n <= 2) {
return(sample(c(0, 1), n, replace = TRUE))
}
m <- sample(c(1, 2), 1)
v <- Recall(n - m)
c(v, rep((tail(v, 1) + 1) %% 2, m))
}

R, dplyr: build sets with one non-zero element per column

I have a dataframe of values with thousands of rows and a couple dozen columns. For a given row, R_0, I'd like to iteratively find a complementary row, add it to a set, then find a row complementary to each element in the set. A complementary row is defined as:
if given row has a non-zero value for a column, then the complement must have a zero value for that column
The end result should be a set of SKUs whose combination should result in as few zero-valued columns as possible.
To illustrate, here is a toy dataframe (code at bottom):
sku p1_prop p2_prop p3_prop p4_prop p5_prop rowTally
1 1 0 0 0 0.1634774 0 1
2 2 0.1617101 0.1700415 0 0 0 2
3 3 0 0 0 0 0.1385715 1
4 4 0 0 0.1785431 0 0.1399401 2
5 5 0.1682469 0 0 0 0 1
totalDollarSales totalUnitSales dollarsPerRobot
1 386175.48 482131.9 0.80097474
2 13488.99 599605.9 0.02249643
3 382449.72 493592.0 0.77482973
4 869703.88 186299.0 4.66832335
5 340414.96 827390.6 0.41143200
I want a function that accepts the first SKU in the set as an input and finds all complementary elements to the set.
For example, I need a function f:
f(df=A, sku=1, rowTallyThreshold)
Process iteratively adds a SKU that is complementary to the existing set. If rowTallyThreshold = 3, then all rows where rowTally<=3 can be added to the set:
[1] -> [1, 2] -> [1, 2, 3]
[1] -> [1, 2] -> [1, 2, 4]
If 'rowTallyThreshold` = 1, then all rows where rowTally<=1, or rows 1, 3 and 5, may potentially be added to the set:
[1] -> [1, 3] -> [1, 3, 5]
The resulting output should be all sets possible.
Code to generate MWE:
set.seed(1)
a = runif(n=25, min=0, max=0.18); a[a<0.13] = 0
A = as.data.frame(matrix(a, nrow=5, ncol=5, byrow = TRUE))
A$rowTally <- rowSums(A != 0);
A$sku <- seq(from = 1, to = 5)
A$totalDollarSales <- runif(n=5, min=1*10^2, max=1*10^6)
A$totalUnitSales <- runif(n=5, min=1*10^2, max=1*10^6)
names(A) <- c("p1_prop", "p2_prop", "p3_prop", "p4_prop", "p5_prop", "rowTally", "sku", "totalDollarSales", "totalUnitSales")
A <- A[c("sku", "p1_prop", "p2_prop", "p3_prop", "p4_prop", "p5_prop", "rowTally", "totalDollarSales", "totalUnitSales")]
A$dollarsPerRobot <- A$totalDollarSales/A$totalUnitSales
How about this:
library(tidyverse)
## y matches to x iff y is zero when x is not zero
is_match <- function(x, y) {
all((x != 0 & y == 0) | (x == 0))
}
## Find complement skus of sku
find_matches <- function(df, sku, rowTallyThreshold, vars) {
## Vector of main sku
main_sku <- as.numeric(df[df$sku == sku, vars])
## Potential candidates
potential <- df %>%
filter(rowTally <= rowTallyThreshold)
## Indices of matches
match_idx <- apply(potential[vars], 1, function(y){is_match(main_sku, y)})
## Skus of matches
potential$sku[match_idx]
}
find_matches(A, 1, 3, c("p1_prop", "p2_prop", "p3_prop", "p4_prop", "p5_prop"))

Vectorize loop with repeating indices

I have a vector of indices that contains repeating values:
IN <- c(1, 1, 2, 2, 3, 4, 5)
I would like to uses these indices to subtract two vectors:
ST <- c(0, 0, 0, 0, 0, 0, 0)
SB <- c(1, 1, 1, 1, 1, 1, 1)
However, I would like to do the subtraction in "order" such that after subtraction of the first index values (0, 1), the second substraction would "build off" the first subtraction. I would like to end up with a vector FN that looks like this:
c(-2, -2, -1, -1, -1, 0, 0)
This is easy enough to do in a for loop:
for(i in seq_along(IN)){
ST[IN[i]] <- ST[IN[i]] - SB[IN[i]]
}
But I need to run this loop many times on long vectors and this can take many hours. Is there any way to vectorize this task and avoid a for loop? Maybe using a data.table technique?
Sure, with data.table, it's
library(data.table)
DT = data.table(ST)
mDT = data.table(IN, SB)[, .(sub = sum(SB)), by=.(w = IN)]
DT[mDT$w, ST := ST - mDT$sub ]
ST
1: -2
2: -2
3: -1
4: -1
5: -1
6: 0
7: 0
Or with base R:
w = sort(unique(IN))
ST[w] <- ST[w] - tapply(SB, IN, FUN = sum)
# [1] -2 -2 -1 -1 -1 0 0
Here is an option using aggregate in base R:
ag <- aggregate(.~IN, data.frame(IN, ST[IN]-SB[IN]), sum)
replace(ST, ag[,1], ag[,2])
#[1] -2 -2 -1 -1 -1 0 0
OR using xtabs:
d <- as.data.frame(xtabs(B~A, data.frame(A=IN, B=ST[IN]-SB[IN])))
replace(ST, d[,1], d[,2])

R: recode previous/following n observations

I have a dataframe of 0/1 dummy variables. Each dummy variable only takes the value 1 once. For each column, I would want to replace n preceding/following observations counting from the observation with the value 1 to a particular value (say 1).
So for single vector, with n=1:
c(0, 0, 1, 0, 0)
I would want to get
c(0, 1, 1, 1, 0)
What would be a good general approach with n columns and allowing for a different number of preceding/following observations to replace (e.g n-1 before & n after)?
Thanks for help!
x<-c(0,0,1,0,0)
ind<-which(x==1)
x[(ind-1):(ind+x)]<-1
Another option:
f <- function(x, pre, post) {
idx <- which.max(x)
x[max(1, (idx-pre)):min(length(x), (idx+post))] <- 1
x
}
Sample data:
df <- data.frame(x = c(0, 0, 1, 0, 0), y = c(0, 1, 0, 0, 0))
Application:
df[] <- lapply(df, f, pre=2, post=1)
#df
# x y
#1 1 1
#2 1 1
#3 1 1
#4 1 0
#5 0 0
What you can do is the following:
vec <- c(0, 0, 1, 0, 0)
sapply(1:length(vec), function(i) {
minval <- max(0, i - 1)
maxval <- min(i + 1, length(vec))
return(sum(vec[minval:maxval]))
})
# [1] 0 1 1 1 0
Or to put it in a function (same code but a bit more compact)
f <- function(vec){
sapply(1:length(vec), function(i)
sum(vec[max(0, i-1):min(i+1, length(vec))]))
}
f(vec)
# [1] 0 1 1 1 0
Speedtest
To compare the two different solutions, I quickly ran a benchmark using microbenchmark, and the winner is: Clearly #Shenglin's code.... Always nice to see simple solutions (as well as to see how complicated some (my) solutions can be).
fDavid <- function(vec){
sapply(1:length(vec), function(i)
sum(vec[max(0, i-1):min(i+1, length(vec))]))
}
fHeroka <- function(vec){
res <- vec
test <- which(vec==1)
#create indices to be replaced
n=1 #variable n
replace_indices <- c(test+(1:n),test-(1:n))
#filter out negatives (may happen with larger n)
replace_indices <- replace_indices[replace_indices>0]
#replace items in 'res' that need to be replaced with 1
res[replace_indices] <- 1
}
fShenglin <- function(vec){
ind<-which(vec==1)
vec[(ind-1):(ind+x)]<-1
}
vect <- sample(0:1, size = 1000, replace = T)
library(microbenchmark)
microbenchmark(fHeroka(vect), fDavid(vect), fShenglin)
# # Unit: nanoseconds
# expr min lq mean median uq max
# fHeroka(vect) 38929 42999 54422.57 49546 61755.5 145451
# fDavid(vect) 2463805 2577935 2875024.99 2696844 2849548.5 5994596
# fShenglin 0 0 138.63 1 355.0 1063
# neval cld
# 100 a
# 100 b
# 100 a
# Warning message:
# In microbenchmark(fHeroka(vect), fDavid(vect), fShenglin) :
# Could not measure a positive execution time for 30 evaluations.
This might be a start:
myv <- c(0, 0, 1, 0, 0)
#make a copy
res <- myv
#check where the ones are
test <- which(myv==1)
#create indices to be replaced
n=1 #variable n
replace_indices <- c(test+(1:n),test-(1:n))
#filter out negatives (may happen with larger n)
replace_indices <- replace_indices[replace_indices>0]
#replace items in 'res' that need to be replaced with 1
res[replace_indices] <- 1
res
> res
[1] 0 1 1 1 0
This could be a solution:
dat<-data.frame(x=c(0,0,1,0,0,0),y=c(0,0,0,1,0,0),z=c(0,1,0,0,0,0))
which_to_change<-data.frame(prev=c(2,2,1),foll=c(1,1,3))
for(i in 1:nrow(which_to_change)){
dat[(which(dat[,i]==1)-which_to_change[i,1]):(which(dat[,i]==1)+which_to_change[i,2]),i]<-1
}

Remove zeros in the start and end of a vector

I have a vector like this:
x <- c(0, 0, 0, 0, 4, 5, 0, 0, 3, 2, 7, 0, 0, 0)
I want to keep only the elements from position 5 to 11. I want to delete the zeroes in the start and end. For this vector it is quite easy since it is small.
I have very large data and need something in general for all vectors.
Try this:
x[ min( which ( x != 0 )) : max( which( x != 0 )) ]
Find index for all values that are not zero, and take the first -min and last - max to subset x.
You can try something like:
x=c(0,0,0,0,4,5,0,0,3,2,7,0,0,0)
rl <- rle(x)
if(rl$values[1] == 0)
x <- tail(x, -rl$lengths[1])
if(tail(rl$values,1) == 0)
x <- head(x, -tail(rl$lengths,1))
x
## 4 5 0 0 3 2 7
Hope it helps,
alex
This would also work :
x[cumsum(x) & rev(cumsum(rev(x)))]
# [1] 4 5 0 0 3 2 7
I would probably define two functions, and compose them:
trim_leading <- function(x, value=0) {
w <- which.max(cummax(x != value))
x[seq.int(w, length(x))]
}
trim_trailing <- function(x, value=0) {
w <- which.max(cumsum(x != value))
x[seq.int(w)]
}
And then pipe your data through:
x %>% trim_leading %>% trim_trailing

Resources