R, dplyr: build sets with one non-zero element per column - r

I have a dataframe of values with thousands of rows and a couple dozen columns. For a given row, R_0, I'd like to iteratively find a complementary row, add it to a set, then find a row complementary to each element in the set. A complementary row is defined as:
if given row has a non-zero value for a column, then the complement must have a zero value for that column
The end result should be a set of SKUs whose combination should result in as few zero-valued columns as possible.
To illustrate, here is a toy dataframe (code at bottom):
sku p1_prop p2_prop p3_prop p4_prop p5_prop rowTally
1 1 0 0 0 0.1634774 0 1
2 2 0.1617101 0.1700415 0 0 0 2
3 3 0 0 0 0 0.1385715 1
4 4 0 0 0.1785431 0 0.1399401 2
5 5 0.1682469 0 0 0 0 1
totalDollarSales totalUnitSales dollarsPerRobot
1 386175.48 482131.9 0.80097474
2 13488.99 599605.9 0.02249643
3 382449.72 493592.0 0.77482973
4 869703.88 186299.0 4.66832335
5 340414.96 827390.6 0.41143200
I want a function that accepts the first SKU in the set as an input and finds all complementary elements to the set.
For example, I need a function f:
f(df=A, sku=1, rowTallyThreshold)
Process iteratively adds a SKU that is complementary to the existing set. If rowTallyThreshold = 3, then all rows where rowTally<=3 can be added to the set:
[1] -> [1, 2] -> [1, 2, 3]
[1] -> [1, 2] -> [1, 2, 4]
If 'rowTallyThreshold` = 1, then all rows where rowTally<=1, or rows 1, 3 and 5, may potentially be added to the set:
[1] -> [1, 3] -> [1, 3, 5]
The resulting output should be all sets possible.
Code to generate MWE:
set.seed(1)
a = runif(n=25, min=0, max=0.18); a[a<0.13] = 0
A = as.data.frame(matrix(a, nrow=5, ncol=5, byrow = TRUE))
A$rowTally <- rowSums(A != 0);
A$sku <- seq(from = 1, to = 5)
A$totalDollarSales <- runif(n=5, min=1*10^2, max=1*10^6)
A$totalUnitSales <- runif(n=5, min=1*10^2, max=1*10^6)
names(A) <- c("p1_prop", "p2_prop", "p3_prop", "p4_prop", "p5_prop", "rowTally", "sku", "totalDollarSales", "totalUnitSales")
A <- A[c("sku", "p1_prop", "p2_prop", "p3_prop", "p4_prop", "p5_prop", "rowTally", "totalDollarSales", "totalUnitSales")]
A$dollarsPerRobot <- A$totalDollarSales/A$totalUnitSales

How about this:
library(tidyverse)
## y matches to x iff y is zero when x is not zero
is_match <- function(x, y) {
all((x != 0 & y == 0) | (x == 0))
}
## Find complement skus of sku
find_matches <- function(df, sku, rowTallyThreshold, vars) {
## Vector of main sku
main_sku <- as.numeric(df[df$sku == sku, vars])
## Potential candidates
potential <- df %>%
filter(rowTally <= rowTallyThreshold)
## Indices of matches
match_idx <- apply(potential[vars], 1, function(y){is_match(main_sku, y)})
## Skus of matches
potential$sku[match_idx]
}
find_matches(A, 1, 3, c("p1_prop", "p2_prop", "p3_prop", "p4_prop", "p5_prop"))

Related

For and If in R data programming

I want to evaluate the distance between non-zero data. So if i have 50 data, and only the first and last data is non-zero, thus i want the result to be 49.
For example, my data is:
1. 0
2. 0
3. 5
4. 6
5. 0
6. 1
7. 0
Based on my data above, i want to get 4 variables:
v0 = 3 (because the distance between 0th to 3rd data is 3 jumps)
v1 = 1 (because the distance between 3rd to 4th data is 1 jump)
v2 = 2 (because the distance between 4rd to 6th data is 2 jump)
v3 = 1 (because the distance between 6rd to 7th data is 1 jump)
This is my code:
data=c(0,0,5,6,0,1,0)
t=1
for (i in data) {
if (i == 0) {
t[i]=t+1
}
else {
t[i]=1
}
}
t
The result is:
[1] 1 NA NA NA 1 1
Could you help me in figuring out this problem? I also hope that the code is using some kind of loop, so that it can be applied to any other data.
The general rule is not clear from the question but if x is the input we assume that:
the input is non-negative
the first element in output is the position of the first +ve element in x
subsequent elements of output are distances between successive +ve elements of x
if that results in a vector whose sum is less than length(x) append the remainder
To do that determine the positions of the positive elements of c(1, x), calculate the differences between successive elements in that reduced vector using diff and then if they don't sum to length(x) append the remainder.
dists <- function(x) {
d <- diff(which(c(1, x) > 0))
if (sum(d) < length(x)) c(d, length(x) - sum(d)) else d
}
# distance to 5 is 3 and then to 6 is 1 and then to 1 is 2 and 1 is left
x1 <- c(0, 0, 5, 6, 0, 1, 0)
dists(x1)
## [1] 3 1 2 1
# distance to first 1 is 1 and from that to second 1 is 3
x2 <- c(1, 0, 0, 1)
dists(x2)
## [1] 1 3
Here it is redone using a loop:
dists2 <- function(x) {
pos <- 0
out <- numeric(0)
for(i in seq_along(x)) {
if (x[i]) {
out <- c(out, i - pos)
pos <- i
}
}
if (sum(out) < length(x)) out <- c(out, length(x) - sum(out))
out
}
dists2(x1)
## [1] 3 1 2 1
dists2(x2)
## [1] 1 3
Updates
Simplification based on comments below answer. Added loop approach.

What solves my problem: Map, reduce or a recursion?

I really need some help to write a recursion in R.
The function that I want changes a certain observation according to a set of comparisons between different rows in a data frame, which I shall call g. One of these comparisons depends on the previous value of this same observation.
Suppose first that I want to update the value of column index, row i in my data df in the following way:
j <- 1:4
g <- (df$dom[i] > 0 &
abs(df$V2009[i] - df$V2009[j]) <= w) |
df$index[i] == df$index[j]
df$index[i] <- ifelse(any(g), which(g)[[1]], df$index[[i]])
The thing is, the object w is actually a list:
w = list(0, 1, 2, df$age[i])
So, as you can see, I want to create a function foo() that updates df$index iteratively. It changes it by looping through w and comparisons depend on updated values.
Here is some data:
df <- data.frame(dom = c(0, 0, 6, 6),
V2009 = c(9, 11, 9, 11),
index = c(1, 2, 1, 2),
age = c(2, 2, 2, 2))
I am not sure if a recursive function is actually needed or if something like reduce or map would do it.
Thank you!
The following function uses a double for loop to change the values of column index according to the condition defining g. It accepts a data.frame as input and returns the updated data.frame.
foo <- function(x){
change_index <- function(x, i, w){
j <- seq_len(nrow(x))
(x$dom[i] > 0 & abs(x$V2009[i] - x$V2009[j]) <= w) |
x$index[i] == x$index[j]
}
for(i in seq_len(nrow(x))){
W <- list(0, 1, 2, x$age[i])
for(w in W){
g <- change_index(x, i, w)
if(any(g)) x$index[i] <- which(g)[1]
}
}
x
}
foo(df)
# dom V2009 index age
#1 0 9 1 2
#2 0 11 2 2
#3 6 9 1 2
#4 6 11 1 2
One can define w inside a function and use lexical scoping (closure).
Using your instructions, the function index_value calculates for any given i the index value.
correct_index_col returns the corrected df.
df <- data.frame(dom = c(0, 0, 6, 6),
V2009 = c(9, 11, 9, 11),
index = c(1, 2, 1, 2),
age = c(2, 2, 2, 2))
index_value <- function(df, i) {
j <- nrow(df)
w <- c(0, 1, 2, df$age[i])
g <- (df$dom[i] > 0 & abs(df$V2009[i] - df$V2009[j]) <= w) |
df$index[i] == df$index[j]
ifelse(any(g), which(g)[[1]], df$index[[i]])
}
correct_index_col <- function(df) {
indexes <- Vectorize(function(i) {
index_value(df, i)
})
df$index <- indexes(1:nrow(df))
df
}
# > correct_index_col(df)
# dom V2009 index age
# 1 0 9 1 2
# 2 0 11 1 2
# 3 6 9 3 2
# 4 6 11 1 2
#
If you want to really update (mutate) your df, then you have to do
df <- correct_index_col(df).
Here is an attempt of my own. I guess I figured out a way to use recursion over mutate:
test <- function(i, df, k){
j <- 1:nrow(df)
w <- list(0, 1, 2, df$age[i])
g <- (df$dom[i] > 0 & abs(df$V2009[i] - df$V2009[j]) <= w[k]) |
df$index[i] == df$index[j]
l <- ifelse(any(g), which(g)[1], df$index[i])
return(l)
}
loop <- function(data,
k = 1) {
data <- data %>%
mutate(index = map_dbl(seq(n()),
~ test(.x, df = cur_data(), k)))
if (k == 4) {
return(data)
} else {
return(loop(data, k + 1))
}
}
df %>% loop()
I welcome any comments in case this is inefficient considering large datasets

generating random vector in r with particular sum

My aim is to create a vector, with sum 0, in which there are the same number of entries -x and the same number of entry equals x, the length of the vector is even, so it sums up to 0.
I created a function, that has x as an input.
there i insert a sample of the vectorlength but i the end it doesn't work out.
vector<-function(x){
for(i in length(sample)){
if(i %% 2!=0){
output[sample[i]]<-(-x)
}
if(i %% 2 ==0){
output[sample[i]]<-x
}
}
return(output)
}
Try this:
vector <- function(x, sample){
c(rep(x, sample/2), rep(-x, sample/2))
}
print(vector(x = 1, sample = 4))
# [1] 1 1 -1 -1
Edit
If alterning is required:
vector <- function(x, sample){
c(rbind(rep(-x, sample/2), rep(x, sample/2)))
}
print(vector(x = 1, sample = 4))
# [1] -1 1 -1 1
You can try
foo <- function(x, sample){
a <- sample(sample, x/2, replace = T)
c(a,-a)
# or alternating
# c(rbind(a,-a))
}
set.seed(123)
foo(4, 1:10)
[1] 3 8 -3 -8
According to the title you are looking for a random vector. In that case you can simply first generate an ordered vector with the desired properties and then use sample to shuffle it:
f <- function(x, size){
sample(c(rep(x, size/2), rep(-x, size/2), if(size %% 2 != 0) 0))
}
f(x = 1, size = 6)
#> [1] 1 -1 -1 1 -1 1
f(x = 1, size = 7)
#> [1] 0 -1 -1 1 -1 1 1
Edit: Now the function even allows for an odd size.

Find closest value with condition

I have a function that finds me the nearest values for each row in a matrix. It then reports a list with an index of the nearest rows. However, I want it to exclude values if they are +1 in the first AND +1 in the second column away from a particular set of values (-1 in the first and -1 in the second column should also be removed). Moreover, +1 in first column and -1 in second column with respect to the values of interest should also be avoided.
As an example, if I want things closes to c(2, 1), it should accept c(3,1) or (2,2) or (1,1), but NOT c(3,2) and not c(1,0).
Basically, for an output to be reported either column 1 or column 2 should be a value of 1 away from a row of interest, but not both.
input looks like this
x
v1 v2
[1,] 3 1
[2,] 2 1
[3,] 3 2
[4,] 1 2
[5,] 8 5
myfunc(x)
The output looks like this. Notice that the closest thing to row 2 ($V2 in output) is row 1,3,4. The answer should only be 1 though.
$V1
[1] 2 3
$V2
[1] 1 3 4
$V3
[1] 1 2
$V4
[1] 2
$V5
integer(0)
Here is myfunc
myfunc = function(t){
d1 <- dist(t[,1])
d2 <- dist(t[,2])
dF <- as.matrix(d1) <= 1 & as.matrix(d2) <= 1
diag(dF) <- NA
colnames(dF) <- NULL
dF2 <- lapply(as.data.frame(dF), which)
return(dF2)
}
Basically, the rows that you want to find should differ from your reference element by +1 or -1 in one column and be identical in the other column. That means that the sum over the absolute values of the differences is exactly one. For your example c(2, 1), this works as follows:
c(3, 1): difference is c(1, 0), thus sum(abs(c(1, 0))) = 1 + 0 = 1
c(1, 1): difference is c(-1, 0), thus sum(abs(c(-1, 0))) = 1 + 0 = 1
etc.
The following function checks exactly this:
myfunc <- function(x) {
do_row <- function(r) {
r_mat <- matrix(rep(r, length = length(x)), ncol = ncol(x), byrow = TRUE)
abs_dist <- abs(r_mat - x)
return(which(rowSums(abs_dist) == 1))
}
return(apply(x, 1, do_row))
}
do_row() does the job for a single row, and then apply() is used to do this with each row. For your example, I get:
myfunc(x)
## [[1]]
## [1] 2 3
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 1
##
## [[4]]
## integer(0)
##
## [[5]]
## integer(0)
Using sweep(), one can write a shorter function:
myfunc2 <- function(x) {
apply(x, 1, function(r) which(rowSums(abs(sweep(x, 2, r))) == 1))
}
But this seems harder to understand and it turns out that it is slower by about a factor two for your matrix x. (I have also tried it with a large matrix, and there, the efficiency seems about the same.)

Binary coding of pairwise comparisons

I'm working on a questionnaire where there are always three statements presented at a time and participants have to rank order these according to their preferences (3 = most preferred, 1 = least preferred).
For further analyses I have to transform these rankings into pairwise comparisons within each block of three. Below is a code doing this for the first six items (2 blocks) of the questionnaire.
data <- matrix(c(1,2,3,1,2,3,2,1,3,3,1,2),2,6)
i1i2 <- ifelse(data[,1] > data[,2], 1, 0)
i1i3 <- ifelse(data[,1] > data[,3], 1, 0)
i2i3 <- ifelse(data[,2] > data[,3], 1, 0)
i4i5 <- ifelse(data[,4] > data[,5], 1, 0)
i4i6 <- ifelse(data[,4] > data[,6], 1, 0)
i5i6 <- ifelse(data[,5] > data[,6], 1, 0)
result <- cbind(i1i2, i1i3, i2i3, i4i5, i4i6, i5i6)
print(result)
I extended this code to fit a 45 item questionnaire and it works fine. Now, I'd like to write a function which automatically does this job for n items. I experimented with while and for loops but couldn't succeed.
Can anyone please give me a hint/ reference to the relevant functions I need/ an example on how to do this?
Related: Brown, A., & Maydeu-Olivares, A. (2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71(3), 460–502.
First off, remove the ifelse and put them at the end instead:
i1i2 <- data[,1] > data[,2]
i1i3 <- data[,1] > data[,3]
i2i3 <- data[,2] > data[,3]
…
result <- ifelse(cbind(i1i2, i1i3, i2i3, i4i5, i4i6, i5i6), 1, 0)
Next, avoid unnecessary repetition.
three_way_compare = function (data, index) {
cbind(data[, index + 0] > data[, index + 1],
data[, index + 0] > data[, index + 2],
data[, index + 1] > data[, index + 2])
}
result = ifelse(do.call(cbind, lapply(seq(1, ncol(data), by = 3),
three_way_compare, data = data)), 1, 0)
While there are probably more efficient alternatives, you could convert your matrix to a list of vectors of length 3 and apply the ifelse statements to them through a function.
Update:
If you have multiple rows in your matrix, you need to use t(data) inside split() to get the correct values.
# Put data in lists of 3
blocks <- split(t(data), ceiling(seq_along(data)/3))
# Define function
comparison <-function(x) {
i1 <- ifelse(x[1] > x[2], 1, 0)
i2 <- ifelse(x[1] > x[3], 1, 0)
i3 <- ifelse(x[2] > x[3], 1, 0)
return(cbind(i1,i2,i3))
}
# Apply function to list
lapply(blocks,comparison)
# $`1`
# i1 i2 i3
# [1,] 0 0 1
#
# $`2`
# i1 i2 i3
# [1,] 0 1 1
#
# $`3`
# i1 i2 i3
# [1,] 1 0 0
#
# $`4`
# i1 i2 i3
# [1,] 0 0 1
# Or unlist to get vector
unlist(lapply(blocks,comparison))
# 11 12 13 21 22 23 31 32 33 41 42 43
# 0 0 1 0 1 1 1 0 0 0 0 1

Resources