What solves my problem: Map, reduce or a recursion? - r

I really need some help to write a recursion in R.
The function that I want changes a certain observation according to a set of comparisons between different rows in a data frame, which I shall call g. One of these comparisons depends on the previous value of this same observation.
Suppose first that I want to update the value of column index, row i in my data df in the following way:
j <- 1:4
g <- (df$dom[i] > 0 &
abs(df$V2009[i] - df$V2009[j]) <= w) |
df$index[i] == df$index[j]
df$index[i] <- ifelse(any(g), which(g)[[1]], df$index[[i]])
The thing is, the object w is actually a list:
w = list(0, 1, 2, df$age[i])
So, as you can see, I want to create a function foo() that updates df$index iteratively. It changes it by looping through w and comparisons depend on updated values.
Here is some data:
df <- data.frame(dom = c(0, 0, 6, 6),
V2009 = c(9, 11, 9, 11),
index = c(1, 2, 1, 2),
age = c(2, 2, 2, 2))
I am not sure if a recursive function is actually needed or if something like reduce or map would do it.
Thank you!

The following function uses a double for loop to change the values of column index according to the condition defining g. It accepts a data.frame as input and returns the updated data.frame.
foo <- function(x){
change_index <- function(x, i, w){
j <- seq_len(nrow(x))
(x$dom[i] > 0 & abs(x$V2009[i] - x$V2009[j]) <= w) |
x$index[i] == x$index[j]
}
for(i in seq_len(nrow(x))){
W <- list(0, 1, 2, x$age[i])
for(w in W){
g <- change_index(x, i, w)
if(any(g)) x$index[i] <- which(g)[1]
}
}
x
}
foo(df)
# dom V2009 index age
#1 0 9 1 2
#2 0 11 2 2
#3 6 9 1 2
#4 6 11 1 2

One can define w inside a function and use lexical scoping (closure).
Using your instructions, the function index_value calculates for any given i the index value.
correct_index_col returns the corrected df.
df <- data.frame(dom = c(0, 0, 6, 6),
V2009 = c(9, 11, 9, 11),
index = c(1, 2, 1, 2),
age = c(2, 2, 2, 2))
index_value <- function(df, i) {
j <- nrow(df)
w <- c(0, 1, 2, df$age[i])
g <- (df$dom[i] > 0 & abs(df$V2009[i] - df$V2009[j]) <= w) |
df$index[i] == df$index[j]
ifelse(any(g), which(g)[[1]], df$index[[i]])
}
correct_index_col <- function(df) {
indexes <- Vectorize(function(i) {
index_value(df, i)
})
df$index <- indexes(1:nrow(df))
df
}
# > correct_index_col(df)
# dom V2009 index age
# 1 0 9 1 2
# 2 0 11 1 2
# 3 6 9 3 2
# 4 6 11 1 2
#
If you want to really update (mutate) your df, then you have to do
df <- correct_index_col(df).

Here is an attempt of my own. I guess I figured out a way to use recursion over mutate:
test <- function(i, df, k){
j <- 1:nrow(df)
w <- list(0, 1, 2, df$age[i])
g <- (df$dom[i] > 0 & abs(df$V2009[i] - df$V2009[j]) <= w[k]) |
df$index[i] == df$index[j]
l <- ifelse(any(g), which(g)[1], df$index[i])
return(l)
}
loop <- function(data,
k = 1) {
data <- data %>%
mutate(index = map_dbl(seq(n()),
~ test(.x, df = cur_data(), k)))
if (k == 4) {
return(data)
} else {
return(loop(data, k + 1))
}
}
df %>% loop()
I welcome any comments in case this is inefficient considering large datasets

Related

How to use a loop with mutate dplyr

I have a question regarding a for-loop within R's dplyr. Imagine I have the following dataframe:
id <- c(rep(8, 9))
check <- c(0,1,1,0,0,1,0,0,0)
df <- data.frame(id, check)
df$count_x <- cumsum(df$check)
df$count_y <- NA
df$count_y[1] <- ifelse(df$check[1] == 0, 0, 1)
co <- df$count_y[1]
I want to fill the variable count_y based on an adjusted cumulative function below:
for (idx in 2:nrow(df)){
if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
co <- 1
df[idx, 4] <- co
} else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
co <- co + 1
df[idx, 4] <- co
} else if (df[idx, 2] == 0){
df[idx, 4] <- co
}
}
The output of this for-loop is correct. However, in my current data set, I have many IDs, and using a for loop to iterate over the IDs will take too much time. I'm trying to use the functionality of dplyr to speed up the process.
id <- c(rep(8, 9))
check <- c(0,1,1,0,0,1,0,0,0)
df <- data.frame(id, check)
df <- df %>% group_by(id) %>% mutate(count_x = cumsum(check),
count_y = NA) %>% ungroup()
df <- df %>% group_by(id) %>% mutate(count_y = replace(count_y, 1, ifelse(check[1] == 0, 0 , 1)))
count_n <- function(df){
co <- df$count_y[1]
for (idx in 2:nrow(df)){
if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
co <- 1
df[idx, 4] <- co
} else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
co <- co + 1
df[idx, 4] <- co
} else if (df[idx, 2] == 0){
df[idx, 4] <- co
}
}
}
I want to use mutate to call the function count_n to fill count_y as described above. I'm aware that I'm passing just one variable, where I have to pass a data frame as the function relies on the column 'check' (col number 2) and 'count_y' (col number 4). I have tried multiple options (mutate_at, all, etc) but I couldn't make it to work. What can I do differently?
df <- df %>% group_by(id) %>% mutate_at(vars(count_y), ~count_n(.))
I think this is the perfect case to use purrr::accumulate2().
purrr::accumulate() is often used to calculate conditional cumulative sums. It takes a function as the second argument. This function should have 2 arguments: the cumulative output co, and the currently evaluated value x.
purrr::accumulate2() allows us to use a second variable to iterate on, and here we use lag(check) as lx. The tricky part is that this second variable should be one item shorter, as it does not matter for the initial value.
Here is the code, matching your expected output.
library(tidyverse)
df = structure(list(id = c(8, 8, 8, 8, 8, 8, 8, 8, 8),
check = c(0, 1, 1, 0, 0, 1, 0, 0, 0),
count_x = c(0, 1, 2, 2, 2, 3, 3, 3, 3)),
row.names = c(NA, -9L), class = "data.frame")
df %>%
mutate(
count_y = accumulate2(check, lag(check)[-1], function(co, x, lx){
case_when(
x==0 ~ co,
x==1 & lx==0 ~ 1,
x==1 & lx==1 ~ co+1,
TRUE ~ 999 #error value in case of unexpected input
)
})
)
#> id check count_x count_y
#> 1 8 0 0 0
#> 2 8 1 1 1
#> 3 8 1 2 2
#> 4 8 0 2 2
#> 5 8 0 2 2
#> 6 8 1 3 1
#> 7 8 0 3 1
#> 8 8 0 3 1
#> 9 8 0 3 1
Created on 2021-05-05 by the reprex package (v2.0.0)
The first issue is that you weren't returning anything in your function. The second issue is that you don't need to use a mutate_at (or even a mutate as would be more appropriate for a single variable) when you're writing the function that modifies the entire tibble. The simplest way to get it working is adding a return statement and running it in line like so:
count_n <- function(df){
co <- df$count_y[1]
for (idx in 2:nrow(df)){
if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
co <- 1
df[idx, 4] <- co
} else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
co <- co + 1
df[idx, 4] <- co
} else if (df[idx, 2] == 0){
df[idx, 4] <- co
}
}
return(df)
}
df %>% group_by(id) %>% count_n(.)
However, I would use Dan's answer above because it's much cleaner and has the advantage of not running a for loop, which isn't very "R". :)

Multivariate cummulative sum

Assume one wished to calculate a cumulative sum based on a multivariate condition, all(Z[i] <= x), for all i over a multivariate grid x. One may obviously implement this naively
cSums <- numeric(nrow(x))
for(i in seq(nrow(x))){
for(j in seq(nrow(Z))){
if(all(Z[j, ] <= x[i, ]))
cSums[i] <- cSums[i] + R[j] # <== R is a single vector to be summed
}
}
which would be somewhere around O((n*p)^2), or slightly faster by iteratively subsetting the columns
cSums <- numeric(nrow(x))
for(i in seq(nrow(x))){
indx <- seq(nrow(Z))
for(j in seq(ncol(Z))){
indx <- indx[which(Z[indx, j] <= x[i, j])]
}
cSums[i] <- sum(R[indx])
}
but this still worst-case as slow as the naive-implementation. How could one improve this to achieve faster performance, while still allowing an undefined number of columns to be compared?
Dummy data and Reproducible example
var1 <- c(3,3,3,5,5,5,4,4,4,6)
var2 <- rep(seq(1,5), each = 2)
Z <- cbind(var1, var2)
x <- Z
R <- rep(1, nrow(x))
# Result using either method.
#[1] 2 2 3 4 6 6 5 5 6 10
outer is your friend, just Vectorize your comparison. colSums yields the desired result then. Should be fast.
f <- Vectorize(function(k, l) all(Z[k, ] <= x[l, ]))
res <- colSums(outer(1:nrow(Z), 1:nrow(x), f))
res
# [1] 2 2 3 4 6 6 5 5 6 10
Data
x <- Z <- structure(c(3, 3, 3, 5, 5, 5, 4, 4, 4, 6, 1, 1, 2, 2, 3, 3, 4,
4, 5, 5), .Dim = c(10L, 2L), .Dimnames = list(NULL, c("var1",
"var2")))
We can use apply row-wise and compare every row with every other row and count how many of them satidy the criteria.
apply(Z, 1, function(x) sum(rowSums(Z <= as.list(x)) == length(x)))
#[1] 2 2 3 4 6 6 5 5 6 10
Similar approach can also be performed using sapply + split
sapply(split(Z, seq_len(nrow(Z))), function(x)
sum(rowSums(Z <= as.list(x)) == length(x)))
data
var1 <- c(3,3,3,5,5,5,4,4,4,6)
var2 <- rep(seq(1,5), each = 2)
Z <- data.frame(var1, var2)

Applying a Logical Calculation to Two Vectors and Returning the Result in a Third Vector

I'm fairly new to R and am having trouble implementing something that should be very basic. Can someone point me in the right direction?
I need to apply a logical calculation based on the values of two vectors and return the value of that function in a third vector.
I want to do this in a user defined function so I can easily apply this in several other areas of the algorithm and make modifications to the implementation with ease.
Here's what I have tried, but I cannot get this implementation to work. I believe it is because I cannot send vectors as parameters to this function.
<!-- language: python -->
calcSignal <- function(fVector, sVector) {
if(!is.numeric(fVector) || !is.numeric(sVector)) {
0
}
else if (fVector > sVector) {
1
}
else if (fVector < sVector) {
-1
}
else {
0 # is equal case
}
}
# set up data frame
df <- data.frame(x=c("NA", 2, 9, 7, 0, 5), y=c(4, 1, 5, 9, 0, "NA"))
# call function
df$z <- calcSignal(df$x, df$y)
I want the output to be a vector with the following values, but I am not implementing the function correctly.
[0,-1,1,-1,0,0]
Can someone help explain how to implement this function to properly perform the logic outlined?
I appreciate your assistance!
There are some misunderstandings in your code:
in R, "NA" is considered as character (string is called character in R). the correct
form is NA without quotes.
it is worth noting that data.frame automatically will convert character to factor type which can be disabled by using data.frame(...,stringsAsFactors = F).
each column of a data.frame has a type, not each element. so when you have a column containing numbers and NA, class of that column will be numeric and is.numeric gives you True even for NA elements. is.na will do the job
|| only compares first element of each vector. | does elementwise comparison.
Now let's implement what you wanted:
Implementation 1:
#set up data frame
df <- data.frame(x=c(NA, 2, 9, 7, 0, 5), y=c(4, 1, 5, 9, 0, NA))
calcSignal <- function(f,s){
if(is.na(f) | is.na(s))
return(0)
else if(f>s)
return(1)
else if(f<s)
return(-1)
else
return(0)
}
df$z = mapply(calcSignal, df$x, df$y, SIMPLIFY = T)
to run a function on two or more vectors element-wise, we can use mapply.
Implementaion 2
not much different from previous. here the function is easier to use.
#set up data frame
df <- data.frame(x=c(NA, 2, 9, 7, 0, 5), y=c(4, 1, 5, 9, 0, NA))
calcSignal <- function(fVector, sVector) {
res = mapply(function(f,s){
if(is.na(f) | is.na(s))
return(0)
else if(f>s)
return(1)
else if(f<s)
return(-1)
else
return(0)
},fVector,sVector,SIMPLIFY = T)
return(res)
}
df$z = calcSignal(df$x,df$y)
Implementaion 3 (Vectorized)
This one is much better. because it is vectorized and is much faster:
calcSignal <- function(fVector, sVector) {
res = rep(0,length(fVector))
res[fVector>sVector] = 1
res[fVector<sVector] = -1
#This line isn't necessary.It's just for clarification
res[(is.na(fVector) | is.na(sVector))] = 0
return(res)
}
df$z = calcSignal(df$x,df$y)
Output:
> df
x y z
1 NA 4 0
2 2 1 1
3 9 5 1
4 7 9 -1
5 0 0 0
6 5 NA 0
No need for loopage as ?sign has your back:
# fixing the "NA" issue:
df <- data.frame(x=c(NA, 2, 9, 7, 0, 5), y=c(4, 1, 5, 9, 0, NA))
s <- sign(df$x - df$y)
s[is.na(s)] <- 0
s
#[1] 0 1 1 -1 0 0
ifelse is another handy function. Less elegant here than sign though
df <- data.frame(x=c(NA, 2, 9, 7, 0, 5), y=c(4, 1, 5, 9, 0, NA))
cs <- function(x, y){
a <- x > y
b <- x < y
out <- ifelse(a, 1, ifelse(b, -1, 0))
ifelse(is.na(out), 0, out)
}
cs(df$x, df$y)

Vectorise a function with a supplied variable

I have a function I've been trying to vectorise going from if(){} to ifelse(). It works fine when all the arguments to the function are contained within the data set it is working on, but if I supply an argument as a string, then the vectorisation stops and the first result is used for the whole data set.
Here's an example
# data
dat <- data.frame(var1 = rep(c(0,1), 4),
var2 = c(rep("a", 4), rep("b", 4))
)
# function
my_fun <- function(x, y){
z <- ifelse(y == "a", fun_a(x), fun_b(x))
return(z)
}
fun_a <- function(x){
z <- ifelse(x == 0, "zero", x)
return(z)
}
fun_b <- function(x){
z <- ifelse(x == 1, "ONE", x)
return(z)
}
dat$var3 <- my_fun(dat$var1, dat$var2)
This returns what I expect, a vector with a row-wise value based on var1 and var2
> dat
var1 var2 var3
1 0 a zero
2 1 a 1
3 0 a zero
4 1 a 1
5 0 b 0
6 1 b ONE
7 0 b 0
8 1 b ONE
However, I want to use this functions on different data sets where var2 is not included. I realise that an easy way around would be to add var2 as an extra column in the data set, but I don't really want to do that.
This is what happens when I supply var2 as a string:
other_dat <- data.frame(var1 = rep(c(0,1), 4))
other_dat$var3 <- my_fun(other_dat$var1, y = "a")
other_dat
var1 var3
1 0 zero
2 1 zero
3 0 zero
4 1 zero
5 0 zero
6 1 zero
7 0 zero
8 1 zero
How can I vectorise this function so that it accepts a string argument and returns the result I desire?
You can vectorise the y i.e. make y of similar length as x and then the ifelse will apply the function my_func on all the values. Revised code:
# data
dat <- data.frame(var1 = rep(c(0,1), 4),
var2 = c(rep("a", 4), rep("b", 4))
)
# function
my_fun <- function(x, y){
if(length(y) == 1) {
y <- rep(y, length(x))
}
z <- ifelse(y == "a", fun_a(x), fun_b(x))
return(z)
}
fun_a <- function(x){
z <- ifelse(x == 0, "zero", x)
return(z)
}
fun_b <- function(x){
z <- ifelse(x == 1, "ONE", x)
return(z)
}
dat$var3 <- my_fun(dat$var1, "a")
other_dat <- data.frame(var1 = rep(c(0,1), 4))
other_dat$var3 <- my_fun(other_dat$var1, y = "a")
other_dat
Hope this helps.

Nested for loop in R

I wrote the following code, and I need to repeat this for 100 times, and I know I need to user another for loop, but I don't know how to do it. Here is the code:
mean <- c(5,5,10,10,5,5,5)
x <- NULL
u <- NULL
delta1 <- NULL
w1 <- NULL
for (i in 1:7 ) {
x[i] <- rexp(1, rate = mean[i])
u[i] <- (1/1.2)*runif(1, min=0, max=1)
y1 <- min(x,u)
if (y1 == min(x)) {
delta1 <- 1
}
else {
delta1 <- 0
}
if (delta1 == 0)
{
w1 <- NULL
}
else {
if(y1== x[[1]])
{
w1 <- "x1"
}
}
}
output <- cbind(delta1,w1)
output
I want the final output to be 100 rows* 3 columns matrix representing run number, delta1, and w1.
Any thought will be truly appreciated.
Here's what I gather you're trying to achieve from your code:
Given two vectors drawn from different distributions (Exponential and Uniform)
Find out which distribution the smallest number comes from
Repeat this 100 times.
Theres a couple of problems with your code if you want to achieve this, so here's a cleaned up example:
rates <- c(5, 5, 10, 10, 5, 5, 5) # 'mean' is an inbuilt function
# Initialise the output data frame:
output <- data.frame(number=rep(0, 100), delta1=rep(1, 100), w1=rep("x1", 100))
for (i in 1:100) {
# Generating u doesn't require a for loop. Additionally, can bring in
# the (1/1.2) out the front.
u <- runif(7, min=0, max=5/6)
# Generating x doesn't need a loop either. It's better to use apply functions
# when you can!
x <- sapply(rates, function(x) { rexp(1, rate=x) })
y1 <- min(x, u)
# Now we can store the output
output[i, "number"] <- y1
# Two things here:
# 1) use all.equal instead of == to compare floating point numbers
# 2) We initialised the data frame to assume they always came from x.
# So we only need to overwrite it where it comes from u.
if (isTRUE(all.equal(y1, min(u)))) {
output[i, "delta1"] <- 0
output[i, "w1"] <- NA # Can't use NULL in a character vector.
}
}
output
Here's an alternative, more efficient approach with replicate:
Mean <- c(5, 5, 10, 10, 5, 5, 5)
n <- 100 # number of runs
res <- t(replicate(n, {
x <- rexp(n = length(Mean), rate = Mean)
u <- runif(n = length(Mean), min = 0, max = 1/1.2)
mx <- min(x)
delta1 <- mx <= min(u)
w1 <- delta1 & mx == x[1]
c(delta1, w1)
}))
output <- data.frame(run = seq.int(n), delta1 = as.integer(res[ , 1]),
w1 = c(NA, "x1")[res[ , 2] + 1])
The result:
head(output)
# run delta1 w1
# 1 1 1 <NA>
# 2 2 1 <NA>
# 3 3 1 <NA>
# 4 4 1 x1
# 5 5 1 <NA>
# 6 6 0 <NA>

Resources