dplyr mutate based on columns condition and an external vector - r

I am trying to add a list column to a tibble data frame. The resulting list column is calculated from two columns contained in the data frame and a vector which is external / independent.
Suppose that the data frame and the vector are the following:
library(dplyr)
library(magrittr)
dat <- tibble(A = c(12, 27, 22, 1, 15, 30, 20, 28, 19),
B = c(68, 46, 69, 7, 44, 76, 72, 50, 51))
vec <- c(12, 25, 28, 58, 98)
Now, I would like to add (mutate) the column y so that for each row y is a list containing the elements of vec between A and B (inclusive).
The not-so-proper way to do this would be via loop. I initialize the column y as list and update it row-wise based on the condition A <= vec & vec <= B:
dat %<>%
mutate(y = list(vec))
for (i in 1:nrow(dat)){
dat[i,]$y[[1]] <- (vec[dat[i,]$A <= vec & vec <= dat[i,]$B])
}
The result is a data frame with y being a list of dbl of variable length:
> dat
# A tibble: 9 x 3
A B y
<dbl> <dbl> <list>
1 12 68 <dbl [4]>
2 27 46 <dbl [1]>
3 22 69 <dbl [3]>
4 1 7 <dbl [0]>
5 15 44 <dbl [2]>
6 30 76 <dbl [1]>
7 20 72 <dbl [3]>
8 28 50 <dbl [1]>
9 19 51 <dbl [2]>
The first four values of y are:
[[1]]
[1] 12 25 28 58
[[2]]
[1] 28
[[3]]
[1] 25 28 58
[[4]]
numeric(0)
Note: the 4-th list is empty, because no value of vec is between A=1 and B=7.
I have tried as an intermediate step with getting the subscripts via which using mutate(y = list(which(A <= vec & vec <= B))) or with a combination of seq and %in%, for instance mutate(y = list(vec %in% seq(A, B))). These both give an error. However, I don't need the subscripts, I need a subset of vec.

Create a small helper function with the logic that you want to implement.
return_values_in_between <- function(vec, A, B) {
vec[A <= vec & vec <= B]
}
and call the function for each row (using rowwise) -
library(dplyr)
result <- dat %>%
rowwise() %>%
mutate(y = list(return_values_in_between(vec, A, B))) %>%
ungroup()
result
# A tibble: 9 × 3
# A B y
# <dbl> <dbl> <list>
#1 12 68 <dbl [4]>
#2 27 46 <dbl [1]>
#3 22 69 <dbl [3]>
#4 1 7 <dbl [0]>
#5 15 44 <dbl [2]>
#6 30 76 <dbl [1]>
#7 20 72 <dbl [3]>
#8 28 50 <dbl [1]>
#9 19 51 <dbl [2]>
Checking the first 4 values in result$y -
result$y
#[[1]]
#[1] 12 25 28 58
#[[2]]
#[1] 28
#[[3]]
#[1] 25 28 58
#[[4]]
#numeric(0)
#...
#...

With the help of #Ronak Shah, I was able to come up with a solution that doesn't require a dedicated function and also makes sure that the vec is pulled from the global environment (in case there might be a column vec in the data frame):
library(tidyverse)
dat |>
rowwise() |>
mutate(y = list(.GlobalEnv$vec[.GlobalEnv$vec >= A & .GlobalEnv$vec <= B])) |>
ungroup()

Related

dplyr: Why do some operations work "rowwise" without calling rowwise() and others dont?

I am still trying to figure out, how rowwise works exactly in R/dplyr.
For example I have this code:
library(dplyr)
df = data.frame(
group = c("a", "a", "a", "b", "b", "c"),
var1 = 1:6,
var2 = 7:12
)
df %>%
mutate(
concatNotRW = paste0(var1, "-", group), # work on rows
meanNotRW = mean(c(var1, var2)), # works not on rows
charsNotRW = strsplit(concatNotRW, "-") # works on rows
) %>%
rowwise() %>%
mutate(
concatRW = paste0(var1, "-", group), # all work on rows
meanRW = mean(c(var1, var2)),
charsRW = strsplit(concatRW, "-")
) -> res
The res dataframe looks like this:
group var1 var2 concatNotRW meanNotRW charsNotRW concatRW meanRW chars
<chr> <int> <int> <chr> <dbl> <list> <chr> <dbl> <list>
1 a 1 7 1-a 6.5 <chr [2]> 1-a 4 <chr [2]>
2 a 2 8 2-a 6.5 <chr [2]> 2-a 5 <chr [2]>
3 a 3 9 3-a 6.5 <chr [2]> 3-a 6 <chr [2]>
4 b 4 10 4-b 6.5 <chr [2]> 4-b 7 <chr [2]>
5 b 5 11 5-b 6.5 <chr [2]> 5-b 8 <chr [2]>
6 c 6 12 6-c 6.5 <chr [2]> 6-c 9 <chr [2]>
What I do not understand is why paste0 can take each cell of a row and pastes them together (essentially performing a rowwise-operation), yet mean can't do that. What am I missing and are there any rules on what already works rowwise without the call to rowwise() ? I did not find so much info in the rowwise()-vignette here https://dplyr.tidyverse.org/articles/rowwise.html
paste can take vectors as input in the variadic argument (...) and return the same length as vector whereas mean takes the variadic argument for other inputs (trim etc) and return a single value. Here we need rowMeans. Regarding strsplit, it returns a list of split elements
library(dplyr)
df %>%
mutate(
concatNotRW = paste0(var1, "-", group),
meanNotRW = rowMeans(across(c(var1, var2))),
charsNotRW = strsplit(concatNotRW, "-")
)
> mean(c(1:5, 6:10))
[1] 5.5
Note that the vector we are passing is a single vector by concatenating both vectors 1:5 and 6:10
whereas
> paste(1:5, 6:10)
[1] "1 6" "2 7" "3 8" "4 9" "5 10"
are two vectors passed into paste
For splitting the column into two columns, we can use separate
library(tidyr)
df %>%
mutate(
concatNotRW = paste0(var1, "-", group),
meanNotRW = rowMeans(across(c(var1, var2)))) %>%
separate(concatNotRW, into = c("ind", "chars"))
group var1 var2 ind chars meanNotRW
1 a 1 7 1 a 4
2 a 2 8 2 a 5
3 a 3 9 3 a 6
4 b 4 10 4 b 7
5 b 5 11 5 b 8
6 c 6 12 6 c 9
Why some operations work on rowwise depends on the function. If the function is vectorized, it works on the whole column and doesn't need rowwise. Here, both functions paste and mean are vectorized except that paste is vectorized for variadic input and mean is only vectorized to take a single vector and return a single value as output. Suppose, we have a function that checks each value with if/else, then it is not vectorized as if/else expects a single logical value. In that case, can use either rowwise or Vectorize the function

R: Coding looping constructs

I am looking for some help on achieving below looping requirement through a simplified indexing or apply constructs in R. Doing it by 'for' loops seem computationally complex and inefficient. Hence, I am looking for any help to achieve it in an efficient manner;
The data table reference is as below;
The sequence I am trying to get is all positive and negative number index (row and column) sequences per row as below;
For row 1: 1-4-5, 1-4-8, 1-4-11;
Column 'Occurances' specifies the potential number of sequences per row.
Finally, I am trying to get a data frame similar to below (shown only for first and second rows) with all occurrences with each index on a column;
Any help is highly appreciated. Thank you very much
There's a lot of way you can do this. To do it in an efficient manner you should probably use base R. The more rows and columns you have to check, the more you will need to be careful with how you code this.
Here are two examples of how you could run it, see which works best for you.
library(purrr)
library(dplyr)
# create table to test code on, n1 x n2 dataframe with a random sample of -1, 0, 1
n1 <- 10
n2 <- 10
to_test <- map(1:n1, ~sample(c(-1, 0, 1), size = c(n2), replace = T)) %>%
`names<-`(seq_along(.)) %>%
bind_cols()
# Split table into a list of rows
to_test_row_list <- split(to_test, 1:nrow(to_test))
# For each item in the list
sub_tables <- mapply(FUN = function(list_in, row_in){
# create a dataframe with the ron number in the first row
crossing(row = row_in,
# cross join the indexes of the columns with are less than and
# more than zero for the other two cols
crossing(data.frame(gt = which(list_in > 0)),
data.frame(lt = which(list_in < 0))))},
# Inputs for the mapply function FUN, the list of rows and the number for each row
list_in = to_test_row_list,
row_in = names(to_test_row_list),
# Do not simply dataframes into lists
SIMPLIFY = F)
# Turn list of tables into one long table
res1 <- bind_rows(sub_tables)
res1
# The same code in one pipe
res2 <- to_test %>%
split(seq_along(.)) %>%
map2(.x = .,
.y = names(.),
~crossing (data.frame(gt = which(.x > 0)),
data.frame(lt = which(.x < 0))) %>%
mutate(row = .y) %>% select(row, everything())) %>%
bind_rows()
res2
This works in Base-R, the task is accomplished almost entirely in the first line of code. The rest is just cleaning the output to make it exactly as asked for. Without proper example data (you can use dput(...) to share) there will certainly be issues with using this code exactly as presented with your data.
new_data <- do.call(rbind,apply(mydata,1, function(x) merge(x[x > 0], x[x < 0]) ))
new_data$from <- sub("X(\\d).*","\\1",row.names(new_data))
new_data <- new_data[,c(3,1,2)]
rownames(new_data) <- c()
sample data:
mydata <- data.frame("1"=c(0,0,0,-45,57,0,0,51,0,0,45,0),"3"=c(4,4,0,5,654,34,-6,65,-37,4,56,56))
mydata <- t(mydata)
output:
> new_data
from x y
1 1 57 -45
2 1 51 -45
3 1 45 -45
4 3 4 -6
5 3 4 -6
6 3 5 -6
7 3 654 -6
8 3 34 -6
9 3 65 -6
10 3 4 -6
11 3 56 -6
12 3 56 -6
13 3 4 -37
14 3 4 -37
15 3 5 -37
16 3 654 -37
17 3 34 -37
18 3 65 -37
19 3 4 -37
20 3 56 -37
21 3 56 -37
Here's a tidyverse approach if you want to keep things nested neatly:
library(tidyverse)
df <- tibble::tribble(
~`1`, ~`2`, ~`3`, ~`4`, ~`5`, ~`6`, ~`7`, ~`8`, ~`9`, ~`10`, ~`11`, ~`12`,
0, 0, 0L, -45.2, 57, 0, 0, 82.7, 0, 0, 58.7, 0,
48.8, 65, 0L, 35.5, 50.8, 42.2, -89.6, 52.8, -45.8, 26.4, 51.1, 85.7,
63.1, 83.3, 0L, 21.5, 60, 0, 0, 69, 0, -84.3, 61, 0
)
df %>%
rownames_to_column(var = "row_idx") %>%
pivot_longer(cols = -row_idx, names_to = "col_idx") %>%
group_by(row_idx) %>%
nest() %>%
mutate(
df_of_pairs = map(data, ~ expand.grid(which(.$value < 0), which(.$value > 0))),
combos = map_int(df_of_pairs, nrow)
)
#> # A tibble: 3 x 4
#> # Groups: row_idx [3]
#> row_idx data df_of_pairs combos
#> <chr> <list> <list> <int>
#> 1 1 <tibble [12 x 2]> <df[,2] [3 x 2]> 3
#> 2 2 <tibble [12 x 2]> <df[,2] [18 x 2]> 18
#> 3 3 <tibble [12 x 2]> <df[,2] [6 x 2]> 6
Created on 2020-05-13 by the reprex package (v0.3.0)
Then if you want to get the list of pairs, simply add %>% unnest(df_of_pairs) to the end of the pipeline:
df %>%
rownames_to_column(var = "row_idx") %>%
pivot_longer(cols = -row_idx, names_to = "col_idx") %>%
group_by(row_idx) %>%
nest() %>%
mutate(
df_of_pairs = map(data, ~ expand.grid(which(.$value < 0), which(.$value > 0))),
combos = map_int(df_of_pairs, nrow)
) %>%
unnest(df_of_pairs)
# A tibble: 27 x 5
# Groups: row_idx [3]
row_idx data Var1 Var2 combos
<chr> <list> <int> <int> <int>
1 1 <tibble [12 x 2]> 4 5 3
2 1 <tibble [12 x 2]> 4 8 3
3 1 <tibble [12 x 2]> 4 11 3
4 2 <tibble [12 x 2]> 7 1 18
5 2 <tibble [12 x 2]> 9 1 18
6 2 <tibble [12 x 2]> 7 2 18
7 2 <tibble [12 x 2]> 9 2 18
8 2 <tibble [12 x 2]> 7 4 18
9 2 <tibble [12 x 2]> 9 4 18
10 2 <tibble [12 x 2]> 7 5 18
# ... with 17 more rows

Subset a vector of lists in R

Let's say I have a vector of lists:
library(tidyverse)
d <- tribble(
~x,
c(10, 20, 64),
c(22, 11),
c(5, 9, 99),
c(55, 67),
c(76, 65)
)
How can I subset this vector such that, for example, I have have rows with lists having a length greater than 2? Here is my unsuccessful attempt using the tidyverse:
filter(d, length(x) > 2)
# A tibble: 5 x 1
x
<list>
1 <dbl [3]>
2 <dbl [2]>
3 <dbl [3]>
4 <dbl [2]>
5 <dbl [2]>
It would be lengths as the 'x' is a list
library(dplyr)
d %>%
filter(lengths(x) > 2)
You can use subset() + lengths()
subset(d,lengths(x)>2)

difference between two comma separated strings

I have the following dataframe yy
fundId Year Qtr StockCurrentQtr StockNextQtr
1 2015 1 1,2,3,4,5 2,3,4,51
1 2015 2 2,3,4,51 7,8,9,4,2
1 2015 3 7,8,9,4,2 NA
2 2015 1 10,11,14 14,16,19
2 2015 2 14,16,19 20,21,45
2 2015 3 20,21,45 NA
I want to know the difference between StockNextQtr and StocCurrentQtr for each row group_by fundId or the difference between successive rows for the column 'StockCurrentQtr' group_by fundId
yy <- yy %>%
group_by(fundId) %>%
mutate(StockDiff = apply(yy,2,function(x){
paste(setdiff(unlist(strsplit(x[5], split = ",")), unlist(strsplit(x[4],
split = ","))),collapse = ",")}))
I am getting following error:
Column StockDiff must be length 3 (the group size) or one, not 5
You don't have to use apply here. Just rowwise, i.e.
library(dplyr)
df %>%
mutate_at(vars(4:5), funs(strsplit(., ','))) %>%
rowwise() %>%
mutate(new = toString(setdiff(StocCurrentQtr, StockNextQtr)))
which gives,
Source: local data frame [6 x 6]
Groups: <by row>
# A tibble: 6 x 6
fundId Year Qtr StocCurrentQtr StockNextQtr new
<int> <int> <int> <list> <list> <chr>
1 1 2015 1 <chr [5]> <chr [4]> 1, 5
2 1 2015 2 <chr [4]> <chr [5]> 3, 51
3 1 2015 3 <chr [5]> <chr [1]> 7, 8, 9, 4, 2
4 2 2015 1 <chr [3]> <chr [3]> 10, 11
5 2 2015 2 <chr [3]> <chr [3]> 14, 16, 19
6 2 2015 3 <chr [3]> <chr [1]> 20, 21, 45
The equivalent in base R,
mapply(function(x, y)toString(setdiff(x, y)), strsplit(df$StocCurrentQtr, ','),
strsplit(df$StockNextQtr, ','))
#[1] "1, 5" "3, 51" "7, 8, 9, 4, 2" "10, 11" "14, 16, 19" "20, 21, 45"
If StockNextQtr is missing, we can create it first and continue in the same manner as before, i.e.
df %>%
group_by(fundId) %>%
mutate(StockNextQtr = lead(StocCurrentQtr)) %>%
mutate_at(vars(4:5), funs(strsplit(., ','))) %>%
rowwise() %>%
mutate(new = toString(setdiff(StocCurrentQtr, StockNextQtr)))
I found another way
yy <- yy %>% group_by(fundId, Year, Qtr) %>% mutate(new = paste(setdiff((unlist(strsplit(StockCurrentQtr,split = ","))), unlist(strsplit(StockNextQtr,split = ","))),collapse = ","))

Within tidyverse, create seq() column based on existing variables

Goal: Within tidyverse, create a sequence column called my_seq. Each seq() number should use existing columns for "from" (x column) and "to" (y column).
Bonus points for self-referential "dot" combo (and explanation of dot grammar).
boo <- tribble(
~ x, ~y,
5, 20,
6, 10,
2, 20)
# Desired results should reflect these results in new column:
seq(5, 20, by = 2)
#> [1] 5 7 9 11 13 15 17 19
seq(6, 10, by = 2)
#> [1] 6 8 10
seq(2, 20, by = 2)
#> [1] 2 4 6 8 10 12 14 16 18 20
# These straightforward solutions do not work
boo %>%
mutate(my_seq = seq(x, y, by = 2))
boo %>%
mutate(my_seq = seq(boo$x, boo$y, by = 2))
# The grammar of self-referential dots is super arcane, but
# here are some additional tries. All fail.
boo %>%
mutate(my_seq = map_int(boo, ~seq(.$x, .$y, by = 2)))
boo %>%
mutate(my_seq = seq(.$x, .$y, by = 2))
With purrr, you can use map2 to loop through x and y in parallel, which is similar to Map/mapply in base R but different syntax:
boo %>% mutate(my_seq = map2(x, y, seq, by=2))
# A tibble: 3 x 3
# x y my_seq
# <dbl> <dbl> <list>
#1 5 20 <dbl [8]>
#2 6 10 <dbl [3]>
#3 2 20 <dbl [10]>
my_seq is a column of list type, we can pull the column out to see its content:
boo %>% mutate(my_seq = map2(x, y, seq, by=2)) %>% pull(my_seq)
#[[1]]
#[1] 5 7 9 11 13 15 17 19
#[[2]]
#[1] 6 8 10
#[[3]]
# [1] 2 4 6 8 10 12 14 16 18 20
In general, when there are multiple arguments, pmap can be used as well
library(dplyr)
library(purrr)
res <- boo %>%
mutate(my_seq = pmap(., .f = ~seq(..1, ..2, by = 2)))
res
# A tibble: 3 x 3
# x y my_seq
# <dbl> <dbl> <list>
#1 5.00 20.0 <dbl [8]>
#2 6.00 10.0 <dbl [3]>
#3 2.00 20.0 <dbl [10]>
res$my_seq
#[[1]]
#[1] 5 7 9 11 13 15 17 19
#[[2]]
#[1] 6 8 10
#[[3]]
#[1] 2 4 6 8 10 12 14 16 18 20

Resources