Iteration in R using tidyverse - r

I am trying to avoid using a for loop and instead use tidyverse for iteration. Specifically, I have a vector of values that I want to loop through a single variable from a data frame to create new variables with a prefix. I've tried using dplyr::across but am unsuccessful when the vector length is >1
Sample code:
library(tidyverse)
library(glue)
data <- data.frame(id = 1:10,
y = letters[1:10],
z = LETTERS[1:10])
letter_list <- letters[1:10]
var_naming <- function(dat, list){
dat %>%
mutate(!!glue("hx_{list}") := ifelse(y == {list}, 1, 0))
}
Code I've tried:
**the correct dimensions of the data frame should be 13 variables and 10 observations**
# data_b outputs the correct number of observations but has 40 variables
data_b <- map(letter_list,
~var_naming(data, .x)) %>%
as.data.frame()
# data_c gives me the correct number of variables but has 100 observations
data_c <- map_df(letter_list,
~var_naming(data, .x))
# error message from data_d when using dplyr::across:
>> Error in `mutate()`:
>> ! Problem while computing `..1 =
>> across(...)`.
>> Caused by error in `across()`:
>> ! All unnamed arguments must be length 1
>> Run `rlang::last_error()` to see where the error occurred.
data_d <- data %>%
mutate(
across(
.cols = y,
.fns = ~ifelse(y == {letter_list}, 1, 0),
.names = glue("hx_{letter_list}")
))
Desired output:
id y z hx_a hx_b hx_c hx_d hx_e hx_f hx_g hx_h hx_i hx_j
1 a A 1 0 0 0 0 0 0 0 0 0
2 b B 0 1 0 0 0 0 0 0 0 0
3 c C 0 0 1 0 0 0 0 0 0 0
4 d D 0 0 0 1 0 0 0 0 0 0
5 e E 0 0 0 0 1 0 0 0 0 0
6 f F 0 0 0 0 0 1 0 0 0 0
7 g G 0 0 0 0 0 0 1 0 0 0
8 h H 0 0 0 0 0 0 0 1 0 0
9 i I 0 0 0 0 0 0 0 0 1 0
10 j J 0 0 0 0 0 0 0 0 0 1

You were close with the mutate call, but what you ultimately want is a list of functions (one for each letter in letter_list) to pass to .fns. Since they are anonymous functions, name them the same as letter_list to help with across naming the columns
myFxs <- map(letter_list, ~function(y) ifelse(y == .x, 1, 0)) %>%
setNames(letter_list)
For whatever reason, .names seemed to be having a problem with the glue character vector (for me anyway). Since the functions are named for the letters they are matching against you can use the .fn pronoun to instead to pass a template to across
data %>%
mutate(
across(
.cols = y,
.fns = myFxs,
.names = "hx_{.fn}"
)
)

The code can be modified
Remove the {} around the list on the rhs of :=
It may be better to use transmute instead of mutate as mutate returns the whole data by default.
Once we get the column binded (_dfc) data from map, bind the original data with bind_cols
library(dplyr)
library(purrr)
var_naming <- function(dat, list){
dat %>%
transmute(!!glue::glue('hx_{list}') := ifelse(y == list, 1, 0))
}
NOTE: list is a base R function to construct list data structure. It may be better to create functions with argument names different than the reserved words or function names already existing.
-testing
map_dfc(letter_list, var_naming, dat = data) %>%
bind_cols(data, .)
-output
id y z hx_a hx_b hx_c hx_d hx_e hx_f hx_g hx_h hx_i hx_j
1 1 a A 1 0 0 0 0 0 0 0 0 0
2 2 b B 0 1 0 0 0 0 0 0 0 0
3 3 c C 0 0 1 0 0 0 0 0 0 0
4 4 d D 0 0 0 1 0 0 0 0 0 0
5 5 e E 0 0 0 0 1 0 0 0 0 0
6 6 f F 0 0 0 0 0 1 0 0 0 0
7 7 g G 0 0 0 0 0 0 1 0 0 0
8 8 h H 0 0 0 0 0 0 0 1 0 0
9 9 i I 0 0 0 0 0 0 0 0 1 0
10 10 j J 0 0 0 0 0 0 0 0 0 1

Anotehr way to get the same results:
data %>%
cbind(model.matrix(~y + 0, .)) %>%
rename_with(~str_replace(., 'y\\B', 'hx_'))
id y z hx_a hx_b hx_c hx_d hx_e hx_f hx_g hx_h hx_i hx_j
1 1 a A 1 0 0 0 0 0 0 0 0 0
2 2 b B 0 1 0 0 0 0 0 0 0 0
3 3 c C 0 0 1 0 0 0 0 0 0 0
4 4 d D 0 0 0 1 0 0 0 0 0 0
5 5 e E 0 0 0 0 1 0 0 0 0 0
6 6 f F 0 0 0 0 0 1 0 0 0 0
7 7 g G 0 0 0 0 0 0 1 0 0 0
8 8 h H 0 0 0 0 0 0 0 1 0 0
9 9 i I 0 0 0 0 0 0 0 0 1 0
10 10 j J 0 0 0 0 0 0 0 0 0 1
If you only consider those in letters_list:
data %>%
mutate( y =factor(y, letter_list)) %>%
cbind(model.matrix(~y + 0, .) %>%
as_tibble() %>%
select(paste0('y', letter_list)) %>%
rename_with(~str_replace(., 'y', 'hx_')))

Related

Calculate amount of times a interaction has been made by the same client id when a conversion is made

I have a dataset like so :
client_id
interaction_1
interaction_2
conversion
A
1
0
0
B
0
1
0
C
0
0
1
A
0
0
1
B
0
1
0
B
0
0
1
C
0
1
0
C
0
0
1
The dataset is already ordered based on a timestamp (ascending). Both the interactions and conversion columns are dummies (0/1). For every conversion, I need to calculate the amount a client_id did a interaction or conversion, but only the interactions since the last conversion (therefore, the column "lag_conversion" can never be >1).
The output should look something like this:
client_id
interaction_1
interaction_2
conversion
lag_interaction_1
lag_interaction_2
lag_conversion
A
1
0
0
0
0
0
B
0
1
0
0
0
0
C
0
0
1
0
0
0
A
0
0
1
1
0
0
B
0
1
0
0
0
0
B
0
0
1
0
2
0
C
0
1
0
0
0
0
C
0
0
1
0
1
1
I've tried the code:
for (i in 1:nrow(mydata)) {
client_id <- mydata$client_id[i]
if (mydata$conversion[i] == 1) {
last_conversion_index <- max(which(mydata$client_id == client_id & mydata$conversion== 1 & 1:nrow(mydata) <= i))
mydata$interaction_1[i:last_conversion_index & mydata$interaction_1== 1] <- 1
}
}
although this only results in a 1 in the row itself, meaning that the first row would look like:
client_id
interaction_1
interaction_2
conversion
lag_interaction_1
lag_interaction_2
lag_conversion
A
1
0
0
1
0
0
Any help is much appreciated! Thanks in advance
With dplyr:
library(dplyr) #1.1.0 and above
dat %>%
mutate(across(everything(), ~ ifelse(conversion == 1, cumsum(lag(.x, default = 0)), 0),
.names = "lag_{col}"), .by = client_id)
client_id interaction_1 interaction_2 conversion lag_interaction_1 lag_interaction_2 lag_conversion
1 A 1 0 0 0 0 0
2 B 0 1 0 0 0 0
3 C 0 0 1 0 0 0
4 A 0 0 1 1 0 0
5 B 0 1 0 0 0 0
6 B 0 0 1 0 2 0
7 C 0 1 0 0 0 0
8 C 0 0 1 0 1 1
below 1.1.0:
dat %>%
group_by(client_id) %>%
mutate(across(everything(), ~ ifelse(conversion == 1, cumsum(lag(.x, default = 0)), 0),
.names = "lag_{col}")) %>%
ungroup()

Turning data.frame into adjacency matrix lists does not show values (in R)

I have a data.frame that has 4 columns (sender, receiver, year and value). I want to create a list which has for each year an adjacency matrix of the sender and the receiver containing the value.
A MVE would be
df = data.frame(sender = c("a","a","b","c","d","d","d","b","e","e"),
receiver = c("b","d","a","a","b","a","c","e","c","a"),
value = 1:10,
year= c(2000,2000,2001,2002,2002,2002,2003,2003,2003,2004))
Also I have a country list I need it to match with
country_list = data.frame(country = c("a","b","c","d","e")
What I have tried looks like this. My problem is that not the correct values are shown in the subsequent adjacency matrix.
transfer_list <- list()
for (t in 2000:2004){
matrix<-matrix(0,5,5)
rownames(matrix)<-country_list[1:5,1]
colnames(matrix)<-country_list[1:5,1]
year=which(df[,4]==t)
dyad=df[year,c(1,2)]
for (i in 1:dim(dyad)[1]){
partner1<-which(country_list[,1]==dyad[i,1])
partner2<-which(country_list[,1]==dyad[i,2])
matrix[partner1, partner2]<-df[i,3]
}
transfer_list[[t-1999]]=matrix
}
The result for 2004 should be a 10 for the transfer from e to a but is:
> transfer_list[[5]]
a b c d e
a 0 0 0 0 0
b 0 0 0 0 0
c 0 0 0 0 0
d 0 0 0 0 0
e 1 0 0 0 0
What is my error?
One approach might be the igraph package.
We can use igraph::graph_from_data_frame to create a graph for each year. We can include all of the possible vertices using the vertices = argument. Otherwise, only vertices with edges will be included.
Once we have created the weighted graph, we can use as_adj with attr = "value" to create the adjacency matrix with the values in the matrix. sparse = FALSE gets you 0s in the other positions.
library(igraph)
result <- lapply(unique(df$year), function(x) {
g <- graph_from_data_frame(df[df$year == x,-4],
vertices = unique(country_list))
as_adj(g, attr = "value", sparse = FALSE)})
names(result) <- unique(df$year)
result$`2004`
# a b c d e
#a 0 0 0 0 0
#b 0 0 0 0 0
#c 0 0 0 0 0
#d 0 0 0 0 0
#e 10 0 0 0 0
You can try the code below
library(igraph)
library(dplyr)
lapply(
split(df[setdiff(names(df), "year")], df$year),
function(x) {
x %>%
graph_from_data_frame(vertices = country_list$country) %>%
as_adj(attr = "value", sparse = FALSE)
}
)
which gives
$`2000`
a b c d e
a 0 1 0 2 0
b 0 0 0 0 0
c 0 0 0 0 0
d 0 0 0 0 0
e 0 0 0 0 0
$`2001`
a b c d e
a 0 0 0 0 0
b 3 0 0 0 0
c 0 0 0 0 0
d 0 0 0 0 0
e 0 0 0 0 0
$`2002`
a b c d e
a 0 0 0 0 0
b 0 0 0 0 0
c 4 0 0 0 0
d 6 5 0 0 0
e 0 0 0 0 0
$`2003`
a b c d e
a 0 0 0 0 0
b 0 0 0 0 8
c 0 0 0 0 0
d 0 0 7 0 0
e 0 0 9 0 0
$`2004`
a b c d e
a 0 0 0 0 0
b 0 0 0 0 0
c 0 0 0 0 0
d 0 0 0 0 0
e 10 0 0 0 0
Change this section:
dyad=df[year,c(1,2,3)]
for (i in 1:dim(dyad)[1]){
partner1<-which(country_list[,1]==dyad[i,1])
partner2<-which(country_list[,1]==dyad[i,2])
matrix[partner1, partner2]<- dyad[i,3]
}

Function using conditions

I am trying to create a function that considers a determined number of samples (rivers) each one with a determined number of observations. Given 10 samples each one with 12 observations in a lognormal distribution with mean=4 and sd=1.4, I would like to obtain the number of times a particular number (6 - it refers to a standard number for water quality measurement) is counted.
The following is the code for one experiment, considering "limit" as the maximum number of observations allowed to ovverpass 6.
set.seed(1001)
nobs<-12
limit<-round(0.10 * nobs, digits = 0)
h2o <- as.data.frame(matrix(rnorm(10*12, mean = 4, sd = 1.4), ncol = 12))
paste(rep("Riv", nrow(h2o)), c(1:nrow(h2o)), sep = "")
rownames(h2o) <- paste(rep("Riv", nrow(h2o)), c(1:nrow(h2o)), sep = "")
colnames(h2o) <- paste(rep("Obs", ncol(h2o)), c(1:ncol(h2o)), sep = "")
#Number of rivers declared impared based in the assumptiom that the number of observations per river are 2 or more?
ifelse(h2o >=6,1,0)
h2o$Test<-rowSums(ifelse(h2o >=6,1,0))
length(h2o$Test[h2o$Test>1])
The function should resume the previous data and works for different observations with different samples.
Thanks
Here's a function using dplyr and tidyr from the tidyverse.
library(tidyverse)
test_h2o <- function(data, threshold_quality = 6, limit = 1) {
table <- data %>%
rownames_to_column("river") %>%
gather(observation, value, -river) %>%
mutate(over_lim = value > threshold_quality)
table_wide <- table %>%
select(river, observation, over_lim) %>%
mutate(over_lim = over_lim %>% as.integer()) %>%
spread(observation, over_lim)
summary <- table %>%
group_by(river) %>%
summarize(over_lim_count = sum(over_lim))
result <- summary %>%
summarize(num_impaired = sum(over_lim_count > limit))
list(table_wide, summary, result)
}
Here's the output, meant to show the steps in your example:
> test_h2o(h2o)
[[1]]
river Obs1 Obs10 Obs11 Obs12 Obs2 Obs3 Obs4 Obs5 Obs6 Obs7 Obs8 Obs9
1 Riv1 0 0 0 0 0 0 0 0 0 0 0 0
2 Riv10 0 0 0 0 0 0 0 0 0 0 0 0
3 Riv2 0 0 1 0 0 0 0 0 0 0 0 0
4 Riv3 1 0 0 0 0 0 0 0 0 0 0 0
5 Riv4 0 0 0 0 0 0 0 0 0 0 0 1
6 Riv5 0 1 0 0 0 0 0 0 0 0 0 0
7 Riv6 0 1 1 0 0 1 0 0 0 0 0 0
8 Riv7 0 0 0 0 0 0 0 0 0 0 0 0
9 Riv8 0 0 0 0 0 0 0 1 0 1 0 0
10 Riv9 1 0 0 0 0 0 0 0 0 0 0 0
[[2]]
# A tibble: 10 x 2
river over_lim_count
<chr> <int>
1 Riv1 0
2 Riv10 0
3 Riv2 1
4 Riv3 1
5 Riv4 1
6 Riv5 1
7 Riv6 3
8 Riv7 0
9 Riv8 2
10 Riv9 1
[[3]]
# A tibble: 1 x 1
num_impaired
<int>
1 2

Using any in nested ifelse statement

data:
set.seed(1337)
m <- matrix(sample(c(0,0,0,1),size = 50,replace=T),ncol=5) %>% as.data.frame
colnames(m)<-LETTERS[1:5]
code:
m %<>%
mutate(newcol = ifelse(A==1&(B==1|C==1)&(D==1|E==1),1,
ifelse(any(A,B,C,D,E),0,NA)),
desiredResult= ifelse(A==1&(B==1|C==1)&(D==1|E==1),1,
ifelse(!(A==0&B==0&C==0&D==0&E==0),0,NA)))
looks like:
A B C D E newcol desiredResult
1 0 1 1 1 0 0 0
2 0 1 0 0 1 0 0
3 0 1 0 0 0 0 0
4 0 0 0 0 0 0 NA
5 0 1 0 1 0 0 0
6 0 0 1 0 0 0 0
7 1 1 1 1 0 1 1
8 0 1 1 0 0 0 0
9 0 0 0 0 0 0 NA
10 0 0 1 0 0 0 0
question
I want newcol to be the same as desiredResult.
Why can't I use any in that "stratified" manner of ifelse. Is there a function like any that would work in that situation?
possible workaround
I could define a function
any_vec <- function(...) {apply(cbind(...),1,any)} but this does not make me smile too much.
like suggested in the answer
using pmax works exactly like a vectorized any.
m %>%
mutate(pmaxResult = ifelse(A==1& pmax(B,C) & pmax(D,E),1,
ifelse(pmax(A,B,C,D,E),0,NA)),
desiredResult= ifelse(A==1&(B==1|C==1)&(D==1|E==1),1,
ifelse(!(A==0&B==0&C==0&D==0&E==0),0,NA)))
Here's an alternative approach. I converted to logical at the beginning and back to integer at the end:
m %>%
mutate_all(as.logical) %>%
mutate(newcol = A & pmax(B,C) & pmax(D, E) ,
newcol = replace(newcol, !newcol & !pmax(A,B,C,D,E), NA)) %>%
mutate_all(as.integer)
# A B C D E newcol
# 1 0 1 1 1 0 0
# 2 0 1 0 0 1 0
# 3 0 1 0 0 0 0
# 4 0 0 0 0 0 NA
# 5 0 1 0 1 0 0
# 6 0 0 1 0 0 0
# 7 1 1 1 1 0 1
# 8 0 1 1 0 0 0
# 9 0 0 0 0 0 NA
# 10 0 0 1 0 0 0
I basically replaced the any with pmax.

merge two rows of n dimension by considering one column

I have a matrix, that has been formed after using cbind()
! ? c e i k l t
dif 0 0 1 0 0 0
dor 1 0 0 0 0 0
dor 0 0 0 0 0 1
same 0 0 0 1 0 0
same 0 1 0 0 0 0
Suggest me a code in R that could merge the rows as below
! ? c e i k l t
same 1 1 0 1 0 0
dif 0 0 1 0 0 0
dor 1 0 0 0 0 1
Thank you..
df<-read.table(header=T,text="ID c e i k l t
dif 0 0 1 0 0 0
dor 1 0 0 0 0 0
dor 0 0 0 0 0 1
same 0 0 0 1 0 0
same 0 1 0 0 0 0")
require(plyr)
ddply(df,.(ID),function(x)colSums(x[,-1]))
ID c e i k l t
1 dif 0 0 1 0 0 0
2 dor 1 0 0 0 0 1
3 same 0 1 0 1 0 0
Command acknowledged:
aggregate(df[, -1], list(df[, 1]), function(x) {
Reduce("|", x)
})
# Group.1 c e i k l t
# 1 dif 0 0 1 0 0 0
# 2 dor 1 0 0 0 0 1
# 3 same 0 1 0 1 0 0
Do you want the sum, or do you want the logical OR:
Logical OR:
require(functional)
aggregate(. ~ ID, data=df, FUN=Compose(any, as.numeric))
ID c e i k l t
1 dif 0 0 1 0 0 0
2 dor 1 0 0 0 0 1
3 same 0 1 0 1 0 0
Sum:
aggregate(. ~ ID, data=df, FUN=sum)
The result here is the same.

Resources