Count interactions with unique accounts in financial transaction dataset - r

I have a question about a dataset with financial transactions:
Account_from Account_to Value
1 1 2 25.0
2 1 3 30.0
3 2 1 28.0
4 2 3 10.0
5 2 3 12.0
6 3 1 40.0
7 3 1 30.0
8 3 1 20.0
Each row represents a transaction. I would like to create an extra column with a variable containing the information of the number of interactions with each unique account.
That it would look like the following:
Account_from Account_to Value Count_interactions_out Count_interactions_in
1 1 2 25.0 2 2
2 1 3 30.0 2 2
3 2 1 28.0 2 1
4 2 3 10.0 2 1
5 2 3 12.0 2 1
6 3 1 40.0 1 2
7 3 1 30.0 1 2
8 3 1 20.0 1 2
Account 3 only interacts with account 1, therefore Count_interactions_out is 1. However, it receives interactions from account 1 and 2, therefore the count_interactions_in is 2.
How can I apply this to the whole dataset?
Thanks

Here's an approach using dplyr
library(dplyr)
financial.data %>%
group_by(Account_from) %>%
mutate(Count_interactions_out = nlevels(factor(Account_to))) %>%
ungroup() %>%
group_by(Account_to) %>%
mutate(Count_interactions_in = nlevels(factor(Account_from))) %>%
ungroup()

Here is a solution with base R, where ave() is used
df <- cbind(df,
with(df, list(
Count_interactions_out = ave(Account_to,Account_from,FUN = function(x) length(unique(x))),
Count_interactions_in = ave(Account_from,Account_to,FUN = function(x) length(unique(x)))[match(Account_from,Account_to,)])))
such that
> df
Account_from Account_to Value Count_interactions_out Count_interactions_in
1 1 2 25 2 2
2 1 3 30 2 2
3 2 1 28 2 1
4 2 3 10 2 1
5 2 3 12 2 1
6 3 1 40 1 2
7 3 1 30 1 2
8 3 1 20 1 2
or
df <- within(df, list(
Count_interactions_out <- ave(Account_to,Account_from,FUN = function(x) length(unique(x))),
Count_interactions_in <- ave(Account_from,Account_to,FUN = function(x) length(unique(x)))[match(Account_from,Account_to,)]))
such that
> df
Account_from Account_to Value Count_interactions_in Count_interactions_out
1 1 2 25 2 2
2 1 3 30 2 2
3 2 1 28 1 2
4 2 3 10 1 2
5 2 3 12 1 2
6 3 1 40 2 1
7 3 1 30 2 1
8 3 1 20 2 1

Related

Efficient code to remove rows containing non-unique max?

Here's a simple example of an array for which I want to extract only those rows whose max value is unique (in that row).
foo <- expand.grid(1:3,1:3,1:3)
Var1 Var2 Var3
1 1 1 1
2 2 1 1
3 3 1 1
4 1 2 1
5 2 2 1
6 3 2 1
7 1 3 1
8 2 3 1
9 3 3 1
10 1 1 2
11 2 1 2
12 3 1 2
13 1 2 2
14 2 2 2
15 3 2 2
16 1 3 2
17 2 3 2
18 3 3 2
19 1 1 3
20 2 1 3
21 3 1 3
22 1 2 3
23 2 2 3
24 3 2 3
25 1 3 3
26 2 3 3
27 3 3 3
I've got working code:
winners <- max.col(foo)
finddupe <- rep(0,length=length(winners))
for (jf in 1:length(winners)) finddupe[jf] <- sum(foo[jf,] == foo[jf, winners[jf] ] )
winners <- winners[finddupe == 1]
foo <- foo[finddupe == 1, ]
That just looks inefficient to me.
I'd prefer a solution which only uses base - R calls, but am open to using tools in other libraries.
Another base R solution:
subset(foo, max.col(foo, 'first') == max.col(foo, 'last'))
Var1 Var2 Var3
2 2 1 1
3 3 1 1
4 1 2 1
6 3 2 1
7 1 3 1
8 2 3 1
10 1 1 2
12 3 1 2
15 3 2 2
16 1 3 2
17 2 3 2
19 1 1 3
20 2 1 3
22 1 2 3
23 2 2 3
>
Same logic as above in dplyr way:
library(dplyr)
foo %>%
filter(max.col(., 'first') == max.col(., 'last'))
Create a column of max with pmax from all the columns, then filter the rows where there is only a single unique max by getting the count on a logical dataset with rowSums
library(dplyr)
foo %>%
mutate(mx = do.call(pmax, c(across(everything()), na.rm = TRUE))) %>%
filter(rowSums(across(Var1:Var3, ~ .x == mx), na.rm = TRUE) == 1)
-output
Var1 Var2 Var3 mx
1 2 1 1 2
2 3 1 1 3
3 1 2 1 2
4 3 2 1 3
5 1 3 1 3
6 2 3 1 3
7 1 1 2 2
8 3 1 2 3
9 3 2 2 3
10 1 3 2 3
11 2 3 2 3
12 1 1 3 3
13 2 1 3 3
14 1 2 3 3
15 2 2 3 3
Or with base R
subset(foo, rowSums(foo == do.call(pmax, c(foo, na.rm = TRUE)),
na.rm = TRUE) == 1)
A base R approach using apply
foo[apply(foo, 1, function(x) sum(x[which.max(x)] == x) <= 1), ]
Var1 Var2 Var3
2 2 1 1
3 3 1 1
4 1 2 1
6 3 2 1
7 1 3 1
8 2 3 1
10 1 1 2
12 3 1 2
15 3 2 2
16 1 3 2
17 2 3 2
19 1 1 3
20 2 1 3
22 1 2 3
23 2 2 3
After verifying the answers so far (18:00 EST Weds 15 Feb), I ran a benchmark comparison. #onyambu wins the race. (cgw is me; ak** are akrun's solutions)
bar5 = 1:5
foo55 <- expand.grid(bar5,bar5,bar5,bar5,bar5)
microbenchmark(ony(foo55), cgw(foo55), akply(foo55), akbase(foo55), andre(foo55))
Unit: microseconds
expr min lq mean median uq max neval cld
ony(foo55) 455.117 495.2335 589.6801 517.3755 634.9795 3107.222 100 a
cgw(foo55) 314076.038 317184.4050 348711.9522 319784.5870 324921.0335 2691161.873 100 b
akply(foo55) 14156.653 14835.2230 16194.3699 15160.0270 16441.3550 74019.622 100 a
akbase(foo55) 858.969 896.8310 1055.4277 970.6395 1117.2420 4098.860 100 a
andre(foo55) 8161.406 8531.1700 9188.4801 8872.0325 9284.0995 14548.383 100 a

Retrieve a value by another column criteria in R

i need some help:
i got this df:
df <- data.frame(month = c(1,1,1,1,1,2,2,2,2,2),
day = c(1,2,3,4,5,1,2,3,4,5),
flow = c(2,5,7,8,5,4,6,7,9,2))
month day flow
1 1 1 2
2 1 2 5
3 1 3 7
4 1 4 8
5 1 5 5
6 2 1 4
7 2 2 6
8 2 3 7
9 2 4 9
10 2 5 2
but i want to know the day of min per month:
month day flow dayminflowofthemonth
1 1 1 2 1
2 1 2 5 1
3 1 3 7 1
4 1 4 8 1
5 1 5 5 1
6 2 1 4 5
7 2 2 6 5
8 2 3 7 5
9 2 4 9 5
10 2 5 2 5
this repetition is not a problem, i will use pivot fuction
tks people!
We can use which.min to return the index of 'min'imum 'flow' per group and use that to get the corresponding 'day' to create the column with mutate
library(dplyr)
df <- df %>%
group_by(month) %>%
mutate(dayminflowofthemonth = day[which.min(flow)]) %>%
ungroup
-output
df
# A tibble: 10 x 4
# month day flow dayminflowofthemonth
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 2 1
# 2 1 2 5 1
# 3 1 3 7 1
# 4 1 4 8 1
# 5 1 5 5 1
# 6 2 1 4 5
# 7 2 2 6 5
# 8 2 3 7 5
# 9 2 4 9 5
#10 2 5 2 5
Another option using indexing inside dplyr pipeline:
library(dplyr)
#Code
newdf <- df %>% group_by(month) %>% mutate(Val=day[flow==min(flow)][1])
Output:
# A tibble: 10 x 4
# Groups: month [2]
month day flow Val
<dbl> <dbl> <dbl> <dbl>
1 1 1 2 1
2 1 2 5 1
3 1 3 7 1
4 1 4 8 1
5 1 5 5 1
6 2 1 4 5
7 2 2 6 5
8 2 3 7 5
9 2 4 9 5
10 2 5 2 5
Here is a base R option using ave
transform(
df,
dayminflowofthemonth = ave(day*(ave(flow,month,FUN = min)==flow),month,FUN = max)
)
which gives
month day flow dayminflowofthemonth
1 1 1 2 1
2 1 2 5 1
3 1 3 7 1
4 1 4 8 1
5 1 5 5 1
6 2 1 4 5
7 2 2 6 5
8 2 3 7 5
9 2 4 9 5
10 2 5 2 5
One more base R approach:
df$dayminflowofthemonth <- by(
df,
df$month,
function(x) x$day[which.min(x$flow)]
)[df$month]

r recode by a splitting rule

I have a student dataset including student information, question id (5 questions), the sequence of each trial to answer the questions. I would like to create a variable to distinguish where exactly student starts reviewing questions after finishing all questions.
Here is a sample dataset:
data <- data.frame(
person = c(1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
question = c(1,2,2,3,3,3,4,3,5,1,2, 1,1,1,2,3,4,4,4,5,5,4,3,4,4,5,4,5),
sequence = c(1,1,2,1,2,3,1,4,1,2,3, 1,2,3,1,1,1,2,3,1,2,4,2,5,6,3,7,4))
data
person question sequence
1 1 1 1
2 1 2 1
3 1 2 2
4 1 3 1
5 1 3 2
6 1 3 3
7 1 4 1
8 1 3 4
9 1 5 1
10 1 1 2
11 1 2 3
12 2 1 1
13 2 1 2
14 2 1 3
15 2 2 1
16 2 3 1
17 2 4 1
18 2 4 2
19 2 4 3
20 2 5 1
21 2 5 2
22 2 4 4
23 2 3 2
24 2 4 5
25 2 4 6
26 2 5 3
27 2 4 7
28 2 5 4
sequence variables record each visit by giving a sequence number. Generally revisits could be before seeing all questions. However, the attempt variable should only record after the student sees all 5 questions. With the new variable, I target this dataset.
> data
person question sequence attempt
1 1 1 1 initial
2 1 2 1 initial
3 1 2 2 initial
4 1 3 1 initial
5 1 3 2 initial
6 1 3 3 initial
7 1 4 1 initial
8 1 3 4 initial
9 1 5 1 initial
10 1 1 2 review
11 1 2 3 review
12 2 1 1 initial
13 2 1 2 initial
14 2 1 3 initial
15 2 2 1 initial
16 2 3 1 initial
17 2 4 1 initial
18 2 4 2 initial
19 2 4 3 initial
20 2 5 1 initial
21 2 5 2 initial
22 2 4 4 review
23 2 3 2 review
24 2 4 5 review
25 2 4 6 review
26 2 5 3 review
27 2 4 7 review
28 2 5 4 review
Any ideas?
Thanks!
What a challenging question. Took almost 2 hours to find the solution.
Try this
library(dplyr)
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
data %>%
mutate(var0 = n_distinct(question)) %>%
group_by(person) %>%
mutate(var1 = dist_cum(question),
var2 = cumsum(c(1, diff(question) != 0))) %>%
ungroup() %>%
mutate(var3 = if_else(sequence == 1 | var1 < var0, 0, 1)) %>%
group_by(person, var2) %>%
mutate(var4 = min(var3)) %>%
ungroup() %>%
mutate(attemp = if_else(var4 == 0, "initial", "review")) %>%
select(-starts_with("var")) %>%
as.data.frame
Result
person question sequence attemp
1 1 1 1 initial
2 1 2 1 initial
3 1 2 2 initial
4 1 3 1 initial
5 1 3 2 initial
6 1 3 3 initial
7 1 4 1 initial
8 1 3 4 initial
9 1 5 1 initial
10 1 1 2 review
11 1 2 3 review
12 2 1 1 initial
13 2 1 2 initial
14 2 1 3 initial
15 2 2 1 initial
16 2 3 1 initial
17 2 4 1 initial
18 2 4 2 initial
19 2 4 3 initial
20 2 5 1 initial
21 2 5 2 initial
22 2 4 4 review
23 2 3 2 review
24 2 4 5 review
25 2 4 6 review
26 2 5 3 review
27 2 4 7 review
28 2 5 4 review
dist_cum is a function to calculate rolling distinct (Source). var0...var4 are helpers
One way to do it is by finding where the reviewing starts (i.e. the next entry after the fifth question has been seen) and where the sequence is 2. See v1 and v2. Then by means of subsetting for every individual person and looping by each subset, you can update the missing entries for the attempt variable since it is now known where the reviewing starts.
v1 <- c(FALSE, (data$question == 5)[-(nrow(data))])
v2 <- data$sequence == 2
data$attempt <- ifelse(v1 * v2 == 1, "review", NA)
persons <- unique(data$person)
persons.list <- vector(mode = "list", length = length(persons))
for(i in 1:length(persons)){
person.i <- subset(data, person == persons[i])
n <- which(person.i$attempt == "review")
m <- nrow(person.i)
person.i$attempt[(n+1):m] <- "review"
person.i$attempt[which(is.na(person.i$attempt))] <- "initial"
persons.list[[i]] <- person.i
}
do.call(rbind, persons.list)
person question sequence attempt
1 1 1 1 initial
2 1 2 1 initial
3 1 2 2 initial
4 1 3 1 initial
5 1 3 2 initial
6 1 3 3 initial
7 1 4 1 initial
8 1 3 4 initial
9 1 5 1 initial
10 1 1 2 review
11 1 2 3 review
12 2 1 1 initial
13 2 1 2 initial
14 2 1 3 initial
15 2 2 1 initial
16 2 3 1 initial
17 2 4 1 initial
18 2 4 2 initial
19 2 4 3 initial
20 2 5 1 initial
21 2 5 2 review
22 2 4 4 review
23 2 3 2 review
24 2 4 5 review
25 2 4 6 review
26 2 5 3 review
27 2 4 7 review
28 2 5 4 review
Alternatively, you can also use lapply:
do.call(rbind,
lapply(persons, function(x){
person.x <- subset(data, person == x)
n <- which(person.x$attempt == "review")
m <- nrow(person.x)
person.x$attempt[(n+1):m] <- "review"
person.x$attempt[which(is.na(person.x$attempt))] <- "initial"
person.x
}))

calculating mean by keeping all variable in the dataset in r

I am trying to calculate the mean of time by keeping all the variables in the final dataset within dplyr package.
Here how my sample dataset looks like:
library(dplyr)
id <- c(1,1,1,1, 2,2,2,2, 3,3,3,3, 4,4,4,4)
gender <- c(1,1,1,1, 2,2,2,2, 2,2,2,2, 1,1,1,1)
item.id <-c(1,1,1,2, 1,1,2,2, 1,2,3,4, 1,2,2,3)
sequence<-c(1,2,3,1, 1,2,1,2, 1,1,1,1, 1,1,2,1)
time <- c(5,6,7,1, 2,3,4,9, 1,2,3,9, 5,6,7,8)
data <- data.frame(id, gender, item.id, sequence, time)
> data
id gender item.id sequence time
1 1 1 1 1 5
2 1 1 1 2 6
3 1 1 1 3 7
4 1 1 2 1 1
5 2 2 1 1 2
6 2 2 1 2 3
7 2 2 2 1 4
8 2 2 2 2 9
9 3 2 1 1 1
10 3 2 2 1 2
11 3 2 3 1 3
12 3 2 4 1 9
13 4 1 1 1 5
14 4 1 2 1 6
15 4 1 2 2 7
16 4 1 3 1 8
id for student id, gender for gender, item.id for the question ids students take, sequence is the sequence number of attempts to solve the question because students might return back to questions and try to answer again, and time is the time spent on each trial.
When calculating the mean of the time, I need to follow three steps:
(a) students have multiple trials for each question. I need to calculate the mean of the time for each item having multiple trials.
(b) then calculate the overall mean of the time for each id. For example, for id=1, I have two items, the first item has 3 trials and the second item has 1 trial. First I need to aggregate the time for the first item by (5+6+7)/3=6, so id=1 has item1 time 6 and item2 time 1. Second, taking 6 and 1 and calculating the mean for this student (6+1)/2=3.5.
(c) Lastly, I would like to keep all the variables in the dataset.
data <- data %>%
group_by(id) %>%
select(id, gender, item.id, sequence, time) %>%
summarize(mean.time = mean(time))
I got this but obviously this is only aggregating the mean by not taking into account of the within mean for each trial and this also does not keep all the variables:
> data
# A tibble: 4 x 2
id mean.time
<dbl> <dbl>
1 1 4.75
2 2 4.5
3 3 3.75
4 4 6.5
I thought select() was going to keep all variables.
The final dataset should look like this below:
> data
id gender item.id sequence time mean.time
1 1 1 1 1 5 3.5
2 1 1 1 2 6 3.5
3 1 1 1 3 7 3.5
4 1 1 2 1 1 3.5
5 2 2 1 1 2 4.5
6 2 2 1 2 3 4.5
7 2 2 2 1 4 4.5
8 2 2 2 2 5 4.5
9 3 2 1 1 1 3.75
10 3 2 2 1 2 3.75
11 3 2 3 1 3 3.75
12 3 2 4 1 9 3.75
13 4 1 1 1 5 6.5
14 4 1 2 1 6 6.5
15 4 1 2 2 7 6.5
16 4 1 3 1 8 6.5
I used dplyr but open any other solutions.
Thanks in advance!
We can use mutate instead of summarise as summarise returns a summarised output of 1 row per each group, while mutate creates a new column in the dataset
...
%>%
mutate(mean.time = mean(time))
If wee want to get the mean of mean, then first group by 'id', 'item.id', get the mean, and then grouped by 'id', get the mean of unique elements
data %>%
group_by(id, item.id) %>%
mutate(mean.time = mean(time)) %>%
group_by(id) %>%
mutate(mean.time = mean(unique(mean.time)))
# A tibble: 16 x 6
# Groups: id [4]
# id gender item.id sequence time mean.time
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1 5 3.5
# 2 1 1 1 2 6 3.5
# 3 1 1 1 3 7 3.5
# 4 1 1 2 1 1 3.5
# 5 2 2 1 1 2 4.5
# 6 2 2 1 2 3 4.5
# 7 2 2 2 1 4 4.5
# 8 2 2 2 2 9 4.5
# 9 3 2 1 1 1 3.75
#10 3 2 2 1 2 3.75
#11 3 2 3 1 3 3.75
#12 3 2 4 1 9 3.75
#13 4 1 1 1 5 6.5
#14 4 1 2 1 6 6.5
#15 4 1 2 2 7 6.5
#16 4 1 3 1 8 6.5
Or instead of creating a second group by, we can do a match to get the first position of 'item.id', extract the 'mean.time' and get the mean
data %>%
group_by(id, item.id) %>%
mutate(mean.time = mean(time),
mean.time = mean(mean.time[match(unique(item.id), item.id)]))
Or use summarise and then do a left_join
data %>%
group_by(id, item.id) %>%
summarise(mean.time = mean(time)) %>%
group_by(id) %>%
summarise(mean.time = mean(mean.time)) %>%
right_join(data)

Delete rows with value if not only value in group

Somewhat new to R and I find myself needing to delete rows based on multiple criteria. The data frame has 3 columns and I need to delete rows where bid=99 and there are values less than 99 grouping by rid and qid. The desired output at an rid and qid level are bid has multiple values less than 99 or bid=99.
rid qid bid
1 1 5
1 1 6
1 1 99
1 2 6
2 1 7
2 1 99
2 2 2
2 2 3
3 1 7
3 1 8
3 2 1
3 2 99
4 1 2
4 1 6
4 2 1
4 2 2
4 2 99
5 1 99
5 2 99
The expected output...
rid qid bid
1 1 5
1 1 6
1 2 6
2 1 7
2 2 2
2 2 3
3 1 7
3 1 8
3 2 1
4 1 2
4 1 6
4 2 1
4 2 2
5 1 99
5 2 99
Any assistance would be appreciated.
You can use the base R function ave to generate a dropping variable like this:
df$dropper <- with(df, ave(bid, rid, qid, FUN= function(i) i == 99 & length(i) > 1))
ave calculates a function on bid, grouping by rid and qid. The function tests if each element of the grouped bid values i is 99 and if i has a length greater than 1. Also, with is used to reduce typing.
which returns
df
rid qid bid dropper
1 1 1 5 0
2 1 1 6 0
3 1 1 99 1
4 1 2 6 0
5 2 1 7 0
6 2 1 99 1
7 2 2 2 0
8 2 2 3 0
9 3 1 7 0
10 3 1 8 0
11 3 2 1 0
12 3 2 99 1
13 4 1 2 0
14 4 1 6 0
15 4 2 1 0
16 4 2 2 0
17 4 2 99 1
18 5 1 99 0
19 5 2 99 0
then drop the undesired observations with df[dropper == 0, 1:3] which will simultaneously drop the new variable.
If you want to just delete rows where bid = 99 then use dplyr.
library(dplyr)
df <- df %>%
filter(bid != 99)
Where df is your data frame. and != means not equal to
Updated solution using dplyr
df %>%
group_by(rid, qid) %>%
mutate(tempcount = n())%>%
ungroup() %>%
mutate(DropValue =ifelse(bid == 99 & tempcount > 1, 1,0) ) %>%
filter(DropValue == 0) %>%
select(rid,qid,bid)
Here is another option with all and if condition in data.table to subset the rows after grouping by 'rid' and 'qid'
library(data.table)
setDT(df1)[, if(all(bid==99)) .SD else .SD[bid!= 99], .(rid, qid)]
# rid qid bid
# 1: 1 1 5
# 2: 1 1 6
# 3: 1 2 6
# 4: 2 1 7
# 5: 2 2 2
# 6: 2 2 3
# 7: 3 1 7
# 8: 3 1 8
# 9: 3 2 1
#10: 4 1 2
#11: 4 1 6
#12: 4 2 1
#13: 4 2 2
#14: 5 1 99
#15: 5 2 99
Or without using the if
setDT(df1)[df1[, .I[all(bid==99) | bid != 99], .(rid, qid)]$V1]
Here is a solution using dplyr, which is a very expressive framework for this kind of problems.
df <- read.table(text =
" rid qid bid
1 1 5
1 1 6
1 1 99
1 2 6
2 1 7
2 1 99
2 2 2
2 2 3
3 1 7
3 1 8
3 2 1
3 2 99
4 1 2
4 1 6
4 2 1
4 2 2
4 2 99
5 1 99
5 2 99",
header = TRUE, stringsAsFactors = FALSE)
Dplyr verbs allow to express the program in a way that is close to the very terms of your questions:
library(dplyr)
res <-
df %>%
group_by(rid, qid) %>%
filter(!(any(bid < 99) & bid == 99)) %>%
ungroup()
# # A tibble: 15 × 3
# rid qid bid
# <int> <int> <int>
# 1 1 1 5
# 2 1 1 6
# 3 1 2 6
# 4 2 1 7
# 5 2 2 2
# 6 2 2 3
# 7 3 1 7
# 8 3 1 8
# 9 3 2 1
# 10 4 1 2
# 11 4 1 6
# 12 4 2 1
# 13 4 2 2
# 14 5 1 99
# 15 5 2 99
Let's check we get the desired output:
desired_output <- read.table(text =
" rid qid bid
1 1 5
1 1 6
1 2 6
2 1 7
2 2 2
2 2 3
3 1 7
3 1 8
3 2 1
4 1 2
4 1 6
4 2 1
4 2 2
5 1 99
5 2 99",
header = TRUE, stringsAsFactors = FALSE)
identical(as.data.frame(res), desired_output)
# [1] TRUE

Resources