calculating mean by keeping all variable in the dataset in r - r

I am trying to calculate the mean of time by keeping all the variables in the final dataset within dplyr package.
Here how my sample dataset looks like:
library(dplyr)
id <- c(1,1,1,1, 2,2,2,2, 3,3,3,3, 4,4,4,4)
gender <- c(1,1,1,1, 2,2,2,2, 2,2,2,2, 1,1,1,1)
item.id <-c(1,1,1,2, 1,1,2,2, 1,2,3,4, 1,2,2,3)
sequence<-c(1,2,3,1, 1,2,1,2, 1,1,1,1, 1,1,2,1)
time <- c(5,6,7,1, 2,3,4,9, 1,2,3,9, 5,6,7,8)
data <- data.frame(id, gender, item.id, sequence, time)
> data
id gender item.id sequence time
1 1 1 1 1 5
2 1 1 1 2 6
3 1 1 1 3 7
4 1 1 2 1 1
5 2 2 1 1 2
6 2 2 1 2 3
7 2 2 2 1 4
8 2 2 2 2 9
9 3 2 1 1 1
10 3 2 2 1 2
11 3 2 3 1 3
12 3 2 4 1 9
13 4 1 1 1 5
14 4 1 2 1 6
15 4 1 2 2 7
16 4 1 3 1 8
id for student id, gender for gender, item.id for the question ids students take, sequence is the sequence number of attempts to solve the question because students might return back to questions and try to answer again, and time is the time spent on each trial.
When calculating the mean of the time, I need to follow three steps:
(a) students have multiple trials for each question. I need to calculate the mean of the time for each item having multiple trials.
(b) then calculate the overall mean of the time for each id. For example, for id=1, I have two items, the first item has 3 trials and the second item has 1 trial. First I need to aggregate the time for the first item by (5+6+7)/3=6, so id=1 has item1 time 6 and item2 time 1. Second, taking 6 and 1 and calculating the mean for this student (6+1)/2=3.5.
(c) Lastly, I would like to keep all the variables in the dataset.
data <- data %>%
group_by(id) %>%
select(id, gender, item.id, sequence, time) %>%
summarize(mean.time = mean(time))
I got this but obviously this is only aggregating the mean by not taking into account of the within mean for each trial and this also does not keep all the variables:
> data
# A tibble: 4 x 2
id mean.time
<dbl> <dbl>
1 1 4.75
2 2 4.5
3 3 3.75
4 4 6.5
I thought select() was going to keep all variables.
The final dataset should look like this below:
> data
id gender item.id sequence time mean.time
1 1 1 1 1 5 3.5
2 1 1 1 2 6 3.5
3 1 1 1 3 7 3.5
4 1 1 2 1 1 3.5
5 2 2 1 1 2 4.5
6 2 2 1 2 3 4.5
7 2 2 2 1 4 4.5
8 2 2 2 2 5 4.5
9 3 2 1 1 1 3.75
10 3 2 2 1 2 3.75
11 3 2 3 1 3 3.75
12 3 2 4 1 9 3.75
13 4 1 1 1 5 6.5
14 4 1 2 1 6 6.5
15 4 1 2 2 7 6.5
16 4 1 3 1 8 6.5
I used dplyr but open any other solutions.
Thanks in advance!

We can use mutate instead of summarise as summarise returns a summarised output of 1 row per each group, while mutate creates a new column in the dataset
...
%>%
mutate(mean.time = mean(time))
If wee want to get the mean of mean, then first group by 'id', 'item.id', get the mean, and then grouped by 'id', get the mean of unique elements
data %>%
group_by(id, item.id) %>%
mutate(mean.time = mean(time)) %>%
group_by(id) %>%
mutate(mean.time = mean(unique(mean.time)))
# A tibble: 16 x 6
# Groups: id [4]
# id gender item.id sequence time mean.time
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1 5 3.5
# 2 1 1 1 2 6 3.5
# 3 1 1 1 3 7 3.5
# 4 1 1 2 1 1 3.5
# 5 2 2 1 1 2 4.5
# 6 2 2 1 2 3 4.5
# 7 2 2 2 1 4 4.5
# 8 2 2 2 2 9 4.5
# 9 3 2 1 1 1 3.75
#10 3 2 2 1 2 3.75
#11 3 2 3 1 3 3.75
#12 3 2 4 1 9 3.75
#13 4 1 1 1 5 6.5
#14 4 1 2 1 6 6.5
#15 4 1 2 2 7 6.5
#16 4 1 3 1 8 6.5
Or instead of creating a second group by, we can do a match to get the first position of 'item.id', extract the 'mean.time' and get the mean
data %>%
group_by(id, item.id) %>%
mutate(mean.time = mean(time),
mean.time = mean(mean.time[match(unique(item.id), item.id)]))
Or use summarise and then do a left_join
data %>%
group_by(id, item.id) %>%
summarise(mean.time = mean(time)) %>%
group_by(id) %>%
summarise(mean.time = mean(mean.time)) %>%
right_join(data)

Related

How to give a consecutive id number for each distinct study in r

I am trying to create consecutive ID numbers for each distinct study. I found an example of data where they managed to create such an ID number under esid variable
Browse[1]> dat <- dat.assink2016
Browse[1]> head(dat, 9)
study esid id yi vi pubstatus year deltype
1 1 1 1 0.9066 0.0740 1 4.5 general
2 1 2 2 0.4295 0.0398 1 4.5 general
3 1 3 3 0.2679 0.0481 1 4.5 general
4 1 4 4 0.2078 0.0239 1 4.5 general
5 1 5 5 0.0526 0.0331 1 4.5 general
6 1 6 6 -0.0507 0.0886 1 4.5 general
7 2 1 7 0.5117 0.0115 1 1.5 general
8 2 2 8 0.4738 0.0076 1 1.5 general
9 2 3 9 0.3544 0.0065 1 1.5 general
I would like to create the same for my study, can anyone show me how to do it?
The key is to group_by study, then use row_number
library(dplyr)
df %>%
group_by(study) %>%
mutate(esid = row_number())
with the example data from #njp:
# A tibble: 9 × 3
# Groups: study [3]
study id esid
<dbl> <int> <int>
1 1 1 1
2 1 2 2
3 1 3 3
4 2 4 1
5 2 5 2
6 2 6 3
7 2 7 4
8 3 8 1
9 3 9 2
If the id column is consecutive (i.e. no jumps or repeated values) you could subtract the minimum value of id for each study and add one:
# Example data
df = data.frame(study=c(1,1,1,2,2,2,2,3,3),
id=1:9)
# Calculate minima
min.id = tapply(X=df$id,
INDEX=df$study,
FUN=min)
# merge this with the data
df$min.id = min.id[df$study]
# Calculate consecutive id as required
df$esid = df$id - df$min.id+1

Count interactions with unique accounts in financial transaction dataset

I have a question about a dataset with financial transactions:
Account_from Account_to Value
1 1 2 25.0
2 1 3 30.0
3 2 1 28.0
4 2 3 10.0
5 2 3 12.0
6 3 1 40.0
7 3 1 30.0
8 3 1 20.0
Each row represents a transaction. I would like to create an extra column with a variable containing the information of the number of interactions with each unique account.
That it would look like the following:
Account_from Account_to Value Count_interactions_out Count_interactions_in
1 1 2 25.0 2 2
2 1 3 30.0 2 2
3 2 1 28.0 2 1
4 2 3 10.0 2 1
5 2 3 12.0 2 1
6 3 1 40.0 1 2
7 3 1 30.0 1 2
8 3 1 20.0 1 2
Account 3 only interacts with account 1, therefore Count_interactions_out is 1. However, it receives interactions from account 1 and 2, therefore the count_interactions_in is 2.
How can I apply this to the whole dataset?
Thanks
Here's an approach using dplyr
library(dplyr)
financial.data %>%
group_by(Account_from) %>%
mutate(Count_interactions_out = nlevels(factor(Account_to))) %>%
ungroup() %>%
group_by(Account_to) %>%
mutate(Count_interactions_in = nlevels(factor(Account_from))) %>%
ungroup()
Here is a solution with base R, where ave() is used
df <- cbind(df,
with(df, list(
Count_interactions_out = ave(Account_to,Account_from,FUN = function(x) length(unique(x))),
Count_interactions_in = ave(Account_from,Account_to,FUN = function(x) length(unique(x)))[match(Account_from,Account_to,)])))
such that
> df
Account_from Account_to Value Count_interactions_out Count_interactions_in
1 1 2 25 2 2
2 1 3 30 2 2
3 2 1 28 2 1
4 2 3 10 2 1
5 2 3 12 2 1
6 3 1 40 1 2
7 3 1 30 1 2
8 3 1 20 1 2
or
df <- within(df, list(
Count_interactions_out <- ave(Account_to,Account_from,FUN = function(x) length(unique(x))),
Count_interactions_in <- ave(Account_from,Account_to,FUN = function(x) length(unique(x)))[match(Account_from,Account_to,)]))
such that
> df
Account_from Account_to Value Count_interactions_in Count_interactions_out
1 1 2 25 2 2
2 1 3 30 2 2
3 2 1 28 1 2
4 2 3 10 1 2
5 2 3 12 1 2
6 3 1 40 2 1
7 3 1 30 2 1
8 3 1 20 2 1

R cummax function with NA

data
data=data.frame("person"=c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2),
"score"=c(1,2,1,2,3,1,3,NA,4,2,1,NA,2,NA,3,1,2,4),
"want"=c(1,2,1,2,3,3,3,3,4,2,1,1,2,2,3,3,3,4))
attempt
library(dplyr)
data = data %>%
group_by(person) %>%
mutate(wantTEST = ifelse(score >= 3 | (row_number() >= which.max(score == 3)),
cummax(score), score),
wantTEST = replace(wantTEST, duplicated(wantTEST == 4) & wantTEST == 4, NA))
i am basically working to use the cummax function but only under specific circumstances. i want to keep any values (1-2-1-1) except if there is a 3 or 4 (1-2-1-3-2-1-4) should be (1-2-1-3-3-4). if there is NA value i want to carry forward previous value. thank you.
Here's one way with tidyverse. You may want to use fill() after group_by() but that's somewhat unclear.
data %>%
fill(score) %>%
group_by(person) %>%
mutate(
w = ifelse(cummax(score) > 2, cummax(score), score)
) %>%
ungroup()
# A tibble: 18 x 4
person score want w
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 1 2 2 2
3 1 1 1 1
4 1 2 2 2
5 1 3 3 3
6 1 1 3 3
7 1 3 3 3
8 1 3 3 3
9 1 4 4 4
10 2 2 2 2
11 2 1 1 1
12 2 1 1 1
13 2 2 2 2
14 2 2 2 2
15 2 3 3 3
16 2 1 3 3
17 2 2 3 3
18 2 4 4 4
One way to do this is to first fill NA values and then for each row check if anytime the score of 3 or more is passed in the group. If the score of 3 is reached till that point we take the max score until that point or else return the same score.
library(tidyverse)
data %>%
fill(score) %>%
group_by(person) %>%
mutate(want1 = map_dbl(seq_len(n()), ~if(. >= which.max(score == 3))
max(score[seq_len(.)]) else score[.]))
# person score want want1
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1
# 2 1 2 2 2
# 3 1 1 1 1
# 4 1 2 2 2
# 5 1 3 3 3
# 6 1 1 3 3
# 7 1 3 3 3
# 8 1 3 3 3
# 9 1 4 4 4
#10 2 2 2 2
#11 2 1 1 1
#12 2 1 1 1
#13 2 2 2 2
#14 2 2 2 2
#15 2 3 3 3
#16 2 1 3 3
#17 2 2 3 3
#18 2 4 4 4
Another way is to use accumulate from purrr. I use if_else_ from hablar for type stability:
library(tidyverse)
library(hablar)
data %>%
fill(score) %>%
group_by(person) %>%
mutate(wt = accumulate(score, ~if_else_(.x > 2, max(.x, .y), .y)))

R: Assign incremental ids based on the groups [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 3 years ago.
I have the following sample data frame:
> test = data.frame(UserId = sample(1:5, 10, replace = T)) %>% arrange(UserId)
> test
UserId
1 1
2 1
3 1
4 1
5 1
6 3
7 4
8 4
9 4
10 5
I now want another column called loginCount for that user, which is something like assigning incremental ids within each group, something like below. Using the mutate like below creates id within each group, but how do I get the incremental ids within each group independent of each other ?
> test %>% mutate(loginCount = group_indices_(test, .dots = "UserId"))
UserId loginCount
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 3 2
7 4 3
8 4 3
9 4 3
10 5 4
I want something like shown below:
UserId loginCount
1 1
1 2
1 3
1 4
1 5
3 1
4 1
4 2
4 3
5 1
You could group and use row_number:
test %>%
arrange(UserId) %>%
group_by(UserId) %>%
mutate(loginCount = row_number()) %>%
ungroup()
# A tibble: 10 x 2
# Groups: UserId [4]
UserId loginCount
<int> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 3 1
7 4 1
8 4 2
9 4 3
10 5 1
One solution using base R tapply()
test$loginCount <- unlist(tapply(rep(1, nrow(test)), test$UserId, cumsum))
> test
UserId loginCount
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 3 1
7 4 1
8 4 2
9 4 3
10 5 1

calculate each chunk by group using dplyr?

How can I get the expected calculation using dplyr package?
row value group expected
1 2 1 =NA
2 4 1 =4-2
3 5 1 =5-4
4 6 2 =NA
5 11 2 =11-6
6 12 1 =NA
7 15 1 =15-12
I tried
df=read.table(header=1, text=' row value group
1 2 1
2 4 1
3 5 1
4 6 2
5 11 2
6 12 1
7 15 1')
df %>% group_by(group) %>% mutate(expected=value-lag(value))
How can I calculate for each chunk (row 1-3, 4-5, 6-7) although row 1-3 and 6-7 are labelled as the same group number?
Here is a similar approach. I created a new group variable using cumsum. Whenever the difference between two numbers in group is not 0, R assigns a new group number. If you have more data, this approach may be helpful.
library(dplyr)
mutate(df, foo = cumsum(c(T, diff(group) != 0))) %>%
group_by(foo) %>%
mutate(out = value - lag(value))
# row value group foo out
#1 1 2 1 1 NA
#2 2 4 1 1 2
#3 3 5 1 1 1
#4 4 6 2 2 NA
#5 5 11 2 2 5
#6 6 12 1 3 NA
#7 7 15 1 3 3
As your group variable is not useful for this, create a new variable aux and use it as the grouping variable:
library(dplyr)
df$aux <- rep(seq_along(rle(df$group)$values), times = rle(df$group)$lengths)
df %>% group_by(aux) %>% mutate(expected = value - lag(value))
Source: local data frame [7 x 5]
Groups: aux
row value group aux expected
1 1 2 1 1 NA
2 2 4 1 1 2
3 3 5 1 1 1
4 4 6 2 2 NA
5 5 11 2 2 5
6 6 12 1 3 NA
7 7 15 1 3 3
Here is an option using data.table_1.9.5. The devel version introduced new functions rleid and shift (default type is "lag" and fill is "NA") that can be useful for this.
library(data.table)
setDT(df)[, expected:=value-shift(value) ,by = rleid(group)][]
# row value group expected
#1: 1 2 1 NA
#2: 2 4 1 2
#3: 3 5 1 1
#4: 4 6 2 NA
#5: 5 11 2 5
#6: 6 12 1 NA
#7: 7 15 1 3

Resources